You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many users and downstream applications rely on CPU/GPU transfers for interoperability, including in the zero code change interfaces for pandas, Polars, and Spark. Speed of CPU/GPU transfers can sometimes materially impact the benefit of GPU-acceleration, depending on the workflow and use case.
Today, when a Python user triggers a transfer of a GPU DataFrame to the CPU on via e.g.,to_arrow or to_pandas, every column is sequentially copied to the host with a call to cudaMemcpyAsync.
In practice, we end up not getting close to saturating the theoretically available system bandwidth. E.g., an A100 system with PCIe Gen4 x16 lanes has a peak unidirectional bandwidth of 32GB/s. In practice, we see the following with a 10 column DataFrame of 1GB int64/float64 columns:
To convert this GPU DataFrame into a CPU PyArrow table, we're seeing about 5GB/s -- or about 15% of theoretical peak bandwidth.
These copies already use cudaMemcpyAsync under the hood. Would pre-allocating the CPU table's buffers and using multiple streams to async fill them (with a final sync at the end) potentially be a viable path to better saturating the available system resources?
The text was updated successfully, but these errors were encountered:
beckernick
changed the title
[QST] Will using streams for async copying table / DataFrame columns speed up CPU/GPU transfers?
[QST] Would using streams for async copying table / DataFrame columns speed up CPU/GPU transfers?
Oct 14, 2024
Many users and downstream applications rely on CPU/GPU transfers for interoperability, including in the zero code change interfaces for pandas, Polars, and Spark. Speed of CPU/GPU transfers can sometimes materially impact the benefit of GPU-acceleration, depending on the workflow and use case.
Today, when a Python user triggers a transfer of a GPU DataFrame to the CPU on via e.g.,
to_arrow
orto_pandas
, every column is sequentially copied to the host with a call tocudaMemcpyAsync
.In practice, we end up not getting close to saturating the theoretically available system bandwidth. E.g., an A100 system with PCIe Gen4 x16 lanes has a peak unidirectional bandwidth of 32GB/s. In practice, we see the following with a 10 column DataFrame of 1GB int64/float64 columns:
To convert this GPU DataFrame into a CPU PyArrow table, we're seeing about 5GB/s -- or about 15% of theoretical peak bandwidth.
These copies already use
cudaMemcpyAsync
under the hood. Would pre-allocating the CPU table's buffers and using multiple streams to async fill them (with a final sync at the end) potentially be a viable path to better saturating the available system resources?The text was updated successfully, but these errors were encountered: