Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Would using streams for async copying table / DataFrame columns speed up CPU/GPU transfers? #17082

Open
beckernick opened this issue Oct 14, 2024 · 0 comments
Labels
question Further information is requested

Comments

@beckernick
Copy link
Member

Many users and downstream applications rely on CPU/GPU transfers for interoperability, including in the zero code change interfaces for pandas, Polars, and Spark. Speed of CPU/GPU transfers can sometimes materially impact the benefit of GPU-acceleration, depending on the workflow and use case.

Today, when a Python user triggers a transfer of a GPU DataFrame to the CPU on via e.g.,to_arrow or to_pandas, every column is sequentially copied to the host with a call to cudaMemcpyAsync.

In practice, we end up not getting close to saturating the theoretically available system bandwidth. E.g., an A100 system with PCIe Gen4 x16 lanes has a peak unidirectional bandwidth of 32GB/s. In practice, we see the following with a 10 column DataFrame of 1GB int64/float64 columns:

import cudf

NROWS = 125000000
N_INT = 5
N_FLOAT = 5

DTYPES = {f"i{x}": int for x in range(N_INT)}
DTYPES.update({f"f{x}": float for x in range(N_FLOAT)})

df = cudf.datasets.randomdata(nrows=NROWS, dtypes=DTYPES, seed=12)
size_mb = df.memory_usage(deep=True).sum()/1e6

print(f"{size_mb} MB dataframe")
10000.0 MB dataframe

%timeit -n2 -r2 cpu_table = df.to_arrow()
1.99 s ± 66.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

To convert this GPU DataFrame into a CPU PyArrow table, we're seeing about 5GB/s -- or about 15% of theoretical peak bandwidth.

These copies already use cudaMemcpyAsync under the hood. Would pre-allocating the CPU table's buffers and using multiple streams to async fill them (with a final sync at the end) potentially be a viable path to better saturating the available system resources?

@beckernick beckernick added the question Further information is requested label Oct 14, 2024
@beckernick beckernick changed the title [QST] Will using streams for async copying table / DataFrame columns speed up CPU/GPU transfers? [QST] Would using streams for async copying table / DataFrame columns speed up CPU/GPU transfers? Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: In Progress
Development

No branches or pull requests

1 participant