-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAPIDS 0.10 and NYCTaxi-E2E notebook: Boolean and RMM_ERROR_OUT_OF_MEMORY errors #214
Comments
It looks like dask_xgboost is not handling boolean columns, I have dropped the boolean column 'day_of_week' (which wasn't needed) and it worked, but why dask_xgboost can't handle boolean columns? |
Hey @vilmara! working with Ty on the Boolean issue and raised the issue to our xgboost and our dask teams and our teams are looking into it. Will let you know the resolution and when to expect it. For now, i'll drop the "day of week" boolean column in a PR (or you could PR it yourself and it would count as an awesome community contribution!) Are you still having the out of memory issue? |
Hi @taureandyernv, thanks for your prompt reply. Here are some comments and questions in regards to the NYCTaxi notebook: 1- PR sent #215 for error 2- Error 3- What does mean the below warning and how to eliminate it:
4- How can I implement Rapids Memory Manager Functionality (RMM) on RAPIDS_v0.10? I used the previous technique and I am getting the error |
XGBoost updates should be able to determine the number of GPUs automatically based on the client and does not require the
The RMM imports have changed. Updating the imports to |
Hey @vilmara, I have been running the notebook on a 2x GPU system, so its taking me a bit longer per iteration than i think you or Ty :). Just a quick reply...
A few things changed in v0.10 and I'm working with the community (like you!) and devs to iron out any wrinkles. |
Hi @taureandyernv / @ayushdg,
Thanks, it eliminated the WARNING: Deprecated. Single process multi-GPU training is no longer supported
I have updated the imports, and now I am getting a different error:
I am using 2 nodes with 4xV100-16GB each, the total ETL cicle is very quick on my system
Do you mean it doesn't require to explicitly handle the RMM functionality with the helper functions initialize_rmm_pool(), initialize_rmm_no_pool(), and finalize_rmm()? |
Could you share the exact import command you used. The error message implies looking for rmm in cudf though rmm is a separate module.
Dask handles memory in the sense of partitioning the Dataframe etc. Running ETL by default will do the operations without using pool mode for underlying Cuda memory management. The Rmm pool step will help enable |
I have modified the NYCTaxi-E2E notebook to use RMM with these changes:
Notice dropped Just before
At end, add:
I tested the |
Hi @taureandyernv, after implementing the recommendations mentioned on this issue (thanks @tym1062 for the update), I got the code working for the first iteration, then the second iteration is sending me the same RMM memory error you got , see below: |
Awesome! Thanks @tym1062 for sharing the snippet. Could you check if rmm pool mode really gets initialized after calling |
Thanks @ayushdg you are correct, need to finalize RMM before initalize RMM pool (i checked via nvidia-smi). Here is the correct way: Just before
For Out-of-Memory issue, @vilmara have you tried using less csv data from the taxi datasets? I can generate OOM using too much data (years Jan-2014 thru Jun-2016) on my system with 4x 32GB GV100. |
thanks @tym1062, I have fixed the OOM error after the second iteration increasing the device memory limit as shown below (my new system has 4x_V100-32GB): @taureandyernv / @ayushdg, thanks for your support, now that NYCTaxi-E2E notebook is working without issues and with RMM functionality on RAPIDS_v0.10, will Nvidia update the notebook?, or do we need to create a PR? |
@JohnZed can we merge under Vilmara's PR (i can also update) @vilmara Congrats on your new system!! I'm running 1 node with 2x GV100s, 32GB each. and using Local Dask Cuda cluster :) It took my system nearly an hour from start to get back to the Dask XGBoost training with all the data downloads. Can you add the fix to the RMM issue after we merge the RAPIDS solution into your PR? I'll merge after that. You rock! @ayushdg thanks for sharing the great solutions! |
Additions to NYCTaxi-E2E notebook addressing issues on rapidsai-community#214
What was the largest data size you were able to handle with your system 4x 32GB GV100 before it generated OOM? |
Describe the bug
The NYCTaxi-E2E notebook is throwing boolean and rmm errors:
Steps/Code to reproduce bug
running the notebook via docker image
rapidsai/rapidsai:0.10-cuda10.1-runtime-ubuntu18.04
Environment details (please complete the following information):
docker run --gpus all --rm -it --net=host -p 8888:8888 -p 8787:8787 -p 8786:8786 -v /home/rapids/notebooks-contrib/:/rapids/notebooks/contrib/ -v /home/rapids/data/:/data/ rapidsai/rapidsai:0.10-cuda10.1-runtime-ubuntu18.04
The text was updated successfully, but these errors were encountered: