Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax PyTorch upper bound (allowing 2.4) #4703

Open
wants to merge 11 commits into
base: branch-24.12
Choose a base branch
from

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Oct 7, 2024

As the issue around PyTorch being built without NumPy was fixed in conda-forge, we can now relax these upper bounds to allow PyTorch 2.4.

xref: conda-forge/pytorch-cpu-feedstock#254
xref: conda-forge/pytorch-cpu-feedstock#266
xref: #4615

As the issue around PyTorch being built without NumPy was fixed in
conda-forge, we can now relax these upper bounds to allow PyTorch 2.4.
@jakirkham jakirkham requested review from a team as code owners October 7, 2024 19:45
@jakirkham jakirkham mentioned this pull request Oct 7, 2024
@jakirkham jakirkham added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 7, 2024
Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jakirkham
Copy link
Member Author

jakirkham commented Oct 7, 2024

Odd looks the S3 upload of wheels built on CI failed. Going to cancel builds and restart

[rapids-upload-to-s3] Path to upload is a directory, creating .tar.gz
./
./cugraph_equivariant_cu11-24.12.0a40-py3-none-any.whl
upload failed: - to s3://rapids-downloads/ci/cugraph/pull-request/4703/fb7a6a6/cugraph_wheel_python_cugraph-equivariant_cu11.tar.gz An error occurred (InternalError) when calling the PutObject operation (reached max retries: 2): We encountered an internal error. Please try again.
Error: Process completed with exit code 1.

Edit: Restarting appears to have fixed this

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! Always happy to see a < removed when possible.

@jakirkham
Copy link
Member Author

One CI build failed as it pulled in CuPy 12.2.0 alongside NumPy 2, which is incompatible

Looks like this is due to a repodata patch not getting applied to older CuPy packages in conda-forge. Addressing in upstream PR: conda-forge/conda-forge-repodata-patches-feedstock#873

Not expecting any other action needed here (beyond restarting the failed build when the upstream PR lands)

@alexbarghi-nv
Copy link
Member

WholeGraph feature store tests are failing. I'm looking into it.

@jakirkham
Copy link
Member Author

Thanks Alex! 🙏

Meaning the failures in this CI job:

FAILED tests/data_store/test_gnn_feat_storage_wholegraph.py::test_feature_storage_wholegraph_backend - assert 0 > 0
FAILED tests/data_store/test_gnn_feat_storage_wholegraph.py::test_feature_storage_wholegraph_backend_mg - assert 0 > 0

Looks like that job does have CuPy 13.3.0 + NumPy 2, which are compatible. So this is unrelated to the CuPy repodata patch above

Agree these are worth looking into

Did notice this line when going through the log

pytorch                   2.4.1           cpu_mkl_py312hf535c18_100    conda-forge

So maybe this a different dependency resolution issue. Have a hunch as to what this might be, but will need to investigate further

@alexbarghi-nv
Copy link
Member

Thanks Alex! 🙏

Meaning the failures in this CI job:

FAILED tests/data_store/test_gnn_feat_storage_wholegraph.py::test_feature_storage_wholegraph_backend - assert 0 > 0
FAILED tests/data_store/test_gnn_feat_storage_wholegraph.py::test_feature_storage_wholegraph_backend_mg - assert 0 > 0

Looks like that job does have CuPy 13.3.0 + NumPy 2, which are compatible. So this is unrelated to the CuPy repodata patch above

Agree these are worth looking into

Did notice this line when going through the log

pytorch                   2.4.1           cpu_mkl_py312hf535c18_100    conda-forge

So maybe this a different dependency resolution issue. Have a hunch as to what this might be, but will need to investigate further

Oh, good catch. That is very likely the cause.

@alexbarghi-nv
Copy link
Member

I'm having a hard time replicating that dependency resolution locally. Instead, mamba is resolving:

  + pytorch                                  2.3.1  cuda120_py312h26b3cf7_300            conda-forge            25MB

@alexbarghi-nv
Copy link
Member

Never mind, I just replicated it.

@alexbarghi-nv
Copy link
Member

This error goes away when adding -c pytorch and correctly installs pytorch from the pytorch channel. Let me push a change to this PR.

@alexbarghi-nv
Copy link
Member

It looks like we failed due to selecting an older version of cupy. Can we safely bind cupy to >=13.3 in dependencies.yaml? @jakirkham

@jakirkham
Copy link
Member Author

Think this has to do with the issue mentioned above ( #4703 (comment) ), which was recently resolved

Let's give this another try

Am going to try without the pytorch channel if that is ok

Let's look at the results of that

We can then discuss after whether we want any more changes

@jakirkham
Copy link
Member Author

Now seeing errors like these in failing CI jobs

NotImplementedError: `ego_graph' is not implemented by ['cugraph'] backends. To remedy this, you may enable automatic conversion to more backends (including 'networkx') by adding them to `nx.config.backend_priority`, or you may specify a backend to use with the `backend=` keyword argument.

Are we missing something in cuGraph?

@jakirkham
Copy link
Member Author

From offline discussion, sounds like this was fixed in PR: #4717

Updated this PR to pull in those changes

@jakirkham
Copy link
Member Author

Looks like this is passing! 🥳

@alexbarghi-nv @jameslamb would you like to give it one last look?

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. Now that we have CI set up in https://github.com/rapidsai/cugraph-gnn, can you replicate these changes over to there as well? Hopefully soon we can move cugraph-dgl and cugraph-pyg out of this repo completely and not have to do that duplicate work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci conda improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants