Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testset RFBacthedLowLevel factorization is failing with CUDA 12.3 #15

Open
amontoison opened this issue Nov 9, 2023 · 3 comments
Open

Comments

@amontoison
Copy link
Member

On the branch master, the following testset fails with the following error:

CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 535.104.12, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.3.2
- CURAND: 10.3.4
- CUFFT: 11.0.11
- CUSOLVER: 11.5.3
- CUSPARSE: 12.1.3
- CUPTI: 21.0.0
- NVML: 12.0.0+535.104.12

Julia packages: 
- CUDA: 5.1.0
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.0+1

Toolchain:
- Julia: 1.9.3
- LLVM: 14.0.6

1 device:
  0: NVIDIA A100 80GB PCIe (sm_80, 79.143 GiB / 80.000 GiB available)
RFBacthedLowLevel factorization: Error During Test at /home/montalex/git/CUSOLVERRF.jl/test/cusolverRF.jl:29
  Got exception outside of a @test
  ReadOnlyMemoryError()
  Stacktrace:
    [1] macro expansion
      @ ~/.julia/packages/CUDA/nIZkq/lib/cusolver/libcusolverRF.jl:236 [inlined]
    [2] #1195
      @ ~/.julia/packages/CUDA/nIZkq/lib/utils/call.jl:27 [inlined]
    [3] #1
      @ ~/.julia/packages/CUDA/nIZkq/lib/cusolver/libcusolver.jl:17 [inlined]
    [4] retry_reclaim(f::CUDA.CUSOLVER.var"#1#2"{CUDA.CUSOLVER.var"#1195#1196"{CUSOLVERRF.RfHandle}}, isfailed::Base.Fix2{typeof(in), Tuple{CUDA.CUSOLVER.cusolverStatus_t}})
      @ CUDA ~/.julia/packages/CUDA/nIZkq/src/pool.jl:359
    [5] check
      @ ~/.julia/packages/CUDA/nIZkq/lib/cusolver/libcusolver.jl:16 [inlined]
    [6] cusolverRfBatchAnalyze
      @ ~/.julia/packages/CUDA/nIZkq/lib/utils/call.jl:26 [inlined]
    [7] CUSOLVERRF.RFBatchedLowLevel(lu_host::CUSOLVERRF.RFSymbolicAnalysis{Float64, Int32}, batchsize::Int64; options::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ CUSOLVERRF ~/git/CUSOLVERRF.jl/src/rf_wrapper.jl:222
    [8] RFBatchedLowLevel
      @ ~/git/CUSOLVERRF.jl/src/rf_wrapper.jl:200 [inlined]
    [9] #RFBatchedLowLevel#5
      @ ~/git/CUSOLVERRF.jl/src/rf_wrapper.jl:197 [inlined]
   [10] CUSOLVERRF.RFBatchedLowLevel(A::CuSparseMatrixCSR{Float64, Int32}, batchsize::Int64)
      @ CUSOLVERRF ~/git/CUSOLVERRF.jl/src/rf_wrapper.jl:192
   [11] macro expansion
      @ ~/git/CUSOLVERRF.jl/test/cusolverRF.jl:39 [inlined]
   [12] macro expansion
      @ ~/Applications/julia/julia-1.9.3/share/julia/stdlib/v1.9/Test/src/Test.jl:1498 [inlined]
   [13] macro expansion
      @ ~/git/CUSOLVERRF.jl/test/cusolverRF.jl:30 [inlined]
   [14] macro expansion
      @ ~/Applications/julia/julia-1.9.3/share/julia/stdlib/v1.9/Test/src/Test.jl:1498 [inlined]
   [15] top-level scope
      @ ~/git/CUSOLVERRF.jl/test/cusolverRF.jl:4
   [16] include(fname::String)
      @ Base.MainInclude ./client.jl:478
   [17] top-level scope
      @ ~/git/CUSOLVERRF.jl/test/runtests.jl:15
   [18] include(fname::String)
      @ Base.MainInclude ./client.jl:478
   [19] top-level scope
      @ none:6
   [20] eval
      @ ./boot.jl:370 [inlined]
   [21] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:280
   [22] _start()
      @ Base ./client.jl:522
Test Summary:                     | Pass  Error  Total   Time
cusolverRF                        |    2      1      3  19.5s
  RFLowLevel factorization        |    2             2  15.4s
  RFBacthedLowLevel factorization |           1      1   1.9s
ERROR: LoadError: Some tests did not pass: 2 passed, 0 failed, 1 errored, 0 broken.
in expression starting at /home/montalex/git/CUSOLVERRF.jl/test/cusolverRF.jl:2
in expression starting at /home/montalex/git/CUSOLVERRF.jl/test/runtests.jl:15
ERROR: Package CUSOLVERRF errored during testing
@amontoison amontoison changed the title Testset RFBacthedLowLevel factorization is failing with with CUDA 12.3 Testset RFBacthedLowLevel factorization is failing with CUDA 12.3 Nov 9, 2023
@frapac
Copy link
Member

frapac commented Nov 9, 2023

I confirm I have the same behavior on CUSOLVERRF.jl master and CUDA runtime 12.3. Interestingly, everything is working well with CUDA runtime 11.8.

@amontoison
Copy link
Member Author

amontoison commented Nov 9, 2023

We should test with CI if it's working or not with CUDA 12.2. It will help us to isolate the issue.
I suppose that some operations can't be in-place anymore.

@frapac
Copy link
Member

frapac commented Nov 9, 2023

I just did the tests: the CI is passing with CUDA 12.1, but not with 12.2 and 12.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants