Using with CUDA Solver? #18

zjwegert · 2021-01-08T22:37:21Z

Hi,
I'm interested in using this package with a GPU CG method from IterativeMethods. I think only forward_substitution and backward_substitution methods are required to extend this to CUDA. I've tried using CUDAs sv2! with little success. Any thoughts?

using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
using IterativeSolvers
CUDA.allowscalar(false)

val = sprand(200,200,0.05);
A_cpu = val*val'
b_cpu = rand(200)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)                       
  sv2!('N', 'L', 1.0, P, y, 'O') 
  sv2!('N', 'U', 1.0, P, y, 'O')  
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)                        
  sv2!('N', 'L', 1.0, P, x, 'O') 
  sv2!('N', 'U', 1.0, P, x, 'O')  
  return x
end

# function LinearAlgebra.ldiv!(_lu::LUperso, rhs::CUDA.CuArray)
# 	_x = UpperTriangular(_lu.Ut) \ (LowerTriangular(_lu.L) \ rhs)
# 	rhs .= vec(_x)
# 	# CUDA.unsafe_free!(_x)
# 	rhs
# end
#
# function LinearAlgebra.ldiv!(yrhs::CuArray,_lu::LUperso, rhs::CuArray)
# 	copyto!(yrhs,rhs)
# 	_x = UpperTriangular(_lu.Ut) \ (LowerTriangular(_lu.L) \ rhs)
# 	rhs .= vec(_x)
# 	# CUDA.unsafe_free!(_x)
# 	rhs
# end

struct LUgpu
	L
	Ut	# transpose of U in LU decomposition
end


P = LUperso(LowerTriangular(CuSparseMatrixCSR(I+Precilu.L)), UpperTriangular(CuSparseMatrixCSR(sparse(Precilu.U'))));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7,maxiter=1000)

P_cpu = ilu(A_cpu,τ=3.0)
val = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu,tol=10^-7,maxiter=1000)
A_cpu\b_cpu

The text was updated successfully, but these errors were encountered:

amontoison · 2021-01-09T04:37:42Z

You need to convert Precilu.L and Precilu.U before to make it working on GPU.

L_gpu = CuSparseMatrixCSC(Precilu.L)
U_gpu = CuSparseMatrixCSC(Precilu.U)

After that you can use sv2! for the backward and forward sweeps.
Be careful because the storage is a little bit specific for the U factor (@haampie I am right ?)
With Tim, we updated some functions for triangular types in CUSPARSE to avoid the use of sv2! for information (JuliaGPU/CUDA.jl#575).

About your example, CG requires a symmetric and positive definite preconditioner. ilu is great for non-symmetric square systems and Krylov solvers bicgstab, gmres, etc....
An incomplete Cholesky or LDL' factorization is more appropriate for your problem.

zjwegert · 2021-01-09T06:14:45Z

Hi @amontoison, I gave that a go but no success! Do you mean something like

...
import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)
  sv2!('T', 'L', 1.0, P, y, 'O')
  sv2!('N', 'U', 1.0, P, y, 'O',unit_diag=true)
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
  sv2!('T', 'L', 1.0, P, x, 'O')
  sv2!('N', 'U', 1.0, P, x, 'O',unit_diag=true)
  return x
end

struct LUgpu
	L
	Ut	# transpose of U in LU decomposition
end

L_gpu = CuSparseMatrixCSC(sparse(Precilu.L+I))
U_gpu = CuSparseMatrixCSC(Precilu.U)

P = LUperso(LowerTriangular(L_gpu), UpperTriangular(U_gpu));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7,maxiter=1000)

Note the change to unit_diag which you recommended in the other issue. Let me know if you have any ideas. Also, thank you for the heads up in-regard to ILU vs Cholesky. Should I not be using sv2!?

haampie · 2021-01-09T13:18:17Z

I think what @amontoison tries to say is if you use CUDA#master (or is there a tagged version already?) you can do this:

using SparseArrays, LinearAlgebra, CUDA, IncompleteLU
using IncompleteLU: ILUFactorization
import LinearAlgebra: ldiv!

struct CudaILUFactorization{TL, TU}
    L::TL
    U::TU
end

function CudaILUFactorization(f::ILUFactorization)
    L = UnitLowerTriangular(cu(f.L))
    U = UpperTriangular(transpose(cu(f.U)))

    return CudaILUFactorization(L, U)
end

LinearAlgebra.ldiv!(f::CudaILUFactorization, x) = ldiv!(f.U, ldiv!(f.L, x))

function example()
    A = sprand(Float32, 1000, 1000, 10 / 1000) + 100I
    fact = ilu(A, τ = 0.001f0);
    cuda_fact = CudaILUFactorization(fact);

    x = rand(Float32, 1000);

    return norm(Array(ldiv!(cuda_fact, cu(x))) - fact \ x)
end

s.t.

julia> example()
2.401256f-8

haampie · 2021-01-09T13:21:01Z

Now, I'm not 100% sure anymore why I wasn't using the UnitLowerTriangular + Transpose types for the CPU. IIRC there was some performance issue or dispatch issue back in the days. Maybe that's resolved by now. I should check.

amontoison · 2021-01-09T17:17:14Z

@Omega-xyZac you forgot the factors P.L and P.Ut.

function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
    copyto!(y, x)
    sv2!('T', 'L', 1.0, P.L, y, 'O')
    sv2!('N', 'U', 1.0, P.Ut, y, 'O',unit_diag=true)
    return y
end
 
function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
    sv2!('T', 'L', 1.0, P.L, x, 'O')
    sv2!('N', 'U', 1.0, P.Ut, x, 'O',unit_diag=true)
    return x
end

@haampie The code for the backward and forward sweeps was slow If I remember well.
I don't know if it was improved but I know that the dispatch issues were solved.

The last release 2.4.0 should include these modifications Harmen but I don't know why I have the old behavior of sv2!.

zjwegert · 2021-01-10T02:30:03Z

Thank you both for all the help. I really appreciate it. I'm still having issues with the GPU version VS CPU. CPU converges in 40 iterations whilst GPU never converges and hits iteration limit. I have attached an example A and b along with the script. Would appreciate any ideas.

A&b.zip
&

using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
using IterativeSolvers
CUDA.allowscalar(false)

A = sparse(readdlm("./A.mat",' '))
b = readdlm("./b.mat",' ')

A_cpu = A
b_cpu = b/norm(b)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

struct LUgpu{TL, TU}
	L::TL
	U::TU	# transpose of U in LU decomposition
end

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)
  sv2!('T', 'L', 1.0, P.L, y, 'O')
  sv2!('N', 'U', 1.0, P.U, y, 'O',unit_diag=true)
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
  sv2!('T', 'L', 1.0, P.L, x, 'O')
  sv2!('N', 'U', 1.0, P.U, x, 'O',unit_diag=true)
  return x
end

P = LUgpu(CuSparseMatrixCSC(I+Precilu.L), CuSparseMatrixCSC(Precilu.U));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7)

P_cpu = ilu(A_cpu,τ=3.0)
val = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu,tol=10^-7) # <--- Working as expected

Probably still doing something silly here...

amontoison · 2021-01-10T21:05:46Z

I found the mistakes :

b = readdlm("./b.mat",' ') -> b = readdlm("./b.mat",' ')[:]
CuSparseMatrixCSC(Precilu.U) -> CuSparseMatrixCSC(sparse(Precilu.U')))
Update ldiv! functions

using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
CUDA.allowscalar(false)
using DelimitedFiles
using IterativeSolvers

A = sparse(readdlm("./A.mat",' '))
b = readdlm("./b.mat",' ')[:]
n, m = size(A)

A_cpu = A
b_cpu = b/norm(b)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

struct LUgpu{TL, TU}
  L::TL
  U::TU # transpose of U in LU decomposition
end

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)
  sv2!('N', 'L', 1.0, P.L, y, 'O')
  sv2!('N', 'U', 1.0, P.U, y, 'O')
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
  sv2!('N', 'L', 1.0, P.L, x, 'O')
  sv2!('N', 'U', 1.0, P.U, x, 'O')
  return x
end

P = LUgpu(CuSparseMatrixCSC(I+Precilu.L), CuSparseMatrixCSC(sparse(Precilu.U')));
x1 = cg(A_gpu,b_gpu,verbose=true,Pl=P)
norm(b_gpu - A_gpu * x1)

P_cpu = ilu(A_cpu,τ=3.0)
x2 = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu)
norm(b_cpu - A_cpu * x2)

But as I explained before, It's not relevant to use ILU factorization for symmetric positive definite linear systems...
If you also want to solve linear systems on GPU, you should compute the preconditioner on GPU, for example an Incomplete Cholesky factorization.

The best solution for your problem is from my point of view:

using LinearOperators
import Krylov

F = ic02(A_gpu, 'O')

# Solve Fy = x
function ldiv2!(y, F, x)
  copyto!(y, x)
  sv2!('T', 'U', 1.0, F, y, 'O')
  sv2!('N', 'U', 1.0, F, y, 'O')
  return y
end

# Operator that model F⁻¹
y = similar(b_gpu); n = length(b_gpu); T = eltype(b_gpu)
opM = LinearOperator(T, n, n, true, true, x -> ldiv2!(y, F, x))

# Solve a symmetric positive definite system with an incomplete Cholesky preconditioner on GPU
(x, stats) = Krylov.cg(A_gpu, b_gpu, M=opM)

haampie · 2021-01-10T22:17:57Z

Also, at least on the CPU, computing the incomplete Cholesky factorization is much more efficient than computing the incomplete LU decomposition, cause it barely requires bookkeeping for indices.

zjwegert · 2021-01-10T22:50:47Z

Thank you both for the info. The main issue that I was having with ic02 and even ilu02 was the lack of drop tolerance.. The example matrix I gave you is for a tiny system (6x6x6 FE cells). In reality, I am usually dealing with 500000x500000 matrices or larger. As such, using the full IC/ILU decompositions is very slow. Is there a way to set the drop tolerance for these?

amontoison · 2021-01-10T23:02:10Z

ic02 and ilu02 are IC(0) and ILU(0) factorizations. The factors L and U computed by IncompleteLU have more nnz.

zjwegert · 2021-01-10T23:07:44Z

Using one of the larger systems I have

nnz(Precilu.U),nnz(Precilu.L) = (569704,69708)
nnz(F) = 53999152

where F is ic02 and Precilu is ilu with a drop tolerance of 3.0.

cuihantao · 2022-06-24T02:52:48Z

    A = sprand(Float32, 1000, 1000, 10 / 1000) + 100I

Hello @haampie ,

If the matrix is ill-conditioned, say A = sprand(Float32, 1000, 1000, 10 / 1000) + 0.1I, the norm of the difference Array(ldiv!(cuda_fact, cu(x))) - fact \ x becomes significant. What would be the possible cause? Thanks!

cond number: 7681.252426717046
norm: 3.2563258f14

zjwegert mentioned this issue Jan 9, 2021

GPU Documention Broken JuliaSmoothOptimizers/Krylov.jl#257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using with CUDA Solver? #18

Using with CUDA Solver? #18

zjwegert commented Jan 8, 2021 •

edited

Loading

amontoison commented Jan 9, 2021 •

edited

Loading

zjwegert commented Jan 9, 2021

haampie commented Jan 9, 2021

haampie commented Jan 9, 2021 •

edited

Loading

amontoison commented Jan 9, 2021 •

edited

Loading

zjwegert commented Jan 10, 2021 •

edited

Loading

amontoison commented Jan 10, 2021 •

edited

Loading

haampie commented Jan 10, 2021

zjwegert commented Jan 10, 2021

amontoison commented Jan 10, 2021

zjwegert commented Jan 10, 2021

cuihantao commented Jun 24, 2022

Using with CUDA Solver? #18

Using with CUDA Solver? #18

Comments

zjwegert commented Jan 8, 2021 • edited Loading

amontoison commented Jan 9, 2021 • edited Loading

zjwegert commented Jan 9, 2021

haampie commented Jan 9, 2021

haampie commented Jan 9, 2021 • edited Loading

amontoison commented Jan 9, 2021 • edited Loading

zjwegert commented Jan 10, 2021 • edited Loading

amontoison commented Jan 10, 2021 • edited Loading

haampie commented Jan 10, 2021

zjwegert commented Jan 10, 2021

amontoison commented Jan 10, 2021

zjwegert commented Jan 10, 2021

cuihantao commented Jun 24, 2022

zjwegert commented Jan 8, 2021 •

edited

Loading

amontoison commented Jan 9, 2021 •

edited

Loading

haampie commented Jan 9, 2021 •

edited

Loading

amontoison commented Jan 9, 2021 •

edited

Loading

zjwegert commented Jan 10, 2021 •

edited

Loading

amontoison commented Jan 10, 2021 •

edited

Loading