Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using with CUDA Solver? #18

Open
zjwegert opened this issue Jan 8, 2021 · 12 comments
Open

Using with CUDA Solver? #18

zjwegert opened this issue Jan 8, 2021 · 12 comments

Comments

@zjwegert
Copy link

zjwegert commented Jan 8, 2021

Hi,
I'm interested in using this package with a GPU CG method from IterativeMethods. I think only forward_substitution and backward_substitution methods are required to extend this to CUDA. I've tried using CUDAs sv2! with little success. Any thoughts?

using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
using IterativeSolvers
CUDA.allowscalar(false)

val = sprand(200,200,0.05);
A_cpu = val*val'
b_cpu = rand(200)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)                       
  sv2!('N', 'L', 1.0, P, y, 'O') 
  sv2!('N', 'U', 1.0, P, y, 'O')  
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)                        
  sv2!('N', 'L', 1.0, P, x, 'O') 
  sv2!('N', 'U', 1.0, P, x, 'O')  
  return x
end

# function LinearAlgebra.ldiv!(_lu::LUperso, rhs::CUDA.CuArray)
# 	_x = UpperTriangular(_lu.Ut) \ (LowerTriangular(_lu.L) \ rhs)
# 	rhs .= vec(_x)
# 	# CUDA.unsafe_free!(_x)
# 	rhs
# end
#
# function LinearAlgebra.ldiv!(yrhs::CuArray,_lu::LUperso, rhs::CuArray)
# 	copyto!(yrhs,rhs)
# 	_x = UpperTriangular(_lu.Ut) \ (LowerTriangular(_lu.L) \ rhs)
# 	rhs .= vec(_x)
# 	# CUDA.unsafe_free!(_x)
# 	rhs
# end

struct LUgpu
	L
	Ut	# transpose of U in LU decomposition
end


P = LUperso(LowerTriangular(CuSparseMatrixCSR(I+Precilu.L)), UpperTriangular(CuSparseMatrixCSR(sparse(Precilu.U'))));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7,maxiter=1000)

P_cpu = ilu(A_cpu,τ=3.0)
val = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu,tol=10^-7,maxiter=1000)
A_cpu\b_cpu
@amontoison
Copy link
Contributor

amontoison commented Jan 9, 2021

You need to convert Precilu.L and Precilu.U before to make it working on GPU.

L_gpu = CuSparseMatrixCSC(Precilu.L)
U_gpu = CuSparseMatrixCSC(Precilu.U)

After that you can use sv2! for the backward and forward sweeps.
Be careful because the storage is a little bit specific for the U factor (@haampie I am right ?)
With Tim, we updated some functions for triangular types in CUSPARSE to avoid the use of sv2! for information (JuliaGPU/CUDA.jl#575).

About your example, CG requires a symmetric and positive definite preconditioner. ilu is great for non-symmetric square systems and Krylov solvers bicgstab, gmres, etc....
An incomplete Cholesky or LDL' factorization is more appropriate for your problem.

@zjwegert
Copy link
Author

zjwegert commented Jan 9, 2021

Hi @amontoison, I gave that a go but no success! Do you mean something like

...
import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)
  sv2!('T', 'L', 1.0, P, y, 'O')
  sv2!('N', 'U', 1.0, P, y, 'O',unit_diag=true)
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
  sv2!('T', 'L', 1.0, P, x, 'O')
  sv2!('N', 'U', 1.0, P, x, 'O',unit_diag=true)
  return x
end

struct LUgpu
	L
	Ut	# transpose of U in LU decomposition
end

L_gpu = CuSparseMatrixCSC(sparse(Precilu.L+I))
U_gpu = CuSparseMatrixCSC(Precilu.U)

P = LUperso(LowerTriangular(L_gpu), UpperTriangular(U_gpu));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7,maxiter=1000)

Note the change to unit_diag which you recommended in the other issue. Let me know if you have any ideas. Also, thank you for the heads up in-regard to ILU vs Cholesky. Should I not be using sv2!?

@haampie
Copy link
Owner

haampie commented Jan 9, 2021

I think what @amontoison tries to say is if you use CUDA#master (or is there a tagged version already?) you can do this:

using SparseArrays, LinearAlgebra, CUDA, IncompleteLU
using IncompleteLU: ILUFactorization
import LinearAlgebra: ldiv!

struct CudaILUFactorization{TL, TU}
    L::TL
    U::TU
end

function CudaILUFactorization(f::ILUFactorization)
    L = UnitLowerTriangular(cu(f.L))
    U = UpperTriangular(transpose(cu(f.U)))

    return CudaILUFactorization(L, U)
end

LinearAlgebra.ldiv!(f::CudaILUFactorization, x) = ldiv!(f.U, ldiv!(f.L, x))

function example()
    A = sprand(Float32, 1000, 1000, 10 / 1000) + 100I
    fact = ilu(A, τ = 0.001f0);
    cuda_fact = CudaILUFactorization(fact);

    x = rand(Float32, 1000);

    return norm(Array(ldiv!(cuda_fact, cu(x))) - fact \ x)
end

s.t.

julia> example()
2.401256f-8

@haampie
Copy link
Owner

haampie commented Jan 9, 2021

Now, I'm not 100% sure anymore why I wasn't using the UnitLowerTriangular + Transpose types for the CPU. IIRC there was some performance issue or dispatch issue back in the days. Maybe that's resolved by now. I should check.

@amontoison
Copy link
Contributor

amontoison commented Jan 9, 2021

@Omega-xyZac you forgot the factors P.L and P.Ut.

function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
    copyto!(y, x)
    sv2!('T', 'L', 1.0, P.L, y, 'O')
    sv2!('N', 'U', 1.0, P.Ut, y, 'O',unit_diag=true)
    return y
end
 
function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
    sv2!('T', 'L', 1.0, P.L, x, 'O')
    sv2!('N', 'U', 1.0, P.Ut, x, 'O',unit_diag=true)
    return x
end

@haampie The code for the backward and forward sweeps was slow If I remember well.
I don't know if it was improved but I know that the dispatch issues were solved.

The last release 2.4.0 should include these modifications Harmen but I don't know why I have the old behavior of sv2!.

@zjwegert
Copy link
Author

zjwegert commented Jan 10, 2021

Thank you both for all the help. I really appreciate it. I'm still having issues with the GPU version VS CPU. CPU converges in 40 iterations whilst GPU never converges and hits iteration limit. I have attached an example A and b along with the script. Would appreciate any ideas.

A&b.zip
&

using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
using IterativeSolvers
CUDA.allowscalar(false)

A = sparse(readdlm("./A.mat",' '))
b = readdlm("./b.mat",' ')

A_cpu = A
b_cpu = b/norm(b)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

struct LUgpu{TL, TU}
	L::TL
	U::TU	# transpose of U in LU decomposition
end

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)
  sv2!('T', 'L', 1.0, P.L, y, 'O')
  sv2!('N', 'U', 1.0, P.U, y, 'O',unit_diag=true)
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
  sv2!('T', 'L', 1.0, P.L, x, 'O')
  sv2!('N', 'U', 1.0, P.U, x, 'O',unit_diag=true)
  return x
end

P = LUgpu(CuSparseMatrixCSC(I+Precilu.L), CuSparseMatrixCSC(Precilu.U));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7)

P_cpu = ilu(A_cpu,τ=3.0)
val = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu,tol=10^-7) # <--- Working as expected

Probably still doing something silly here...

@amontoison
Copy link
Contributor

amontoison commented Jan 10, 2021

I found the mistakes :

  • b = readdlm("./b.mat",' ') -> b = readdlm("./b.mat",' ')[:]
  • CuSparseMatrixCSC(Precilu.U) -> CuSparseMatrixCSC(sparse(Precilu.U')))
  • Update ldiv! functions
using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
CUDA.allowscalar(false)
using DelimitedFiles
using IterativeSolvers

A = sparse(readdlm("./A.mat",' '))
b = readdlm("./b.mat",' ')[:]
n, m = size(A)

A_cpu = A
b_cpu = b/norm(b)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

struct LUgpu{TL, TU}
  L::TL
  U::TU # transpose of U in LU decomposition
end

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)
  sv2!('N', 'L', 1.0, P.L, y, 'O')
  sv2!('N', 'U', 1.0, P.U, y, 'O')
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
  sv2!('N', 'L', 1.0, P.L, x, 'O')
  sv2!('N', 'U', 1.0, P.U, x, 'O')
  return x
end

P = LUgpu(CuSparseMatrixCSC(I+Precilu.L), CuSparseMatrixCSC(sparse(Precilu.U')));
x1 = cg(A_gpu,b_gpu,verbose=true,Pl=P)
norm(b_gpu - A_gpu * x1)

P_cpu = ilu(A_cpu,τ=3.0)
x2 = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu)
norm(b_cpu - A_cpu * x2)

But as I explained before, It's not relevant to use ILU factorization for symmetric positive definite linear systems...
If you also want to solve linear systems on GPU, you should compute the preconditioner on GPU, for example an Incomplete Cholesky factorization.

The best solution for your problem is from my point of view:

using LinearOperators
import Krylov

F = ic02(A_gpu, 'O')

# Solve Fy = x
function ldiv2!(y, F, x)
  copyto!(y, x)
  sv2!('T', 'U', 1.0, F, y, 'O')
  sv2!('N', 'U', 1.0, F, y, 'O')
  return y
end

# Operator that model F⁻¹
y = similar(b_gpu); n = length(b_gpu); T = eltype(b_gpu)
opM = LinearOperator(T, n, n, true, true, x -> ldiv2!(y, F, x))

# Solve a symmetric positive definite system with an incomplete Cholesky preconditioner on GPU
(x, stats) = Krylov.cg(A_gpu, b_gpu, M=opM)

@haampie
Copy link
Owner

haampie commented Jan 10, 2021

Also, at least on the CPU, computing the incomplete Cholesky factorization is much more efficient than computing the incomplete LU decomposition, cause it barely requires bookkeeping for indices.

@zjwegert
Copy link
Author

Thank you both for the info. The main issue that I was having with ic02 and even ilu02 was the lack of drop tolerance.. The example matrix I gave you is for a tiny system (6x6x6 FE cells). In reality, I am usually dealing with 500000x500000 matrices or larger. As such, using the full IC/ILU decompositions is very slow. Is there a way to set the drop tolerance for these?

@amontoison
Copy link
Contributor

ic02 and ilu02 are IC(0) and ILU(0) factorizations. The factors L and U computed by IncompleteLU have more nnz.

@zjwegert
Copy link
Author

Using one of the larger systems I have

nnz(Precilu.U),nnz(Precilu.L) = (569704,69708)
nnz(F) = 53999152

where F is ic02 and Precilu is ilu with a drop tolerance of 3.0.

@cuihantao
Copy link

    A = sprand(Float32, 1000, 1000, 10 / 1000) + 100I

Hello @haampie ,

If the matrix is ill-conditioned, say A = sprand(Float32, 1000, 1000, 10 / 1000) + 0.1I, the norm of the difference Array(ldiv!(cuda_fact, cu(x))) - fact \ x becomes significant. What would be the possible cause? Thanks!

cond number: 7681.252426717046
norm: 3.2563258f14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants