Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in version 0.3.28 on Graviton4 #4939

Open
dnoan opened this issue Oct 15, 2024 · 5 comments
Open

Performance regression in version 0.3.28 on Graviton4 #4939

dnoan opened this issue Oct 15, 2024 · 5 comments

Comments

@dnoan
Copy link
Contributor

dnoan commented Oct 15, 2024

Recently I reported a performance regression at a4e56e0 There is some glitch in GitHub which prevents me from posting a follow up so I am opening this ticket.

In my case the app makes calls to DGEMM with 86100 different inputs. I picked some of what appears to be the most common calls:

DGEMM TRANSA: T, TRANSB: N, M: 45, N: 1, K: 211, ALPHA: -1.000000, LDA: 256, LDB: 256, BETA: 1.000000, LDC: 5383
DGEMM TRANSA: T, TRANSB: N, M: 23, N: 1, K: 117, ALPHA: -1.000000, LDA: 140, LDB: 140, BETA: 1.000000, LDC: 5383
DGEMM TRANSA: T, TRANSB: N, M: 211, N: 1, K: 45, ALPHA: -1.000000, LDA: 45, LDB: 45, BETA: 1.000000, LDC: 211
DGEMM TRANSA: T, TRANSB: N, M: 117, N: 1, K: 23, ALPHA: -1.000000, LDA: 23, LDB: 23, BETA: 1.000000, LDC: 117
DGEMM TRANSA: T, TRANSB: N, M: 9, N: 1, K: 75, ALPHA: -1.000000, LDA: 84, LDB: 84, BETA: 1.000000, LDC: 5383
DGEMM TRANSA: T, TRANSB: N, M: 98, N: 1, K: 17, ALPHA: -1.000000, LDA: 17, LDB: 17, BETA: 1.000000, LDC: 98
DGEMM TRANSA: T, TRANSB: N, M: 94, N: 1, K: 56, ALPHA: -1.000000, LDA: 56, LDB: 56, BETA: 1.000000, LDC: 94

DGEMM TRANSA: N, TRANSB: N, M: 5, N: 5, K: 1, ALPHA: -1.000000, LDA: 5, LDB: 301, BETA: 1.000000, LDC: 301
DGEMM TRANSA: N, TRANSB: N, M: 8, N: 8, K: 1, ALPHA: -1.000000, LDA: 8, LDB: 301, BETA: 1.000000, LDC: 301
DGEMM TRANSA: N, TRANSB: N, M: 33, N: 20, K: 1, ALPHA: -1.000000, LDA: 33, LDB: 301, BETA: 1.000000, LDC: 301
DGEMM TRANSA: N, TRANSB: N, M: 69, N: 24, K: 1, ALPHA: -1.000000, LDA: 69, LDB: 301, BETA: 1.000000, LDC: 301
DGEMM TRANSA: N, TRANSB: N, M: 210, N: 15, K: 1, ALPHA: -1.000000, LDA: 210, LDB: 211, BETA: 1.000000, LDC: 211
DGEMM TRANSA: N, TRANSB: N, M: 196, N: 1, K: 1, ALPHA: -1.000000, LDA: 196, LDB: 211, BETA: 1.000000, LDC: 211
DGEMM TRANSA: N, TRANSB: N, M: 211, N: 211, K: 45, ALPHA: -1.000000, LDA: 256, LDB: 256, BETA: 1.000000, LDC: 256
@Mousius
Copy link
Contributor

Mousius commented Oct 15, 2024

Hi @dnoan,

That is unfortunate, what's interesting is that many of these are N=1, which means they should go through GEMV, not GEMM:

OpenBLAS/interface/gemm.c

Lines 501 to 545 in 8483a71

#if defined(GEMM_GEMV_FORWARD) && !defined(GEMM3M) && !defined(COMPLEX) && !defined(BFLOAT16)
// Check if we can convert GEMM -> GEMV
if (args.k != 0) {
if (args.n == 1) {
blasint inc_x = 1;
blasint inc_y = 1;
// These were passed in as blasint, but the struct translates them to blaslong
blasint m = args.m;
blasint n = args.k;
blasint lda = args.lda;
// Create new transpose parameters
char NT = 'N';
if (transa & 1) {
NT = 'T';
m = args.k;
n = args.m;
}
if (transb & 1) {
inc_x = args.ldb;
}
GEMV(&NT, &m, &n, args.alpha, args.a, &lda, args.b, &inc_x, args.beta, args.c, &inc_y);
return;
}
if (args.m == 1) {
blasint inc_x = args.lda;
blasint inc_y = args.ldc;
// These were passed in as blasint, but the struct translates them to blaslong
blasint m = args.k;
blasint n = args.n;
blasint ldb = args.ldb;
// Create new transpose parameters
char NT = 'T';
if (transa & 1) {
inc_x = 1;
}
if (transb & 1) {
NT = 'N';
m = args.n;
n = args.k;
}
GEMV(&NT, &m, &n, args.alpha, args.b, &ldb, args.a, &inc_x, args.beta, args.c, &inc_y);
return;
}
}
#endif

Whilst the patch you're indicating is triggered later, after checking the small GEMM permit:

OpenBLAS/interface/gemm.c

Lines 551 to 571 in 8483a71

#if USE_SMALL_MATRIX_OPT
#if !defined(COMPLEX)
if(GEMM_SMALL_MATRIX_PERMIT(transa, transb, args.m, args.n, args.k, *(FLOAT *)(args.alpha), *(FLOAT *)(args.beta))){
if(*(FLOAT *)(args.beta) == 0.0){
(GEMM_SMALL_KERNEL_B0((transb << 2) | transa))(args.m, args.n, args.k, args.a, args.lda, *(FLOAT *)(args.alpha), args.b, args.ldb, args.c, args.ldc);
}else{
(GEMM_SMALL_KERNEL((transb << 2) | transa))(args.m, args.n, args.k, args.a, args.lda, *(FLOAT *)(args.alpha), args.b, args.ldb, *(FLOAT *)(args.beta), args.c, args.ldc);
}
return;
}
#else
if(GEMM_SMALL_MATRIX_PERMIT(transa, transb, args.m, args.n, args.k, alpha[0], alpha[1], beta[0], beta[1])){
if(beta[0] == 0.0 && beta[1] == 0.0){
(ZGEMM_SMALL_KERNEL_B0((transb << 2) | transa))(args.m, args.n, args.k, args.a, args.lda, alpha[0], alpha[1], args.b, args.ldb, args.c, args.ldc);
}else{
(ZGEMM_SMALL_KERNEL((transb << 2) | transa))(args.m, args.n, args.k, args.a, args.lda, alpha[0], alpha[1], args.b, args.ldb, beta[0], beta[1], args.c, args.ldc);
}
return;
}
#endif
#endif

The dgemm permit for small gemm is conservative (M*N*K <= 64*64*64):
https://github.com/OpenMathLib/OpenBLAS/blob/8483a71169bac112db133d45b39d4def812f81b6/kernel/arm64/gemm_small_kernel_permit_sve.c#L35C14-L35C22

This would indicate the last line (M:211, N:211, K:45) would not trigger the small kernels either.

That leaves the K=1 path; could you modify the permit so that you do not take the small matrices for that path and see if your performance improves?

Could you also let me know what compiler you're using?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 15, 2024

This could indeed be an unintentional downgrade caused by switching from SVE GEMM to NEON GEMV (the SVE implementation of the latter - from #4803 - only being available on A64FX right now)

@martin-frbg
Copy link
Collaborator

might be worthwile to copy kernel/arm64/KERNEL.A64FX to kernel/arm64/KERNEL.NEOVERSEV2 and rebuild/retest

@dnoan
Copy link
Contributor Author

dnoan commented Oct 16, 2024

Copying KERNEL.A64FX to KERNEL.NEOVERSEV2 didn't improve performance. Actually the workload ran marginally slower.

@Mousius, can provide a diff so that I don't mess things up?

@martin-frbg
Copy link
Collaborator

guess that would be something like

--- a/kernel/arm64/gemm_small_kernel_permit_sve.c
+++ b/kernel/arm64/gemm_small_kernel_permit_sve.c
@@ -32,7 +32,7 @@ int CNAME(int transa, int transb, BLASLONG M, BLASLONG N, BLASLONG K, FLOAT alph
   BLASLONG MNK = M * N * K;
 
 #if defined(DOUBLE) // dgemm
-  if (MNK <= 64*64*64)
+  if (MNK <= 64*64*64 && K > 1)
     return 1;
 #else // sgemm
   if (MNK <= 64*64*64)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants