Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omatcopy much slower than copy in OMP loop #4902

Open
david-cortes opened this issue Sep 21, 2024 · 1 comment
Open

omatcopy much slower than copy in OMP loop #4902

david-cortes opened this issue Sep 21, 2024 · 1 comment

Comments

@david-cortes
Copy link

david-cortes commented Sep 21, 2024

I'm trying to use cblas_domatcopy to transpose large row-major matrices.

I'm finding that the function is slower than a simple loop of cblas_dcopy calls parallelized with OpenMP (with number of threads set to number of logical cores, otherwise OMP loop is much slower).

Function cblas_domatcopy appears to be especially slower when the inputs have more columns than rows - in this sense, in a dcopy loop, there's also a large timing difference according to whether the copies are by rows of the input or of the output, and I'm guessing that perhaps omatcopy always follows the same order.

(code is provided at the end of this post)

  • Timings in seconds on an intel 12700H, average of 7 runs:

    • Input size: 100,000 x 5x000
      • OpenBLAS cblas_domatcopy: 3.12
      • OpenMP dcopy loop: 2.38
      • MKL MKL_Domatcopy: 1.26
    • Input size: 5,000 x 100,000
      • OpenBLAS cblas_domatcopy: 3.74
      • OpenMP dcopy loop: 1.23
      • MKL MKL_Domatcopy: 1.27
  • Timings in seconds on an amd ryzen 7840HS, average of 7 runs:

    • Input size: 100,000 x 5x000
      • OpenBLAS cblas_domatcopy: 0.922
      • OpenMP dcopy loop: 0.586
      • MKL MKL_Domatcopy: 0.560
    • Input size: 5,000 x 100,000
      • OpenBLAS cblas_domatcopy: 1.12
      • OpenMP dcopy loop: 0.402
      • MKL MKL_Domatcopy: 0.516

OpenBLAS version: 0.3.26, OpenMP variant.

Code that I'm using for the OMP dcopy loop:

void transpose_mat(const double *A, const int nrows, const int ncols, double *B, int nthreads)
{
    if (nrows >= ncols)
    {
        #pragma omp parallel for schedule(static) num_threads(nthreads)
        for (int row = 0; row < nrows; row++)
            cblas_dcopy(ncols, A + (size_t)row*(size_t)ncols, 1, B + row, nrows);
    }
    
    else
    {
        #pragma omp parallel for schedule(static) num_threads(nthreads
        for (int col = 0; col < ncols; col++)
            cblas_dcopy(nrows, A + col, ncols, B + (size_t)col*(size_t)nrows, 1);
    }
}
@martin-frbg
Copy link
Collaborator

related to #1243 - the current ?matcopy code is indeed just a fairly poorly optimized stopgap implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants