Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPMEM runtime warning/error #46

Open
BiplabRaut opened this issue Mar 25, 2021 · 10 comments
Open

XPMEM runtime warning/error #46

BiplabRaut opened this issue Mar 25, 2021 · 10 comments

Comments

@BiplabRaut
Copy link

I am trying to use XPMEM with openMPI4.x, and have used the below configure command to configure openMPI4.1.0:-
$ ompi_info --all|grep 'command line'
Configure command line: '--prefix=/home/server/ompi4_xmem' '--with-xpmem=/home/server/xpmm' '--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' '--enable-static=yes' '--enable-mpi1-compatibility'
User-specified command line parameters passed to ROMIO's configure script
Complete set of command-line parameters passed to ROMIO's configure script

But I am getting a warning/error when running the FFTW inbuilt MPI benchmark.
$ mpirun --map-by core -rank-by core --bind-to core ./mpi-bench -s ic1000000
WARNING: Could not generate an xpmem segment id for this process'
address space.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: lib-server-03
Error code: 2 (No such file or directory)
Problem: ic1000000, setup: 580.97 ms, time: 1.76 ms, ``mflops'': 56555
[lib-server-03:1297333] 127 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[lib-server-03:1297333] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Due to the above warning/error, I am not sure if the MPI program is using XPMEM or CMA.
Can you please help me in resolving this warning/error?

Thanks in advance.

@mahendrapaipuri
Copy link

Hello,

I have similar issue as well running with OpenMPI 4.1.1. I built OpenMPI using UCX and enabled flag --with-xpmem to configure UCX. When I run Intel MPI benchmark suite to measure shared memory bandwidth, I get the same error.

Benchmark:
mpirun --bind-to core -np $SLURM_NTASKS_PER_NODE IMB-MPI1 Bcast
Log:

--------------------------------------------------------------------------
WARNING: Could not generate an xpmem segment id for this process'
address space.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: openhpc-compute-0
  Error code: 2 (No such file or directory)
--------------------------------------------------------------------------

xpmem is installed correctly and I see the header files and shared libraries in the installation folder. I could not get any more leads on this error.

@jdinan
Copy link
Contributor

jdinan commented Aug 9, 2021

The Open UCX team has started maintaining their own fork of XPMEM. Your issue might get more attention if you post it there: https://github.com/openucx/xpmem

@hjelmn
Copy link
Collaborator

hjelmn commented Aug 9, 2021

Indeed. I have had limited time to respond to the issues. This one may be an xpmem issue or UCX and MVAPICH issues.

I intend to bring all ucx fork fixed here then move it to hpc/xpmem where I can add more people to review fixes.

@jdinan
Copy link
Contributor

jdinan commented Aug 9, 2021

@hjelmn As a user, it would be great to have a single source for XPMEM. I don't have an opinion on who should own the repository, but would prefer to have just one.

@hjelmn
Copy link
Collaborator

hjelmn commented Aug 9, 2021

@jdinan Agreed. That is why I want to move it to the LANL-collaborative hpc org. UCX should not own the main fork but in the hpc org I can give UCX developers more access :)

@hjelmn
Copy link
Collaborator

hjelmn commented Aug 9, 2021

I will try to get that done tomorrow.

@gkatev
Copy link

gkatev commented Jun 17, 2022

As I'm randomly cruising through here:
No such file or directory usually occurs when /dev/xpmem is not present. @BiplabRaut @mahendrapaipuri is the kernel module inserted? Furthermore, if you get Permission denied, it is likely related to insufficient unix permissions to /dev/xpmem.

@jywangx
Copy link

jywangx commented Dec 28, 2022

Hey @gkatev 🙂, sorry to bother you here. I encountered a problem with xpmem_make while trying to use XPMEM in Open MPI's SMSC framework. While looking for an answer I saw your comment.

The Open MPI MCA output is like:

mca_smsc_base_select: could not select component xpmem. query returned error code -16

It may occurs in (opal/mca/smsc/xpmem/smsc_xpmem_component line 143):

    mca_smsc_xpmem_component.my_seg_id = xpmem_make(0, XPMEM_MAXADDR_SIZE, XPMEM_PERMIT_MODE,
                                                    (void *) 0666);
    if (-1 == mca_smsc_xpmem_component.my_seg_id) {
        return OPAL_ERR_NOT_AVAILABLE;
    }

I'm runing my program in a super computer platform so I don't have root permission, and there is no /dev/xpmem directory in system. Could the problem with xpmem_make be related to this as well (although I don't see No such file or directory or Permission denied in output)?

@gkatev
Copy link

gkatev commented Dec 28, 2022

Hi @jywangx, definetely don't expect XPMEM to work without the kernel module inserted and /dev/xpmem present. It most likely is related to this, xpmem_make (and other ops) work via ioctl to /dev/xpmem.

I imagine the -16 you see is the value of OPAL_ERR_NOT_AVAILABLE, and if you checked errno after xpmem_make you'd get No such file or directory. In the comments above (OpenMPI v4), the XPMEM code resided in btl/vader instead of smsc/xpmem (OpenMPI v5), and apparently that used to emit a help text that includes errno, while smsc does not.

@jywangx
Copy link

jywangx commented Dec 28, 2022

Hi @jywangx, definetely don't expect XPMEM to work without the kernel module inserted and /dev/xpmem present. It most likely is related to this, xpmem_make (and other ops) work via ioctl to /dev/xpmem.

I imagine the -16 you see is the value of OPAL_ERR_NOT_AVAILABLE, and if you checked errno after xpmem_make you'd get No such file or directory. In the comments above (OpenMPI v4), the XPMEM code resided in btl/vader instead of smsc/xpmem (OpenMPI v5), and apparently that used to emit a help text that includes errno, while smsc does not.

Got it :) Thank you, this really helped me.

tzafrir-mellanox pushed a commit to tzafrir-mellanox/xpmem that referenced this issue Sep 11, 2024
KERNEL: Use pte offset kernel function for mapped pages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants