Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary of recent UCX TPCxBB tests and intermittent failures #670

Open
beckernick opened this issue Jan 19, 2021 · 15 comments
Open

Summary of recent UCX TPCxBB tests and intermittent failures #670

beckernick opened this issue Jan 19, 2021 · 15 comments

Comments

@beckernick
Copy link
Member

beckernick commented Jan 19, 2021

We've been doing some UCX TPCxBB tests on a Slurm cluster. Across multiple configurations, we've run into intermittent and as of yet unexplained failures using UCX and InfiniBand. We have been using UCX 1.8 and 1.9 rather than UCX 1.10 due to the already discussed issues (see #668 and associated issues/PRs). This issue will summarize several of the configurations we've recently tested and with which we've seen failures.

The setup includes a manually QA check of the mappings between GPUs, MLNX NICs, and NIC interfaces. The specific failures are being triaged and may result in their own issues with more details, which can be crosslinked for tracking.

Initial Setup

  • UCX 1.8.1
  • UCX-Py 0.18 (2020-01-19)
  • RAPIDS 0.18 nightlies (2020-01-19)
  • CUDA 11.0
  • Ubuntu 18.04
  • MLNX_OFED-5.1-2.5.8.0

With this setup, we are able to run a few queries successfully. However, we experienced intermittent segfaults that were not consistently reproducible.

We also saw the following warning related to libibcm, which we are triaging but may perhaps resolve itself with Ubuntu 20.04. Others (including @pentschev ) have suggested that we may simply no longer need libibcm.

> libibcm: kernel ABI version 0 doesn't match library version 5.

Second Setup

  • UCX 1.9
  • UCX-Py 0.18 (2020-01-19)
  • RAPIDS 0.18 nightlies (2020-01-19)
  • CUDA 11.0
  • Ubuntu 18.04
  • MLNX_OFED-5.1-2.5.8.0

The only change in this setup was to use OpenUCX 1.9. With this setup, we were also able to run a few queries successfully. However, we again experienced intermittent failures. Failing queries included both large and small queries, suggesting that this was not driven by out of memories but by something else.

Third Setup (pending, may suceeed -- will edit as appropriate)

  • UCX 1.9
  • UCX-Py 0.18 (2020-01-19)
  • RAPIDS 0.18 nightlies (2020-01-19)
  • CUDA 11.0
  • Ubuntu 20.04 Focal
  • MLNX_OFED-5.1-2.5.8.0

After additional discussions, we upgraded from Ubuntu 18.04 to Ubuntu 20.04. In this test, we also removed --with-cm from the UCX build process. We now consistently see compute occurring and then shortly after we see a hang.

@quasiben please feel free to edit/correct me if I've misstated anything.

@jakirkham
Copy link
Member

It's worth noting that one needs to use a different set of patches on 1.9.0 as opposed to 1.8.0. They are more similar to the patches we would use on 1.10.0. More details in this discussion ( https://github.com/rapidsai/ucx-split-feedstock/pull/50#issuecomment-763067547 ). Mentioning in case this hasn't already come up elsewhere.

@quasiben
Copy link
Member

Thanks @jakirkham, I think we have the right patch:

ADD https://raw.githubusercontent.com/rapidsai/ucx-split-feedstock/master/recipe/cuda-alloc-rcache.patch /tmp/ib_registration_cache.patch

@jakirkham
Copy link
Member

The CUDA alloc patch yes. The IB patch shouldn’t be there though (not sure if I’m understanding the quoted text correctly).

@quasiben
Copy link
Member

It's a dirty remnant from the dockerfile -- the patch is being written to a name /tmp/ib_reg...

@pentschev
Copy link
Member

The patch quoted in #670 (comment) is the correct one for 1.9, there's only one patch needed. IIRC, the old patches from 1.8 will not apply to 1.9.

@beckernick
Copy link
Member Author

beckernick commented Jan 19, 2021

This is the UCX build we're using (for the third setup with UCX 1.9), extracted from a Dockerfile.

ADD https://raw.githubusercontent.com/rapidsai/ucx-split-feedstock/master/recipe/cuda-alloc-rcache.patch /tmp/cuda-alloc-rcache.patch
RUN git clone --recurse-submodules -b v1.9.x https://github.com/openucx/ucx /tmp/ucx \
 && cd /tmp/ucx \
 && source activate $CONDA_ENV \
 && patch -p1 < /tmp/cuda-alloc-rcache.patch \
 && ./autogen.sh \
 && ./configure \
    --prefix="${CONDA_PREFIX}" \
    --with-sysroot \
    --enable-cma \
    --enable-mt \
    --enable-numa \
    --with-gnu-ld \
    --with-rdmacm \
    --with-verbs \
    --with-cuda="${CUDA_HOME}" \
 && make -j${PARALLEL_LEVEL} \
 && make install

Note that we removed --with-cm and now are seeing a hang (the issue summary has been updated).

@jakirkham
Copy link
Member

Ok thanks for clarifying Nick! Yeah that looks right.

@jakirkham
Copy link
Member

jakirkham commented Jan 20, 2021

For context on ibcm was removed in rdma-core 17.0 (see commit linux-rdma/rdma-core@357d23c for details). So if Ubuntu is using a newer version of rdma-core, dropping --with-cm is the right choice. This is certainly true on CentOS 7. Though CentOS 6 needed that flag and library.

@jakirkham
Copy link
Member

My only suggestion at this point would be to double check that we are linking against the libraries in the OFED version we expect as opposed to a different version (or non-OFED libraries). This is one place where we have seen issues crop up before (though not the only place).

@quasiben
Copy link
Member

When using the second setup listed above query 3 consistently fails with UCX issues using latest dask/distributed (note, latest dask/distributed resolved issues around BlockwiseIO)

[luna-0337:3023299:1:3023434] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[luna-0337:3023299:1:3023434] ib_mlx5_log.c:143  RC QP 0x15ac5 wqe[300]: SEND --e [inl len 26]

@pentschev
Copy link
Member

Any chance we could run the same query using RAPIDS 0.17? The error from #670 (comment) is consistent with that I remember from the many issues we had last year with IB, and I hadn't seen it since.

@quasiben
Copy link
Member

quasiben commented Jan 21, 2021

With both UCX 1.8 / 1.9 and RAPIDS 0.17 (probably other configurations above) we are seeing the following error in the scheduler

luna-0526:947971:0:949164] sockcm_iface.c:257 Fatal: sockcm_listener: unable to create handler for new connection

We think this is due to the scheduler opening too many file descriptors and maybe some ulimit settings? We are still investigating...

@pentschev
Copy link
Member

We think this is due to the scheduler opening too many file descriptors

Correction: We know this is due to the scheduler opening too many file descriptors.

@quasiben
Copy link
Member

Update:

We've been testing UCX 1.10 with RDMACM but we are running into other issues (also observer with ucp_client_server):

Transport retry count exceeded on mlx5_1:1/IB

We are working with UCX devs to diagnose and resolve.

@pentschev
Copy link
Member

@beckernick it would be good to have GPU-BDB tested again with UCX 1.11, we believe that issues here have been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants