Summary of recent UCX TPCxBB tests and intermittent failures #670

beckernick · 2021-01-19T21:39:45Z

We've been doing some UCX TPCxBB tests on a Slurm cluster. Across multiple configurations, we've run into intermittent and as of yet unexplained failures using UCX and InfiniBand. We have been using UCX 1.8 and 1.9 rather than UCX 1.10 due to the already discussed issues (see #668 and associated issues/PRs). This issue will summarize several of the configurations we've recently tested and with which we've seen failures.

The setup includes a manually QA check of the mappings between GPUs, MLNX NICs, and NIC interfaces. The specific failures are being triaged and may result in their own issues with more details, which can be crosslinked for tracking.

Initial Setup

UCX 1.8.1
UCX-Py 0.18 (2020-01-19)
RAPIDS 0.18 nightlies (2020-01-19)
CUDA 11.0
Ubuntu 18.04
MLNX_OFED-5.1-2.5.8.0

With this setup, we are able to run a few queries successfully. However, we experienced intermittent segfaults that were not consistently reproducible.

We also saw the following warning related to libibcm, which we are triaging but may perhaps resolve itself with Ubuntu 20.04. Others (including @pentschev ) have suggested that we may simply no longer need libibcm.

> libibcm: kernel ABI version 0 doesn't match library version 5.

Second Setup

UCX 1.9
UCX-Py 0.18 (2020-01-19)
RAPIDS 0.18 nightlies (2020-01-19)
CUDA 11.0
Ubuntu 18.04
MLNX_OFED-5.1-2.5.8.0

The only change in this setup was to use OpenUCX 1.9. With this setup, we were also able to run a few queries successfully. However, we again experienced intermittent failures. Failing queries included both large and small queries, suggesting that this was not driven by out of memories but by something else.

Third Setup (~~pending, may suceeed -- will edit as appropriate~~)

UCX 1.9
UCX-Py 0.18 (2020-01-19)
RAPIDS 0.18 nightlies (2020-01-19)
CUDA 11.0
Ubuntu 20.04 Focal
MLNX_OFED-5.1-2.5.8.0

After additional discussions, we upgraded from Ubuntu 18.04 to Ubuntu 20.04. In this test, we also removed --with-cm from the UCX build process. We now consistently see compute occurring and then shortly after we see a hang.

@quasiben please feel free to edit/correct me if I've misstated anything.

The text was updated successfully, but these errors were encountered:

jakirkham · 2021-01-19T21:55:14Z

It's worth noting that one needs to use a different set of patches on 1.9.0 as opposed to 1.8.0. They are more similar to the patches we would use on 1.10.0. More details in this discussion ( https://github.com/rapidsai/ucx-split-feedstock/pull/50#issuecomment-763067547 ). Mentioning in case this hasn't already come up elsewhere.

quasiben · 2021-01-19T22:18:28Z

Thanks @jakirkham, I think we have the right patch:

ADD https://raw.githubusercontent.com/rapidsai/ucx-split-feedstock/master/recipe/cuda-alloc-rcache.patch /tmp/ib_registration_cache.patch

jakirkham · 2021-01-19T22:24:51Z

The CUDA alloc patch yes. The IB patch shouldn’t be there though (not sure if I’m understanding the quoted text correctly).

quasiben · 2021-01-19T22:29:11Z

It's a dirty remnant from the dockerfile -- the patch is being written to a name /tmp/ib_reg...

pentschev · 2021-01-19T22:32:28Z

The patch quoted in #670 (comment) is the correct one for 1.9, there's only one patch needed. IIRC, the old patches from 1.8 will not apply to 1.9.

beckernick · 2021-01-19T22:33:21Z

This is the UCX build we're using (for the third setup with UCX 1.9), extracted from a Dockerfile.

ADD https://raw.githubusercontent.com/rapidsai/ucx-split-feedstock/master/recipe/cuda-alloc-rcache.patch /tmp/cuda-alloc-rcache.patch
RUN git clone --recurse-submodules -b v1.9.x https://github.com/openucx/ucx /tmp/ucx \
 && cd /tmp/ucx \
 && source activate $CONDA_ENV \
 && patch -p1 < /tmp/cuda-alloc-rcache.patch \
 && ./autogen.sh \
 && ./configure \
    --prefix="${CONDA_PREFIX}" \
    --with-sysroot \
    --enable-cma \
    --enable-mt \
    --enable-numa \
    --with-gnu-ld \
    --with-rdmacm \
    --with-verbs \
    --with-cuda="${CUDA_HOME}" \
 && make -j${PARALLEL_LEVEL} \
 && make install

Note that we removed --with-cm and now are seeing a hang (the issue summary has been updated).

jakirkham · 2021-01-19T23:48:10Z

Ok thanks for clarifying Nick! Yeah that looks right.

jakirkham · 2021-01-20T00:04:17Z

For context on ibcm was removed in rdma-core 17.0 (see commit linux-rdma/rdma-core@357d23c for details). So if Ubuntu is using a newer version of rdma-core, dropping --with-cm is the right choice. This is certainly true on CentOS 7. Though CentOS 6 needed that flag and library.

jakirkham · 2021-01-20T00:08:06Z

My only suggestion at this point would be to double check that we are linking against the libraries in the OFED version we expect as opposed to a different version (or non-OFED libraries). This is one place where we have seen issues crop up before (though not the only place).

quasiben · 2021-01-20T15:43:34Z

When using the second setup listed above query 3 consistently fails with UCX issues using latest dask/distributed (note, latest dask/distributed resolved issues around BlockwiseIO)

[luna-0337:3023299:1:3023434] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[luna-0337:3023299:1:3023434] ib_mlx5_log.c:143  RC QP 0x15ac5 wqe[300]: SEND --e [inl len 26]

pentschev · 2021-01-20T16:44:49Z

Any chance we could run the same query using RAPIDS 0.17? The error from #670 (comment) is consistent with that I remember from the many issues we had last year with IB, and I hadn't seen it since.

quasiben · 2021-01-21T21:39:33Z

With both UCX 1.8 / 1.9 and RAPIDS 0.17 (probably other configurations above) we are seeing the following error in the scheduler

luna-0526:947971:0:949164] sockcm_iface.c:257 Fatal: sockcm_listener: unable to create handler for new connection

We think this is due to the scheduler opening too many file descriptors and maybe some ulimit settings? We are still investigating...

pentschev · 2021-01-21T23:05:40Z

We think this is due to the scheduler opening too many file descriptors

Correction: We know this is due to the scheduler opening too many file descriptors.

quasiben · 2021-01-28T14:10:01Z

Update:

We've been testing UCX 1.10 with RDMACM but we are running into other issues (also observer with ucp_client_server):

Transport retry count exceeded on mlx5_1:1/IB

We are working with UCX devs to diagnose and resolve.

pentschev · 2021-08-13T21:56:35Z

@beckernick it would be good to have GPU-BDB tested again with UCX 1.11, we believe that issues here have been resolved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary of recent UCX TPCxBB tests and intermittent failures #670

Summary of recent UCX TPCxBB tests and intermittent failures #670

beckernick commented Jan 19, 2021 •

edited

Loading

jakirkham commented Jan 19, 2021

quasiben commented Jan 19, 2021

jakirkham commented Jan 19, 2021

quasiben commented Jan 19, 2021

pentschev commented Jan 19, 2021

beckernick commented Jan 19, 2021 •

edited

Loading

jakirkham commented Jan 19, 2021

jakirkham commented Jan 20, 2021 •

edited

Loading

jakirkham commented Jan 20, 2021

quasiben commented Jan 20, 2021

pentschev commented Jan 20, 2021

quasiben commented Jan 21, 2021 •

edited

Loading

pentschev commented Jan 21, 2021

quasiben commented Jan 28, 2021

pentschev commented Aug 13, 2021

Summary of recent UCX TPCxBB tests and intermittent failures #670

Summary of recent UCX TPCxBB tests and intermittent failures #670

Comments

beckernick commented Jan 19, 2021 • edited Loading

jakirkham commented Jan 19, 2021

quasiben commented Jan 19, 2021

jakirkham commented Jan 19, 2021

quasiben commented Jan 19, 2021

pentschev commented Jan 19, 2021

beckernick commented Jan 19, 2021 • edited Loading

jakirkham commented Jan 19, 2021

jakirkham commented Jan 20, 2021 • edited Loading

jakirkham commented Jan 20, 2021

quasiben commented Jan 20, 2021

pentschev commented Jan 20, 2021

quasiben commented Jan 21, 2021 • edited Loading

pentschev commented Jan 21, 2021

quasiben commented Jan 28, 2021

pentschev commented Aug 13, 2021

beckernick commented Jan 19, 2021 •

edited

Loading

beckernick commented Jan 19, 2021 •

edited

Loading

jakirkham commented Jan 20, 2021 •

edited

Loading

quasiben commented Jan 21, 2021 •

edited

Loading