Skip to content

Bluefog Docker Test Notes

Bicheng Ying edited this page Jul 14, 2021 · 24 revisions

Run for testing under Docker Container (GPU)

WARNING: you have to use privileged mode to run the docker, otherwise all win_ops would not be able to execute correctly.

For easier testing in the docker environment, it is better to mount the host directory into docker container. To build the test docker image (you may not need to run unless it is the firt time):

$ sudo docker build -t bluefog_gpu:devel . -f dockerfile.gpu.test

Running the following command under root folder to mount the bluefog folder:

$ sudo docker run --privileged -it --gpus all --name bluefog_gpu_devtest \
   --network=host --shm-size=64g -v /mnt/share/ssh:/root/.ssh \
   --mount type=bind,source="$(pwd)",target=/bluefog bluefog_gpu:devel

Remember to remove the devtest container if you need it

$ sudo docker container rm bluefog_gpu_devtest

Nvidia Container Runtime

The following error may pop up when running a docker container with GPUs.

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

In order to properly run docker with GPUs, Nvidia container runtime needs to be installed using following commands for Ubuntu. Furthermore, the GPU driver is also required.

$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
    sudo apt-key add -
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
$ sudo apt-get update
$ sudo apt-get install nvidia-container-runtime
$ sudo service docker restart

More details can be found on https://github.com/NVIDIA/nvidia-container-runtime and https://nvidia.github.io/nvidia-container-runtime.

Run for testing under Docker Container (CPU)

It takes a similar approach like GPU version:

$ sudo docker build -t bluefog_cpu:devel . -f dockerfile.cpu.test

Running the following command under root folder to mount the bluefog folder:

$ sudo docker run --privileged -it --mount type=bind,source="$(pwd)",target=/bluefog \
   --network=host -v /mnt/share/ssh:/root/.ssh \
   --name bluefog_cpu_devtest bluefog_cpu:devel

Remember to remove the devtest container if you need it

$ sudo docker container rm bluefog_cpu_devtest

Run on Multiple Machines

Assuming machine1 is the master and machine2 is the slave (it will also work vice versa), first check the ssh config in shared directory that will be mounted into the container later is correct:

root@machine2$ sudo cat /mnt/share/ssh/config
Host machine1
    HostName xxx.xxx.xxx.xxx
    User root
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes
    PubkeyAuthentication yes
root@machine1$ sudo cat /mnt/share/ssh/config
Host machine2
    HostName xxx.xxx.xxx.xxx
    User root
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes
    PubkeyAuthentication yes

In machine2(slave), run the container and start ssh service in port 40000:

$ sudo docker run --privileged -it --mount type=bind,source="$(pwd)",target=/bluefog \
   --network=host -v /mnt/share/ssh:/root/.ssh \
   --name bluefog_gpu_devtest bluefog_gpu:devel
/examples# bash -c "/usr/sbin/sshd -p 40000; sleep infinity;"

The location of ssh folder can be differed from different machines. Another possible choice is:

$ sudo docker run --privileged -it --mount type=bind,source="$(pwd)",target=/bluefog \
   --network=host -v ~/.ssh:/root/.ssh \
   --name bluefog_gpu_devtest bluefog_gpu:devel
/examples# bash -c "/usr/sbin/sshd -p 40000; sleep infinity;"

A note on using host network: Doing so will lose the network isolation between the container and the host. As a result, opening ssh service on port 40000 should make the container accessible by directly ssh'ing the host at port 40000.

In master:

$sudo docker run --privileged -it --mount type=bind,source="$(pwd)",target=/bluefog \
   --network=host -v /mnt/share/ssh:/root/.ssh \
   --name bluefog_gpu_devtest bluefog_gpu:devel