Skip to content
Bluefog-Lib edited this page May 16, 2020 · 4 revisions

Overview

One the top level, the structure of this codebase looks like:

├── bluefog
├── docs
├── examples
├── setup.py
├── test
└── ...

The main project code is under bluefog folder, examples folder contains the usage code for users, (served as e2e test as well) and test folder for test of course. Overall, our repository shares a very similar organization like horovod repository. FYI, in the future, we will put all documentation into a website here.

Code Architecture

The main project/code is located under bluefog, which is composed of two folders common and torch (now). We plan to support simple API at least for pytorch and tensorflow frameworks. So torch and tensorflow fold implement the adaptor for the tensor operations under pytorch and tensorflow respectively. common folder includes the common functions, such as MPI communication, Background processing thread, etc., because those operations are the same for any frameworks.

The core code consists of three parts of logics:

  1. Framework(pytorch/tensorflow) API
  2. Public MPI ops (such as AllReduce, WinPut, NeighborAllGather, etc.)
  3. Background Processing Part. (Background thread, status maintaining, Communication Scheduling, etc.)

Framework APIs are located under the __init__.py file. Among them, the most important function is the DistributedOptimizer wrapper for the original Optimizer. Under it, a hook function is registered for each layer. Depends on what style of Distributed Optimizer is (ring-allreduce, diffusion, consensus, etc), different asynchronized communication/callback/updating functions are implemented.

Public MPI ops can be categorized into four types:

  1. Basics Ops -- (init, shutdown, size, local_size, rank, local_rank)
  2. Collective Ops -- (allreduce, allgather, broadcast) x (sync, async, in-place) + (poll, synchronize)
  3. Topology and Neighbor Ops -- (load_topology, set_topology, neighbor_allgather, neighbor_allreduce)
  4. One-Sided Communication Ops -- (win_create, win_free, win_get, win_put, win_accumulate, win_sync, win_fence)

The response of most communication ops take the tensors and return the averages tensors of all processes or its neighbors. To make our function not synchronized and blocking, the pattern of implementation of those functions do not directly call the MPI functions but push the communication request to the queue. The real communication will be executed by "Background Processing Part". An asynchronized callback pattern is used to let those communication ops get the results.

Lastly, the Background Processing Part is the main course for this project. There are lots of niceties behind the implementation but the main logic is simple and clear. First, it will set up some global status, which contains some threads for communication. Similar to the threading pool strategy, that background thread keeps pulling the communication event request out of the queue, which is pushed by called in MPI ops; Executes the communications. (Currently, we only have one thread to do that job. It is possible to create multiple threads but may not be necessary and cause more trouble than benefits.) After the communication is done, it will execute the callback function to inform MPI ops.

Build C++ Files

All the interface code will be python. However, several functions such as low-level hardware interaction, MPI controlling etc, requires C++ interface. Therefore, part of our code base is C++. In order to communication between C++ and python, we need to build a .so on linux or .dll on windows file first. If everything works correctly on your system, only thing you need to do it type

python setup.py build_ext

It will generate the object file under build folder. Currently, we have a simple hello_world.cc and c_hello_world.py as an example. Alternatively, you can generate the shared library in-place by running

python setup.py build_ext -i
Clone this wiki locally