Skip to content

Increasing throughput of a Convolutional Neural Network (CNN), executed on a multi-processor edge (mobile or embedded) platform.

Notifications You must be signed in to change notification settings

SvetlanaMinakova/high_throughput_cnns

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code for scientific paper Combining Task- and Data-level Parallelism for High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs"

authors: Svetlana Minakova, Erqian Tang, and Todor Stefanov

paper published In Proc. "20th International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS'20)", pp. 18-35, Pythagoreio, Samos Island, Greece, July 05-09, 2020."

Abstract

The code is aimed at increasing throughput of a Convolutional Neural Network (CNN), executed on a multi-processor edge (mobile or embedded) platform. It represents a methodology, based on the publication above. The methodology exploits two types of parallelism: data-level parallelism and task-level parallelism, available in a CNN, to efficiently distribute (map) the computations within the CNN to the computational resources of an edge (mobile or embedded) platform. The CNN distribution (mapping) is considered efficient if it ensures high CNN throughput. To find an efficient CNN mapping, our proposed methodology performs an automated Design Space Exploration (DSE), based on a Genetic Algorithm (GA) as proposed in the original paper or on a Greedy Algorithm (only added to the tool). Exploitation of task-level (pipeline) parallelism together with data-level parallelism is the main novel feature of our proposed methodology. This feature distinguishes our methodology from the existing DL frameworks and other methodologies that utilize only task-level (pipeline) parallelism or only data-level parallelism, available in a CNN, to ensure high CNN throughput. Thanks to the combined use of task- and data-level parallelism, our proposed methodology takes full advantage of all computational resources that are available on the edge platform, and ensures very high CNN throughput. To evaluate our proposed methodology, we perform experiments where we apply our methodology to real-world CNNs from the Open Neural Network Exchange Format (ONNX) models zoo and execute the CNNs on the NVIDIA Jetson TX2 edge platform. We compare the throughput demonstrated by the CNNs mapped on the Jetson TX2 platform using: our proposed methodology and the best-known and state-of-the-art TensorRT DL framework for the Jetson TX2 edge platform. The experimental results shown that ~20% higher throughput is achieved, when the CNNs are mapped using our methodology. We note that, while the efficient mapping search, offered by the methodology, is applicable to a wide range of CNNs and platforms, the (real) code generation provided by this tool only supports ARM-CL (ARM CPU) + TensoRT (GPU) DL frameworks. Moreover, the code is only meant to be executed on NVIDIA Jetson platforms.

Requirements

  • python 3.6+
  • onnx 1.8

[optionally] for keras cnn models

  • keras with tensorflow background
  • onnx2keras

Inputs and outputs

Examples of the tool inputs and outputs are located in ./input_examples folder

Inputs

  • a DNN in one of the supported formats (see below)
  • a target platform architecture in json format. For an example see ./input_examples/architecture/jetson.json
  • step-specific intermediate files in .json format, generated by the tool. For examples see ./input_examples/intermediate folder

Supported input DNN formats

We support several input DNN formats/extensions:

  • ONNX format. For an example see ./input_examples/dnn/mnist.onnx
  • H5 format, provided by Keras DL framework
  • JSON format with a specific structure. For an example see ./input_examples/dnn/mnist.json.

The JSON format shows internal (analytical) DNN model. The internal DNN model is a light-weighted and generic DNN model highly suitable for analysis, performed by the tool. When you pass an ONNX or .h5 DNN model to the tool, the tool always converts the ONNX or .h5 DNN model into the internal DNN model for processing. However, conversion strongly depends on the third-party tools (such as onnx package for processing .onnx DNN models), and does not always go smooth. Therefore, we allow use of manually created analytical CNN models. The analytical CNN model can be also automatically obtained from the onnx or .h5 format using the dnn_to_json.py script.

Example use:

python ./dnn_to_json.py --cnn ./input_examples/dnn/mnist.onnx -o ./output/mnist/

Example output: see ./output/mnist/cnn.json or ./input_examples/dnn/mnist.json

Outputs:

A final CNN inference model (in .json format), which describes how a CNN is partitioned, mapped and scheduled for efficient CNN execution on the target edge platform. For an example see ./input_examples/intermediate/mnist/app.json

Tests

Tool has a set of tests, located in ./tests folder. To run the whole toolflow for example mnist dnn, located in ./input_examples/dnn and an example edge platform located in ./input_examples/architecture/jetson.json, run

python tests/t_integration.py

To learn how to run individual steps of the toolflow (explained below), run:

python tests/t_unit.py --help

The settings (input cnn, input platform, path to intermediate files etc.) for the tests can be found and manually changed in ./tests/test_config

Toolflow

Tool consists of several steps, executed one after another. The steps are represented as API scripts, located in the root directory of the project. Shortly, the steps are:

  1. Generation of an SDF (task graph) from an input CNN (dnn_to_sdf_task_graph.py)
  2. Generation of a per-layer execution time (latency) evaluation template (sdf_latency_eval_template.py)
  3. Filling the generated per-layer execution time (latency) template with real latency estimations (this step is performed manually!)
  4. Generation of efficient mapping of the input CNN onto target edge platform architecture (generate_mapping.py)
  5. Generation of final (CSDF) application model (generate_final_app_model.py)
  6. Generation of executable code from the input DNN and the final application model (see generate_code*.py)

Below, we give more detailed explanation for every step (script). In case you are wondering, why do you use so many steps and intermediate files for every step?, the answer is: to provide modularity, reusability, and control over each separate step. For example, if a conversion from an ONNX (CNN) model into a task graph (SDF) goes wrong for a particular CNN model, you can always create CNN task graph manually. Analogously, if you'd like to map a CNN onto several platforms, with our modular structure, you will not have to repeat Step 1 many times.

Step 1: Generation of an SDF (task graph) from an input CNN (dnn_to_sdf_task_graph.py)

This script generates a static dataflow (task graph) from the input CNN model. The task graph shows the tasks, performed to execute CNN inference as well as connections between these tasks. One task corresponds to one or more layers of the input CNN. One task is always mapped on one processor of a target edge platform.

Example use

python ./dnn_to_sdf_task_graph.py --cnn ./input_examples/dnn/mnist.json -o ./output/mnist

Example output: see ./output/mnist/task_graph.json or ./input_examples/intermediate/mnist/task_graph.json

Step 2: Generation of a per-layer execution time (latency) evaluation template (sdf_latency_eval_template.py)

To explore the efficient mapping of a CNN onto a target edge platform, our methodology uses CNN execution time (latency) estimation. The estimation is stored in a CNN- and platform- dependent .json file of a specific platform. At this step, we generate such a file for an input CNN and an input target platform description. The template can be initialized with zeros (execution time of every layer on every processor=0) or with number of floating-point operations (FLOPs). NOTE: even though, we provide FLOPs as a metric for execution time template initialization, we strongly recommend (manually) replacing the FLOPs with real measurement on the platform (see Step 3). The measurements on the platform provide precise and platform-aware estimation of CNN latency, while FLOPs are a platform-agnostic and imprecise metric

Example 1: empty template

Example use

python ./sdf_latency_eval_template.py -tg ./input_examples/intermediate/mnist/task_graph.json -p ./input_examples/architecture/jetson.json -o ./output/mnist/

Example output: see ./output/mnist/eval_template.json or ./input_examples/intermediate/mnist/eval_template.json

Example 2: template initialized with FLOPs

python ./sdf_latency_eval_template.py --cnn ./input_examples/dnn/mnist.json -tg ./input_examples/intermediate/mnist/task_graph.json -p ./input_examples/architecture/jetson.json --flops

Example output: see ./output/mnist/eval_template.json or ./input_examples/intermediate/mnist/eval_template.json

STEP 3: Filling the generated per-layer execution time (latency) template with real latency estimations

NOTE: this step is performed manually!

As explained in Step 2, to explore the efficient mapping of a CNN onto a target edge platform, our methodology uses CNN execution time (latency) estimation. Our tool does not provide means to estimate CNN latency. You can do it by performing measurements on the platform or using third-party tools. When you've obtained your measurements, please insert them in the .json template, generated at Step 2.

Step 4: Generation of efficient mapping of the input CNN onto target edge platform architecture (generate_mapping.py)

At this step, we map a CNN, represented as a task graph (SDF) onto a target edge platform. To find an efficient CNN mapping, our proposed methodology performs an automated Design Space Exploration (DSE), based on a Genetic Algorithm (GA) as proposed in the original paper or on a Greedy Algorithm (only added to the tool). The GA typically delivers better results (in terms of CNN throughput increase). However, it may create cyclic dependencies in the final application graph. These dependencies have to be resolved with a specific schedule. The greedy algorithm, on the other hand, provides worse results, but works fast and does not create cyclic dependencies. The details of the GA and the greedy algorithm are given in scripts ./DSE/mapping/ga.py (see top-level class "ga" ) and ./DSE/mapping/greedy_mapping.py (see top-level function "map_greedy")

Example use:

python ./generate_mapping.py --cnn ./input_examples/dnn/mnist.json -tg ./input_examples/intermediate/mnist/task_graph.json -p ./input_examples/architecture/jetson.json -o ./output/mnist/ -e ./input_examples/intermediate/mnist/eval_template.json -map-algo greedy

Example output: see ./output/mnist/mapping.json or ./input_examples/intermediate/mnist/mapping.json

Step 5: Generation of final (CSDF) application model (generate_final_app_model.py)

At this step, we generate final (CSDF) application model from the input CNN and meta-files generated at steps above. The final application model describes CNN partitioning (splitting into sub-graphs), mapping (of sub-graph onto processors of a target edge platform), and scheduling.

Example use

python ./generate_final_app_model.py --cnn ./input_examples/dnn/mnist.onnx -tg ./input_examples/intermediate/mnist/task_graph.json -p ./input_examples/architecture/jetson.json -m ./input_examples/intermediate/mnist/mapping.json -o ./output/mnist/

Example output: see ./output/mnist/app.json or ./input_examples/intermediate/mnist/app.json

Step 6: Generation of executable code from DNN and final (CSDF) application model (generate_code*.py)

At this step, we generate executable C++ code from final (CSDF) application model. To generate the code we use:

  • static (dnn-independent) C++ code which is written manually and just copied inside of target folder every time. The static code is located in ./codegen/static_lib_files
  • one of scripts (listed below and located in the root folder of the project) which generate dnn-dependent C++ code for every individual dnn:
    • generate_code_wrapper.py: generates wrapper code, which implements communication and synchronization primitives for final app. model, but NO real CNN functionality. This code can be compiled and executed on any device, supporting C++11 and above and pthreads.

    • generate_code_arm_cl.py: generates CPU-only code which implements input DNN using ARM-CL DL framework

    • generate_code_tensorrt.py: generates GPU-only code which implements input DNN using TensorRT DL framework

    • generate_code_mixed.py: generates CPU+GPU code which implements the final (CSDF) application, generated at Step 5 as well as ARM-CL and TensorRT DL frameworks

      NOTE: TensorRT, ARM-CL and mixed code have only been tested on NVIDIA Jetson TX2 platform. We cannot guarantee this code will run anywhere else. Also, this code still requires significant amount of refactoring and finishing-up.

  • code generation config (./codegen/codegen_config) which specifies code generation parameters such as version of C++, paths to local libraries etc.

Wrapper code (generate_code_wrapper.py)

As mentioned above, this code implements communication and synchronization primitives for final app. model but NO real CNN functionality. In other words, this code simulates execution of a CSDF model, representing final DNN-based application and shows how partitioning, mapping, communication and synchronization would look like for this application. This code can be compiled and executed on any device, supporting C++11 and above and pthreads. The wrapper code can be untimed (every dnn partition takes 0 seconds to execute) and timed (every dnn partition is characterised with execution time, specified in eval.json file generated at Step 2 and filled at Step 3). The static files for wrapper code are located in ./codegen/static_lib_files/wrapper

Example use (untimed)

python ./generate_code_wrapper.py --cnn ./input_examples/dnn/mnist.json -a ./input_examples/intermediate/mnist/app.json -o ./output/mnist/

Example use (timed)

python ./generate_code_wrapper.py --cnn ./input_examples/dnn/mnist.json -a ./input_examples/intermediate/mnist/app.json -o ./output/mnist/ -e ./input_examples/intermediate/mnist/eval_template.json

Example output: see ./output/mnist/code/wrapper

TensorRT (GPU) code (generate_code_tensorrt.py)

Generates GPU-only code which implements input DNN using TensorRT DL framework. Example use:

python ./generate_code_tensorrt.py --cnn ./input_examples/dnn/mnist.json -o ./output/mnist/

Example output: see ./output/mnist/code/gpu

ARM-CL (CPU) code (generate_code_arm_cl.py)

Generates CPU-only code which implements input DNN using ARM-CL DL framework. Beside of generating code which implements (the whole) input DNN, this script can generate partitioned DNN, where DNN is split into partitions, and every partition corresponds to a node of a Task Graph (SDF) model, generated at Step 1. The "partitioned" code is used for per-node Task Graph (SDF) latency measurement on the target edge platform.

Example use:

python ./generate_code_arm_cl.py --cnn ./input_examples/dnn/mnist.json -o ./output/mnist/

Example output: see ./output/mnist/code/cpu

Example use (partitioned):

python ./generate_code_arm_cl.py --cnn ./input_examples/dnn/mnist.json -o ./output/mnist/ -tg ./input_examples/mnist/task_graph.json --partitioned

Example output: see ./output/mnist/code/cpu_partitioned

Mixed TensorRT (GPU) + ARM-CL (CPU) code (generate_code_mixed.py)

generates CPU+GPU code which implements the final (CSDF) application, generated at Step 5 as well as ARM-CL and TensorRT DL frameworks.

Example use:

python ./generate_code_mixed.py --cnn ./input_examples/dnn/mnist.json -a ./input_examples/intermediate/mnist/app.json -o ./output/mnist/

Example output: see ./output/mnist/code/mixed

About

Increasing throughput of a Convolutional Neural Network (CNN), executed on a multi-processor edge (mobile or embedded) platform.

Topics

Resources

Stars

Watchers

Forks