diff --git a/docs/conceptual/includes/high-level-design.rst b/docs/conceptual/includes/high-level-design.rst index 38145c279..80eeedcec 100644 --- a/docs/conceptual/includes/high-level-design.rst +++ b/docs/conceptual/includes/high-level-design.rst @@ -8,23 +8,25 @@ following diagram. Core Omniperf profiler Acquires raw performance counters via application replay using ``rocprof``. - Counters are stored in a comma-separated-values format for further analysis. - It runs a set of accelerator-specific micro benchmarks to acquire - hierarchical roofline data. The roofline model is not available on - accelerators pre-MI200. + Counters are stored in a comma-separated-values format for further + :ref:`analysis `. It runs a set of accelerator-specific micro + benchmarks to acquire hierarchical roofline data. The roofline model is not + available on accelerators pre-MI200. -Grafana analyzer for Omniperf +Grafana server for Omniperf * **Grafana database import**: All raw performance counters are imported into - the a backend MongoDB database to support analysis and visualization in the - Grafana GUI. Compatibility with previously generated data using older - Omniperf versions is not guaranteed. - * **Grafana dashboard GUI**: The Grafana dashboard retrieves the raw counters - information from the backend database. It displays the relevant performance - metrics and visualization. + the a :ref:`backend MongoDB database ` to support + analysis and visualization in the Grafana GUI. Compatibility with + previously generated data using older Omniperf versions is not guaranteed. + * **Grafana analysis dashboard GUI**: The + :ref:`Grafana dashboard ` retrieves the raw counters + information from the backend database. It displays the relevant + performance metrics and visualization. -Omniperf standalone GUI analyzer - Omniperf provides a standalone GUI to enable basic performance analysis - without the need to import data into a database instance. +Omniperf standalone analysis GUI + Omniperf provides a :ref:`standalone GUI ` to enable + basic performance analysis without the need to import data into a database + instance. .. figure:: ./data/omniperf_server_vs_client_install.png :align: center diff --git a/docs/data/standalone_gui.png b/docs/data/analyze/standalone_gui.png similarity index 100% rename from docs/data/standalone_gui.png rename to docs/data/analyze/standalone_gui.png diff --git a/docs/how-to/analyze-mode.rst b/docs/how-to/analyze-mode.rst index f825c8aa6..2a9315d1b 100644 --- a/docs/how-to/analyze-mode.rst +++ b/docs/how-to/analyze-mode.rst @@ -25,16 +25,18 @@ Learn about Omniperf's other modes in :ref:`modes`. performance analysis. Unless otherwise noted, the performance analysis is done on the :ref:`MI200 platform `. +.. _cli-analysis: + CLI analysis ============ Features -------- -* __Derived metrics__: All of Omniperf's built-in metrics. -* __Baseline comparison__: Compare multiple runs in a side-by-side manner. -* __Metric customization__: Isolate a subset of built-in metrics or build your own profiling configuration. -* __Filtering__: Hone in on a particular kernel, gpu-id, and/or dispatch-id via post-process filtering. +* **Derived metrics**: All of Omniperf's built-in metrics. +* **Baseline comparison**: Compare multiple runs in a side-by-side manner. +* **Metric customization**: Isolate a subset of built-in metrics or build your own profiling configuration. +* **Filtering**: Hone in on a particular kernel, gpu-id, and/or dispatch-id via post-process filtering. Run ``omniperf analyze -h`` for more details. @@ -295,171 +297,177 @@ Demo #. Redo a comprehensive analysis with Omniperf CLI at any optimization milestone. -More options ------------- +More analysis options +--------------------- + +Single run + .. code-block:: shell -- __Single run__ - ```shell - $ omniperf analyze -p workloads/vcopy/MI200/ - ``` + $ omniperf analyze -p workloads/vcopy/MI200/ -- __List top kernels and dispatches__ - ```shell - $ omniperf analyze -p workloads/vcopy/MI200/ --list-stats - ``` +List top kernels and dispatches + .. code-block:: shell -- __List metrics__ + $ omniperf analyze -p workloads/vcopy/MI200/ --list-stats - ```shell - $ omniperf analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a - ``` +List metrics + .. code-block:: shell -- __Show "System Speed-of-Light" and "CS_Busy" blocks only__ + $ omniperf analyze -p workloads/vcopy/MI200/ --list-metrics gfx90a - ```shell - $ omniperf analyze -p workloads/vcopy/MI200/ -b 2 5.1.0 - ``` +Show System Speed-of-Light and CS_Busy blocks only + .. code-block:: shell - ```{note} - Users can filter single metric or the whole hardware component by its id. In this case, 1 is the id for "system speed of light" and 5.1.0 the id for metric "GPU Busy Cycles". - ``` + $ omniperf analyze -p workloads/vcopy/MI200/ -b 2 5.1.0 -- __Filter kernels__ +.. note:: + You can filter a single metric or the whole hardware component by its ID. In + this case, ``1`` is the ID for System Speed-of-Light and ``5.1.0`` the ID for + GPU Busy Cycles metric. + +Filter kernels First, list the top kernels in your application using `--list-stats`. - ```shell-session - $ omniperf analyze -p workloads/vcopy/MI200/ --list-stats - - Analysis mode = cli - [analysis] deriving Omniperf metrics... - - -------------------------------------------------------------------------------- - Detected Kernels (sorted descending by duration) - ╒════╤══════════════════════════════════════════════╕ - │ │ Kernel_Name │ - ╞════╪══════════════════════════════════════════════╡ - │ 0 │ vecCopy(double*, double*, double*, int, int) │ - ╘════╧══════════════════════════════════════════════╛ - - -------------------------------------------------------------------------------- - Dispatch list - ╒════╤═══════════════╤══════════════════════════════════════════════╤══════════╕ - │ │ Dispatch_ID │ Kernel_Name │ GPU_ID │ - ╞════╪═══════════════╪══════════════════════════════════════════════╪══════════╡ - │ 0 │ 0 │ vecCopy(double*, double*, double*, int, int) │ 0 │ - ╘════╧═══════════════╧══════════════════════════════════════════════╧══════════╛ - - ``` - - Second, select the index of the kernel you would like to filter (i.e. __vecCopy(double*, double*, double*, int, int) [clone .kd]__ at index __0__). Then, use this index to apply the filter via `-k/--kernels`. - - ```shell-session - $ omniperf analyze -p workloads/vcopy/MI200/ -k 0 - - Analysis mode = cli - [analysis] deriving Omniperf metrics... - - -------------------------------------------------------------------------------- - 0. Top Stats - 0.1 Top Kernels - ╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╤═════╕ - │ │ Kernel_Name │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │ S │ - ╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╪═════╡ - │ 0 │ vecCopy(double*, double*, double*, int, │ 1.00 │ 18560.00 │ 18560.00 │ 18560.00 │ 100.00 │ * │ - │ │ int) │ │ │ │ │ │ │ - ╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╧═════╛ - ... ... - ``` - - ```{note} - You will see your filtered kernel(s) indicated by an asterisk in the Top Stats table - ``` - - -- __Baseline comparison__ - - ```shell - omniperf analyze -p workload1/path/ -p workload2/path/ - ``` + + .. code-block:: + + $ omniperf analyze -p workloads/vcopy/MI200/ --list-stats + + Analysis mode = cli + [analysis] deriving Omniperf metrics... + + -------------------------------------------------------------------------------- + Detected Kernels (sorted descending by duration) + ╒════╤══════════════════════════════════════════════╕ + │ │ Kernel_Name │ + ╞════╪══════════════════════════════════════════════╡ + │ 0 │ vecCopy(double*, double*, double*, int, int) │ + ╘════╧══════════════════════════════════════════════╛ + + -------------------------------------------------------------------------------- + Dispatch list + ╒════╤═══════════════╤══════════════════════════════════════════════╤══════════╕ + │ │ Dispatch_ID │ Kernel_Name │ GPU_ID │ + ╞════╪═══════════════╪══════════════════════════════════════════════╪══════════╡ + │ 0 │ 0 │ vecCopy(double*, double*, double*, int, int) │ 0 │ + ╘════╧═══════════════╧══════════════════════════════════════════════╧══════════╛ + + Second, select the index of the kernel you would like to filter; for example, + ``vecCopy(double*, double*, double*, int, int) [clone .kd]`` at index ``0``. + Then, use this index to apply the filter via ``-k`` or ``--kernels``. + + .. code-block:: shell + + $ omniperf analyze -p workloads/vcopy/MI200/ -k 0 + + Analysis mode = cli + [analysis] deriving Omniperf metrics... + + -------------------------------------------------------------------------------- + 0. Top Stats + 0.1 Top Kernels + ╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╤═════╕ + │ │ Kernel_Name │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │ S │ + ╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╪═════╡ + │ 0 │ vecCopy(double*, double*, double*, int, │ 1.00 │ 18560.00 │ 18560.00 │ 18560.00 │ 100.00 │ * │ + │ │ int) │ │ │ │ │ │ │ + ╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╧═════╛ + ... + + .. note:: + + You should see your filtered kernels indicated by an asterisk in the **Top + Stats** table. + + +Baseline comparison + .. code-block:: shell + + omniperf analyze -p workload1/path/ -p workload2/path/ + OR - ```shell - omniperf analyze -p workload1/path/ -k 0 -p workload2/path/ -k 1 - ``` + + .. code-block:: shell + + omniperf analyze -p workload1/path/ -k 0 -p workload2/path/ -k 1 GUI analysis ============ -Web-based GUI -------------- +.. _standalone-gui-analysis: + +Standalone web-based analysis GUI +--------------------------------- Features ^^^^^^^^ -Omniperf's standalone GUI analyzer is a lightweight web page that can -be generated directly from the command-line. This option is provided -as an alternative for users wanting to explore profiling results -graphically, but without the additional setup requirements or -server-side overhead of Omniperf's detailed [Grafana -interface](analysis.md#grafana-based-gui) -option. The standalone GUI analyzer is provided as simple -[Flask](https://flask.palletsprojects.com/en/2.2.x/) application -allowing users to view results from within a web browser. - -```{admonition} Port forwarding - -Note that the standalone GUI analyzer publishes a web interface on port 8050 by default. -On production HPC systems where profiling jobs run -under the auspices of a resource manager, additional SSH tunneling -between the desired web browser host (e.g. login node or remote workstation) and compute host may be -required. Alternatively, users may find it more convenient to download -profiled workloads to perform analysis on their local system. - -See [FAQ](faq.md) for more details on SSH tunneling. -``` +Omniperf's standalone analyzer GUI is a lightweight web page that can be +generated directly from the command line. The standalone analyzer GUI is an +alternative to the CLI if you want to explore profiling results visually, but +without the additional setup requirements or server-side overhead of Omniperf's +detailed :ref:`Grafana interface ` option. The standalone GUI +analyzer is provided as simple `Flask `_ +application that lets you view results from your preferred web browser. -#### Usage +.. note:: -To launch the standalone GUI, include the `--gui` flag with your desired analysis command. For example: + A note on **port forwarding**: the standalone GUI analyzer publishes a web + interface on port ``8050`` by default. On production HPC systems where + profiling jobs run under the auspices of a resource manager, additional SSH + tunneling between the desired web browser host (such as a login node or + remote workstation) and compute host may be required. Alternatively, you + might find it more convenient to download profiled workloads to perform + analysis on a local system. -```shell-session -$ omniperf analyze -p workloads/vcopy/MI200/ --gui + See the :doc:`../reference/faq` for more details on SSH tunneling. - ___ _ __ - / _ \ _ __ ___ _ __ (_)_ __ ___ _ __ / _| -| | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_ -| |_| | | | | | | | | | | |_) | __/ | | _| - \___/|_| |_| |_|_| |_|_| .__/ \___|_| |_| - |_| + Usage + ^^^^^ -Analysis mode = web_ui -[analysis] deriving Omniperf metrics... -Dash is running on http://0.0.0.0:8050/ - - * Serving Flask app 'omniperf_analyze.analysis_webui' (lazy loading) - * Environment: production - WARNING: This is a development server. Do not use it in a production deployment. - Use a production WSGI server instead. - * Debug mode: off - * Running on all addresses (0.0.0.0) - WARNING: This is a development server. Do not use it in a production deployment. - * Running on http://127.0.0.1:8050 - * Running on http://10.228.33.172:8050 (Press CTRL+C to quit) -``` +To launch the standalone GUI, include the ``--gui`` flag with your desired +analysis command. For example: -At this point, users can then launch their web browser of choice and -go to http://localhost:8050/ to see an analysis page. +.. code-block:: shell + $ omniperf analyze -p workloads/vcopy/MI200/ --gui + ___ _ __ + / _ \ _ __ ___ _ __ (_)_ __ ___ _ __ / _| + | | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_ + | |_| | | | | | | | | | | |_) | __/ | | _| + \___/|_| |_| |_|_| |_|_| .__/ \___|_| |_| + |_| -![Standalone GUI Homepage](images/standalone_gui.png) + Analysis mode = web_ui + [analysis] deriving Omniperf metrics... + Dash is running on http://0.0.0.0:8050/ -```{tip} -To launch the web application on a port other than 8050, include an optional port argument: -`--gui ` -``` + * Serving Flask app 'omniperf_analyze.analysis_webui' (lazy loading) + * Environment: production + WARNING: This is a development server. Do not use it in a production deployment. + Use a production WSGI server instead. + * Debug mode: off + * Running on all addresses (0.0.0.0) + WARNING: This is a development server. Do not use it in a production deployment. + * Running on http://127.0.0.1:8050 + * Running on http://10.228.33.172:8050 (Press CTRL+C to quit) + +At this point, you can launch your web browser of choice and navigate to +``http://localhost:8050/`` to view the analysis interface. -When no filters are applied, users will see five basic sections derived from their application's profiling data: +.. image:: ../data/analyze/standalone_gui.png + :align: center + :alt: Omniperf standalone GUI home screen + +.. tip:: + + To launch the standalone GUI analyzer web app on a port other than ``8050``, + include the optional argument ``--gui ``. + +When no filters are applied, you'll see five basic sections derived from your +application's profiling data: 1. Memory Chart Analysis 2. Empirical Roofline Analysis @@ -479,29 +487,35 @@ interface](analysis.md#grafana-based-gui). .. _grafana-analysis: -Grafana-based GUI ------------------ +Grafana analyzer GUI +-------------------- + +Find setup instructions in :doc:`../install/grafana-setup`. Features ^^^^^^^^ -The Omniperf Grafana GUI Analyzer supports the following features to facilitate MI GPU performance profiling and analysis: +The Omniperf Grafana analyzer GUI supports the following features to facilitate +MI accelerator performance profiling and analysis: + +* System and Hardware Component (Hardware Block) +* Speed-of-Light (SOL) +* Multiple normalization options, including per-cycle, per-wave, per-kernel and per-second. +* Baseline comparisons +* Regex based Dispatch ID filtering +* Roofline Analysis +* Detailed performance counters and metrics per hardware component, e.g., + * Command Processor - Fetch (CPF) / Command Processor - Controller (CPC) + * Workgroup Manager (SPI) + * Shader Sequencer (SQ) + * Shader Sequencer Controller (SQC) + * L1 Address Processing Unit, a.k.a. Texture Addresser (TA) / L1 Backend Data Processing Unit, a.k.a. Texture Data (TD) + * L1 Cache (TCP) + * L2 Cache (TCC) (both aggregated and per-channel perf info) + +Speed-of-Light +++++++++++++++ -- System and Hardware Component (Hardware Block) Speed-of-Light (SOL) -- Multiple normalization options, including per-cycle, per-wave, per-kernel and per-second. -- Baseline comparisons -- Regex based Dispatch ID filtering -- Roofline Analysis -- Detailed performance counters and metrics per hardware component, e.g., - - Command Processor - Fetch (CPF) / Command Processor - Controller (CPC) - - Workgroup Manager (SPI) - - Shader Sequencer (SQ) - - Shader Sequencer Controller (SQC) - - L1 Address Processing Unit, a.k.a. Texture Addresser (TA) / L1 Backend Data Processing Unit, a.k.a. Texture Data (TD) - - L1 Cache (TCP) - - L2 Cache (TCC) (both aggregated and per-channel perf info) - -##### Speed-of-Light Speed-of-light panels are provided at both the system and per hardware component level to help diagnosis performance bottlenecks. The performance numbers of the workload under testing are compared to the theoretical maximum, (e.g. floating point operations, bandwidth, cache hit rate, etc.), to indicate the available room to further utilize the hardware capability. ##### Multi Normalization diff --git a/docs/how-to/includes/global-options.rst b/docs/how-to/includes/global-options.rst deleted file mode 100644 index d133a1d9d..000000000 --- a/docs/how-to/includes/global-options.rst +++ /dev/null @@ -1,20 +0,0 @@ -.. _global-options: - -Global options -============== - -The Omniperf command line tool has a set of *global* utility options that are -available across all modes. - -``-v``, ``--version`` - Prints the Omniperf version and exits. - -``-V``, ``--verbose`` - Increases output verbosity. Use multiple times for higher levels of - verbosity. - -``-q``, ``--quiet`` - Reduces output verbosity and runs quietly. - -``-s``, ``--specs`` - Prints system specs and exits. diff --git a/docs/how-to/includes/modes.rst b/docs/how-to/includes/modes.rst index 5eb886b2c..bfdf812de 100644 --- a/docs/how-to/includes/modes.rst +++ b/docs/how-to/includes/modes.rst @@ -37,7 +37,7 @@ Analyze mode generated metrics. It generates metrics from the entirety of your profiled application or a subset identified through the Omniperf CLI analysis filters. - To generate a lightweight GUI interface, you can add the `--gui` flag to your + To generate a lightweight GUI interface, you can add the ``--gui`` flag to your analysis command. This mode is a middle ground to the highly detailed Omniperf Grafana GUI and @@ -57,11 +57,11 @@ Database mode ------------- ``database`` - The :doc:`Grafana GUI dashboard <../install/grafana-setup>` is built on a - MongoDB database. ``--import`` profiling results to the DB to interact with - the workload in Grafana or `--remove` the workload from the DB. + The Grafana analyzer GUI is built on a MongoDB database. ``--import`` + profiling results to the DB to interact with the workload in Grafana or + ``--remove`` the workload from the DB. - Connection options need to be specified. See :ref:`grafana-gui-import` for + Connection options need to be specified. See :ref:`grafana-analysis` for more details. .. code-block:: shell diff --git a/docs/how-to/use.rst b/docs/how-to/use.rst index c02694b74..436d6ad28 100644 --- a/docs/how-to/use.rst +++ b/docs/how-to/use.rst @@ -89,6 +89,8 @@ workload path. ``-p``, ``--path`` Enables you to analyze existing profiling data in the Omniperf CLI. +See :ref:`cli-analysis` for more detailed information. + .. _basic-analyze-grafana: Analyze in the Grafana GUI @@ -100,15 +102,36 @@ data to the MongoDB instance included in the Omniperf Dockerfile. See :doc:`../install/grafana-setup`. To interact with Grafana data, stored in the Omniperf database, enter -``database`` :ref:`mode `; for example: +``database`` :ref:`mode `; for example: .. code-block:: shell $ omniperf database --import [CONNECTION OPTIONS] +See :ref:`grafana-analysis` for more detailed information. + .. include:: ./includes/modes.rst -.. include:: ./includes/global-options.rst +.. _global-options: + +Global options +============== + +The Omniperf command line tool has a set of *global* utility options that are +available across all modes. + +``-v``, ``--version`` + Prints the Omniperf version and exits. + +``-V``, ``--verbose`` + Increases output verbosity. Use multiple times for higher levels of + verbosity. + +``-q``, ``--quiet`` + Reduces output verbosity and runs quietly. + +``-s``, ``--specs`` + Prints system specs and exits. .. note:: diff --git a/docs/index.rst b/docs/index.rst index 7d3ea350c..3fc5b895b 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -7,7 +7,7 @@ Omniperf ******** Omniperf documentation provides a detailed breakdown of all facets of Omniperf. -In addition to a full deployment guide with installation instructions, this +In addition to a full deployment guide with installation instructions, this documentation also explains the ideas motivating the design behind the tool and its components. @@ -56,10 +56,10 @@ in practice. * :doc:`license` -This project is proudly open source and we welcome all feedback. For more -details on how to contribute, refer to `Contributing to ROCm -`_. +This project is proudly open source; all feedback is welcome. For more details +on how to contribute, refer to +`Contributing to ROCm `_. -Find licensing information on the +Find ROCm licensing information on the `Licensing `_ page. diff --git a/docs/install/grafana-setup.rst b/docs/install/grafana-setup.rst index f1fc38441..3e48874d8 100644 --- a/docs/install/grafana-setup.rst +++ b/docs/install/grafana-setup.rst @@ -73,6 +73,7 @@ directory to begin. .. code-block:: bash + $ cd grafana $ sudo docker-compose build $ sudo docker-compose up -d