diff --git a/source/docs/_toc.yml.in b/source/docs/_toc.yml.in index 8cc75d39..f6987bfe 100644 --- a/source/docs/_toc.yml.in +++ b/source/docs/_toc.yml.in @@ -6,14 +6,6 @@ defaults: root: index subtrees: - - entries: - - file: what-is-rocprof-sdk - - file: buffered_services.md - - file: callback_services.md - - file: counter_collection_services.md - - file: intercept_table.md - - file: pc_sampling.md - - file: tool_library_overview.md - caption: Install entries: - file: install/installation @@ -23,8 +15,17 @@ subtrees: - file: how-to/samples - caption: API reference entries: + - file: api-reference/buffered_services + - file: api-reference/callback_services + - file: api-reference/counter_collection_services + - file: api-reference/intercept_table + - file: api-reference/pc_sampling + - file: api-reference/tool_library - file: _doxygen/html/index title: API library + - caption: Conceptual + entries: + - file: conceptual/comparing-with-legacy-tools - caption: License entries: - file: license diff --git a/source/docs/buffered_services.md b/source/docs/api-reference/buffered_services.md similarity index 99% rename from source/docs/buffered_services.md rename to source/docs/api-reference/buffered_services.md index 77d09027..f6a7eead 100644 --- a/source/docs/buffered_services.md +++ b/source/docs/api-reference/buffered_services.md @@ -1,4 +1,4 @@ -# Buffered Services +# Buffered services For the buffered approach, supported buffer record categories are enumerated in `rocprofiler_buffer_category_t` category field. diff --git a/source/docs/callback_services.md b/source/docs/api-reference/callback_services.md similarity index 99% rename from source/docs/callback_services.md rename to source/docs/api-reference/callback_services.md index 6744d9d4..1a458490 100644 --- a/source/docs/callback_services.md +++ b/source/docs/api-reference/callback_services.md @@ -1,4 +1,4 @@ -# Callback Tracing Services +# Callback tracing services ## Overview diff --git a/source/docs/counter_collection_services.md b/source/docs/api-reference/counter_collection_services.md similarity index 99% rename from source/docs/counter_collection_services.md rename to source/docs/api-reference/counter_collection_services.md index 86cf29e8..a7f58b59 100644 --- a/source/docs/counter_collection_services.md +++ b/source/docs/api-reference/counter_collection_services.md @@ -1,4 +1,4 @@ -# Counter Collection Services +# Counter collection services ## Definitions diff --git a/source/docs/intercept_table.md b/source/docs/api-reference/intercept_table.md similarity index 98% rename from source/docs/intercept_table.md rename to source/docs/api-reference/intercept_table.md index 54a95093..58cdc745 100644 --- a/source/docs/intercept_table.md +++ b/source/docs/api-reference/intercept_table.md @@ -1,4 +1,4 @@ -# Runtime Intercept Tables +# Runtime intercept tables Although most tools will want to leverage the callback or buffer tracing services for tracing the HIP, HSA, and ROCTx APIs, rocprofiler-sdk does provide access to the raw API dispatch tables. Each of the aforementioned APIs are diff --git a/source/docs/pc_sampling.md b/source/docs/api-reference/pc_sampling.md similarity index 98% rename from source/docs/pc_sampling.md rename to source/docs/api-reference/pc_sampling.md index c7abfde5..a75cf03e 100644 --- a/source/docs/pc_sampling.md +++ b/source/docs/api-reference/pc_sampling.md @@ -1,4 +1,4 @@ -# PC Sampling Method +# PC sampling method PC Sampling is a profiling method that uses statistical approximation of the kernel execution by sampling GPU program counters. Furthermore, the method periodically chooses an active wave (in a round robin manner) and snapshot it's program counter (PC). The process takes place on every compute unit simultaneously which makes it device-wide PC sampling. The outcome is the histogram of samples that says how many times each kernel instruction was sampled. diff --git a/source/docs/tool_library_overview.md b/source/docs/api-reference/tool_library.md similarity index 98% rename from source/docs/tool_library_overview.md rename to source/docs/api-reference/tool_library.md index b8930e34..0d63f5a5 100644 --- a/source/docs/tool_library_overview.md +++ b/source/docs/api-reference/tool_library.md @@ -143,18 +143,6 @@ tool_init(rocprofiler_client_finalize_t fini_func, Otherwise, ROCprofiler-SDK invokes the `finalize` callback via an `atexit` handler. -## Agent Information - -## Contexts - -## Configuring Services - -## Synchronous Callbacks - -## Asynchronous Callbacks for Buffers - -## Recommendations - ## Full `rocprofiler_configure` Sample All of the snippets from the previous sections have been combined here for convenience. diff --git a/source/docs/what-is-rocprof-sdk.rst b/source/docs/conceptual/comparing-with-legacy-tools.rst similarity index 53% rename from source/docs/what-is-rocprof-sdk.rst rename to source/docs/conceptual/comparing-with-legacy-tools.rst index e4389d1d..82909b99 100644 --- a/source/docs/what-is-rocprof-sdk.rst +++ b/source/docs/conceptual/comparing-with-legacy-tools.rst @@ -1,22 +1,15 @@ .. meta:: - :description: Documentation of the installation, configuration, use of the ROCProfiler SDK, and rocprofv3 command-line tool - :keywords: ROCProfiler SDK tool, ROCProfiler SDK library, rocprofv3, ROCm, API, reference + :description: Documentation of the installation, configuration, use of the ROCprofiler-SDK, and rocprofv3 command-line tool + :keywords: ROCprofiler-SDK tool, ROCprofiler-SDK library, rocprofv3, ROCm, API, reference -.. _what-is-rocprof-sdk: +.. _comparing-with-legacy-tools: -========================== -What is ROCprofiler-SDK? -========================== +======================================================== +Comparing ROCprofiler-SDK to other ROCm profiling tools +======================================================== -ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software. -It supports application tracing to provide a big picture of the GPU application execution and kernel profiling to provide low-level hardware details from the performance counters. -The ROCprofiler-SDK library provides runtime-independent APIs for tracing runtime calls and asynchronous activities such as GPU kernel dispatches and memory moves. The tracing includes callback APIs for runtime API tracing and activity APIs for asynchronous activity records logging. - -In summary, ROCprofiler-SDK combines `ROCProfiler `_ and `ROCTracer `_. -You can utilize the ROCprofiler-SDK to develop a tool for profiling and tracing HIP applications on ROCm software. - -ROCprofiler-SDK is an improved version that enables more efficient implementations and better thread safety while avoiding problems that plague the former implementations of ROCProfiler and ROCTracer. -Here are the distinct ROCprofiler-SDK features: +ROCprofiler-SDK is an improved version of ROCm profiling tools that enables more efficient implementations and better thread safety while avoiding problems that plague the former implementations of ROCProfiler and ROCTracer. +Here are the distinct ROCprofiler-SDK features, which also highlight the improvements over ROCProfiler and ROCTracer: - Improved tool initialization - Support for simultaneous use of the same services by multiple tools @@ -25,10 +18,7 @@ Here are the distinct ROCprofiler-SDK features: - Backward ABI compatibility - PC sampling (beta implementation) -Improvements over ROCProfiler and ROCTracer ----------------------------------------------------- - -The former implementations allow a tool to access any of the services provided by ROCProfiler or ROCTracer such as API tracing, kernel tracing, etc., by calling ``roctracer_init()`` when a ROCm runtime is initially loaded. +The former implementations allow a tool to access any of the services provided by ROCProfiler or ROCTracer, such as API tracing and kernel tracing, by calling ``roctracer_init()`` when an ROCm runtime is initially loaded. As the calling tool is not required to specify during initialization, the services it needs to use, the libraries must be effectively prepared for any service to be available anytime. This behavior introduces unnecessary overhead and makes thread-safe data management difficult, as tools generally don't use all the available services. For example, ROCTracer always installs wrappers around every runtime API and adds indirection overhead through the ROCTracer library to check for the current service configuration in a thread-safe manner. diff --git a/source/docs/data/counter_collection.csv b/source/docs/data/counter_collection.csv new file mode 100644 index 00000000..b650bd02 --- /dev/null +++ b/source/docs/data/counter_collection.csv @@ -0,0 +1,2 @@ +"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value" +0,1,1,139892123975680,5619,5619,1048576,"matrixTranspose(float*, float*, int)",16,0,0,8,16,"SQ_WAVES",65536 diff --git a/source/docs/data/kernel_names.csv b/source/docs/data/kernel_names.csv new file mode 100644 index 00000000..c0b571c2 --- /dev/null +++ b/source/docs/data/kernel_names.csv @@ -0,0 +1,5 @@ +"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value" +4,4,1,1,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 +8,8,1,2,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 +12,12,1,3,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 +16,16,1,4,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 diff --git a/source/docs/how-to/samples.md b/source/docs/how-to/samples.md index f92fa314..3d6140cf 100644 --- a/source/docs/how-to/samples.md +++ b/source/docs/how-to/samples.md @@ -4,7 +4,7 @@ The samples are provided to help you see the profiler in action. ## Finding samples -After the ROCm build is installed: +The ROCm installation provides sample programs and `rocprofv3` tool. - Sample programs are installed here: @@ -35,7 +35,7 @@ ctest -V ``` :::{note} -Running a few of these tests require you to install Pandas and pytest first. +Running a few of these tests require you to install [pandas](https://pandas.pydata.org/) and [pytest](https://docs.pytest.org/en/stable/) first. ::: ```bash diff --git a/source/docs/how-to/using-rocprofv3.rst b/source/docs/how-to/using-rocprofv3.rst index a13f7ee3..c8c20753 100644 --- a/source/docs/how-to/using-rocprofv3.rst +++ b/source/docs/how-to/using-rocprofv3.rst @@ -1,6 +1,6 @@ .. meta:: - :description: Documentation of the installation, configuration, use of the ROCProfiler SDK, and rocprofv3 command-line tool - :keywords: ROCProfiler SDK tool, ROCProfiler SDK library, rocprofv3, ROCm, API, reference + :description: Documentation of the installation, configuration, use of the ROCprofiler-SDK, and rocprofv3 command-line tool + :keywords: ROCprofiler-SDK tool, ROCprofiler-SDK library, rocprofv3, ROCm, API, reference .. _using-rocprofv3: @@ -8,8 +8,8 @@ Using rocprofv3 ====================== -``rocprofv3`` is a CLI tool that helps you quickly optimize applications and understand the low-level kernel details without requiring any modification in the source code. -It is being developed to be backward compatible with its predecessor, ``rocprof``, and to provide more features for application profiling with better accuracy. +``rocprofv3`` is a CLI tool that helps you quickly optimize applications and understand the low-level kernel details without requiring any modification in the source code. +It's backward compatible with its predecessor, ``rocprof``, and provides more features for application profiling with better accuracy. The following sections demonstrate the use of ``rocprofv3`` for application tracing and kernel profiling using various command-line options. @@ -37,7 +37,7 @@ Here is the list of ``rocprofv3`` command-line options. Some options are used fo * - Option - Description - Use - + * - ``--hip-trace`` - Collects HIP runtime traces. - Application tracing @@ -113,7 +113,7 @@ Here is the list of ``rocprofv3`` command-line options. Some options are used fo * - ``-o`` \| ``--output-file`` - Specifies the name of the output file. Note that this name is appended to the default names (_api_trace or counter_collection.csv) of the generated files'. - Output control - + * - ``-M`` \| ``--mangled-kernels`` - Overrides the default demangling of kernel names. - Output control @@ -125,7 +125,7 @@ Here is the list of ``rocprofv3`` command-line options. Some options are used fo * - ``--output-format`` - For adding output format (supported formats: csv, json, pftrace) - Output control - + * - ``--preload`` - Libraries to prepend to LD_PRELOAD (usually for sanitizers) - Extension @@ -158,9 +158,6 @@ To trace HIP runtime APIs, use: rocprofv3 --hip-trace -- < app_relative_path > -.. note:: - The tracing and counter collection options generate an additional `agent info` file. - The above command generates a `hip_api_trace.csv` file prefixed with the process ID. .. code-block:: shell @@ -170,9 +167,9 @@ The above command generates a `hip_api_trace.csv` file prefixed with the process Here are the contents of `hip_api_trace.csv` file: .. csv-table:: HIP runtime api trace - :file: /data/hip_compile_trace.csv - :widths: 10,10,10,10,10,20,20 - :header-rows: 1 + :file: /data/hip_compile_trace.csv + :widths: 10,10,10,10,10,20,20 + :header-rows: 1 To trace HIP compile time APIs, use: @@ -189,23 +186,12 @@ The above command generates a `hip_api_trace.csv` file prefixed with the process Here are the contents of `hip_api_trace.csv` file: .. csv-table:: HIP compile time api trace - :file: /data/hip_compile_trace.csv - :widths: 10,10,10,10,10,20,20 - :header-rows: 1 + :file: /data/hip_compile_trace.csv + :widths: 10,10,10,10,10,20,20 + :header-rows: 1 For the description of the fields in the output file, see :ref:`output-file-fields`. -Agent Info -'''''''''''''' - -.. code-block:: shell - - $ cat 238_agent_info.csv - - "Node_Id","Logical_Node_Id","Agent_Type","Cpu_Cores_Count","Simd_Count","Cpu_Core_Id_Base","Simd_Id_Base","Max_Waves_Per_Simd","Lds_Size_In_Kb","Gds_Size_In_Kb","Num_Gws","Wave_Front_Size","Num_Xcc","Cu_Count","Array_Count","Num_Shader_Banks","Simd_Arrays_Per_Engine","Cu_Per_Simd_Array","Simd_Per_Cu","Max_Slots_Scratch_Cu","Gfx_Target_Version","Vendor_Id","Device_Id","Location_Id","Domain","Drm_Render_Minor","Num_Sdma_Engines","Num_Sdma_Xgmi_Engines","Num_Sdma_Queues_Per_Engine","Num_Cp_Queues","Max_Engine_Clk_Ccompute","Max_Engine_Clk_Fcompute","Sdma_Fw_Version","Fw_Version","Capability","Cu_Per_Engine","Max_Waves_Per_Cu","Family_Id","Workgroup_Max_Size","Grid_Max_Size","Local_Mem_Size","Hive_Id","Gpu_Id","Workgroup_Max_Dim_X","Workgroup_Max_Dim_Y","Workgroup_Max_Dim_Z","Grid_Max_Dim_X","Grid_Max_Dim_Y","Grid_Max_Dim_Z","Name","Vendor_Name","Product_Name","Model_Name" - 0,0,"CPU",24,0,0,0,0,0,0,0,0,1,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3800,0,0,0,0,0,0,23,0,0,0,0,0,0,0,0,0,0,0,"AMD Ryzen 9 3900X 12-Core Processor","CPU","AMD Ryzen 9 3900X 12-Core Processor","" - 1,1,"GPU",0,256,0,2147487744,10,64,0,64,64,1,64,4,4,1,16,4,32,90000,4098,26751,12032,0,128,2,0,2,24,3800,1630,432,440,138420864,16,40,141,1024,4294967295,0,0,64700,1024,1024,1024,4294967295,4294967295,4294967295,"gfx900","AMD","Radeon RX Vega","vega10" - HSA trace +++++++++++++ @@ -214,7 +200,7 @@ The HIP runtime library is implemented with the low-level HSA runtime. HSA API t HSA trace contains the start and end time of HSA runtime API calls and their asynchronous activities. .. code-block:: bash - + rocprofv3 --hsa-trace -- < app_relative_path > The above command generates a `hsa_api_trace.csv` file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes. @@ -226,9 +212,9 @@ The above command generates a `hsa_api_trace.csv` file prefixed with process ID. Here are the contents of `hsa_api_trace.csv` file: .. csv-table:: HSA api trace - :file: /data/hsa_trace.csv - :widths: 10,10,10,10,10,20,20 - :header-rows: 1 + :file: /data/hsa_trace.csv + :widths: 10,10,10,10,10,20,20 + :header-rows: 1 For the description of the fields in the output file, see :ref:`output-file-fields`. @@ -284,9 +270,9 @@ Running the preceding command generates a `marker_api_trace.csv` file prefixed w Here are the contents of `marker_api_trace.csv` file: .. csv-table:: Marker api trace - :file: /data/marker_api_trace.csv - :widths: 10,10,10,10,10,20,20 - :header-rows: 1 + :file: /data/marker_api_trace.csv + :widths: 10,10,10,10,10,20,20 + :header-rows: 1 For the description of the fields in the output file, see :ref:`output-file-fields`. @@ -308,10 +294,10 @@ The above command generates a `kernel_trace.csv` file prefixed with the process Here are the contents of `kernel_trace.csv` file: .. csv-table:: Kernel trace - :file: /data/kernel_trace.csv - :widths: 10,10,10,10,10,10,20,20,10,10,10,10,10,10,10,10 + :file: /data/kernel_trace.csv + :widths: 10,10,10,10,10,10,20,20,10,10,10,10,10,10,10,10 :header-rows: 1 - + For the description of the fields in the output file, see :ref:`output-file-fields`. Memory copy trace @@ -332,8 +318,8 @@ The above command generates a `memory_copy_trace.csv` file prefixed with the pro Here are the contents of `memory_copy_trace.csv` file: .. csv-table:: Memory copy trace - :file: /data/memory_copy_trace.csv - :widths: 10,10,10,10,10,20,20 + :file: /data/memory_copy_trace.csv + :widths: 10,10,10,10,10,20,20 :header-rows: 1 For the description of the fields in the output file, see :ref:`output-file-fields`. @@ -377,10 +363,11 @@ The above command generates a `hip_stats.csv` and `hip_api_trace` file prefixed Here are the contents of `hip_stats.csv` file: .. csv-table:: HIP stats - :file: /data/hip_stats.csv - :widths: 10,10,20,20,10,10,10,10 + :file: /data/hip_stats.csv + :widths: 10,10,20,20,10,10,10,10 :header-rows: 1 +For the description of the fields in the output file, see :ref:`output-file-fields`. Kernel profiling ------------------- @@ -392,160 +379,141 @@ For a comprehensive list of counters available on MI200, see `MI200 performance Input file ++++++++++++ -Rocprofv3 supports three input file formats: text (.txt), yaml (.yaml/.yml), or JSON (.json) format. +To collect the desired basic counters or derived metrics, mention them in an input file. In the input file, the line consisting of the counter or metric names must begin with ``pmc``. The input file could be in text (.txt), yaml (.yaml/.yml), or JSON (.json) format. -Text input is used collect the desired basic counters or derived metrics. In the input file, the line consisting of the counter or metric names must begin with ``pmc``. -The input files in JSON/YAML support all commandline options. Using these files each run can be configured with different set of options. -The schema supported by input json and yaml is as given below: - -*Schema for the rocprofv3 JSON/YAML input* +.. code-block:: shell -Properties -++++++++++++ + $ cat input.txt -- **``jobs``** *(array)*: rocprofv3 input data per application run. - - - **Items** *(object)*: data for rocprofv3. - - - **``pmc``** *(array)*: list of counters to collect. - - **``kernel_include_regex``** *(string)*: regex string. - - **``kernel_exclude_regex``** *(string)*: regex string. - - **``kernel_iteration_range``** *(string)*: range for range for - each kernel that match the filter [start-stop]. - - **``hip_trace``** *(boolean)*: For Collecting HIP Traces - (runtime + compiler). - - **``hip_runtime_trace``** *(boolean)*: For Collecting HIP - Runtime API Traces. - - **``hip_compiler_trace``** *(boolean)*: For Collecting HIP - Compiler generated code Traces. - - **``marker_trace``** *(boolean)*: For Collecting Marker (ROCTx) - Traces. - - **``kernel_trace``** *(boolean)*: For Collecting Kernel - Dispatch Traces. - - **``memory_copy_trace``** *(boolean)*: For Collecting Memory - Copy Traces. - - **``scratch_memory_trace``** *(boolean)*: For Collecting - Scratch Memory operations Traces. - - **``stats``** *(boolean)*: For Collecting statistics of enabled - tracing types. - - **``hsa_trace``** *(boolean)*: For Collecting HSA Traces (core - + amd + image + finalizer). - - **``hsa_core_trace``** *(boolean)*: For Collecting HSA API - Traces (core API). - - **``hsa_amd_trace``** *(boolean)*: For Collecting HSA API - Traces (AMD-extension API). - - **``hsa_finalize_trace``** *(boolean)*: For Collecting HSA API - Traces (Finalizer-extension API). - - **``hsa_image_trace``** *(boolean)*: For Collecting HSA API - Traces (Image-extenson API). - - **``sys_trace``** *(boolean)*: For Collecting HIP, HSA, Marker - (ROCTx), Memory copy, Scratch memory, and Kernel dispatch - traces. - - **``mangled-kernels``** *(boolean)*: Do not demangle the kernel - names. - - **``truncate-kernels``** *(boolean)*: Truncate the demangled - kernel names. - - **``output_file``** *(string)*: For the output file name. - - **``output_directory``** *(string)*: For adding output path - where the output files will be saved. - - **``output_format``** *(array)*: For adding output format - (supported formats: csv, json, pftrace). - - **``list_metrics``** *(boolean)*: List the metrics. - - **``log_level``** *(string)*: fatal, error, warning, info, - trace. - - **``preload``** *(array)*: Libraries to prepend to LD_PRELOAD - (usually for sanitizers). - -The number of basic counters or derived metrics that can be collected in one run of profiling are limited by the GPU hardware resources. If too many counters or metrics are selected, the kernels need to be executed multiple times to collect them. -For multi-pass execution, in the input text file include multiple ``pmc`` rows and counters or metrics in each ``pmc`` row can be collected in each kernel run. Whereas Json/Yaml input files have a list of jobs and each job corresponds to a pass/run. + pmc: GPUBusy SQ_WAVES + pmc: GRBM_GUI_ACTIVE .. code-block:: shell $ cat input.json - { - "jobs": [ - { - "hsa_trace": true, - "kernel_trace": true, - "memory_copy_trace": true, - "marker_trace": true, - "output_file": "out", - "output_format": [ - "csv", - "json", - "pftrace" - ] - }, - { - "pmc": [ - "SQ_WAVES" - ], - "kernel_include_regex": ".*_kernel", - "kernel_exclude_regex": "multiply", - "kernel_iteration_range": "[1-2]", - "output_file": "out", - "output_format": [ - "csv", - "json" - ], - "truncate_kernels": true - } - ] - } + { + "metrics": [ + { + "pmc": ["SQ_WAVES", "GRBM_COUNT", "GUI_ACTIVE"] + }, + { + "pmc": ["FETCH_SIZE", "WRITE_SIZE"] + } + ] + } .. code-block:: shell - $ cat input.txt + $ cat input.yaml - pmc: GPUBusy SQ_WAVES - pmc: GRBM_GUI_ACTIVE + metrics: + - pmc: + - SQ_WAVES + - GRBM_COUNT + - GUI_ACTIVE + - 'TCC_HIT[1]' + - 'TCC_HIT[2]' + - pmc: + - FETCH_SIZE + - WRITE_SIZE + +The number of basic counters or derived metrics that can be collected in one run of profiling are limited by the GPU hardware resources. If too many counters or metrics are selected, the kernels need to be executed multiple times to collect them. For multi-pass execution, include multiple ``pmc`` rows in the input file. Counters or metrics in each ``pmc`` row can be collected in each kernel run. + +Kernel profiling output ++++++++++++++++++++++++++ + +To supply the input file for kernel profiling, use: .. code-block:: shell - $ cat input.yml + rocprofv3 -i input.txt -- - jobs: +Running the above command generates a `./pmc_n/counter_collection.csv` file prefixed with the process ID. For each ``pmc`` row, a directory ``pmc_n`` containing a `counter_collection.csv` file is generated, where n = 1 for the first row and so on. - - "hsa_trace": true - "kernel_trace": true - "memory_copy_trace": true - "marker_trace": true - "output_file": "out" - "output_format" - - "csv", - - "json", - - "pftrace" +Each row of the CSV file is an instance of kernel execution. Here is a truncated version of the output file from ``pmc_1``: - - pmc: - - SQ_WAVES - kernel_include_regex: "addition" - kernel_exclude_regex: "multiply" - kernel_iteration_range: - - "[1-2]" - - "[3-4]" - - "[5-6]" +.. code-block:: shell + $ cat pmc_1/218_counter_collection.csv -Kernel profiling output -+++++++++++++++++++++++++ +Here are the contents of `counter_collection.csv` file: -To supply the input file for kernel profiling, use: +.. csv-table:: Counter collection + :file: /data/counter_collection.csv + :widths: 10,10,10,10,10,10,10,10,10,10,10,10,10,10,10 + :header-rows: 1 + +For the description of the fields in the output file, see :ref:`output-file-fields`. + +Kernel names +++++++++++++++ + +To target a specific kernel for counter collection when multiple kernels are present, use the ``--kernel-names`` option: .. code-block:: shell - rocprofv3 -i input.txt -- + rocprofv3 -i input.txt --kernel-names divide_kernel -- Running the above command generates a `./pmc_n/counter_collection.csv` file prefixed with the process ID. For each ``pmc`` row, a directory ``pmc_n`` containing a `counter_collection.csv` file is generated, where n = 1 for the first row and so on. -Each row of the CSV file is an instance of kernel execution. Here is a truncated version of the output file from ``pmc_1``. +Each row of the CSV file is an instance of kernel execution. Here is a truncated version of the output file from ``pmc_1``: + +.. code-block:: shell + + $ cat pmc_1/312_counter_collection.csv + +Here are the contents of `counter_collection.csv` file: + +.. csv-table:: Targeted kernel counter collection + :file: /data/kernel_names.csv + :widths: 10,10,10,10,10,10,10,10,10,10,10,10,10,10,10 + :header-rows: 1 + +Agent info +++++++++++++ + +.. note:: + All tracing and counter collection options generate an additional `agent_info.csv` file prefixed with the process ID. +The `agent_info.csv` file contains information about the CPU or GPU the kernel runs on. + .. code-block:: shell - $ cat pmc_1/218_counter_collection.csv + $ cat 238_agent_info.csv + + "Node_Id","Logical_Node_Id","Agent_Type","Cpu_Cores_Count","Simd_Count","Cpu_Core_Id_Base","Simd_Id_Base","Max_Waves_Per_Simd","Lds_Size_In_Kb","Gds_Size_In_Kb","Num_Gws","Wave_Front_Size","Num_Xcc","Cu_Count","Array_Count","Num_Shader_Banks","Simd_Arrays_Per_Engine","Cu_Per_Simd_Array","Simd_Per_Cu","Max_Slots_Scratch_Cu","Gfx_Target_Version","Vendor_Id","Device_Id","Location_Id","Domain","Drm_Render_Minor","Num_Sdma_Engines","Num_Sdma_Xgmi_Engines","Num_Sdma_Queues_Per_Engine","Num_Cp_Queues","Max_Engine_Clk_Ccompute","Max_Engine_Clk_Fcompute","Sdma_Fw_Version","Fw_Version","Capability","Cu_Per_Engine","Max_Waves_Per_Cu","Family_Id","Workgroup_Max_Size","Grid_Max_Size","Local_Mem_Size","Hive_Id","Gpu_Id","Workgroup_Max_Dim_X","Workgroup_Max_Dim_Y","Workgroup_Max_Dim_Z","Grid_Max_Dim_X","Grid_Max_Dim_Y","Grid_Max_Dim_Z","Name","Vendor_Name","Product_Name","Model_Name" + 0,0,"CPU",24,0,0,0,0,0,0,0,0,1,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3800,0,0,0,0,0,0,23,0,0,0,0,0,0,0,0,0,0,0,"AMD Ryzen 9 3900X 12-Core Processor","CPU","AMD Ryzen 9 3900X 12-Core Processor","" + 1,1,"GPU",0,256,0,2147487744,10,64,0,64,64,1,64,4,4,1,16,4,32,90000,4098,26751,12032,0,128,2,0,2,24,3800,1630,432,440,138420864,16,40,141,1024,4294967295,0,0,64700,1024,1024,1024,4294967295,4294967295,4294967295,"gfx900","AMD","Radeon RX Vega","vega10" + +Kernel filtering ++++++++++++++++++ +Kernel filtering allows you to filter the kernel profiling output based on the kernel name by specifying regex strings in the input file. To include kernel names matching the regex string in the kernel profiling output, use ``kernel_include_regex``. To exclude the kernel names matching the regex string from the kernel profiling output, use ``kernel_exclude_regex``. +You can also specify an iteration range for set of iterations of the included kernels. If the iteration range is not specified, then all iterations of the included kernels are profiled. + +Here is an input file with kernel filters: + +.. code-block:: shell + + $ cat input.yml + jobs: + - pmc: [SQ_WAVES] + kernel_include_regex: "divide" + kernel_exclude_regex: "" + +To collect counters for the kernels matching the filters specified in the preceding input file, run: + +.. code-block:: shell + + rocprofv3 -i input.yml -- + + $ cat pass_1/312_counter_collection.csv "Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value" - 0,1,1,139892123975680,5619,5619,1048576,"matrixTranspose(float*, float*, int)",16,0,0,8,16,"SQ_WAVES",65536 + 4,4,1,1,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 + 8,8,1,2,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 + 12,12,1,3,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 + 16,16,1,4,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 .. _output-file-fields: @@ -605,32 +573,6 @@ The following table lists the various fields or the columns in the output CSV fi * - VGPR_Count - Kernel's Vector General Purpose Register (VGPR) count. -Kernel Filtering -+++++++++++++++++ - -rocprofv3 supports kernel filtering for profiling. A kernel filter is a set of a regex string (to include the kernels matching this filter), a regex string (to exclude the kernels matching this filter), -and an iteration range (set of iterations of the included kernels). If the iteration range is not provided then all iterations of the included kernels are profiled. - -.. code-block:: shell - - $ cat input.yml - jobs: - - pmc: [SQ_WAVES] - kernel_include_regex: "divide" - kernel_exclude_regex: "" - - -.. code-block:: shell - - rocprofv3 -i input.yml -- - - $ cat pass_1/312_counter_collection.csv - "Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value" - 4,4,1,1,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 - 8,8,1,2,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 - 12,12,1,3,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 - 16,16,1,4,36499,36499,1048576,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,16,"SQ_WAVES",16384 - Output formats ---------------- diff --git a/source/docs/index.rst b/source/docs/index.rst index d97efc90..8aaec858 100644 --- a/source/docs/index.rst +++ b/source/docs/index.rst @@ -1,16 +1,24 @@ .. meta:: - :description: Documentation of the installation, configuration, use of the ROCProfiler SDK, and rocprofv3 command-line tool - :keywords: ROCProfiler SDK tool, ROCProfiler SDK library, rocprofv3, ROCm, API, reference + :description: Documentation of the installation, configuration, use of the ROCprofiler SDK, and rocprofv3 command-line tool + :keywords: ROCprofiler-SDK tool, ROCprofiler-SDK library, rocprofv3, ROCm, API, reference .. _index: ****************************************** -ROCProfiler SDK documentation +ROCprofiler-SDK documentation ****************************************** -ROCProfiler SDK is a comprehensive library that provides APIs for profiling and tracing HIP applications on AMD ROCm Software. To learn more, see :ref:`what-is-rocprof-sdk` +ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software. +It supports application tracing to provide a big picture of the GPU application execution and kernel profiling to provide low-level hardware details from the performance counters. +The ROCprofiler-SDK library provides runtime-independent APIs for tracing runtime calls and asynchronous activities such as GPU kernel dispatches and memory moves. The tracing includes callback APIs for runtime API tracing and activity APIs for asynchronous activity records logging. -You can access ROCProfiler SDK on our `GitHub repository `_. +In summary, ROCprofiler-SDK combines `ROCProfiler `_ and `ROCTracer `_. +You can utilize the ROCprofiler-SDK to develop a tool for profiling and tracing HIP applications on ROCm software. + +The code is open and hosted at ``_. + +.. note:: + ROCprofiler-SDK is in beta and subject to change in future releases. The documentation is structured as follows: @@ -23,12 +31,22 @@ The documentation is structured as follows: .. grid-item-card:: How to - * :doc:`Using rocprofv3 ` + * :ref:`using-rocprofv3` * :doc:`Samples ` .. grid-item-card:: API reference + * :doc:`Buffered services ` + * :doc:`Callback services ` + * :doc:`Counter collection services ` + * :doc:`Intercept table ` + * :doc:`PC sampling ` + * :doc:`Tool library ` * :doc:`API library <_doxygen/html/index>` + + .. grid-item-card:: Conceptual + + * :ref:`comparing-with-legacy-tools` To contribute to the documentation, refer to `Contributing to ROCm `_. diff --git a/source/docs/install/installation.md b/source/docs/install/installation.md index 053e3851..f4150ae3 100644 --- a/source/docs/install/installation.md +++ b/source/docs/install/installation.md @@ -11,7 +11,7 @@ ROCprofiler-SDK is supported only on Linux. The following distributions are test - OpenSUSE 15.4 - RedHat 8.8 -Other [Linux distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems) might be supported but not tested yet. +ROCprofiler-SDK might operate as expected on other [Linux distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems), but has not been tested. ### Identifying the operating system @@ -31,9 +31,11 @@ The relevant fields are `ID` and the `VERSION_ID`. ## Build requirements -Install [CMake](https://cmake.org/) version 3.21 or higher. +Install [CMake](https://cmake.org/) version 3.21 (or later). -**Note:** If the `CMake` installed on the system is too old, you can install a new version using various methods. One of the easiest options is to use PyPi (Python’s pip). +:::{note} +If the `CMake` installed on the system is too old, you can install a new version using various methods. One of the easiest options is to use PyPi (Python’s pip). +::: ```bash pip install --user 'cmake==3.22.0' diff --git a/source/scripts/update-docs.sh b/source/scripts/update-docs.sh index 09357af3..c4528106 100755 --- a/source/scripts/update-docs.sh +++ b/source/scripts/update-docs.sh @@ -31,7 +31,7 @@ message "Running doxysphinx" doxysphinx build ${WORK_DIR} ${WORK_DIR}/_build/html ${WORK_DIR}/_doxygen/html message "Building html documentation" -make html SPHINXOPTS="-W --keep-going -n" +make html SPHINXOPTS="--keep-going -n" if [ -d ${SOURCE_DIR}/docs ]; then message "Removing stale documentation in ${SOURCE_DIR}/docs/"