From 4204042ac6358abc0b8f2cabc2efe3b436bc3811 Mon Sep 17 00:00:00 2001 From: srawat <120587655+SwRaw@users.noreply.github.com> Date: Wed, 30 Oct 2024 19:39:08 +0530 Subject: [PATCH] Refactor API reference docs (#1125) * Refactor API reference docs * refactor API ref docs * corrections * consistent naming * updates * Update CHANGELOG.md * improving SEO * improving SEO * Update using-rocprofv3.rst * Update counter_collection_services.md * Update using-rocprofv3.rst * Fixing doc build errors * changelogs and some formatting issues --------- Co-authored-by: Gopesh Bhardwaj --- CHANGELOG.md | 181 ++++++++++-------- source/docs/_toc.yml.in | 6 + .../docs/api-reference/buffered_services.md | 141 +++++++------- .../docs/api-reference/callback_services.md | 178 ++++++++--------- .../counter_collection_services.md | 125 +++++++----- source/docs/api-reference/intercept_table.md | 47 +++-- source/docs/api-reference/pc_sampling.md | 25 ++- source/docs/api-reference/tool_library.md | 21 +- source/docs/how-to/samples.md | 9 +- source/docs/how-to/using-rocprofv3.rst | 20 +- source/docs/index.rst | 10 +- source/docs/install/installation.md | 11 +- 12 files changed, 416 insertions(+), 358 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 2da332f..e20ff4b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,128 +1,139 @@ # Changelog for ROCprofiler-SDK -Full documentation for ROCprofiler-SDK is available at [Click Here](source/docs/index.md) +Full documentation for ROCprofiler-SDK is available at [rocm.docs.amd.com/projects/rocprofiler-sdk](source/docs/index.rst) ## ROCprofiler-SDK for AFAR I -### Additions +### Added -- HSA API Tracing -- Kernel Dispatch Tracing -- Kernel Dispatch Counter Collection - - Instances are reported as single dimensions +- HSA API tracing +- Kernel dispatch tracing +- Kernel dispatch counter collection + - Instances reported as single dimension - No serialization ## ROCprofiler-SDK for AFAR II -### Additions +### Added -- HIP API Tracing -- ROCTx Tracing +- HIP API tracing +- ROCTx tracing - Tracing ROCProf Tool V3 -- Packaging Documentation -- ROCTx start/stop -- Memory Copy Tracing +- Documentation packaging +- ROCTx control (start and stop) +- Memory copy tracing ## ROCprofiler-SDK for AFAR III -### Additions - -- Kernel Dispatch Counter Collection – (includes serialization and multidimensional instances) -- Kernel serialization -- Serialization on/off handling -- ROCprof Tool Plugin Interface V3 for Counters and Dimensions -- List metrics support -- Correlation-id retirement -- HIP and HSA trace distinction - - --hip-runtime-trace For Collecting HIP Runtime API Traces - - --hip-compiler-trace For Collecting HIP Compiler generated code Traces - - --hsa-core-trace For Collecting HSA API Traces (core API) - - --hsa-amd-trace For Collecting HSA API Traces (AMD-extension API) - - --hsa-image-trace For Collecting HSA API Traces (Image-extension API) - - --hsa-finalizer-trace For Collecting HSA API Traces (Finalizer-extension API) +### Added + +- Kernel dispatch counter collection. This includes serialization and multidimensional instances. +- Kernel serialization. +- Serialization control (on and off). +- ROCprof tool plugin interface V3 for counters and dimensions. +- Support to list metrics. +- Correlation-Id retirement +- HIP and HSA trace distinction: + - --hip-runtime-trace For collecting HIP Runtime API traces + - --hip-compiler-trace For collecting HIP compiler-generated code traces + - --hsa-core-trace For collecting HSA API traces (core API) + - --hsa-amd-trace For collecting HSA API traces (AMD-extension API) + - --hsa-image-trace For collecting HSA API traces (image-extension API) + - --hsa-finalizer-trace For collecting HSA API traces (finalizer-extension API) ## ROCprofiler-SDK for AFAR IV -### Additions +### Added -- Page Migration Reporting (API) -- Scratch Memory Reporting (API) -- Kernel Dispatch Callback Tracing (API) -- External Correlation ID Request Service (API) -- Buffered counter collection record headers (API) -- Remove HSA dependency from counter collection (API) -- rocprofv3 Multi-GPU support in single-process (tool) +**API:** + +- Page migration reporting +- Scratch memory reporting +- Kernel dispatch callback tracing +- External correlation Id request service +- Buffered counter collection record headers +- Option to remove HSA dependency from counter collection + +**Tool:** + +- `rocprofv3` multi-GPU support in a single-process ## ROCprofiler-SDK for AFAR V -### Additions +### Added + +**API:** -- Agent/Device Counter Collection (API) -- Single JSON output format support (tool) -- Perfetto output format support(.pftrace) (tool) -- Input YAML support for counter collection (tool) -- Input JSON support for counter collection (tool) -- Application Replay (Counter collection) -- PC Sampling (Beta)(API) -- ROCProf V3 Multi-GPU Support: - - Multi-process (multiple files) +- Agent or device counter collection +- PC sampling (beta) -### Fixes +**Tool:** -- SQ_ACCUM_PREV and SQ_ACCUM_PREV_HIRE overwriting issue +- Single JSON output format support +- Perfetto output format support (.pftrace) +- Input YAML support for counter collection +- Input JSON support for counter collection +- Application replay in counter collection +- `rocprofv3` multi-GPU support: + - Multiprocess (multiple files) -### Changes +### Changed -- rocprofv3 tool now needs `--` in front of application. For detailed uses, please [Click Here](source/docs/rocprofv3.md) +- `rocprofv3` tool now requires mentioning `--` before the application. For detailed use, see [Using rocprofv3](source/docs/how-to/using-rocprofv3.rst) -## ROCprofiler-SDK for AFAR VI +### Resolved issues -### Additions +- Fixed `SQ_ACCUM_PREV` and `SQ_ACCUM_PREV_HIRE` overwriting issue -- OTF2 Tool Support -- Kernel and Range Filtering -- Counter Collection Definitions in YAML -- Documentation updates (SQ Block, Counter Collection, Tracing, Tool Usage) -- Added rocprofv3 option --kernel-rename -- Added rocprofv3 options for perfetto settings (buffer size, etc.) -- Added CSV columns for kernel trace - - Thread_Id - - Dispatch_Id -- Added CSV column for counter_collection +## ROCprofiler-SDK 0.4.0 for ROCm release 6.2 (AFAR VI) -### Fixes +### Added -- Miscellaneous bug fixes +- OTF2 tool support +- Kernel and range filtering +- Counter collection definitions in YAML +- Documentation updates (SQ block, counter collection, tracing, tool usage) +- `rocprofv3` option `--kernel-rename` +- `rocprofv3` options for Perfetto settings (buffer size and so on) +- CSV columns for kernel trace + - `Thread_Id` + - `Dispatch_Id` +- CSV column for counter collection -## ROCprofiler-SDK 0.5.0 for ROCm Release 6.3 (AFAR VII) -### Additions +## ROCprofiler-SDK 0.5.0 for ROCm release 6.3 (AFAR VII) -### Changes +### Added -- Support `--marker-trace` on application linked against old (roctracer) ROCTx (i.e. `libroctx64.so`) -- Replaced deprecated hipHostMalloc and hipHostFree functions with hipExtHostAlloc and hipFreeHost in when ROCm version is greater than or equal to 6.3 +- Start and end timestamp columns to the counter collection csv output +- Check to force tools to initialize context id with zero + +### Changed + +- `--marker-trace` option for `rocprofv3` now supports the legacy ROCTx library `libroctx64.so` when the application is linked against the new library `librocprofiler-sdk-roctx.so`. +- Replaced deprecated `hipHostMalloc` and `hipHostFree` functions with `hipExtHostAlloc` and `hipFreeHost` for ROCm versions starting 6.3. - Updated `rocprofv3` `--help` options. -- Adding start and end timestamp columns to the counter collection csv output. -- Changed naming of agent profiling to device counting service (which more closely follows its name). To convert existing tool/user code to the new names, the following sed can be used: `find . -type f -exec sed -i 's/rocprofiler_agent_profile_callback_t/rocprofiler_device_counting_service_callback_t/g; s/rocprofiler_configure_agent_profile_counting_service/rocprofiler_configure_device_counting_service/g; s/agent_profile.h/device_counting_service.h/g; s/rocprofiler_sample_agent_profile_counting_service/rocprofiler_sample_device_counting_service/g' {} +` -- Changed naming of dispatch profiling service to dispatch counting service (which more closely follows its name). To convert existing tool/user code to the new names, the following sed can be used: `-type f -exec sed -i -e 's/dispatch_profile_counting_service/dispatch_counting_service/g' -e 's/dispatch_profile.h/dispatch_counting_service.h/g' -e 's/rocprofiler_profile_counting_dispatch_callback_t/rocprofiler_dispatch_counting_service_callback_t/g' -e 's/rocprofiler_profile_counting_dispatch_data_t/rocprofiler_dispatch_counting_service_data_t/g' -e 's/rocprofiler_profile_counting_dispatch_record_t/rocprofiler_dispatch_counting_service_record_t/g' {} +` +- Changed naming of "agent profiling" to a more descriptive "device counting service". To convert existing tool or user code to the new name, use the following sed: +`find . -type f -exec sed -i 's/rocprofiler_agent_profile_callback_t/rocprofiler_device_counting_service_callback_t/g; s/rocprofiler_configure_agent_profile_counting_service/rocprofiler_configure_device_counting_service/g; s/agent_profile.h/device_counting_service.h/g; s/rocprofiler_sample_agent_profile_counting_service/rocprofiler_sample_device_counting_service/g' {} +` +- Changed naming of "dispatch profiling service" to a more descriptive "dispatch counting service". To convert existing tool or user code to the new names, the following sed can be used: `-type f -exec sed -i -e 's/dispatch_profile_counting_service/dispatch_counting_service/g' -e 's/dispatch_profile.h/dispatch_counting_service.h/g' -e 's/rocprofiler_profile_counting_dispatch_callback_t/rocprofiler_dispatch_counting_service_callback_t/g' -e 's/rocprofiler_profile_counting_dispatch_data_t/rocprofiler_dispatch_counting_service_data_t/g' -e 's/rocprofiler_profile_counting_dispatch_record_t/rocprofiler_dispatch_counting_service_record_t/g' {} +` - Support specifying HW counters via command-line in rocprofv3, e.g. `rocprofv3 --pmc [COUNTER [COUNTER ...]]` -- FETCH_SIZE metric on gfx94x uses TCC_BUBBLE for 128B reads. +- `FETCH_SIZE` metric on gfx94x now uses `TCC_BUBBLE` for 128B reads. +- PMC dispatch-based counter collection serialization is now per-device instead of being global across all devices. + -### Fixes +### Resolved issues -- Creation of subdirection when rocprofv3 `--output-file` contains a folder path -- Fix misaligned stores (undefined behavior) for buffer records -- Fix crash when only scratch reporting is enabled -- Fixed MeanOccupancy* metrics -- Fix aborted-app validation test to properly check for hipExtHostAlloc command now that it is supported -- Fix for SQ and GRBM metrics implicitly reduced. -- Fix Support for derived counters in reduce operation and bug fix for max in reduce -- Check to force tools to initialize context id with zero. -- Fix to handle a range of values for select() dimension in expressions parser. -- PMC dispatch based Counter Collection Serialization is now per-device instead of global across all devices. +- Introduced subdirection when `rocprofv3 --output-file` used to specify a folder path +- Fixed misaligned stores (undefined behavior) for buffer records +- Fixed crash when only scratch reporting is enabled +- Fixed `MeanOccupancy` metrics +- Fixed aborted-application validation test to properly check for `hipExtHostAlloc` command +- Fixed implicit reduction of SQ and GRBM metrics +- Fixed support for derived counters in reduce operation +- Bug fixed in max-in-reduce operation +- Introduced fix to handle a range of values for `select()` dimension in expressions parser ### Removed -- Removed gfx8 metric definitions. -- Removed rocprofv3 installation to sbin directory. +- Removed gfx8 metric definitions +- Removed `rocprofv3` installation to sbin directory diff --git a/source/docs/_toc.yml.in b/source/docs/_toc.yml.in index f6987bf..b43bae6 100644 --- a/source/docs/_toc.yml.in +++ b/source/docs/_toc.yml.in @@ -16,11 +16,17 @@ subtrees: - caption: API reference entries: - file: api-reference/buffered_services + title: Buffered services - file: api-reference/callback_services + title: Callback tracing services - file: api-reference/counter_collection_services + title: Counter collection services - file: api-reference/intercept_table + title: Runtime intercept tables - file: api-reference/pc_sampling + title: PC sampling - file: api-reference/tool_library + title: Tool library - file: _doxygen/html/index title: API library - caption: Conceptual diff --git a/source/docs/api-reference/buffered_services.md b/source/docs/api-reference/buffered_services.md index 88bf80b..6fcccad 100644 --- a/source/docs/api-reference/buffered_services.md +++ b/source/docs/api-reference/buffered_services.md @@ -1,22 +1,23 @@ -# Buffered services +--- +myst: + html_meta: + "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software." + "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK buffered services, Buffered services API" +--- -For the buffered approach, supported buffer record categories are enumerated in `rocprofiler_buffer_category_t` category field. +# ROCprofiler-SDK buffered services -## Overview +In the buffered approach, the internal (background) thread sends callbacks for batches of records. +Supported buffer record categories are enumerated in `rocprofiler_buffer_category_t` category field and supported buffer tracing services are enumerated in `rocprofiler_buffer_tracing_kind_t`. Configuring +a buffered tracing service requires buffer creation. Flushing the buffer implicitly or explicitly invokes a callback to the tool, which provides an array of one or more buffer records. +To flush a buffer explicitly, use `rocprofiler_flush_buffer` function. -In buffered approach, callbacks are received for batches of records from an internal (background) thread. -Supported buffered tracing services are enumerated in `rocprofiler_buffer_tracing_kind_t`. Configuring -a buffer tracing service requires the creation of a buffer. When the buffer is "flushed", either implicitly -or explicitly, a callback to the tool will be invoked which provides an array of one or more buffer records. -A buffer can be explicitly flushed via the `rocprofiler_flush_buffer` function. +## Subscribing to buffer tracing services -## Subscribing to Buffer Tracing Services +During tool initialization, the tool configures callback tracing using `rocprofiler_configure_buffer_tracing_service` +function. However, before invoking `rocprofiler_configure_buffer_tracing_service`, the tool must create a buffer for the tracing records as shown in the following section. -During tool initialization, tools configure callback tracing via the `rocprofiler_configure_buffer_tracing_service` -function. However, before invoking `rocprofiler_configure_buffer_tracing_service`, the tool must create a buffer -for the tracing records. - -### Creating a Buffer +### Creating a buffer ```cpp rocprofiler_status_t @@ -29,49 +30,40 @@ rocprofiler_create_buffer(rocprofiler_context_id_t context, rocprofiler_buffer_id_t* buffer_id); ``` -The `size` parameter is the size of the buffer in bytes and will be rounded up to the nearest -memory page size (defined by `sysconf(_SC_PAGESIZE)`); the default memory page size on Linux +Here are the parameters required to create a buffer: + +- `size`: Size of the buffer in bytes, which is rounded up to the nearest +memory page size (defined by `sysconf(_SC_PAGESIZE)`). The default memory page size on Linux is 4096 bytes (4 KB). -The `watermark` parameter specifies the number of bytes at which -the buffer should be "flushed", i.e. when the records in the buffer should invoke the -`callback` parameter to deliver the records to the tool. For example, if a buffer has a size -of 4096 bytes and the watermark is set to 48 bytes, six 8-byte records can be placed in the +- `watermark`: Specifies the number of bytes at which the buffer should be flushed. To flush the buffer, the records in the buffer must invoke the `callback` parameter to deliver the records to the tool. For example, for a buffer of size 4096 bytes with the watermark set to 48 bytes, six 8-byte records can be placed in the buffer before `callback` is invoked. However, every 64-byte record that is placed in the buffer will trigger a flush. It is safe to set the `watermark` to any value between zero and the buffer size. -The `policy` parameter specifies the behavior for when a record is larger than the -amount of free space in the current buffer. For example, if a buffer has a size of -4000 bytes with a watermark set to 4000 bytes and 3998 of the bytes in the buffer -have been populated with records, the `policy` dictates how to handle an incoming record > -2 bytes. The `ROCPROFILER_BUFFER_POLICY_DISCARD` policy dictates that all records greater -than should 2 bytes should be dropped until the tool _explicitly_ flushes the buffer via -a `rocprofiler_flush_buffer` function call whereas the `ROCPROFILER_BUFFER_POLICY_LOSSLESS` -policy dictates that the current buffer should be swapped out for an empty buffer and placed -in that new buffer and former (full) buffer should be _implicitly_ flushed. - -The `callback` parameter is the function that rocprofiler-sdk should invoke when flushing -the buffer; the value of the `callback_data` parameter will be passed as one of the arguments -to the `callback` function. - -The `buffer_id` parameter is an output parameter for the function call and will have a +- `policy`: Specifies the behavior when a record is larger than the +amount of free space in the current buffer. For example, for a buffer of size 4000 bytes with the watermark set to 4000 bytes and 3998 bytes populated with records, the `policy` dictates how to handle an incoming record greater than 2 bytes. If the environment variable `ROCPROFILER_BUFFER_POLICY_DISCARD` is enabled, all records greater than 2 bytes are dropped until the tool _explicitly_ flushes the buffer using `rocprofiler_flush_buffer` function call whereas, if the environment variable `ROCPROFILER_BUFFER_POLICY_LOSSLESS` is enabled, the current buffer is swapped out for an empty buffer and placed in the new buffer while the former (full) buffer is _implicitly_ flushed. + +- `callback`: Invoked to flush the buffer. + +- `callback_data`: Value passed as one of the arguments to the `callback` function. + +- `buffer_id`: Output parameter for the function call to contain a non-zero handle field after successful buffer creation. -### Creating a Dedicated Thread for Buffer Callbacks +### Creating a dedicated thread for buffer callbacks -By default, all buffers will use the same (default) background thread created by rocprofiler-sdk to -invoke their callback. However, rocprofiler-sdk provides an interface for tools to specify the -creation of an additional background thread for one or more of their buffers. +By default, all buffers use the same (default) background thread created by ROCprofiler-SDK to +invoke their callback. However, ROCprofiler-SDK provides an interface to allow the tools to create an additional background thread for one or more of their buffers. -Callback threads for buffers are created via the `rocprofiler_create_callback_thread` function: +To create callback threads for buffers, use `rocprofiler_create_callback_thread` function: ```cpp rocprofiler_status_t rocprofiler_create_callback_thread(rocprofiler_callback_thread_t* cb_thread_id); ``` -Buffers are assigned to that callback thread via the `rocprofiler_assign_callback_thread` function: +To assign buffers to that callback thread, use `rocprofiler_assign_callback_thread` function: ```cpp rocprofiler_status_t @@ -79,7 +71,7 @@ rocprofiler_assign_callback_thread(rocprofiler_buffer_id_t buffer_id, rocprofiler_callback_thread_t cb_thread_id); ``` -#### Buffer Callback Thread Creation and Assignment Example +**Example:** ```cpp { @@ -101,7 +93,9 @@ rocprofiler_assign_callback_thread(rocprofiler_buffer_id_t buffer_id, } ``` -### Configuring Buffer Tracing Services +### Configuring buffer tracing services + +To configure buffer tracing services, use: ```cpp rocprofiler_status_t @@ -112,20 +106,21 @@ rocprofiler_configure_buffer_tracing_service(rocprofiler_context_id_t c rocprofiler_buffer_id_t buffer_id); ``` -The `kind` parameter is a high-level specifier of which service to trace (also known as a "domain"). -Domain examples include, but are not limited to, the HIP API, the HSA API, and kernel dispatches. -For each domain, there are (often) various "operations", which can be used to restrict the callbacks -to a subset within the domain. For domains which correspond to APIs, the "operations" are the functions -which compose the API. If all operations in a domain should be traced, the `operations` and `operations_count` -parameters can be set to `nullptr` and `0`, respectively. If the tracing domain should be restricted to a subset -of operations, the tool library should specify a C-array of type `rocprofiler_tracing_operation_t` and the -size of the array for the `operations` and `operations_count` parameter. +Here are the parameters required to configure buffer tracing services: + +- `kind`: A high-level specification of the services to be traced. This parameter is also known as "domain". +Domain examples include, but not limited to, the HIP API, HSA API, and kernel dispatches. -Similar to `rocprofiler_configure_callback_tracing_service`, -`rocprofiler_configure_buffer_tracing_service` will return an error if a buffer service for given context -and given domain is configured more than once. +- `operations`: For each domain, there are often various `operations` that can be used to restrict the callbacks to a subset within the domain. For domains corresponding to APIs, the `operations` are the functions +composing the API. To trace all operations in a domain, set the `operations` and `operations_count` +parameters to `nullptr` and `0` respectively. To restrict the tracing domain to a subset +of operations, the tool library must specify a C-array of type `rocprofiler_tracing_operation_t` for `operations` and size of the array for the `operations_count` parameter. -#### Example +Similar to the `rocprofiler_configure_callback_tracing_service`, +`rocprofiler_configure_buffer_tracing_service` returns an error if a buffer service for the specified context +and domain is configured more than once. + +**Example:** ```cpp { @@ -158,9 +153,9 @@ and given domain is configured more than once. } ``` -## Buffer Tracing Callback Function +## Buffer tracing callback function -Rocprofiler-sdk buffer tracing callback functions have the signature: +Here is the buffer tracing callback function: ```cpp typedef void (*rocprofiler_buffer_tracing_cb_t)(rocprofiler_context_id_t context, @@ -171,20 +166,14 @@ typedef void (*rocprofiler_buffer_tracing_cb_t)(rocprofiler_context_id_t co uint64_t drop_count); ``` -The `rocprofiler_record_header_t` data type provides three pieces of information: +The `rocprofiler_record_header_t` data type contains the following information: + +- `category` (`rocprofiler_buffer_category_t`): The `category` is used to classify the buffer record. For all +services configured via `rocprofiler_configure_buffer_tracing_service`, the `category` is equal to the value of `ROCPROFILER_BUFFER_CATEGORY_TRACING`. The other available categories are `ROCPROFILER_BUFFER_CATEGORY_PC_SAMPLING` and `ROCPROFILER_BUFFER_CATEGORY_COUNTERS`. -1. Category (`rocprofiler_buffer_category_t`) -2. Kind -3. Payload +- `kind`: The `kind` field is dependent on the `category`. For example, for `category` `ROCPROFILER_BUFFER_CATEGORY_TRACING`, the value of `kind` depicts the tracing type such as HSA core API in `ROCPROFILER_BUFFER_TRACING_HSA_CORE_API`. -The category is used to distinguish the classification of the buffer record. For all -services configured via `rocprofiler_configure_buffer_tracing_service`, the category will -be equal to the value of `ROCPROFILER_BUFFER_CATEGORY_TRACING`. The meaning of the kind -field is dependent on the category but when the category is `ROCPROFILER_BUFFER_CATEGORY_TRACING`, -the kind value will be equivalent to the is used -to distinguish the `rocprofiler_buffer_tracing_kind_t` value passed to -`rocprofiler_configure_buffer_tracing_service`, e.g. `ROCPROFILER_BUFFER_TRACING_KERNEL_DISPATCH`. -Once the category and kind have been determined, the payload can be casted: +- `payload`: The `payload` is casted after the category and kind have been determined. ```cpp { @@ -199,7 +188,7 @@ Once the category and kind have been determined, the payload can be casted: } ``` -### Buffer Tracing Callback Function Example +**Example:** ```cpp void @@ -238,14 +227,12 @@ buffer_callback_func(rocprofiler_context_id_t context, } ``` -## Buffer Tracing Record +## Buffer tracing record Unlike callback tracing records, there is no common set of data for each buffer tracing record. However, -many buffer tracing records contain a `kind` field and an `operation` field. -The name of a tracing kind can be obtained via the `rocprofiler_query_buffer_tracing_kind_name` function. -The name of an operation specific to a tracing kind can be obtained via the `rocprofiler_query_buffer_tracing_kind_operation_name` -function. One can also iterate over all the buffer tracing kinds and operations for each tracing kind via the +many buffer tracing records contain a `kind` and an `operation` field. +You can obtain the value for the `kind` of tracing using `rocprofiler_query_buffer_tracing_kind_name` function and the value for the `operation` specific to a tracing kind using the `rocprofiler_query_buffer_tracing_kind_operation_name` +function. You can also iterate over all the buffer tracing `kinds` and `operations` for each tracing kind using the `rocprofiler_iterate_buffer_tracing_kinds` and `rocprofiler_iterate_buffer_tracing_kind_operations` functions. -The buffer tracing record data types can be found in the `rocprofiler-sdk/buffer_tracing.h` header -(`source/include/rocprofiler-sdk/buffer_tracing.h` in the [rocprofiler-sdk GitHub repository](https://github.com/ROCm/rocprofiler-sdk)). +The buffer tracing record data types are available in the [rocprofiler-sdk/buffer_tracing.h](https://github.com/ROCm/rocprofiler-sdk/blob/amd-mainline/source/include/rocprofiler-sdk/buffer_tracing.h) header. diff --git a/source/docs/api-reference/callback_services.md b/source/docs/api-reference/callback_services.md index 7bcb4d6..85274cd 100644 --- a/source/docs/api-reference/callback_services.md +++ b/source/docs/api-reference/callback_services.md @@ -1,15 +1,19 @@ -# Callback tracing services +--- +myst: + html_meta: + "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software." + "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK callback services, Callback services API" +--- -## Overview +# ROCprofiler-SDK callback tracing services -Callback tracing services provide immediate callbacks to a tool on the current CPU thread when a given event occurs. -For example, when tracing an API function, e.g. `hipSetDevice`, callback tracing invokes a user-specified callback -before and after the traced function executes on the thread which is invoking the API function. +Callback tracing services provide immediate callbacks to a tool on the current CPU thread on the occurrence of an event. +For example, when tracing an API function such as `hipSetDevice`, callback tracing invokes a user-specified callback +before and after the traced function executes on the thread invoking the API function. -## Subscribing to Callback Tracing Services +## Subscribing to callback tracing services -During tool initialization, tools configure callback tracing via the `rocprofiler_configure_callback_tracing_service` -function: +During tool initialization, tools configure callback tracing using: ```cpp rocprofiler_status_t @@ -21,18 +25,20 @@ rocprofiler_configure_callback_tracing_service(rocprofiler_context_id_t void* callback_args); ``` -The `kind` parameter is a high-level specifier of which service to trace (also known as a "domain"). -Domain examples include, but are not limited to, the HIP API, the HSA API, and kernel dispatches. -For each domain, there are (often) various "operations", which can be used to restrict the callbacks -to a subset within the domain. For domains which correspond to APIs, the "operations" are the functions -which compose the API. If all operations in a domain should be traced, the `operations` and `operations_count` -parameters can be set to `nullptr` and `0`, respectively. If the tracing domain should be restricted to a subset -of operations, the tool library should specify a C-array of type `rocprofiler_tracing_operation_t` and the -size of the array for the `operations` and `operations_count` parameter. +Here are the parameters required to configure callback tracing services: -`rocprofiler_configure_callback_tracing_service` will return an error if a callback service for given context -and given domain is configured more than once. For example, if one only wanted to trace two functions within -the HIP runtime API, `hipGetDevice` and `hipSetDevice`, the following code would accomplish this objective: +- `kind`: A high-level specification of the services to be traced. This parameter is also known as "domain". +Domain examples include, but not limited to, the HIP API, HSA API, and kernel dispatches. + +- `operations`: For each domain, there are often various `operations` that can be used to restrict the callbacks to a subset within the domain. For domains corresponding to APIs, the `operations` are the functions +composing the API. To trace all operations in a domain, set the `operations` and `operations_count` +parameters to `nullptr` and `0` respectively. To restrict the tracing domain to a subset +of operations, the tool library must specify a C-array of type `rocprofiler_tracing_operation_t` for `operations` and size of the array for the `operations_count` parameter. + +`rocprofiler_configure_callback_tracing_service` returns an error if a callback service for the specified context and domain is configured more than once. + +**Example:** To trace only two functions within +the HIP runtime API, `hipGetDevice` and `hipSetDevice`: ```cpp { @@ -55,7 +61,7 @@ the HIP runtime API, `hipGetDevice` and `hipSetDevice`, the following code would } ``` -But the following code would be invalid: +The following code returns error `ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED` as the callback service is already configured: ```cpp { @@ -70,7 +76,7 @@ But the following code would be invalid: for(auto op : operations) { - // after the first iteration, will return ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED + // after the first iteration, returns ROCPROFILER_STATUS_ERROR_SERVICE_ALREADY_CONFIGURED rocprofiler_configure_callback_tracing_service(ctx, ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API, &op, @@ -83,9 +89,9 @@ But the following code would be invalid: } ``` -## Callback Tracing Callback Function +## Callback tracing callback function -Rocprofiler-sdk callback tracing callback functions have the signature: +Here is the callback tracing callback function: ```cpp typedef void (*rocprofiler_callback_tracing_cb_t)(rocprofiler_callback_tracing_record_t record, @@ -93,12 +99,13 @@ typedef void (*rocprofiler_callback_tracing_cb_t)(rocprofiler_callback_tracing_r void* callback_data) ``` -The `record` parameter contains the information to uniquely identify a tracing record type and has the -following definition: +The parameters `record` and `user_data` are discussed here: -```cpp -typedef struct rocprofiler_callback_tracing_record_t -{ +- `record`: Contains the information to uniquely identify a tracing record type. Here is the definition: + + ```cpp + typedef struct rocprofiler_callback_tracing_record_t + { rocprofiler_context_id_t context_id; rocprofiler_thread_id_t thread_id; rocprofiler_correlation_id_t correlation_id; @@ -106,26 +113,25 @@ typedef struct rocprofiler_callback_tracing_record_t uint32_t operation; rocprofiler_callback_phase_t phase; void* payload; -} rocprofiler_callback_tracing_record_t; -``` - -The underlying type of `payload` field above is typically unique to a domain and, less frequently, an operation. -For example, for the `ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API` and `ROCPROFILER_CALLBACK_TRACING_HIP_COMPILER_API`, -the payload should be casted to `rocprofiler_callback_tracing_hip_api_data_t*` -- which will contain the arguments -to the function and (in the exit phase) the return value of the function. The payload field will only be a valid -pointer during the invocation of the callback function(s). - -The `user_data` parameter can be used to store data in between callback phases. It is a unique for every -instance of an operation. For example, if the tool library wishes to store the timestamp of the + } rocprofiler_callback_tracing_record_t; + ``` + The underlying type of `payload` field is typically unique to a domain and, less frequently, an operation. + For example, for the `ROCPROFILER_CALLBACK_TRACING_HIP_RUNTIME_API` and `ROCPROFILER_CALLBACK_TRACING_HIP_COMPILER_API`, + the payload must be casted to `rocprofiler_callback_tracing_hip_api_data_t*`, which contains the arguments + to the function and the return value when exiting the function. The payload field is a valid + pointer only during the invocation of the callback function(s). + +- `user_data`: Stores data in between callback phases. This value is unique for every +instance of an operation. For example, for a tool library to store the timestamp of the `ROCPROFILER_CALLBACK_PHASE_ENTER` phase for the ensuing `ROCPROFILER_CALLBACK_PHASE_EXIT` callback, -this data can be stored in a method similar to below: +the data can be stored using: -```cpp -void -callback_func(rocprofiler_callback_tracing_record_t record, + ```cpp + void + callback_func(rocprofiler_callback_tracing_record_t record, rocprofiler_user_data_t* user_data, void* cb_data) -{ + { auto ts = rocprofiler_timestamp_t{}; rocprofiler_get_timestamp(&ts); @@ -142,26 +148,20 @@ callback_func(rocprofiler_callback_tracing_record_t record, { // ... etc. ... } -} -``` + } + ``` -The `callback_data` argument will be the value of `callback_args` passed to `rocprofiler_configure_callback_tracing_service` -in [the previous section](#subscribing-to-callback-tracing-services). + The `callback_data` is passed to `rocprofiler_configure_callback_tracing_service` as the value of `callback_args` to [subscribe to callback tracing services](#subscribing-to-callback-tracing-services). -## Callback Tracing Record +## Callback tracing record -The name of a tracing kind can be obtained via the `rocprofiler_query_callback_tracing_kind_name` function. -The name of an operation specific to a tracing kind can be obtained via the `rocprofiler_query_callback_tracing_kind_operation_name` -function. One can also iterate over all the callback tracing kinds and operations for each tracing kind via the -`rocprofiler_iterate_callback_tracing_kinds` and `rocprofiler_iterate_callback_tracing_kind_operations` functions. -Lastly, for a given `rocprofiler_callback_tracing_record_t` object, rocprofiler-sdk supports generically iterating over -the arguments of the payload field for many domains. +To obtain the name of the `kind` of tracing, you can use `rocprofiler_query_callback_tracing_kind_name` function and to obtain the name of an `operation` specific to a tracing kind, use `rocprofiler_query_callback_tracing_kind_operation_name` +function. To iterate over all the callback tracing kinds and operations for each tracing kind, use `rocprofiler_iterate_callback_tracing_kinds` and `rocprofiler_iterate_callback_tracing_kind_operations` functions. -As mentioned above, within the `rocprofiler_callback_tracing_record_t` object, -an opaque `void* payload` is provided for accessing domain specific information. -The data types generally follow the naming convention of `rocprofiler_callback_tracing__data_t`, -e.g., for the tracing kinds `ROCPROFILER_BUFFER_TRACING_HSA_{CORE,AMD_EXT,IMAGE_EXT,FINALIZE_EXT}_API`, -the payload should be casted to `rocprofiler_callback_tracing_hsa_api_data_t*`: +Lastly, for a specified `rocprofiler_callback_tracing_record_t` object, ROCprofiler-SDK supports generically iterating over the arguments of the payload field for many domains. Within the `rocprofiler_callback_tracing_record_t` object, the domain-specific information is available in +an opaque `void* payload`. +The data types generally follow the naming convention of `rocprofiler_callback_tracing__data_t`. For example, for the tracing kinds `ROCPROFILER_BUFFER_TRACING_HSA_{CORE,AMD_EXT,IMAGE_EXT,FINALIZE_EXT}_API`, +cast the payload to `rocprofiler_callback_tracing_hsa_api_data_t*`: ```cpp void @@ -205,7 +205,7 @@ callback_func(rocprofiler_callback_tracing_record_t record, } ``` -### Sample `rocprofiler_iterate_callback_tracing_kind_operation_args` +**Example:** Iterating over all the callback tracing kinds and operations for each tracing kind using `rocprofiler_iterate_callback_tracing_kind_operation_args`: ```cpp int @@ -263,7 +263,7 @@ callback_func(rocprofiler_callback_tracing_record_t record, } ``` -Sample Output: +**Sample output:** ```console @@ -283,55 +283,57 @@ Sample Output: 4: hipStream_t* stream = 0x25dfcf0 ``` -## Code Object Tracing +## Code object tracing The code object tracing service is a critical component for obtaining information regarding asynchronous activity on the GPU. The `rocprofiler_callback_tracing_code_object_load_data_t` payload (kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`, operation=`ROCPROFILER_CODE_OBJECT_LOAD`) -provides a unique identifier for a bundle of one or more GPU kernel symbols which have been loaded -for a specific GPU agent. For example, if your application is leveraging a multi-GPU system system -containing 4 Vega20 GPUs and 4 MI100 GPUs, there will at least 8 code objects loaded: one code -object for each GPU. Each code object will be associated with a set of kernel symbols: -the `rocprofiler_callback_tracing_code_object_kernel_symbol_register_data_t` payload +provides a unique identifier for a bundle of one or more GPU kernel symbols that are loaded +for a specific GPU agent. For example, if your application leverages a multi-GPU system +consisting of four Vega20 GPUs and four MI100 GPUs, at least eight code objects will be loaded: one code +object for each GPU. Each code object will be associated with a set of kernel symbols. +The `rocprofiler_callback_tracing_code_object_kernel_symbol_register_data_t` payload (kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT`, operation=`ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER`) provides a globally unique identifier for the specific kernel symbol along with the kernel name and -several other static properties of the kernel (e.g. scratch size, scalar general purpose register count, etc.). -Note: two otherwise identical kernel symbols (same kernel name, scratch size, etc.) which are part of -otherwise identical code objects but the code objects are loaded for different GPU agents ***will*** have unique -kernel identifiers. Furthermore, if the same code object (and it's kernel symbols) are unloaded and then -re-loaded, that code object and all of it's kernel symbols ***will*** be given new unique identifiers. +several other static properties of the kernel such as scratch size, scalar general purpose register count, and so on. + +:::{note} +The kernel identifiers for two identical kernel symbols with the same properties (kernel name, scratch size, and so on) that are part of similar code objects loaded for different GPU agents will still be unique. Furthermore, the identifier for a code object and its kernel symbols after being unloaded and then +reloaded, will also be unique. +::: -In general, when a code object is loaded and unloaded, here is the sequence of events: +Here is the general sequence of events when a code object is loaded and unloaded: -1. Callback: code object load +1. Callback: load code object - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT` - operation=`ROCPROFILER_CODE_OBJECT_LOAD` - phase=`ROCPROFILER_CALLBACK_PHASE_LOAD` -2. Callback: kernel symbol load +2. Callback: load kernel symbol - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT` - operation=`ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER` - phase=`ROCPROFILER_CALLBACK_PHASE_LOAD` - Repeats for each kernel symbol in code object -3. Application Execution -4. Callback: kernel symbol unload +3. Execute application +4. Callback: unload kernel symbol - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT` - operation=`ROCPROFILER_CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER` - phase=`ROCPROFILER_CALLBACK_PHASE_UNLOAD` - Repeats for each kernel symbol in code object -5. Callback: code object unload +5. Callback: unload code object - kind=`ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT` - operation=`ROCPROFILER_CODE_OBJECT_LOAD` - phase=`ROCPROFILER_CALLBACK_PHASE_UNLOAD` -Note: rocprofiler-sdk does not provide an interface to query this information outside of the -code object tracing service. If you wish to be able to associate kernel names with kernel tracing records, -a tool is personally responsible for making a copy of the relevant information when the code objects and -kernel symbol are loaded (however, any constant string fields like the (`const char* kernel_name` field) -need not to be copied, these are guaranteed to be valid pointers until after rocprofiler-sdk finalization). -If a tool decides to delete their copy of the data associated with a given code object or kernel symbol +:::{note} +ROCprofiler-SDK doesn't provide an interface to query information outside of the +code object tracing service. If you wish to associate kernel names with kernel tracing records, +the tool must be configured to create a copy of the relevant information when the code objects and +kernel symbol are loaded. However, any constant string fields like `const char* kernel_name` +don't need to be copied as these are guaranteed to be valid pointers until after ROCprofiler-SDK finalization. +If a tool decides to delete its copy of the data associated with a code object or kernel symbol identifier when the code object and kernel symbols are unloaded, it is highly recommended to flush -any/all buffers which might contain references to that code object or kernel symbol identifiers before +all buffers that might contain references to that code object or kernel symbol identifier before deleting the associated data. +::: -For a sample of code object tracing, please see the `samples/code_object_tracing` example in the -[rocprofiler-sdk GitHub repository](https://github.com/ROCm/rocprofiler-sdk). +For a sample of code object tracing, see [samples/code_object_tracing](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples/code_object_tracing). diff --git a/source/docs/api-reference/counter_collection_services.md b/source/docs/api-reference/counter_collection_services.md index c60bd88..cc47998 100644 --- a/source/docs/api-reference/counter_collection_services.md +++ b/source/docs/api-reference/counter_collection_services.md @@ -1,14 +1,29 @@ -# Counter collection services +--- +myst: + html_meta: + "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software." + "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK counter collection services, Counter collection services API" +--- + +# ROCprofiler-SDK counter collection services + +There are two modes of counter collection service: + +- Dispatch profiling: In this mode, counters are collected on a per-kernel launch basis. This mode is useful for collecting highly detailed counters for a specific kernel execution in isolation. Note that dispatch profiling allows only a single kernel to execute in hardware at a time. + +- Agent profiling: In this mode, counters are collected on a device level. This mode is useful for collecting device level counters not tied to a specific kernel execution, which encompasses collecting counter values for a specific time range. + +This topic explains how to setup dispatch and agent profiling and use common counter collection APIs. For details on the APIs including the less commonly used counter collection APIs, see the API library. For fully functional examples of both dispatch and agent profiling, see [Samples](https://github.com/ROCm/rocprofiler-sdk/tree/amd-mainline/samples). ## Definitions -*Profile Config*: A configuration to specify what counters should be collected on an agent. This needs to be supplied to various counter collection APIs to initiate collection of counter data. Profiles are agent specific and cannot be used on different agents. +Profile Config: A configuration to specify the counters to be collected on an agent. This must be supplied to various counter collection APIs to initiate collection of counter data. Profiles are agent-specific and can't be used on different agents. -*Counter ID*: Unique ID (per-architecture) that specifies the counter. The counter interface can be used to fetch information about the counter (such as its name or expression). +Counter ID: Unique Id (per-architecture) that specifies the counter. The counter Id can be used to fetch counter information such as its name or expression. -*Instance ID*: Unique record id encoding both the counter id and dimension for a specific collected value. +Instance ID: Unique record Id that encodes the counter Id and dimension for a collected value. -*Dimension*: Dimensions provide context to the raw counter values to specify the specific hardware register (such as shader engine) that the value was collected from. All counter values have dimension data encoded in its instance id and functions in the counter interface can be used to extract the values for individual dimensions. There following dimensions are currently supported by rocprofiler-sdk: +Dimension: Dimensions help to provide context to the raw counter values by specifying the hardware register that is the source of counter collection such as a shader engine. All counter values have dimension data encoded in their instance Id, which allows you to extract the values for individual dimensions using functions in the counter interface. The following dimensions are supported: ```c ROCPROFILER_DIMENSION_XCC, ///< XCC dimension of result @@ -20,15 +35,17 @@ ROCPROFILER_DIMENSION_INSTANCE, ///< From unspecified hardware register ``` -## Using The Counter Collection Service - -There are two modes for the counter collection service: *dispatch profiling* where counters are collected on a per kernel launch basis and *agent profiling* where counters are collected on a device level. Dispatch profiling is useful for collecting highly detailed counters for a specific kernel execution in isolation (Note: dispatch profiling allows only a single kernel to execute in hardware at a time). Agent profiling is useful for collecting device level counters not tied to a specific kernel execution (i.e. collecting counter values for a specific time range). +## Using the counter collection service -This guide explains how to setup dispatch and agent profiling along will describing the usage of the common counter collection APIs. More detail on the APIs themselves (as well as non-common options) is available in the API documentation. Fully functional examples of both dispatch and agent profiling can be found on the sample directory of rocprofiler-sdk. +The setup for dispatch and agent profiling is similar with only minor changes needed to adapt code from one to another. +Here are the steps required to configure the counter collection services: ### tool_init() setup -The setup for dispatch and agent profiling is similar (with only minor changes needed to adapt code from one to another). In tool_init, similar to tracing services, you need to create a context and a buffer to collect the output. Important Note: buffered_callback in rocprofiler_create_buffer is called when the buffer is full with a vector of collected counter samples, see the buffered callback section below for processing. +Similar to tracing services, you must create a context and a buffer to collect the output when initializing the tool. +:::{note} +`Buffered_callback` in `rocprofiler_create_buffer` is invoked with a vector of collected counter samples, when the buffer is full. For details, see the [Buffered callback](#buffered-callback) section. +::: ```CPP rocprofiler_context_id_t ctx{0}; @@ -44,13 +61,13 @@ ROCPROFILER_CALL(rocprofiler_create_buffer(ctx, "buffer creation failed"); ``` -After creating a context and buffer to store results, it is highly recommended (but not required) that you construct the profiles for each agent containing the counters you wish to collect in tool_init. Profile creation has a high time cost associated with it due to validating that the counters can be collected on the agent and thus should be avoided in the time critical dispatch profiling callback. After profile setup, the collection service for dispatch or agent profiling can be setup. The following two calls can be used to setup either dispatch or agent profiling (only one can be in use at a time). +After creating a context and buffer to store results in `tool_init`, it is highly recommended but not mandatory for you to construct the profiles for each agent, containing the counters for collection. Profile creation should be avoided in the time critical dispatch profiling callback as it involves validating if the counters can be collected on the agent. After profile setup, you can set up the collection service for dispatch or agent profiling. To set up either dispatch or agent profiling (only one can be used at a time), use: ```CPP /* For Dispatch Profiling */ // Setup the dispatch profile counting service. This service will trigger the dispatch_callback // when a kernel dispatch is enqueued into the HSA queue. The callback will specify what - // counters to collect by returning a profile config id. + // counters to collect by returning a profile config id. ROCPROFILER_CALL(rocprofiler_configure_buffered_dispatch_counting_service( ctx, buff, dispatch_callback, nullptr), "Could not setup buffered service"); @@ -63,9 +80,9 @@ After creating a context and buffer to store results, it is highly recommended ( "Could not setup buffered service"); ``` -#### Profile Setup +#### Profile setup -The first step in constructing a counter collection profile is to find the GPU agents on the machine. A profile will need to be created for each set of counters you want to collect on every agent on the machine. You can use rocprofiler_query_available_agents to find agents on the system. The below example will collect all GPU agents on the device and store them in the vector agents. +1. The first step in constructing a counter collection profile is to find the GPU agents on the machine. You must create a profile for each set of counters to be collected on every agent on the machine. You can use `rocprofiler_query_available_agents` to find agents on the system. The following example collects all GPU agents on the device and stores them in the vector agents: ```CPP std::vector agents; @@ -98,7 +115,7 @@ The first step in constructing a counter collection profile is to find the GPU a "query available agents"); ``` -To identify the counters that an agent supports, you can query the available counters with rocprofiler_iterate_agent_supported_counters. An example with a single agent (returning the available counters in gpu_counters) would be the following: +2. To identify the counters supported by an agent, query the available counters with `rocprofiler_iterate_agent_supported_counters`. Here is an example of a single agent returning the available counters in `gpu_counters`: ```CPP std::vector gpu_counters; @@ -122,7 +139,7 @@ To identify the counters that an agent supports, you can query the available cou "Could not fetch supported counters"); ``` -rocprofiler_counter_id_t is a handle to a counter. The information about the counter (such as its name) can be fetched using rocprofiler_query_counter_info. +3. `rocprofiler_counter_id_t` is a handle to a counter. To fetch information about the counter such as its name, use `rocprofiler_query_counter_info`: ```CPP for(auto& counter : gpu_counters) @@ -137,7 +154,7 @@ rocprofiler_counter_id_t is a handle to a counter. The information about the cou } ``` -After you have identified a set of counters you wish to collect, a profile can be constructed by passing a list of these counters to rocprofiler_create_profile_config. +4. After identifying the counters to be collected, construct a profile by passing a list of these counters to `rocprofiler_create_profile_config`. ```C++ // Create and return the profile @@ -147,18 +164,21 @@ After you have identified a set of counters you wish to collect, a profile can b "Could not construct profile cfg"); ``` -The created profile can in turn be used for both dispatch and agent counter collection services. +5. You can use the created profile for both dispatch and agent counter collection services. -##### Special Notes On Profile Behavior +:::{note} + +Points to note on profile behavior: - Profile created is *only valid* for the agent it was created for. -- Profiles are immutable. If a new counter set is desired to be collected, construct a new profile. -- A single profile can be used multiple times on the same agent. -- Counter IDs that are supplied to rocprofiler_create_profile_config are *agent specific* and cannot be used to construct profiles for other agents. +- Profiles are immutable. To collect a new counter set, construct a new profile. +- A single profile can be used multiple times on the same agent. +- Counter Ids supplied to `rocprofiler_create_profile_config` are *agent-specific* and can't be used to construct profiles for other agents. +::: -### Dispatch Profiling Callback +### Dispatch profiling callback -When a kernel is dispatched, a dispatch callback is issued to the tool to allow for the selection of counters to collect for the dispatch (via supplying a profile). +When a kernel is dispatched, a dispatch callback is issued to the tool to allow selection of counters to be collected for the dispatch by supplying a profile. ```CPP void @@ -168,11 +188,11 @@ dispatch_callback(rocprofiler_dispatch_counting_service_data_t dispatch_data, void* /*callback_data_args*/) ``` -Dispatch data contains information about the dispatch that is being launched (such as its name) and config is where the tool can specify the profile (and in turn counters) to collect for the dispatch. If no profile is supplied, no counters are collected for this dispatch. User data contains user data supplied to rocprofiler_configure_buffered_dispatch_counting_service. +`dispatch_data` contains information about the dispatch being launched such as its name. `config` is used by the tool to specify the profile, which allows counter collection for the dispatch. If no profile is supplied, no counters are collected for this dispatch. `user_data` contains user data supplied to `rocprofiler_configure_buffered_dispatch_profile_counting_service`. -### Agent Set Profile Callback +### Agent set profile callback -This callback is called when the context is started and allows for the tool to specify the profile to be used. +This callback is invoked after the context starts and allows the tool to specify the profile to be used. ```CPP void @@ -182,11 +202,11 @@ set_profile(rocprofiler_context_id_t context_id, void*) ``` -The profile to be used for this agent is specified by calling set_config(agent, profile). +The profile to be used for this agent is specified by calling `set_config(agent, profile)`. -### Buffered Callback +### Buffered callback -Data from collected counter values is returned via a buffered callback. The buffered callback routines are similar between dispatch and agent profiling with the exception that some data (such as kernel launch ids) are not available in agent profiling mode. A sample iteration to print out counter collection data is the following: +Data from collected counter values is returned through a buffered callback. The buffered callback routines are similar for dispatch and agent profiling except that some data such as kernel launch Ids is not available in agent profiling mode. Here is a sample iteration to print out counter collection data: ```CPP for(size_t i = 0; i < num_headers; ++i) @@ -225,14 +245,14 @@ Data from collected counter values is returned via a buffered callback. The buff } ``` -## Counter Definitions +## Counter definitions -Counters are defined in yaml format in the file counter_defs.yaml. The counter definition has the following format +Counters are defined in yaml format in the `counter_defs.yaml` file. The counter definition has the following format: ```yaml counter_name: # Counter name architectures: - gfx90a: # Architecture name + gfx90a: # Architecture name block: # Block information (SQ/etc) event: # Event ID (used by AQLProfile to identify counter register) expression: # Formula for the counter (if derived counter) @@ -242,11 +262,12 @@ counter_name: # Counter name description: # Description of the counter ``` -Architectures can be separately defined with their own definitions (i.e. gfx90a and gfx1010 in the above example). If two or more architectures share the same block/event/expression definition, they can be "/" delimited on a single line (i.e. "gfx90a/gfx1010:"). Hardware metrics have the elements block, event, and description defined. Derived metrics have the element expression defined (and cannot have block or event defined). +You can separately define the counters for different architectures as shown in the preceding example for gfx90a and gfx1010. If two or more architectures share the same block, event, or expression definition, they can be specified together using "/" delimiter ("gfx90a/gfx1010:"). +Hardware metrics have the elements block, event, and description defined. Derived metrics have the element expression defined and can't have block or event defined. -## Derived Metrics +## Derived metrics -Derived metrics allow for computations (via expressions) to be performed on collected hardware metrics with the result returned as it it were a real hardware counter. +Derived metrics are expressions performing computation on collected hardware metrics. These expressions produce result similar to a real hardware counter. ```yaml GPU_UTIL: @@ -256,30 +277,35 @@ GPU_UTIL: description: Percentage of the time that GUI is active ``` -GPU_UTIL is an example of a derived metric which takes the values of two GRBM hardware counters (GRBM_GUI_ACTIVE and GRBM_COUNT) and uses a mathematic expression to calculate the utilization rate of the GPU. Expressions support the standard set of math operators (/,*,-,+) along with a set of special functions (reduce and accumulate). +In the preceding example, `GPU_UTIL` is a derived metric that uses a mathematic expression to calculate the utilization rate of the GPU using values of two GRBM hardware counters `GRBM_GUI_ACTIVE` and `GRBM_COUNT`. Expressions support the standard set of math operators (/,*,-,+) along with a set of special functions such as reduce and accumulate. -### Reduce Function +### Reduce function ```yaml -expression: 100*reduce(GL2C_HIT,sum)/(reduce(GL2C_HIT,sum)+reduce(GL2C_MISS,sum)) +Expression: 100*reduce(GL2C_HIT,sum)/(reduce(GL2C_HIT,sum)+reduce(GL2C_MISS,sum)) ``` -Reduce() reduces counter values across all dimensions (shader engine, SIMD, etc) to produce a single output value. This is useful when you want to collect and compare values across the entire device. There are a number of reduction operations that can be perfomed: sum, average (avr), minimum value (selects minimum value across all dimensions, min), and max (selects the maximum value across all dimensions). For example reduce(GL2C_HIT,sum) sums all GL2C_HIT hardware register values together to return a single output value. +The reduce function reduces counter values across all dimensions such as shader engine, SIMD, and so on, to produce a single output value. This helps to collect and compare values across the entire device. +Here are the common reduction operations: + +- `sum`: Sums to create a single output. For example, `reduce(GL2C_HIT,sum)` sums all `GL2C_HIT` hardware register values. +- `avr`: Calculates the average across all dimensions. +- `min`: Selects minimum value across all dimensions. +- `max`: Selects the maximum value across all dimensions. -### Accumulate Function +### Accumulate function ```yaml -expression: accumulate(, ) +Expression: accumulate(, ) ``` -#### Description +- The accumulate function sums the values of a basic level counter over the specified number of cycles. The `resolution` parameter allows you to control the frequency of the following summing operation: -- The accumulate metric is used to sum the values of a basic level counter over a specified number of cycles. By setting the resolution parameter, you can control the frequency of the summing operation: - - HIGH_RES: Sums up the basic counter every clock cycle. Captures the value every single cycle for higher accuracy, suitable for fine-grained analysis. - - LOW_RES: Sums up the basic counter every four clock cycles. Reduces the data points and provides less detailed summing, useful for reducing data volume. - - NONE: Does nothing and is equivalent to collecting basic_level_counter. Outputs the value of the basic counter without any summing operation. + - `HIGH_RES`: Sums up the basic level counter every clock cycle. Captures the value every cycle for higher accuracy, which helps in fine-grained analysis. + - `LOW_RES`: Sums up the basic level counter every four clock cycles. Reduces the data points and provides less detailed summing, which helps in reducing data volume. + - `NONE`: Does nothing and is equivalent to collecting basic level counter. Outputs the value of the basic level counter without performing any summing operation. -#### Usage +**Example:** ```yaml MeanOccupancyPerCU: @@ -291,4 +317,5 @@ MeanOccupancyPerCU: -- MeanOccupancyPerCU: This metric calculates the mean occupancy per compute unit. It uses the accumulate function with HIGH_RES to sum the SQ_LEVEL_WAVES counter at every clock cycle. This sum is then divided by GRBM_GUI_ACTIVE and the number of compute units (CU_NUM) to derive the mean occupancy. +- `MeanOccupancyPerCU`: In the preceding example, the `MeanOccupancyPerCU` metric calculates the mean occupancy per compute unit. It uses the accumulate function with `HIGH_RES` to sum the `SQ_LEVEL_WAVES` counter every clock cycle. +This sum is then divided by the maximum value of GRBM_GUI_ACTIVE and the number of compute units `CU_NUM` to derive the mean occupancy. diff --git a/source/docs/api-reference/intercept_table.md b/source/docs/api-reference/intercept_table.md index bce3dc0..7d341bd 100644 --- a/source/docs/api-reference/intercept_table.md +++ b/source/docs/api-reference/intercept_table.md @@ -1,12 +1,18 @@ -# Runtime intercept tables +--- +myst: + html_meta: + "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software." + "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK intercept table, Intercept table API" +--- -Although most tools will want to leverage the callback or buffer tracing services for tracing the HIP, HSA, and ROCTx -APIs, rocprofiler-sdk does provide access to the raw API dispatch tables. Each of the aforementioned APIs are -designed similar to the following sample. +# ROCprofiler-SDK runtime intercept tables -## Dispatch Table Overview +While tools commonly leverage the callback or buffer tracing services for tracing the HIP, HSA, and ROCTx +APIs, ROCprofiler-SDK also provides access to the raw API dispatch tables. -### Forward Declaration of public C API function +## Forward declaration of public C API function + +All the aforementioned APIs are designed similar to the following sample: ```cpp extern "C" @@ -17,7 +23,7 @@ foo(int) __attribute__((visibility("default"))); } ``` -### Internal Implementation of API function +## Internal implementation of API function ```cpp namespace impl @@ -31,7 +37,7 @@ foo(int val) } ``` -### Dispatch Table Implementation +## Dispatch table implementation ```cpp namespace impl @@ -41,21 +47,21 @@ struct dispatch_table int (*foo_fn)(int) = nullptr; }; -// invoked once: populates the dispatch_table with function pointers to implementation +// Invoked once: populates the dispatch_table with function pointers to implementation dispatch_table*& construct_dispatch_table() { static dispatch_table* tbl = new dispatch_table{}; tbl->foo_fn = impl::foo; - // in between above and below, rocprofiler-sdk gets passed the pointer + // In between, ROCprofiler-SDK gets passed the pointer // to the dispatch table and has the opportunity to wrap the function // pointers for interception return tbl; } -// constructs dispatch table and stores it in static variable +// Constructs dispatch table and stores it in static variable dispatch_table* get_dispatch_table() { @@ -65,7 +71,7 @@ get_dispatch_table() } // namespace impl ``` -### Implementation of public C API function +## Implementation of public C API function ```cpp extern "C" @@ -79,18 +85,9 @@ foo(int val) } ``` -### Dispatch Table Chaining - -rocprofiler-sdk is given an opportunity within `impl::construct_dispatch_table()` to -save the original value(s) of the function pointers such as `foo_fn` and install -it's own function pointers in its place -- this results in the public C API function `foo` -calling into the rocprofiler-sdk function pointer, which then in turn, calls the original -function pointer to `impl::foo` (this is called "chaining"). Once rocprofiler-sdk -has made any necessary modifications to the dispatch table, tools which indicated -they also want access to the raw dispatch table via `rocprofiler_at_intercept_table_registration` -will be passed the pointer to the dispatch table. +## Dispatch table chaining -## Sample +ROCprofiler-SDK can save the original values of the function pointers such as `foo_fn` in `impl::construct_dispatch_table()` and install its own function pointers in its place. This results in the public C API function `foo` calling into the ROCprofiler-SDK function pointer, which in turn, calls the original function pointer to `impl::foo`. This phenomenon is named chaining. Once ROCprofiler-SDK +makes necessary modifications to the dispatch table, tools requesting access to the raw dispatch table via `rocprofiler_at_intercept_table_registration`, are provided the pointer to the dispatch table. -For a demo of dispatch table chaining, please see the `samples/intercept_table` example in the -[rocprofiler-sdk GitHub repository](https://github.com/ROCm/rocproifler-sdk). +For an example of dispatch table chaining, see [samples/intercept_table](https://github.com/ROCm/rocprofiler-sdk-internal/tree/amd-staging/samples/intercept_table). diff --git a/source/docs/api-reference/pc_sampling.md b/source/docs/api-reference/pc_sampling.md index cadbbe0..53f7a85 100644 --- a/source/docs/api-reference/pc_sampling.md +++ b/source/docs/api-reference/pc_sampling.md @@ -1,14 +1,21 @@ -# PC sampling method +--- +myst: + html_meta: + "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software." + "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK PC sampling, Program counter sampling, PC sampling" +--- -PC Sampling is a profiling method that uses statistical approximation of the kernel execution by sampling GPU program counters. Furthermore, the method periodically chooses an active wave (in a round robin manner) and snapshot it's program counter (PC). The process takes place on every compute unit simultaneously which makes it device-wide PC sampling. The outcome is the histogram of samples that says how many times each kernel instruction was sampled. +# ROCprofiler-SDK PC sampling method -**Note**: The PC sampling feature is still under development and may not be completely stable. +Program Counter (PC) sampling is a profiling method that uses statistical approximation of the kernel execution by sampling GPU program counters. Furthermore, this method periodically chooses an active wave in a round robin manner and snapshots its PC. This process takes place on every compute unit simultaneously, making it device-wide PC sampling. The outcome is the histogram of samples, explaining how many times each kernel instruction was sampled. - **Risk Acknowledgment**: +:::{note} +Risk acknowledgment: - By activating this feature through `ROCPROFILER_PC_SAMPLING_BETA_ENABLED` environment variable, you acknowledge and accept the following potential risks: - -- **Hardware Freeze**: This beta feature could cause your hardware to freeze unexpectedly. -- **Need for Cold Restart**: In the event of a hardware freeze, you may need to perform a cold restart (turning the hardware off and on) to restore normal operations. +The PC sampling feature is under development and might not be completely stable. Use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk. - Please use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk. +By activating this feature through `ROCPROFILER_PC_SAMPLING_BETA_ENABLED` environment variable, you acknowledge and accept the following potential risks: + +- Hardware freeze: This beta feature could cause your hardware to freeze unexpectedly. +- Need for cold restart: In the event of a hardware freeze, you might need to perform a cold restart (turning the hardware off and on) to restore normal operations. +::: diff --git a/source/docs/api-reference/tool_library.md b/source/docs/api-reference/tool_library.md index f698d9a..cce13d5 100644 --- a/source/docs/api-reference/tool_library.md +++ b/source/docs/api-reference/tool_library.md @@ -1,4 +1,11 @@ -# Tool library +--- +myst: + html_meta: + "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software." + "keywords": "ROCprofiler-SDK API reference, ROCprofiler-SDK tool library, Tool library API" +--- + +# ROCprofiler-SDK tool library The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.0.4.0`, which uses these libraries. @@ -144,9 +151,9 @@ tool_init(rocprofiler_client_finalize_t fini_func, Otherwise, ROCprofiler-SDK invokes the `finalize` callback via an `atexit` handler. -## Full `rocprofiler_configure` Sample +## Full rocprofiler-configure sample -All of the snippets from the previous sections have been combined here for convenience. +All the code snippets from the previous sections are combined here to demonstrate complete ROCProfiler configuration. ```cpp #include @@ -184,7 +191,7 @@ tool_init(rocprofiler_client_finalize_t fini_func, // Save your contexts tool_data->contexts.emplace_back(ctx); - // associate code object tracing with this context + // Associate code object tracing with this context rocprofiler_configure_callback_tracing_service( ctx, ROCPROFILER_CALLBACK_TRACING_CODE_OBJECT, @@ -193,7 +200,7 @@ tool_init(rocprofiler_client_finalize_t fini_func, tool_tracing_callback, tool_data); - // ... associate services with contexts ... + // ... Associate services with contexts ... return 0; } @@ -213,14 +220,14 @@ rocprofiler_configure(uint32_t version, // (optional) Provide a name for this tool to rocprofiler client_id->name = "ExampleTool"; - // info provided back to tool_init and tool_fini + // Info provided back to tool_init and tool_fini auto* my_tool_data = new rocp_tool_data{ version, runtime_version, priority, client_id, nullptr }; - // create configure data + // Create configure data static auto cfg = rocprofiler_tool_configure_result_t{ sizeof(rocprofiler_tool_configure_result_t), &tool_init, diff --git a/source/docs/how-to/samples.md b/source/docs/how-to/samples.md index d8e2b0c..d55934a 100644 --- a/source/docs/how-to/samples.md +++ b/source/docs/how-to/samples.md @@ -1,4 +1,11 @@ -# Samples +--- +myst: + html_meta: + "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software." + "keywords": "ROCprofiler-SDK, ROCProfiler-SDK samples" +--- + +# ROCprofiler-SDK samples The samples are provided to help you see the profiler in action. diff --git a/source/docs/how-to/using-rocprofv3.rst b/source/docs/how-to/using-rocprofv3.rst index 6d814e7..97b6ce4 100644 --- a/source/docs/how-to/using-rocprofv3.rst +++ b/source/docs/how-to/using-rocprofv3.rst @@ -1,6 +1,6 @@ .. meta:: :description: Documentation of the installation, configuration, use of the ROCprofiler-SDK, and rocprofv3 command-line tool - :keywords: ROCprofiler-SDK tool, ROCprofiler-SDK library, rocprofv3, ROCm, API, reference + :keywords: ROCprofiler-SDK tool, ROCprofiler-SDK library, rocprofv3, rocprofv3 tool usage, Using rocprofv3, ROCprofiler-SDK command line tool, ROCprofiler-SDK CLI .. _using-rocprofv3: @@ -137,19 +137,19 @@ Here is the sample of commonly used ``rocprofv3`` command-line options. Some opt * - ``--preload`` - Libraries to prepend to LD_PRELOAD (usually for sanitizers) - Extension - + * - ``--perfetto-backend {inprocess,system}`` - Perfetto data collection backend. 'system' mode requires starting traced and perfetto daemons - Extension - + * - ``--perfetto-buffer-size KB`` - Size of buffer for perfetto output in KB. default: 1 GB - Extension - + * - ``--perfetto-buffer-fill-policy {discard,ring_buffer}`` - Policy for handling new records when perfetto has reached the buffer limit - Extension - + * - ``--perfetto-shmem-size-hint KB`` - Perfetto shared memory size hint in KB. default: 64 KB - Extension @@ -266,9 +266,9 @@ Here is a list of useful APIs for code instrumentation. See how to use ``ROCTx`` APIs in the MatrixTranspose application below: .. code-block:: bash - + #include - + roctxMark("before hipLaunchKernel"); int rangeId = roctxRangeStart("hipLaunchKernel range"); roctxRangePush("hipLaunchKernel"); @@ -542,7 +542,7 @@ Properties - WRITE_SIZE -Command-Line +Command-line +++++++++++++ Desired counters can now be collected as ``command-line`` option as well. @@ -585,7 +585,7 @@ Here are the contents of ``counter_collection.csv`` file: For the description of the fields in the output file, see :ref:`output-file-fields`. -Kernel Filtering +Kernel filtering +++++++++++++++++ rocprofv3 supports kernel filtering in case of profiling. A kernel filter is a set of a regex string (to include the kernels matching this filter), a regex string (to exclude the kernels matching this filter), @@ -768,7 +768,7 @@ Properties - **`simd_per_cu`** `(integer)`: SIMDs per CU. - **`max_slots_scratch_cu`** `(integer)`: Maximum slots for scratch CU. - **`gfx_target_version`** `(integer)`: GFX target version. - - **`vendor_id`** `(integer)`: Vendor ID. + - **`vendor_id`** `(integer)`: Vendor ID. - **`device_id`** `(integer)`: Device ID. - **`location_id`** `(integer)`: Location ID. - **`domain`** `(integer)`: Domain identifier. diff --git a/source/docs/index.rst b/source/docs/index.rst index 8aaec85..e6c4ccf 100644 --- a/source/docs/index.rst +++ b/source/docs/index.rst @@ -1,6 +1,6 @@ .. meta:: - :description: Documentation of the installation, configuration, use of the ROCprofiler SDK, and rocprofv3 command-line tool - :keywords: ROCprofiler-SDK tool, ROCprofiler-SDK library, rocprofv3, ROCm, API, reference + :description: ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software + :keywords: ROCprofiler-SDK tool, ROCprofiler-SDK library, rocprofv3, ROCprofiler-SDK API, ROCprofiler-SDK documentation .. _index: @@ -10,7 +10,7 @@ ROCprofiler-SDK documentation ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software. It supports application tracing to provide a big picture of the GPU application execution and kernel profiling to provide low-level hardware details from the performance counters. -The ROCprofiler-SDK library provides runtime-independent APIs for tracing runtime calls and asynchronous activities such as GPU kernel dispatches and memory moves. The tracing includes callback APIs for runtime API tracing and activity APIs for asynchronous activity records logging. +The ROCprofiler-SDK library provides runtime-independent APIs for tracing runtime calls and asynchronous activities such as GPU kernel dispatches and memory moves. The tracing includes callback APIs for runtime API tracing and activity APIs for asynchronous activity records logging. In summary, ROCprofiler-SDK combines `ROCProfiler `_ and `ROCTracer `_. You can utilize the ROCprofiler-SDK to develop a tool for profiling and tracing HIP applications on ROCm software. @@ -33,7 +33,7 @@ The documentation is structured as follows: * :ref:`using-rocprofv3` * :doc:`Samples ` - + .. grid-item-card:: API reference * :doc:`Buffered services ` @@ -47,7 +47,7 @@ The documentation is structured as follows: .. grid-item-card:: Conceptual * :ref:`comparing-with-legacy-tools` - + To contribute to the documentation, refer to `Contributing to ROCm `_. diff --git a/source/docs/install/installation.md b/source/docs/install/installation.md index 88007ac..41f045b 100644 --- a/source/docs/install/installation.md +++ b/source/docs/install/installation.md @@ -1,4 +1,11 @@ -# Installation +--- +myst: + html_meta: + "description": "ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software." + "keywords": "ROCprofiler-SDK installation, Install ROCprofiler-SDK, Build ROCprofiler-SDK" +--- + +# ROCprofiler-SDK installation This document provides information required to install ROCprofiler-SDK from source. @@ -53,7 +60,7 @@ cmake \ -D CMAKE_INSTALL_PREFIX=/opt/rocm \ rocprofiler-sdk-source -cmake --build rocprofiler-sdk-build --target all --parallel 8 +cmake --build rocprofiler-sdk-build --target all --parallel 8 ``` ## Installing ROCprofiler-SDK