Skip to content

Commit

Permalink
fix words and formatting
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Jun Park <peter.park@amd.com>

formatting

Signed-off-by: Peter Jun Park <peter.park@amd.com>
  • Loading branch information
peterjunpark committed Jul 9, 2024
1 parent 184f18a commit dee9b62
Show file tree
Hide file tree
Showing 45 changed files with 4,044 additions and 2,483 deletions.
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
* @koomie @coleramos425

# Documentation files
docs/* @ROCm/rocm-documentation
docs/ @ROCm/rocm-documentation
*.md @ROCm/rocm-documentation
*.rst @ROCm/rocm-documentation
.readthedocs.yaml @ROCm/rocm-documentation
137 changes: 95 additions & 42 deletions docs/concept/command-processor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,95 +2,148 @@
Command processor (CP)
**********************

The command processor (CP) is responsible for interacting with the AMDGPU Kernel
Driver (a.k.a., the Linux Kernel) on the CPU and
for interacting with user-space HSA clients when they submit commands to
HSA queues. Basic tasks of the CP include reading commands (e.g.,
corresponding to a kernel launch) out of `HSA
Queues <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__
(Sec. 2.5), scheduling work to subsequent parts of the scheduler
pipeline, and marking kernels complete for synchronization events on the
host.

The command processor is composed of two sub-components:

- Fetcher (CPF): Fetches commands out of memory to hand them over to
the CPC for processing
- Packet Processor (CPC): The micro-controller running the command
processing firmware that decodes the fetched commands, and (for
kernels) passes them to the `Workgroup Processors <SPI>`__ for
scheduling

Before scheduling work to the accelerator, the command-processor can
first acquire a memory fence to ensure system consistency `(Sec
2.6.4) <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__.
After the work is complete, the command-processor can apply a
memory-release fence. Depending on the AMD CDNA accelerator under
question, either of these operations *may* initiate a cache write-back
or invalidation.
The command processor (CP) is responsible for interacting with the AMDGPU kernel
driver -- the Linux kernel -- on the CPU and for interacting with user-space
HSA clients when they submit commands to HSA queues. Basic tasks of the CP
include reading commands (such as, corresponding to a kernel launch) out of
:hsa-runtime-pdf:`HSA queues <68>`, scheduling work to subsequent parts of the
scheduler pipeline, and marking kernels complete for synchronization events on
the host.

The command processor consists of two sub-components:

* :ref:`Fetcher <cpf-metrics>` (CPF): Fetches commands out of memory to hand
them over to the CPC for processing.

* :ref:`Packet processor <cpc-metrics>` (CPC): Micro-controller running the
command processing firmware that decodes the fetched commands and (for
kernels) passes them to the :ref:`workgroup processors <desc-spi>` for
scheduling.

Before scheduling work to the accelerator, the command processor can
first acquire a memory fence to ensure system consistency
:hsa-runtime-pdf:`Section 2.6.4 <91>`. After the work is complete, the
command processor can apply a memory-release fence. Depending on the AMD CDNA
accelerator under question, either of these operations *might* initiate a cache
write-back or invalidation.

Analyzing command processor performance is most interesting for kernels
that the user suspects to be scheduling/launch-rate limited. The command
processor’s metrics therefore are focused on reporting, e.g.:
that you suspect to be limited by scheduling or launch rate. The command
processor’s metrics therefore are focused on reporting, for example:

* Utilization of the fetcher

* Utilization of the packet processor, and decoding processing packets

* Stalls in fetching and processing

- Utilization of the fetcher
- Utilization of the packet processor, and decoding processing packets
- Fetch/processing stalls
.. _cpf-metrics:

Command Processor Fetcher (CPF) metrics
=======================================

.. list-table::
:header-rows: 1
:widths: 20 65 15

* - Metric

- Description

- Unit

* - CPF Utilization
- Percent of total cycles where the CPF was busy actively doing any work. The ratio of CPF busy cycles over total cycles counted by the CPF.

- Percent of total cycles where the CPF was busy actively doing any work.
The ratio of CPF busy cycles over total cycles counted by the CPF.

- Percent

* - CPF Stall

- Percent of CPF busy cycles where the CPF was stalled for any reason.

- Percent

* - CPF-L2 Utilization
- Percent of total cycles counted by the CPF-[L2](L2) interface where the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles over total cycles counted by the CPF-L2.

- Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface
where the CPF-L2 interface was active doing any work. The ratio of CPF-L2
busy cycles over total cycles counted by the CPF-L2.

- Percent

* - CPF-L2 Stall
- Percent of CPF-L2 busy cycles where the CPF-[L2](L2) interface was stalled for any reason.

- Percent of CPF-L2 busy cycles where the CPF-:doc:`L2 <l2-cache>`
interface was stalled for any reason.

- Percent

* - CPF-UTCL1 Stall
- Percent of CPF busy cycles where the CPF was stalled by address translation.

- Percent of CPF busy cycles where the CPF was stalled by address
translation.

- Percent

.. _cpc-metrics:

Command Processor Packet Processor (CPC) metrics
================================================

.. list-table::
:header-rows: 1
:widths: 20 65 15

* - Metric

- Description

- Unit

* - CPC Utilization
- Percent of total cycles where the CPC was busy actively doing any work. The ratio of CPC busy cycles over total cycles counted by the CPC.

- Percent of total cycles where the CPC was busy actively doing any work.
The ratio of CPC busy cycles over total cycles counted by the CPC.

- Percent

* - CPC Stall

- Percent of CPC busy cycles where the CPC was stalled for any reason.

- Percent

* - CPC Packet Decoding Utilization

- Percent of CPC busy cycles spent decoding commands for processing.

- Percent

* - CPC-Workgroup Manager Utilization
- Percent of CPC busy cycles spent dispatching workgroups to the [Workgroup Manager](SPI).

- Percent of CPC busy cycles spent dispatching workgroups to the
:ref:`workgroup manager <desc-spi>`.

- Percent

* - CPC-L2 Utilization
- Percent of total cycles counted by the CPC-[L2](L2) interface where the CPC-L2 interface was active doing any work.

- Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface
where the CPC-L2 interface was active doing any work.

- Percent

* - CPC-UTCL1 Stall
- Percent of CPC busy cycles where the CPC was stalled by address translation.

- Percent of CPC busy cycles where the CPC was stalled by address
translation.

- Percent

* - CPC-UTCL2 Utilization
- Percent of total cycles counted by the CPC's L2 address translation interface where the CPC was busy doing address translation work.

- Percent of total cycles counted by the CPC's L2 address translation
interface where the CPC was busy doing address translation work.

- Percent

27 changes: 15 additions & 12 deletions docs/concept/compute-unit.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,35 +17,38 @@ The CU consists of several independent execution pipelines and functional units.
executing much of the computational work on CDNA accelerators, including but
not limited to floating-point operations (FLOPs) and integer operations
(IOPs).

* The *vector memory (VMEM)* unit is responsible for issuing loads, stores and
atomic operations that interact with the memory system.

* The :ref:`desc-salu` is shared by all threads in a
[wavefront](wavefront), and is responsible for executing instructions that are
known to be uniform across the wavefront at compile-time. The SALU has a
memory unit (SMEM) for interacting with memory, but it cannot issue separately
from the SALU.

* The :ref:`desc-lds` is an on-CU software-managed scratchpad memory
that can be used to efficiently share data between all threads in a
[workgroup](workgroup).
:ref:`workgroup <desc-workgroup>`.

* The :ref:`desc-scheduler` is responsible for issuing and decoding instructions for all
the [wavefronts](wavefront) on the compute unit.
* The *vector L1 data cache (vL1D)* is the first level cache local to the
compute unit. On current CDNA accelerators, the vL1D is write-through. The
vL1D caches from multiple compute units are kept coherent with one another
through software instructions.
the :ref:`wavefronts <desc-wavefront>` on the compute unit.

* The :doc:`vector L1 data cache (vL1D) <vector-l1-cache>` is the first level
cache local to the compute unit. On current CDNA accelerators, the vL1D is
write-through. The vL1D caches from multiple compute units are kept coherent
with one another through software instructions.

* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
specialized matrix-multiplication accelerator pipelines known as the
:ref:`desc-mfma`.

For a more in-depth description of a compute unit on a CDNA accelerator, see
:hip-training-2019:`22` and :gcn-crash-course:`27`.
:hip-training-pdf:`22` and :gcn-crash-course:`27`.

:ref:`pipeline-desc` details the various
execution pipelines (VALU, SALU, LDS, Scheduler, etc.). The metrics
presented by Omniperf for these pipelines are described in
:ref:`pipeline-metrics`. Finally, the `vL1D <vL1D>`__ cache and
:ref:`LDS <desc-lds>` will be described their own sections.

.. include:: ./includes/pipeline-descriptions.rst
:ref:`pipeline-metrics`. The :doc:`vL1D <vector-l1-cache>` cache and
:doc:`LDS <local-data-share>` are described their own sections.

.. include:: ./includes/pipeline-metrics.rst
109 changes: 109 additions & 0 deletions docs/concept/definitions.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
.. meta::
:description: Omniperf terminology and definitions
:keywords: Omniperf, ROCm, glossary, definitions, terms, profiler, tool,
Instinct, accelerator, AMD

***********
Definitions
***********

The following table briefly defines some terminology used in Omniperf interfaces
and in this documentation.

.. include:: ./includes/terms.rst

.. include:: ./includes/normalization-units.rst

.. _memory-spaces:

Memory spaces
=============

AMD Instinct MI accelerators can access memory through multiple address spaces
which may map to different physical memory locations on the system. The
following table provides a view into how various types of memory used
in HIP map onto these constructs:

.. list-table::
:header-rows: 1

* - LLVM Address Space
- Hardware Memory Space
- HIP Terminology

* - Generic
- Flat
- N/A

* - Global
- Global
- Global

* - Local
- LDS
- LDS/Shared

* - Private
- Scratch
- Private

* - Constant
- Same as global
- Constant

The following is a high-level description of the address spaces in the AMDGPU
backend of LLVM:

.. list-table::
:header-rows: 1

* - Address space
- Description

* - Global
- Memory that can be seen by all threads in a process, and may be backed by
the local accelerator's HBM, a remote accelerator's HBM, or the CPU's
DRAM.

* - Local
- Memory that is only visible to a particular workgroup. On AMD's Instinct
accelerator hardware, this is stored in :ref:`LDS <local-data-share>`
memory.

* - Private
- Memory that is only visible to a particular [work-item](workitem)
(thread), stored in the scratch space on AMD's Instinct accelerators.

* - Constant
- Read-only memory that is in the global address space and stored on the
local accelerator's HBM.

* - Generic
- Used when the compiler cannot statically prove that a pointer is
addressing memory in a single (non-generic) address space. Mapped to Flat
on AMD's Instinct accelerators, the pointer could dynamically address
global, local, private or constant memory.

`LLVM's documentation for AMDGPU Backend <https://llvm.org/docs/AMDGPUUsage.html#address-spaces>`_
has the most up-to-date information. Refer to this source for a more complete
explanation.

.. _memory-type:

Memory type
===========

AMD Instinct accelerators contain a number of different memory allocation
types to enable the HIP language's
:doc:`memory coherency model <hip:how-to/programming_manual>`.
These memory types are broadly similar between AMD Instinct accelerator
generations, but may differ in exact implementation.

In addition, these memory types *might* differ between accelerators on the same
system, even when accessing the same memory allocation.

For example, an :ref:`MI2XX <mixxx-note>` accelerator accessing *fine-grained*
memory allocated local to that device may see the allocation as coherently
cacheable, while a remote accelerator might see the same allocation as
*uncached*.

Loading

0 comments on commit dee9b62

Please sign in to comment.