Skip to content

Commit

Permalink
add extlinks
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Jun Park <peter.park@amd.com>
  • Loading branch information
peterjunpark committed Jun 27, 2024
1 parent cde3538 commit 67b6de3
Show file tree
Hide file tree
Showing 8 changed files with 191 additions and 228 deletions.
7 changes: 2 additions & 5 deletions docs/conceptual/compute-unit.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,13 @@ The CU consists of several independent execution pipelines and functional units.
:ref:`desc-mfma`.

For a more in-depth description of a compute unit on a CDNA accelerator, see
slides 22 to 28 in
`An introduction to AMD GPU Programming with HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`_
and slide 27 in
`The AMD GCN Architecture - A Crash Course (Layla Mah) <https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah>`_.
:hip-training-2019:`22` and :gcn-crash-course:`27`.

:ref:`pipeline-desc` details the various
execution pipelines (VALU, SALU, LDS, Scheduler, etc.). The metrics
presented by Omniperf for these pipelines are described in
:ref:`pipeline-metrics`. Finally, the `vL1D <vL1D>`__ cache and
`LDS <LDS>`__ will be described their own sections.
:ref:`LDS <desc-lds>` will be described their own sections.

.. include:: ./includes/pipeline-descriptions.rst

Expand Down
110 changes: 62 additions & 48 deletions docs/conceptual/includes/pipeline-descriptions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,20 +13,18 @@ over an entire wavefront, each `work-item <Workitem>`__ (or,
vector-lane) potentially operating on distinct data. The VALU of a CDNA
accelerator or GCN GPU typically consists of:

- four 16-wide SIMD processors (see `An introduction to AMD GPU
Programming with
HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`__
for more details)
- four 64 or 128 KiB VGPR files (yielding a total of 256-512 KiB total
per CU), see `AGPRs <agprs>`__ for more detail.
- An instruction buffer (per-SIMD) that contains execution slots for up
* Four 16-wide SIMD processors (see :hip-training-2019:`24` for more details).
* Four 64 or 128 KiB VGPR files (yielding a total of 256-512 KiB total
per CU), see :ref:`AGPRs <agprs>` for more detail.
* An instruction buffer (per-SIMD) that contains execution slots for up
to 8 wavefronts (for 32 total wavefront slots on each CU).
- A vector memory (VMEM) unit which transfers data between VGPRs and
* A vector memory (VMEM) unit which transfers data between VGPRs and
memory; each work-item supplies its own memory address and supplies
or receives unique data.
- CDNA accelerators, such as the MI100 and `MI2XX <2xxnote>`__, contain
additional `Matrix Fused Multiply-Add (MFMA)
units <https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/>`__.
* CDNA accelerators, such as the MI100 and MI2XX [#mi2xx]_, contain
additional
:amd-lab-note:`Matrix Fused Multiply-Add (MFMA) <amd-lab-notes-matrix-cores-readme>`
unites.

In order to support branching / conditionals, each wavefront in the VALU
has a distinct execution mask which determines which work-items in the
Expand All @@ -37,8 +35,11 @@ and are treated as no-ops.

.. note::

On GCN GPUs and the CDNA MI100 accelerator, there are slots for up to 10 wavefronts in the instruction buffer, but generally occupancy is limited by other factors to 32 waves per [Compute Unit](CU).
On the CDNA2 [MI2XX](2xxnote) series accelerators, there are only 8 waveslots per-SIMD.
On GCN GPUs and the CDNA MI100 accelerator, there are slots for up to 10
wavefronts in the instruction buffer, but generally occupancy is limited by
other factors to 32 waves per :doc:`compute unit <compute-unit>`.
On the CDNA2 MI2XX [#mi2xx]_ series accelerators, there are only 8 waveslots
per-SIMD.

.. _desc-salu:

Expand All @@ -47,16 +48,17 @@ Scalar Arithmetic Logic Unit (SALU)

The scalar arithmetic logic unit (SALU) executes instructions that are
shared between all work-items in a wavefront. This includes control-flow
– such as if/else conditionals, branches and looping – pointer
arithmetic, loading common values, etc. The SALU consists of:
– such as if/else conditionals, branches and looping
– pointer arithmetic, loading common values, etc.

- a scalar processor capable of various arithmetic, conditional, and
comparison (etc.) operations. See, e.g., `Chapter 5. Scalar ALU
Operations <https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf>`__
of the CDNA2 Instruction Set Architecture (ISA) Guide for more
The SALU consists of:

- A scalar processor capable of various arithmetic, conditional, and
comparison (etc.) operations. See :mi200-isa-pdf:`Chapter 5. Scalar ALU Operations <35>`
of the CDNA2 Instruction Set Architecture (ISA) Reference Guide for more
detail.
- a 12.5 KiB Scalar General Purpose Register (SGPR) file
- a scalar memory (SMEM) unit which transfers data between SGPRs and
- A 12.5 KiB Scalar General Purpose Register (SGPR) file
- A scalar memory (SMEM) unit which transfers data between SGPRs and
memory

Data loaded by the SMEM can be cached in the `scalar L1 data
Expand All @@ -65,35 +67,40 @@ accesses such as kernel arguments, or HIP’s ``__constant__`` memory.

.. _desc-lds:

Local Data Share (LDS)
Local data share (LDS)
----------------------

The local data share (LDS, a.k.a., “shared memory”) is fast on-CU
scratchpad that can be explicitly managed by software to effectively
share data and to coordinate between wavefronts in a workgroup.
.. _perf-model-branch:

The local data share (LDS, a.k.a., "shared memory") is fast on-CU scratchpad
that can be explicitly managed by software to effectively share data and to
coordinate between wavefronts in a workgroup.

\```{figure} images/lds.\* :scale: 150 % :alt: Performance model of the
Local Data Share (LDS) on AMD Instinct(tm) MI accelerators. :align:
center
.. figure:: ../data/performance-model/lds.*
:align: center
:alt: Performance model of the local data share (LDS) on AMD Instinct
accelerators

Performance model of the Local Data Share (LDS) on AMD Instinct(tm) MI
accelerators.
Performance model of the local data share (LDS) on AMD Instinct MI-series
accelerators.

Above is Omniperf's performance model of the LDS on CDNA accelerators (adapted from [GCN Architecture, by Mike Mantor](https://old.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-3-ManyCore/HC24.28.315-AMD.GCN.mantor_v1.pdf), slide 20).
The SIMDs in the [VALU](valu) are connected to the LDS in pairs (see above).
Only one SIMD per pair may issue an LDS instruction at a time, but both pairs may issue concurrently.
Only one SIMD per pair may issue an LDS instruction at a time, but both pairs may issue concurrently.

On CDNA accelerators, the LDS contains 32 banks and each bank is 4B wide.
The LDS is designed such that each bank can be read from/written to/atomically updated every cycle, for a total throughput of 128B/clock ([GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 40).
The LDS is designed such that each bank can be read from/written to/atomically updated every cycle, for a total throughput of 128B/clock :gcn-crash-course:`40`.

On each of the two ports to the SIMDs, 64B can be sent in each direction per cycle. So, a single wavefront, coming from one of the 2 SIMDs in a pair, can only get back 64B/cycle (16 lanes per cycle). The input port is shared between data and address and this can affect achieved bandwidth for different data sizes. For example, a 64-wide store where each lane is sending a 4B value takes 8 cycles (50% peak bandwidth) while a 64-wide store where each lane is sending a 16B value takes 20 cycles (80% peak bandwidth).

In addition, the LDS contains conflict-resolution hardware to detect and handle bank conflicts.
A bank conflict occurs when two (or more) work-items in a wavefront want to read, write, or atomically update different addresses that map to the same bank in the same cycle.
In this case, the conflict detection hardware will determine a new schedule such that the access is split into multiple cycles with no conflicts in any single cycle.

When multiple work-items want to read from the same address within a bank, the result can be efficiently broadcasted ([GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 41).
Multiple work-items writing to the same address within a bank typically results undefined behavior in HIP and other languages, as the LDS will write the value from the last work-item as determined by the hardware scheduler ([GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 41). This behavior may be useful in the very specific case of storing a uniform value.
When multiple work-items want to read from the same address within a bank, the result can be efficiently broadcasted
:gcn-crash-course:`41`.
Multiple work-items writing to the same address within a bank typically results undefined behavior in HIP and other languages, as the LDS will write the value from the last work-item as determined by the hardware scheduler
:gcn-crash-course:`41`. This behavior may be useful in the very specific case of storing a uniform value.

Relatedly, an address conflict is defined as occurring when two (or more) work-items in a wavefront want to atomically update the same address on the same cycle.
As in a bank-conflict, this may cause additional cycles of work for the LDS operation to complete.
Expand All @@ -103,30 +110,37 @@ As in a bank-conflict, this may cause additional cycles of work for the LDS oper
Branch
------

The branch unit is responsible for executing jumps and branches to execute control-flow operations.
Note that Branch operations are not used for execution mask updates, but only for “whole wavefront” control-flow changes.
The branch unit is responsible for executing jumps and branches to execute
control flow operations.
Note that Branch operations are not used for execution mask updates, but only
for “whole wavefront” control-flow changes.

.. _desc-scheduler:

Scheduler
---------

The scheduler is responsible for arbitration and issue of instructions for all the wavefronts currently executing on the CU. On every clock cycle, the scheduler:
The scheduler is responsible for arbitration and issue of instructions for all
the wavefronts currently executing on the :doc:`CU <compute-unit>`. On every
clock cycle, the scheduler:

- considers waves from one of the SIMD units for execution, selected in a round-robin fashion between the SIMDs in the [compute unit](CU)
- issues up to one instruction per wavefront on the selected SIMD
- issues up to one instruction per each of the instruction categories among the waves on the selected SIMD:
- [VALU](valu)
- [VMEM](valu) operations
- [SALU](salu) / SMEM operations
- [LDS](lds)
- [Branch](branch) operations
* Considers waves from one of the SIMD units for execution, selected in a
round-robin fashion between the SIMDs in the compute unit
* Issues up to one instruction per wavefront on the selected SIMD
* Issues up to one instruction per each of the instruction categories among the waves on the selected SIMD:
* :ref:`VALU <desc-valu>` / :ref:`VMEM <desc-valu>` operations
* :ref:`SALU <desc-salu>` / SMEM operations
* :ref:`LDS <desc-lds>`
* :ref:`Branch <desc-branch>` operations

This gives a maximum of five issued Instructions Per Cycle (IPC), per-SIMD, per-CU ([AMD GPU HIP Training](https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf), [GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah)).
This gives a maximum of five issued Instructions Per Cycle (IPC), per-SIMD,
per-CU ([AMD GPU HIP Training](https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf), [GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah)).

On CDNA accelerators with [MFMA](mfma) instructions, these are issued via the [VALU](valu). Some of them will execute on a separate functional unit and typically allow other [VALU](valu) operations to execute in their shadow (see the [MFMA](mfma) section for more detail).
On CDNA accelerators with [MFMA](mfma) instructions, these are issued via the
[VALU](valu). Some of them will execute on a separate functional unit and typically allow other [VALU](valu) operations to execute in their shadow (see the [MFMA](mfma) section for more detail).

.. note::

```{note}
The IPC model used by Omniperf omits the following two complications for clarity.
First, CDNA accelerators contain other execution units on the CU that are unused for compute applications.
Second, so-called "internal" instructions (see [Layla Mah's GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 29) are not issued to a functional unit, and can technically cause the maximum IPC to _exceed_ 5 instructions per-cycle in special (largely unrealistic) cases.
Expand Down
Loading

0 comments on commit 67b6de3

Please sign in to comment.