add extlinks

Signed-off-by: Peter Jun Park <peter.park@amd.com>
ROCm · Jun 27, 2024 · 67b6de3 · 67b6de3
1 parent cde3538
commit 67b6de3
Show file tree

Hide file tree

Showing 8 changed files with 191 additions and 228 deletions.
diff --git a/docs/conceptual/compute-unit.rst b/docs/conceptual/compute-unit.rst
@@ -38,16 +38,13 @@ The CU consists of several independent execution pipelines and functional units.
   :ref:`desc-mfma`.
 
 For a more in-depth description of a compute unit on a CDNA accelerator, see
-slides 22 to 28 in
-`An introduction to AMD GPU Programming with HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`_
-and slide 27 in
-`The AMD GCN Architecture - A Crash Course (Layla Mah) <https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah>`_.
+:hip-training-2019:`22` and :gcn-crash-course:`27`.
 
 :ref:`pipeline-desc` details the various
 execution pipelines (VALU, SALU, LDS, Scheduler, etc.). The metrics
 presented by Omniperf for these pipelines are described in
 :ref:`pipeline-metrics`. Finally, the `vL1D <vL1D>`__ cache and
-`LDS <LDS>`__ will be described their own sections.
+:ref:`LDS <desc-lds>` will be described their own sections.
 
 .. include:: ./includes/pipeline-descriptions.rst
 

diff --git a/docs/conceptual/includes/pipeline-descriptions.rst b/docs/conceptual/includes/pipeline-descriptions.rst
@@ -13,20 +13,18 @@ over an entire wavefront, each `work-item <Workitem>`__ (or,
 vector-lane) potentially operating on distinct data. The VALU of a CDNA
 accelerator or GCN GPU typically consists of:
 
--  four 16-wide SIMD processors (see `An introduction to AMD GPU
-   Programming with
-   HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`__
-   for more details)
--  four 64 or 128 KiB VGPR files (yielding a total of 256-512 KiB total
-   per CU), see `AGPRs <agprs>`__ for more detail.
--  An instruction buffer (per-SIMD) that contains execution slots for up
+*  Four 16-wide SIMD processors (see :hip-training-2019:`24` for more details).
+*  Four 64 or 128 KiB VGPR files (yielding a total of 256-512 KiB total
+   per CU), see :ref:`AGPRs <agprs>` for more detail.
+*  An instruction buffer (per-SIMD) that contains execution slots for up
    to 8 wavefronts (for 32 total wavefront slots on each CU).
--  A vector memory (VMEM) unit which transfers data between VGPRs and
+*  A vector memory (VMEM) unit which transfers data between VGPRs and
    memory; each work-item supplies its own memory address and supplies
    or receives unique data.
--  CDNA accelerators, such as the MI100 and `MI2XX <2xxnote>`__, contain
-   additional `Matrix Fused Multiply-Add (MFMA)
-   units <https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/>`__.
+*  CDNA accelerators, such as the MI100 and MI2XX [#mi2xx]_, contain
+   additional
+   :amd-lab-note:`Matrix Fused Multiply-Add (MFMA) <amd-lab-notes-matrix-cores-readme>`
+   unites.
 
 In order to support branching / conditionals, each wavefront in the VALU
 has a distinct execution mask which determines which work-items in the
@@ -37,8 +35,11 @@ and are treated as no-ops.
 
 .. note::
 
-   On GCN GPUs and the CDNA MI100 accelerator, there are slots for up to 10 wavefronts in the instruction buffer, but generally occupancy is limited by other factors to 32 waves per [Compute Unit](CU).
-   On the CDNA2 [MI2XX](2xxnote) series accelerators, there are only 8 waveslots per-SIMD.
+   On GCN GPUs and the CDNA MI100 accelerator, there are slots for up to 10
+   wavefronts in the instruction buffer, but generally occupancy is limited by
+   other factors to 32 waves per :doc:`compute unit <compute-unit>`.
+   On the CDNA2 MI2XX [#mi2xx]_ series accelerators, there are only 8 waveslots
+   per-SIMD.
 
 .. _desc-salu:
 
@@ -47,16 +48,17 @@ Scalar Arithmetic Logic Unit (SALU)
 
 The scalar arithmetic logic unit (SALU) executes instructions that are
 shared between all work-items in a wavefront. This includes control-flow
-– such as if/else conditionals, branches and looping – pointer
-arithmetic, loading common values, etc. The SALU consists of:
+– such as if/else conditionals, branches and looping
+– pointer arithmetic, loading common values, etc.
 
--  a scalar processor capable of various arithmetic, conditional, and
-   comparison (etc.) operations. See, e.g., `Chapter 5. Scalar ALU
-   Operations <https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf>`__
-   of the CDNA2 Instruction Set Architecture (ISA) Guide for more
+The SALU consists of:
+
+-  A scalar processor capable of various arithmetic, conditional, and
+   comparison (etc.) operations. See :mi200-isa-pdf:`Chapter 5. Scalar ALU Operations <35>`
+   of the CDNA2 Instruction Set Architecture (ISA) Reference Guide for more
    detail.
--  a 12.5 KiB Scalar General Purpose Register (SGPR) file
--  a scalar memory (SMEM) unit which transfers data between SGPRs and
+-  A 12.5 KiB Scalar General Purpose Register (SGPR) file
+-  A scalar memory (SMEM) unit which transfers data between SGPRs and
    memory
 
 Data loaded by the SMEM can be cached in the `scalar L1 data
@@ -65,35 +67,40 @@ accesses such as kernel arguments, or HIP’s ``__constant__`` memory.
 
 .. _desc-lds:
 
-Local Data Share (LDS)
+Local data share (LDS)
 ----------------------
 
-The local data share (LDS, a.k.a., “shared memory”) is fast on-CU
-scratchpad that can be explicitly managed by software to effectively
-share data and to coordinate between wavefronts in a workgroup.
+.. _perf-model-branch:
+
+The local data share (LDS, a.k.a., "shared memory") is fast on-CU scratchpad
+that can be explicitly managed by software to effectively share data and to
+coordinate between wavefronts in a workgroup.
 
-\```{figure} images/lds.\* :scale: 150 % :alt: Performance model of the
-Local Data Share (LDS) on AMD Instinct(tm) MI accelerators. :align:
-center
+.. figure:: ../data/performance-model/lds.*
+   :align: center
+   :alt: Performance model of the local data share (LDS) on AMD Instinct
+         accelerators
 
-Performance model of the Local Data Share (LDS) on AMD Instinct(tm) MI
-accelerators.
+   Performance model of the local data share (LDS) on AMD Instinct MI-series
+   accelerators.
 
 Above is Omniperf's performance model of the LDS on CDNA accelerators (adapted from [GCN Architecture, by Mike Mantor](https://old.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-3-ManyCore/HC24.28.315-AMD.GCN.mantor_v1.pdf), slide 20).
 The SIMDs in the [VALU](valu) are connected to the LDS in pairs (see above).
-   Only one SIMD per pair may issue an LDS instruction at a time, but both pairs may issue concurrently.
+Only one SIMD per pair may issue an LDS instruction at a time, but both pairs may issue concurrently.
 
 On CDNA accelerators, the LDS contains 32 banks and each bank is 4B wide.
-The LDS is designed such that each bank can be read from/written to/atomically updated every cycle, for a total throughput of 128B/clock ([GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 40).
+The LDS is designed such that each bank can be read from/written to/atomically updated every cycle, for a total throughput of 128B/clock :gcn-crash-course:`40`.
 
 On each of the two ports to the SIMDs, 64B can be sent in each direction per cycle. So, a single wavefront, coming from one of the 2 SIMDs in a pair, can only get back 64B/cycle (16 lanes per cycle). The input port is shared between data and address and this can affect achieved bandwidth for different data sizes. For example, a 64-wide store where each lane is sending a 4B value takes 8 cycles (50% peak bandwidth) while a 64-wide store where each lane is sending a 16B value takes 20 cycles (80% peak bandwidth).
 
 In addition, the LDS contains conflict-resolution hardware to detect and handle bank conflicts.
 A bank conflict occurs when two (or more) work-items in a wavefront want to read, write, or atomically update different addresses that map to the same bank in the same cycle.
 In this case, the conflict detection hardware will determine a new schedule such that the access is split into multiple cycles with no conflicts in any single cycle.
 
-When multiple work-items want to read from the same address within a bank, the result can be efficiently broadcasted ([GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 41).
-Multiple work-items writing to the same address within a bank typically results undefined behavior in HIP and other languages, as the LDS will write the value from the last work-item as determined by the hardware scheduler ([GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 41).  This behavior may be useful in the very specific case of storing a uniform value.
+When multiple work-items want to read from the same address within a bank, the result can be efficiently broadcasted
+:gcn-crash-course:`41`.
+Multiple work-items writing to the same address within a bank typically results undefined behavior in HIP and other languages, as the LDS will write the value from the last work-item as determined by the hardware scheduler
+:gcn-crash-course:`41`. This behavior may be useful in the very specific case of storing a uniform value.
 
 Relatedly, an address conflict is defined as occurring when two (or more) work-items in a wavefront want to atomically update the same address on the same cycle.
 As in a bank-conflict, this may cause additional cycles of work for the LDS operation to complete.
@@ -103,30 +110,37 @@ As in a bank-conflict, this may cause additional cycles of work for the LDS oper
 Branch
 ------
 
-   The branch unit is responsible for executing jumps and branches to execute control-flow operations.
-   Note that Branch operations are not used for execution mask updates, but only for “whole wavefront” control-flow changes.
+The branch unit is responsible for executing jumps and branches to execute
+control flow operations.
+Note that Branch operations are not used for execution mask updates, but only
+for “whole wavefront” control-flow changes.
 
 .. _desc-scheduler:
 
 Scheduler
 ---------
 
-   The scheduler is responsible for arbitration and issue of instructions for all the wavefronts currently executing on the CU.  On every clock cycle, the scheduler:
+The scheduler is responsible for arbitration and issue of instructions for all
+the wavefronts currently executing on the :doc:`CU <compute-unit>`.  On every
+clock cycle, the scheduler:
 
-   - considers waves from one of the SIMD units for execution, selected in a round-robin fashion between the SIMDs in the [compute unit](CU)
-   - issues up to one instruction per wavefront on the selected SIMD
-   - issues up to one instruction per each of the instruction categories among the waves on the selected SIMD:
-     - [VALU](valu)
-     - [VMEM](valu) operations
-     - [SALU](salu) / SMEM operations
-     - [LDS](lds)
-     - [Branch](branch) operations
+* Considers waves from one of the SIMD units for execution, selected in a
+  round-robin fashion between the SIMDs in the compute unit
+* Issues up to one instruction per wavefront on the selected SIMD
+* Issues up to one instruction per each of the instruction categories among the waves on the selected SIMD:
+  * :ref:`VALU <desc-valu>` / :ref:`VMEM <desc-valu>` operations
+  * :ref:`SALU <desc-salu>` / SMEM operations
+  * :ref:`LDS <desc-lds>`
+  * :ref:`Branch <desc-branch>` operations
 
-   This gives a maximum of five issued Instructions Per Cycle (IPC), per-SIMD, per-CU ([AMD GPU HIP Training](https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf), [GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah)).
+This gives a maximum of five issued Instructions Per Cycle (IPC), per-SIMD,
+per-CU ([AMD GPU HIP Training](https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf), [GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah)).
 
-   On CDNA accelerators with [MFMA](mfma) instructions, these are issued via the [VALU](valu). Some of them will execute on a separate functional unit and typically allow other [VALU](valu) operations to execute in their shadow (see the [MFMA](mfma) section for more detail).
+On CDNA accelerators with [MFMA](mfma) instructions, these are issued via the
+[VALU](valu). Some of them will execute on a separate functional unit and typically allow other [VALU](valu) operations to execute in their shadow (see the [MFMA](mfma) section for more detail).
+
+.. note::
 
-   ```{note}
    The IPC model used by Omniperf omits the following two complications for clarity.
    First, CDNA accelerators contain other execution units on the CU that are unused for compute applications.
    Second, so-called "internal" instructions (see [Layla Mah's GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 29) are not issued to a functional unit, and can technically cause the maximum IPC to _exceed_ 5 instructions per-cycle in special (largely unrealistic) cases.