docs/drivers/amd/hw/pops.rst - third_party/mesa - Git at Google

 Primitive Ordered Pixel Shading
 ===============================

 Primitive Ordered Pixel Shading (POPS) is the feature available starting from
 GFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering
 functionality.

 It allows a part of a fragment shader — an ordered section (or a critical
 section) — to be executed sequentially in rasterization order for different
 invocations covering the same pixel position.

 This article describes how POPS is set up in shader code and the registers. The
 information here is currently provided for architecture generations up to GFX11.

 Note that the information in this article is **not official** and may contain
 inaccuracies, as well as incomplete or incorrect assumptions. It is based on the
 shader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage
 in Direct3D shaders, AMD's Platform Abstraction Library (PAL), ISA references,
 and experimentation with the hardware.

 Shader code
 -----------

 With POPS, a wave can dynamically execute up to one ordered section. It is fine
 for a wave not to enter an ordered section at all if it doesn't need ordering on
 its execution path, however.

 The setup of the ordered section consists of three parts:

 1. Entering the ordered section in the current wave — awaiting the completion of
    ordered sections in overlapped waves.
 2. Resolving overlap within the current wave — intrawave collisions (optional
    and GFX9–10.3 only).
 3. Exiting the ordered section — resuming overlapping waves trying to enter
    their ordered sections.

 GFX9–10.3: Entering the ordered section in the wave
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Awaiting the completion of ordered sections in overlapped waves is performed by
 setting the POPS packer hardware register, and then polling the volatile
 ``pops_exiting_wave_id`` ALU operand source until its value exceeds the newest
 overlapped wave ID for the current wave.

 The information needed for the wave to perform the waiting is provided to it via
 the SGPR argument ``COLLISION_WAVEID``. Its loading needs to be enabled in the
 ``SPI_SHADER_PGM_RSRC2_PS`` and ``PA_SC_SHADER_CONTROL`` registers (note that
 the POPS arguments specifically need to be enabled not only in ``RSRC`` unlike
 various other arguments, but in ``PA_SC_SHADER_CONTROL`` as well).

 The collision wave ID argument contains the following unsigned values:

 * [31]: Whether overlap has occurred.
 * [29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated
   with.
 * [25:16]: Newest overlapped wave ID.
 * [9:0]: Current wave ID.

 The 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of
 the fields, possibly from an early development iteration, but the meanings of
 them are accurate there.

 The wait must not be performed if the "did overlap" bit 31 is set to 0,
 otherwise it will result in a hang. Also, the bit being set to 0 indicates that
 there are *both* no wave overlap *and no intrawave collisions* for the current
 wave — so if the bit is 0, it's safe for the wave to skip all of the POPS logic
 completely and execute the contents of the ordered section simply as usual with
 unordered access as a potential additional optimization. The packer hardware
 register, however, may be set even without overlap safely — it's the wait loop
 itself that must not be executed if it was reported that there was no overlap.

 The packer ID needs to be passed to the packer hardware register using
 ``s_setreg_b32`` so the wave can poll ``pops_exiting_wave_id`` on that packer.

 On GFX9, the ``MODE`` (1) hardware register has two bits specifying which packer
 the wave is associated with:

 * [25]: The wave is associated with packer 1.
 * [24]: The wave is associated with packer 0.

 Initially, both of these bits are set 0, meaning that POPS is disabled for the
 wave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if
 the packer ID in ``COLLISION_WAVEID`` is 0, or set bit 25 to 1 if the packer ID
 is 1.

 Starting from GFX10, the ``POPS_PACKER`` (25) hardware register is used instead,
 containing the following fields:

 * [2:1]: Packer ID.
 * [0]: POPS enabled for the wave.

 Initially, POPS is disabled for a wave. To start entering the ordered section,
 bits 2:1 must be set to the packer ID from ``COLLISION_WAVEID``, and bit 0 needs
 to be set to 1.

 The wave IDs, both in ``COLLISION_WAVEID`` and ``pops_exiting_wave_id``, are
 10-bit values wrapping around on overflow — consecutive waves are numbered 1022,
 1023, 0, 1… This wraparound needs to be taken into account when comparing the
 exiting wave ID and the newest overlapped wave ID.

 Specifically, until the current wave exits the ordered section, its ID can't be
 smaller than the newest overlapped wave ID or the exiting wave ID. So
 ``current_wave_id + 1`` can be subtracted from 10-bit wave IDs to remap them to
 monotonically increasing unsigned values. In this case, the largest value,
 0xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current
 wave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from
 before the last wraparound will be near 0 increasing away from it. Subtracting
 ``current_wave_id + 1`` is equivalent to adding ``~current_wave_id``.

 GFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit
 newest overlapped wave ID is greater than the 10-bit current wave ID (meaning
 that it's behind the last wraparound point), 1 needs to be added to the newest
 overlapped wave ID before using it in the comparison. This was corrected in
 GFX10.

 The exiting wave ID (not to be confused with "exited" — the exiting wave ID is
 the wave that will exit the ordered section next) is queried via the
 ``pops_exiting_wave_id`` ALU operand source, numbered 239. Normally, it will be
 one of the arguments of ``s_add_i32`` that remaps it from a wrapping 10-bit wave
 ID to monotonically increasing one.

 It's a volatile operand, and it needs to be read in a loop until its value
 becomes greater than the newest overlapped wave ID (after remapping both to
 monotonic). However, if it's too early for the current wave to enter the ordered
 section, it needs to yield execution to other waves that may potentially be
 overlapped — via ``s_sleep``. GFX9 requires a finite amount of delay to be
 specified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up
 the waiting waves, so the maximum delay of 0xFFFF can be used.

 In pseudocode, the entering logic would look like this::

    bool did_overlap = collision_wave_id[31];
    if (did_overlap) {
       if (gfx_level >= GFX10) {
          uint packer_id = collision_wave_id[29:28];
          s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1));
       } else {
          uint packer_id = collision_wave_id[28];
          s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01);
       }

       uint current_10bit_wave_id = collision_wave_id[9:0];
       // Or -(current_10bit_wave_id + 1).
       uint wave_id_remap_offset = ~current_10bit_wave_id;

       uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16];
       if (gfx_level < GFX10 &&
           newest_overlapped_10bit_wave_id > current_10bit_wave_id) {
          ++newest_overlapped_10bit_wave_id;
       }
       uint newest_overlapped_wave_id =
          newest_overlapped_10bit_wave_id + wave_id_remap_offset;

       while (!(src_pops_exiting_wave_id + wave_id_remap_offset >
                newest_overlapped_wave_id)) {
          s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3);
       }
    }

 The SPIR-V fragment shader interlock specification requires an invocation — an
 individual invocation, not the whole subgroup — to execute
 ``OpBeginInvocationInterlockEXT`` exactly once. However, if there are multiple
 begin instructions, or even multiple begin/end pairs, under divergent
 conditions, a wave may end up waiting for the overlapped waves multiple times.
 Thankfully, it's safe to set the POPS packer hardware register to the same
 value, or to run the wait loop, multiple times during the wave's execution, as
 long as the ordered section isn't exited in between by the wave.

 GFX11: Entering the ordered section in the wave
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Instead of exposing wave IDs to shaders, GFX11 uses the "export ready" wave
 status flag to report that the wave may enter the ordered section. It's awaited
 by the ``s_wait_event`` instruction, with the bit 0 ("don't wait for
 ``export_ready``") of the immediate operand set to 0. On GFX11 specifically, AMD
 passes 0 as the whole immediate operand.

 The "export ready" wait can be done multiple times safely.

 GFX9–10.3: Resolving intrawave collisions
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 On GFX9–10.3, it's possible for overlapping fragment shader invocations to be
 placed not only in different waves, but also in the same wave, with the shader
 code making sure that the ordered section is executed for overlapping
 invocations in order.

 This functionality is optional — it can be activated by enabling loading of the
 ``INTRAWAVE_COLLISION`` SGPR argument in ``SPI_SHADER_PGM_RSRC2_PS`` and
 ``PA_SC_SHADER_CONTROL``.

 The lower 8 or 16 (depending on the wave size) bits of ``INTRAWAVE_COLLISION``
 contain the mask of whether each quad in the wave starts a new layer of
 overlapping invocations, and thus the ordered section code for them needs to be
 executed after running it for all lanes with indices preceding that quad index
 multiplied by 4. The rest of the bits in the argument need to be ignored — AMD
 explicitly masks them out in shader code (although this is not necessary if the
 shader uses "find first 1" to obtain the start of the next set of overlapping
 quads or expands this quad mask into a lane mask).

 For example, if the intrawave collision mask is 0b0000001110000100, or
 ``(1 << 2) | (1 << 7) | (1 << 8) | (1 << 9)``, the code of the ordered section
 needs to be executed first only for quads 1:0 (lanes 7:0), then only for quads
 6:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32),
 and then for the remaining quads 15:9 (lanes 63:36).

 This effectively causes the ordered section to be executed as smaller
 "sub-subgroups" within the original subgroup.

 However, this is not always compatible with the execution model of SPIR-V or
 GLSL fragment shaders, so enabling intrawave collisions and wrapping a part of
 the shader in a loop may be unsafe in some cases. One particular example is when
 the shader uses subgroup operations influenced by lanes outside the current
 quad. In this case, the code outside and inside the ordered section may be
 executed with different sets of active invocations, affecting the results of
 subgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not
 supposed to modify the set of active invocations in any way. So the intrawave
 collision loop may break the results of subgroup operations in unpredictable
 ways, even outside the driver's compiler infrastructure. Even if the driver
 splits the subgroup exactly at ``OpBeginInvocationInterlockEXT`` and makes the
 lane subsets rejoin exactly at ``OpEndInvocationInterlockEXT``, the application
 and the compilers that created the source shader are still not aware of that
 happening — the input SPIR-V or GLSL shader might have already gone through
 various optimizations, such as common subexpression elimination which might
 have considered a subgroup operation before ``OpBeginInvocationInterlockEXT``
 and one after it equivalent.

 The idea behind reporting intrawave collisions to shaders is to reduce the
 impact on the parallelism of the part of the shader that doesn't depend on the
 ordering, to avoid wasting lanes in the wave and to allow the code outside the
 ordered section in different invocations to run in parallel lanes as usual. This
 may be especially helpful if the ordered section is small compared to the rest
 of the shader — for instance, a custom blending equation in the end of the usual
 fragment shader for a surface in the world.

 However, whether handling intrawave collisions is preferred is not a question
 with one universal answer. Intrawave collisions are pretty uncommon without
 multisampling, or when using sample interlock with multisampling, although
 they're highly frequent with pixel interlock with multisampling, when adjacent
 primitives cover the same pixels along the shared edge (though that's an
 extremely expensive situation in general). But resolving intrawave collisions
 adds some overhead costs to the shader. If intrawave overlap is unlikely to
 happen often, or even more importantly, if the majority of the shader is inside
 the ordered section, handling it in the shader may cause more harm than good.

 GFX11 removes this concept entirely, instead overlapping invocations are always
 placed in different waves.

 GFX9–10.3: Exiting the ordered section in the wave
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 To exit the ordered section and let overlapping waves resume execution and enter
 their ordered sections, the wave needs to send the ``ORDERED_PS_DONE`` message
 (7) using ``s_sendmsg``.

 If the wave has enabled POPS by setting the packer hardware register, it *must
 not* execute ``s_endpgm`` without having sent ``ORDERED_PS_DONE`` once, so the
 message must be sent on all execution paths after the packer register setup.
 However, if the wave exits before having configured the packer register, sending
 the message is not required, though it's still fine to send it regardless of
 that.

 Note that if the shader has multiple ``OpEndInvocationInterlockEXT``
 instructions executed in the same wave (depending on a divergent condition, for
 example), it must still be ensured that ``ORDERED_PS_DONE`` is sent by the wave
 only once, and especially not before any awaiting of overlapped waves.

 Before the message is sent, all counters for memory accesses that need to be
 primitive-ordered, both writes and (in case something after the ordered section
 depends on the per-pixel data, for instance, the tail blending fallback in
 order-independent transparency) reads, must be awaited. Those may include
 ``vm``, ``vs``, and in some cases ``lgkm`` (though normally primitive-ordered
 memory accesses will be done through VMEM with divergent addresses, not SMEM, as
 there's no synchronization between fragments at different pixel coordinates, but
 it's still technically possible for a shader, even though pointless and
 nonoptimal, to explicitly perform them in a waterfall loop, for instance, and
 that must work correctly too). Without that, a race condition will occur when
 the newly resumed waves start accessing the memory locations to which there
 still are outstanding accesses in the current wave.

 Another option for exiting is the ``s_endpgm_ordered_ps_done`` instruction,
 which combines waiting for all the counters, sending the ``ORDERED_PS_DONE``
 message, and ending the program. Generally, however, it's desirable to resume
 overlapping waves as early as possible, including before the export, as it may
 stall the wave for some time too.

 GFX11: Exiting the ordered section in the wave
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The overlapping waves are resumed when the wave performs the last export (with
 the ``done`` flag).

 The same requirements for awaiting the memory access counters as on GFX9–10.3
 still apply.

 Memory access requirements
 ^^^^^^^^^^^^^^^^^^^^^^^^^^

 The compiler needs to ensure that entering the ordered section implements
 acquire semantics, and exiting it implements release semantics, in the fragment
 interlock memory scope for ``UniformMemory`` and ``ImageMemory`` SPIR-V storage
 classes.

 A fragment interlock memory scope instance includes overlapping fragment shader
 invocations executed by commands inside a single subpass. It may be considered a
 subset of a queue family memory scope instance from the perspective of memory
 barriers.

 Fragment shader interlock doesn't perform implicit memory availability or
 visibility operations. Shaders must do them by themselves for accesses requiring
 primitive ordering, such as via ``coherent`` (``queuefamilycoherent``) in GLSL
 or ``MakeAvailable`` and ``MakeVisible`` in at least the ``QueueFamily`` scope
 in SPIR-V.

 On AMD hardware, this means that the accessed memory locations must be made
 available or visible between waves that may be executed on any compute unit — so
 accesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag
 and L1$ via DLC.

 However, it should be noted that memory accesses in the ordered section may be
 expected by the application to be done in primitive order even if they don't
 have the GLC and DLC flags. Coherent access not only bypasses, but also
 invalidates the lower-level caches for the accessed memory locations. Thus,
 considering that normally per-pixel data is accessed exclusively by the
 invocation executing the ordered section, it's not necessary to make all reads
 or writes in the ordered section for one memory location to be GLC/DLC — just
 the first read and the last write: it doesn't matter if per-pixel data is cached
 in L0/L1 in the middle of a dependency chain in the ordered section, as long as
 it's invalidated in them in the beginning and flushed to L2 in the end.
 Therefore, optimizations in the compiler must not simply assume that only
 coherent accesses need primitive ordering — and moreover, the compiler must also
 take into account that the same data may be accessed through different bindings.

 Export requirements
 ^^^^^^^^^^^^^^^^^^^

 With POPS, on all hardware generations, the shader must have at least one
 export, though it can be a null or an ``off, off, off, off`` one.

 Also, even if the shader doesn't need to export any real data, the export
 skipping that was added in GFX10 must not be used, and some space must be
 allocated in the export buffer, such as by setting ``SPI_SHADER_COL_FORMAT`` for
 some color output to ``SPI_SHADER_32_R``.

 Without this, the shader will be executed without the needed synchronization on
 GFX10, and will hang on GFX11.

 Drawing context setup
 ---------------------

 Configuring POPS
 ^^^^^^^^^^^^^^^^

 Most of the configuration is performed via the ``DB_SHADER_CONTROL`` register.

 To enable POPS for the draw,
 ``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` should be set to 1.

 On GFX9–10.3, ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` controls which
 fragment shader invocations are considered overlapping:

 * For pixel interlock, it must be set to 0 (1 sample).
 * If sample interlock is sufficient (only synchronizing between invocations that
   have any common sample mask bits), it may be set to
   ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` — the number of sample coverage mask
   bits passed to the shader which is expected to use the sample mask to
   determine whether it's allowed to access the data for each of the samples. As
   of April 2023, PAL for some reason doesn't use non-1x
   ``POPS_OVERLAP_NUM_SAMPLES`` at all, even when using Direct3D Rasterizer
   Ordered Views or ``GL_INTEL_fragment_shader_ordering`` with sample shading
   (those APIs tie the interlock granularity to the shading frequency — Vulkan
   and OpenGL fragment shader interlock, however, allows specifying the interlock
   granularity independently of it, making it possible both to ask for finer
   synchronization guarantees and to require stronger ones than Direct3D ROVs can
   provide). However, with MSAA, on AMD hardware, pixel interlock generally
   performs *massively*, sometimes prohibitively, slower than sample interlock,
   because it causes fragment shader invocations along the common edge of
   adjacent primitives to be ordered as they cover the same pixels (even though
   they don't cover any common samples). So it's highly desirable for the driver
   to provide sample interlock, and to set ``POPS_OVERLAP_NUM_SAMPLES``
   accordingly, if the shader declares that it's enough for it via the execution
   mode.

 On GFX11, when POPS is enabled, ``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE`` is
 used in place of ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` from the earlier
 architecture generations (and has a different bit offset in the register), and
 ``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE`` must be set to 1. The GFX11
 blending performance workaround overriding the intrinsic rate must not be
 applied if POPS is used in the draw — the intrinsic rate override must be used
 solely to control the interlock granularity in this case.

 No explicit flushes/synchronization are needed when changing the pipeline state
 variables that may be involved in POPS, such as the rasterization sample count.
 POPS automatically keeps synchronizing invocations even between draws with
 different sample counts (invocations with common coverage mask bits are
 considered overlapping by the hardware, regardless of what those samples
 actually are — only the indices are important).

 Also, on GFX11, POPS uses ``DB_Z_INFO.NUM_SAMPLES`` to determine the coverage
 sample count, and it must be equal to ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES``
 even if there's no depth/stencil target.

 Hardware bug workarounds
 ^^^^^^^^^^^^^^^^^^^^^^^^

 Early revisions of GFX9 — ``CHIP_VEGA10`` and ``CHIP_RAVEN`` — contain a
 hardware bug that may result in a hang, and need a workaround to be enabled.
 Specifically, if POPS is used with 8 or more rasterization samples, or with 8 or
 more depth/stencil target samples, ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP``
 must be set to 1 for draws that satisfy this condition. In PAL, this is the
 ``waMiscPopsMissedOverlap`` workaround. It results in slightly lower performance
 in those cases, increasing the frame time by around 1.5 to 2 times in
 `nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
 on the RX Vega 10, but it's required in a pretty rare case (8x+ MSAA) and is
 mandatory to ensure stability.

 Also, even though ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` is not required
 on chips other than the ``CHIP_VEGA10`` and ``CHIP_RAVEN`` GFX9 revisions, if
 it's enabled for some reason on GFX10.1 (``CHIP_NAVI10``, ``CHIP_NAVI12``,
 ``CHIP_NAVI14``), and the draw uses POPS,
 ``DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL`` must be set to
 ``PSLC_ON_HANG_ONLY`` to avoid a hang (see ``waStalledPopsMode`` in PAL).

 Out-of-order rasterization interaction
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 This is a largely unresearched topic currently. However, considering that POPS
 is primarily the functionality of the Depth Block, similarity to the behavior of
 out-of-order rasterization in depth/stencil testing may possibly be expected.

 If the shader specifies an ordered interlock execution mode, out-of-order
 rasterization likely must not be enabled implicitly.

 As of April 2023, PAL doesn't have any rules specifically for POPS in the logic
 determining whether out-of-order rasterization can be enabled automatically.
 Some of the POPS usage cases may possibly be covered by the rule that always
 disables out-of-order rasterization if the shader writes to Unordered Access
 Views (storage resources), though fragment shader interlock can be used for
 read-only purposes too (for ordering between draws that only read per-pixel data
 and draws that may write it), so that may be an oversight.

 Explicitly enabled relaxed rasterization order modifies the concept of
 rasterization order itself in Vulkan, so from the point of view of the
 specification of fragment shader interlock, relaxed rasterization order should
 still be applicable regardless of whether the shader requests ordered interlock.
 PAL also doesn't make any POPS-specific exceptions here as of April 2023.

 Variable-rate shading interaction
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 On GFX10.3, enabling ``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` forces
 the shading rate to be 1x1, thus the
 ``fragmentShadingRateWithFragmentShaderInterlock`` Vulkan device property must
 be false.

 On GFX11, by default, POPS itself can work with non-1x1 shading rates, and the
 ``fragmentShadingRateWithFragmentShaderInterlock`` property must be true.
 However, if ``PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS`` is set,
 enabling POPS will force 1x1 shading rate.

 The widest interlock granularity available on GFX11 — with the lowest possible
 Depth Block intrinsic rate, 1x — is per-fine-pixel, however. There's no
 synchronization between coarse fragment shader invocations if they don't cover
 common fine pixels, so the ``fragmentShaderShadingRateInterlock`` Vulkan device
 feature is not available.

 Additional configuration
 ^^^^^^^^^^^^^^^^^^^^^^^^

 These are some largely unresearched options found in the register declarations.
 PAL doesn't use them, so it's unknown if they make any significant difference.
 No effect was found in `nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
 during testing on GFX9 ``CHIP_RAVEN`` and GFX11 ``CHIP_GFX1100``.

 * ``DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED`` on GFX9–10.3.
 * ``PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS`` on GFX10+.