| Single-sampled Color Compression |
| ================================ |
| |
| Starting with Ivy Bridge, Intel graphics hardware provides a form of color |
| compression for single-sampled surfaces. In its initial form, this provided an |
| acceleration of render target clear operations that, in the common case, allows |
| you to avoid almost all of the bandwidth of a full-surface clear operation. On |
| Sky Lake, single-sampled color compression was extended to allow for the |
| compression color values from actual rendering and not just the initial clear. |
| From here on, the older Ivy Bridge form of color compression will be called |
| "fast-clears" and term "color compression" will be reserved for the more |
| powerful Sky Lake form. |
| |
| The documentation for Ivy Bridge through Broadwell overloads the term MCS for |
| referring both to the *multisample control surface* used for multisample |
| compression and the control surface used for fast-clears. In ISL, the |
| :cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_MCS` enum always refers to |
| multisample color compression while the |
| :cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_CCS_` enums always refer to |
| single-sampled color compression. Throughout this chapter and the rest of the |
| ISL documentation, we will use the term "color control surface", abbreviated |
| CCS, to denote the control surface used for both fast-clears and color |
| compression. While this is still an overloaded term, Ivy Bridge fast-clears |
| are much closer to Sky Lake color compression than they are to multisample |
| compression. |
| |
| CCS data |
| -------- |
| |
| Fast clears and CCS are possibly the single most poorly documented aspect of |
| surface layout/setup for Intel graphics hardware (with HiZ coming in a neat |
| second). All the documentation really says is that you can use an MCS buffer on |
| single-sampled surfaces (we will call it the CCS in this case). It also |
| provides some documentation on how to program the hardware to perform clear |
| operations, but that's it. How big is this buffer? What does it contain? |
| Those question are left as exercises to the reader. Almost everything we know |
| about the contents of the CCS is gleaned from reverse-engineering of the |
| hardware. The best bit of documentation we have ever had comes from the |
| display section of the Sky Lake PRM Vol 12 section on planes (p. 159): |
| |
| The Color Control Surface (CCS) contains the compression status of the |
| cache-line pairs. The compression state of the cache-line pair is |
| specified by 2 bits in the CCS. Each CCS cache-line represents an area |
| on the main surface of 16x16 sets of 128 byte Y-tiled cache-line-pairs. |
| CCS is always Y tiled. |
| |
| While this is technically for color compression and not fast-clears, it |
| provides a good bit of insight into how color compression and fast-clears |
| operate. Each cache-line pair, in the main surface corresponds to 1 or 2 bits |
| in the CCS. The primary difference, as far as the current discussion is |
| concerned, is that fast-clears use only 1 bit per cache-line pair whereas color |
| compression uses 2 bits. |
| |
| What is a cache-line pair? Both the X and Y tiling formats are arranged as an |
| 8x8 grid of cache lines. (See the :doc:`chapter on tiling <tiling>` for more |
| details.) In either case, a cache-line pair is a pair of cache lines whose |
| starting addresses differ by 512 bytes or 8 cache lines. This results in the |
| two cache lines being vertically adjacent when the main surface is X-tiled and |
| horizontally adjacent when the main surface is Y-tiled. For an X-tiled surface |
| this forms an area of 64B x 2rows and for a Y-tiled surface this forms an area |
| of 32B x 4rows. In either case, it is guaranteed that, regardless of surface |
| format, each 2x2 subspan coming out of a shader will land entirely within one |
| cache-line pair. |
| |
| What is the correspondence between bits and cache-line pairs? The best model I |
| (Jason) know of is to consider the CCS as having a 1-bit color format for |
| fast-clears and a 2-bit format for color compression and a special tiling |
| format. The CCS tiling formats operate on a 1 or 2-bit granularity rather than |
| the byte granularity of most tiling formats. |
| |
| The following table represents the bit-layouts that yield the CCS tiling format |
| on different hardware generations. Bits 0-11 correspond to the regular swizzle |
| of bytes within a 4KB page whereas the negative bits represent the address of |
| the particular 1 or 2-bit portion of a byte. (Note: The Haswell data was |
| gathered on a dual-channel system so bit-6 swizzling was enabled. It's unclear |
| how this affects the CCS layout.) |
| |
| ============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== |
| Generation Tiling 11 10 9 8 7 6 5 4 3 2 1 0 -1 -2 -3 |
| ============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== |
| Ivy Bridge X or Y :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0` |
| Haswell X :math:`u_6` :math:`u_5` :math:`v_3 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0` |
| Haswell Y :math:`u_6` :math:`u_5` :math:`v_2 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0` |
| Broadwell X :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`u_3` :math:`v_3` :math:`u_2` :math:`u_1` :math:`u_0` :math:`v_2` :math:`v_1` :math:`v_0` |
| Broadwell Y :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_1` :math:`v_0` :math:`u_0` |
| Sky Lake Y :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_3` :math:`v_2` :math:`v_1` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_0` :math:`u_0` |
| ============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== |
| |
| CCS surface layout |
| ------------------ |
| |
| Starting with Broadwell, fast-clears and color compression can be used on |
| mipmapped and array surfaces. When considered from a higher level, the CCS is |
| laid out like any other surface. The Broadwell and Sky Lake PRMs describe |
| this as follows: |
| |
| Broadwell PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 676): |
| |
| Mip-mapped and arrayed surfaces are supported with MCS buffer layout with |
| these alignments in the RT space: Horizontal Alignment = 256 and Vertical |
| Alignment = 128. |
| |
| Broadwell PRM Vol 2d, "RENDER_SURFACE_STATE" (p. 279): |
| |
| For non-multisampled render target's auxiliary surface, MCS, QPitch must be |
| computed with Horizontal Alignment = 256 and Surface Vertical Alignment = |
| 128. These alignments are only for MCS buffer and not for associated render |
| target. |
| |
| Sky Lake PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 632): |
| |
| Mip-mapped and arrayed surfaces are supported with MCS buffer layout with |
| these alignments in the RT space: Horizontal Alignment = 128 and Vertical |
| Alignment = 64. |
| |
| Sky Lake PRM Vol. 2d, "RENDER_SURFACE_STATE" (p. 435): |
| |
| For non-multisampled render target's CCS auxiliary surface, QPitch must be |
| computed with Horizontal Alignment = 128 and Surface Vertical Alignment |
| = 256. These alignments are only for CCS buffer and not for associated |
| render target. |
| |
| Empirical evidence seems to confirm this. On Sky Lake, the vertical alignment |
| is always one cache line. The horizontal alignment, however, varies by main |
| surface format: 1 cache line for 32bpp, 2 for 64bpp and 4 cache lines for |
| 128bpp formats. This nicely corresponds to the alignment of 128x64 pixels in |
| the primary color surface. The second PRM citation about Sky Lake CCS above |
| gives a vertical alignment of 256 rather than 64. With a little |
| experimentation, this additional alignment appears to only apply to QPitch and |
| not to the miplevels within a slice. |
| |
| On Broadwell, each miplevel in the CCS is aligned to a cache-line pair |
| boundary: horizontal when the primary surface is X-tiled and vertical when |
| Y-tiled. For a 32bpp format, this works out to an alignment of 256x128 main |
| surface pixels regardless of X or Y tiling. On Sky Lake, the alignment is |
| a single cache line which works out to an alignment of 128x64 main surface |
| pixels. |
| |
| TODO: More than just 32bpp formats on Broadwell! |
| |
| Once armed with the above alignment information, we can lay out the CCS surface |
| itself. The way ISL does CCS layout calculations is by a very careful and |
| subtle application of its normal surface layout code. |
| |
| Above, we described the CCS data layout as mapping of address bits. In |
| ISL, this is represented by :cpp:enumerator:`isl_tiling::ISL_TILING_CCS`. The |
| logical and physical tile dimensions corresponding to the above mapping. |
| |
| We also have special :cpp:enum:`isl_format` enums for CCS. These formats are 1 |
| bit-per-pixel on Ivy Bridge through Broadwell and 2 bits-per-pixel on Skylake |
| and above to correspond to the 1 and 2-bit values represented in the CCS data. |
| They have a block size (similar to a block compressed format such as BC or |
| ASTC) which says what area (in surface elements) in the main surface is covered |
| by a single CCS element (1 or 2-bit). Because this depends on the main surface |
| tiling and format, we have several different CCS formats. |
| |
| Once the appropriate :cpp:enum:`isl_format` has been selected, computing the |
| size and layout of a CCS surface is as simple as passing the same surface |
| creation parameters to :cpp:func:`isl_surf_init_s` as were used to create the |
| primary surface only with :cpp:enumerator:`isl_tiling::ISL_TILING_CCS` and the |
| correct CCS format. This not only results in a correctly sized surface but |
| most other ISL helpers for things such as computing offsets into surfaces work |
| correctly as well. |
| |
| CCS on Tigerlake and above |
| -------------------------- |
| |
| Starting with Tigerlake, CCS is no longer done via a surface and, instead, the |
| term CCS gets overloaded once again (gotta love it!) to now refer to a form of |
| universal compression which can be applied to almost any surface. Nothing in |
| this chapter applies to any hardware with a graphics IP version 12 or above. |