Single-sampled Color Compression | |

================================ | |

Starting with Ivy Bridge, Intel graphics hardware provides a form of color | |

compression for single-sampled surfaces. In its initial form, this provided an | |

acceleration of render target clear operations that, in the common case, allows | |

you to avoid almost all of the bandwidth of a full-surface clear operation. On | |

Sky Lake, single-sampled color compression was extended to allow for the | |

compression color values from actual rendering and not just the initial clear. | |

From here on, the older Ivy Bridge form of color compression will be called | |

"fast-clears" and term "color compression" will be reserved for the more | |

powerful Sky Lake form. | |

The documentation for Ivy Bridge through Broadwell overloads the term MCS for | |

referring both to the *multisample control surface* used for multisample | |

compression and the control surface used for fast-clears. In ISL, the | |

:cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_MCS` enum always refers to | |

multisample color compression while the | |

:cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_CCS_` enums always refer to | |

single-sampled color compression. Throughout this chapter and the rest of the | |

ISL documentation, we will use the term "color control surface", abbreviated | |

CCS, to denote the control surface used for both fast-clears and color | |

compression. While this is still an overloaded term, Ivy Bridge fast-clears | |

are much closer to Sky Lake color compression than they are to multisample | |

compression. | |

CCS data | |

-------- | |

Fast clears and CCS are possibly the single most poorly documented aspect of | |

surface layout/setup for Intel graphics hardware (with HiZ coming in a neat | |

second). All the documentation really says is that you can use an MCS buffer on | |

single-sampled surfaces (we will call it the CCS in this case). It also | |

provides some documentation on how to program the hardware to perform clear | |

operations, but that's it. How big is this buffer? What does it contain? | |

Those question are left as exercises to the reader. Almost everything we know | |

about the contents of the CCS is gleaned from reverse-engineering of the | |

hardware. The best bit of documentation we have ever had comes from the | |

display section of the Sky Lake PRM Vol 12 section on planes (p. 159): | |

The Color Control Surface (CCS) contains the compression status of the | |

cache-line pairs. The compression state of the cache-line pair is | |

specified by 2 bits in the CCS. Each CCS cache-line represents an area | |

on the main surface of 16x16 sets of 128 byte Y-tiled cache-line-pairs. | |

CCS is always Y tiled. | |

While this is technically for color compression and not fast-clears, it | |

provides a good bit of insight into how color compression and fast-clears | |

operate. Each cache-line pair, in the main surface corresponds to 1 or 2 bits | |

in the CCS. The primary difference, as far as the current discussion is | |

concerned, is that fast-clears use only 1 bit per cache-line pair whereas color | |

compression uses 2 bits. | |

What is a cache-line pair? Both the X and Y tiling formats are arranged as an | |

8x8 grid of cache lines. (See the :doc:`chapter on tiling <tiling>` for more | |

details.) In either case, a cache-line pair is a pair of cache lines whose | |

starting addresses differ by 512 bytes or 8 cache lines. This results in the | |

two cache lines being vertically adjacent when the main surface is X-tiled and | |

horizontally adjacent when the main surface is Y-tiled. For an X-tiled surface | |

this forms an area of 64B x 2rows and for a Y-tiled surface this forms an area | |

of 32B x 4rows. In either case, it is guaranteed that, regardless of surface | |

format, each 2x2 subspan coming out of a shader will land entirely within one | |

cache-line pair. | |

What is the correspondence between bits and cache-line pairs? The best model I | |

(Jason) know of is to consider the CCS as having a 1-bit color format for | |

fast-clears and a 2-bit format for color compression and a special tiling | |

format. The CCS tiling formats operate on a 1 or 2-bit granularity rather than | |

the byte granularity of most tiling formats. | |

The following table represents the bit-layouts that yield the CCS tiling format | |

on different hardware generations. Bits 0-11 correspond to the regular swizzle | |

of bytes within a 4KB page whereas the negative bits represent the address of | |

the particular 1 or 2-bit portion of a byte. (Note: The Haswell data was | |

gathered on a dual-channel system so bit-6 swizzling was enabled. It's unclear | |

how this affects the CCS layout.) | |

============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== | |

Generation Tiling 11 10 9 8 7 6 5 4 3 2 1 0 -1 -2 -3 | |

============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== | |

Ivy Bridge X or Y :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0` | |

Haswell X :math:`u_6` :math:`u_5` :math:`v_3 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0` | |

Haswell Y :math:`u_6` :math:`u_5` :math:`v_2 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0` | |

Broadwell X :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`u_3` :math:`v_3` :math:`u_2` :math:`u_1` :math:`u_0` :math:`v_2` :math:`v_1` :math:`v_0` | |

Broadwell Y :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_1` :math:`v_0` :math:`u_0` | |

Sky Lake Y :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_3` :math:`v_2` :math:`v_1` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_0` :math:`u_0` | |

============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== | |

CCS surface layout | |

------------------ | |

Starting with Broadwell, fast-clears and color compression can be used on | |

mipmapped and array surfaces. When considered from a higher level, the CCS is | |

laid out like any other surface. The Broadwell and Sky Lake PRMs describe | |

this as follows: | |

Broadwell PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 676): | |

Mip-mapped and arrayed surfaces are supported with MCS buffer layout with | |

these alignments in the RT space: Horizontal Alignment = 256 and Vertical | |

Alignment = 128. | |

Broadwell PRM Vol 2d, "RENDER_SURFACE_STATE" (p. 279): | |

For non-multisampled render target's auxiliary surface, MCS, QPitch must be | |

computed with Horizontal Alignment = 256 and Surface Vertical Alignment = | |

128. These alignments are only for MCS buffer and not for associated render | |

target. | |

Sky Lake PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 632): | |

Mip-mapped and arrayed surfaces are supported with MCS buffer layout with | |

these alignments in the RT space: Horizontal Alignment = 128 and Vertical | |

Alignment = 64. | |

Sky Lake PRM Vol. 2d, "RENDER_SURFACE_STATE" (p. 435): | |

For non-multisampled render target's CCS auxiliary surface, QPitch must be | |

computed with Horizontal Alignment = 128 and Surface Vertical Alignment | |

= 256. These alignments are only for CCS buffer and not for associated | |

render target. | |

Empirical evidence seems to confirm this. On Sky Lake, the vertical alignment | |

is always one cache line. The horizontal alignment, however, varies by main | |

surface format: 1 cache line for 32bpp, 2 for 64bpp and 4 cache lines for | |

128bpp formats. This nicely corresponds to the alignment of 128x64 pixels in | |

the primary color surface. The second PRM citation about Sky Lake CCS above | |

gives a vertical alignment of 256 rather than 64. With a little | |

experimentation, this additional alignment appears to only apply to QPitch and | |

not to the miplevels within a slice. | |

On Broadwell, each miplevel in the CCS is aligned to a cache-line pair | |

boundary: horizontal when the primary surface is X-tiled and vertical when | |

Y-tiled. For a 32bpp format, this works out to an alignment of 256x128 main | |

surface pixels regardless of X or Y tiling. On Sky Lake, the alignment is | |

a single cache line which works out to an alignment of 128x64 main surface | |

pixels. | |

TODO: More than just 32bpp formats on Broadwell! | |

Once armed with the above alignment information, we can lay out the CCS surface | |

itself. The way ISL does CCS layout calculations is by a very careful and | |

subtle application of its normal surface layout code. | |

Above, we described the CCS data layout as mapping of address bits. In | |

ISL, this is represented by :cpp:enumerator:`isl_tiling::ISL_TILING_CCS`. The | |

logical and physical tile dimensions corresponding to the above mapping. | |

We also have special :cpp:enum:`isl_format` enums for CCS. These formats are 1 | |

bit-per-pixel on Ivy Bridge through Broadwell and 2 bits-per-pixel on Skylake | |

and above to correspond to the 1 and 2-bit values represented in the CCS data. | |

They have a block size (similar to a block compressed format such as BC or | |

ASTC) which says what area (in surface elements) in the main surface is covered | |

by a single CCS element (1 or 2-bit). Because this depends on the main surface | |

tiling and format, we have several different CCS formats. | |

Once the appropriate :cpp:enum:`isl_format` has been selected, computing the | |

size and layout of a CCS surface is as simple as passing the same surface | |

creation parameters to :cpp:func:`isl_surf_init_s` as were used to create the | |

primary surface only with :cpp:enumerator:`isl_tiling::ISL_TILING_CCS` and the | |

correct CCS format. This not only results in a correctly sized surface but | |

most other ISL helpers for things such as computing offsets into surfaces work | |

correctly as well. | |

CCS on Tigerlake and above | |

-------------------------- | |

Starting with Tigerlake, CCS is no longer done via a surface and, instead, the | |

term CCS gets overloaded once again (gotta love it!) to now refer to a form of | |

universal compression which can be applied to almost any surface. Nothing in | |

this chapter applies to any hardware with a graphics IP version 12 or above. |