ACO (short for AMD compiler) is a back-end compiler for AMD GCN / RDNA GPUs, based on the NIR compiler infrastructure. Simply put, ACO translates shader programs from the NIR intermediate representation into a GCN / RDNA binary which the GPU can execute.
Why did we choose to develop a new compiler backend?
Modern GPUs are SIMD machines that execute the shader in parallel. In case of GCN / RDNA the parallelism is achieved by executing the shader on several waves, and each wave has several lanes (32 or 64). When every lane executes exactly the same instructions, and takes the same path, it‘s uniform control flow; otherwise when some lanes take one path while other lanes take a different path, it’s divergent.
Each hardware lane corresponds to a shader invocation from a software perspective.
The hardware doesn't directly support divergence, so in case of divergent control flow, the GPU must execute both code paths, each with some lanes disabled. This is why divergence is a performance concern in shader programming.
ACO deals with divergent control flow by maintaining two control flow graphs (CFG):
The instruction selection is based around the divergence analysis and works in 3 passes on the NIR shader.
We have two types of instructions:
Each instruction can have operands (temporaries that it reads), and definitions (temporaries that it writes). Temporaries can be fixed to a specific register, or just specify a register class (either a single register, or a vector of several registers).
The value numbering pass is necessary for two reasons: the lack of descriptor load representation in NIR, and every NIR instruction that gets emitted as multiple ACO instructions also has potential for CSE. This pass does dominator-tree value numbering.
In this phase, simpler instructions are combined into more complex instructions (like the different versions of multiply-add as well as neg, abs, clamp, and output modifiers) and constants are inlined, moves are eliminated, etc. Exactly which optimizations are performed depends on the hardware for which the shader is being compiled.
This pass is responsible for making sure that register allocation is correct for reductions, by adding pseudo instructions that utilize linear VGPRs. When a temporary has a linear VGPR register class, this means that the variable is considered live in the linear control flow graph.
In the GCN/RDNA architecture, there is a special register called exec
which is used for manually controlling which VALU threads (aka. lanes) are active. The value of exec
has to change in divergent branches, loops, etc. and it needs to be restored after the branch or loop is complete. This pass ensures that the correct lanes are active in every branch.
A live-variable analysis is used to calculate the register need of the shader. This information is used for spilling and scheduling before register allocation.
First, we lower the shader program to CSSA form. Then, if the register demand exceeds the global limit, this pass lowers register usage by temporarily storing excess scalar values in free vector registers, or excess vector values in scratch memory, and reloading them when needed. It is based on the paper “Register Spilling and Live-Range Splitting for SSA-Form Programs”.
Scheduling is another NP-complete problem where basically all known heuristics suffer from unpredictable change in register pressure. For that reason, the implemented scheduler does not completely re-schedule all instructions, but only aims to move up memory loads as far as possible without exceeding the maximum register limit for the pre-calculated wave count. The reason this works is that ILP is very limited on GCN. This approach looks promising so far.
The register allocator works on SSA (as opposed to LLVM's which works on virtual registers). The SSA properties guarantee that there are always as many registers available as needed. The problem is that some instructions require a vector of neighboring registers to be available, but the free regs might be scattered. In this case, the register allocator inserts shuffle code (moving some temporaries to other registers) to make space for the variable. The assumption is that it is (almost) always better to have a few more moves than to sacrifice a wave. The RA does SSA-reconstruction on the fly, which makes its runtime linear.
The next step is a pass out of SSA by inserting parallelcopies at the end of blocks to match the phi nodes' semantics.
Most pseudo instructions are lowered to actual machine instructions. These are mostly parallel copy instructions created by instruction selection or register allocation and spill/reload code.
GCN requires some wait states to be manually inserted in order to ensure correct behavior on memory instructions and some register dependencies. This means that we need to insert s_waitcnt
instructions (and its variants) so that the shader program waits until the eg. a memory operation is complete.
Some instructions require wait states or other instructions to resolve hazards which are not handled by the hardware. This pass makes sure that no known hazards occour.
The assembler emits the actual binary that will be sent to the hardware for execution. ACO's assembler is straight-forward because all instructions have their format, opcode, registers and potential fields already available, so it only needs to cater to the some differences between each hardware generation.
Hardware stages (as executed on the chip) don't exactly match software stages (as defined in OpenGL / Vulkan). Which software stage gets executed on which hardware stage depends on what kind of software stages are present in the current pipeline.
An important difference is that VS is always the first stage to run in SW models, whereas HW VS refers to the last HW stage before fragment shading in GCN/RDNA terminology. That's why, among other things, the HW VS is no longer used to execute the SW VS when tesselation or geometry shading are used.
HW PS reads its inputs from a special ring buffer called Parameter Cache (PC) that only HW VS can write to, using export instructions. However, legacy GS store their output in VRAM (before GFX10/NGG). So in order for HW PS to be able to read the GS outputs, we must run something on the VS stage which reads the GS outputs from VRAM and exports them to the PC. This is what we call a “GS copy” shader. From a HW perspective the “GS copy” shader is in fact VS (it runs on the HW VS stage), but from a SW perspective it‘s not part of the traditional pipeline, it’s just some “glue code” that we need for outputs to play nicely.
On GFX10/NGG this limitation no longer exists, because NGG can export directly to the PC.
The merged stages on GFX9 (and GFX10/legacy) are: LSHS and ESGS. On GFX10/NGG the ESGS is merged with HW VS into NGG.
This might be confusing due to a mismatch between the number of invocations of these shaders. For example, ES is per-vertex, but GS is per-primitive. This is why merged shaders get an argument called merged_wave_info
which tells how many invocations each part needs, and there is some code at the beginning of each part to ensure the correct number of invocations by disabling some threads. So, think about these as two independent shader programs slapped together.
GFX6-8 HW stages: | LS | HS | ES | GS | VS | PS | ACO terminology |
---|---|---|---|---|---|---|---|
SW stages: only VS+PS: | VS | FS | vertex_vs , fragment_fs | ||||
with tess: | VS | TCS | TES | FS | vertex_ls , tess_control_hs , tess_eval_vs , fragment_fs | ||
with GS: | VS | GS | GS copy | FS | vertex_es , geometry_gs , gs_copy_vs , fragment_fs | ||
with both: | VS | TCS | TES | GS | GS copy | FS | vertex_ls , tess_control_hs , tess_eval_es , geometry_gs , gs_copy_vs , fragment_fs |
GFX9+ HW stages: | LSHS | ESGS | VS | PS | ACO terminology |
---|---|---|---|---|---|
SW stages: only VS+PS: | VS | FS | vertex_vs , fragment_fs | ||
with tess: | VS + TCS | TES | FS | vertex_tess_control_hs , tess_eval_vs , fragment_fs | |
with GS: | VS + GS | GS copy | FS | vertex_geometry_gs , gs_copy_vs , fragment_fs | |
with both: | VS + TCS | TES + GS | GS copy | FS | vertex_tess_control_hs , tess_eval_geometry_gs , gs_copy_vs , fragment_fs |
GFX10/NGG HW stages: | LSHS | NGG | PS | ACO terminology |
---|---|---|---|---|
SW stages: only VS+PS: | VS | FS | vertex_ngg , fragment_fs | |
with tess: | VS + TCS | TES | FS | vertex_tess_control_hs , tess_eval_ngg , fragment_fs |
with GS: | VS + GS | FS | vertex_geometry_ngg , fragment_fs | |
with both: | VS + TCS | TES + GS | FS | vertex_tess_control_hs , tess_eval_geometry_ngg , fragment_fs |
GFX10.3+:
GFX10.3+ HW stages | CS | NGG | PS | ACO terminology |
---|---|---|---|---|
SW stages: only MS+PS: | MS | FS | mesh_ngg , fragment_fs | |
with task: | TS | MS | FS | task_cs , mesh_ngg , fragment_fs |
GFX6-10:
GFX6-10 HW stage | CS | ACO terminology |
---|---|---|
SW stage | CS | compute_cs |
Handy RADV_DEBUG
options that help with ACO debugging:
nocache
- you always want to use this when debugging, otherwise you risk using a broken shader from the cache.shaders
- makes ACO print the IR after register allocation, as well as the disassembled shader binary.metashaders
- does the same thing as shaders
but for built-in RADV shaders.preoptir
- makes ACO print the final NIR shader before instruction selection, as well as the ACO IR after instruction selection.nongg
- disables NGG supportWe also have ACO_DEBUG
options:
validateir
- Validate the ACO IR between compilation stages. By default, enabled in debug builds and disabled in release builds.validatera
- Perform a RA (register allocation) validation.perfwarn
- Warn when sub-optimal instructions are found.force-waitcnt
- Forces ACO to emit a wait state after each instruction when there is something to wait for. Harms performance.novn
- Disables the ACO value numbering stage.noopt
- Disables the ACO optimizer.nosched
- Disables the ACO scheduler.Note that you need to combine these options into a comma-separated list, for example: RADV_DEBUG=nocache,shaders
otherwise only the last one will take effect. (This is how all environment variables work, yet this is an often made mistake.) Example:
RADV_DEBUG=nocache,shaders ACO_DEBUG=validateir,validatera vkcube
GCC has several sanitizers which can help figure out hard to diagnose issues. To use these, you need to pass the -Dbsanitize
flag to meson
when building mesa. For example -Dbsanitize=undefined
will add support for the undefined behavior sanitizer.
Several Linux distributions use “hardened” builds meaning several special compiler flags are added by downstream packaging which are not used in mesa builds by default. These may be responsible for some bug reports of inexplicable crashes with assertion failures you can't reproduce.
Most notable are the glibc++ debug flags, which you can use by adding the -D_GLIBCXX_ASSERTIONS=1
and -D_GLIBCXX_DEBUG=1
flags.
To see the full list of downstream compiler flags, you can use eg. rpm --eval "%optflags"
on Red Hat based distros like Fedora.
Here are some good practices we learned while debugging visual corruption and hangs.
radv_shader.c
or radv_pipeline.c
to change if they are compiled with LLVM or ACO.s_endpgm
to end the shader early to find the problematic instruction.