docs/drivers/panfrost.rst - third_party/mesa - Git at Google

 Panfrost
 ========

 The Panfrost driver stack includes an OpenGL ES implementation for Arm Mali
 GPUs based on the Midgard and Bifrost microarchitectures. It is **conformant**
 on Mali-G52 and Mali-G57 but **non-conformant** on other GPUs. The following
 hardware is currently supported:

 =========  ============ ============ =======
 Product    Architecture OpenGL ES    OpenGL
 =========  ============ ============ =======
 Mali T620  Midgard (v4) 2.0          2.1
 Mali T720  Midgard (v4) 2.0          2.1
 Mali T760  Midgard (v5) 3.1          3.1
 Mali T820  Midgard (v5) 3.1          3.1
 Mali T830  Midgard (v5) 3.1          3.1
 Mali T860  Midgard (v5) 3.1          3.1
 Mali T880  Midgard (v5) 3.1          3.1
 Mali G72   Bifrost (v6) 3.1          3.1
 Mali G31   Bifrost (v7) 3.1          3.1
 Mali G51   Bifrost (v7) 3.1          3.1
 Mali G52   Bifrost (v7) 3.1          3.1
 Mali G76   Bifrost (v7) 3.1          3.1
 Mali G57   Valhall (v9) 3.1          3.1
 =========  ============ ============ =======

 Other Midgard and Bifrost chips (T604, G71) are not yet supported.

 Older Mali chips based on the Utgard architecture (Mali 400, Mali 450) are
 supported in the :doc:`Lima <lima>` driver, not Panfrost. Lima is also
 available in Mesa.

 Other graphics APIs (Vulkan, OpenCL) are not supported at this time.

 Building
 --------

 Panfrost's OpenGL support is a Gallium driver. Since Mali GPUs are 3D-only and
 do not include a display controller, Mesa uses kmsro to support display
 controllers paired with Mali GPUs. If your board with a Panfrost supported GPU
 has a display controller with mainline Linux support not supported by kmsro,
 it's easy to add support, see the commit ``cff7de4bb597e9`` as an example.

 LLVM is *not* required by Panfrost's compilers. LLVM support in Mesa can
 safely be disabled for most OpenGL ES users with Panfrost.

 Build like ``meson . build/ -Dvulkan-drivers=
 -Dgallium-drivers=panfrost -Dllvm=disabled`` for a build directory
 ``build``.

 For general information on building Mesa, read :doc:`the install documentation
 <../install>`.

 Chat
 ----

 Panfrost developers and users hang out on IRC at ``#panfrost`` on OFTC. Note
 that registering and authenticating with ``NickServ`` is required to prevent
 spam. `Join the chat. <https://webchat.oftc.net/?channels=panfrost>`_

 Compressed texture support
 --------------------------

 In the driver, Panfrost supports ASTC, ETC, and all BCn formats (e.g. RGTC,
 S3TC, etc.) However, Panfrost depends on the hardware to support these formats
 efficiently.  All supported Mali architectures support these formats, but not
 every system-on-chip with a Mali GPU support all these formats. Many lower-end
 systems lack support for some BCn formats, which can cause problems when playing
 desktop games with Panfrost. To check whether this issue applies to your
 system-on-chip, Panfrost includes a ``panfrost_texfeatures`` tool to query
 supported formats.

 To use this tool, include the option ``-Dtools=panfrost`` when configuring Mesa.
 Then inside your Mesa build directory, the tool is located at
 ``src/panfrost/tools/panfrost_texfeatures``. Copy it to your target device,
 set as executable as necessary, and run on the target device. A table of
 supported formats will be printed to standard output.

 drm-shim
 --------

 Panfrost implements ``drm-shim``, stubbing out the Panfrost kernel interface.
 Use cases for this functionality include:

 - Future hardware bring up
 - Running shader-db on non-Mali workstations
 - Reproducing compiler (and some driver) bugs without Mali hardware

 Although Mali hardware is usually paired with an Arm CPU, Panfrost is portable C
 code and should work on any Linux machine. In particular, you can test the
 compiler on shader-db on an Intel desktop.

 To build Mesa with Panfrost drm-shim, configure Meson with
 ``-Dgallium-drivers=panfrost`` and ``-Dtools=drm-shim``. See the above
 building section for a full invocation. The drm-shim binary will be built to
 ``build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so``.

 To use, set the ``LD_PRELOAD`` environment variable to the drm-shim binary.  It
 may also be necessary to set ``LIBGL_DRIVERS_PATH`` to the location where Mesa
 was installed.

 By default, drm-shim mocks a Mali-G52 system. To select a specific Mali GPU,
 set the ``PAN_GPU_ID`` environment variable to the desired GPU ID:

 =========  ============ =======
 Product    Architecture GPU ID
 =========  ============ =======
 Mali-T720  Midgard (v4) 720
 Mali-T860  Midgard (v5) 860
 Mali-G72   Bifrost (v6) 6221
 Mali-G52   Bifrost (v7) 7212
 Mali-G57   Valhall (v9) 9093
 =========  ============ =======

 Additional GPU IDs are enumerated in the ``panfrost_model_list`` list in
 ``src/panfrost/lib/pan_props.c``.

 As an example: assuming Mesa is installed to a local path ``~/lib`` and Mesa's
 build directory is ``~/mesa/build``, a shader can be compiled for Mali-G52 as:

 .. code-block:: console

    ~/shader-db$ BIFROST_MESA_DEBUG=shaders \
    LIBGL_DRIVERS_PATH=~/lib/dri/ \
    LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
    PAN_GPU_ID=7212 \
    ./run shaders/glmark/1-1.shader_test

 The same shader can be compiled for Mali-T720 as:

 .. code-block:: console

    ~/shader-db$ MIDGARD_MESA_DEBUG=shaders \
    LIBGL_DRIVERS_PATH=~/lib/dri/ \
    LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
    PAN_GPU_ID=720 \
    ./run shaders/glmark/1-1.shader_test

 These examples set the compilers' ``shaders`` debug flags to dump the optimized
 NIR, backend IR after instruction selection, backend IR after register
 allocation and scheduling, and a disassembly of the final compiled binary.

 As another example, this invocation runs a single dEQP test "on" Mali-G52,
 pretty-printing GPU data structures and disassembling all shaders
 (``PAN_MESA_DEBUG=trace``) as well as dumping raw GPU memory
 (``PAN_MESA_DEBUG=dump``). The ``EGL_PLATFORM=surfaceless`` environment variable
 and various flags to dEQP mimic the surfaceless environment that our
 continuous integration (CI) uses. This eliminates window system dependencies,
 although it requires a specially built CTS:

 .. code-block:: console

    ~/VK-GL-CTS/build/external/openglcts/modules$ PAN_MESA_DEBUG=trace,dump \
    LIBGL_DRIVERS_PATH=~/lib/dri/ \
    LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
    PAN_GPU_ID=7212 EGL_PLATFORM=surfaceless \
    ./glcts --deqp-surface-type=pbuffer \
    --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 \
    --deqp-surface-height=256 -n \
    dEQP-GLES31.functional.shaders.builtin_functions.common.abs.float_highp_compute

 U-interleaved tiling
 ---------------------

 Panfrost supports u-interleaved tiling. U-interleaved tiling is
 indicated by the ``DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED`` modifier.

 The tiling reorders whole pixels (blocks). It does not compress or modify the
 pixels themselves, so it can be used for any image format. Internally, images
 are divided into tiles. Tiles occur in source order, but pixels (blocks) within
 each tile are reordered according to a space-filling curve.

 For regular formats, 16x16 tiles are used. This harmonizes with the default tile
 size for binning and CRCs (transaction elimination). It also means a single line
 (16 pixels) at 4 bytes per pixel equals a single 64-byte cache line.

 For formats that are already block compressed (S3TC, RGTC, etc), 4x4 tiles are
 used, where entire blocks are reorder. Most of these formats compress 4x4
 blocks, so this gives an effective 16x16 tiling. This justifies the tile size
 intuitively, though it's not a rule: ASTC may uses larger blocks.

 Within a tile, the X and Y bits are interleaved (like Morton order), but with a
 twist: adjacent bit pairs are XORed. The reason to add XORs is not obvious.
 Visually, addresses take the form::

    | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) |

 Reference routines to encode/decode u-interleaved images are available in
 ``src/panfrost/shared/test/test-tiling.cpp``, which documents the space-filling
 curve. This reference implementation is used to unit test the optimized
 implementation used in production. The optimized implementation is available in
 ``src/panfrost/shared/pan_tiling.c``.

 Although these routines are part of Panfrost, they are also used by Lima, as Arm
 introduced the format with Utgard. It is the only tiling supported on Utgard. On
 Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and
 should be used instead where possible. However, not all formats are
 compressible, so u-interleaved tiling remains an important fallback on Panfrost.

 Instancing
 ----------

 The attribute descriptor lets the attribute unit compute the address of an
 attribute given the vertex and instance ID. Unfortunately, the way this works is
 rather complicated when instancing is enabled.

 To explain this, first we need to explain how compute and vertex threads are
 dispatched.  When a quad is dispatched, it receives a single, linear index.
 However, we need to translate that index into a (vertex id, instance id) pair.
 One option would be to do:

 .. math::
    \text{vertex id} = \text{linear id} \% \text{num vertices}

    \text{instance id} = \text{linear id} / \text{num vertices}

 but this involves a costly division and modulus by an arbitrary number.
 Instead, we could pad num_vertices. We dispatch padded_num_vertices *
 num_instances threads instead of num_vertices * num_instances, which results
 in some "extra" threads with vertex_id >= num_vertices, which we have to
 discard.  The more we pad num_vertices, the more "wasted" threads we
 dispatch, but the division is potentially easier.

 One straightforward choice is to pad num_vertices to the next power of two,
 which means that the division and modulus are just simple bit shifts and
 masking. But the actual algorithm is a bit more complicated. The thread
 dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
 to dividing by a power of two. As a result, padded_num_vertices can be
 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
 since we need less padding.

 padded_num_vertices is picked by the hardware. The driver just specifies the
 actual number of vertices. Note that padded_num_vertices is a multiple of four
 (presumably because threads are dispatched in groups of 4). Also,
 padded_num_vertices is always at least one more than num_vertices, which seems
 like a quirk of the hardware. For larger num_vertices, the hardware uses the
 following algorithm: using the binary representation of num_vertices, we look at
 the most significant set bit as well as the following 3 bits. Let n be the
 number of bits after those 4 bits. Then we set padded_num_vertices according to
 the following table:

 ==========  =======================
 high bits   padded_num_vertices
 ==========  =======================
 1000		   :math:`9 \cdot 2^n`
 1001		   :math:`5 \cdot 2^{n+1}`
 101x		   :math:`3 \cdot 2^{n+2}`
 110x		   :math:`7 \cdot 2^{n+1}`
 111x		   :math:`2^{n+4}`
 ==========  =======================

 For example, if num_vertices = 70 is passed to glDraw(), its binary
 representation is 1000110, so n = 3 and the high bits are 1000, and
 therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72.

 The attribute unit works in terms of the original linear_id. if
 num_instances = 1, then they are the same, and everything is simple.
 However, with instancing things get more complicated. There are four
 possible modes, two of them we can group together:

 1. Use the linear_id directly. Only used when there is no instancing.

 2. Use the linear_id modulo a constant. This is used for per-vertex
 attributes with instancing enabled by making the constant equal
 padded_num_vertices. Because the modulus is always padded_num_vertices, this
 mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9.
 The shift field specifies the power of two, while the extra_flags field
 specifies the odd number. If shift = n and extra_flags = m, then the modulus
 is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as
 computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set
 extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware
 algorithm used to get padded_num_vertices in order to correctly implement
 per-vertex attributes.

 3. Divide the linear_id by a constant. In order to correctly implement
 instance divisors, we have to divide linear_id by padded_num_vertices times
 to user-specified divisor. So first we compute padded_num_vertices, again
 following the exact same algorithm that the hardware uses, then multiply it
 by the GL-level divisor to get the hardware-level divisor. This case is
 further divided into two more cases. If the hardware-level divisor is a
 power of two, then we just need to shift. The shift amount is specified by
 the shift field, so that the hardware-level divisor is just
 :math:`2^\text{shift}`.

 If it isn't a power of two, then we have to divide by an arbitrary integer.
 For that, we use the well-known technique of multiplying by an approximation
 of the inverse. The driver must compute the magic multiplier and shift
 amount, and then the hardware does the multiplication and shift. The
 hardware and driver also use the "round-down" optimization as described in
 https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
 The hardware further assumes the multiplier is between :math:`2^{31}` and
 :math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set
 to 0 by the driver -- presumably this simplifies the hardware multiplier a
 little. The hardware first multiplies linear_id by the multiplier and
 takes the high 32 bits, then applies the round-down correction if
 extra_flags = 1, then finally shifts right by the shift field.

 There are some differences between ridiculousfish's algorithm and the Mali
 hardware algorithm, which means that the reference code from ridiculousfish
 doesn't always produce the right constants. Mali does not use the pre-shift
 optimization, since that would make a hardware implementation slower (it
 would have to always do the pre-shift, multiply, and post-shift operations).
 It also forces the multiplier to be at least :math:`2^{31}`, which means
 that the exponent is entirely fixed, so there is no trial-and-error.
 Altogether, given the divisor d, the algorithm the driver must follow is:

 1. Set shift = :math:`\lfloor \log_2(d) \rfloor`.
 2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
 3. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set
    magic_divisor = m - 1 and extra_flags = 1.  4. Otherwise, set magic_divisor =
    m and extra_flags = 0.
	Panfrost
	========

	The Panfrost driver stack includes an OpenGL ES implementation for Arm Mali
	GPUs based on the Midgard and Bifrost microarchitectures. It is conformant
	on Mali-G52 and Mali-G57 but non-conformant on other GPUs. The following
	hardware is currently supported:

	========= ============ ============ =======
	Product Architecture OpenGL ES OpenGL
	========= ============ ============ =======
	Mali T620 Midgard (v4) 2.0 2.1
	Mali T720 Midgard (v4) 2.0 2.1
	Mali T760 Midgard (v5) 3.1 3.1
	Mali T820 Midgard (v5) 3.1 3.1
	Mali T830 Midgard (v5) 3.1 3.1
	Mali T860 Midgard (v5) 3.1 3.1
	Mali T880 Midgard (v5) 3.1 3.1
	Mali G72 Bifrost (v6) 3.1 3.1
	Mali G31 Bifrost (v7) 3.1 3.1
	Mali G51 Bifrost (v7) 3.1 3.1
	Mali G52 Bifrost (v7) 3.1 3.1
	Mali G76 Bifrost (v7) 3.1 3.1
	Mali G57 Valhall (v9) 3.1 3.1
	========= ============ ============ =======

	Other Midgard and Bifrost chips (T604, G71) are not yet supported.

	Older Mali chips based on the Utgard architecture (Mali 400, Mali 450) are
	supported in the :doc:`Lima <lima>` driver, not Panfrost. Lima is also
	available in Mesa.

	Other graphics APIs (Vulkan, OpenCL) are not supported at this time.

	Building
	--------

	Panfrost's OpenGL support is a Gallium driver. Since Mali GPUs are 3D-only and
	do not include a display controller, Mesa uses kmsro to support display
	controllers paired with Mali GPUs. If your board with a Panfrost supported GPU
	has a display controller with mainline Linux support not supported by kmsro,
	it's easy to add support, see the commit ``cff7de4bb597e9`` as an example.

	LLVM is not required by Panfrost's compilers. LLVM support in Mesa can
	safely be disabled for most OpenGL ES users with Panfrost.

	Build like ``meson . build/ -Dvulkan-drivers=
	-Dgallium-drivers=panfrost -Dllvm=disabled`` for a build directory
	``build``.

	For general information on building Mesa, read :doc:`the install documentation
	<../install>`.

	Chat
	----

	Panfrost developers and users hang out on IRC at ``#panfrost`` on OFTC. Note
	that registering and authenticating with ``NickServ`` is required to prevent
	spam. `Join the chat. <https://webchat.oftc.net/?channels=panfrost>`_

	Compressed texture support
	--------------------------

	In the driver, Panfrost supports ASTC, ETC, and all BCn formats (e.g. RGTC,
	S3TC, etc.) However, Panfrost depends on the hardware to support these formats
	efficiently. All supported Mali architectures support these formats, but not
	every system-on-chip with a Mali GPU support all these formats. Many lower-end
	systems lack support for some BCn formats, which can cause problems when playing
	desktop games with Panfrost. To check whether this issue applies to your
	system-on-chip, Panfrost includes a ``panfrost_texfeatures`` tool to query
	supported formats.

	To use this tool, include the option ``-Dtools=panfrost`` when configuring Mesa.
	Then inside your Mesa build directory, the tool is located at
	``src/panfrost/tools/panfrost_texfeatures``. Copy it to your target device,
	set as executable as necessary, and run on the target device. A table of
	supported formats will be printed to standard output.

	drm-shim
	--------

	Panfrost implements ``drm-shim``, stubbing out the Panfrost kernel interface.
	Use cases for this functionality include:

	- Future hardware bring up
	- Running shader-db on non-Mali workstations
	- Reproducing compiler (and some driver) bugs without Mali hardware

	Although Mali hardware is usually paired with an Arm CPU, Panfrost is portable C
	code and should work on any Linux machine. In particular, you can test the
	compiler on shader-db on an Intel desktop.

	To build Mesa with Panfrost drm-shim, configure Meson with
	``-Dgallium-drivers=panfrost`` and ``-Dtools=drm-shim``. See the above
	building section for a full invocation. The drm-shim binary will be built to
	``build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so``.

	To use, set the ``LD_PRELOAD`` environment variable to the drm-shim binary. It
	may also be necessary to set ``LIBGL_DRIVERS_PATH`` to the location where Mesa
	was installed.

	By default, drm-shim mocks a Mali-G52 system. To select a specific Mali GPU,
	set the ``PAN_GPU_ID`` environment variable to the desired GPU ID:

	========= ============ =======
	Product Architecture GPU ID
	========= ============ =======
	Mali-T720 Midgard (v4) 720
	Mali-T860 Midgard (v5) 860
	Mali-G72 Bifrost (v6) 6221
	Mali-G52 Bifrost (v7) 7212
	Mali-G57 Valhall (v9) 9093
	========= ============ =======

	Additional GPU IDs are enumerated in the ``panfrost_model_list`` list in
	``src/panfrost/lib/pan_props.c``.

	As an example: assuming Mesa is installed to a local path ``~/lib`` and Mesa's
	build directory is ``~/mesa/build``, a shader can be compiled for Mali-G52 as:

	.. code-block:: console

	~/shader-db$ BIFROST_MESA_DEBUG=shaders \
	LIBGL_DRIVERS_PATH=~/lib/dri/ \
	LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
	PAN_GPU_ID=7212 \
	./run shaders/glmark/1-1.shader_test

	The same shader can be compiled for Mali-T720 as:

	.. code-block:: console

	~/shader-db$ MIDGARD_MESA_DEBUG=shaders \
	LIBGL_DRIVERS_PATH=~/lib/dri/ \
	LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
	PAN_GPU_ID=720 \
	./run shaders/glmark/1-1.shader_test

	These examples set the compilers' ``shaders`` debug flags to dump the optimized
	NIR, backend IR after instruction selection, backend IR after register
	allocation and scheduling, and a disassembly of the final compiled binary.

	As another example, this invocation runs a single dEQP test "on" Mali-G52,
	pretty-printing GPU data structures and disassembling all shaders
	(``PAN_MESA_DEBUG=trace``) as well as dumping raw GPU memory
	(``PAN_MESA_DEBUG=dump``). The ``EGL_PLATFORM=surfaceless`` environment variable
	and various flags to dEQP mimic the surfaceless environment that our
	continuous integration (CI) uses. This eliminates window system dependencies,
	although it requires a specially built CTS:

	.. code-block:: console

	~/VK-GL-CTS/build/external/openglcts/modules$ PAN_MESA_DEBUG=trace,dump \
	LIBGL_DRIVERS_PATH=~/lib/dri/ \
	LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
	PAN_GPU_ID=7212 EGL_PLATFORM=surfaceless \
	./glcts --deqp-surface-type=pbuffer \
	--deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 \
	--deqp-surface-height=256 -n \
	dEQP-GLES31.functional.shaders.builtin_functions.common.abs.float_highp_compute

	U-interleaved tiling
	---------------------

	Panfrost supports u-interleaved tiling. U-interleaved tiling is
	indicated by the ``DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED`` modifier.

	The tiling reorders whole pixels (blocks). It does not compress or modify the
	pixels themselves, so it can be used for any image format. Internally, images
	are divided into tiles. Tiles occur in source order, but pixels (blocks) within
	each tile are reordered according to a space-filling curve.

	For regular formats, 16x16 tiles are used. This harmonizes with the default tile
	size for binning and CRCs (transaction elimination). It also means a single line
	(16 pixels) at 4 bytes per pixel equals a single 64-byte cache line.

	For formats that are already block compressed (S3TC, RGTC, etc), 4x4 tiles are
	used, where entire blocks are reorder. Most of these formats compress 4x4
	blocks, so this gives an effective 16x16 tiling. This justifies the tile size
	intuitively, though it's not a rule: ASTC may uses larger blocks.

	Within a tile, the X and Y bits are interleaved (like Morton order), but with a
	twist: adjacent bit pairs are XORed. The reason to add XORs is not obvious.
	Visually, addresses take the form::

	\| y3 \| (x3 ^ y3) \| y2 \| (y2 ^ x2) \| y1 \| (y1 ^ x1) \| y0 \| (y0 ^ x0) \|

	Reference routines to encode/decode u-interleaved images are available in
	``src/panfrost/shared/test/test-tiling.cpp``, which documents the space-filling
	curve. This reference implementation is used to unit test the optimized
	implementation used in production. The optimized implementation is available in
	``src/panfrost/shared/pan_tiling.c``.

	Although these routines are part of Panfrost, they are also used by Lima, as Arm
	introduced the format with Utgard. It is the only tiling supported on Utgard. On
	Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and
	should be used instead where possible. However, not all formats are
	compressible, so u-interleaved tiling remains an important fallback on Panfrost.

	Instancing
	----------

	The attribute descriptor lets the attribute unit compute the address of an
	attribute given the vertex and instance ID. Unfortunately, the way this works is
	rather complicated when instancing is enabled.

	To explain this, first we need to explain how compute and vertex threads are
	dispatched. When a quad is dispatched, it receives a single, linear index.
	However, we need to translate that index into a (vertex id, instance id) pair.
	One option would be to do:

	.. math::
	\text{vertex id} = \text{linear id} \% \text{num vertices}

	\text{instance id} = \text{linear id} / \text{num vertices}

	but this involves a costly division and modulus by an arbitrary number.
	Instead, we could pad num_vertices. We dispatch padded_num_vertices *
	num_instances threads instead of num_vertices * num_instances, which results
	in some "extra" threads with vertex_id >= num_vertices, which we have to
	discard. The more we pad num_vertices, the more "wasted" threads we
	dispatch, but the division is potentially easier.

	One straightforward choice is to pad num_vertices to the next power of two,
	which means that the division and modulus are just simple bit shifts and
	masking. But the actual algorithm is a bit more complicated. The thread
	dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
	to dividing by a power of two. As a result, padded_num_vertices can be
	1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
	since we need less padding.

	padded_num_vertices is picked by the hardware. The driver just specifies the
	actual number of vertices. Note that padded_num_vertices is a multiple of four
	(presumably because threads are dispatched in groups of 4). Also,
	padded_num_vertices is always at least one more than num_vertices, which seems
	like a quirk of the hardware. For larger num_vertices, the hardware uses the
	following algorithm: using the binary representation of num_vertices, we look at
	the most significant set bit as well as the following 3 bits. Let n be the
	number of bits after those 4 bits. Then we set padded_num_vertices according to
	the following table:

	========== =======================
	high bits padded_num_vertices
	========== =======================
	1000 :math:`9 \cdot 2^n`
	1001 :math:`5 \cdot 2^{n+1}`
	101x :math:`3 \cdot 2^{n+2}`
	110x :math:`7 \cdot 2^{n+1}`
	111x :math:`2^{n+4}`
	========== =======================

	For example, if num_vertices = 70 is passed to glDraw(), its binary
	representation is 1000110, so n = 3 and the high bits are 1000, and
	therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72.

	The attribute unit works in terms of the original linear_id. if
	num_instances = 1, then they are the same, and everything is simple.
	However, with instancing things get more complicated. There are four
	possible modes, two of them we can group together:

	1. Use the linear_id directly. Only used when there is no instancing.

	2. Use the linear_id modulo a constant. This is used for per-vertex
	attributes with instancing enabled by making the constant equal
	padded_num_vertices. Because the modulus is always padded_num_vertices, this
	mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9.
	The shift field specifies the power of two, while the extra_flags field
	specifies the odd number. If shift = n and extra_flags = m, then the modulus
	is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as
	computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set
	extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware
	algorithm used to get padded_num_vertices in order to correctly implement
	per-vertex attributes.

	3. Divide the linear_id by a constant. In order to correctly implement
	instance divisors, we have to divide linear_id by padded_num_vertices times
	to user-specified divisor. So first we compute padded_num_vertices, again
	following the exact same algorithm that the hardware uses, then multiply it
	by the GL-level divisor to get the hardware-level divisor. This case is
	further divided into two more cases. If the hardware-level divisor is a
	power of two, then we just need to shift. The shift amount is specified by
	the shift field, so that the hardware-level divisor is just
	:math:`2^\text{shift}`.

	If it isn't a power of two, then we have to divide by an arbitrary integer.
	For that, we use the well-known technique of multiplying by an approximation
	of the inverse. The driver must compute the magic multiplier and shift
	amount, and then the hardware does the multiplication and shift. The
	hardware and driver also use the "round-down" optimization as described in
	https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
	The hardware further assumes the multiplier is between :math:`2^{31}` and
	:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set
	to 0 by the driver -- presumably this simplifies the hardware multiplier a
	little. The hardware first multiplies linear_id by the multiplier and
	takes the high 32 bits, then applies the round-down correction if
	extra_flags = 1, then finally shifts right by the shift field.

	There are some differences between ridiculousfish's algorithm and the Mali
	hardware algorithm, which means that the reference code from ridiculousfish
	doesn't always produce the right constants. Mali does not use the pre-shift
	optimization, since that would make a hardware implementation slower (it
	would have to always do the pre-shift, multiply, and post-shift operations).
	It also forces the multiplier to be at least :math:`2^{31}`, which means
	that the exponent is entirely fixed, so there is no trial-and-error.
	Altogether, given the divisor d, the algorithm the driver must follow is:

	1. Set shift = :math:`\lfloor \log_2(d) \rfloor`.
	2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
	3. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set
	magic_divisor = m - 1 and extra_flags = 1. 4. Otherwise, set magic_divisor =
	m and extra_flags = 0.