tree: 5904442360afff6e4b3ef6e4551a6e59b8f0ca2c [path history] [tgz]
  1. BUILD.gn
  2. README.md
  3. docs/
  4. hotsort_gen/
  5. platforms/
src/graphics/lib/compute/hotsort/README.md

HotSort

HotSort is a high-performance GPU-accelerated integer sorting library for Vulkan.

HotSort's advantages include:

  • Ultra-fast sorting of 32‑bit or 64‑bit keys
  • Reaches peak throughput on small arrays
  • In-place sorting for low-memory environments
  • Strong scaling with number of multiprocessors
  • Low memory transactions relative to array size
  • A concurrency-friendly dense kernel grid
  • Support for GPU post-processing of sorted results

Although HotSort is a comparison sort, it's typically significantly faster than other GPU-accelerated algorithms when sorting arrays smaller than ~500K keyvals.

Benchmarks

Throughput

Here is a throughput plot for HotSort sorting 32-bit and 64-bit keys with a 640-core Quadro M1200:

NVIDIA throughput for 32-bit keys NVIDIA throughput for 64-bit keys

HotSort throughput on Vulkan (Mesa) with a 704-core AMD V1807B APU:

AMD throughput

HotSort throughput on Vulkan with a 192-core Intel HD 630:

Intel throughput

Execution time

Note that these sorting rates translate to sub-millisecond to multi-millisecond execution times on small GPUs:

NVIDIA duration for 32-bit keys NVIDIA duration for 64-bit keys AMD duration Intel duration

Usage

HotSort provides a build-time tool (named hotsort_gen) to generate highly-optimized sets of compiled compute kernels that target a specific GPU architecture. These binary code modules are packed with configuration data into what is called a HotSort target file.

HotSort also provides a small runtime library to load a target into your application.

A simple benchmarking example for HotSort can be found here: hotsort_vk_bench.

Note that HotSort implements a comparison sort and supports in-place sorting.

Not all targeted architectures have been tested.

The following architectures are supported:

VendorArchitecture32‑bit64‑bit32+32‑bitNotes
NVIDIAsm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70Not tested on all architectures
NVIDIAsm_30,sm_32,sm_53,sm_62Need to generate properly shaped kernels
AMDGCNTested on Linux MESA 18.2
IntelGEN8+Good but the assumed best-shaped kernels aren't being used due to a compiler issue
IntelAPL/GLK using a 2x9 or 1x12 thread poolNeed to generate properly shaped kernels

HotSort target generation

A HotSort target can be generated using the hotsort_target GN template.

See the hotsort_vk_bench BUILD.gn for a concrete example.

By default, the HotSort target's binary data will be available as C source file defining an array of uint32_t literals, and a corresponding header declaring the array by name.

HotSort target usage

Include the generated hs_target.h file into your source code to access the HotSort target's data compiled as an uint32_t array.

Include hotsort_vk.h into your source code to access the HotSort Vulkan-based APIs to load the HotSort target data and run it. See comments in this header file for more details about the API.

For example, to sort count keys on Vulkan:

// Provide hotsort_vk_xxx() functions.
#include "hotsort_vk.h"

// Defines the hs_intel_gen8_u32 variable pointing to the target's data.
#include "targets/intel/gen8/u32/hs_target.h"

// create HotSort instance from a target
struct hotsort_vk * hs = hotsort_vk_create(...,
                                           <pipeline layout>,
                                           <descriptor set locations>,
                                           &hs_intel_gen8_u32);

...
  // bind pipeline-compatible descriptor sets
  ...

  // see how much padding may be required
  hotsort_vk_pad(hs, count, &count_padded_in, &count_padded_out);

// append compute shaders to command buffer
hotsort_vk_sort(cb, hs, <array offsets>, count, padded_in, padded_out);

// command buffer end and queue submit

...

  // release the HotSort instance
  hotsort_vk_release(hs, ...);

Background

The HotSort sorting algorithm was created in 2012 and generalized in 2015 to support GPUs that benefit from non-power-of-two workgroups.

The primary use case of HotSort is to achieve high throughput as early as possible on small GPUs when sorting arrays containing 1000‘s to 100’s of thousands of integer keys.

HotSort uses unique properties of bitonic sequences to create a novel mapping of keys onto data-parallel devices like GPUs.

Overview

The overall flow of the HotSort algorithm is below. Kernel launches are in italics.

  1. For each workgroup of slabs:
    1. For each slab in the workgroup:
      1. Slab Load
      2. Slab Sort
    2. Until all slabs in the workgroup are merged:
      1. Multi-Slab Flip Merge
    3. Slab Store
  2. Until all slabs are merged:
    1. Streaming Flip Merge
    2. If necessary, Streaming Half Merge
    3. If necessary, Multi-Slab Half Merge
    4. If necessary, Slab Half Merge
    5. If complete:
      1. Optionally, Report Key Changes
      2. Optionally, Slab Transpose & Store
    6. Otherwise: Slab Store
  3. Done

Sorting

The algorithm begins with a very dense per-multiprocessor block sorting algorithm that loads a “slab” of keys into a subgroup's registers, sorts the slabs, merges all slabs in the workgroup, and stores the slabs back to global memory.

In the slab sorting phase, each lane of a subgroup executes a bitonic sorting network on its registers and successively merges lanes until the slab of registers is sorted in serpentine order.

Slab layout

Merging

HotSort has several different merging strategies.

The merging kernels leverage the multiprocessor's register file by loading, merging and storing a large number of strided slab rows without using local memory.

The merging kernels exploit the bitonic sequence property that interleaved subsequences of a bitonic sequence are also bitonic sequences. This property also holds for non-power-of-two sequences.

As an example, the Streaming Flip Merge kernel is illustrated below:

Flip merge algorithm

Future Enhancements

Hybrid improved merging

HotSort's initial sorting and merging phases are performed on bitonic sequences. Because of this, throughput decreases as the problem size increases.

A hybrid algorithm that combined HotSort‘s block sorter and several rounds of merging with a state-of-the-art GPU merging algorithm would probably improve the algorithm’s performance on larger arrays.

Reenable support for devices lacking shuffle functions

The original version of HotSort ran on pre-Kepler GPUs without intra-warp/inter-lane shuffling ― reenable this capability.