HotSort throughput on Vulkan (Mesa) on a 704-core AMD V1807B APU:
HotSort throughput on Vulkan on a 192-core Intel HD 630:
Note that these sorting rates translate to sub-millisecond to multi-millisecond execution times on small GPUs:
This repository contains the HotSort API, a utility for generating HotSort's compute shaders, and recipes for a number of target architectures.
A simple benchmarking example for HotSort can be found here: hs_bench_vk
.
Note that HotSort is a comparison sort and supports in-place sorting.
Not all targeted architectures have been tested.
The following architectures are supported:
Vendor | Architecture | 32‑bit | 64‑bit | 32+32‑bit | Notes |
---|---|---|---|---|---|
NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x: | Not tested on all architectures |
NVIDIA | sm_30,sm_32,sm_53,sm_62 | :x: | :x: | :x: | Need to generate properly shaped kernels |
AMD | GCN | :white_check_mark: | :white_check_mark: | :x: | Tested on Linux MESA 18.2 |
Intel | GEN8+ | :white_check_mark: | :white_check_mark: | :x: | Good but the assumed best-shaped kernels aren't being used due to a compiler issue |
Intel | APL/GLK using a 2x9 or 1x12 thread pool | :x: | :x: | :x: | Need to generate properly shaped kernels |
Add an arch-specific HotSort algorithm (aka “target”) to your project by including a .c
source and .h
header file:
Key Size | Source | Header |
---|---|---|
32‑bit | hotsort/vk/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c | hotsort/vk/<vendor>/<arch>/u32/hs_target.h |
64‑bit | hotsort/vk/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c | hotsort/vk/<vendor>/<arch>/u64/hs_target.h |
To sort count
keys on Vulkan:
#include "hotsort/vk/intel/gen8/u32/hs_target.h" ... // create the Vulkan HotSort target struct hs_vk * hs = hs_vk_create(<address of target>,...); // allocate a descriptor set from a pool VkDescriptorSet hs_ds = hs_vk_ds_alloc(hs,descriptor_pool); ... // command buffer begin ... // bind buffer(s) to a command buffer hs_vk_ds_bind(hs,hs_ds,cb,vin,vout); // or (...,vin,VK_NULL_HANDLE) for in-place sorting // see how much padding may be required hs_vk_pad(hs,count,&count_padded_in,&count_padded_out); // append compute shaders to command buffer hs_vk_sort(hs,cb,...,vin,...,vout,...); // hs_vk_sort() and hs_vk_ds_bind() must have matching vin/vout args ... // command buffer end and queue submit ... // release the Vulkan HotSort target hs_vk_release(hs,...);
The hs_vk.h
header file describes these functions in greater detail.
The HotSort sorting algorithm was created in 2012 and generalized in 2015 to support GPUs that benefit from non-power-of-two workgroups.
The objective of HotSort is to achieve high throughput as early as possible on small GPUs when sorting modestly-sized arrays ― 1,000s to 100s of thousands of 64‑bit keys.
HotSort uses both well-known and obscure properties of bitonic sequences to create a novel mapping of keys onto data-parallel devices like GPUs.
The overall flow of the HotSort algorithm is below. Kernel launches are in italics.
The algorithm begins with a very dense per-multiprocessor block sorting algorithm that loads a “slab” of keys into a subgroup's registers, sorts the slabs, merges all slabs in the workgroup, and stores the slabs back to global memory.
In the slab sorting phase, each lane of a subgroup executes a bitonic sorting network on its registers and successively merges lanes until the slab of registers is sorted in serpentine order.
HotSort has several different merging strategies.
The merging kernels leverage the multiprocessor's register file by loading, merging and storing a large number of strided slab rows without using local memory.
The merging kernels exploit the bitonic sequence property that interleaved subsequences of a bitonic sequence are also bitonic sequences. This property also holds for non-power-of-two sequences.
As an example, the Streaming Flip Merge kernel is illustrated below:
HotSort's initial sorting and merging phases are performed on bitonic sequences. Because of this, throughput decreases as the problem size increases.
A hybrid algorithm that combined HotSort‘s block sorter and several rounds of merging with a state-of-the-art GPU merging algorithm would probably improve the algorithm’s performance on larger arrays.
The original version of HotSort ran on pre-Kepler GPUs without intra-warp/inter-lane shuffling ― reenable this capability.