commit	01aac629320dfd7ecc16f756c16f6990f01a22fa	[log] [tgz]
author	Jonathan Oliver <jonathan.s.oliver42@gmail.com>	Sat Oct 05 20:03:02 2019 -0600
committer	Jonathan Oliver <jonathan.s.oliver42@gmail.com>	Sat Oct 05 20:03:02 2019 -0600
tree	d7b127f26766f99a1f0f456b57427b68eb77935b
parent	874e3c74063992e7b8e137c38909809fbf7e1bdc [diff]

tree: d7b127f26766f99a1f0f456b57427b68eb77935b

README.md

Notes

Disruptor Overview

This is a port of the LMAX Disruptor into the Go programming language. It retains the essence and spirit of the Disruptor and utilizes a lot of the same abstractions and concepts, but does not maintain the same API.

On a MacBook Pro (Intel Core i9-8950HK CPU @ 2.90GHz) using Go 1.13.1, it can pass many hundreds of millions of messages per second (yes, you read that right) from one goroutine to another goroutine. The message being transferred between the two CPU cores was a simple, incrementing number, but literally could be anything. Note that your mileage may vary and that different operating systems can introduce significant “jitter” into the application by taking control of the CPU and invalidating the various CPU caches. Linux and Windows have the ability to assign a given process to specific CPU cores which reduces jitter significantly by keeping all the CPU caches hot.

Once initialized and running, one of the preeminent design considerations of the Disruptor is to process messages at a constant rate. It does this using two primary techniques. First, it avoids using locks at all costs which cause contention between CPU cores and prevents true scalability. Secondly, it produces no garbage by allowing the application to preallocate sequential space on a ring buffer. By avoiding garbage, the need for a garbage collection and the stop-the-world application pauses introduced can be almost entirely avoided.

In Go, the current channel implementation maintains lock around enqueue/dequeue operations and maxes out on the aforementioned hardware at about 25 million messages per second for uncontended access—more than an orders of magnitude slower when compared to the Disruptor. The same channel, when contended between OS threads only pushes about 7 million messages per second.

Example Usage

func main() {
    // wireup
    options := []disruptor.Options{
        disruptor.WithCapacity(BufferSize),
        disruptor.WithConsumerGroup(MyConsumer{})} 

    sequencer, listener := disruptor.New(options...)).Build()
    
    // publish messages
    go func() {
        reservation := sequencer.Reserve(1)
        ringBuffer[sequence&RingBufferMask] = "Hello, World"
        sequencer.Commit(sequence, sequence)

        _ = listener.Close()
    }()
    
    // listen for messages and process them
    listener.Listen()
}

type MyConsumer struct{}

func (m MyConsumer) Consume(lowerSequence, upperSequence int64) {
	for sequence := lowerSequence; sequence <= upperSequence; sequence++ {
		message := ringBuffer[sequence&RingBufferMask]
        fmt.Println(message)		
	}
}

var ringBuffer = [BufferSize]string

const (
	BufferSize   = 1024 * 64
	BufferMask   = BufferSize - 1
)

The above example is more complex than a typical channel implementation. When removing all the comments, wireup, and extra fluff to explain what‘s happening, the code is very concise. In fact, a given “Publish” only takes three lines—Reserve a slot, update the ring buffer at that slot, and Commit the reserved sequence range. On the consumer side, there’s a for-loop to handle all incoming items into your application. Again, this not quite as concise as a channel (nor as flexible), but it's much, much faster.

Benchmarks

Each of the following benchmark tests sends an incrementing sequence message from one goroutine to another. The receiving goroutine asserts that the message is received is the expected incrementing sequence value. Any failures cause a panic. Unless otherwise noted, all tests were run using GOMAXPROCS=2.

MacBook Pro 15" Retina, Mid 2014

CPU: Intel Core i9-8950HK CPU @ 2.90GHz
Operation System: OS X 10.14.6
Go Runtime: Go 1.13.1
Go Architecture: amd64

Scenario	Per Operation Time
Channels: Buffered, Blocking	53 ns
Channels: Buffered, Blocking	54 ns
Channels: Buffered, Blocking, Contended Write	121 ns
Channels: Buffered, Non-blocking	60 ns
Channels: Buffered, Non-blocking	68 ns
Channels: Buffered, Non-blocking Contended Write	205 ns
Disruptor: Sequencer, Reserve One	10 ns
Disruptor: Sequencer, Reserve Many	8 ns
Disruptor: Sequencer, Reserve One, Multiple Readers	15 ns
Disruptor: Sequencer, Reserve Many, Multiple Readers	10 ns

When In Doubt, Use Channels

Despite Go channels being significantly slower than the Disruptor, channels should still be considered the easiest, best, and most desirable choice for the vast majority of all use cases. The Disruptor's target use case is ultra-low latency environments where application response times are measured in nanoseconds and where stable, consistent latency is paramount and latency spikes cannot be tolerated.

Pre-Alpha

This code is currently experimental (pre-Alpha stage) and is not supported or recommended for production environments. It does not have any unit tests and is only meant serve as spike code to and a proof of concept that the Disruptor is possible on the Go runtime despite some of the limits imposed by the Go memory model.

We are very interested to receive feedback on this project and how performance can be improved using subtle techniques such as additional cache line padding, memory alignment, utilizing a pointer vs a struct in a given location, replacing less optimal techniques with more optimal ones, especially in the performance critical paths of Reserve/Commit in the various Sequencer structs as well as the Receive operation of the Listener.

Caveats

One last caveat worth noting. In the Java-based Disruptor implementation, a ring buffer is created, preallocated, and prepopulated with instances of the class which serve as the message type to be transferred between threads. Because Go lacks generics, we have opted to not interact with ring buffers at all within the library code. This has the benefit of avoiding an unnecessary type conversions (“casts”) during the receipt of a given message from type interface{} to a concrete type. It also means that it is the responsibility of the application developer to create and populate their particular ring buffer during application wireup. Pre-populating the ring buffer at startup should ensure contiguous memory allocation for all items in the various ring buffer slots, whereas on-the-fly creation may introduce gaps in the memory allocation and subsequent CPU cache misses which introduce latency spikes.

The reference to the ring buffer can (but need not be) be scoped as a package-level variable. The reason for this is that any given application should have very few Disruptor instances. The instances are designed to be created at startup and stopped during shutdown. They are not typically meant to be created ad-hoc and passed around like channel instances. It is the responsibility of the application developer to manage references to the ring buffer instances such that the producer can push messages to the buffer and consumers can receive messages from the buffer.