The code in this module governs when worker threads should go to sleep. This is a tricky topic -- the work-stealing algorithm relies on having active worker threads running around stealing from one another. But, if there isn't a lot of work, this can be a bit of a drag, because it requires high CPU usage.
The code in this module takes a fairly simple approach to the problem. It allows worker threads to fall asleep if they have failed to steal work after various thresholds; however, whenever new work appears, they will wake up briefly and try to steal again. There are some shortcomings in this current approach:
Workers interact with the sleep module by invoking three methods:
work_found()
: signals that the worker found some work and is about to execute it.no_work_found()
: signals that the worker searched all available sources for work and found none.tickle()
: indicates that new work is available (e.g., a job has been pushed to the local deque) or that some other blocking condition has been resolved (e.g., a latch has been set). Wakes up any sleeping workers.When in a loop searching for work, Workers also have to maintain an integer yields
that they provide to the sleep
module (which will return a new value for the next time). Thus the basic worker “find work” loop looks like this (this is wait_until()
, basically):
let mut yields = 0; while /* not done */ { if let Some(job) = search_for_work() { yields = work_found(self.index, yields); } else { yields = no_work_found(self.index, yields); } }
The basic idea here is that every worker goes through three states:
At any given time, only one worker can be in the sleepy state. This allows us to coordinate the entire sleep protocol using a single AtomicUsize
and without the need of epoch counters or other things that might rollover and so forth.
Whenever a worker invokes work_found()
, it transitions back to the awake state. In other words, if it was sleepy, it stops being sleepy. (work_found()
cannot be invoked when the worker is asleep, since then it is not doing anything.)
On the other hand, whenever a worker invokes no_work_found()
, it may transition to a more sleepy state. To track this, we use the counter yields
that is maintained by the worker's steal loop. This counter starts at 0. Whenever work is found, the counter is returned to 0. But each time that no work is found, the counter is incremented. Eventually it will reach a threshold ROUNDS_UNTIL_SLEEPY
. At this point, the worker will try to become the sleepy one. It does this by executing a CAS into the global registry state (details on this below). If that attempt is successful, then the counter is incremented again, so that it is equal to ROUNDS_UNTIL_SLEEPY + 1
. Otherwise, the counter stays the same (and hence we will keep trying to become sleepy until either work is found or we are successful).
Becoming sleepy does not put us to sleep immediately. Instead, we keep iterating and looking for work for some further number of rounds. If during this search we do find work, then we will return the counter to 0 and also reset the global state to indicate we are no longer sleepy.
But if again no work is found, yields
will eventually reach the value ROUNDS_UNTIL_ASLEEP
. At that point, we will try to transition from sleepy to asleep. This is done by the helper fn sleep()
, which executes another CAS on the global state that removes our worker as the sleepy worker and instead sets a flag to indicate that there are sleeping workers present (the flag may already have been set, that's ok). Assuming that CAS succeeds, we will block on a condition variable.
Of course, while all the stuff in the previous section is happening, other workers are (hopefully) producing new work. There are three kinds of events that can allow a blocked worker to make progress:
Whenever one of these things happens, the worker (or thread, more generally) responsible must invoke tickle()
. Tickle will basically wake up all the workers:
no_work_found()
and return to the awake state (with a yield counter of zero).notify_all()
on the condition variable, which will cause them to awaken and start over from the awake state (with a yield counter of zero).Because tickle()
is invoked very frequently -- and hopefully most of the time it is not needed, because the workers are already actively stealing -- it is important that it be very cheap. The current design requires, in the case where nobody is even sleepy, just a load and a compare. If there are sleepy workers, a swap is needed. If there workers asleep, we must naturally acquire the lock and signal the condition variable.
We manage all of the above state transitions using a small bit of global state (well, global to the registry). This is stored in the Sleep
struct. The primary thing is a single AtomicUsize
. The value in this usize packs in two pieces of information:
We use bit 0 to indicate whether any workers are asleep. So if state & 1
is zero, then no workers are sleeping. But if state & 1
is 1, then some workers are either sleeping or on their way to falling asleep (i.e., they have acquired the lock).
The remaining bits are used to store if there is a sleepy worker. We want 0
to indicate that there is no sleepy worker. If there a sleepy worker with index worker_index
, we would store (worker_index + 1) << 1
. The +1
is there because worker indices are 0-based, so this ensures that the value is non-zero, and the shift skips over the sleepy bit.
Some examples:
0
: everyone is awake, nobody is sleepy1
: some workers are asleep, no sleepy worker2
: no workers are asleep, but worker 0 is sleepy ((0 + 1) << 1 == 2
).3
: some workers are asleep, and worker 0 is sleepy.In general, we do not want to miss wakeups. Two bad things could happen:
It turns out that guaranteeing we don‘t miss a wakeup while retaining good performance is fairly tricky. This is because of some details of the C++11 memory model. But let’s ignore those for now and generally assume sequential consistency. In that case, our scheme should work perfectly.
Even if you assume seqcst, though, ensuring that you don't miss wakeups can be fairly tricky in the absence of a central queue. For example, consider the simplest scheme: imagine we just had a boolean flag indicating whether anyone was asleep. Then you could imagine that when workers find no work, they flip this flag to true. When work is published, if the flag is true, we issue a wakeup.
The problem here is that checking for new work is not an atomic action. So it‘s possible that worker 1 could start looking for work and (say) see that worker 0’s queue is empty and then search workers 2..N. While that searching is taking place, worker 0 publishes some new work. At the time when the new work is published, the “anyone sleeping?” flag is still false, so nothing happens. Then worker 1, who failed to find any work, goes to sleep --- completely missing the wakeup!
We use the “sleepy worker” idea to sidestep this problem. Under our scheme, instead of going right to sleep at the end, worker 1 would become sleepy. Worker 1 would then do at least one additional scan. During this scan, they should find the work published by worker 0, so they will stop being sleepy and go back to work (here of course we are assuming that no one else has stolen the worker 0 work yet; if someone else stole it, worker 1 may still go to sleep, but that's ok, since there is no more work to be had).
Now you may be wondering -- how does being sleepy help? What if, instead of publishing its job right before worker 1 became sleepy, worker 0 wait until right before worker 1 was going to go to sleep? In other words, the sequence was like this:
The reason that this doesn't occur is because, when worker 0 publishes its job, it will see that there is a sleepy worker. It will clear the global state to 0. Then, when worker 1 its scan, it will notice that it is no longer sleepy, and hence it will not go to sleep. Instead it will awaken and keep searching for work.
The sleepy worker phase thus also serves as a cheap way to signal that work is around: instead of doing the whole dance of acquiring a lock and issuing notifications, when we publish work we can just swap a single atomic counter and let the sleepy worker notice that on their own.
Unfortunately, the C++11 memory model doesn‘t generally guarantee seq-cst. And, somewhat annoyingly, it’s not easy for the sleep module in isolation to guarantee the properties the need. The key challenge has to do with the synchronized-with relation. Typically, we try to use acquire-release reasoning, and in that case the idea is that if a load observes a store, it will also observe those writes that preceded the store. But nothing says that the load must observe the store -- at least not right away.
The place that this is most relevant is the load in the tickle()
routine. The routine begins by reading from the global state. If it sees anything other than 0, it then does a swap and -- if necessary -- acquires a lock and does a notify. This load is a seq-cst load (as are the other accesses in tickle). This ensures that it is sensible to talk about a tickle happening before a worker gets sleepy and so forth.
It turns out that to get things right, if we use the current tickle routine, we have to use seq-cst operations both in the sleep module and when publishing work. We'll walk through two scenarios to show what I mean.
This scenario shows why the operations in sleep must be seq-cst. We want to ensure that once a worker gets sleepy, any other worker that does a tickle will observe that. In other words, we want to ensure that the following scenario cannot happen:
Let's diagram this. The notation read_xxx(A) = V
means that a read of location A
was executed with the result V
. The xxx
is the ordering and the location A
is either L
(latch) or S
(global state). I will leave the ordering on the latch as xxx
as it is not relevant here. The numbers correspond to the steps above.
worker 0 worker 1 | +- 2: cas_sc(S, 4) s| 3: write_xxx(L) + b| 4: read_sc(S) = ??? <-sc-+ v
Clearly, this cannot happen with sc orderings, because read 4 will always return 4
here. However, if we tried to use acquire-release orderings on the global state, then there would be no guarantee that the tickle will observe that a sleepy worker occurred. We would be guaranteed only that worker 0 would eventually observe that worker 1 had become sleepy (and, at that time, that it would see other writes). But it could take time -- and if we indeed miss that worker 1 is sleepy, it could lead to deadlock or loss of efficiency, as explained earlier.
This scenario shows why latch operations must also be seq-cst (and, more generally, any operations that publish work before a tickle). We wish to ensure that this ordering of events cannot occur:
In other words, we want to ensure that if worker 0 sets a latch and does a tickle before worker 1 gets sleepy, then worker 1 will observe that latch as set when it calls probe. We'll see that, with the current scheme, this implies that the latch memory orderings must be seq-cst as well.
Here is the diagram:
worker 0 worker 1 | 2: read_xxx(L) = false s| 3: write_xxx(L, true) b| 4: read_sc(S) = 0 -+ | +-sc---> 5: cas_sc(S, 4) v 6: read_xxx(L) = ???
The diagram shows that each thread's actions are related by sequenced-before (sb). Moreover the read and write of S
are related by sc
(the seq-cst ordering). However, and this is crucial, this does not imply that oper 4 synchronizes-with oper 5. This is because a read never synchronizes-with a store, only the reverse. Hence, if the latch were using acq-rel orderings, it would be legal for oper 6 to return false. But if the latch were to use an sc ordering itself, then we know that oper 6 must return true, since 3 -sc-> 4 -sc-> 5 -sc-> 6
.
Note that this means that, before we tickle, we must execute some seq-cst stores to publish our work (and during the scan we must load from those same locations) if we wish to guarantee that the work we published WILL be seen by the other threads (as opposed to may). This is true for setting a latch -- if a latch is set but another thread misses it, then the system could deadlock. However, in the case of pushing new work to a deque, we choose not to use a seqcst ordering. This is for several reasons:
AtomicBool
for each deque and do a seqcst write to it (with whatever value) after we push to the deque, and a seqcst load whenever we steal from the deque.In both cases above, our problems arose because tickle is merely performing a seq-cst read. If we instead had tickle perform a release swap, that would be a write action of the global state. No matter the ordering mode, all writes to the same memory location have a total ordering, and hence we would not have to worry about others storing a value that we fail to read (as in scenario 1). Similarly, as a release write, a swap during tickle would synchronize-with a later cas and so scenario 2 should be averted. So you might wonder why we don‘t do that. The simple reason was that it didn’t perform as well! In my measurements, many benchmarks were unaffected by using a swap, but some of them were hit hard: