Refactor Thread internals for clarity and efficiency.

On the clarity side, the thread main loop is now just:

    while (GetNewStateOtherThanReady() == State::HasWork) {
      RevertToReadyState();
    }

On the efficiency side:
* Locking and atomic ops have been reduced, we used to lock state_mutex_
  around the entire thread task execution,now we are only locking it
  anymore around notify/wait on the state_cond_ condition_variable, so
  this mutex is renamed state_cond_mutex_, which clarifies its purpose.
* We used to perform a redundant reload-acquire of the new state_ in
  the main thread loop.
* Some accesses are demoted to relaxed because they are already ordered
  by other release-acquire relationships.
* A notify_all becomes notify_one.
* Send all thread exit requests upfront so threads can exit in parallel.

A comment is added on Thread::task_ to explain the release-acquire
relationships making this all work.

Internal code is broken into functions that are only ever called
from the main thread, and functions that are only ever called from the
worker thread. That specialization
made further simplifications and performance gains obvious.

It was found by continuous integration that some TFLite users construct and destroy the context from two different threads, due to the use of reference-counting. That means that the notion of "main thread" is not that solid. Accordingly, instances of "main thread" in comments and identifiers have been rephrased as "outside thread" as opposed to worker thread.

Tested with TSan (also enabled on presubmits) so fairly confident that this
is correct.

PiperOrigin-RevId: 442697771
2 files changed
tree: 5f885bb4616e1287ba31923b33973aaaaf3922bd
  1. cmake/
  2. doc/
  3. example/
  4. ruy/
  5. third_party/
  6. .gitignore
  7. .gitmodules
  8. BUILD
  9. CMakeLists.txt
  10. CONTRIBUTING.md
  11. LICENSE
  12. README.md
  13. WORKSPACE
README.md

The ruy matrix multiplication library

This is not an officially supported Google product.

ruy is a matrix multiplication library. Its focus is to cover the matrix multiplication needs of neural network inference engines. Its initial user has been TensorFlow Lite, where it is used by default on the ARM CPU architecture.

ruy supports both floating-point and 8bit-integer-quantized matrices.

Efficiency

ruy is designed to achieve high performance not just on very large sizes, as is the focus of many established libraries, but on whatever are the actual sizes and shapes of matrices most critical in current TensorFlow Lite applications. This often means quite small sizes, e.g. 100x100 or even 50x50, and all sorts of rectangular shapes. It's not as fast as completely specialized code for each shape, but it aims to offer a good compromise of speed across all shapes and a small binary size.

Documentation

Some documentation will eventually be available in the doc/ directory, see doc/README.md.