commit	a09683b8da7164b9c5704f88aef2dc65aa583e5d	[log] [tgz]
author	Benoit Jacob <benoitjacob@google.com>	Mon Apr 18 20:06:00 2022 -0700
committer	Copybara-Service <copybara-worker@google.com>	Mon Apr 18 20:07:15 2022 -0700
tree	5f885bb4616e1287ba31923b33973aaaaf3922bd
parent	915898ed1a46401f1dfb3b23563cad7f89b83fa0 [diff]

Refactor Thread internals for clarity and efficiency. On the clarity side, the thread main loop is now just: while (GetNewStateOtherThanReady() == State::HasWork) { RevertToReadyState(); } On the efficiency side: * Locking and atomic ops have been reduced, we used to lock state_mutex_ around the entire thread task execution,now we are only locking it anymore around notify/wait on the state_cond_ condition_variable, so this mutex is renamed state_cond_mutex_, which clarifies its purpose. * We used to perform a redundant reload-acquire of the new state_ in the main thread loop. * Some accesses are demoted to relaxed because they are already ordered by other release-acquire relationships. * A notify_all becomes notify_one. * Send all thread exit requests upfront so threads can exit in parallel. A comment is added on Thread::task_ to explain the release-acquire relationships making this all work. Internal code is broken into functions that are only ever called from the main thread, and functions that are only ever called from the worker thread. That specialization made further simplifications and performance gains obvious. It was found by continuous integration that some TFLite users construct and destroy the context from two different threads, due to the use of reference-counting. That means that the notion of "main thread" is not that solid. Accordingly, instances of "main thread" in comments and identifiers have been rephrased as "outside thread" as opposed to worker thread. Tested with TSan (also enabled on presubmits) so fairly confident that this is correct. PiperOrigin-RevId: 442697771

tree: 5f885bb4616e1287ba31923b33973aaaaf3922bd

README.md

The ruy matrix multiplication library

This is not an officially supported Google product.

ruy is a matrix multiplication library. Its focus is to cover the matrix multiplication needs of neural network inference engines. Its initial user has been TensorFlow Lite, where it is used by default on the ARM CPU architecture.

ruy supports both floating-point and 8bit-integer-quantized matrices.

Efficiency

ruy is designed to achieve high performance not just on very large sizes, as is the focus of many established libraries, but on whatever are the actual sizes and shapes of matrices most critical in current TensorFlow Lite applications. This often means quite small sizes, e.g. 100x100 or even 50x50, and all sorts of rectangular shapes. It's not as fast as completely specialized code for each shape, but it aims to offer a good compromise of speed across all shapes and a small binary size.

Documentation

Some documentation will eventually be available in the doc/ directory, see doc/README.md.