| ============================================= |
| Machine Learning - Guided Optimization (MLGO) |
| ============================================= |
| |
| Introduction |
| ============ |
| |
| MLGO refers to integrating ML techniques (primarily) to replace heuristics within |
| LLVM with machine learned models. |
| |
| Currently the following heuristics feature such integration: |
| |
| * Inlining for size |
| * Register allocation (LLVM greedy eviction heuristic) for performance |
| |
| This document is an outline of the tooling and APIs facilitating MLGO. |
| |
| Note that tools for orchestrating ML training are not part of LLVM, as they are |
| dependency-heavy - both on the ML infrastructure choice, as well as choices of |
| distributed computing. For the training scenario, LLVM only contains facilities |
| enabling it, such as corpus extraction, training data extraction, and evaluation |
| of models during training. |
| |
| |
| .. contents:: |
| |
| Corpus Tooling |
| ============== |
| |
| Within the LLVM monorepo, there is the ``mlgo-utils`` python packages that |
| lives at ``llvm/utils/mlgo-utils``. This package primarily contains tooling |
| for working with corpora, or collections of LLVM bitcode. We use these corpora |
| to train and evaluate ML models. Corpora consist of a description in JSON |
| format at ``corpus_description.json`` in the root of the corpus, and then |
| a bitcode file and command line flags file for each extracted module. The |
| corpus structure is designed to contain sufficient information to fully |
| compile the bitcode to bit-identical object files. |
| |
| .. program:: extract_ir.py |
| |
| Synopsis |
| -------- |
| |
| Extracts a corpus from some form of a structured compilation database. This |
| tool supports a variety of different scenarios and input types. |
| |
| Options |
| ------- |
| |
| .. option:: --input |
| |
| The path to the input. This should be a path to a supported structured |
| compilation database. Currently only ``compile_commands.json`` files, linker |
| parameter files, a directory containing object files (for the local |
| ThinLTO case only), or a JSON file containing a bazel aquery result are |
| supported. |
| |
| .. option:: --input_type |
| |
| The type of input that has been passed to the ``--input`` flag. |
| |
| .. option:: --output_dir |
| |
| The output directory to place the corpus in. |
| |
| .. option:: --num_workers |
| |
| The number of workers to use for extracting bitcode into the corpus. This |
| defaults to the number of hardware threads available on the host system. |
| |
| .. option:: --llvm_objcopy_path |
| |
| The path to the llvm-objcopy binary to use when extracting bitcode. |
| |
| .. option:: --obj_base_dir |
| |
| The base directory for object files. Bitcode files that get extracted into |
| the corpus will be placed into the output directory based on where their |
| source object files are placed relative to this path. |
| |
| .. option:: --cmd_filter |
| |
| Allows filtering of modules by command line. If set, only modules that much |
| the filter will be extracted into the corpus. Regular expressions are |
| supported in some instances. |
| |
| .. option:: --thinlto_build |
| |
| If the build was performed with ThinLTO, this should be set to either |
| ``distributed`` or ``local`` depending upon how the build was performed. |
| |
| .. option:: --cmd_section_name |
| |
| This flag allows specifying the command line section name. This is needed |
| on non-ELF platforms where the section name might differ. |
| |
| .. option:: --bitcode_section_name |
| |
| This flag allows specifying the bitcode section name. This is needed on |
| non-ELF platforms where the section name might differ. |
| |
| Example: CMake |
| -------------- |
| |
| CMake can output a ``compilation_commands.json`` compilation database if the |
| ``CMAKE_EXPORT_COMPILE_COMMANDS`` switch is turned on at compile time. It is |
| also necessary to enable bitcode embedding (done by passing |
| ``-Xclang -fembed-bitcode=all`` to all C/C++ compilation actions in the |
| non-ThinLTO case). For example, to extract a corpus from clang, you would |
| run the following commands (assuming that the system C/C++ compiler is clang): |
| |
| .. code-block:: bash |
| |
| cmake -GNinja \ |
| -DCMAKE_BUILD_TYPE=Release \ |
| -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \ |
| -DCMAKE_C_FLAGS="-Xclang -fembed-bitcode=all" \ |
| -DCMAKE_CXX_FLAGS="-Xclang -fembed-bitcode-all" |
| ../llvm |
| ninja |
| |
| After running CMake and building the project, there should be a |
| ``compilation_commands.json`` file within the build directory. You can then |
| run the following command to create a corpus: |
| |
| .. code-block:: bash |
| |
| python3 ./extract_ir.py \ |
| --input=./build/compile_commands.json \ |
| --input_type=json \ |
| --output_dir=./corpus |
| |
| After running the above command, there should be a full |
| corpus of bitcode within the ``./corpus`` directory. |
| |
| Example: Bazel Aquery |
| --------------------- |
| |
| This tool also supports extracting bitcode from bazel in multiple ways |
| depending upon the exact configuration. For ThinLTO, a linker parameters file |
| is preferred. For the non-ThinLTO case, the script will accept the output of |
| ``bazel aquery`` which it will use to find all the object files that are linked |
| into a specific target and then extract bitcode from them. First, you need |
| to generate the aquery output: |
| |
| .. code-block:: bash |
| |
| bazel aquery --output=jsonproto //path/to:target > /path/to/aquery.json |
| |
| Afterwards, assuming that the build is already complete, you can run this |
| script to create a corpus: |
| |
| .. code-block:: bash |
| |
| python3 ./extract_ir.py \ |
| --input=/path/to/aquery.json \ |
| --input_type=bazel_aqeury \ |
| --output_dir=./corpus \ |
| --obj_base_dir=./bazel-bin |
| |
| This will again leave a corpus that contains all the bitcode files. This mode |
| does not capture all object files in the build however, only the ones that |
| are involved in the link for the binary passed to the ``bazel aquery`` |
| invocation. |
| |
| .. program:: make_corpus.py |
| |
| Synopsis |
| -------- |
| |
| Creates a corpus from a collection of bitcode files. |
| |
| Options |
| ------- |
| |
| .. option:: --input_dir |
| |
| The input directory to search for bitcode files in. |
| |
| .. option:: --output_dir |
| |
| The output directory to place the constructed corpus in. |
| |
| .. option:: --default_args |
| |
| A list of space separated flags that are put into the corpus description. |
| These are used by some tooling when compiling the modules within the corpus. |
| |
| .. program:: combine_training_corpus.py |
| |
| Synopsis |
| -------- |
| |
| Combines two training corpora that share the same parent folder by generating |
| a new ``corpus_description.json`` that contains all the modules in both corpora. |
| |
| Options |
| ------- |
| |
| .. option:: --root_dir |
| |
| The root directory that contains subfolders consisting of the corpora that |
| should be combined. |
| |
| Interacting with ML models |
| ========================== |
| |
| We interact with ML models in 2 primary scenarios: one is to train such a model. |
| The other, inference, is to use a model during compilation, to make optimization |
| decisions. |
| |
| For a specific optimization problem - i.e. inlining, or regalloc eviction - we |
| first separate correctness - preserving decisions from optimization decisions. |
| For example, not inlining functions marked "no inline" is an example of the |
| former. Same is not evicting an unevictable live range. An example of the latter |
| is deciding to inline a function that will bloat the caller size, just because |
| we have reason to believe that later, the effect will be some constant |
| propagation that will actually reduce the size (or dynamic instruction count). |
| |
| ML models can be understood as functions. Their inputs are tensors - buffers of |
| scalars. The output (in our case, singular) is a scalar. For example, for |
| inlining, the inputs are properties of the caller, callee, and the callsite |
| being analyzed for inlining. The output is a boolean. |
| |
| Inputs and outputs are named, have a scalar type (e.g. int32_t) and a shape |
| (e.g. 3x4). These are the elements that we use to bind to a ML model. |
| |
| In both training and inference, we want to expose to ML (training algorithms or |
| trained model, respectively) the features we want to make optimization |
| decisions on. In that regard, the interface from the compiler side to the ML |
| side is the same: pass features, and get a decision. It's essentially a function |
| call, where the parameters and result are bound by name and are described by |
| name, scalar type, and shape tuples. |
| |
| The main types in LLVM are: |
| |
| - ``MLModelRunner`` - an abstraction for the decision making mechanism |
| - ``TensorSpec`` which describes a tensor. |
| |
| TensorSpec |
| ---------- |
| |
| See ``llvm/Analysis/TensorSpec.h``. This is a simple data bag, identifying a |
| tensor by name (a string), scalar type, and shape (a vector of ints). The scalar |
| type can only be int (8, 16, 32, or 64), signed or unsigned; float; or double. |
| |
| MLModelRunner |
| ------------- |
| |
| See ``llvm/Analysis/MLModelRunner.h``. The abstraction has a pure virtual, |
| ``evaluateUntyped``, but the contract with implementers is a bit more involved: |
| |
| Implementers |
| ^^^^^^^^^^^^ |
| |
| At construction, the implementer is expected to receive a list of ``TensorSpec`` |
| for input features and the ``TensorSpec`` of the output (e.g. |
| ``std::vector<TensorSpec>``). The list type is not contractual, but it must be |
| a 0-based indexing array-like container. Given a ``TensorSpec`` at index "I" in |
| the input list, that has a name "N", shape "D1 x D2x ... Dn", and scalar type |
| "T", the implementer must: |
| |
| - set up a contiguous buffer sized ``sizeof(T) * D1 * D2 * ... * Dn``. This |
| buffer's lifetime must be the same as the lifetime of the implementer object. |
| - call ``MLModelRunner::setUpBufferForTensor`` passing I, the ``TensorSpec``, |
| and the buffer above. |
| |
| Internally, the expectation is that the implementer uses the name (and maybe |
| shape) of a ``TensorSpec`` for binding (e.g. lookup in an underlying ML model). |
| |
| ``MLModelRunner::setUpBufferForTensor`` stores each buffer at the corresponding |
| index (i.e. its position in the list used at construction). The expectation is |
| that the user will use that position when calling ``MLModelRunner::getTensor`` |
| to retrieve the underlying buffer (more on that in a bit). |
| |
| The implementation of ``evaluateUntyped`` is expected to use the value in the |
| buffers described above, carry out whatever computation (e.g. evaluate a ML |
| model) and then place the outcome in an output buffer which will be returned to |
| the caller. Importantly, ``evaluateUntyped`` must not reset the input buffers. |
| This is because during training we may want to log the features and decisions, |
| and since the data is already buffered, there's no reason to force backing it |
| up elsewhere. |
| |
| Users |
| ^^^^^ |
| |
| The users must pass the input ``TensorSpec`` list at the construction of a |
| specific ``MLModelRunner`` object. After that, users can be agnostic of the |
| specific implementation, and would typically follow the following workflow: |
| |
| - call ``getTensor`` or ``getTensorUntyped``, for each input tensor, identified |
| by its index (i.e. the index of the corresponding ``TensorSpec`` in the list |
| used at construction). |
| - populate the tensor buffer of each input tensor with values. Users can take |
| advantage of the stability of the tensor buffers like set only once those that |
| don't change, or cache the buffer address |
| - call ``evaluate`` and use its result. |
| |
| Versioning |
| ^^^^^^^^^^ |
| |
| We support a model "knowing" less inputs than the compiler. This is supported by |
| ``MLModelRunner::setUpBufferForTensor``. If a ``TensorSpec`` requested by the |
| compiler is not supported by the underlying model, the ``MLModelRunner`` |
| implementer must still call ``setUpBufferForTensor`` with a ``nullptr`` value |
| for the buffer. In turn, ``MLModelRunner`` will allocate an appropriately - sized |
| buffer and track its lifetime. The user can safely populate that buffer. Since |
| the rest of the inputs are still provided, this allows an evolution model where |
| we first add features to the compiler and continue using older models without |
| regressing. Then, the new compiler can be used to train new models. Deprecating |
| features in the compiler involves, then, training first a model without those |
| features. |
| |
| ``MLModelRunner`` implementations |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| We currently feature 3 implementations: |
| |
| - ``ModelUnderTrainingRunner``. This requires the compiler be built with TFLite |
| support. It allows loading a TFLite model dynamically and is primarily |
| intended for training scenarios, but it can be used relatively easily in |
| production build environments, as it does not change how the compiler operates |
| (why this remark is necessary will become clear in a few paragraphs) |
| |
| - ``ReleaseModeModelRunner``. This is intended for inference scenarios. This |
| uses the rules defined in ``llvm/cmake/modules/TensorFlowCompile.cmake`` to |
| convert, at the time the compiler is built, TensorFlow Saved Models into a |
| header (.h) and native object (.o). The latter is a CPU-based implementation of |
| the neural network, together with its weights (essentially, loops performing |
| matrix multiplications) |
| |
| NOTE: we are actively working on replacing this with an EmitC implementation |
| requiring no out of tree build-time dependencies. |
| |
| - ``InteractiveModelRunner``. This is intended for training scenarios where the |
| training algorithm drives compilation. This model runner has no special |
| dependencies, and relies on I/O pipes to communicate with a separate process, |
| presumably a python training algorithm. We do not envision using this in a |
| production environment. |
| |
| Note that training leaves it to the training infrastructure to handle |
| distributed computing. The assumed architecture has python processes |
| communicating remotely between themselves, but managing local communication with |
| clang. |
| |
| .. |
| TODO(mtrofin): |
| - logging, and the use in interactive mode. |
| - discuss an example (like the inliner) |