| BOLT |
| ==== |
| |
| BOLT is a post-link optimizer developed to speed up large applications. |
| It achieves the improvements by optimizing application’s code layout |
| based on execution profile gathered by sampling profiler, such as Linux |
| ``perf`` tool. An overview of the ideas implemented in BOLT along with a |
| discussion of its potential and current results is available in `CGO’19 |
| paper <https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/>`__. |
| |
| Input Binary Requirements |
| ------------------------- |
| |
| BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the |
| binaries should have an unstripped symbol table, and, to get maximum |
| performance gains, they should be linked with relocations |
| (``--emit-relocs`` or ``-q`` linker flag). |
| |
| BOLT disassembles functions and reconstructs the control flow graph |
| (CFG) before it runs optimizations. Since this is a nontrivial task, |
| especially when indirect branches are present, we rely on certain |
| heuristics to accomplish it. These heuristics have been tested on a code |
| generated with Clang and GCC compilers. The main requirement for C/C++ |
| code is not to rely on code layout properties, such as function pointer |
| deltas. Assembly code can be processed too. Requirements for it include |
| a clear separation of code and data, with data objects being placed into |
| data sections/segments. If indirect jumps are used for intra-function |
| control transfer (e.g., jump tables), the code patterns should be |
| matching those generated by Clang/GCC. |
| |
| NOTE: BOLT is currently incompatible with the |
| ``-freorder-blocks-and-partition`` compiler option. Since GCC8 enables |
| this option by default, you have to explicitly disable it by adding |
| ``-fno-reorder-blocks-and-partition`` flag if you are compiling with |
| GCC8 or above. |
| |
| NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM |
| and GCC compilers. It offers several benefits over the previous DWARF |
| v4. Currently, the support for v5 is a work in progress for BOLT. While |
| you will be able to optimize binaries produced by the latest compilers, |
| until the support is complete, you will not be able to update the debug |
| info with ``-update-debug-sections``. To temporarily work around the |
| issue, we recommend compiling binaries with ``-gdwarf-4`` option that |
| forces DWARF v4 output. |
| |
| PIE and .so support has been added recently. Please report bugs if you |
| encounter any issues. |
| |
| Installation |
| ------------ |
| |
| Docker Image |
| ~~~~~~~~~~~~ |
| |
| You can build and use the docker image containing BOLT using our `docker |
| file <utils/docker/Dockerfile>`__. Alternatively, you can build BOLT |
| manually using the steps below. |
| |
| Manual Build |
| ~~~~~~~~~~~~ |
| |
| BOLT heavily uses LLVM libraries, and by design, it is built as one of |
| LLVM tools. The build process is not much different from a regular LLVM |
| build. The following instructions are assuming that you are running |
| under Linux. |
| |
| Start with cloning LLVM repo: |
| |
| :: |
| |
| > git clone https://github.com/llvm/llvm-project.git |
| > mkdir build |
| > cd build |
| > cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt" |
| > ninja bolt |
| |
| ``llvm-bolt`` will be available under ``bin/``. Add this directory to |
| your path to ensure the rest of the commands in this tutorial work. |
| |
| Optimizing BOLT’s Performance |
| ----------------------------- |
| |
| BOLT runs many internal passes in parallel. If you foresee heavy usage |
| of BOLT, you can improve the processing time by linking against one of |
| memory allocation libraries with good support for concurrency. E.g. to |
| use jemalloc: |
| |
| :: |
| |
| > sudo yum install jemalloc-devel |
| > LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt .... |
| |
| Or if you rather use tcmalloc: |
| |
| :: |
| |
| > sudo yum install gperftools-devel |
| > LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt .... |
| |
| Usage |
| ----- |
| |
| For a complete practical guide of using BOLT see `Optimizing Clang with |
| BOLT <docs/OptimizingClang.md>`__. |
| |
| Step 0 |
| ~~~~~~ |
| |
| In order to allow BOLT to re-arrange functions (in addition to |
| re-arranging code within functions) in your program, it needs a little |
| help from the linker. Add ``--emit-relocs`` to the final link step of |
| your application. You can verify the presence of relocations by checking |
| for ``.rela.text`` section in the binary. BOLT will also report if it |
| detects relocations while processing the binary. |
| |
| Step 1: Collect Profile |
| ~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| This step is different for different kinds of executables. If you can |
| invoke your program to run on a representative input from a command |
| line, then check **For Applications** section below. If your program |
| typically runs as a server/service, then skip to **For Services** |
| section. |
| |
| The version of ``perf`` command used for the following steps has to |
| support ``-F brstack`` option. We recommend using ``perf`` version 4.5 |
| or later. |
| |
| For Applications |
| ^^^^^^^^^^^^^^^^ |
| |
| This assumes you can run your program from a command line with a typical |
| input. In this case, simply prepend the command line invocation with |
| ``perf``: |
| |
| :: |
| |
| $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ... |
| |
| For Services |
| ^^^^^^^^^^^^ |
| |
| Once you get the service deployed and warmed-up, it is time to collect |
| perf data with LBR (branch information). The exact perf command to use |
| will depend on the service. E.g., to collect the data for all processes |
| running on the server for the next 3 minutes use: |
| |
| :: |
| |
| $ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180 |
| |
| Depending on the application, you may need more samples to be included |
| with your profile. It’s hard to tell upfront what would be a sweet spot |
| for your application. We recommend the profile to cover 1B instructions |
| as reported by BOLT ``-dyno-stats`` option. If you need to increase the |
| number of samples in the profile, you can either run the ``sleep`` |
| command for longer and use ``-F<N>`` option with ``perf`` to increase |
| sampling frequency. |
| |
| Note that for profile collection we recommend using cycle events and not |
| ``BR_INST_RETIRED.*``. Empirically we found it to produce better |
| results. |
| |
| If the collection of a profile with branches is not available, e.g., |
| when you run on a VM or on hardware that does not support it, then you |
| can use only sample events, such as cycles. In this case, the quality of |
| the profile information would not be as good, and performance gains with |
| BOLT are expected to be lower. |
| |
| With instrumentation |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| If perf record is not available to you, you may collect profile by first |
| instrumenting the binary with BOLT and then running it. |
| |
| :: |
| |
| llvm-bolt <executable> -instrument -o <instrumented-executable> |
| |
| After you run instrumented-executable with the desired workload, its |
| BOLT profile should be ready for you in ``/tmp/prof.fdata`` and you can |
| skip **Step 2**. |
| |
| Run BOLT with the ``-help`` option and check the category “BOLT |
| instrumentation options” for a quick reference on instrumentation knobs. |
| |
| Step 2: Convert Profile to BOLT Format |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| NOTE: you can skip this step and feed ``perf.data`` directly to BOLT |
| using experimental ``-p perf.data`` option. |
| |
| For this step, you will need ``perf.data`` file collected from the |
| previous step and a copy of the binary that was running. The binary has |
| to be either unstripped, or should have a symbol table intact (i.e., |
| running ``strip -g`` is okay). |
| |
| Make sure ``perf`` is in your ``PATH``, and execute ``perf2bolt``: |
| |
| :: |
| |
| $ perf2bolt -p perf.data -o perf.fdata <executable> |
| |
| This command will aggregate branch data from ``perf.data`` and store it |
| in a format that is both more compact and more resilient to binary |
| modifications. |
| |
| If the profile was collected without LBRs, you will need to add ``-nl`` |
| flag to the command line above. |
| |
| Step 3: Optimize with BOLT |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Once you have ``perf.fdata`` ready, you can use it for optimizations |
| with BOLT. Assuming your environment is setup to include the right path, |
| execute ``llvm-bolt``: |
| |
| :: |
| |
| $ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats |
| |
| If you do need an updated debug info, then add |
| ``-update-debug-sections`` option to the command above. The processing |
| time will be slightly longer. |
| |
| For a full list of options see ``-help``/``-help-hidden`` output. |
| |
| The input binary for this step does not have to 100% match the binary |
| used for profile collection in **Step 1**. This could happen when you |
| are doing active development, and the source code constantly changes, |
| yet you want to benefit from profile-guided optimizations. However, |
| since the binary is not precisely the same, the profile information |
| could become invalid or stale, and BOLT will report the number of |
| functions with a stale profile. The higher the number, the less |
| performance improvement should be expected. Thus, it is crucial to |
| update ``.fdata`` for release branches. |
| |
| Multiple Profiles |
| ----------------- |
| |
| Suppose your application can run in different modes, and you can |
| generate multiple profiles for each one of them. To generate a single |
| binary that can benefit all modes (assuming the profiles don’t |
| contradict each other) you can use ``merge-fdata`` tool: |
| |
| :: |
| |
| $ merge-fdata *.fdata > combined.fdata |
| |
| Use ``combined.fdata`` for **Step 3** above to generate a universally |
| optimized binary. |
| |
| License |
| ------- |
| |
| BOLT is licensed under the `Apache License v2.0 with LLVM |
| Exceptions <./LICENSE.TXT>`__. |