bolt/docs/index.rst - third_party/llvm-project - Git at Google

 BOLT
 ====

 BOLT is a post-link optimizer developed to speed up large applications.
 It achieves the improvements by optimizing application’s code layout
 based on execution profile gathered by sampling profiler, such as Linux
 ``perf`` tool. An overview of the ideas implemented in BOLT along with a
 discussion of its potential and current results is available in `CGO’19
 paper <https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/>`__.

 Input Binary Requirements
 -------------------------

 BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the
 binaries should have an unstripped symbol table, and, to get maximum
 performance gains, they should be linked with relocations
 (``--emit-relocs`` or ``-q`` linker flag).

 BOLT disassembles functions and reconstructs the control flow graph
 (CFG) before it runs optimizations. Since this is a nontrivial task,
 especially when indirect branches are present, we rely on certain
 heuristics to accomplish it. These heuristics have been tested on a code
 generated with Clang and GCC compilers. The main requirement for C/C++
 code is not to rely on code layout properties, such as function pointer
 deltas. Assembly code can be processed too. Requirements for it include
 a clear separation of code and data, with data objects being placed into
 data sections/segments. If indirect jumps are used for intra-function
 control transfer (e.g., jump tables), the code patterns should be
 matching those generated by Clang/GCC.

 NOTE: BOLT is currently incompatible with the
 ``-freorder-blocks-and-partition`` compiler option. Since GCC8 enables
 this option by default, you have to explicitly disable it by adding
 ``-fno-reorder-blocks-and-partition`` flag if you are compiling with
 GCC8 or above.

 NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM
 and GCC compilers. It offers several benefits over the previous DWARF
 v4. Currently, the support for v5 is a work in progress for BOLT. While
 you will be able to optimize binaries produced by the latest compilers,
 until the support is complete, you will not be able to update the debug
 info with ``-update-debug-sections``. To temporarily work around the
 issue, we recommend compiling binaries with ``-gdwarf-4`` option that
 forces DWARF v4 output.

 PIE and .so support has been added recently. Please report bugs if you
 encounter any issues.

 Installation
 ------------

 Docker Image
 ~~~~~~~~~~~~

 You can build and use the docker image containing BOLT using our `docker
 file <utils/docker/Dockerfile>`__. Alternatively, you can build BOLT
 manually using the steps below.

 Manual Build
 ~~~~~~~~~~~~

 BOLT heavily uses LLVM libraries, and by design, it is built as one of
 LLVM tools. The build process is not much different from a regular LLVM
 build. The following instructions are assuming that you are running
 under Linux.

 Start with cloning LLVM repo:

 ::

     > git clone https://github.com/llvm/llvm-project.git
     > mkdir build
     > cd build
     > cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
     > ninja bolt

 ``llvm-bolt`` will be available under ``bin/``. Add this directory to
 your path to ensure the rest of the commands in this tutorial work.

 Optimizing BOLT’s Performance
 -----------------------------

 BOLT runs many internal passes in parallel. If you foresee heavy usage
 of BOLT, you can improve the processing time by linking against one of
 memory allocation libraries with good support for concurrency. E.g. to
 use jemalloc:

 ::

     > sudo yum install jemalloc-devel
     > LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....

 Or if you rather use tcmalloc:

 ::

     > sudo yum install gperftools-devel
     > LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....

 Usage
 -----

 For a complete practical guide of using BOLT see `Optimizing Clang with
 BOLT <docs/OptimizingClang.md>`__.

 Step 0
 ~~~~~~

 In order to allow BOLT to re-arrange functions (in addition to
 re-arranging code within functions) in your program, it needs a little
 help from the linker. Add ``--emit-relocs`` to the final link step of
 your application. You can verify the presence of relocations by checking
 for ``.rela.text`` section in the binary. BOLT will also report if it
 detects relocations while processing the binary.

 Step 1: Collect Profile
 ~~~~~~~~~~~~~~~~~~~~~~~

 This step is different for different kinds of executables. If you can
 invoke your program to run on a representative input from a command
 line, then check **For Applications** section below. If your program
 typically runs as a server/service, then skip to **For Services**
 section.

 The version of ``perf`` command used for the following steps has to
 support ``-F brstack`` option. We recommend using ``perf`` version 4.5
 or later.

 For Applications
 ^^^^^^^^^^^^^^^^

 This assumes you can run your program from a command line with a typical
 input. In this case, simply prepend the command line invocation with
 ``perf``:

 ::

     $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...

 For Services
 ^^^^^^^^^^^^

 Once you get the service deployed and warmed-up, it is time to collect
 perf data with LBR (branch information). The exact perf command to use
 will depend on the service. E.g., to collect the data for all processes
 running on the server for the next 3 minutes use:

 ::

     $ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180

 Depending on the application, you may need more samples to be included
 with your profile. It’s hard to tell upfront what would be a sweet spot
 for your application. We recommend the profile to cover 1B instructions
 as reported by BOLT ``-dyno-stats`` option. If you need to increase the
 number of samples in the profile, you can either run the ``sleep``
 command for longer and use ``-F<N>`` option with ``perf`` to increase
 sampling frequency.

 Note that for profile collection we recommend using cycle events and not
 ``BR_INST_RETIRED.*``. Empirically we found it to produce better
 results.

 If the collection of a profile with branches is not available, e.g.,
 when you run on a VM or on hardware that does not support it, then you
 can use only sample events, such as cycles. In this case, the quality of
 the profile information would not be as good, and performance gains with
 BOLT are expected to be lower.

 With instrumentation
 ^^^^^^^^^^^^^^^^^^^^

 If perf record is not available to you, you may collect profile by first
 instrumenting the binary with BOLT and then running it.

 ::

     llvm-bolt <executable> -instrument -o <instrumented-executable>

 After you run instrumented-executable with the desired workload, its
 BOLT profile should be ready for you in ``/tmp/prof.fdata`` and you can
 skip **Step 2**.

 Run BOLT with the ``-help`` option and check the category “BOLT
 instrumentation options” for a quick reference on instrumentation knobs.

 Step 2: Convert Profile to BOLT Format
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 NOTE: you can skip this step and feed ``perf.data`` directly to BOLT
 using experimental ``-p perf.data`` option.

 For this step, you will need ``perf.data`` file collected from the
 previous step and a copy of the binary that was running. The binary has
 to be either unstripped, or should have a symbol table intact (i.e.,
 running ``strip -g`` is okay).

 Make sure ``perf`` is in your ``PATH``, and execute ``perf2bolt``:

 ::

     $ perf2bolt -p perf.data -o perf.fdata <executable>

 This command will aggregate branch data from ``perf.data`` and store it
 in a format that is both more compact and more resilient to binary
 modifications.

 If the profile was collected without LBRs, you will need to add ``-nl``
 flag to the command line above.

 Step 3: Optimize with BOLT
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 Once you have ``perf.fdata`` ready, you can use it for optimizations
 with BOLT. Assuming your environment is setup to include the right path,
 execute ``llvm-bolt``:

 ::

     $ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats

 If you do need an updated debug info, then add
 ``-update-debug-sections`` option to the command above. The processing
 time will be slightly longer.

 For a full list of options see ``-help``/``-help-hidden`` output.

 The input binary for this step does not have to 100% match the binary
 used for profile collection in **Step 1**. This could happen when you
 are doing active development, and the source code constantly changes,
 yet you want to benefit from profile-guided optimizations. However,
 since the binary is not precisely the same, the profile information
 could become invalid or stale, and BOLT will report the number of
 functions with a stale profile. The higher the number, the less
 performance improvement should be expected. Thus, it is crucial to
 update ``.fdata`` for release branches.

 Multiple Profiles
 -----------------

 Suppose your application can run in different modes, and you can
 generate multiple profiles for each one of them. To generate a single
 binary that can benefit all modes (assuming the profiles don’t
 contradict each other) you can use ``merge-fdata`` tool:

 ::

     $ merge-fdata *.fdata > combined.fdata

 Use ``combined.fdata`` for **Step 3** above to generate a universally
 optimized binary.

 License
 -------

 BOLT is licensed under the `Apache License v2.0 with LLVM
 Exceptions <./LICENSE.TXT>`__.
	BOLT
	====

	BOLT is a post-link optimizer developed to speed up large applications.
	It achieves the improvements by optimizing application’s code layout
	based on execution profile gathered by sampling profiler, such as Linux
	``perf`` tool. An overview of the ideas implemented in BOLT along with a
	discussion of its potential and current results is available in `CGO’19
	paper <https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/>`__.

	Input Binary Requirements
	-------------------------

	BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the
	binaries should have an unstripped symbol table, and, to get maximum
	performance gains, they should be linked with relocations
	(``--emit-relocs`` or ``-q`` linker flag).

	BOLT disassembles functions and reconstructs the control flow graph
	(CFG) before it runs optimizations. Since this is a nontrivial task,
	especially when indirect branches are present, we rely on certain
	heuristics to accomplish it. These heuristics have been tested on a code
	generated with Clang and GCC compilers. The main requirement for C/C++
	code is not to rely on code layout properties, such as function pointer
	deltas. Assembly code can be processed too. Requirements for it include
	a clear separation of code and data, with data objects being placed into
	data sections/segments. If indirect jumps are used for intra-function
	control transfer (e.g., jump tables), the code patterns should be
	matching those generated by Clang/GCC.

	NOTE: BOLT is currently incompatible with the
	``-freorder-blocks-and-partition`` compiler option. Since GCC8 enables
	this option by default, you have to explicitly disable it by adding
	``-fno-reorder-blocks-and-partition`` flag if you are compiling with
	GCC8 or above.

	NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM
	and GCC compilers. It offers several benefits over the previous DWARF
	v4. Currently, the support for v5 is a work in progress for BOLT. While
	you will be able to optimize binaries produced by the latest compilers,
	until the support is complete, you will not be able to update the debug
	info with ``-update-debug-sections``. To temporarily work around the
	issue, we recommend compiling binaries with ``-gdwarf-4`` option that
	forces DWARF v4 output.

	PIE and .so support has been added recently. Please report bugs if you
	encounter any issues.

	Installation
	------------

	Docker Image
	~~~~~~~~~~~~

	You can build and use the docker image containing BOLT using our `docker
	file <utils/docker/Dockerfile>`__. Alternatively, you can build BOLT
	manually using the steps below.

	Manual Build
	~~~~~~~~~~~~

	BOLT heavily uses LLVM libraries, and by design, it is built as one of
	LLVM tools. The build process is not much different from a regular LLVM
	build. The following instructions are assuming that you are running
	under Linux.

	Start with cloning LLVM repo:

	::

	> git clone https://github.com/llvm/llvm-project.git
	> mkdir build
	> cd build
	> cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
	> ninja bolt

	``llvm-bolt`` will be available under ``bin/``. Add this directory to
	your path to ensure the rest of the commands in this tutorial work.

	Optimizing BOLT’s Performance
	-----------------------------

	BOLT runs many internal passes in parallel. If you foresee heavy usage
	of BOLT, you can improve the processing time by linking against one of
	memory allocation libraries with good support for concurrency. E.g. to
	use jemalloc:

	::

	> sudo yum install jemalloc-devel
	> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....

	Or if you rather use tcmalloc:

	::

	> sudo yum install gperftools-devel
	> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....

	Usage
	-----

	For a complete practical guide of using BOLT see `Optimizing Clang with
	BOLT <docs/OptimizingClang.md>`__.

	Step 0
	~~~~~~

	In order to allow BOLT to re-arrange functions (in addition to
	re-arranging code within functions) in your program, it needs a little
	help from the linker. Add ``--emit-relocs`` to the final link step of
	your application. You can verify the presence of relocations by checking
	for ``.rela.text`` section in the binary. BOLT will also report if it
	detects relocations while processing the binary.

	Step 1: Collect Profile
	~~~~~~~~~~~~~~~~~~~~~~~

	This step is different for different kinds of executables. If you can
	invoke your program to run on a representative input from a command
	line, then check For Applications section below. If your program
	typically runs as a server/service, then skip to For Services
	section.

	The version of ``perf`` command used for the following steps has to
	support ``-F brstack`` option. We recommend using ``perf`` version 4.5
	or later.

	For Applications
	^^^^^^^^^^^^^^^^

	This assumes you can run your program from a command line with a typical
	input. In this case, simply prepend the command line invocation with
	``perf``:

	::

	$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...

	For Services
	^^^^^^^^^^^^

	Once you get the service deployed and warmed-up, it is time to collect
	perf data with LBR (branch information). The exact perf command to use
	will depend on the service. E.g., to collect the data for all processes
	running on the server for the next 3 minutes use:

	::

	$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180

	Depending on the application, you may need more samples to be included
	with your profile. It’s hard to tell upfront what would be a sweet spot
	for your application. We recommend the profile to cover 1B instructions
	as reported by BOLT ``-dyno-stats`` option. If you need to increase the
	number of samples in the profile, you can either run the ``sleep``
	command for longer and use ``-F<N>`` option with ``perf`` to increase
	sampling frequency.

	Note that for profile collection we recommend using cycle events and not
	``BR_INST_RETIRED.*``. Empirically we found it to produce better
	results.

	If the collection of a profile with branches is not available, e.g.,
	when you run on a VM or on hardware that does not support it, then you
	can use only sample events, such as cycles. In this case, the quality of
	the profile information would not be as good, and performance gains with
	BOLT are expected to be lower.

	With instrumentation
	^^^^^^^^^^^^^^^^^^^^

	If perf record is not available to you, you may collect profile by first
	instrumenting the binary with BOLT and then running it.

	::

	llvm-bolt <executable> -instrument -o <instrumented-executable>

	After you run instrumented-executable with the desired workload, its
	BOLT profile should be ready for you in ``/tmp/prof.fdata`` and you can
	skip Step 2.

	Run BOLT with the ``-help`` option and check the category “BOLT
	instrumentation options” for a quick reference on instrumentation knobs.

	Step 2: Convert Profile to BOLT Format
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	NOTE: you can skip this step and feed ``perf.data`` directly to BOLT
	using experimental ``-p perf.data`` option.

	For this step, you will need ``perf.data`` file collected from the
	previous step and a copy of the binary that was running. The binary has
	to be either unstripped, or should have a symbol table intact (i.e.,
	running ``strip -g`` is okay).

	Make sure ``perf`` is in your ``PATH``, and execute ``perf2bolt``:

	::

	$ perf2bolt -p perf.data -o perf.fdata <executable>

	This command will aggregate branch data from ``perf.data`` and store it
	in a format that is both more compact and more resilient to binary
	modifications.

	If the profile was collected without LBRs, you will need to add ``-nl``
	flag to the command line above.

	Step 3: Optimize with BOLT
	~~~~~~~~~~~~~~~~~~~~~~~~~~

	Once you have ``perf.fdata`` ready, you can use it for optimizations
	with BOLT. Assuming your environment is setup to include the right path,
	execute ``llvm-bolt``:

	::

	$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats

	If you do need an updated debug info, then add
	``-update-debug-sections`` option to the command above. The processing
	time will be slightly longer.

	For a full list of options see ``-help``/``-help-hidden`` output.

	The input binary for this step does not have to 100% match the binary
	used for profile collection in Step 1. This could happen when you
	are doing active development, and the source code constantly changes,
	yet you want to benefit from profile-guided optimizations. However,
	since the binary is not precisely the same, the profile information
	could become invalid or stale, and BOLT will report the number of
	functions with a stale profile. The higher the number, the less
	performance improvement should be expected. Thus, it is crucial to
	update ``.fdata`` for release branches.

	Multiple Profiles
	-----------------

	Suppose your application can run in different modes, and you can
	generate multiple profiles for each one of them. To generate a single
	binary that can benefit all modes (assuming the profiles don’t
	contradict each other) you can use ``merge-fdata`` tool:

	::

	$ merge-fdata *.fdata > combined.fdata

	Use ``combined.fdata`` for Step 3 above to generate a universally
	optimized binary.

	License
	-------

	BOLT is licensed under the `Apache License v2.0 with LLVM
	Exceptions <./LICENSE.TXT>`__.