auto-sync
is the architecture update tool for Capstone. Because the architecture modules of Capstone use mostly code from LLVM, we need to update this part with every LLVM release. auto-sync
helps with this synchronization between LLVM and Capstone's modules by automating most of it.
You can find it in suite/auto-sync
.
This document is split into four parts.
auto-sync
do what.auto-sync
.auto-sync
.auto-sync
.Please read the section about architecture module design in ARCHITECTURE.md before proceeding. The architectural understanding is important for the following.
As already described in the ARCHITECTURE
document, Capstone uses translated and generated source code from LLVM.
Because LLVM is written in C++ and Capstone in C the update process is internally complicated but almost completely automated.
auto-sync
categorizes source files of a module into three groups. Each group is updated differently.
File type | Update method | Edits by hand |
---|---|---|
Generated files | Generated by patched LLVM backends | Never/Not allowed |
Translated LLVM C++ files | CppTranslater and Differ | Only changes which are too complicated for automation. |
Capstone files | By hand | all |
Let's look at the update procedure for each group in detail.
Note: The only exception to touch generated files is via git patches. This is the last resort if something is broken in LLVM, and we cannot generate correct files.
Generated files
Generated files always have the file extension .inc
.
There are generated files for the LLVM code and for Capstone. They can be distinguished by their names:
<ARCH>GenCS<NAME>.inc
.<ARCH>Gen<NAME>.inc
.The files are generated by refactored LLVM TableGen emitter backends.
The procedure looks roughly like this:
┌──────────┐ 1 2 3 4 │CS .inc │ ┌───────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ ┌─►│files │ │ .td │ │ │ │ │ │ Code- │ │ └──────────┘ │ files ├────►│ TableGen ├────►│ CodeGen ├────►│ Emitter ├──┤ └───────┘ └──────┬────┘ └───────────┘ └──────────┘ │ ┌──────────┐ │ ▲ └─►│LLVM .inc │ └─────────────────────────────────┘ │files │ └──────────┘
LLVM architectures are defined in .td
files. They describe instructions, operands, features and other properties of an architecture.
LLVM TableGen parses these files and converts them to an internal representation.
In the second step a TableGen component called CodeGen abstracts the these properties even further. The result is a representation which is not specific to any architecture (e.g. the CodeGenInstruction
class can represent a machine instruction of any architecture).
The Code-Emitter
uses the abstract representation of the architecture (provided from CodeGen
) to generated state machines for instruction decoding. Architecture specific information (think of register names, operand properties etc.) is taken from TableGen's
internal representation.
The result is emitted to .inc
files. Those are included in the translated C++ files or Capstone code where necessary.
Translation of LLVM C++ files
We use two tools to translate C++ to C files.
First the CppTranslator
and afterward the Differ
.
The CppTranslator
parses the C++ files and patches C++ syntax with its equivalent C syntax.
Note: For details about this checkout suite/auto-sync/CppTranslator/README.md
.
Because the result of the CppTranslator
is not perfect, we still have many syntax problems left.
Those need to be fixed by hand. In order to ease this process we run the Differ
after the CppTranslator
.
The Differ
parses each translated file and the corresponding source file currently used in Capstone. It then compares specific nodes from the just translated file to the equivalent nodes in the old file.
The user can choose if she accepts the version from the translated file or the old file. This decision is saved for every node. If there exists a saved decision for a node, the previous decision automatically applied again.
Every other syntax error must be solved manually.
To update an architecture do the following:
Rebase llvm-capstone
onto the new LLVM release (if not already done).
# 1. Clone Capstone's LLVM git clone https://github.com/capstone-engine/llvm-capstone cd llvm-capstone git checkout auto-sync # 2. Rebase onto the new LLVM release and resolve the conflicts. # 3. Build tblgen mkdir build cd build cmake -G Ninja -DLLVM_TARGETS_TO_BUILD=<ARCH> -DCMAKE_BUILD_TYPE=Debug ../llvm cmake --build . --target llvm-tblgen --config Debug # 4. Run the updater cd ../../suite/auto-sync/ ./Updater/ASUpdater.py -a <ARCH>
The update script will execute the steps described above and copy the new files to their directories.
Afterward try to build Capstone and fix any build errors left.
If new instructions or operands were added, add test cases for those (recession tests for instructions are located in suite/MC/
).
TODO: Operand and detail tests
auto-sync
To refactor an architecture to use auto-sync
, you need to add it to the configuration.
ASUpdater.py
.CppTranslator
for your architecture (suite/auto-sync/CppTranslator/arch_config.json
)Now, manually run the update commands within ASUpdater.py
but skip the Differ
step:
./Updater/ASUpdater.py -a <ARCH> -s IncGen Translate
The task after this is to:
add_cs_detail()
handler in <ARCH>Mapping
for each operand type.Notes:
If you find yourself fixing the same syntax error multiple times, please consider adding a Patch
to the CppTranslator
for this case.
Please check out the implementation of ARM's add_cs_detail()
before implementing your own.
Running the Differ
after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them.
Sometimes the LLVM code uses a single function from a larger source file. It is not worth it to translate the whole file just for this function. Bundle those lonely functions in <ARCH>DisassemblerExtension.c
.
Some generated enums must be included in the include/capstone/<ARCH>.h
header. At the position where the enum should be inserted, add a comment like this (don't remove the <>
brackets):
// generate content <FILENAME.inc> begin // generate content <FILENAME.inc> end
The update script will insert the content of the .inc
file at this place.
Adding a new architecture follows the same steps as above. With the exception that you need to implement all the Capstone files from scratch.
Check out an auto-sync
supporting architectures for guidance and open an issue if you need help.