Auto-Sync

auto-sync is the architecture update tool for Capstone. Because the architecture modules of Capstone use mostly code from LLVM, we need to update this part with every LLVM release. auto-sync helps with this synchronization between LLVM and Capstone's modules by automating most of it.

You can find it in suite/auto-sync.

This document is split into four parts.

  1. An overview of the update process and which subcomponents of auto-sync do what.
  2. The instructions how to update an architecture which already supports auto-sync.
  3. Instructions how to refactor an architecture to use auto-sync.
  4. Notes about how to add a new architecture to Capstone with auto-sync.

Please read the section about architecture module design in ARCHITECTURE.md before proceeding. The architectural understanding is important for the following.

Update procedure

As already described in the ARCHITECTURE document, Capstone uses translated and generated source code from LLVM.

Because LLVM is written in C++ and Capstone in C the update process is internally complicated but almost completely automated.

auto-sync categorizes source files of a module into three groups. Each group is updated differently.

File typeUpdate methodEdits by hand
Generated filesGenerated by patched LLVM backendsNever/Not allowed
Translated LLVM C++ filesCppTranslater and DifferOnly changes which are too complicated for automation.
Capstone filesBy handall

Let's look at the update procedure for each group in detail.

Note: The only exception to touch generated files is via git patches. This is the last resort if something is broken in LLVM, and we cannot generate correct files.

Generated files

Generated files always have the file extension .inc.

There are generated files for the LLVM code and for Capstone. They can be distinguished by their names:

  • For Capstone: <ARCH>GenCS<NAME>.inc.
  • For LLVM code: <ARCH>Gen<NAME>.inc.

The files are generated by refactored LLVM TableGen emitter backends.

The procedure looks roughly like this:

                                                                   ┌──────────┐
    1               2                 3                4           │CS .inc   │
┌───────┐     ┌───────────┐     ┌───────────┐     ┌──────────┐  ┌─►│files     │
│ .td   │     │           │     │           │     │ Code-    │  │  └──────────┘
│ files ├────►│ TableGen  ├────►│  CodeGen  ├────►│ Emitter  ├──┤
└───────┘     └──────┬────┘     └───────────┘     └──────────┘  │  ┌──────────┐
                     │                                 ▲        └─►│LLVM .inc │
                     └─────────────────────────────────┘           │files     │
                                                                   └──────────┘
  1. LLVM architectures are defined in .td files. They describe instructions, operands, features and other properties of an architecture.

  2. LLVM TableGen parses these files and converts them to an internal representation.

  3. In the second step a TableGen component called CodeGen abstracts the these properties even further. The result is a representation which is not specific to any architecture (e.g. the CodeGenInstruction class can represent a machine instruction of any architecture).

  4. The Code-Emitter uses the abstract representation of the architecture (provided from CodeGen) to generated state machines for instruction decoding. Architecture specific information (think of register names, operand properties etc.) is taken from TableGen's internal representation.

The result is emitted to .inc files. Those are included in the translated C++ files or Capstone code where necessary.

Translation of LLVM C++ files

We use two tools to translate C++ to C files.

First the CppTranslator and afterward the Differ.

The CppTranslator parses the C++ files and patches C++ syntax with its equivalent C syntax.

Note: For details about this checkout suite/auto-sync/CppTranslator/README.md.

Because the result of the CppTranslator is not perfect, we still have many syntax problems left.

Those need to be fixed by hand. In order to ease this process we run the Differ after the CppTranslator.

The Differ parses each translated file and the corresponding source file currently used in Capstone. It then compares specific nodes from the just translated file to the equivalent nodes in the old file.

The user can choose if she accepts the version from the translated file or the old file. This decision is saved for every node. If there exists a saved decision for a node, the previous decision automatically applied again.

Every other syntax error must be solved manually.

Update an architecture

To update an architecture do the following:

Rebase llvm-capstone onto the new LLVM release (if not already done).

# 1. Clone Capstone's LLVM
git clone https://github.com/capstone-engine/llvm-capstone
cd llvm-capstone
git checkout auto-sync

# 2. Rebase onto the new LLVM release and resolve the conflicts.

# 3. Build tblgen
mkdir build
cd build
cmake -G Ninja -DLLVM_TARGETS_TO_BUILD=<ARCH> -DCMAKE_BUILD_TYPE=Debug ../llvm
cmake --build . --target llvm-tblgen --config Debug

# 4. Run the updater
cd ../../suite/auto-sync/
./Updater/ASUpdater.py -a <ARCH>

The update script will execute the steps described above and copy the new files to their directories.

Afterward try to build Capstone and fix any build errors left.

If new instructions or operands were added, add test cases for those (recession tests for instructions are located in suite/MC/).

TODO: Operand and detail tests

Refactor an architecture for auto-sync

To refactor an architecture to use auto-sync, you need to add it to the configuration.

  1. Add the architecture to the supported architectures list in ASUpdater.py.
  2. Configure the CppTranslator for your architecture (suite/auto-sync/CppTranslator/arch_config.json)

Now, manually run the update commands within ASUpdater.py but skip the Differ step:

./Updater/ASUpdater.py -a <ARCH> -s IncGen Translate

The task after this is to:

  • Replace leftover C++ syntax with its C equivalent.
  • Implement the add_cs_detail() handler in <ARCH>Mapping for each operand type.
  • Add any missing logic to the translated files.
  • Make it build and write tests.
  • Run the Differ again and always select the old nodes.

Notes:

  • If you find yourself fixing the same syntax error multiple times, please consider adding a Patch to the CppTranslator for this case.

  • Please check out the implementation of ARM's add_cs_detail() before implementing your own.

  • Running the Differ after everything is done, preserves your version of syntax corrections, and the next user can auto-apply them.

  • Sometimes the LLVM code uses a single function from a larger source file. It is not worth it to translate the whole file just for this function. Bundle those lonely functions in <ARCH>DisassemblerExtension.c.

  • Some generated enums must be included in the include/capstone/<ARCH>.h header. At the position where the enum should be inserted, add a comment like this (don't remove the <> brackets):

    // generate content <FILENAME.inc> begin
    // generate content <FILENAME.inc> end
    

The update script will insert the content of the .inc file at this place.

Adding a new architecture

Adding a new architecture follows the same steps as above. With the exception that you need to implement all the Capstone files from scratch.

Check out an auto-sync supporting architectures for guidance and open an issue if you need help.