C++ Translator

Capstone uses source files from LLVM to disassemble opcodes. Because LLVM is written in C++ we must translate those files to C.

The task of the CppTranslator is to do just that. The translation will not result in a completely correct C file! But it takes away most of the manual work.

The configuration file

The configuration for each architecture is set in arch_config.json.

The config values have the following meaning:

General: Settings valid for all architectures.
- diff_color_new: Color in the Differ for translated content.
- diff_color_old: Color in the Differ for old/current Capstone content.
- diff_color_saved: Color in the Differ for saved content.
- diff_color_edited: Color in the Differ for edited content.
- patch_editor: Editor to open for patch editing.
- nodes_to_diff: List of parse tree nodes which get diffed - Mind the note below.
  - node_type: The type of the node to be diffed.
  - identifier_node_type: Types of child nodes which identify the node during diffing (the identifier must be the same in the translated and the old file!). Types can be of the form <parent-type>/<child type>.
<ARCH>: Settings valid for a specific architecture
- files_to_translate: A list of file paths to translate.
  - in: Path to a specific source file.
  - out: The filename of the translated file.
- files_for_template_search: List of file paths to search for calls to template functions.
- manually_edite_files: List of files which are too complicated to translate. The user will be warned about them.
- templates_with_arg_deduction: Template functions which uses argument deduction. Those templates are translated to normal functions, not macro definition.

Note:

To understand the nodes_to_diff setting, check out Differ.py.
Paths can contain {AUTO_SYNC_ROOT}, {CS_ROOT} and {CPP_TRANSLATOR_ROOT}. They are replaced with the absolute paths to those directories.

Translation process

The translation process simply searches for certain syntax and patches it.

To allow searches for complicated patterns we parse the C++ file with Tree-sitter. Afterward we can use pattern queries to find our syntax we would like to patch.

Here is an overview of the procedure:

First the source file is parsed with Tree-Sitter.
Afterward the translator iterates of a number of patches.

For each patch we do the following.


 Translator                                  Patch
   +---+
   |   |                                     +----+
   |   |  Request pattern to search for      |    |
   |   | ----------------------------------> |    |
   |   |                                     |    |
   |   |  Return pattern                     |    |
   |   | <---------------------------------  |    |
   |   |                                     |    |
   |   | ---+                                |    |
   |   |    | Find                           |    |
   |   |    | captures                       |    |
   |   |    | in src                         |    |
   |   | <--+                                |    |
   |   |                                     |    |
   |   |  Return captures found              |    |
   |   | ----------------------------------> |    |
   |   |                                     |    |
   |   |                                 +-- |    |
   |   |                     Use capture |   |    |
   |   |                     info to     |   |    |
   |   |                     build new   |   |    |
   |   |                     syntax str  |   |    |
   |   |                                 +-> |    |
   |   |                                     |    |
   |   | Return new syntax string to patch   |    |
   |   | <---------------------------------- |    |
   |   |                                     |    |
   |   | ---+                                |    |
   |   |    | Replace old                    |    |
   |   |    | with new syntax                |    |
   |   |    | at all occurrences             |    |
   |   |    | in the file.                   |    |
   |   | <--+                                |    |
   |   |                                     |    |
   +---+                                     +----+

C++ Template translation

Most of the C++ syntax is simple to translate. But unfortunately the one exception are C++ templates.

Translating template functions and calls from C++ to C is tricky. Since each template has a number of actual implementations we do the following.

A template function definition is translated into a C macro.
The template parameters get translated to the macro parameters.
To differentiate the C implementations, the functions follow the naming pattern fcn_[template_param_0]_[template_param_1]()

Example

This C++ template function

template<unsigned X>
void fcn() {
   unsigned a = X * 8;
}

becomes

#define DEFINE_FCN(X)  \
void fcn ## _ ## X() { \
   unsigned a = X * 8; \
}

To define an implementation where X = 0 we do

DEFINE_FCN(0)

To call this implementation we call fcn_0().

(There is a special case when a template parameter is passed on to a template call. But this is explained in the code.)

Enumerate template instances

In our C++ code a template function can be called with different template parameters. For each of those calls we need to define a template implementation in C.

To do that we first scan source files for calls to template functions (TemplateCollector.py does this). For each unique call we check the parameter list. Knowing the parameter list we can now define a C function which uses exactly those parameters. For the definition we use a macro as above.

Example

Within this C++ code we see two template function calls:

void main() {
   fcn<0>();
   fcn<4>();
}

With the knowledge that once parameter 1 and once parameter 4 was passed to the template, we can define the implementations with the help of our DEFINE_FCN macro.

DEFINE_FCN(0)
DEFINE_FCN(4)

Within the C code we can now call those with fcn_0() and fcn_4().