Refactoring of `rustc_codegen_llvm`

by Denis Merigoux, October 23rd 2018

State of the code before the refactoring

All the code related to the compilation of MIR into LLVM IR was contained inside the rustc_codegen_llvm crate. Here is the breakdown of the most important elements:

the back folder (7,800 LOC) implements the mechanisms for creating the different object files and archive through LLVM, but also the communication mechanisms for parallel code generation;
the debuginfo (3,200 LOC) folder contains all code that passes debug information down to LLVM;
the llvm (2,200 LOC) folder defines the FFI necessary to communicate with LLVM using the C++ API;
the mir (4,300 LOC) folder implements the actual lowering from MIR to LLVM IR;
the base.rs (1,300 LOC) file contains some helper functions but also the high-level code that launches the code generation and distributes the work.
the builder.rs (1,200 LOC) file contains all the functions generating individual LLVM IR instructions inside a basic block;
the common.rs (450 LOC) contains various helper functions and all the functions generating LLVM static values;
the type_.rs (300 LOC) defines most of the type translations to LLVM IR.

The goal of this refactoring is to separate inside this crate code that is specific to the LLVM from code that can be reused for other rustc backends. For instance, the mir folder is almost entirely backend-specific but it relies heavily on other parts of the crate. The separation of the code must not affect the logic of the code nor its performance.

For these reasons, the separation process involves two transformations that have to be done at the same time for the resulting code to compile :

replace all the LLVM-specific types by generics inside function signatures and structure definitions;
encapsulate all functions calling the LLVM FFI inside a set of traits that will define the interface between backend-agnostic code and the backend.

While the LLVM-specific code will be left in rustc_codegen_llvm, all the new traits and backend-agnostic code will be moved in rustc_codegen_ssa (name suggestion by @eddyb).

Generic types and structures

@irinagpopa started to parametrize the types of rustc_codegen_llvm by a generic Value type, implemented in LLVM by a reference &'ll Value. This work has been extended to all structures inside the mir folder and elsewhere, as well as for LLVM's BasicBlock and Type types.

The two most important structures for the LLVM codegen are CodegenCx and Builder. They are parametrized by multiple lifetime parameters and the type for Value.

struct CodegenCx<'ll, 'tcx> {
  /* ... */
}

struct Builder<'a, 'll, 'tcx> {
  cx: &'a CodegenCx<'ll, 'tcx>,
  /* ... */
}

CodegenCx is used to compile one codegen-unit that can contain multiple functions, whereas Builder is created to compile one basic block.

The code in rustc_codegen_llvm has to deal with multiple explicit lifetime parameters, that correspond to the following:

'tcx is the longest lifetime, that corresponds to the original TyCtxt containing the program's information;
'a is a short-lived reference of a CodegenCx or another object inside a struct;
'll is the lifetime of references to LLVM objects such as Value or Type.

Although there are already many lifetime parameters in the code, making it generic uncovered situations where the borrow-checker was passing only due to the special nature of the LLVM objects manipulated (they are extern pointers). For instance, a additional lifetime parameter had to be added to LocalAnalyser in analyse.rs, leading to the definition:

struct LocalAnalyzer<'mir, 'a, 'tcx> {
  /* ... */
}

However, the two most important structures CodegenCx and Builder are not defined in the backend-agnostic code. Indeed, their content is highly specific of the backend and it makes more sense to leave their definition to the backend implementor than to allow just a narrow spot via a generic field for the backend's context.

Traits and interface

Because they have to be defined by the backend, CodegenCx and Builder will be the structures implementing all the traits defining the backend's interface. These traits are defined in the folder rustc_codegen_ssa/traits and all the backend-agnostic code is parametrized by them. For instance, let us explain how a function in base.rs is parametrized:

pub fn codegen_instance<'a, 'tcx, Bx: BuilderMethods<'a, 'tcx>>(
    cx: &'a Bx::CodegenCx,
    instance: Instance<'tcx>
) {
    /* ... */
}

In this signature, we have the two lifetime parameters explained earlier and the master type Bx which satisfies the trait BuilderMethods corresponding to the interface satisfied by the Builder struct. The BuilderMethods defines an associated type Bx::CodegenCx that itself satisfies the CodegenMethods traits implemented by the struct CodegenCx.

On the trait side, here is an example with part of the definition of BuilderMethods in traits/builder.rs:

pub trait BuilderMethods<'a, 'tcx>:
    HasCodegen<'tcx>
    + DebugInfoBuilderMethods<'tcx>
    + ArgTypeMethods<'tcx>
    + AbiBuilderMethods<'tcx>
    + IntrinsicCallMethods<'tcx>
    + AsmBuilderMethods<'tcx>
{
    fn new_block<'b>(
        cx: &'a Self::CodegenCx,
        llfn: Self::Value,
        name: &'b str
    ) -> Self;
    /* ... */
    fn cond_br(
        &mut self,
        cond: Self::Value,
        then_llbb: Self::BasicBlock,
        else_llbb: Self::BasicBlock,
    );
    /* ... */
}

Finally, a master structure implementing the ExtraBackendMethods trait is used for high-level codegen-driving functions like codegen_crate in base.rs. For LLVM, it is the empty LlvmCodegenBackend. ExtraBackendMethods should be implemented by the same structure that implements the CodegenBackend defined in rustc_codegen_utils/codegen_backend.rs.

During the traitification process, certain functions have been converted from methods of a local structure to methods of CodegenCx or Builder and a corresponding self parameter has been added. Indeed, LLVM stores information internally that it can access when called through its API. This information does not show up in a Rust data structure carried around when these methods are called. However, when implementing a Rust backend for rustc, these methods will need information from CodegenCx, hence the additional parameter (unused in the LLVM implementation of the trait).

State of the code after the refactoring

The traits offer an API which is very similar to the API of LLVM. This is not the best solution since LLVM has a very special way of doing things: when addding another backend, the traits definition might be changed in order to offer more flexibility.

However, the current separation between backend-agnostic and LLVM-specific code has allows the reuse of a significant part of the old rustc_codegen_llvm. Here is the new LOC breakdown between backend-agnostic (BA) and LLVM for the most important elements:

back folder: 3,800 (BA) vs 4,100 (LLVM);
mir folder: 4,400 (BA) vs 0 (LLVM);
base.rs: 1,100 (BA) vs 250 (LLVM);
builder.rs: 1,400 (BA) vs 0 (LLVM);
common.rs: 350 (BA) vs 350 (LLVM);

The debuginfo folder has been left almost untouched by the splitting and is specific to LLVM. Only its high-level features have been traitified.

The new traits folder has 1500 LOC only for trait definitions. Overall, the 27,000 LOC-sized old rustc_codegen_llvm code has been split into the new 18,500 LOC-sized new rustc_codegen_llvm and the 12,000 LOC-sized rustc_codegen_ssa. We can say that this refactoring allowed the reuse of approximately 10,000 LOC that would otherwise have had to be duplicated between the multiple backends of rustc.

The refactored version of rustc's backend introduced no regression over the test suite nor in performance benchmark, which is in coherence with the nature of the refactoring that used only compile-time parametricity (no trait objects).

Refactoring of rustc_codegen_llvm

State of the code before the refactoring

Generic types and structures

Traits and interface

State of the code after the refactoring

Refactoring of `rustc_codegen_llvm`