Swift Compiler Performance

This document is a guide to understanding, diagnosing and reporting compilation-performance problems in the swift compiler. That is: the speed at which the compiler compiles code, not the speed at which that code runs.

While this guide is lengthy, it should all be relatively straightforward. Performance analysis is largely a matter of patience, thoroughness and perseverance, measuring carefully and consistently, and gradually eliminating noise and focusing on a signal.

Table of Contents

Outline of processes and factors affecting compilation performance

This section is intended to provide a high-level orientation around what the compiler is doing when it's run -- beyond the obvious “compiling” -- and what major factors influence how much time it spends.

When you compile or run a Swift program, either with Xcode or on the command line, you typically invoke swift or swiftc (the latter is a symbolic link to the former), which is a program that can behave in very different ways depending on its arguments.

It may compile or execute code directly, but it will usually instead turn around and run one or more copies of swift or swiftc as subprocesses. In typical batch compilation, the first copy of swiftc runs as a so-called driver process, and it then executes a number of so-called frontend subprocesses, in a process tree. It‘s essential, when interpreting Swift compilation, to have a clear picture of which processes are run and what they’re doing:

  • Driver: the top-level swiftc process in a tree of subprocesses. Responsible for deciding which files need compiling or recompiling and running child processes — so-called jobs — to perform compilation and linking steps. For most of its execution, it is idle, waiting for subprocesses to complete.

  • Frontend Jobs: subprocesses launched by the driver, running swift -frontend ... and performing compilation, generating PCH files, merging modules, etc. These are the jobs that incur the bulk of the costs of compiling.

  • Other Jobs: subprocesses launched by the driver, running ld, swift -modulewrap, swift-autolink-extract, dsymutil, dwarfdump and similar tools involved in finishing off a batch of work done by the frontend jobs. Some of these will be the swift program too, but they're not “doing frontend jobs” and so will have completely different profiles.

The set of jobs that are run, and the way they spend their time, is itself highly dependent on compilation modes. Information concerning those modes that's relevant to compilation performance is recounted in the following section; for more details on the driver, see the driver docs, as well as docs on driver internals and driver parseable output.

After discussing compilation modes in the following section, we'll also touch on large-scale variation in workload that can occur without obvious hotspots, in terms of laziness strategies and approximations.

Compilation modes

There are many different options for controlling the driver and frontend jobs, but the two dimensions that cause the most significant variation in behaviour are often referred to as modes. These modes make the biggest difference, and it's important when looking at compilation to be clear on which mode swiftc is running in, and often to perform separate analysis for each mode. The significant modes are:

  • Primary-file vs. whole-module: this varies depending on whether the driver is run with the flag -wmo (a.k.a. -whole-module-optimization).

    • Batch vs. single-file primary-file mode. This distinction refines the behaviour of primary-file mode, with the new batch mode added in the Swift 4.2 release cycle. Batching eliminates much of the overhead of primary-file mode, and will eventually become the default way of running primary-file mode, but until that time it is explicitly enabled by passing the -enable-batch-mode flag.
  • Optimizing vs. non-optimizing: this varies depending on whether the driver (and thus each frontend) is run with the flags -O, -Osize, or -Ounchecked (each of which turn on one or more sets of optimizations), or the default (no-optimization) which is synonymous with -Onone or -Oplayground.

When you build a program in Xcode or using xcodebuild, often there is a configuration parameter that will switch both of these modes simultaneously. That is, typical code has two configurations:

  • Debug which combines primary-file mode with -Onone
  • Release which combines WMO mode with -O

But these parameters can be varied independently and the compiler will spend its time very differently depending on their settings, so it's worth understanding both dimensions in a bit more detail.

Primary-file (with and without batching) vs. WMO

This is the most significant variable in how the compiler behaves, so it's worth getting perfectly clear:

  • In primary-file mode, the driver divides the work it has to do between multiple frontend processes, emitting partial results and merging those results when all the frontends finish. Each frontend job itself reads all the files in the module, and focuses on one or more primary file(s) among the set it read, which it compiles, lazily analyzing other referenced definitions from the module as needed. This mode has two sub-modes:

    • In the single-file sub-mode, it runs one frontend job per file, with each job having a single primary.

    • In the batch sub-mode, it runs one frontend job per CPU, identifying an equal-sized “batch” of the module's files as primaries.

  • In whole-module optimization (WMO) mode, the driver runs one frontend job for the entire module, no matter what. That frontend reads all the files in the module once and compiles them all at once.

For example: if your module has 100 files in it:

  • Running swiftc *.swift will compile in single-file mode, and will thus run 100 frontend subprocesses, each of which will parse all 100 inputs (for a total of 10,000 parses), and then each subprocess will (in parallel) compile the definitions in its single primary file.

  • Running swiftc -enable-batch-mode *.swift will compile in batch mode, and on a system with 4 CPUs will run 4 frontend subprocesses, each of which will parse all 100 inputs (for a total of 400 parses), and then each subprocess will (in parallel) compile the definitions of 25 primary files (one quarter of the module in each process).

  • Running swiftc -wmo *.swift will compile in whole-module mode, and will thus run one frontend subprocess, which then reads all 100 files once (for a total of 100 parses) and compiles the definitions in all of them, in order (serially).

Why do multiple modes exist? Because they have different strengths and weaknesses; neither is perfect:

  • Primary-file mode‘s advantages are that the driver can do incremental compilation by only running frontends for files that it thinks are out of date, as well as running multiple frontend jobs in parallel, making use of multiple cores. Its disadvantage is that each frontend job has to read all the source files in the module before focusing on its primary-files of interest, which means that a portion of the frontend job’s work is being done quadratically in the number of jobs. Usually this portion is relatively small and fast, but because it's quadratic, it can easily go wrong. The addition of batch mode was specifically to eliminate this quadratic increase in early work.

  • WMO mode‘s advantages are that it can do certain optimizations that only work when they are sure they’re looking at the entire module, and it avoids the quadratic work in the early phases of primary-file mode. Its disadvantages are that it always rebuilds everything, and that it exploits parallelism worse (at least before LLVM IR code-generation, which is always multithreaded).

Whole-module mode does enable a set of optimizations that are not possible when compiling in primary-file mode. In particular, in modules with a lot of private dead code, whole-module mode can eliminate the dead code earlier and avoid needless work compiling it, making for both smaller output and faster compilation.

It is therefore possible that, in certain cases (such as with limited available parallelism / many modules built in parallel), building in whole-module mode with optimization disabled can complete in less time than batched primary-file mode. This scenario depends on many factors seldom gives a significant advantage, and since using it trades-away support for incremental compilation entirely, it is not a recommended configuration.

Amount of optimization

This document isn‘t the right place to give a detailed overview of the compiler architecture, but it’s important to keep in mind that the compiler deals with Swift code in memory in 3 major representations, and can therefore be conceptually divided into 3 major stages, the latter 2 of which behave differently depending on optimization mode:

  • ASTs (Abstract Syntax Trees): this is the representation (defined in the lib/AST directory) closest to what's in a source file, produced from Swift source code, Swift modules and Clang modules (in lib/Parse, lib/Serialization and lib/ClangImporter respectively) and interpreted by resolution, typechecking and high-level semantics functions (in lib/Sema) early-on in compilation.

  • SIL (Swift Intermediate Language): this is a form that‘s private to the Swift compiler, lower-level and more-explicit than the AST representation, but still higher-level and more Swift-specific than a machine-oriented representation like LLVM. It’s defined in lib/SIL, produced by code in lib/SILGen and optionally optimized by code in lib/SILOptimizer.

  • LLVM IR (Low Level Virtual Machine Intermediate Representation): this is a form that‘s an abstract representation of the machine language being compiled for; it doesn’t contain any Swift-specific knowledge, rather it‘s a form the Swift compiler generates from SIL (in lib/IRGen) and then hands off as input to the LLVM backend, a library upon which the Swift compiler depends. LLVM has its own optional optimizations that apply to LLVM IR before it’s lowered to machine code.

When running the Swift compiler in optimizing mode, many SIL and LLVM optimizations are turned on, making those phases of compilation (in each frontend job) take significantly more time and memory. When running in non-optimizing mode, SIL and LLVM IR are still produced and consumed along the way, but only as part of lowering, with comparatively few “simple” optimizations applied.

Additionally, the IRGen and LLVM phases can operate (and usually are operated) in parallel, using multiple threads in each frontend job, as controlled by the -num-threads flag. This option only applies to the latter phases, however: the AST and SIL-related phases never run multithreaded.

The amount of work done to the AST representation (in particular: importing, resolving and typechecking ASTs) does not vary between different optimization modes. However, it does vary significantly between different projects and among seemingly-minor changes to code, depending on the amount of laziness the frontend is able to exploit.

Workload variability, approximation and laziness

While some causes of slow compilation have definite hotspots (which we will get to shortly), one final thing to keep in mind when doing performance analysis is that the compiler tries to be lazy in a variety of ways, and that laziness does not always work: it is driven by certain approximations and assumptions that often err on the side of doing more work than strictly necessary.

The outcome of a failure in laziness is not usually a visible hotspot in a profile: rather, it's the appearance of doing “too much work altogether” across a generally-flat profile. Two areas in particular where this occurs — and where there are significant, ongoing improvements to be made — are in incremental compilation and lazy resolution.

Incremental compilation

As mentioned in the section on primary-file mode, the driver has an incremental mode that can be used to attempt to avoid running frontend jobs entirely. When successful, this is the most effective form of time-saving possible: nothing is faster than a process that doesn't even run.

Unfortunately judgements about when a file “needs recompiling” are themselves driven by an auxiliary data structure that summarizes the dependencies between files, and this data structure is necessarily a conservative approximation. The approximation is weaker than it should be, and as a result the driver often runs more frontend jobs than it should.

Lazy resolution

Swift source files contain names that refer to definitions outside the enclosing file, and frequently outside of the enclosing module. These “external” definitions are resolved lazily from two very different locations (both called “modules”):

  • C/ObjC modules, provided by the Clang importer
  • Serialized Swift modules

Despite their differences, both kinds of modules support laziness in the Swift compiler in one crucial way: they are both kinds of indexed binary file formats that permit loading single definitions out of by name, without having to load the entire contents of the module.

When the Swift compiler manages to be lazy and limit the number of definitions it tries to load from modules, it can be very fast; the file formats support very cheap access. But often the logic in the Swift compiler is unnecessarily conservative about exploiting this potential laziness, and so it loads more definitions than it should.

Summing up: high level picture of compilation performance

Swift compilation performance varies significantly by at least the following parameters:

  • WMO vs. primary-file (non-WMO) mode, including batching thereof
  • Optimizing vs. non-optimizing mode
  • Quantity of incremental work avoided (if in non-WMO)
  • Quantity of external definitions lazily loaded

When approaching Swift compilation performance, it‘s important to be aware of these parameters and keep them in mind, as they tend to frame the problem you’re analyzing: changing one (or any of the factors influencing them, in a project) will likely completely change the resulting profile.

Known problem areas

These are areas where we know the compiler has room for improvement, performance-wise, where it‘s worth searching for existing bugs on the topic, finding an existing team member who knows the area, and trying to relate the problem you’re seeing to some of the existing strategies and plans for improvement:

  • Incremental mode is over-approximate, runs too many subprocesses.
  • Too many referenced (non-primary-file) definitions are type-checked beyond the point they need to be, during the quadratic phase.
  • Expression type inference solves constraints inefficiently, and can sometimes behave super-linearly or even exponentially.
  • Periodically the analysis phase of a SIL optimization fails to cache overlapping subproblems, causing a super-linear slowdown.
  • Some SIL-to-IR lowerings (eg. large value types) can generate too much LLVM IR, increasing the time spent in LLVM.

(Subsystem experts: please add further areas of concern here.)

How to diagnose compilation performance problems

Compiler performance analysis breaks down into two broad categories of work, depending on what you're trying to do:

  • Isolating a regression
  • Finding areas that need general improvement

In all cases, it's important to be familiar with several tools and compiler options we have at our disposal. If you know about all these tools, you can skip the following section.

Tools and options

You'll use several tools along the way. These come in 5 main categories:

  • Profilers
  • Diagnostic options built-in to the compiler (timers, counters)
  • Post-processing tools to further analyze diagnostic output
  • Tools to generally analyze the output artifacts of the compiler
  • Tools to minimize the regression range or testcases

Profilers

The basic tool of performance analysis is a profiler, and you will need to learn to use at least one profiler for the purposes of this work. The main two profilers we use are Instruments.app on macOS, and perf(1) on Linux. Both are freely available and extremely powerful; this document will barely scratch the surface of what they can do.

Instruments.app

Instruments is a tool on macOS that ships as part of Xcode. It contains graphical and batch interfaces to a very wide variety of profiling services; see here for more documentation.

The main way we will use Instruments.app is in “Counter” mode, to record and analyze a single run of swiftc. We will also use it in simple push-button interactive mode, as a normal application. While it's possible to run Instruments in batch mode on the command-line, the batch interface is less reliable than running it as an interactive application, and frequently causes lockups or fails to collect data.

Before starting, you should also be sure you are going to profile a version of Swift without DWARF debuginfo; while in theory debuginfo will give a higher-resolution, more-detailed profile, in practice Instruments will often stall out and become unresponsive trying to process the additional detail.

Similarly, be sure that as many applications as possible (especially those with debuginfo themselves!) are closed, so that Instruments has little additional material to symbolicate as possible. It collects a whole system profile at very high resolution, so you want to make its life easy by profiling on a quiet machine doing little beyond the task you're interested in.

Once you're ready, follow these steps:

  • Open Xcode.app
  • Click Xcode => Open Developer Tool => Instruments (Once it's open, you might want to pin Instruments.app to the dock for ease of access)
  • Select the Counters profiling template
  • Open a terminal and get prepared to run your test-case
  • Switch back to Instruments.app
  • Press the red record button in the top-left of the instruments panel
  • Quickly switch to your terminal, run the test-case you wish to profile, and as soon as it's finished switch back to Instruments.app and press the stop button.

That's it! You should have a profile gathered.

Ideally you want to get to a situation that looks like this:

Instruments Profile with terminal

In the main panel you can see a time-sorted set of process and call-frame samples, which you can filter to show only swift processes by typing swift in the Input Filter box at the bottom of the window. Each line in the main panel can be expanded by clicking the triangle at its left, showing the callees as indented sub-frames.

If you hover over the line corresponding to a specific swift process, you‘ll see a small arrow enclosed in a grey circle to the right of the line. Click on it and instruments will shift focus of the main panel to just that process’ subtree (and recalculate time-percentages accordingly). Once you're focused on a specific swift process, you can begin looking at its individual stack-frame profile.

In the panel to the right of the main panel, you can see the heaviest stack trace within the currently-selected line of the main panel. If you click on one of the frames in that stack, the main panel will automatically expand every level between the current frame and the frame you clicked on. For example, clicking 11 frames down the hottest stack, on the frame called swift::ModuleFile::getModule, will expand the main panel to show something like this:

Instruments Profile with terminal

Click around a profile by expanding and contracting nodes in the stack tree, and you'll pretty quickly get a feeling for where the program is spending its time. Each line in the main display shows both the cumulative sample count and running time of its subtree (including all of its children), as well as its own frame-specific Self time.

In the example above, it's pretty clear that the compiler is spending 66% of its time in Sema, and the heaviest stack inside there is the time spent deserializing external definitions (which matches a known problem area, mentioned earlier).

If you want to keep notes on what you're seeing while exploring a profile, you can expand and collapse frames until you see a meaningful pattern, then select the displayed set of stack frames and copy them as text (using ⌘-C as usual) and paste it into a text file; whitespace indentation will be inserted in the copied text, to keep the stack structure readable.

If you have two profiles and want to compare them, Instruments does have a mode for direct diffing between profiles, but it doesn‘t work when the profiles are gathered from different binaries, so for purposes of comparing different swift compilers, you’ll typically have to do manual comparison of the profiles.

Perf

Perf is a Linux profiler that runs on the command line. In many Linux distributions it‘s included in a package called linux-tools that needs to be separately installed. It’s small, fast, robust, flexible, and can be easily scripted; the main disadvantages are that it lacks any sort of GUI and only runs on Linux, so you can't use it to diagnose problems in builds that need macOS or iOS frameworks or run under xcodebuild.

Perf is documented on the kernel wiki as well as on Brendan Gregg's website.

Using perf requires access to hardware performance counters, so you cannot use it in most virtual machines (unless they virtualize access to performance counters). Further, you will need root access to give yourself permission to use the profiling interface of the kernel.

The simplest use of perf just involves running your command under perf stat. This gives high level performance counters including an instructions-executed count, which is a comparatively-stable approximation of total execution cost, and is often enough to pick out a regression when bisecting (see below):

$ perf stat swiftc t.swift

 Performance counter stats for 'swiftc t.swift':

       2140.543052      task-clock (msec)         #    0.966 CPUs utilized
                17      context-switches          #    0.008 K/sec
                 6      cpu-migrations            #    0.003 K/sec
            52,084      page-faults               #    0.024 M/sec
     5,373,530,212      cycles                    #    2.510 GHz
     9,709,304,679      instructions              #    1.81  insn per cycle
     1,812,011,233      branches                  #  846.519 M/sec
        22,026,587      branch-misses             #    1.22% of all branches

       2.216754787 seconds time elapsed

The fact that perf gives relatively stable and precise cost measurements means that it can be made into a useful subroutine when doing other performance-analysis tasks, such as bisecting (see section on git bisect) or reducing (see section on creduce). A shell function like the following is very useful:

count_instructions() {
    perf stat -x , --log-fd 3    \
      -e instructions -r 10 "$@" \
      3>&1 2>/dev/null 1>&2 | cut -d , -f 1
}

To gather a full profile with perf -- when not just using it as a batch counter -- use the perf record and perf report commands; depending on configuration you might need to play with the --call-graph and -e parameters to get a clear picture:

$ perf record -e cycles -c 10000 --call-graph=lbr swiftc t.swift
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.676 MB perf.data (9731 samples) ]

Once recorded, data will be kept in a file called perf.data, which is the default file acted-upon by perf report. Running it should give you something like the following textual user interface, which operates similarly to Instruments.app, only using cursor keys: