| :orphan: |
| |
| ================================== |
| Whole Module Optimization in Swift |
| ================================== |
| |
| State of Whole Module Optimization in Swift 2.0 |
| =============================================== |
| |
| Since Swift 1.2 / Xcode 6.3, the Swift optimizer has included support |
| for whole module optimization (WMO). |
| |
| To date (Swift 2.0 / Xcode 7), the differences in the optimization |
| pipeline and specific optimization passes when WMO is enabled have |
| been relatively minimal, and have provided high value at low |
| implementation cost. Examples of this include inferring final on |
| internal methods, and removing functions that are not referenced |
| within the module and cannot be referenced from outside the module. |
| |
| Additionally, compiling with WMO has some natural consequences that |
| require no enhancements to passes. For example the increased scope of |
| compilation that results from having the entire module available makes |
| it possible for the inliner to inline functions that it would |
| otherwise not be able to inline in normal separate compilation. Other |
| optimizations similarly benefit, for example generic specialization |
| (since it has more opportunities for specialize) and function |
| signature optimization (since it has more call sites to rewrite). |
| |
| |
| Whole Module Optimization for Swift.Next |
| ======================================== |
| |
| As it stands, WMO provides significant benefit with minimal complexity, |
| but also has many areas in which it can be improved. Some of these |
| areas for improvement are architectural, e.g. improvements to the pass |
| management scheme, while others involve adding new interprocedural |
| analyses and updating existing passes to make use of the results of |
| these analyses. |
| |
| Passes and Pass Management |
| -------------------------- |
| |
| Our current pass management scheme intersperses module passes and |
| function passes in a way that results in module passes being run on |
| functions that are partially optimized, which is suboptimal. Consider |
| inlining as one example. If we run the inliner when callee functions |
| are only partially optimized we may make different inlining decisions |
| in a caller than if we ran it on a caller only after its callees have |
| been fully optimized. This particular issue and others mentioned below |
| are not specific to WMO per-se, but are more generally a problem for |
| any interprocedural optimization that we currently do (most of which |
| happen in per-file builds as well). |
| |
| This also affects interprocedural optimizations where we can compute |
| summary information for a function and use that information when |
| optimizing callers of that function. We can obtain better results by |
| processing strongly connected components (SCCs) of the call graph |
| rather than individual functions, and running the full optimization |
| pipeline on a given SCC before moving to the next SCC in a post-order |
| traversal of the call graph (i.e. bottom-up). A similar approach can |
| be taken for running transforms that benefit from information that is |
| propagated in reverse-post-order in the call graph (i.e. top-down). |
| |
| Processing one SCC at a time versus one function at a time is |
| primarily for the benefit of improved analyses. For example consider |
| escape analysis. If there is an SCC in the call graph and we process |
| one function at a time, there are cases where we would have to be |
| pessimistic and assume a value escapes, when in fact the value may be |
| used within the SCC such that it never escapes. The same pessimism can |
| happen in other analyses, e.g. dead argument analysis. |
| |
| In our current set of passes, several are implemented as |
| SILModuleTransforms but simply iterate over the functions in the |
| module in whatever order they happen to be in and do not appear to |
| benefit from being module passes at this time (although some would |
| benefit from optimizing the functions in bottom-up order in the call |
| graph). Other SILModuleTransforms currently walk the functions |
| bottom-up in the call graph, and do gain some benefit from doing so. |
| |
| Moving forward we should eliminate the notion of module passes and |
| instead have SCC passes as well as the existing function passes. We |
| should change the pass manager to run the SCC passes as a bottom-up |
| traversal of the call graph. As previously mentioned we may also want |
| to consider having the flexibility of being able to run passes in a |
| top-down order (so we could create a pass manager with passes that |
| benefit from running in this order, not because we would want to run |
| any arbitrary pass in that order). |
| |
| For each SCC we would run the entire set of passes before moving on to |
| the next SCC, so when we go to optimize an SCC any functions that it |
| calls (when going bottom-up - or calls into it when going top-down) |
| have been fully optimized. As part of this, analyses that benefit from |
| looking at an SCC-at-a-time will need to be modified to do |
| so. Existing analyses that would benefit from looking at an |
| SCC-at-a-time but have not yet been updated to do so can be run on |
| each function in the SCC in turn, producing potentially pessimistic |
| results. In time these can be updated to analyze an entire |
| SCC. Likewise function passes would be run on each function in the SCC |
| in turn. SCC function passes would be handed an entire SCC and be able |
| to ask for the analysis results for that SCC, but can process each |
| function individually based on those results. |
| |
| **TBD: Do we really need SCC transforms at all, or is it sufficient to |
| simply have function transforms that are always passed an SCC, and |
| have them ask for the results of an analysis for the entire SCC and |
| then iterate over all functions in the SCC?** |
| |
| In some cases we have transforms that generate new work in a top-down |
| fashion, for example the devirtualizer as well as any pass that |
| clones. These can be handled by allowing function passes to push new |
| work at the top of the stack of work items, and then upon finishing a |
| pass the pass pipeline will be restarted with those new functions at |
| the top of the work stack, and the previous function buried beneath |
| them, to be reprocessed after all the callees are processed. |
| |
| **TBD: This may result in re-running some early passes multiple times |
| for any given function, and it may mean we want to front-load the |
| passes that generate new callees like this so that they are early in |
| the pipeline. We could also choose not to rerun portions of the |
| pipeline on the function that's already had some processing done on |
| it.*** |
| |
| SCC-based analyses |
| ------------------ |
| |
| There are a variety of analyses that can be done on an SCC to produce |
| better information than can be produced by looking at a single |
| function at a time, even when processing in (reverse-)post-order on |
| the call graph. For example, a dead argument analysis can provide |
| information about arguments that are not actually used, making it |
| possible for optimizations like DCE and dead-store optimization (which |
| we do as part of load/store opts) to eliminate code, independent of a |
| pass like function signature optimization running (and thus we |
| eliminate a phase-ordering issue). It benefits from looking at an |
| SCC-at-a-time because some arguments getting passed into the SCC might |
| be passed around the SCC, but are never used in any other way by a |
| function within the SCC (and are never passed outside of the SCC). |
| |
| Similarly, escape analysis, alias analysis, and mod/ref analysis |
| benefit from analyzing an SCC rather than a function. |
| |
| This list is not meant to be exhaustive, but is probably a good |
| initial set of analyses to plan for once we have the new pass |
| framework up. |
| |
| Incremental Whole Module Optimization |
| ------------------------------------- |
| |
| Building with WMO enabled can result in slower build times due to |
| reduced parallelization of the build process. We currently parallelize |
| some of the last-mile optimization and code generation through LLVM, |
| but do not attempt to parallelize any of the SIL passes (see the next |
| section). |
| |
| With that in mind, it seems very worthwhile to examine doing |
| incremental WMO in order to minimize the amount of work done for each |
| compile. |
| |
| One way to accomplish this is to serialize function bodies after |
| canonical SIL is produced (i.e. after the mandatory passes) and only |
| reoptimize those functions which change between builds. Doing just |
| this, however, is not sufficient since we've used information gleaned |
| from examining other functions, and we've also done things like inline |
| functions into one another. As a result, we need to track dependencies |
| between functions in order to properly do incremental compilation |
| during WMO. |
| |
| Some approaches to tracking these dependencies could be very |
| burdensome, requiring passes to explicitly track exactly which |
| information they actually use during optimization. This seems error |
| prone and difficult to maintain. |
| |
| Another approach might be to recompile adjacent functions in the call |
| graph when a given function changes. This might be somewhat practical |
| if we only have analyses which propagate information bottom-up, but it |
| would be more expensive than necessary, and impractical if we also |
| have analyses that propagate information top-down since it could |
| result in a full recompile of the module in the worst case. |
| |
| A more reasonable approach would be to serialize the results of the |
| interprocedural analyses at the end of the pass pipeline, and use |
| these serialized results to drive some of the dependency tracking |
| (along with some manual tracking, e.g. tracking which functions are |
| inlined at which call sites). These serialized analysis results would |
| then be compared against the results of running the same analyses at |
| the end of the compilation pipeline on any function which has changed |
| since the previous compile. If the results of an analysis changes, the |
| functions which use the results of that analysis would also need to be |
| recompiled. |
| |
| **TBD: Properly tracking dependencies for functions generated from |
| other functions via cloning. Is this any different from tracking for |
| inlining?** |
| |
| Parallel Whole Module Optimization |
| ---------------------------------- |
| |
| We could also explore the possibility of doing more work in parallel |
| during WMO builds. For example, it may be feasible to run the SIL |
| optimization passes in parallel. It may also be feasible to do IRGen |
| in parallel, although there are shared mutating structures that would |
| need to be guarded. |
| |
| It's TBD whether this is actually going to be practical and |
| worthwhile, but it seems worth investigating and scoping out the work |
| involved to some first-level of approximation. |