docs/proposals/WholeModuleOptimization.rst - third_party/swift - Git at Google

 :orphan:

 ==================================
 Whole Module Optimization in Swift
 ==================================

 State of Whole Module Optimization in Swift 2.0
 ===============================================

 Since Swift 1.2 / Xcode 6.3, the Swift optimizer has included support
 for whole module optimization (WMO).

 To date (Swift 2.0 / Xcode 7), the differences in the optimization
 pipeline and specific optimization passes when WMO is enabled have
 been relatively minimal, and have provided high value at low
 implementation cost. Examples of this include inferring final on
 internal methods, and removing functions that are not referenced
 within the module and cannot be referenced from outside the module.

 Additionally, compiling with WMO has some natural consequences that
 require no enhancements to passes. For example the increased scope of
 compilation that results from having the entire module available makes
 it possible for the inliner to inline functions that it would
 otherwise not be able to inline in normal separate compilation. Other
 optimizations similarly benefit, for example generic specialization
 (since it has more opportunities for specialize) and function
 signature optimization (since it has more call sites to rewrite).


 Whole Module Optimization for Swift.Next
 ========================================

 As it stands, WMO provides significant benefit with minimal complexity,
 but also has many areas in which it can be improved. Some of these
 areas for improvement are architectural, e.g. improvements to the pass
 management scheme, while others involve adding new interprocedural
 analyses and updating existing passes to make use of the results of
 these analyses.

 Passes and Pass Management
 --------------------------

 Our current pass management scheme intersperses module passes and
 function passes in a way that results in module passes being run on
 functions that are partially optimized, which is suboptimal. Consider
 inlining as one example. If we run the inliner when callee functions
 are only partially optimized we may make different inlining decisions
 in a caller than if we ran it on a caller only after its callees have
 been fully optimized. This particular issue and others mentioned below
 are not specific to WMO per-se, but are more generally a problem for
 any interprocedural optimization that we currently do (most of which
 happen in per-file builds as well).

 This also affects interprocedural optimizations where we can compute
 summary information for a function and use that information when
 optimizing callers of that function. We can obtain better results by
 processing strongly connected components (SCCs) of the call graph
 rather than individual functions, and running the full optimization
 pipeline on a given SCC before moving to the next SCC in a post-order
 traversal of the call graph (i.e. bottom-up). A similar approach can
 be taken for running transforms that benefit from information that is
 propagated in reverse-post-order in the call graph (i.e. top-down).

 Processing one SCC at a time versus one function at a time is
 primarily for the benefit of improved analyses. For example consider
 escape analysis. If there is an SCC in the call graph and we process
 one function at a time, there are cases where we would have to be
 pessimistic and assume a value escapes, when in fact the value may be
 used within the SCC such that it never escapes. The same pessimism can
 happen in other analyses, e.g. dead argument analysis.

 In our current set of passes, several are implemented as
 SILModuleTransforms but simply iterate over the functions in the
 module in whatever order they happen to be in and do not appear to
 benefit from being module passes at this time (although some would
 benefit from optimizing the functions in bottom-up order in the call
 graph). Other SILModuleTransforms currently walk the functions
 bottom-up in the call graph, and do gain some benefit from doing so.

 Moving forward we should eliminate the notion of module passes and
 instead have SCC passes as well as the existing function passes. We
 should change the pass manager to run the SCC passes as a bottom-up
 traversal of the call graph. As previously mentioned we may also want
 to consider having the flexibility of being able to run passes in a
 top-down order (so we could create a pass manager with passes that
 benefit from running in this order, not because we would want to run
 any arbitrary pass in that order).

 For each SCC we would run the entire set of passes before moving on to
 the next SCC, so when we go to optimize an SCC any functions that it
 calls (when going bottom-up - or calls into it when going top-down)
 have been fully optimized. As part of this, analyses that benefit from
 looking at an SCC-at-a-time will need to be modified to do
 so. Existing analyses that would benefit from looking at an
 SCC-at-a-time but have not yet been updated to do so can be run on
 each function in the SCC in turn, producing potentially pessimistic
 results. In time these can be updated to analyze an entire
 SCC. Likewise function passes would be run on each function in the SCC
 in turn. SCC function passes would be handed an entire SCC and be able
 to ask for the analysis results for that SCC, but can process each
 function individually based on those results.

 **TBD: Do we really need SCC transforms at all, or is it sufficient to
 simply have function transforms that are always passed an SCC, and
 have them ask for the results of an analysis for the entire SCC and
 then iterate over all functions in the SCC?**

 In some cases we have transforms that generate new work in a top-down
 fashion, for example the devirtualizer as well as any pass that
 clones. These can be handled by allowing function passes to push new
 work at the top of the stack of work items, and then upon finishing a
 pass the pass pipeline will be restarted with those new functions at
 the top of the work stack, and the previous function buried beneath
 them, to be reprocessed after all the callees are processed.

 **TBD: This may result in re-running some early passes multiple times
 for any given function, and it may mean we want to front-load the
 passes that generate new callees like this so that they are early in
 the pipeline. We could also choose not to rerun portions of the
 pipeline on the function that's already had some processing done on
 it.***

 SCC-based analyses
 ------------------

 There are a variety of analyses that can be done on an SCC to produce
 better information than can be produced by looking at a single
 function at a time, even when processing in (reverse-)post-order on
 the call graph. For example, a dead argument analysis can provide
 information about arguments that are not actually used, making it
 possible for optimizations like DCE and dead-store optimization (which
 we do as part of load/store opts) to eliminate code, independent of a
 pass like function signature optimization running (and thus we
 eliminate a phase-ordering issue). It benefits from looking at an
 SCC-at-a-time because some arguments getting passed into the SCC might
 be passed around the SCC, but are never used in any other way by a
 function within the SCC (and are never passed outside of the SCC).

 Similarly, escape analysis, alias analysis, and mod/ref analysis
 benefit from analyzing an SCC rather than a function.

 This list is not meant to be exhaustive, but is probably a good
 initial set of analyses to plan for once we have the new pass
 framework up.

 Incremental Whole Module Optimization
 -------------------------------------

 Building with WMO enabled can result in slower build times due to
 reduced parallelization of the build process. We currently parallelize
 some of the last-mile optimization and code generation through LLVM,
 but do not attempt to parallelize any of the SIL passes (see the next
 section).

 With that in mind, it seems very worthwhile to examine doing
 incremental WMO in order to minimize the amount of work done for each
 compile.

 One way to accomplish this is to serialize function bodies after
 canonical SIL is produced (i.e. after the mandatory passes) and only
 reoptimize those functions which change between builds. Doing just
 this, however, is not sufficient since we've used information gleaned
 from examining other functions, and we've also done things like inline
 functions into one another. As a result, we need to track dependencies
 between functions in order to properly do incremental compilation
 during WMO.

 Some approaches to tracking these dependencies could be very
 burdensome, requiring passes to explicitly track exactly which
 information they actually use during optimization. This seems error
 prone and difficult to maintain.

 Another approach might be to recompile adjacent functions in the call
 graph when a given function changes. This might be somewhat practical
 if we only have analyses which propagate information bottom-up, but it
 would be more expensive than necessary, and impractical if we also
 have analyses that propagate information top-down since it could
 result in a full recompile of the module in the worst case.

 A more reasonable approach would be to serialize the results of the
 interprocedural analyses at the end of the pass pipeline, and use
 these serialized results to drive some of the dependency tracking
 (along with some manual tracking, e.g. tracking which functions are
 inlined at which call sites). These serialized analysis results would
 then be compared against the results of running the same analyses at
 the end of the compilation pipeline on any function which has changed
 since the previous compile. If the results of an analysis changes, the
 functions which use the results of that analysis would also need to be
 recompiled.

 **TBD: Properly tracking dependencies for functions generated from
 other functions via cloning. Is this any different from tracking for
 inlining?**

 Parallel Whole Module Optimization
 ----------------------------------

 We could also explore the possibility of doing more work in parallel
 during WMO builds. For example, it may be feasible to run the SIL
 optimization passes in parallel. It may also be feasible to do IRGen
 in parallel, although there are shared mutating structures that would
 need to be guarded.

 It's TBD whether this is actually going to be practical and
 worthwhile, but it seems worth investigating and scoping out the work
 involved to some first-level of approximation.
	:orphan:

	==================================
	Whole Module Optimization in Swift
	==================================

	State of Whole Module Optimization in Swift 2.0
	===============================================

	Since Swift 1.2 / Xcode 6.3, the Swift optimizer has included support
	for whole module optimization (WMO).

	To date (Swift 2.0 / Xcode 7), the differences in the optimization
	pipeline and specific optimization passes when WMO is enabled have
	been relatively minimal, and have provided high value at low
	implementation cost. Examples of this include inferring final on
	internal methods, and removing functions that are not referenced
	within the module and cannot be referenced from outside the module.

	Additionally, compiling with WMO has some natural consequences that
	require no enhancements to passes. For example the increased scope of
	compilation that results from having the entire module available makes
	it possible for the inliner to inline functions that it would
	otherwise not be able to inline in normal separate compilation. Other
	optimizations similarly benefit, for example generic specialization
	(since it has more opportunities for specialize) and function
	signature optimization (since it has more call sites to rewrite).


	Whole Module Optimization for Swift.Next
	========================================

	As it stands, WMO provides significant benefit with minimal complexity,
	but also has many areas in which it can be improved. Some of these
	areas for improvement are architectural, e.g. improvements to the pass
	management scheme, while others involve adding new interprocedural
	analyses and updating existing passes to make use of the results of
	these analyses.

	Passes and Pass Management
	--------------------------

	Our current pass management scheme intersperses module passes and
	function passes in a way that results in module passes being run on
	functions that are partially optimized, which is suboptimal. Consider
	inlining as one example. If we run the inliner when callee functions
	are only partially optimized we may make different inlining decisions
	in a caller than if we ran it on a caller only after its callees have
	been fully optimized. This particular issue and others mentioned below
	are not specific to WMO per-se, but are more generally a problem for
	any interprocedural optimization that we currently do (most of which
	happen in per-file builds as well).

	This also affects interprocedural optimizations where we can compute
	summary information for a function and use that information when
	optimizing callers of that function. We can obtain better results by
	processing strongly connected components (SCCs) of the call graph
	rather than individual functions, and running the full optimization
	pipeline on a given SCC before moving to the next SCC in a post-order
	traversal of the call graph (i.e. bottom-up). A similar approach can
	be taken for running transforms that benefit from information that is
	propagated in reverse-post-order in the call graph (i.e. top-down).

	Processing one SCC at a time versus one function at a time is
	primarily for the benefit of improved analyses. For example consider
	escape analysis. If there is an SCC in the call graph and we process
	one function at a time, there are cases where we would have to be
	pessimistic and assume a value escapes, when in fact the value may be
	used within the SCC such that it never escapes. The same pessimism can
	happen in other analyses, e.g. dead argument analysis.

	In our current set of passes, several are implemented as
	SILModuleTransforms but simply iterate over the functions in the
	module in whatever order they happen to be in and do not appear to
	benefit from being module passes at this time (although some would
	benefit from optimizing the functions in bottom-up order in the call
	graph). Other SILModuleTransforms currently walk the functions
	bottom-up in the call graph, and do gain some benefit from doing so.

	Moving forward we should eliminate the notion of module passes and
	instead have SCC passes as well as the existing function passes. We
	should change the pass manager to run the SCC passes as a bottom-up
	traversal of the call graph. As previously mentioned we may also want
	to consider having the flexibility of being able to run passes in a
	top-down order (so we could create a pass manager with passes that
	benefit from running in this order, not because we would want to run
	any arbitrary pass in that order).

	For each SCC we would run the entire set of passes before moving on to
	the next SCC, so when we go to optimize an SCC any functions that it
	calls (when going bottom-up - or calls into it when going top-down)
	have been fully optimized. As part of this, analyses that benefit from
	looking at an SCC-at-a-time will need to be modified to do
	so. Existing analyses that would benefit from looking at an
	SCC-at-a-time but have not yet been updated to do so can be run on
	each function in the SCC in turn, producing potentially pessimistic
	results. In time these can be updated to analyze an entire
	SCC. Likewise function passes would be run on each function in the SCC
	in turn. SCC function passes would be handed an entire SCC and be able
	to ask for the analysis results for that SCC, but can process each
	function individually based on those results.

	**TBD: Do we really need SCC transforms at all, or is it sufficient to
	simply have function transforms that are always passed an SCC, and
	have them ask for the results of an analysis for the entire SCC and
	then iterate over all functions in the SCC?**

	In some cases we have transforms that generate new work in a top-down
	fashion, for example the devirtualizer as well as any pass that
	clones. These can be handled by allowing function passes to push new
	work at the top of the stack of work items, and then upon finishing a
	pass the pass pipeline will be restarted with those new functions at
	the top of the work stack, and the previous function buried beneath
	them, to be reprocessed after all the callees are processed.

	**TBD: This may result in re-running some early passes multiple times
	for any given function, and it may mean we want to front-load the
	passes that generate new callees like this so that they are early in
	the pipeline. We could also choose not to rerun portions of the
	pipeline on the function that's already had some processing done on
	it.***

	SCC-based analyses
	------------------

	There are a variety of analyses that can be done on an SCC to produce
	better information than can be produced by looking at a single
	function at a time, even when processing in (reverse-)post-order on
	the call graph. For example, a dead argument analysis can provide
	information about arguments that are not actually used, making it
	possible for optimizations like DCE and dead-store optimization (which
	we do as part of load/store opts) to eliminate code, independent of a
	pass like function signature optimization running (and thus we
	eliminate a phase-ordering issue). It benefits from looking at an
	SCC-at-a-time because some arguments getting passed into the SCC might
	be passed around the SCC, but are never used in any other way by a
	function within the SCC (and are never passed outside of the SCC).

	Similarly, escape analysis, alias analysis, and mod/ref analysis
	benefit from analyzing an SCC rather than a function.

	This list is not meant to be exhaustive, but is probably a good
	initial set of analyses to plan for once we have the new pass
	framework up.

	Incremental Whole Module Optimization
	-------------------------------------

	Building with WMO enabled can result in slower build times due to
	reduced parallelization of the build process. We currently parallelize
	some of the last-mile optimization and code generation through LLVM,
	but do not attempt to parallelize any of the SIL passes (see the next
	section).

	With that in mind, it seems very worthwhile to examine doing
	incremental WMO in order to minimize the amount of work done for each
	compile.

	One way to accomplish this is to serialize function bodies after
	canonical SIL is produced (i.e. after the mandatory passes) and only
	reoptimize those functions which change between builds. Doing just
	this, however, is not sufficient since we've used information gleaned
	from examining other functions, and we've also done things like inline
	functions into one another. As a result, we need to track dependencies
	between functions in order to properly do incremental compilation
	during WMO.

	Some approaches to tracking these dependencies could be very
	burdensome, requiring passes to explicitly track exactly which
	information they actually use during optimization. This seems error
	prone and difficult to maintain.

	Another approach might be to recompile adjacent functions in the call
	graph when a given function changes. This might be somewhat practical
	if we only have analyses which propagate information bottom-up, but it
	would be more expensive than necessary, and impractical if we also
	have analyses that propagate information top-down since it could
	result in a full recompile of the module in the worst case.

	A more reasonable approach would be to serialize the results of the
	interprocedural analyses at the end of the pass pipeline, and use
	these serialized results to drive some of the dependency tracking
	(along with some manual tracking, e.g. tracking which functions are
	inlined at which call sites). These serialized analysis results would
	then be compared against the results of running the same analyses at
	the end of the compilation pipeline on any function which has changed
	since the previous compile. If the results of an analysis changes, the
	functions which use the results of that analysis would also need to be
	recompiled.

	**TBD: Properly tracking dependencies for functions generated from
	other functions via cloning. Is this any different from tracking for
	inlining?**

	Parallel Whole Module Optimization
	----------------------------------

	We could also explore the possibility of doing more work in parallel
	during WMO builds. For example, it may be feasible to run the SIL
	optimization passes in parallel. It may also be feasible to do IRGen
	in parallel, although there are shared mutating structures that would
	need to be guarded.

	It's TBD whether this is actually going to be practical and
	worthwhile, but it seems worth investigating and scoping out the work
	involved to some first-level of approximation.