MultiSource/Applications/treecc/doc/essay.html - third_party/llvm-test-suite - Git at Google

 <html>
 <head>
 <title>Treecc: An Aspect-Oriented Approach to Writing Compilers</title>
 </head>
 <body bgcolor="#ffffff">
 <h1>Treecc: An Aspect-Oriented Approach to Writing Compilers</h1>

 Rhys Weatherley, <a href="mailto:rweather@southern-storm.com.au">rweather@southern-storm.com.au</a>.<p>

 Copyright &copy; 2001, 2002 Rhys Weatherley<br>
 Verbatim copying and distribution of this entire article is permitted
 in any medium, provided this copyright notice is preserved.<p>

 This is a copy of an article that was published in
 <a href="http://www.rons.net.cn/english/FSM/issue02">Issue 2</a>
 of Free Software Magazine.<p>

 <h2>1. Introduction</h2>

 The C# compiler in Portable.NET <a href="#pnetref">[1]</a> is built on top
 of the "Tree Compiler Compiler" (treecc) utility program.
 Treecc <a href="#treeccref">[2]</a> is distributed as Free Software
 under the terms of the GNU General Public License.<p>

 This article provides some background of why treecc came about.  It discusses
 two common compiler implementation techniques, and the reasons why they
 often fail to manage the complexity of large programming languages like C#.<p>

 We then discuss a new programming technique called "Aspect-Oriented
 Programming".  Treecc is an example of applying this technique to managing
 the complexity of compiler construction.<p>

 <h2>2. Patterns and Compiler Design</h2>

 Compiler writing is generally seen as a black art, but in reality it isn't
 all that hard.  The basic compilation steps are:<p>

 <ol>
 	<li>Convert the program into an abstract syntax tree.</li>
 	<li>Perform type-checking and semantic analysis on the tree.</li>
 	<li>Rearrange the tree to perform optimisations.</li>
 	<li>Convert the tree into the target code.</li>
 </ol>

 The difficulty in writing compilers is not the steps involved, but rather
 the sheer number of tiny little details to keep straight.  Modern languages
 contain large numbers of operators, which are all very similar, but slightly
 different:<p>

 <blockquote>
 Add, substract, multiply, divide, remainder, shift left, shift right,
 shift right unsigned, bitwise and, bitwise or, exclusive-or, negate,
 bitwise not, logical not, logical and, logical or, ...
 </blockquote>

 And that's just the operators.  Introduce statements, arrays, type coercion,
 method invocation, and class definition, and it becomes very easy to
 forget something amongst the forest of code.<p>

 In an attempt to control this complexity, two common pattern-based
 approaches have arisen over the years: Inheritance and Visitor.<p>

 The inheritance pattern can be characterised as follows:<p>

 <ol>
 	<li>Declare a node type for every syntactic element in the language.
 	    All types ultimately inherit from "<code>Node</code>".</li>
 	<li>Declare virtual methods in "<code>Node</code>" for
 	    operations on node types: semantic analysis, optimization,
 		code generation, etc.</li>
 	<li>Override the virtuals in sub-classes to provide the
 	    compiler implementation.</li>
 </ol><p>

 The visitor pattern can be characterised as follows:<p>

 <ol>
 	<li>Declare a node type for every syntactic element in the language.
 	    All types ultimately inherit from "<code>Node</code>".</li>
 	<li>Declare a "<code>Visitor</code>" class with abstract virtual
 		methods such as "<code>VisitAdd</code>", "<code>VisitSub</code>",
 		"<code>VisitIf</code>", "<code>VisitFunction</code>",
 		etc for all of the node types.</li>
 	<li>Define a "walking procedure" over "<code>Node</code>" objects for
 	    walking around a syntax tree, making callbacks on a supplied
 		visitor object.</li>
 	<li>Create multiple classes that inherit from "<code>Visitor</code>",
 	    one for each operation.  Implement the "<code>Visit*</code>"
 		functions for that type of operation.</li>
 </ol>

 In the following sections, we will explore why these two patterns often
 fail to manage compiler complexity, even when the programmer applies them
 rigorously.<p>

 <h2>3. The implementation language is your worst enemy</h2>

 We will start with the inheritance pattern.  Consider that we've written
 the following C# code during the "declare all the node types" phase of
 the project:<p>

 <blockquote>
 <pre>public class UnaryExpression : Expression
 {
     protected Expression expr;

     public UnaryExpression(Expression _expr) { expr = _expr; }
 }

 public class NegateExpression : UnaryExpression
 {
     public NegateExpression(Expression _expr) : base(_expr) {}
 }

 public class UnaryPlusExpression : UnaryExpression
 {
     public UnaryPlusExpression(Expression _expr) : base(_expr) {}
 }

 public class BitwiseNotExpression : UnaryExpression
 {
     public BitwiseNotExpression(Expression _expr) : base(_expr) {}
 }</pre>
 </blockquote>

 We continue this process for several hundred other node types, gradually
 building up the entire syntax tree.  This will probably take several weeks
 to complete for a substantial language like C#, assuming that we are writing
 the parser alongside the node types.<p>

 We now want to go in and implement type-checking, so we modify the
 "<code>UnaryExpression</code>" class as follows:<p>

 <blockquote>
 <pre>public class UnaryExpression : Expression
 {
     protected Expression expr;

     public UnaryExpression(Expression _expr) { expr = _expr; }

     public override LanguageType TypeCheck()
     {
         LanguageType type = expr.TypeCheck();
         if(type.IsInteger() || type.IsFloat())
         {
             return type;
         }
         throw new TypeCheckException();
     }
 }</pre>
 </blockquote>

 We've put the common unary expression type-checking code in a common base
 class.  This makes it easier to maintain because there is only one copy.
 We continue the process over the next few weeks and months for all the
 other operators, statements, and declarations in the language.  So far,
 so good.<p>

 Unfortunately, we've made a mistake.  The "<code>BitwiseNot</code>"
 operator is only legal on integer values; not floating-point.<p>

 But will we find this bug?  It was several weeks ago when we first wrote
 the "<code>BitwiseNotExpression</code>" class, and we have since forgotten
 all about it.  It may even have been written by another programmer on the
 team, who has also forgotten all about it.  When we build our compiler,
 we don't get any errors because the implemention language is perfectly
 happy with the above code.<p>

 Surely testing will find it?  We are building a test suite alongside
 the code, right?  Unfortunately, that doesn't help either.  The test
 suite for a major language will be at least as complex as the compiler
 itself, and so there is always the temptation to abstract common tests
 into common test classes.  We've just shifted the bug into the test suite
 and given ourselves a false sense of security.<p>

 So we keep coding for several more months, adding lots more code.  And then
 a really nasty bug pops up.  The "<code>BitwiseNot</code>" operator is acting
 strangely, and we have no idea why.  The system is now so complex, with
 so many common base classes implementing fallback defaults, that tracking
 this down becomes very hard.<p>

 The inheritance pattern has a fatal flaw.  Adding a new operation entails
 a very large maintainence burden, because hundreds of classes must be
 modified.  This is very error-prone, so we try to abstract details into
 common base classes.  But this introduces other errors.<p>

 The problem basically boils down to semantic analysis: the implementation
 language does not have enough knowledge about the application domain to
 spot the problem and warn us about it.  So it happily compiles the code
 and leaves us to hang ourselves on the system's complexity later.
 Programming languages are supposed to help us manage complexity, not
 make the problem worse!<p>

 I wrote a number of compilers using the inheritance approach, and every
 single time the complexity killed me.  I needed fallback defaults for
 code maintainence reasons, but using fallbacks introduced massive numbers
 of bugs.  I was stuck.<p>

 <h2>4. Design patterns aren't always what they are cracked up to be</h2>

 After much hair-pulling, I searched <em>Design Patterns</em>
 by Gamma, et al <a href="#gammaref">[3]</a>.  "<em>Is there a better
 way of doing this?</em>".  Visitor patterns are the answer:
 the book even gives a compiler example.<p>

 Visitors solve the "<em>I forgot to implement the <code>BitwiseNot</code></em>"
 problem.  Because every node type has its own "<code>Visit*</code>" method,
 we will get an error when we try to build the compiler without implementing
 the operation for a node type.  Of course, this assumes that we haven't
 been dumb and implemented fallback defaults in the "<code>Visitor</code>"
 base class.<p>

 Instead of using virtual methods, we can implement visitors using
 switch statements over node types.  e.g.<p>

 <blockquote>
 <pre>switch(node.type)
 {
     case Negate: ...
     case UnaryPlus: ...
     case BinaryNot: ...
     ...
 }</pre>
 </blockquote>

 However, switch statements have a similar flaw to inheritance: the
 implementation language will not warn us if we forget to put in
 a case for every node type.  It will happily fall out through the
 "<code>default</code>" case with no warning.  Tracking down these bugs
 can be just as hard as tracking down inheritance fallback bugs.
 Using virtual methods is "safer", if not quite as efficient.<p>

 Unfortunately, there is a catch with visitors, as explained in
 <em>Design Patterns</em>:<p>

 <blockquote>
 <em>Use the Visitor pattern when ... the classes defining the object
 structure rarely change, but you often want to define new operations
 over the structure.  Changing the object structure classes requires
 redefining the interface to all visitors, which is potentially
 costly.  If the object structure classes change often, then it's
 probably better to define the operations in those classes.</em>
 </blockquote>

 During the early stages of writing a compiler, the node types change very
 frequently.  This activates the Achilles heel of the Visitor pattern,
 and creates a maintainence nightmare.  The book suggests that we should
 use the inheritance approach to solve this problem.<p>

 <h2>5. So which one do we use?  Inheritance or Visitor?</h2>

 The inheritance pattern becomes a problem when new operations are needed.
 The solution is visitors.  Visitors become a problem when new node types
 are needed.  The solution is inheritance.<p>

 What we have is a situation that the design patterns gurus didn't
 consider: if the set of nodes and operations are both changing rapidly,
 then we will have problems no matter what we do.<p>

 We need a solution that combines the strengths of both patterns without
 the drawbacks of either.  We want the implementation language to catch
 us when we forget something, but we also want it to handle large numbers
 of nodes and operations smoothly.  We want to split different operations
 into different modules, but also keep them closely associated with the
 node type.<p>

 None of the standard patterns provide this combination of functionality,
 because none of the existing implementation languages support both styles
 of program design at the same time.<p>

 <h2>6. Aspect-Oriented Programming</h2>

 A new field in language design has emerged in recent years called
 "Aspect-Oriented Programming" (AOP).  A good review of the field
 can be found in the October 2001 issue of the "<em>Communications of
 the ACM</em>" <a href="#aopref">[4]</a>, and on the AspectJ Web site
 <a href="#aspectjref">[5]</a>.<p>

 The following excerpt from the introduction to the AOP section in the
 CACM issue describes the essential aspects of AOP, and the difference
 between OOP and AOP:<p>

 <blockquote>
 <em>AOP is based on the idea that computer systems are better programmed
 by separately specifying the various concerns (properties or areas
 of interest) of a system and some description of their relationships,
 and then relying on mechanisms in the underlying AOP environment to
 weave or compose them together into a coherent program. ...
 While the tendancy in OOP's is to find commonality among classes
 and push it up the inheritance tree, AOP attempts to realize
 scattered concerns as first-class elements, and eject them
 horizontally from the object structure.</em>
 </blockquote>

 Aspect-orientation gives us some hope of solving our compiler
 complexity problems.  When we moved from the inheritance pattern
 to the visitor pattern, we were attempting to eject the operations
 horizontally.  But it didn't quite work as well as we had hoped: the
 intrinsic complexity of the set of nodes kept interfering.  AOP
 allows us to take the idea further, without re-introducing the problems
 that visitors have.<p>

 We can view each operation on node types (semantic analysis,
 optimization, code generation, etc) as an "aspect" of the compiler's
 construction.  The AOP language weaves these aspects with the node
 types to create the final compiler.<p>

 However, we don't really want to invent a completely new programming
 language for compiler construction.  We would have to implement this
 new language using all of the flawed techniques that makes writing
 compilers hard.  It's a classic chicken and egg problem: we don't
 want to replace a buggy compiler with a buggy compiler implementation
 language.<p>

 We can strike a compromise, similar to that used by lex and yacc.
 Those tools use a custom syntax for the difficult parts, and a
 pre-existing underlying language (usually C) to implement everything
 else.  The code is expanded by the tool and then compiled with the
 underlying language's compiler.<p>

 Treecc uses a domain-specific aspect-oriented programming language for
 declaring and managing nodes and operations, and uses an underlying language
 to implement the body of the operations.  C, C++, C#, or Java can be used
 as the underlying language, depending upon your personal preference.<p>

 Treecc is about 13,000 lines of code in size, which is relatively
 easy to debug by hand.<p>

 <em>Aside:</em> treecc does not support all of the AOP features that are
 described in the literature.  Treecc weaves together classes from multiple
 method definitions.  Other AOP languages can also weave together methods
 from fragments in multiple aspects.  We concentrated on those AOP features
 that were useful for compiler construction.  Other features could be
 incorporated at a later date.<p>

 <h2>7. Using Treecc to Beat Inheritance Bugs</h2>

 Now that we've identified the problems of inheritance and visitor patterns
 for compiler implementation, we will show how treecc helps the programmer
 avoid these traps.<p>

 The following is the treecc definition of our example node types:

 <blockquote>
 <pre>%option lang = "C#"

 %node Expression %abstract %typedef

 %node UnaryExpression Expression %abstract =
 {
     Expression expr;
 }

 %node NegateExpression UnaryExpression
 %node UnaryPlusExpression UnaryExpression
 %node BitwiseNotExpression UnaryExpression</pre>
 </blockquote>

 Treecc converts this into a number of C# classes, one for each
 node type.  It also inserts helper methods for testing the type
 of a node and for tracking source line numbers.  If the output language
 is C or C++, treecc will also insert code for allocating large numbers
 of nodes efficiently.<p>

 The type-checking operation (or "aspect") is declared as follows:<p>

 <blockquote>
 <pre>%operation %virtual LanguageType TypeCheck(Expression e)

 TypeCheck(NegateExpression),
 TypeCheck(UnaryPlusExpression)
 {
     LanguageType type = e.expr.TypeCheck();
     if(type.IsInteger() || type.IsFloat())
     {
         return type;
     }
     throw new TypeCheckException();
 }

 TypeCheck(BitwiseNotExpression)
 {
     LanguageType type = e.expr.TypeCheck();
     if(type.IsInteger())
     {
         return type;
     }
     throw new TypeCheckException();
 }</pre>
 </blockquote>

 We have not declared the operation to cover "<code>Expression</code>" or
 "<code>UnaryExpression</code>".  Instead, we list all of the applicable
 subtypes explicitly.  When treecc is run on the above file, it will check
 that every non-abstract node type is handled by an operation case.  If it finds
 a missing type, it will report an error.  Let's demonstrate that by adding a
 new unary expression type:<p>

 <blockquote><code>
 %node LogicalNotExpression UnaryExpression<br>
 <br>
 eg.tc:17: node type `LogicalNotExpression' is not handled in operation `TypeCheck'<br>
 </code></blockquote>

 This is at the heart of treecc's power: it performs a large amount of
 semantic analysis over the node types to determine if the programmer has
 forgotten something.  The programmer is notified of this early in
 development process, when it is easier to fix the problem.<p>

 As we discussed earlier, we want to implement common code in common
 base classes to improve code sharing.  However, this introduces
 hard to find bugs.  Treecc avoids the need to do this by allowing
 the programmer to attach multiple cases to the same code block,
 as in the case of "<code>NegateExpression</code>" and
 "<code>UnaryPlusExpression</code>" above.<p>

 An important feature of aspect-oriented languages is aspect modularity:
 it should be possible to implement separate aspects in different parts
 of the code.  Treecc supports this by separating node and operation
 definitions.  Operations do not need to be implemented in the same
 file as the nodes to which they apply, and multiple operations on
 the same node types can be scattered through-out the code.<p>

 Portable.NET takes this even further by separating individual operations
 across multiple files for expressions, statements, declarations, etc.
 This introduces a clearer structure to the code that makes it easier
 to navigate the source during compiler construction.  The semantic analysis
 routines in treecc ensure that nothing is missed when all of the
 separate modules are recombined.<p>

 <h2>8. Painless Visitors</h2>

 The previous example used a "<code>%virtual</code>" operation, which
 is defined over all node types.  Treecc inserts the virtual method
 body wherever it is required.<p>

 Sometimes we don't want to define an operation as a virtual method.
 We would prefer to use the visitor approach.  The following is what
 a visitor version of "<code>TypeCheck</code>" would look like:

 <blockquote>
 <pre>%operation LanguageType TypeChecker::TypeCheck(Expression e) = {null}

 TypeCheck(NegateExpression),
 TypeCheck(UnaryPlusExpression)
 {
     LanguageType type = TypeCheck(e.expr);
     if(type.IsInteger() || type.IsFloat())
     {
         return type;
     }
     throw new TypeCheckException();
 }

 TypeCheck(BitwiseNotExpression)
 {
     LanguageType type = TypeCheck(e.expr);
     if(type.IsInteger())
     {
         return type;
     }
     throw new TypeCheckException();
 }</pre>
 </blockquote>

 This is almost identical to the previous version, except for the
 definition of the operation, and the calling conventions for
 "<code>TypeCheck</code>".  Behind the scenes, treecc creates
 a class called "<code>TypeChecker</code>" that contains a static
 method called "<code>TypeCheck</code>".  This is the visitor.<p>

 Interestingly, if we had used C as the underlying language, then no
 changes are necessary to the bodies of the operation cases.  Only
 the "<code>%virtual</code>" keyword changes.  The C macro pre-processor
 is used to smooth out the differences.<p>

 This illustrates another useful property of treecc: it is very simple
 to flip operations between inhertance-based virtuals and visitor-based
 non-virtuals.  This allows the programmer to start developing the compiler
 one way, change their mind, and quickly flip to the other way.<p>

 Normally, changing inheritance-based code into visitor-based code
 would entail a complete system rewrite.  Changing patterns with
 treecc is trivial.<p>

 This is a common property of aspect-oriented programming languages:
 because the language takes care of the inserting the aspects into
 the main classes, it is easier to change the style of insertion
 without a major system overhaul.<p>

 Non-virtual operations can be applied to multiple arguments, which
 can be very useful when implementing coercions:

 <blockquote>
 <pre>%enum SimpleType =
 {
     Integer,
     Long,
     Float,
     Error
 }

 %operation %inline SimpleType Binary::Coerce
         ([SimpleType type1], [SimpleType type2]) = {Error}

 Coerce(Integer, Integer)
 {
     return Integer;
 }

 Coerce(Integer, Long),
 Coerce(Long, Integer),
 Coerce(Long, Long)
 {
     return Long;
 }

 Coerce(Integer, Float),
 Coerce(Float, Integer),
 Coerce(Long, Float),
 Coerce(Float, Float)
 {
     return Float;
 }

 Coerce(Integer, Error),
 Coerce(Error, Integer),
 Coerce(Long, Error),
 Coerce(Error, Long),
 Coerce(Float, Error),
 Coerce(Error, Float),
 Coerce(Error, Error)
 {
     return Error;
 }</pre>
 </blockquote>

 Treecc turns this into a highly efficient nested switch statement,
 which would be extremely difficult to debug by hand.  We actually
 left out one of the cases above, so we get an error:

 <blockquote><code>eg.tc:48: case `Float, Long' is missing from operation `Coerce'</code></blockquote>

 Casts and coercions on primitive types can now be implemented as
 simple table lookups, with treecc checking the completeness of the
 table for us.<p>

 <h2>9. Conclusion and Future Directions</h2>

 Treecc provides a new way to attack the complexity of compiler implementation
 by automating error-prone tasks.  It performs a large amount of semantic
 analysis on the program to ensure that common problems are caught early
 in the development cycle.<p>

 Because treecc is based on an aspect-oriented foundation, it allows the
 programmer to separate out concerns and deal with them individually.
 Treecc puts the whole system back together in the most efficient manner
 possible.<p>

 The system is not necessarily complete.  We'd like to experiment with
 rule-based code generation techniques.  At present, optimizers and
 code generators must be written by hand, as operations on node types.<p>

 A rule-based system would make it easier to build clever optimizers
 as a set of pattern matching directives.  Operations are already a
 special class of pattern matcher, but they don't have any back-tracking
 and retry capabilities.<p>

 Another area that treecc can be applied to is the construction of
 "Just In Time" compilers.  The first phase of the JIT process is the
 reconstruction of the intermediate code form of the program.  This
 intermediate code typically takes the form of abstract syntax trees or
 three-address statements.<p>

 Treecc is well-suited to the management of JIT intermediate forms.
 Register allocation, dynamic flow analysis, and machine-dependent
 code generation can be added as JIT aspects.  We are currently
 exploring the use of treecc to assist in the construction of a
 JIT for Portable.NET<p>

 <h2>References</h2>

 <a name="pnetref">[1] Portable.NET Web Site, <a href="http://www.southern-storm.com.au/portable_net.html">http://www.southern-storm.com.au/portable_net.html</a>.<p>

 <a name="treeccref">[2] Treecc Web Site, <a href="http://www.southern-storm.com.au/treecc.html">http://www.southern-storm.com.au/treecc.html</a>.<p>

 <a name="gammaref">[3] Gamma, et al., <em>Design Patterns: Elements of
 Reusable Object-Oriented Software</em>, Addison-Wesley, 1995.<p>

 <a name="aopref">[4] Aspect-Oriented Programming, <em>Communications of
 the ACM</em>, October 2001.<p>

 <a name="aspectjref">[5] AspectJ Web Site, <a href="http://www.aspectj.org/">http://www.aspectj.org/</a>.<p>

 </body>
 </html>