| Thoughts on the Insanity C++ Parsing |
| |
| <h2>Thoughts on the Insanity of C++ Parsing</h2> |
| |
| <center> |
| <em> |
| "Parsing C++ is simply too complex to do correctly." -- Anonymous |
| </em> |
| </center> |
| <p> |
| Author: David Beazley (beazley@cs.uchicago.edu) |
| |
| <p> |
| August 12, 2002 |
| |
| <p> |
| A central goal of the SWIG project is to generate extension modules by |
| parsing the contents of C++ header files. It's not too hard to come up |
| with reasons why this might be useful---after all, if you've got |
| several hundred class definitions, do you really want to go off and |
| write a bunch of hand-crafted wrappers? No, of course not---you're |
| busy and like everyone else, you've got better things to do with |
| your time. |
| |
| <p> |
| Okay, so there are many reasons why parsing C++ would be nice. |
| However, parsing C++ is also a nightmare. In fact, C++ would |
| probably the last language that any normal person would choose to |
| serve as an interface specification language. It's hard to parse, |
| hard to analyze, and it involves all sorts |
| of nasty little problems related to scoping, typenames, templates, |
| access, and so forth. Because of this, most of the tools that claim |
| to "parse" C++ don't. Instead, they parse a subset of the language |
| that happens to match the C++ programming style used by the tool's |
| creator (believe me, I know---this is how SWIG started). Not |
| surprisingly, these tools tend to break down when presented with code |
| that starts to challenge the capabilities of the C++ compiler. |
| Needless to say, critics see this as opportunity to make bold claims |
| such as "writing a C++ parser is folly" or "this whole approach is too |
| hard to ever work correctly." |
| |
| <p> |
| Well, one does have to give the critics a little credit---writing a |
| C++ parser certainly <em>is</em> hard and writing a parser that |
| actually works correctly is even harder. However, these tasks are |
| certainly not "impossible." After all, there would be no working C++ |
| compiler if such claims were true! Therefore, the question of whether |
| or not a wrapper generator can parse C++ is clearly the wrong question |
| to ask. Instead, the real question is whether or not a wrapper |
| generation tool that parses C++ can actually do anything useful. |
| |
| <p> |
| <h3>The problem with using C++ as an interface definition language</h3> |
| |
| If you cut through all of the low-level details of parsing, the primary |
| problem of using C++ as an module specification language is that of |
| ambiguity. Consider a declaration like this: |
| |
| <blockquote> |
| <pre> |
| void foo(double *x, int n); |
| </pre> |
| </blockquote> |
| |
| If you look at this declaration, you can ask yourself the question, |
| what is "x"? Is it a single input value? Is it an output value |
| (modified by the function)? Is it an array? Is "n" somehow related? |
| Perhaps the real problem in this example is that of expressing the |
| programmer's intent. Yes, the function clearly accepts a pointer to |
| some object and an integer, but the declaration does not contain |
| enough additional information to determine the purpose of these |
| parameters--information that could be useful in generating a suitable |
| set of a wrappers. |
| |
| <p> |
| IDL compilers associated with popular component frameworks (e.g., |
| CORBA, COM, etc.) get around this problem by requiring interfaces to |
| be precisely specified--input and output values are clearly indicated |
| as such. Thus, one might adopt a similar approach and extend C++ |
| syntax with some special modifiers or qualifiers. For example: |
| |
| <blockquote> |
| <pre> |
| void foo(%output double *x, int n); |
| </pre> |
| </blockquote> |
| |
| The problem with this approach is that it breaks from C++ syntax and |
| it requires the user to annotate their input files (a task that C++ |
| wrapper generators are supposed to eliminate). Meanwhile, critics sit |
| back and say "Ha! I told you C++ parsing would never work." |
| |
| <p> |
| Another problem with using C++ as an input language is that interface |
| building often involves more than just blindly wrapping declarations. For instance, |
| users might want to rename declarations, specify exception handling procedures, |
| add customized code, and so forth. This suggests that a |
| wrapper generator really needs to do |
| more than just parse C++---it must give users the freedom to customize |
| various aspects of the wrapper generation process. Again, things aren't |
| looking too good for C++. |
| |
| <p> |
| <h3>The SWIG approach: pattern matching</h3> |
| |
| SWIG takes a different approach to the C++ wrapping problem. |
| Instead of trying to modify C++ with all sorts of little modifiers and |
| add-ons, wrapping is largely controlled by a pattern matching mechanism that is |
| built into the underlying C++ type system. |
| |
| <p> |
| One part of the pattern matcher is programmed to look for specific sequences of |
| datatypes and argument names. These patterns, known as typemaps, are |
| responsible for all aspects of data conversion. They work by simply attaching |
| bits of C conversion code to specific datatypes and argument names in the |
| input file. For example, a typemap might be used like this: |
| |
| <blockquote> |
| <pre> |
| %typemap(in) <b>double *items</b> { |
| // Get an array from the input |
| ... |
| } |
| ... |
| void foo(<b>double *items</b>, int n); |
| </pre> |
| </blockquote> |
| |
| With this approach, type and argument names are used as |
| a basis for specifying customized wrapping behavior. For example, if a program |
| always used an argument of <tt>double *items</tt> to refer to an |
| array, SWIG can latch onto that and use it to provide customized |
| processing. It is even possible to write pattern matching rules for |
| sequences of arguments. For example, you could write the following: |
| |
| <blockquote> |
| <pre> |
| %typemap(in) (<b>double *items, int n</b>) { |
| // Get an array of items. Set n to number of items |
| ... |
| } |
| ... |
| void foo(<b>double *items, int n</b>); |
| </pre> |
| </blockquote> |
| |
| The precise details of typemaps are not so important (in fact, most of |
| this pattern matching is hidden from SWIG users). What is important |
| is that pattern matching allows customized data handling to be |
| specified without breaking C++ syntax--instead, a user merely has to |
| define a few patterns that get applied across the declarations that |
| appear in C++ header files. In some sense, you might view this |
| approach as providing customization through naming conventions rather than |
| having to annotate arguments with extra qualifiers. |
| |
| <p> |
| The other pattern matching mechanism used by SWIG is a declaration annotator |
| that is used to attach properties to specific declarations. A simple example of declaration |
| annotation might be renaming. For example: |
| |
| <blockquote> |
| <pre> |
| %rename(cprint) print; // Rename all occurrences of 'print' to 'cprint' |
| </pre> |
| </blockquote> |
| |
| A more advanced form of declaration matching would be exception handling. |
| For example: |
| |
| <blockquote> |
| <pre> |
| %exception Foo::getitem(int) { |
| try { |
| $action |
| } catch (std::out_of_range& e) { |
| SWIG_exception(SWIG_IndexError,const_cast<char*>(e.what())); |
| } |
| } |
| |
| ... |
| template<class T> class Foo { |
| public: |
| ... |
| T &getitem(int index); // Exception handling code attached |
| ... |
| }; |
| </pre> |
| </blockquote> |
| |
| Like typemaps, declaration matching does not break from C++ syntax. |
| Instead, a user merely specifies special processing rules in advance. |
| These rules are then attached to any matching C++ |
| declaration that appears later in the input. This means that raw C++ |
| header files can often be parsed and customized with few, if any, |
| modifications. |
| |
| <h3>The SWIG difference</h2> |
| |
| Pattern based approaches to wrapper code generation are not unique to SWIG. |
| However, most prior efforts have based their pattern matching engines on simple |
| regular-expression matching. The key difference between SWIG and these systems |
| is that SWIG's customization features are fully integrated into the |
| underlying C++ type system. This means that SWIG is able to deal with very |
| complicated types of C/C++ code---especially code that makes heavy use of |
| <tt>typedef</tt>, namespaces, aliases, class hierarchies, and more. To |
| illustrate, consider some code like this: |
| |
| <blockquote> |
| <pre> |
| // A simple SWIG typemap |
| %typemap(in) int { |
| $1 = PyInt_AsLong($input); |
| } |
| |
| ... |
| // Some raw C++ code (included later) |
| namespace X { |
| typedef int Integer; |
| |
| class _FooImpl { |
| public: |
| typedef Integer value_type; |
| }; |
| typedef _FooImpl Foo; |
| } |
| |
| namespace Y = X; |
| using Y::Foo; |
| |
| class Bar : public Foo { |
| }; |
| |
| void spam(Bar::value_type x); |
| </pre> |
| </blockquote> |
| |
| If you trace your way through this example, you will find that the |
| <tt>Bar::value_type</tt> argument to function <tt>spam()</tt> is |
| really an integer. What's more, if you take a close look at the SWIG |
| generated wrappers, you will find that the typemap pattern defined for |
| <tt>int</tt> is applied to it--in other words, SWIG does exactly the right thing despite |
| our efforts to make the code confusing. |
| |
| <p> |
| Similarly, declaration annotation is integrated into the type system |
| and can be used to define properties that span inheritance hierarchies |
| and more (in fact, there are many similarities between the operation of |
| SWIG and tools developed for Aspect Oriented Programming). |
| |
| <h3>What does this mean?</h3> |
| |
| Pattern-based approaches allow wrapper generation tools to parse C++ |
| declarations and to provide a wide variety of high-level customization |
| features. Although this approach is quite different than that found |
| in a typical IDL, the use of patterns makes it possible to work from |
| existing header files without having to make many (if any) changes to |
| those files. Moreover, when the underlying pattern matching mechanism |
| is integrated with the C++ type system, it is possible to build |
| reliable wrappers to real software---even if that software is filled |
| with namespaces, templates, classes, <tt>typedef</tt> declarations, |
| pointers, and other bits of nastiness. |
| |
| <h3>The bottom line</h3> |
| |
| Not only is it possible to generate extension modules by parsing C++, |
| it is possible to do so with real software and with a high degree of |
| reliability. Don't believe me? Download SWIG-1.3.14 and try it for |
| yourself. |
| |
| |
| |
| |
| |
| |
| |
| |
| |