swigweb/article_cpp.ht - third_party/swig - Git at Google

 Thoughts on the Insanity C++ Parsing

 <h2>Thoughts on the Insanity of C++ Parsing</h2>

 <center>
 <em>
 "Parsing C++ is simply too complex to do correctly." -- Anonymous
 </em>
 </center>
 <p>
 Author: David Beazley (beazley@cs.uchicago.edu)

 <p>
 August 12, 2002

 <p>
 A central goal of the SWIG project is to generate extension modules by
 parsing the contents of C++ header files.  It's not too hard to come up
 with reasons why this might be useful---after all, if you've got
 several hundred class definitions, do you really want to go off and
 write a bunch of hand-crafted wrappers?  No, of course not---you're
 busy and like everyone else, you've got better things to do with
 your time.

 <p>
 Okay, so there are many reasons why parsing C++ would be nice.
 However, parsing C++ is also a nightmare.  In fact, C++ would
 probably the last language that any normal person would choose to
 serve as an interface specification language.  It's hard to parse,
 hard to analyze, and it involves all sorts
 of nasty little problems related to scoping, typenames, templates,
 access, and so forth.  Because of this, most of the tools that claim
 to "parse" C++ don't.  Instead, they parse a subset of the language
 that happens to match the C++ programming style used by the tool's
 creator (believe me, I know---this is how SWIG started).  Not
 surprisingly, these tools tend to break down when presented with code
 that starts to challenge the capabilities of the C++ compiler.
 Needless to say, critics see this as opportunity to make bold claims
 such as "writing a C++ parser is folly" or "this whole approach is too
 hard to ever work correctly."

 <p>
 Well, one does have to give the critics a little credit---writing a
 C++ parser certainly <em>is</em> hard and writing a parser that
 actually works correctly is even harder.  However, these tasks are
 certainly not "impossible."  After all, there would be no working C++
 compiler if such claims were true!  Therefore, the question of whether
 or not a wrapper generator can parse C++ is clearly the wrong question
 to ask.  Instead, the real question is whether or not a wrapper
 generation tool that parses C++ can actually do anything useful.

 <h3>The problem with using C++ as an interface definition language</h3>

 If you cut through all of the low-level details of parsing, the primary
 problem of using C++ as an module specification language is that of
 ambiguity. Consider a declaration like this:

 <blockquote>
 <pre>
 void foo(double *x, int n);
 </pre>
 </blockquote>

 If you look at this declaration, you can ask yourself the question,
 what is "x"?  Is it a single input value?  Is it an output value
 (modified by the function)?  Is it an array?  Is "n" somehow related?
 Perhaps the real problem in this example is that of expressing the
 programmer's intent.  Yes, the function clearly accepts a pointer to
 some object and an integer, but the declaration does not contain
 enough additional information to determine the purpose of these
 parameters--information that could be useful in generating a suitable
 set of a wrappers.

 <p>
 IDL compilers associated with popular component frameworks (e.g.,
 CORBA, COM, etc.)  get around this problem by requiring interfaces to
 be precisely specified--input and output values are clearly indicated
 as such.  Thus, one might adopt a similar approach and extend C++
 syntax with some special modifiers or qualifiers.  For example:

 <blockquote>
 <pre>
 void foo(%output double *x, int n);
 </pre>
 </blockquote>

 The problem with this approach is that it breaks from C++ syntax and
 it requires the user to annotate their input files (a task that C++
 wrapper generators are supposed to eliminate).  Meanwhile, critics sit
 back and say "Ha! I told you C++ parsing would never work."

 <p>
 Another problem with using C++ as an input language is that interface
 building often involves more than just blindly wrapping declarations.  For instance,
 users might want to rename declarations, specify exception handling procedures,
 add customized code, and so forth.  This suggests that a
 wrapper generator really needs to do
 more than just parse C++---it must give users the freedom to customize
 various aspects of the wrapper generation process.  Again, things aren't
 looking too good for C++.

 <h3>The SWIG approach: pattern matching</h3>

 SWIG takes a different approach to the C++ wrapping problem.
 Instead of trying to modify C++ with all sorts of little modifiers and
 add-ons, wrapping is largely controlled by a pattern matching mechanism that is
 built into the underlying C++ type system.

 <p>
 One part of the pattern matcher is programmed to look for specific sequences of
 datatypes and argument names.   These patterns, known as typemaps, are
 responsible for all aspects of data conversion.  They work by simply attaching
 bits of C conversion code to specific datatypes and argument names in the
 input file.  For example, a typemap might be used like this:

 <blockquote>
 <pre>
 %typemap(in) <b>double *items</b> {
    // Get an array from the input
    ...
 }
 ...
 void foo(<b>double *items</b>, int n);
 </pre>
 </blockquote>

 With this approach, type and argument names are used as
 a basis for specifying customized wrapping behavior.  For example, if a program
 always used an argument of <tt>double *items</tt> to refer to an
 array, SWIG can latch onto that and use it to provide customized
 processing.  It is even possible to write pattern matching rules for
 sequences of arguments.  For example, you could write the following:

 <blockquote>
 <pre>
 %typemap(in) (<b>double *items, int n</b>) {
    // Get an array of items. Set n to number of items
    ...
 }
 ...
 void foo(<b>double *items, int n</b>);
 </pre>
 </blockquote>

 The precise details of typemaps are not so important (in fact, most of
 this pattern matching is hidden from SWIG users).  What is important
 is that pattern matching allows customized data handling to be
 specified without breaking C++ syntax--instead, a user merely has to
 define a few patterns that get applied across the declarations that
 appear in C++ header files.  In some sense, you might view this
 approach as providing customization through naming conventions rather than
 having to annotate arguments with extra qualifiers.

 <p>
 The other pattern matching mechanism used by SWIG is a declaration annotator
 that is used to attach properties to specific declarations.  A simple example of declaration
 annotation might be renaming.  For example:

 <blockquote>
 <pre>
 %rename(cprint) print;  // Rename all occurrences of 'print' to 'cprint'
 </pre>
 </blockquote>

 A more advanced form of declaration matching would be exception handling.
 For example:

 <blockquote>
 <pre>
 %exception Foo::getitem(int) {
     try {
         $action
     } catch (std::out_of_range&amp; e) {
         SWIG_exception(SWIG_IndexError,const_cast&lt;char*&gt;(e.what()));
     }
 }

 ...
 template&lt;class T&gt; class Foo {
 public:
      ...
      T &amp;getitem(int index);    // Exception handling code attached
      ...
 };
 </pre>
 </blockquote>

 Like typemaps, declaration matching does not break from C++ syntax.
 Instead, a user merely specifies special processing rules in advance.
 These rules are then attached to any matching C++
 declaration that appears later in the input.  This means that raw C++
 header files can often be parsed and customized with few, if any,
 modifications.

 <h3>The SWIG difference</h3>

 Pattern based approaches to wrapper code generation are not unique to SWIG.
 However, most prior efforts have based their pattern matching engines on simple
 regular-expression matching. The key difference between SWIG and these systems
 is that SWIG's customization features are fully integrated into the
 underlying C++ type system.  This means that SWIG is able to deal with very
 complicated types of C/C++ code---especially code that makes heavy use of
 <tt>typedef</tt>, namespaces, aliases, class hierarchies, and more.  To
 illustrate, consider some code like this:

 <blockquote>
 <pre>
 // A simple SWIG typemap
 %typemap(in) int {
     $1 = PyInt_AsLong($input);
 }

 ...
 // Some raw C++ code (included later)
 namespace X {
   typedef int Integer;

   class _FooImpl {
   public:
       typedef Integer value_type;
   };
   typedef _FooImpl Foo;
 }

 namespace Y = X;
 using Y::Foo;

 class Bar : public Foo {
 };

 void spam(Bar::value_type x);
 </pre>
 </blockquote>

 If you trace your way through this example, you will find that the
 <tt>Bar::value_type</tt> argument to function <tt>spam()</tt> is
 really an integer.  What's more, if you take a close look at the SWIG
 generated wrappers, you will find that the typemap pattern defined for
 <tt>int</tt> is applied to it--in other words, SWIG does exactly the right thing despite
 our efforts to make the code confusing.

 <p>
 Similarly, declaration annotation is integrated into the type system
 and can be used to define properties that span inheritance hierarchies
 and more (in fact, there are many similarities between the operation of
 SWIG and tools developed for Aspect Oriented Programming).

 <h3>What does this mean?</h3>

 Pattern-based approaches allow wrapper generation tools to parse C++
 declarations and to provide a wide variety of high-level customization
 features.  Although this approach is quite different than that found
 in a typical IDL, the use of patterns makes it possible to work from
 existing header files without having to make many (if any) changes to
 those files.  Moreover, when the underlying pattern matching mechanism
 is integrated with the C++ type system, it is possible to build
 reliable wrappers to real software---even if that software is filled
 with namespaces, templates, classes, <tt>typedef</tt> declarations,
 pointers, and other bits of nastiness.

 <h3>The bottom line</h3>

 Not only is it possible to generate extension modules by parsing C++,
 it is possible to do so with real software and with a high degree of
 reliability.  Don't believe me?  Download SWIG-1.3.14 and try it for
 yourself.
	Thoughts on the Insanity C++ Parsing

	<h2>Thoughts on the Insanity of C++ Parsing</h2>

	<center>
	<em>
	"Parsing C++ is simply too complex to do correctly." -- Anonymous
	</em>
	</center>
	<p>
	Author: David Beazley (beazley@cs.uchicago.edu)

	<p>
	August 12, 2002

	<p>
	A central goal of the SWIG project is to generate extension modules by
	parsing the contents of C++ header files. It's not too hard to come up
	with reasons why this might be useful---after all, if you've got
	several hundred class definitions, do you really want to go off and
	write a bunch of hand-crafted wrappers? No, of course not---you're
	busy and like everyone else, you've got better things to do with
	your time.

	<p>
	Okay, so there are many reasons why parsing C++ would be nice.
	However, parsing C++ is also a nightmare. In fact, C++ would
	probably the last language that any normal person would choose to
	serve as an interface specification language. It's hard to parse,
	hard to analyze, and it involves all sorts
	of nasty little problems related to scoping, typenames, templates,
	access, and so forth. Because of this, most of the tools that claim
	to "parse" C++ don't. Instead, they parse a subset of the language
	that happens to match the C++ programming style used by the tool's
	creator (believe me, I know---this is how SWIG started). Not
	surprisingly, these tools tend to break down when presented with code
	that starts to challenge the capabilities of the C++ compiler.
	Needless to say, critics see this as opportunity to make bold claims
	such as "writing a C++ parser is folly" or "this whole approach is too
	hard to ever work correctly."

	<p>
	Well, one does have to give the critics a little credit---writing a
	C++ parser certainly <em>is</em> hard and writing a parser that
	actually works correctly is even harder. However, these tasks are
	certainly not "impossible." After all, there would be no working C++
	compiler if such claims were true! Therefore, the question of whether
	or not a wrapper generator can parse C++ is clearly the wrong question
	to ask. Instead, the real question is whether or not a wrapper
	generation tool that parses C++ can actually do anything useful.

	<h3>The problem with using C++ as an interface definition language</h3>

	If you cut through all of the low-level details of parsing, the primary
	problem of using C++ as an module specification language is that of
	ambiguity. Consider a declaration like this:

	<blockquote>
	<pre>
	void foo(double *x, int n);
	</pre>
	</blockquote>

	If you look at this declaration, you can ask yourself the question,
	what is "x"? Is it a single input value? Is it an output value
	(modified by the function)? Is it an array? Is "n" somehow related?
	Perhaps the real problem in this example is that of expressing the
	programmer's intent. Yes, the function clearly accepts a pointer to
	some object and an integer, but the declaration does not contain
	enough additional information to determine the purpose of these
	parameters--information that could be useful in generating a suitable
	set of a wrappers.

	<p>
	IDL compilers associated with popular component frameworks (e.g.,
	CORBA, COM, etc.) get around this problem by requiring interfaces to
	be precisely specified--input and output values are clearly indicated
	as such. Thus, one might adopt a similar approach and extend C++
	syntax with some special modifiers or qualifiers. For example:

	<blockquote>
	<pre>
	void foo(%output double *x, int n);
	</pre>
	</blockquote>

	The problem with this approach is that it breaks from C++ syntax and
	it requires the user to annotate their input files (a task that C++
	wrapper generators are supposed to eliminate). Meanwhile, critics sit
	back and say "Ha! I told you C++ parsing would never work."

	<p>
	Another problem with using C++ as an input language is that interface
	building often involves more than just blindly wrapping declarations. For instance,
	users might want to rename declarations, specify exception handling procedures,
	add customized code, and so forth. This suggests that a
	wrapper generator really needs to do
	more than just parse C++---it must give users the freedom to customize
	various aspects of the wrapper generation process. Again, things aren't
	looking too good for C++.

	<h3>The SWIG approach: pattern matching</h3>

	SWIG takes a different approach to the C++ wrapping problem.
	Instead of trying to modify C++ with all sorts of little modifiers and
	add-ons, wrapping is largely controlled by a pattern matching mechanism that is
	built into the underlying C++ type system.

	<p>
	One part of the pattern matcher is programmed to look for specific sequences of
	datatypes and argument names. These patterns, known as typemaps, are
	responsible for all aspects of data conversion. They work by simply attaching
	bits of C conversion code to specific datatypes and argument names in the
	input file. For example, a typemap might be used like this:

	<blockquote>
	<pre>
	%typemap(in) <b>double *items</b> {
	// Get an array from the input
	...
	}
	...
	void foo(<b>double *items</b>, int n);
	</pre>
	</blockquote>

	With this approach, type and argument names are used as
	a basis for specifying customized wrapping behavior. For example, if a program
	always used an argument of <tt>double *items</tt> to refer to an
	array, SWIG can latch onto that and use it to provide customized
	processing. It is even possible to write pattern matching rules for
	sequences of arguments. For example, you could write the following:

	<blockquote>
	<pre>
	%typemap(in) (<b>double *items, int n</b>) {
	// Get an array of items. Set n to number of items
	...
	}
	...
	void foo(<b>double *items, int n</b>);
	</pre>
	</blockquote>

	The precise details of typemaps are not so important (in fact, most of
	this pattern matching is hidden from SWIG users). What is important
	is that pattern matching allows customized data handling to be
	specified without breaking C++ syntax--instead, a user merely has to
	define a few patterns that get applied across the declarations that
	appear in C++ header files. In some sense, you might view this
	approach as providing customization through naming conventions rather than
	having to annotate arguments with extra qualifiers.

	<p>
	The other pattern matching mechanism used by SWIG is a declaration annotator
	that is used to attach properties to specific declarations. A simple example of declaration
	annotation might be renaming. For example:

	<blockquote>
	<pre>
	%rename(cprint) print; // Rename all occurrences of 'print' to 'cprint'
	</pre>
	</blockquote>

	A more advanced form of declaration matching would be exception handling.
	For example:

	<blockquote>
	<pre>
	%exception Foo::getitem(int) {
	try {
	$action
	} catch (std::out_of_range& e) {
	SWIG_exception(SWIG_IndexError,const_cast<char*>(e.what()));
	}
	}

	...
	template<class T> class Foo {
	public:
	...
	T &getitem(int index); // Exception handling code attached
	...
	};
	</pre>
	</blockquote>

	Like typemaps, declaration matching does not break from C++ syntax.
	Instead, a user merely specifies special processing rules in advance.
	These rules are then attached to any matching C++
	declaration that appears later in the input. This means that raw C++
	header files can often be parsed and customized with few, if any,
	modifications.

	<h3>The SWIG difference</h3>

	Pattern based approaches to wrapper code generation are not unique to SWIG.
	However, most prior efforts have based their pattern matching engines on simple
	regular-expression matching. The key difference between SWIG and these systems
	is that SWIG's customization features are fully integrated into the
	underlying C++ type system. This means that SWIG is able to deal with very
	complicated types of C/C++ code---especially code that makes heavy use of
	<tt>typedef</tt>, namespaces, aliases, class hierarchies, and more. To
	illustrate, consider some code like this:

	<blockquote>
	<pre>
	// A simple SWIG typemap
	%typemap(in) int {
	$1 = PyInt_AsLong($input);
	}

	...
	// Some raw C++ code (included later)
	namespace X {
	typedef int Integer;

	class _FooImpl {
	public:
	typedef Integer value_type;
	};
	typedef _FooImpl Foo;
	}

	namespace Y = X;
	using Y::Foo;

	class Bar : public Foo {
	};

	void spam(Bar::value_type x);
	</pre>
	</blockquote>

	If you trace your way through this example, you will find that the
	<tt>Bar::value_type</tt> argument to function <tt>spam()</tt> is
	really an integer. What's more, if you take a close look at the SWIG
	generated wrappers, you will find that the typemap pattern defined for
	<tt>int</tt> is applied to it--in other words, SWIG does exactly the right thing despite
	our efforts to make the code confusing.

	<p>
	Similarly, declaration annotation is integrated into the type system
	and can be used to define properties that span inheritance hierarchies
	and more (in fact, there are many similarities between the operation of
	SWIG and tools developed for Aspect Oriented Programming).

	<h3>What does this mean?</h3>

	Pattern-based approaches allow wrapper generation tools to parse C++
	declarations and to provide a wide variety of high-level customization
	features. Although this approach is quite different than that found
	in a typical IDL, the use of patterns makes it possible to work from
	existing header files without having to make many (if any) changes to
	those files. Moreover, when the underlying pattern matching mechanism
	is integrated with the C++ type system, it is possible to build
	reliable wrappers to real software---even if that software is filled
	with namespaces, templates, classes, <tt>typedef</tt> declarations,
	pointers, and other bits of nastiness.

	<h3>The bottom line</h3>

	Not only is it possible to generate extension modules by parsing C++,
	it is possible to do so with real software and with a high degree of
	reliability. Don't believe me? Download SWIG-1.3.14 and try it for
	yourself.