| <html> |
| <head> |
| <title>SWIG C Scanner</title> |
| </head> |
| |
| <body> |
| <center> |
| <h1>SWIG C/C++ Scanning</h1> |
| |
| <p> |
| David M. Beazley <br> |
| dave-swig@dabeaz.com<br> |
| January 11, 2007<br> |
| |
| </b> |
| </center> |
| |
| <h2>Introduction</h2> |
| |
| This document describes functions that can be used to tokenize C/C++ |
| input text. These functions are relatively low-level and are meant to |
| be used in the implementation of scanners that can be plugged into yacc or used for |
| other purposes. For instance, the preprocessor uses these functions to evaluate and test |
| constant expressions. |
| |
| <p> |
| All of these functions are declared in <tt>Source/Swig/swigscan.h</tt>. This API is considered to be stable. |
| |
| <h2>Creation and Deletion of Scanners</h2> |
| |
| The following functions are used to create and destroy a scanner object. More than one scanner object can be created and used |
| as necessary. |
| |
| <p> |
| <b><tt>Scanner *NewScanner()</tt></b> |
| |
| <blockquote> |
| Creates a new scanner object. The scanner contains initially contains no text. To feed text to the scanner use <tt>Scanner_push()</tt>. |
| </blockquote> |
| |
| <p> |
| <b><tt>Scanner *DelScanner()</tt></b> |
| |
| <blockquote> |
| Deletes a scanner object. |
| </blockquote> |
| |
| <h2>Scanner Functions</h2> |
| |
| <p> |
| <b><tt>void Scanner_clear(Scanner *s)</tt></b> |
| <blockquote> |
| Clears all text from the scanner. This can be used to reset a scanner to its initial state, ready to receive new input text. |
| </blockquote> |
| |
| <p> |
| <b><tt>void Scanner_push(Scanner *s, String *text)</tt></b> |
| <blockquote> |
| Pushes an input string into the scanner. Subsequent tokens will be |
| returned from the new string. If the scanner is already processing a |
| string, the pushed string takes precedence--in effect, interrupting |
| the scanning of the previous string. This behavior is used to |
| implement certain SWIG features such as the <tt>%inline</tt> |
| directive. Once the pushed string has been completely scanned, the |
| scanner will return to scanning the previous string (if any). The |
| scanning of text relies upon the DOH file interface to strings |
| (<tt>Getc()</tt>, <tt>Ungetc()</tt>, etc.). Prior to calling this |
| function, the input string should be set so that its file pointer is |
| in the location where you want scanning to begin. You may have to |
| use <tt>Seek()</tt> to set the file pointer back to the beginning of a |
| string prior to using this function. |
| </blockquote> |
| |
| <p> |
| <b><tt>void Scanner_pushtoken(Scanner *s, int tokvalue, String_or_char *val)</tt></b> |
| <blockquote> |
| Pushes a token into the scanner. This exact token will be returned by the next call to <tt>Scanner_token()</tt>. |
| <tt>tokvalue</tt> is the integer token value to return and <tt>val</tt> is the token text to return. This |
| function is only used to handle very special parsing cases. For instance, if you need the scanner to |
| return a fictitious token into order to enter a special parsing case. |
| </blockquote> |
| |
| <p> |
| <b><tt>int Scanner_token(Scanner *s)</tt></b> |
| |
| <blockquote> |
| Returns the next token. An integer token code is returned (see table below) on success. If no more input text is |
| available 0 is returned. If a scanning error occurred, -1 is returned. In this case, error information can be |
| obtained using <tt>Scanner_errinfo()</tt>. |
| </blockquote> |
| |
| <p> |
| <b><tt>String *Scanner_text(Scanner *s)</tt></b> |
| <blockquote> |
| Returns the scanned text corresponding to the last token returned by <tt>Scanner_token()</tt>. The returned string |
| is only valid until the next call to <tt>Scanner_token()</tt>. If you need to save it, make a copy. |
| </blockquote> |
| |
| <p> |
| <b><tt>void Scanner_skip_line(Scanner *s)</tt></b> |
| <blockquote> |
| Skips to the end of the current line. The text skipped can be obtained using <tt>Scanner_text()</tt> afterwards. |
| </blockquote> |
| |
| <p> |
| <b><tt>int Scanner_skip_balanced(Scanner *s, int startchar, int endchar)</tt></b> |
| <blockquote> |
| Skips to the end of a block of text denoted by starting and ending characters. For example, <tt>{</tt> and <tt>}</tt>. The |
| function is smart about how it skips text. String literals and comments are ignored. The function also is aware of nesting. The |
| skipped text can be obtained using <tt>Scanner_text()</tt> afterwards. Returns 0 on success, -1 if no matching <tt>endchar</tt> could be found. |
| </blockquote> |
| |
| |
| <p> |
| <b><tt>void Scanner_set_location(Scanner *s, int startchar, int endchar)</tt></b> |
| <blockquote> |
| Changes the current filename and line number of the scanner. |
| </blockquote> |
| |
| <p> |
| <b><tt>String *Scanner_file(Scanner *s)</tt></b> |
| <blockquote> |
| Gets the current filename associated with text in the scanner. |
| </blockquote> |
| |
| <p> |
| <b><tt>int Scanner_line(Scanner *s)</tt></b> |
| <blockquote> |
| Gets the current line number associated with text in the scanner. |
| </blockquote> |
| |
| <p> |
| <b><tt>int Scanner_start_line(Scanner *s)</tt></b> |
| <blockquote> |
| Gets the starting line number of the last token returned by the scanner. |
| </blockquote> |
| |
| <p> |
| <b><tt>void Scanner_idstart(Scanner *s, char *idchar)</tt></b> |
| <blockquote> |
| Sets additional characters (other than the C default) that may be used to start C identifiers. <tt>idchar</tt> is a string |
| containing the characters (e.g., "%@"). The purpose of this function is to up special keywords such as "%module" or "@directive" as |
| simple identifiers. |
| </blockquote> |
| |
| <p> |
| <b><tt>String *Scanner_errmsg(Scanner *s)</tt></b> |
| <blockquote> |
| Returns the error message associated with the last scanner error (if any). This will only return a meaningful result |
| if <tt>Scanner_token()</tt> returned -1. |
| </blockquote> |
| |
| <p> |
| <b><tt>int Scanner_errline(Scanner *s)</tt></b> |
| <blockquote> |
| Returns the line number associated with the last scanner error (if any). This will only return a meaningful result |
| if <tt>Scanner_token()</tt> returned -1. The line number usually corresponds to the starting line number of a particular |
| token (e.g., for unterminated strings, comments, etc.). |
| </blockquote> |
| |
| <p> |
| <b><tt>int Scanner_isoperator(int tokval)</tt></b> |
| <blockquote> |
| A convenience function that returns 0 or 1 depending on whether <tt>tokval</tt> is a valid C/C++ operator (i.e., a candidate for |
| operator overloading). |
| </blockquote> |
| |
| <p> |
| <b><tt>void Scanner_freeze_line(int val)</tt></b> |
| <blockquote> |
| Freezes the current line number depending upon whether or not <tt>val</tt> is 1 or 0. When the line number is frozen, newline characters will not result in |
| updates to the line number. This is sometimes useful in tracking line numbers through complicated macro expansions. |
| </blockquote> |
| |
| |
| <h2>Token Codes</h2> |
| |
| The following table shows token codes returned by the scanner. These are integer codes returned by |
| the <tt>Scanner_token()</tt> function. |
| |
| <blockquote> |
| <pre> |
| Token code C Token |
| ------------------------- ------------- |
| SWIG_TOKEN_LPAREN ( |
| SWIG_TOKEN_RPAREN ) |
| SWIG_TOKEN_SEMI ; |
| SWIG_TOKEN_COMMA , |
| SWIG_TOKEN_STAR * |
| SWIG_TOKEN_TIMES * |
| SWIG_TOKEN_LBRACE { |
| SWIG_TOKEN_RBRACE } |
| SWIG_TOKEN_EQUAL = |
| SWIG_TOKEN_EQUALTO == |
| SWIG_TOKEN_NOTEQUAL != |
| SWIG_TOKEN_PLUS + |
| SWIG_TOKEN_MINUS - |
| SWIG_TOKEN_AND & |
| SWIG_TOKEN_LAND && |
| SWIG_TOKEN_OR | |
| SWIG_TOKEN_LOR || |
| SWIG_TOKEN_XOR ^ |
| SWIG_TOKEN_LESSTHAN < |
| SWIG_TOKEN_GREATERTHAN > |
| SWIG_TOKEN_LTEQUAL <= |
| SWIG_TOKEN_GTEQUAL >= |
| SWIG_TOKEN_LTEQUALGT <=> |
| SWIG_TOKEN_NOT ~ |
| SWIG_TOKEN_LNOT ! |
| SWIG_TOKEN_LBRACKET [ |
| SWIG_TOKEN_RBRACKET ] |
| SWIG_TOKEN_SLASH / |
| SWIG_TOKEN_DIVIDE / |
| SWIG_TOKEN_BACKSLASH \ |
| SWIG_TOKEN_POUND # |
| SWIG_TOKEN_PERCENT % |
| SWIG_TOKEN_MODULO % |
| SWIG_TOKEN_COLON : |
| SWIG_TOKEN_DCOLON :: |
| SWIG_TOKEN_DCOLONSTAR ::* |
| SWIG_TOKEN_LSHIFT << |
| SWIG_TOKEN_RSHIFT >> |
| SWIG_TOKEN_QUESTION ? |
| SWIG_TOKEN_PLUSPLUS ++ |
| SWIG_TOKEN_MINUSMINUS -- |
| SWIG_TOKEN_PLUSEQUAL += |
| SWIG_TOKEN_MINUSEQUAL -= |
| SWIG_TOKEN_TIMESEQUAL *= |
| SWIG_TOKEN_DIVEQUAL /= |
| SWIG_TOKEN_ANDEQUAL &= |
| SWIG_TOKEN_OREQUAL |= |
| SWIG_TOKEN_XOREQUAL ^= |
| SWIG_TOKEN_LSEQUAL <<= |
| SWIG_TOKEN_RSEQUAL >>= |
| SWIG_TOKEN_MODEQUAL %= |
| SWIG_TOKEN_ARROW -> |
| SWIG_TOKEN_ARROWSTAR ->* |
| SWIG_TOKEN_PERIOD . |
| SWIG_TOKEN_AT @ |
| SWIG_TOKEN_DOLLAR $ |
| SWIG_TOKEN_ENDLINE Literal newline |
| SWIG_TOKEN_ID identifier |
| SWIG_TOKEN_FLOAT Floating point with F suffix (e.g., 3.1415F) |
| SWIG_TOKEN_DOUBLE Floating point (e.g., 3.1415 ) |
| SWIG_TOKEN_INT Integer (e.g., 314) |
| SWIG_TOKEN_UINT Unsigned integer (e.g., 314U) |
| SWIG_TOKEN_LONG Long integer (e.g., 314L) |
| SWIG_TOKEN_ULONG Unsigned long integer (e.g., 314UL) |
| SWIG_TOKEN_LONGLONG Long long integer (e.g., 314LL ) |
| SWIG_TOKEN_ULONGLONG Unsigned long long integer (e.g., 314ULL) |
| SWIG_TOKEN_CHAR Character literal in single quotes ('c') |
| SWIG_TOKEN_STRING String literal in double quotes ("str") |
| SWIG_TOKEN_RSTRING Reverse quote string (`str`) |
| SWIG_TOKEN_CODEBLOCK SWIG code literal block %{ ... %} |
| SWIG_TOKEN_COMMENT C or C++ comment (// or /* ... */) |
| SWIG_TOKEN_ILLEGAL Illegal character |
| </pre> |
| </blockquote> |
| |
| <b>Notes</b> |
| |
| <ul> |
| <li>When more than one token code exist for the same token text, those codes are identical (e.g., <tt>SWIG_TOKEN_STAR</tt> and <tt>SWIG_TOKEN_TIMES</tt>). |
| |
| <p> |
| <li> |
| String literals are returned in their exact representation in which escape codes (if any) have been interpreted. |
| |
| <p> |
| <li> |
| All C identifiers and keywords are simply returned as <tt>SWIG_TOKEN_ID</tt>. To check for specific keywords, you will need to |
| add extra checking on the returned text. |
| |
| <p> |
| <li>C and C++ comments include the comment starting and ending text (e.g., "//", "/*"). |
| |
| <p> |
| <li>The maximum token integer value is found in the constant <tt>SWIG_MAXTOKENS</tt>. This can be used if you wanted to create |
| an array or table for the purposes of remapping tokens to a different set of codes. For instance, if you are |
| using these functions to write a yacc-compatible lexer. |
| </ul> |
| |
| </body> |
| </html> |