| =head1 Introduction |
| |
| Yxml is a small non-validating and mostly conforming XML parser written in C. |
| |
| The latest version of yxml and this document can be found on |
| L<http://dev.yorhel.nl/yxml>. |
| |
| B<This document is a work in progress, there's still some useful stuff missing.> |
| |
| =head1 Compiling yxml |
| |
| Due to the small size of yxml, the recommended way to use it is to copy the |
| L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and |
| L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h> from the git repository |
| into your project directory, and compile and link yxml.c as part of your |
| program or library. |
| |
| The git repository also includes a Makefile. Running C<make> without specifying |
| a target will compile a C<.a> file for easy static linking. A test suite is |
| available under C<make test>. |
| |
| =head1 API documentation |
| |
| =head2 Overview |
| |
| Yxml is designed to be very flexible and efficient, and thus offers a |
| relatively low-level stream-based API. The entire API consists of two typedefs |
| and three functions: |
| |
| typedef enum { /* .. */ } yxml_ret_t; |
| typedef struct { /* .. */ } yxml_t; |
| |
| void yxml_init(yxml_t *x, char *buf, size_t bufsize); |
| yxml_ret_t yxml_parse(yxml_t *x, int ch); |
| yxml_ret_t yxml_eof(yxml_t *x); |
| |
| The values of I<yxml_ret_t> and the public fields of I<yxml_t> are explained in |
| detail below. Parsing a file using yxml involves three steps: |
| |
| =over |
| |
| =item 1. Initialization, using C<yxml_init()>. |
| |
| =item 2. Parsing. This is performed in a loop where C<yxml_parse()> is called |
| on each character of the input file. |
| |
| =item 3. Finalization, using C<yxml_eof()>. |
| |
| =back |
| |
| |
| =head2 Initialization |
| |
| #define BUFSIZE 4096 |
| char *buf = malloc(BUFSIZE); |
| yxml_t x; |
| yxml_init(&x, buf, BUFSIZE); |
| |
| The parsing state for an input document is remembered in the C<yxml_t> |
| structure. This structure needs to be allocated and initialized before parsing |
| a new XML document. |
| |
| Allocating space for the C<yxml_t> structure is the responsibility of the |
| application. Allocation can be done on the stack, but it is also possible to |
| embed the struct inside a larger object or to allocate space for the struct |
| separately. |
| |
| C<yxml_init()> takes a pointer to an (uninitialized) C<yxml_t> struct as first |
| argument and performs the necessary initialization. The two additional |
| arguments specify a pointer to a buffer and the size of this buffer. The given |
| buffer must be writable, but does not have to be initialized by the |
| application. |
| |
| The buffer is used internally by yxml to keep a stack of opened XML element |
| names, property names and PI targets. The size of the buffer determines both |
| the maximum depth in which XML elements can be nested and the maximum length of |
| element names, property names and PI targets. Each name consumes |
| C<strlen(name)+1> bytes in the buffer, and the first byte of the buffer is |
| reserved for the C<\0> byte. This means that in order to parse an XML document |
| with an element name of 100 bytes, a property name or PI target of 50 bytes and |
| a nesting depth of 10 levels, the buffer must be at least |
| C<1+10*(100+1)+(50+1)=1062> bytes. Note that properties and PIs don't nest, so |
| the C<max(PI_name, property_name)> only needs to be counted once. |
| |
| It is not currently possibly to dynamically grow the buffer while parsing, so |
| it is important to choose a buffer size that is large enough to handle all the |
| XML documents that you want to parse. Since element names, property names and |
| PI targets are typically much shorter than in the previous example, a buffer |
| size of 4 or 8 KiB will give enough headroom even for documents with deep |
| nesting. |
| |
| As a useful hack, it is possible to merge the memory for the C<yxml_t> struct |
| and the stack buffer in a single allocation: |
| |
| yxml_t *x = malloc(sizeof(yxml_t) + BUFSIZE); |
| yxml_init(x, x+1, BUFSIZE); |
| |
| This way, the complete parsing state can be passed around with a single |
| pointer, and both the struct and the buffer can be freed with a single call to |
| C<free(x)>. |
| |
| |
| =head2 Parsing |
| |
| yxml_t *x; /* An initialized state */ |
| char *doc; /* The XML document as a zero-terminated string */ |
| for(; *doc; doc++) { |
| yxml_ret_t r = yxml_parse(x, *doc); |
| if(r < 0) |
| exit(1); /* Handle error */ |
| /* Handle any tokens we are interested in */ |
| } |
| |
| The actual parsing of an XML document is facilitated by the C<yxml_parse()> |
| function. It accepts a pointer to an initialized C<yxml_t> struct as first |
| argument and a byte as second argument. The byte is passed as an C<int>, and |
| values in the range of -128 to 255 (both inclusive) are accepted. This way you |
| can pass either C<signed char> or C<unsigned char> values, yxml will work fine |
| with both. To parse a complete document, C<yxml_parse()> needs to be called |
| for each byte of the document in sequence, as done in the above example. |
| |
| For each byte, C<yxml_parse()> will return either I<YXML_OK> (0), a token (>0) |
| or an error (<0). I<YXML_OK> is returned if the given byte has been |
| parsed/consumed correctly but that otherwise nothing worthy of note has |
| happened. The application should then continue processing and pass the next |
| byte of the document. |
| |
| =head3 Public State Variables |
| |
| After each call to C<yxml_parse()>, a number of interesting fields in the |
| C<yxml_t> struct are updated. The fields documented here are part of the API, |
| and are considered as extra return values of C<yxml_parse()>. All of these |
| fields should be considered read-only. |
| |
| =over |
| |
| =item C<char *elem;> |
| |
| Name of the currently opened XML element. Points into the buffer given to |
| C<yxml_init()>. Described in L</Elements>. |
| |
| =item C<char *attr;> |
| |
| Name of the currently opened attribute. Points into the buffer given to |
| C<yxml_init()>. Described in L</Attributes>. |
| |
| =item C<char *pi;> |
| |
| Target of the currently opened PI. Points into the buffer given to |
| C<yxml_init()>. Described in L</Processing Instructions>. |
| |
| =item C<char data[8];> |
| |
| Character data of element contents, attribute values or PI contents. Described |
| in L</Character Data>. |
| |
| =item C<uint32_t line;> |
| |
| Number of the line in the XML document that is currently being parsed. |
| |
| =item C<uint64_t byte;> |
| |
| Byte offset into the current line the XML document. |
| |
| =item C<uint64_t total;> |
| |
| Byte offset into the XML document. |
| |
| =back |
| |
| The values of the I<elem>, I<attr>, I<pi> and I<data> elements depend on the |
| parsing context, and only remain valid within that context. The exact contexts |
| in which these fields contain valid information is described in their |
| respective sections below. |
| |
| The I<line>, I<byte> and I<total> fields are mainly useful for error reporting. |
| When C<yxml_parse()> or C<yxml_eof()> returns with an error, these fields can |
| be used to generate a useful error message. For example: |
| |
| printf("Parsing error at %s:%"PRIu32":%"PRIu64" byte offset %"PRIu64", |
| filename, x->line, x->byte, x->total); |
| |
| =head3 Error Handling |
| |
| Errors are not recoverable. No further calls to C<yxml_parse()> or |
| C<yxml_eof()> should be performed on the same C<yxml_t> struct. Re-initializing |
| the same struct using C<yxml_init()> to start parsing a new document is |
| possible, however. The following error values may be returned by |
| C<yxml_parse()>: |
| |
| =over |
| |
| =item YXML_EREF |
| |
| Invalid character or entity reference. E.g. C<&whatever;> or C<&#ABC;>. |
| |
| =item YXML_ECLOSE |
| |
| Close tag does not match open tag. E.g. C<< <Tag> .. </SomeOtherTag> >>. |
| |
| =item YXML_ESTACK |
| |
| Stack overflow. This happens when the buffer given to C<yxml_init()> was not |
| large enough to parse this document. E.g. when elements are too deeply nested |
| or an element name, attribute name or PI target is too long. |
| |
| =item YXML_ESYN |
| |
| Miscellaneous syntax error. |
| |
| =back |
| |
| =head3 Elements |
| |
| The C<YXML_ELEMSTART> and C<YXML_ELEMEND> tokens are returned when an XML |
| element is opened and closed, respectively. When C<YXML_ELEMSTART> is returned, |
| the I<elem> struct field will hold the name of the element. This field will be |
| valid (i.e. keep pointing to the name of the opened element) until the end of |
| the attribute list. That is, until any token other than those described in |
| L</Attributes> is returned. Although the I<elem> pointer itself may be reused |
| and modified while parsing the contents of the element, the buffer that I<elem> |
| points to will remain valid up to and including the corresponding |
| C<YXML_ELEMCLOSE>. |
| |
| Yxml will validate that elements properly nest and that the name of each |
| closing tag properly matches that of the respective opening tag. The |
| application may safely assume that each C<YXML_ELEMSTART> is properly matched |
| with a C<YXML_ELEMCLOSE>, or that otherwise an error is returned. Furthermore, |
| only a single root element is allowed. When the root element is closed, no |
| further C<YXML_ELEMSTART> tokens will be returned. |
| |
| No distinction is made between self-closing tags and elements with empty |
| content. For example, both C<< <a/> >> and C<< <a></a> >> will result in the |
| C<YXML_ELEMSTART> token (with C<elem="a">) followed by the C<YXML_ELEMEND> |
| token. |
| |
| Element contents are returned in the form of the C<YXML_CONTENT> token and the |
| I<data> field. This is described in more detail in L</Character Data>. |
| |
| =head3 Attributes |
| |
| =head3 Processing Instructions |
| |
| =head3 Character Data |
| |
| |
| =head2 Finalization |
| |
| |