| =head1 Introduction |
| |
| Yxml is a small non-validating and mostly conforming XML parser written in C. |
| |
| The latest version of yxml and this document can be found on |
| L<http://dev.yorhel.nl/yxml>. |
| |
| =head1 Compiling yxml |
| |
| Due to the small size of yxml, the recommended way to use it is to copy the |
| L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and |
| L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h> from the git repository |
| into your project directory, and compile and link yxml.c as part of your |
| program or library. |
| |
| The git repository also includes a Makefile. Running C<make> without specifying |
| a target will compile a C<.a> file for easy static linking. A test suite is |
| available under C<make test>. |
| |
| =head1 API documentation |
| |
| =head2 Overview |
| |
| Yxml is designed to be very flexible and efficient, and thus offers a |
| relatively low-level stream-based API. The entire API consists of two typedefs |
| and three functions: |
| |
| typedef enum { /* .. */ } yxml_ret_t; |
| typedef struct { /* .. */ } yxml_t; |
| |
| void yxml_init(yxml_t *x, char *buf, size_t bufsize); |
| yxml_ret_t yxml_parse(yxml_t *x, int ch); |
| yxml_ret_t yxml_eof(yxml_t *x); |
| |
| The values of I<yxml_ret_t> and the public fields of I<yxml_t> are explained in |
| detail below. Parsing a file using yxml involves three steps: |
| |
| =over |
| |
| =item 1. Initialization, using C<yxml_init()>. |
| |
| =item 2. Parsing. This is performed in a loop where C<yxml_parse()> is called |
| on each character of the input file. |
| |
| =item 3. Finalization, using C<yxml_eof()>. |
| |
| =back |
| |
| |
| =head2 Initialization |
| |
| #define BUFSIZE 4096 |
| char *buf = malloc(BUFSIZE); |
| yxml_t x; |
| yxml_init(&x, buf, BUFSIZE); |
| |
| The parsing state for an input document is remembered in the C<yxml_t> |
| structure. This structure needs to be allocated and initialized before parsing |
| a new XML document. |
| |
| Allocating space for the C<yxml_t> structure is the responsibility of the |
| application. Allocation can be done on the stack, but it is also possible to |
| embed the struct inside a larger object or to allocate space for the struct |
| separately. |
| |
| C<yxml_init()> takes a pointer to an (uninitialized) C<yxml_t> struct as first |
| argument and performs the necessary initialization. The two additional |
| arguments specify a pointer to a buffer and the size of this buffer. The given |
| buffer must be writable, but does not have to be initialized by the |
| application. |
| |
| The buffer is used internally by yxml to keep a stack of opened XML element |
| names, property names and PI targets. The size of the buffer determines both |
| the maximum depth in which XML elements can be nested and the maximum length of |
| element names, property names and PI targets. Each name consumes |
| C<strlen(name)+1> bytes in the buffer, and the first byte of the buffer is |
| reserved for the C<\0> byte. This means that in order to parse an XML document |
| with an element name of 100 bytes, a property name or PI target of 50 bytes and |
| a nesting depth of 10 levels, the buffer must be at least |
| C<1+10*(100+1)+(50+1)=1062> bytes. Note that properties and PIs don't nest, so |
| the C<max(PI_name, property_name)> only needs to be counted once. |
| |
| It is not currently possibly to dynamically grow the buffer while parsing, so |
| it is important to choose a buffer size that is large enough to handle all the |
| XML documents that you want to parse. Since element names, property names and |
| PI targets are typically much shorter than in the previous example, a buffer |
| size of 4 or 8 KiB will give enough headroom even for documents with deep |
| nesting. |
| |
| As a useful hack, it is possible to merge the memory for the C<yxml_t> struct |
| and the stack buffer in a single allocation: |
| |
| yxml_t *x = malloc(sizeof(yxml_t) + BUFSIZE); |
| yxml_init(x, x+1, BUFSIZE); |
| |
| This way, the complete parsing state can be passed around with a single |
| pointer, and both the struct and the buffer can be freed with a single call to |
| C<free(x)>. |
| |
| |
| =head2 Parsing |
| |
| yxml_t *x; /* An initialized state */ |
| char *doc; /* The XML document as a zero-terminated string */ |
| for(; *doc; doc++) { |
| yxml_ret_t r = yxml_parse(x, *doc); |
| if(r < 0) |
| exit(1); /* Handle error */ |
| /* Handle any tokens we are interested in */ |
| } |
| |
| The actual parsing of an XML document is facilitated by the C<yxml_parse()> |
| function. It accepts a pointer to an initialized C<yxml_t> struct as first |
| argument and a byte as second argument. The byte is passed as an C<int>, and |
| values in the range of -128 to 255 (both inclusive) are accepted. This way you |
| can pass either C<signed char> or C<unsigned char> values, yxml will work fine |
| with both. To parse a complete document, C<yxml_parse()> needs to be called |
| for each byte of the document in sequence, as done in the above example. |
| |
| For each byte, C<yxml_parse()> will return either I<YXML_OK> (0), a token (>0) |
| or an error (<0). I<YXML_OK> is returned if the given byte has been |
| parsed/consumed correctly but that otherwise nothing worthy of note has |
| happened. The application should then continue processing and pass the next |
| byte of the document. |
| |
| =head3 Public State Variables |
| |
| After each call to C<yxml_parse()>, a number of interesting fields in the |
| C<yxml_t> struct are updated. The fields documented here are part of the API, |
| and are considered as extra return values of C<yxml_parse()>. All of these |
| fields should be considered read-only. |
| |
| =over |
| |
| =item C<char *elem;> |
| |
| Name of the currently opened XML element. Points into the buffer given to |
| C<yxml_init()>. Described in L</Elements>. |
| |
| =item C<char *attr;> |
| |
| Name of the currently opened attribute. Points into the buffer given to |
| C<yxml_init()>. Described in L</Attributes>. |
| |
| =item C<char *pi;> |
| |
| Target of the currently opened PI. Points into the buffer given to |
| C<yxml_init()>. Described in L</Processing Instructions>. |
| |
| =item C<char data[8];> |
| |
| Character data of element contents, attribute values or PI contents. Described |
| in L</Character Data>. |
| |
| =item C<uint32_t line;> |
| |
| Number of the line in the XML document that is currently being parsed. |
| |
| =item C<uint64_t byte;> |
| |
| Byte offset into the current line the XML document. |
| |
| =item C<uint64_t total;> |
| |
| Byte offset into the XML document. |
| |
| =back |
| |
| The values of the I<elem>, I<attr>, I<pi> and I<data> elements depend on the |
| parsing context, and only remain valid within that context. The exact contexts |
| in which these fields contain valid information is described in their |
| respective sections below. |
| |
| The I<line>, I<byte> and I<total> fields are mainly useful for error reporting. |
| When C<yxml_parse()> reports an error, these fields can be used to generate a |
| useful error message. For example: |
| |
| printf("Parsing error at %s:%"PRIu32":%"PRIu64" byte offset %"PRIu64", |
| filename, x->line, x->byte, x->total); |
| |
| =head3 Error Handling |
| |
| Errors are not recoverable. No further calls to C<yxml_parse()> or |
| C<yxml_eof()> should be performed on the same C<yxml_t> struct. Re-initializing |
| the same struct using C<yxml_init()> to start parsing a new document is |
| possible, however. The following error values may be returned by |
| C<yxml_parse()>: |
| |
| =over |
| |
| =item YXML_EREF |
| |
| Invalid character or entity reference. E.g. C<&whatever;> or C<&#ABC;>. |
| |
| =item YXML_ECLOSE |
| |
| Close tag does not match open tag. E.g. C<< <Tag> .. </SomeOtherTag> >>. |
| |
| =item YXML_ESTACK |
| |
| Stack overflow. This happens when the buffer given to C<yxml_init()> was not |
| large enough to parse this document. E.g. when elements are too deeply nested |
| or an element name, attribute name or PI target is too long. |
| |
| =item YXML_ESYN |
| |
| Miscellaneous syntax error. |
| |
| =back |
| |
| |
| =head2 Handling Tokens |
| |
| The C<yxml_parse()> function will return tokens as they are found. When loading |
| an XML document, it is important to know which tokens are returned in which |
| situation and how to handle them. |
| |
| The following graph shows the (simplified) state machine of the parser to |
| illustrate the order in which tokens are returned. The labels on the edge |
| indicate the tokens that are returned by C<yxml_parse()>, with their C<YXML_> |
| prefix removed. The special return value C<YXML_OK> and error returns are not |
| displayed. |
| |
| [html]<img src="/img/yxml-apistates.png" />É |
| |
| Tokens that the application is not interested in can be ignored safely. For |
| example, if you are not interested in handling processing instructions, then |
| the C<YXML_PISTART>, C<YXML_PICONTENT> and C<YXML_PIEND> tokens can be handled |
| exactly as if they were an alias for C<YXML_OK>. |
| |
| =head3 Elements |
| |
| The C<YXML_ELEMSTART> and C<YXML_ELEMEND> tokens are returned when an XML |
| element is opened and closed, respectively. When C<YXML_ELEMSTART> is returned, |
| the I<elem> struct field will hold the name of the element. This field will be |
| valid (i.e. keeps pointing to the name of the opened element) until the end of |
| the attribute list. That is, until any token other than those described in |
| L</Attributes> is returned. Although the I<elem> pointer itself may be reused |
| and modified while parsing the contents of the element, the buffer that I<elem> |
| points to will remain valid up to and including the corresponding |
| C<YXML_ELEMCLOSE>. |
| |
| Yxml will verify that elements properly nest and that the name of each closing |
| tag properly matches that of the respective opening tag. The application may |
| safely assume that each C<YXML_ELEMSTART> is properly matched with a |
| C<YXML_ELEMCLOSE>, or that otherwise an error is returned. Furthermore, only a |
| single root element is allowed. When the root element is closed, no further |
| C<YXML_ELEMSTART> tokens will be returned. |
| |
| No distinction is made between self-closing tags and elements with empty |
| content. For example, both C<< <a/> >> and C<< <a></a> >> will result in the |
| C<YXML_ELEMSTART> token (with C<elem="a">) followed by C<YXML_ELEMEND>. |
| |
| Element contents are returned in the form of the C<YXML_CONTENT> token and the |
| I<data> field. This is described in more detail in L</Character Data>. |
| |
| =head3 Attributes |
| |
| Element attributes are passed using the C<YXML_ATTRSTART>, C<YXML_ATTRVAL> and |
| C<YXML_ATTREND> tokens. The name of the attribute is available in the I<attr> |
| field, which is available when C<YXML_ATTRSTART> is returned and valid up to |
| and including the next C<YXML_ATTREND>. |
| |
| Yxml does not verify that attribute names are unique within a single element. |
| It is thus possible that the same attribute will appear twice, possibly with a |
| different value. The correct way to handle this situation is to stop parsing |
| the rest of the document and to report an error, but if the application is not |
| interested in all attributes, detecting duplicates in them may complicate the |
| code and possibly even introduce security vulnerabilities (e.g. algorithmic |
| complexity attacks in a hash table). As such, the best solution is to report an |
| error when you can easily detect a duplicate attribute, but ignore duplicates |
| that require more effort to be detected. |
| |
| The attribute value is returned with the C<YXML_ATTRVAL> token and the I<data> |
| field. This is described in more detail in L</Character Data>. |
| |
| =head3 Processing Instructions |
| |
| Processing instructions are passed in similar fashion to attributes, and are |
| passed using C<YXML_PISTART>, C<YXML_PICONTENT> and C<YXML_PIEND>. The target |
| of the PI is available in the I<pi> field after C<YXML_PISTART> and remains |
| valid up to (but excluding) the next C<YXML_PIEND> token. |
| |
| PI contents are returned as C<YXML_PICONTENT> tokens and using the I<data> |
| field, described in more detail in L</Character Data>. |
| |
| =head3 Character Data |
| |
| Element contents (C<YXML_CONTENT>), attribute values (C<YXML_ATTRVAL>) and PI |
| contents (C<YXML_PICONTENT>) are all passed to the application in small chunks |
| through the I<data> field. Each time that C<yxml_parse()> returns one of these |
| tokens, the I<data> field will contain one or more bytes of the element |
| contents, attribute value or PI content. The string is zero-terminated, and its |
| value is only valid until the next call to C<yxml_parse()>. |
| |
| Typically only a single byte is returned after each call, but multiple bytes |
| can be returned in the following special cases: |
| |
| =over |
| |
| =item * Character references outside of the ASCII character range. When a |
| character reference is encountered in element contents or in an attribute |
| value, it is automatically replaced with the referenced character. For example, |
| the XML string C</> is replaced with the single character "/". If the |
| character value is above 127, its value is encoded in UTF-8 and then returned |
| as a multi-byte string in the I<data> field. For example, the character |
| reference C<ç> is returned as the C string "\xc3\xa9", which is the UTF-8 |
| encoding for the character "é". Character references are not expanded in PI |
| contents. |
| |
| =item * The special character "]" in CDATA sections. When the "]" character is |
| encountered inside a CDATA section, yxml can't immediately return it to the |
| application because it does not know whether the character is part of the CDATA |
| ending or whether it is still part of its contents. So it remembers the |
| character for the next call to C<yxml_parse()>, and if it then turns out that |
| the character was part of the CDATA contents, it returns both the "]" character |
| and the following byte in the same I<data> string. Similarly, if two "]" |
| characters appear in sequence as part of the CDATA content, then the two |
| characters are returned in a single I<data> string together with the byte that |
| follows. CDATA sections only appear in element contents, so this does not |
| happen in attribute values or PI contents. |
| |
| =item * The special character "?" in PI contents. This is similar to the issue |
| with "]" characters in CDATA sections. Yxml remembers a "?" character while |
| parsing a PI, and then returns it together with the byte following it if it |
| turned out to be part of the PI contents. |
| |
| =back |
| |
| Note that C<yxml_parse()> operates on bytes rather than characters. If the |
| document is encoded in a multi-byte character encoding such as UTF-8, then each |
| Unicode character that occupies more than a single byte will be broken up and |
| its bytes processed individually. As a result, the bytes returned in the |
| I<data> field may not necessarily represent a single Unicode character. To |
| ensure that multi-byte characters are not broken up, the application can |
| concatenate multiple data tokens to a single buffer before attempting to do |
| further processing on the result. |
| |
| To make processing easier, an application may want to combine all the tokens |
| into a single buffer. This can be easily implemented as follows: |
| |
| SomeString attrval; |
| while(..) { |
| yxml_ret_t r = yxml_parse(x, ch); |
| switch(r) { |
| case YXML_ATTRSTART: |
| somestring_initialize(attrval); |
| break; |
| case YXML_ATTRVAL: |
| somestring_append(attrval, x->data); |
| break; |
| case YXML_ATTREND: |
| /* Now we have a full attribute. Its name is in x->attr, and its value is |
| * in the string 'attrval'. */ |
| somestring_reset(attrval); |
| break; |
| } |
| } |
| |
| The C<SomeString> type and C<somestring_> functions are stubs for any string |
| handling library of your choosing. When using Glib, for example, one could use |
| the L<GString|https://developer.gnome.org/glib/stable/glib-Strings.html> |
| type and the C<g_string_new()>, C<g_string_append()> and C<g_string_free()> |
| functions. For a more lighter-weight string library there is also |
| L<kstring.h in klib|https://github.com/attractivechaos/klib>, but the |
| functionality required in the above example can easily be implemented in a few |
| lines of pure C, too. |
| |
| When buffering data into an ever-growing string, as done in the previous |
| example, one should be careful to protect against memory exhaustion. This can |
| be done trivially by limiting the size of the total XML document or the maximum |
| length of the buffer. If you want to extract information from an XML document |
| that might not fit into memory, but you know that the information you care |
| about is limited in size and is only stored in specific attributes or elements, |
| you can choose to ignore data you don't care about. For example, if you only |
| want to extract the "Size" attribute and you know that its value is never |
| larger than 63 bytes, you can limit your code to read only that value and store |
| it into a small pre-allocated buffer: |
| |
| char sizebuf[64], *sizecur = NULL, *tmp; |
| while(..) { |
| yxml_ret_t r = yxml_parse(x, ch); |
| switch(r) { |
| case YXML_ATTRSTART: |
| if(strcmp(x->attr, "Size") == 0) |
| sizecur = sizebuf; |
| break; |
| case YXML_ATTRVAL: |
| if(!sizecur) /* Are we in the "Size" attribute? */ |
| break; |
| /* Append x->data to sizecur while there is space */ |
| tmp = x->data; |
| while(*tmp && sizecur < sizebuf+sizeof(sizebuf)) |
| *(sizecur++) = *(tmp++); |
| if(sizecur == sizebuf+sizeof(sizebuf)) |
| exit(1); /* Too long attribute value, handle error */ |
| *sizecur = 0; |
| break; |
| case YXML_ATTREND: |
| if(sizecur) { |
| /* Now we have the value of the "Size" attribute in sizebuf */ |
| sizecur = NULL; |
| } |
| break; |
| } |
| } |
| |
| |
| =head2 Finalization |
| |
| yxml_t *x; /* An initialized state */ |
| yxml_ret_t r = yxml_eof(x); |
| if(r < 0) |
| exit(1); /* Handle error */ |
| else |
| /* No errors in the XML document */ |
| |
| Because C<yxml_parse()> does not know when the end of the XML document has been |
| reached, it is unable to detect certain errors in the document. This is why, |
| after successfully parsing a complete document with C<yxml_parse()>, the |
| application should call C<yxml_eof()> to perform some extra checks. |
| |
| C<yxml_eof()> will return C<YXML_OK> if the parsed XML document is well-formed, |
| C<YXML_EEOF> otherwise. The following errors are not detected by |
| C<yxml_parse()> but will result in an error on C<yxml_eof()>: |
| |
| =over |
| |
| =item * The XML document did not contain a root element (e.g. an empty |
| file). |
| |
| =item * The XML root element has not been closed (e.g. "C<< <a> .. >>"). |
| |
| =item * The XML document ended in the middle of a comment or PI (e.g. |
| "C<< <a/><!-- .. >>"). |
| |
| =back |