yxml.pod - third_party/yxml - Git at Google

 =head1 Introduction

 Yxml is a small non-validating and mostly conforming XML parser written in C.

 The latest version of yxml and this document can be found on
 L<http://dev.yorhel.nl/yxml>.

 =head1 Compiling yxml

 Due to the small size of yxml, the recommended way to use it is to copy the
 L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and
 L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h> from the git repository
 into your project directory, and compile and link yxml.c as part of your
 program or library.

 The git repository also includes a Makefile. Running C<make> without specifying
 a target will compile a C<.a> file for easy static linking. A test suite is
 available under C<make test>.

 =head1 API documentation

 =head2 Overview

 Yxml is designed to be very flexible and efficient, and thus offers a
 relatively low-level stream-based API. The entire API consists of two typedefs
 and three functions:

   typedef enum { /* .. */ } yxml_ret_t;
   typedef struct { /* .. */ } yxml_t;

   void yxml_init(yxml_t *x, void *buf, size_t bufsize);
   yxml_ret_t yxml_parse(yxml_t *x, int ch);
   yxml_ret_t yxml_eof(yxml_t *x);

 The values of I<yxml_ret_t> and the public fields of I<yxml_t> are explained in
 detail below. Parsing a file using yxml involves three steps:

 =over

 =item 1. Initialization, using C<yxml_init()>.

 =item 2. Parsing. This is performed in a loop where C<yxml_parse()> is called
 on each character of the input file.

 =item 3. Finalization, using C<yxml_eof()>.

 =back


 =head2 Initialization

   #define BUFSIZE 4096
   void *buf = malloc(BUFSIZE);
   yxml_t x;
   yxml_init(&x, buf, BUFSIZE);

 The parsing state for an input document is remembered in the C<yxml_t>
 structure. This structure needs to be allocated and initialized before parsing
 a new XML document.

 Allocating space for the C<yxml_t> structure is the responsibility of the
 application. Allocation can be done on the stack, but it is also possible to
 embed the struct inside a larger object or to allocate space for the struct
 separately.

 C<yxml_init()> takes a pointer to an (uninitialized) C<yxml_t> struct as first
 argument and performs the necessary initialization. The two additional
 arguments specify a pointer to a buffer and the size of this buffer. The given
 buffer must be writable, but does not have to be initialized by the
 application.

 The buffer is used internally by yxml to keep a stack of opened XML element
 names, property names and PI targets. The size of the buffer determines both
 the maximum depth in which XML elements can be nested and the maximum length of
 element names, property names and PI targets. Each name consumes
 C<strlen(name)+1> bytes in the buffer, and the first byte of the buffer is
 reserved for the C<\0> byte. This means that in order to parse an XML document
 with an element name of 100 bytes, a property name or PI target of 50 bytes and
 a nesting depth of 10 levels, the buffer must be at least
 C<1+10*(100+1)+(50+1)=1062> bytes. Note that properties and PIs don't nest, so
 the C<max(PI_name, property_name)> only needs to be counted once.

 It is not currently possibly to dynamically grow the buffer while parsing, so
 it is important to choose a buffer size that is large enough to handle all the
 XML documents that you want to parse. Since element names, property names and
 PI targets are typically much shorter than in the previous example, a buffer
 size of 4 or 8 KiB will give enough headroom even for documents with deep
 nesting.

 As a useful hack, it is possible to merge the memory for the C<yxml_t> struct
 and the stack buffer in a single allocation:

   yxml_t *x = malloc(sizeof(yxml_t) + BUFSIZE);
   yxml_init(x, x+1, BUFSIZE);

 This way, the complete parsing state can be passed around with a single
 pointer, and both the struct and the buffer can be freed with a single call to
 C<free(x)>.


 =head2 Parsing

   yxml_t *x; /* An initialized state */
   char *doc; /* The XML document as a zero-terminated string */
   for(; *doc; doc++) {
     yxml_ret_t r = yxml_parse(x, *doc);
     if(r < 0)
       exit(1); /* Handle error */
     /* Handle any tokens we are interested in */
   }

 The actual parsing of an XML document is facilitated by the C<yxml_parse()>
 function. It accepts a pointer to an initialized C<yxml_t> struct as first
 argument and a byte as second argument. The byte is passed as an C<int>, and
 values in the range of -128 to 255 (both inclusive) are accepted. This way you
 can pass either C<signed char> or C<unsigned char> values, yxml will work fine
 with both. To parse a complete document, C<yxml_parse()> needs to be called
 for each byte of the document in sequence, as done in the above example.

 For each byte, C<yxml_parse()> will return either I<YXML_OK> (0), a token (>0)
 or an error (<0). I<YXML_OK> is returned if the given byte has been
 parsed/consumed correctly but that otherwise nothing worthy of note has
 happened. The application should then continue processing and pass the next
 byte of the document.

 =head3 Public State Variables

 After each call to C<yxml_parse()>, a number of interesting fields in the
 C<yxml_t> struct are updated. The fields documented here are part of the API,
 and are considered as extra return values of C<yxml_parse()>. All of these
 fields should be considered read-only.

 =over

 =item C<char *elem;>

 Name of the currently opened XML element. Points into the buffer given to
 C<yxml_init()>. Described in L</Elements>.

 =item C<char *attr;>

 Name of the currently opened attribute. Points into the buffer given to
 C<yxml_init()>. Described in L</Attributes>.

 =item C<char *pi;>

 Target of the currently opened PI. Points into the buffer given to
 C<yxml_init()>. Described in L</Processing Instructions>.

 =item C<char data[8];>

 Character data of element contents, attribute values or PI contents. Described
 in L</Character Data>.

 =item C<uint32_t line;>

 Number of the line in the XML document that is currently being parsed.

 =item C<uint64_t byte;>

 Byte offset into the current line the XML document.

 =item C<uint64_t total;>

 Byte offset into the XML document.

 =back

 The values of the I<elem>, I<attr>, I<pi> and I<data> elements depend on the
 parsing context, and only remain valid within that context. The exact contexts
 in which these fields contain valid information is described in their
 respective sections below.

 The I<line>, I<byte> and I<total> fields are mainly useful for error reporting.
 When C<yxml_parse()> reports an error, these fields can be used to generate a
 useful error message. For example:

   printf("Parsing error at %s:%"PRIu32":%"PRIu64" byte offset %"PRIu64",
     filename, x->line, x->byte, x->total);

 =head3 Error Handling

 Errors are not recoverable. No further calls to C<yxml_parse()> or
 C<yxml_eof()> should be performed on the same C<yxml_t> struct. Re-initializing
 the same struct using C<yxml_init()> to start parsing a new document is
 possible, however.  The following error values may be returned by
 C<yxml_parse()>:

 =over

 =item YXML_EREF

 Invalid character or entity reference. E.g. C<&whatever;> or C<&#ABC;>.

 =item YXML_ECLOSE

 Close tag does not match open tag. E.g. C<< <Tag> .. </SomeOtherTag> >>.

 =item YXML_ESTACK

 Stack overflow. This happens when the buffer given to C<yxml_init()> was not
 large enough to parse this document. E.g. when elements are too deeply nested
 or an element name, attribute name or PI target is too long.

 =item YXML_ESYN

 Miscellaneous syntax error.

 =back


 =head2 Handling Tokens

 The C<yxml_parse()> function will return tokens as they are found. When loading
 an XML document, it is important to know which tokens are returned in which
 situation and how to handle them.

 The following graph shows the (simplified) state machine of the parser to
 illustrate the order in which tokens are returned. The labels on the edge
 indicate the tokens that are returned by C<yxml_parse()>, with their C<YXML_>
 prefix removed.  The special return value C<YXML_OK> and error returns are not
 displayed.

 [html]<img src="/img/yxml-apistates.png" />É

 Tokens that the application is not interested in can be ignored safely. For
 example, if you are not interested in handling processing instructions, then
 the C<YXML_PISTART>, C<YXML_PICONTENT> and C<YXML_PIEND> tokens can be handled
 exactly as if they were an alias for C<YXML_OK>.

 =head3 Elements

 The C<YXML_ELEMSTART> and C<YXML_ELEMEND> tokens are returned when an XML
 element is opened and closed, respectively. When C<YXML_ELEMSTART> is returned,
 the I<elem> struct field will hold the name of the element. This field will be
 valid (i.e. keeps pointing to the name of the opened element) until the end of
 the attribute list. That is, until any token other than those described in
 L</Attributes> is returned. Although the I<elem> pointer itself may be reused
 and modified while parsing the contents of the element, the buffer that I<elem>
 points to will remain valid up to and including the corresponding
 C<YXML_ELEMEND>.

 Yxml will verify that elements properly nest and that the name of each closing
 tag properly matches that of the corresponding opening tag. The application may
 safely assume that each C<YXML_ELEMSTART> is properly matched with a
 C<YXML_ELEMEND>, or that otherwise an error is returned. Furthermore, only a
 single root element is allowed. When the root element is closed, no further
 C<YXML_ELEMSTART> tokens will be returned.

 No distinction is made between self-closing tags and elements with empty
 content. For example, both C<< <a/> >> and C<< <a></a> >> will result in the
 C<YXML_ELEMSTART> token (with C<elem="a">) followed by C<YXML_ELEMEND>.

 Element contents are returned in the form of the C<YXML_CONTENT> token and the
 I<data> field. This is described in more detail in L</Character Data>.

 =head3 Attributes

 Element attributes are passed using the C<YXML_ATTRSTART>, C<YXML_ATTRVAL> and
 C<YXML_ATTREND> tokens. The name of the attribute is available in the I<attr>
 field, which is available when C<YXML_ATTRSTART> is returned and valid up to
 and including the next C<YXML_ATTREND>.

 Yxml does not verify that attribute names are unique within a single element.
 It is thus possible that the same attribute will appear twice, possibly with a
 different value. The correct way to handle this situation is to stop parsing
 the rest of the document and to report an error, but if the application is not
 interested in all attributes, detecting duplicates in them may complicate the
 code and possibly even introduce security vulnerabilities (e.g. algorithmic
 complexity attacks in a hash table). As such, the best solution is to report an
 error when you can easily detect a duplicate attribute, but ignore duplicates
 that require more effort to be detected.

 The attribute value is returned with the C<YXML_ATTRVAL> token and the I<data>
 field. This is described in more detail in L</Character Data>.

 =head3 Processing Instructions

 Processing instructions are passed in similar fashion to attributes, and are
 passed using C<YXML_PISTART>, C<YXML_PICONTENT> and C<YXML_PIEND>. The target
 of the PI is available in the I<pi> field after C<YXML_PISTART> and remains
 valid up to (but excluding) the next C<YXML_PIEND> token.

 PI contents are returned as C<YXML_PICONTENT> tokens and using the I<data>
 field, described in more detail in L</Character Data>.

 =head3 Character Data

 Element contents (C<YXML_CONTENT>), attribute values (C<YXML_ATTRVAL>) and PI
 contents (C<YXML_PICONTENT>) are all passed to the application in small chunks
 through the I<data> field. Each time that C<yxml_parse()> returns one of these
 tokens, the I<data> field will contain one or more bytes of the element
 contents, attribute value or PI content. The string is zero-terminated, and its
 value is only valid until the next call to C<yxml_parse()>.

 Typically only a single byte is returned after each call, but multiple bytes
 can be returned in the following special cases:

 =over

 =item * Character references outside of the ASCII character range. When a
 character reference is encountered in element contents or in an attribute
 value, it is automatically replaced with the referenced character. For example,
 the XML string C<&#47;> is replaced with the single character "/". If the
 character value is above 127, its value is encoded in UTF-8 and then returned
 as a multi-byte string in the I<data> field. For example, the character
 reference C<&#xe7;> is returned as the C string "\xc3\xa9", which is the UTF-8
 encoding for the character "é". Character references are not expanded in PI
 contents.

 =item * The special character "]" in CDATA sections. When the "]" character is
 encountered inside a CDATA section, yxml can't immediately return it to the
 application because it does not know whether the character is part of the CDATA
 ending or whether it is still part of its contents. So it remembers the
 character for the next call to C<yxml_parse()>, and if it then turns out that
 the character was part of the CDATA contents, it returns both the "]" character
 and the following byte in the same I<data> string. Similarly, if two "]"
 characters appear in sequence as part of the CDATA content, then the two
 characters are returned in a single I<data> string together with the byte that
 follows. CDATA sections only appear in element contents, so this does not
 happen in attribute values or PI contents.

 =item * The special character "?" in PI contents. This is similar to the issue
 with "]" characters in CDATA sections. Yxml remembers a "?" character while
 parsing a PI, and then returns it together with the byte following it if it
 turned out to be part of the PI contents.

 =back

 Note that C<yxml_parse()> operates on bytes rather than characters. If the
 document is encoded in a multi-byte character encoding such as UTF-8, then each
 Unicode character that occupies more than a single byte will be broken up and
 its bytes processed individually. As a result, the bytes returned in the
 I<data> field may not necessarily represent a single Unicode character. To
 ensure that multi-byte characters are not broken up, the application can
 concatenate multiple data tokens to a single buffer before attempting to do
 further processing on the result.

 To make processing easier, an application may want to combine all the tokens
 into a single buffer. This can be easily implemented as follows:

   SomeString attrval;
   while(..) {
     yxml_ret_t r = yxml_parse(x, ch);
     switch(r) {
     case YXML_ATTRSTART:
       somestring_initialize(attrval);
       break;
     case YXML_ATTRVAL:
       somestring_append(attrval, x->data);
       break;
     case YXML_ATTREND:
       /* Now we have a full attribute. Its name is in x->attr, and its value is
        * in the string 'attrval'. */
       somestring_reset(attrval);
       break;
     }
   }

 The C<SomeString> type and C<somestring_> functions are stubs for any string
 handling library of your choosing. When using Glib, for example, one could use
 the L<GString|https://developer.gnome.org/glib/stable/glib-Strings.html>
 type and the C<g_string_new()>, C<g_string_append()> and C<g_string_free()>
 functions. For a more lighter-weight string library there is also
 L<kstring.h in klib|https://github.com/attractivechaos/klib>, but the
 functionality required in the above example can easily be implemented in a few
 lines of pure C, too.

 When buffering data into an ever-growing string, as done in the previous
 example, one should be careful to protect against memory exhaustion. This can
 be done trivially by limiting the size of the total XML document or the maximum
 length of the buffer. If you want to extract information from an XML document
 that might not fit into memory, but you know that the information you care
 about is limited in size and is only stored in specific attributes or elements,
 you can choose to ignore data you don't care about. For example, if you only
 want to extract the "Size" attribute and you know that its value is never
 larger than 63 bytes, you can limit your code to read only that value and store
 it into a small pre-allocated buffer:

   char sizebuf[64], *sizecur = NULL, *tmp;
   while(..) {
     yxml_ret_t r = yxml_parse(x, ch);
     switch(r) {
     case YXML_ATTRSTART:
       if(strcmp(x->attr, "Size") == 0)
         sizecur = sizebuf;
       break;
     case YXML_ATTRVAL:
       if(!sizecur) /* Are we in the "Size" attribute? */
         break;
       /* Append x->data to sizecur while there is space */
       tmp = x->data;
       while(*tmp && sizecur < sizebuf+sizeof(sizebuf))
         *(sizecur++) = *(tmp++);
       if(sizecur == sizebuf+sizeof(sizebuf))
         exit(1); /* Too long attribute value, handle error */
       *sizecur = 0;
       break;
     case YXML_ATTREND:
       if(sizecur) {
         /* Now we have the value of the "Size" attribute in sizebuf */
         sizecur = NULL;
       }
       break;
     }
   }


 =head2 Finalization

   yxml_t *x; /* An initialized state */
   yxml_ret_t r = yxml_eof(x);
   if(r < 0)
     exit(1); /* Handle error */
   else
     /* No errors in the XML document */

 Because C<yxml_parse()> does not know when the end of the XML document has been
 reached, it is unable to detect certain errors in the document. This is why,
 after successfully parsing a complete document with C<yxml_parse()>, the
 application should call C<yxml_eof()> to perform some extra checks.

 C<yxml_eof()> will return C<YXML_OK> if the parsed XML document is well-formed,
 C<YXML_EEOF> otherwise. The following errors are not detected by
 C<yxml_parse()> but will result in an error on C<yxml_eof()>:

 =over

 =item * The XML document did not contain a root element (e.g. an empty
 file).

 =item * The XML root element has not been closed (e.g. "C<< <a> .. >>").

 =item * The XML document ended in the middle of a comment or PI (e.g.
 "C<< <a/><!-- .. >>").

 =back

 =head2 Utility functions

   size_t yxml_symlen(yxml_t *, const char *);

 C<yxml_symlen()> returns the length of the element name (C<< x->elem >>),
 attribute name (C<< x->attr >>), or PI name (C<< x->pi >>). When used
 correctly, it gives the same result as C<strlen()>, except without having to
 scan through the string. This function should B<ONLY> be used directly after
 the C<YXML_ELEMSTART>, C<YXML_ATTRSTART> or C<YXML_PISTART> (respectively)
 tokens have been returned by C<yxml_parse()>, calling this function at any
 other time may not give the correct results. This function should B<NOT> be
 used on strings other than C<< x->elem >>, C<< x->attr >> or C<< x->pi >>.
	=head1 Introduction

	Yxml is a small non-validating and mostly conforming XML parser written in C.

	The latest version of yxml and this document can be found on
	L<http://dev.yorhel.nl/yxml>.

	=head1 Compiling yxml

	Due to the small size of yxml, the recommended way to use it is to copy the
	L<yxml.c\|http://g.blicky.net/yxml.git/plain/yxml.c> and
	L<yxml.h\|http://g.blicky.net/yxml.git/plain/yxml.h> from the git repository
	into your project directory, and compile and link yxml.c as part of your
	program or library.

	The git repository also includes a Makefile. Running C<make> without specifying
	a target will compile a C<.a> file for easy static linking. A test suite is
	available under C<make test>.

	=head1 API documentation

	=head2 Overview

	Yxml is designed to be very flexible and efficient, and thus offers a
	relatively low-level stream-based API. The entire API consists of two typedefs
	and three functions:

	typedef enum { /* .. */ } yxml_ret_t;
	typedef struct { /* .. */ } yxml_t;

	void yxml_init(yxml_t x, void buf, size_t bufsize);
	yxml_ret_t yxml_parse(yxml_t *x, int ch);
	yxml_ret_t yxml_eof(yxml_t *x);

	The values of I<yxml_ret_t> and the public fields of I<yxml_t> are explained in
	detail below. Parsing a file using yxml involves three steps:

	=over

	=item 1. Initialization, using C<yxml_init()>.

	=item 2. Parsing. This is performed in a loop where C<yxml_parse()> is called
	on each character of the input file.

	=item 3. Finalization, using C<yxml_eof()>.

	=back


	=head2 Initialization

	#define BUFSIZE 4096
	void *buf = malloc(BUFSIZE);
	yxml_t x;
	yxml_init(&x, buf, BUFSIZE);

	The parsing state for an input document is remembered in the C<yxml_t>
	structure. This structure needs to be allocated and initialized before parsing
	a new XML document.

	Allocating space for the C<yxml_t> structure is the responsibility of the
	application. Allocation can be done on the stack, but it is also possible to
	embed the struct inside a larger object or to allocate space for the struct
	separately.

	C<yxml_init()> takes a pointer to an (uninitialized) C<yxml_t> struct as first
	argument and performs the necessary initialization. The two additional
	arguments specify a pointer to a buffer and the size of this buffer. The given
	buffer must be writable, but does not have to be initialized by the
	application.

	The buffer is used internally by yxml to keep a stack of opened XML element
	names, property names and PI targets. The size of the buffer determines both
	the maximum depth in which XML elements can be nested and the maximum length of
	element names, property names and PI targets. Each name consumes
	C<strlen(name)+1> bytes in the buffer, and the first byte of the buffer is
	reserved for the C<\0> byte. This means that in order to parse an XML document
	with an element name of 100 bytes, a property name or PI target of 50 bytes and
	a nesting depth of 10 levels, the buffer must be at least
	C<1+10*(100+1)+(50+1)=1062> bytes. Note that properties and PIs don't nest, so
	the C<max(PI_name, property_name)> only needs to be counted once.

	It is not currently possibly to dynamically grow the buffer while parsing, so
	it is important to choose a buffer size that is large enough to handle all the
	XML documents that you want to parse. Since element names, property names and
	PI targets are typically much shorter than in the previous example, a buffer
	size of 4 or 8 KiB will give enough headroom even for documents with deep
	nesting.

	As a useful hack, it is possible to merge the memory for the C<yxml_t> struct
	and the stack buffer in a single allocation:

	yxml_t *x = malloc(sizeof(yxml_t) + BUFSIZE);
	yxml_init(x, x+1, BUFSIZE);

	This way, the complete parsing state can be passed around with a single
	pointer, and both the struct and the buffer can be freed with a single call to
	C<free(x)>.


	=head2 Parsing

	yxml_t x; / An initialized state */
	char doc; / The XML document as a zero-terminated string */
	for(; *doc; doc++) {
	yxml_ret_t r = yxml_parse(x, *doc);
	if(r < 0)
	exit(1); /* Handle error */
	/* Handle any tokens we are interested in */
	}

	The actual parsing of an XML document is facilitated by the C<yxml_parse()>
	function. It accepts a pointer to an initialized C<yxml_t> struct as first
	argument and a byte as second argument. The byte is passed as an C<int>, and
	values in the range of -128 to 255 (both inclusive) are accepted. This way you
	can pass either C<signed char> or C<unsigned char> values, yxml will work fine
	with both. To parse a complete document, C<yxml_parse()> needs to be called
	for each byte of the document in sequence, as done in the above example.

	For each byte, C<yxml_parse()> will return either I<YXML_OK> (0), a token (>0)
	or an error (<0). I<YXML_OK> is returned if the given byte has been
	parsed/consumed correctly but that otherwise nothing worthy of note has
	happened. The application should then continue processing and pass the next
	byte of the document.

	=head3 Public State Variables

	After each call to C<yxml_parse()>, a number of interesting fields in the
	C<yxml_t> struct are updated. The fields documented here are part of the API,
	and are considered as extra return values of C<yxml_parse()>. All of these
	fields should be considered read-only.

	=over

	=item C<char *elem;>

	Name of the currently opened XML element. Points into the buffer given to
	C<yxml_init()>. Described in L</Elements>.

	=item C<char *attr;>

	Name of the currently opened attribute. Points into the buffer given to
	C<yxml_init()>. Described in L</Attributes>.

	=item C<char *pi;>

	Target of the currently opened PI. Points into the buffer given to
	C<yxml_init()>. Described in L</Processing Instructions>.

	=item C<char data[8];>

	Character data of element contents, attribute values or PI contents. Described
	in L</Character Data>.

	=item C<uint32_t line;>

	Number of the line in the XML document that is currently being parsed.

	=item C<uint64_t byte;>

	Byte offset into the current line the XML document.

	=item C<uint64_t total;>

	Byte offset into the XML document.

	=back

	The values of the I<elem>, I<attr>, I<pi> and I<data> elements depend on the
	parsing context, and only remain valid within that context. The exact contexts
	in which these fields contain valid information is described in their
	respective sections below.

	The I<line>, I<byte> and I<total> fields are mainly useful for error reporting.
	When C<yxml_parse()> reports an error, these fields can be used to generate a
	useful error message. For example:

	printf("Parsing error at %s:%"PRIu32":%"PRIu64" byte offset %"PRIu64",
	filename, x->line, x->byte, x->total);

	=head3 Error Handling

	Errors are not recoverable. No further calls to C<yxml_parse()> or
	C<yxml_eof()> should be performed on the same C<yxml_t> struct. Re-initializing
	the same struct using C<yxml_init()> to start parsing a new document is
	possible, however. The following error values may be returned by
	C<yxml_parse()>:

	=over

	=item YXML_EREF

	Invalid character or entity reference. E.g. C<&whatever;> or C<&#ABC;>.

	=item YXML_ECLOSE

	Close tag does not match open tag. E.g. C<< <Tag> .. </SomeOtherTag> >>.

	=item YXML_ESTACK

	Stack overflow. This happens when the buffer given to C<yxml_init()> was not
	large enough to parse this document. E.g. when elements are too deeply nested
	or an element name, attribute name or PI target is too long.

	=item YXML_ESYN

	Miscellaneous syntax error.

	=back


	=head2 Handling Tokens

	The C<yxml_parse()> function will return tokens as they are found. When loading
	an XML document, it is important to know which tokens are returned in which
	situation and how to handle them.

	The following graph shows the (simplified) state machine of the parser to
	illustrate the order in which tokens are returned. The labels on the edge
	indicate the tokens that are returned by C<yxml_parse()>, with their C<YXML_>
	prefix removed. The special return value C<YXML_OK> and error returns are not
	displayed.

	[html]<img src="/img/yxml-apistates.png" />É

	Tokens that the application is not interested in can be ignored safely. For
	example, if you are not interested in handling processing instructions, then
	the C<YXML_PISTART>, C<YXML_PICONTENT> and C<YXML_PIEND> tokens can be handled
	exactly as if they were an alias for C<YXML_OK>.

	=head3 Elements

	The C<YXML_ELEMSTART> and C<YXML_ELEMEND> tokens are returned when an XML
	element is opened and closed, respectively. When C<YXML_ELEMSTART> is returned,
	the I<elem> struct field will hold the name of the element. This field will be
	valid (i.e. keeps pointing to the name of the opened element) until the end of
	the attribute list. That is, until any token other than those described in
	L</Attributes> is returned. Although the I<elem> pointer itself may be reused
	and modified while parsing the contents of the element, the buffer that I<elem>
	points to will remain valid up to and including the corresponding
	C<YXML_ELEMEND>.

	Yxml will verify that elements properly nest and that the name of each closing
	tag properly matches that of the corresponding opening tag. The application may
	safely assume that each C<YXML_ELEMSTART> is properly matched with a
	C<YXML_ELEMEND>, or that otherwise an error is returned. Furthermore, only a
	single root element is allowed. When the root element is closed, no further
	C<YXML_ELEMSTART> tokens will be returned.

	No distinction is made between self-closing tags and elements with empty
	content. For example, both C<< <a/> >> and C<< <a></a> >> will result in the
	C<YXML_ELEMSTART> token (with C<elem="a">) followed by C<YXML_ELEMEND>.

	Element contents are returned in the form of the C<YXML_CONTENT> token and the
	I<data> field. This is described in more detail in L</Character Data>.

	=head3 Attributes

	Element attributes are passed using the C<YXML_ATTRSTART>, C<YXML_ATTRVAL> and
	C<YXML_ATTREND> tokens. The name of the attribute is available in the I<attr>
	field, which is available when C<YXML_ATTRSTART> is returned and valid up to
	and including the next C<YXML_ATTREND>.

	Yxml does not verify that attribute names are unique within a single element.
	It is thus possible that the same attribute will appear twice, possibly with a
	different value. The correct way to handle this situation is to stop parsing
	the rest of the document and to report an error, but if the application is not
	interested in all attributes, detecting duplicates in them may complicate the
	code and possibly even introduce security vulnerabilities (e.g. algorithmic
	complexity attacks in a hash table). As such, the best solution is to report an
	error when you can easily detect a duplicate attribute, but ignore duplicates
	that require more effort to be detected.

	The attribute value is returned with the C<YXML_ATTRVAL> token and the I<data>
	field. This is described in more detail in L</Character Data>.

	=head3 Processing Instructions

	Processing instructions are passed in similar fashion to attributes, and are
	passed using C<YXML_PISTART>, C<YXML_PICONTENT> and C<YXML_PIEND>. The target
	of the PI is available in the I<pi> field after C<YXML_PISTART> and remains
	valid up to (but excluding) the next C<YXML_PIEND> token.

	PI contents are returned as C<YXML_PICONTENT> tokens and using the I<data>
	field, described in more detail in L</Character Data>.

	=head3 Character Data

	Element contents (C<YXML_CONTENT>), attribute values (C<YXML_ATTRVAL>) and PI
	contents (C<YXML_PICONTENT>) are all passed to the application in small chunks
	through the I<data> field. Each time that C<yxml_parse()> returns one of these
	tokens, the I<data> field will contain one or more bytes of the element
	contents, attribute value or PI content. The string is zero-terminated, and its
	value is only valid until the next call to C<yxml_parse()>.

	Typically only a single byte is returned after each call, but multiple bytes
	can be returned in the following special cases:

	=over

	=item * Character references outside of the ASCII character range. When a
	character reference is encountered in element contents or in an attribute
	value, it is automatically replaced with the referenced character. For example,
	the XML string C</> is replaced with the single character "/". If the
	character value is above 127, its value is encoded in UTF-8 and then returned
	as a multi-byte string in the I<data> field. For example, the character
	reference C<ç> is returned as the C string "\xc3\xa9", which is the UTF-8
	encoding for the character "é". Character references are not expanded in PI
	contents.

	=item * The special character "]" in CDATA sections. When the "]" character is
	encountered inside a CDATA section, yxml can't immediately return it to the
	application because it does not know whether the character is part of the CDATA
	ending or whether it is still part of its contents. So it remembers the
	character for the next call to C<yxml_parse()>, and if it then turns out that
	the character was part of the CDATA contents, it returns both the "]" character
	and the following byte in the same I<data> string. Similarly, if two "]"
	characters appear in sequence as part of the CDATA content, then the two
	characters are returned in a single I<data> string together with the byte that
	follows. CDATA sections only appear in element contents, so this does not
	happen in attribute values or PI contents.

	=item * The special character "?" in PI contents. This is similar to the issue
	with "]" characters in CDATA sections. Yxml remembers a "?" character while
	parsing a PI, and then returns it together with the byte following it if it
	turned out to be part of the PI contents.

	=back

	Note that C<yxml_parse()> operates on bytes rather than characters. If the
	document is encoded in a multi-byte character encoding such as UTF-8, then each
	Unicode character that occupies more than a single byte will be broken up and
	its bytes processed individually. As a result, the bytes returned in the
	I<data> field may not necessarily represent a single Unicode character. To
	ensure that multi-byte characters are not broken up, the application can
	concatenate multiple data tokens to a single buffer before attempting to do
	further processing on the result.

	To make processing easier, an application may want to combine all the tokens
	into a single buffer. This can be easily implemented as follows:

	SomeString attrval;
	while(..) {
	yxml_ret_t r = yxml_parse(x, ch);
	switch(r) {
	case YXML_ATTRSTART:
	somestring_initialize(attrval);
	break;
	case YXML_ATTRVAL:
	somestring_append(attrval, x->data);
	break;
	case YXML_ATTREND:
	/* Now we have a full attribute. Its name is in x->attr, and its value is
	* in the string 'attrval'. */
	somestring_reset(attrval);
	break;
	}
	}

	The C<SomeString> type and C<somestring_> functions are stubs for any string
	handling library of your choosing. When using Glib, for example, one could use
	the L<GString\|https://developer.gnome.org/glib/stable/glib-Strings.html>
	type and the C<g_string_new()>, C<g_string_append()> and C<g_string_free()>
	functions. For a more lighter-weight string library there is also
	L<kstring.h in klib\|https://github.com/attractivechaos/klib>, but the
	functionality required in the above example can easily be implemented in a few
	lines of pure C, too.

	When buffering data into an ever-growing string, as done in the previous
	example, one should be careful to protect against memory exhaustion. This can
	be done trivially by limiting the size of the total XML document or the maximum
	length of the buffer. If you want to extract information from an XML document
	that might not fit into memory, but you know that the information you care
	about is limited in size and is only stored in specific attributes or elements,
	you can choose to ignore data you don't care about. For example, if you only
	want to extract the "Size" attribute and you know that its value is never
	larger than 63 bytes, you can limit your code to read only that value and store
	it into a small pre-allocated buffer:

	char sizebuf[64], sizecur = NULL, tmp;
	while(..) {
	yxml_ret_t r = yxml_parse(x, ch);
	switch(r) {
	case YXML_ATTRSTART:
	if(strcmp(x->attr, "Size") == 0)
	sizecur = sizebuf;
	break;
	case YXML_ATTRVAL:
	if(!sizecur) /* Are we in the "Size" attribute? */
	break;
	/* Append x->data to sizecur while there is space */
	tmp = x->data;
	while(*tmp && sizecur < sizebuf+sizeof(sizebuf))
	(sizecur++) = (tmp++);
	if(sizecur == sizebuf+sizeof(sizebuf))
	exit(1); /* Too long attribute value, handle error */
	*sizecur = 0;
	break;
	case YXML_ATTREND:
	if(sizecur) {
	/* Now we have the value of the "Size" attribute in sizebuf */
	sizecur = NULL;
	}
	break;
	}
	}


	=head2 Finalization

	yxml_t x; / An initialized state */
	yxml_ret_t r = yxml_eof(x);
	if(r < 0)
	exit(1); /* Handle error */
	else
	/* No errors in the XML document */

	Because C<yxml_parse()> does not know when the end of the XML document has been
	reached, it is unable to detect certain errors in the document. This is why,
	after successfully parsing a complete document with C<yxml_parse()>, the
	application should call C<yxml_eof()> to perform some extra checks.

	C<yxml_eof()> will return C<YXML_OK> if the parsed XML document is well-formed,
	C<YXML_EEOF> otherwise. The following errors are not detected by
	C<yxml_parse()> but will result in an error on C<yxml_eof()>:

	=over

	=item * The XML document did not contain a root element (e.g. an empty
	file).

	=item * The XML root element has not been closed (e.g. "C<< <a> .. >>").

	=item * The XML document ended in the middle of a comment or PI (e.g.
	"C<< <a/><!-- .. >>").

	=back

	=head2 Utility functions

	size_t yxml_symlen(yxml_t , const char );

	C<yxml_symlen()> returns the length of the element name (C<< x->elem >>),
	attribute name (C<< x->attr >>), or PI name (C<< x->pi >>). When used
	correctly, it gives the same result as C<strlen()>, except without having to
	scan through the string. This function should B<ONLY> be used directly after
	the C<YXML_ELEMSTART>, C<YXML_ATTRSTART> or C<YXML_PISTART> (respectively)
	tokens have been returned by C<yxml_parse()>, calling this function at any
	other time may not give the correct results. This function should B<NOT> be
	used on strings other than C<< x->elem >>, C<< x->attr >> or C<< x->pi >>.