blob: 978a6a842556c715b528d763a5c51480e6b2012d [file] [log] [blame]
=head1 Introduction
Yxml is a small non-validating and mostly conforming XML parser written in C.
The latest version of yxml and this document can be found on
L<http://dev.yorhel.nl/yxml>.
B<This document is a work in progress, there's still some useful stuff missing.>
=head1 Compiling yxml
Due to the small size of yxml, the recommended way to use it is to copy the
L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and
L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h> from the git repository
into your project directory, and compile and link yxml.c as part of your
program or library.
The git repository also includes a Makefile. Running C<make> without specifying
a target will compile a C<.a> file for easy static linking. A test suite is
available under C<make test>.
=head1 API documentation
=head2 Overview
Yxml is designed to be very flexible and efficient, and thus offers a
relatively low-level stream-based API. The entire API consists of two typedefs
and three functions:
typedef enum { /* .. */ } yxml_ret_t;
typedef struct { /* .. */ } yxml_t;
void yxml_init(yxml_t *x, char *buf, size_t bufsize);
yxml_ret_t yxml_parse(yxml_t *x, int ch);
yxml_ret_t yxml_eof(yxml_t *x);
The values of I<yxml_ret_t> and the public fields of I<yxml_t> are explained in
detail below. Parsing a file using yxml involves three steps:
=over
=item 1. Initialization, using C<yxml_init()>.
=item 2. Parsing. This is performed in a loop where C<yxml_parse()> is called
on each character of the input file.
=item 3. Finalization, using C<yxml_eof()>.
=back
=head2 Initialization
#define BUFSIZE 4096
char *buf = malloc(BUFSIZE);
yxml_t x;
yxml_init(&x, buf, BUFSIZE);
The parsing state for an input document is remembered in the C<yxml_t>
structure. This structure needs to be allocated and initialized before parsing
a new XML document.
Allocating space for the C<yxml_t> structure is the responsibility of the
application. Allocation can be done on the stack, but it is also possible to
embed the struct inside a larger object or to allocate space for the struct
separately.
C<yxml_init()> takes a pointer to an (uninitialized) C<yxml_t> struct as first
argument and performs the necessary initialization. The two additional
arguments specify a pointer to a buffer and the size of this buffer. The given
buffer must be writable, but does not have to be initialized by the
application.
The buffer is used internally by yxml to keep a stack of opened XML element
names, property names and PI targets. The size of the buffer determines both
the maximum depth in which XML elements can be nested and the maximum length of
element names, property names and PI targets. Each name consumes
C<strlen(name)+1> bytes in the buffer, and the first byte of the buffer is
reserved for the C<\0> byte. This means that in order to parse an XML document
with an element name of 100 bytes, a property name or PI target of 50 bytes and
a nesting depth of 10 levels, the buffer must be at least
C<1+10*(100+1)+(50+1)=1062> bytes. Note that properties and PIs don't nest, so
the C<max(PI_name, property_name)> only needs to be counted once.
It is not currently possibly to dynamically grow the buffer while parsing, so
it is important to choose a buffer size that is large enough to handle all the
XML documents that you want to parse. Since element names, property names and
PI targets are typically much shorter than in the previous example, a buffer
size of 4 or 8 KiB will give enough headroom even for documents with deep
nesting.
As a useful hack, it is possible to merge the memory for the C<yxml_t> struct
and the stack buffer in a single allocation:
yxml_t *x = malloc(sizeof(yxml_t) + BUFSIZE);
yxml_init(x, x+1, BUFSIZE);
This way, the complete parsing state can be passed around with a single
pointer, and both the struct and the buffer can be freed with a single call to
C<free(x)>.
=head2 Parsing
yxml_t *x; /* An initialized state */
char *doc; /* The XML document as a zero-terminated string */
for(; *doc; doc++) {
yxml_ret_t r = yxml_parse(x, *doc);
if(r < 0)
exit(1); /* Handle error */
/* Handle any tokens we are interested in */
}
The actual parsing of an XML document is facilitated by the C<yxml_parse()>
function. It accepts a pointer to an initialized C<yxml_t> struct as first
argument and a byte as second argument. The byte is passed as an C<int>, and
values in the range of -128 to 255 (both inclusive) are accepted. This way you
can pass either C<signed char> or C<unsigned char> values, yxml will work fine
with both. To parse a complete document, C<yxml_parse()> needs to be called
for each byte of the document in sequence, as done in the above example.
For each byte, C<yxml_parse()> will return either I<YXML_OK> (0), a token (>0)
or an error (<0). I<YXML_OK> is returned if the given byte has been
parsed/consumed correctly but that otherwise nothing worthy of note has
happened. The application should then continue processing and pass the next
byte of the document.
=head3 Public State Variables
After each call to C<yxml_parse()>, a number of interesting fields in the
C<yxml_t> struct are updated. The fields documented here are part of the API,
and are considered as extra return values of C<yxml_parse()>. All of these
fields should be considered read-only.
=over
=item C<char *elem;>
Name of the currently opened XML element. Points into the buffer given to
C<yxml_init()>. Described in L</Elements>.
=item C<char *attr;>
Name of the currently opened attribute. Points into the buffer given to
C<yxml_init()>. Described in L</Attributes>.
=item C<char *pi;>
Target of the currently opened PI. Points into the buffer given to
C<yxml_init()>. Described in L</Processing Instructions>.
=item C<char data[8];>
Character data of element contents, attribute values or PI contents. Described
in L</Character Data>.
=item C<uint32_t line;>
Number of the line in the XML document that is currently being parsed.
=item C<uint64_t byte;>
Byte offset into the current line the XML document.
=item C<uint64_t total;>
Byte offset into the XML document.
=back
The values of the I<elem>, I<attr>, I<pi> and I<data> elements depend on the
parsing context, and only remain valid within that context. The exact contexts
in which these fields contain valid information is described in their
respective sections below.
The I<line>, I<byte> and I<total> fields are mainly useful for error reporting.
When C<yxml_parse()> or C<yxml_eof()> returns with an error, these fields can
be used to generate a useful error message. For example:
printf("Parsing error at %s:%"PRIu32":%"PRIu64" byte offset %"PRIu64",
filename, x->line, x->byte, x->total);
=head3 Error Handling
Errors are not recoverable. No further calls to C<yxml_parse()> or
C<yxml_eof()> should be performed on the same C<yxml_t> struct. Re-initializing
the same struct using C<yxml_init()> to start parsing a new document is
possible, however. The following error values may be returned by
C<yxml_parse()>:
=over
=item YXML_EREF
Invalid character or entity reference. E.g. C<&whatever;> or C<&#ABC;>.
=item YXML_ECLOSE
Close tag does not match open tag. E.g. C<< <Tag> .. </SomeOtherTag> >>.
=item YXML_ESTACK
Stack overflow. This happens when the buffer given to C<yxml_init()> was not
large enough to parse this document. E.g. when elements are too deeply nested
or an element name, attribute name or PI target is too long.
=item YXML_ESYN
Miscellaneous syntax error.
=back
=head3 Elements
The C<YXML_ELEMSTART> and C<YXML_ELEMEND> tokens are returned when an XML
element is opened and closed, respectively. When C<YXML_ELEMSTART> is returned,
the I<elem> struct field will hold the name of the element. This field will be
valid (i.e. keep pointing to the name of the opened element) until the end of
the attribute list. That is, until any token other than those described in
L</Attributes> is returned. Although the I<elem> pointer itself may be reused
and modified while parsing the contents of the element, the buffer that I<elem>
points to will remain valid up to and including the corresponding
C<YXML_ELEMCLOSE>.
Yxml will validate that elements properly nest and that the name of each
closing tag properly matches that of the respective opening tag. The
application may safely assume that each C<YXML_ELEMSTART> is properly matched
with a C<YXML_ELEMCLOSE>, or that otherwise an error is returned. Furthermore,
only a single root element is allowed. When the root element is closed, no
further C<YXML_ELEMSTART> tokens will be returned.
No distinction is made between self-closing tags and elements with empty
content. For example, both C<< <a/> >> and C<< <a></a> >> will result in the
C<YXML_ELEMSTART> token (with C<elem="a">) followed by the C<YXML_ELEMEND>
token.
Element contents are returned in the form of the C<YXML_CONTENT> token and the
I<data> field. This is described in more detail in L</Character Data>.
=head3 Attributes
=head3 Processing Instructions
=head3 Character Data
=head2 Finalization