Made a start on some documentation

commit: c28dd19f5f7f547b35afd80673d1d713e7271b26 [log] [tgz]
author: Yorhel <git@yorhel.nl> Wed Nov 13 10:47:28 2013 +0100
committer: Yorhel <git@yorhel.nl> Wed Nov 13 10:47:28 2013 +0100
tree: 8467f47d114e31785dfae71039a1ef399f5ecd90
parent: e8bd5434b34dad7c7d173c5afdf9a0abe4ce2df1 [diff]
diff --git a/yxml.pod b/yxml.pod
new file mode 100644
index 0000000..978a6a8
--- /dev/null
+++ b/yxml.pod

@@ -0,0 +1,249 @@
+=head1 Introduction
+
+Yxml is a small non-validating and mostly conforming XML parser written in C.
+
+The latest version of yxml and this document can be found on
+L<http://dev.yorhel.nl/yxml>.
+
+B<This document is a work in progress, there's still some useful stuff missing.>
+
+=head1 Compiling yxml
+
+Due to the small size of yxml, the recommended way to use it is to copy the
+L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and
+L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h> from the git repository
+into your project directory, and compile and link yxml.c as part of your
+program or library.
+
+The git repository also includes a Makefile. Running C<make> without specifying
+a target will compile a C<.a> file for easy static linking. A test suite is
+available under C<make test>.
+
+=head1 API documentation
+
+=head2 Overview
+
+Yxml is designed to be very flexible and efficient, and thus offers a
+relatively low-level stream-based API. The entire API consists of two typedefs
+and three functions:
+
+  typedef enum { /* .. */ } yxml_ret_t;
+  typedef struct { /* .. */ } yxml_t;
+
+  void yxml_init(yxml_t *x, char *buf, size_t bufsize);
+  yxml_ret_t yxml_parse(yxml_t *x, int ch);
+  yxml_ret_t yxml_eof(yxml_t *x);
+
+The values of I<yxml_ret_t> and the public fields of I<yxml_t> are explained in
+detail below. Parsing a file using yxml involves three steps:
+
+=over
+
+=item 1. Initialization, using C<yxml_init()>.
+
+=item 2. Parsing. This is performed in a loop where C<yxml_parse()> is called
+on each character of the input file.
+
+=item 3. Finalization, using C<yxml_eof()>.
+
+=back
+
+
+=head2 Initialization
+
+  #define BUFSIZE 4096
+  char *buf = malloc(BUFSIZE);
+  yxml_t x;
+  yxml_init(&x, buf, BUFSIZE);
+
+The parsing state for an input document is remembered in the C<yxml_t>
+structure. This structure needs to be allocated and initialized before parsing
+a new XML document.
+
+Allocating space for the C<yxml_t> structure is the responsibility of the
+application. Allocation can be done on the stack, but it is also possible to
+embed the struct inside a larger object or to allocate space for the struct
+separately.
+
+C<yxml_init()> takes a pointer to an (uninitialized) C<yxml_t> struct as first
+argument and performs the necessary initialization. The two additional
+arguments specify a pointer to a buffer and the size of this buffer. The given
+buffer must be writable, but does not have to be initialized by the
+application.
+
+The buffer is used internally by yxml to keep a stack of opened XML element
+names, property names and PI targets. The size of the buffer determines both
+the maximum depth in which XML elements can be nested and the maximum length of
+element names, property names and PI targets. Each name consumes
+C<strlen(name)+1> bytes in the buffer, and the first byte of the buffer is
+reserved for the C<\0> byte. This means that in order to parse an XML document
+with an element name of 100 bytes, a property name or PI target of 50 bytes and
+a nesting depth of 10 levels, the buffer must be at least
+C<1+10*(100+1)+(50+1)=1062> bytes. Note that properties and PIs don't nest, so
+the C<max(PI_name, property_name)> only needs to be counted once.
+
+It is not currently possibly to dynamically grow the buffer while parsing, so
+it is important to choose a buffer size that is large enough to handle all the
+XML documents that you want to parse. Since element names, property names and
+PI targets are typically much shorter than in the previous example, a buffer
+size of 4 or 8 KiB will give enough headroom even for documents with deep
+nesting.
+
+As a useful hack, it is possible to merge the memory for the C<yxml_t> struct
+and the stack buffer in a single allocation:
+
+  yxml_t *x = malloc(sizeof(yxml_t) + BUFSIZE);
+  yxml_init(x, x+1, BUFSIZE);
+
+This way, the complete parsing state can be passed around with a single
+pointer, and both the struct and the buffer can be freed with a single call to
+C<free(x)>.
+
+
+=head2 Parsing
+
+  yxml_t *x; /* An initialized state */
+  char *doc; /* The XML document as a zero-terminated string */
+  for(; *doc; doc++) {
+    yxml_ret_t r = yxml_parse(x, *doc);
+    if(r < 0)
+      exit(1); /* Handle error */
+    /* Handle any tokens we are interested in */
+  }
+
+The actual parsing of an XML document is facilitated by the C<yxml_parse()>
+function. It accepts a pointer to an initialized C<yxml_t> struct as first
+argument and a byte as second argument. The byte is passed as an C<int>, and
+values in the range of -128 to 255 (both inclusive) are accepted. This way you
+can pass either C<signed char> or C<unsigned char> values, yxml will work fine
+with both. To parse a complete document, C<yxml_parse()> needs to be called
+for each byte of the document in sequence, as done in the above example.
+
+For each byte, C<yxml_parse()> will return either I<YXML_OK> (0), a token (>0)
+or an error (<0). I<YXML_OK> is returned if the given byte has been
+parsed/consumed correctly but that otherwise nothing worthy of note has
+happened. The application should then continue processing and pass the next
+byte of the document.
+
+=head3 Public State Variables
+
+After each call to C<yxml_parse()>, a number of interesting fields in the
+C<yxml_t> struct are updated. The fields documented here are part of the API,
+and are considered as extra return values of C<yxml_parse()>. All of these
+fields should be considered read-only.
+
+=over
+
+=item C<char *elem;>
+
+Name of the currently opened XML element. Points into the buffer given to
+C<yxml_init()>. Described in L</Elements>.
+
+=item C<char *attr;>
+
+Name of the currently opened attribute. Points into the buffer given to
+C<yxml_init()>. Described in L</Attributes>.
+
+=item C<char *pi;>
+
+Target of the currently opened PI. Points into the buffer given to
+C<yxml_init()>. Described in L</Processing Instructions>.
+
+=item C<char data[8];>
+
+Character data of element contents, attribute values or PI contents. Described
+in L</Character Data>.
+
+=item C<uint32_t line;>
+
+Number of the line in the XML document that is currently being parsed.
+
+=item C<uint64_t byte;>
+
+Byte offset into the current line the XML document.
+
+=item C<uint64_t total;>
+
+Byte offset into the XML document.
+
+=back
+
+The values of the I<elem>, I<attr>, I<pi> and I<data> elements depend on the
+parsing context, and only remain valid within that context. The exact contexts
+in which these fields contain valid information is described in their
+respective sections below.
+
+The I<line>, I<byte> and I<total> fields are mainly useful for error reporting.
+When C<yxml_parse()> or C<yxml_eof()> returns with an error, these fields can
+be used to generate a useful error message. For example:
+
+  printf("Parsing error at %s:%"PRIu32":%"PRIu64" byte offset %"PRIu64",
+    filename, x->line, x->byte, x->total);
+
+=head3 Error Handling
+
+Errors are not recoverable. No further calls to C<yxml_parse()> or
+C<yxml_eof()> should be performed on the same C<yxml_t> struct. Re-initializing
+the same struct using C<yxml_init()> to start parsing a new document is
+possible, however.  The following error values may be returned by
+C<yxml_parse()>:
+
+=over
+
+=item YXML_EREF
+
+Invalid character or entity reference. E.g. C<&whatever;> or C<&#ABC;>.
+
+=item YXML_ECLOSE
+
+Close tag does not match open tag. E.g. C<< <Tag> .. </SomeOtherTag> >>.
+
+=item YXML_ESTACK
+
+Stack overflow. This happens when the buffer given to C<yxml_init()> was not
+large enough to parse this document. E.g. when elements are too deeply nested
+or an element name, attribute name or PI target is too long.
+
+=item YXML_ESYN
+
+Miscellaneous syntax error.
+
+=back
+
+=head3 Elements
+
+The C<YXML_ELEMSTART> and C<YXML_ELEMEND> tokens are returned when an XML
+element is opened and closed, respectively. When C<YXML_ELEMSTART> is returned,
+the I<elem> struct field will hold the name of the element. This field will be
+valid (i.e. keep pointing to the name of the opened element) until the end of
+the attribute list. That is, until any token other than those described in
+L</Attributes> is returned. Although the I<elem> pointer itself may be reused
+and modified while parsing the contents of the element, the buffer that I<elem>
+points to will remain valid up to and including the corresponding
+C<YXML_ELEMCLOSE>.
+
+Yxml will validate that elements properly nest and that the name of each
+closing tag properly matches that of the respective opening tag. The
+application may safely assume that each C<YXML_ELEMSTART> is properly matched
+with a C<YXML_ELEMCLOSE>, or that otherwise an error is returned. Furthermore,
+only a single root element is allowed. When the root element is closed, no
+further C<YXML_ELEMSTART> tokens will be returned.
+
+No distinction is made between self-closing tags and elements with empty
+content. For example, both C<< <a/> >> and C<< <a></a> >> will result in the
+C<YXML_ELEMSTART> token (with C<elem="a">) followed by the C<YXML_ELEMEND>
+token.
+
+Element contents are returned in the form of the C<YXML_CONTENT> token and the
+I<data> field. This is described in more detail in L</Character Data>.
+
+=head3 Attributes
+
+=head3 Processing Instructions
+
+=head3 Character Data
+
+
+=head2 Finalization
+
+
commit	c28dd19f5f7f547b35afd80673d1d713e7271b26	[log] [tgz]
author	Yorhel <git@yorhel.nl>	Wed Nov 13 10:47:28 2013 +0100
committer	Yorhel <git@yorhel.nl>	Wed Nov 13 10:47:28 2013 +0100
tree	8467f47d114e31785dfae71039a1ef399f5ecd90
parent	e8bd5434b34dad7c7d173c5afdf9a0abe4ce2df1 [diff]