blob: caa0a1a989028be4fe546cc0c6c92f3e564d4acb [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Encoding Conversion</title><meta name="generator" content="DocBook XSL Stylesheets V1.50.0"><link rel="home" href="index.html" title="Libxml Tutorial"><link rel="up" href="index.html" title="Libxml Tutorial"><link rel="previous" href="ar01s07.html" title="Retrieving Attributes"><link rel="next" href="apa.html" title="A. Sample Document"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">Encoding Conversion</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="ar01s07.html">Prev</a> </td><th width="60%" align="center"> </th><td width="20%" align="right"> <a accesskey="n" href="apa.html">Next</a></td></tr></table><hr></div><div class="sect1"><div class="titlepage"><div><h2 class="title" style="clear: both"><a name="xmltutorialconvert"></a>Encoding Conversion</h2></div></div><p>Data encoding compatibility problems are one of the most common
difficulties encountered by programmers new to XML in
general and libxml in particular. Thinking
through the design of your application in light of this issue will help
avoid difficulties later. Internally, libxml
stores and manipulates date in the UTF-8 format. Data used by your program
in other formats, such as the commonly used ISO-8859-1 encoding, must be
converted to UTF-8 before passing it to libxml
functions. If you want your program's output in an encoding other than
UTF-8, you also must convert it.</p><p>Libxml uses
iconv if it is available to convert
data. Without iconv, only UTF-8, UTF-16 and
ISO-8859-1 can be used as external formats. With
iconv, any format can be used provided
iconv is able to convert it to and from
UTF-8. Currently iconv supports about 150
different character formats with ability to convert from any to any. While
the actual number of supported formats varies between implementations, every
iconv implementation is almost guaranteed to
support every format anyone has ever heard of.</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0"><tr><td rowspan="2" align="center" valign="top" width="25"><img src="images/warning.png"></td><th>Warning</th></tr><tr><td colspan="2" align="left" valign="top"><p>A common mistake is to use different formats for the internal data
in different parts of one's code. The most common case is an application
that assumes ISO-8859-1 to be the internal data format, combined with
libxml, which assumes UTF-8 to be the
internal data format. The result is an application that treats internal
data differently, depending on which code section is executing. The one or
the other part of code will then, naturally, misinterpret the data.
</p></td></tr></table></div><p>This example constructs a simple document, then adds content provided
at the command line to the document's root element and outputs the results
to <tt>stdout</tt> in the proper encoding. For this example, we
use ISO-8859-1 encoding. The encoding of the string input at the command
line is converted from ISO-8859-1 to UTF-8. Full code: <a href="apf.html" title="F. Code for Encoding Conversion Example">Appendix F</a></p><p>The conversion, encapsulated in the example code in the
<tt>convert</tt> function, uses
libxml's
<tt>xmlFindCharEncodingHandler</tt> function:
<pre class="programlisting">
<a name="handlerdatatype"></a><img src="images/callouts/1.png" alt="1" border="0">xmlCharEncodingHandlerPtr handler;
<a name="calcsize"></a><img src="images/callouts/2.png" alt="2" border="0">size = (int)strlen(in)+1;
out_size = size*2-1;
out = malloc((size_t)out_size);
&#8230;
<a name="findhandlerfunction"></a><img src="images/callouts/3.png" alt="3" border="0">handler = xmlFindCharEncodingHandler(encoding);
&#8230;
<a name="callconversionfunction"></a><img src="images/callouts/4.png" alt="4" border="0">handler-&gt;input(out, &amp;out_size, in, &amp;temp);
&#8230;
<a name="outputencoding"></a><img src="images/callouts/5.png" alt="5" border="0">xmlSaveFormatFileEnc(&quot;-&quot;, doc, encoding, 1);
</pre>
</p><div class="calloutlist"><table border="0" summary="Callout list"><tr><td width="5%" valign="top" align="left"><a href="#handlerdatatype"><img src="images/callouts/1.png" alt="1" border="0"></a> </td><td valign="top" align="left"><p><tt>handler</tt> is declared as a pointer to an
<tt>xmlCharEncodingHandler</tt> function.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#calcsize"><img src="images/callouts/2.png" alt="2" border="0"></a> </td><td valign="top" align="left"><p>The <tt>xmlCharEncodingHandler</tt> function needs
to be given the size of the input and output strings, which are
calculated here for strings <tt>in</tt> and
<tt>out</tt>.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#findhandlerfunction"><img src="images/callouts/3.png" alt="3" border="0"></a> </td><td valign="top" align="left"><p><tt>xmlFindCharEncodingHandler</tt> takes as its
argument the data's initial encoding and searches
libxml's built-in set of conversion
handlers, returning a pointer to the function or NULL if none is
found.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#callconversionfunction"><img src="images/callouts/4.png" alt="4" border="0"></a> </td><td valign="top" align="left"><p>The conversion function identified by <tt>handler</tt>
requires as its arguments pointers to the input and output strings,
along with the length of each. The lengths must be determined
separately by the application.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#outputencoding"><img src="images/callouts/5.png" alt="5" border="0"></a> </td><td valign="top" align="left"><p>To output in a specified encoding rather than UTF-8, we use
<tt>xmlSaveFormatFileEnc</tt>, specifying the
encoding.</p></td></tr></table></div><p>
</p></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="ar01s07.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="index.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="apa.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Retrieving Attributes </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> A. Sample Document</td></tr></table></div></body></html>