blob: 2b97961ab13e7a0be0108b2d90e821ecf846f766 [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Unicode character categories: HarfBuzz Manual</title>
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
<link rel="home" href="index.html" title="HarfBuzz Manual">
<link rel="up" href="shaping-concepts.html" title="Shaping concepts">
<link rel="prev" href="shaping-operations.html" title="Shaping operations">
<link rel="next" href="text-runs.html" title="Text runs">
<meta name="generator" content="GTK-Doc V1.25 (XML mode)">
<link rel="stylesheet" href="style.css" type="text/css">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
<table class="navigation" id="top" width="100%" summary="Navigation header" cellpadding="2" cellspacing="5"><tr valign="middle">
<td width="100%" align="left" class="shortcuts"></td>
<td><a accesskey="h" href="index.html"><img src="home.png" width="16" height="16" border="0" alt="Home"></a></td>
<td><a accesskey="u" href="shaping-concepts.html"><img src="up.png" width="16" height="16" border="0" alt="Up"></a></td>
<td><a accesskey="p" href="shaping-operations.html"><img src="left.png" width="16" height="16" border="0" alt="Prev"></a></td>
<td><a accesskey="n" href="text-runs.html"><img src="right.png" width="16" height="16" border="0" alt="Next"></a></td>
</tr></table>
<div class="section">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="unicode-character-categories"></a>Unicode character categories</h2></div></div></div>
<p>
Shaping models are typically specified with respect to how
scripts are defined in the Unicode standard.
</p>
<p>
Every codepoint in the Unicode Character Database (UCD) is
assigned a <span class="emphasis"><em>Unicode General Category</em></span> (UGC),
which provides the most fundamental information about the
codepoint: whether the codepoint represents a
<span class="emphasis"><em>Letter</em></span>, a <span class="emphasis"><em>Mark</em></span>, a
<span class="emphasis"><em>Number</em></span>, <span class="emphasis"><em>Punctuation</em></span>, a
<span class="emphasis"><em>Symbol</em></span>, a <span class="emphasis"><em>Separator</em></span>,
or something else (<span class="emphasis"><em>Other</em></span>).
</p>
<p>
These UGC properties are "Major" categories. Each codepoint is
further assigned to a "minor" category within its Major
category, such as "Letter, uppercase" (<code class="literal">Lu</code>) or
"Letter, modifier" (<code class="literal">Lm</code>).
</p>
<p>
Shaping models are concerned primarily with Letter and Mark
codepoints. The minor categories of Mark codepoints are
particularly important for shaping. Marks can be nonspacing
(<code class="literal">Mn</code>), spacing combining
(<code class="literal">Mc</code>), or enclosing (<code class="literal">Me</code>).
</p>
<p>
In addition to the UGC property, codepoints in the Indic and
Southeast Asian scripts are also assigned
<span class="emphasis"><em>Unicode Indic Syllabic Category</em></span> (UISC) and
<span class="emphasis"><em>Unicode Indic Positional Category</em></span> (UIPC)
properties that provide more detailed information needed for
shaping.
</p>
<p>
The UISC property sub-categorizes Letters and Marks according to
common script-shaping behaviors. For example, UISC distinguishes
between consonant letters, vowel letters, and vowel marks. The
UIPC property sub-categorizes Mark codepoints by the relative visual
position that they occupy (above, below, right, left, or in
multiple positions).
</p>
<p>
Some complex scripts require that the text run be split into
syllables. What constitutes a valid syllable in these
scripts is specified in regular expressions, formed from the
Letter and Mark codepoints, that take the UISC and UIPC
properties into account.
</p>
</div>
<div class="footer">
<hr>Generated by GTK-Doc V1.25</div>
</body>
</html>