<html>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<head>
<style>BODY, P, DIV, H1, H2, H3, H4, H5, H6, ADDRESS, OL, UL, LI, TITLE, TD, OPTION, SELECT 
{ 
 font-family: Verdana 
}
BODY, P, DIV, ADDRESS, OL, UL, LI, TITLE, TD, OPTION, SELECT  
{  
  font-size: 10.0pt; 
  margin-top:0pt;  
  margin-bottom:0pt;  
} 
BODY, P
{
  margin-left:0pt; 
  margin-right:0pt;
}
BODY
{
  background: white;
  margin: 6px;
  padding: 0px;
}
h6 { font-size: 10pt }
h5 { font-size: 11pt }
h4 { font-size: 12pt }
h3 { font-size: 13pt }
h2 { font-size: 14pt }
h1 { font-size: 16pt }
blockquote { padding: 10px; border: 1px #DDDDDD dashed }
a img {	border: 0; }
table.zeroBorder {
	border-width: 1px 1px 1px 1px;
	border-style: dotted dotted dotted dotted;
	border-color: gray gray gray gray;
}
table.zeroBorder th {
	border-width: 1px 1px 1px 1px;
	border-style: dotted dotted dotted dotted;
	border-color: gray gray gray gray;
}
table.zeroBorder td {
	border-width: 1px 1px 1px 1px;
	border-style: dotted dotted dotted dotted;
	border-color: gray gray gray gray;
}
.hiddenStyle {
		visibility: hidden; 
		position: absolute;
		z-Index: 1;
		paddingRight: 0;
		background: white
	}
.misspell { background-image: url('/images/misspell.gif'); background-repeat: repeat-x; background-position: bottom }
@media screen {
.pb { border-top: 1px dashed #C0C0C0; border-bottom: 1px dashed #C0C0C0 }
.writely-comment { font-size: 9pt; line-height: 1.4em; padding: 1px; border: 1px dashed #C0C0C0 }
}
@media print {
.pb { border-top: 0px; border-bottom: 0px }
.writely-comment { display: none }
}
@media screen,print {
.pb { height: 1px }
}
</style></head>
<body revision='ddd5crnh_206zxf29:22'>
<p class=western style="MARGIN-LEFT:2.46in; TEXT-INDENT:0.49in">
  Doc. no.: N2238=07-0098
</p>
<p class=western style="MARGIN-LEFT:2.46in; TEXT-INDENT:0.49in">
  Date: 2007-04-17
</p>
<p class=western style="MARGIN-LEFT:2.46in; TEXT-INDENT:0.49in">
  Project: Programming Language C++
</p>
<p class=western style="MARGIN-LEFT:2.46in; TEXT-INDENT:0.49in">
  Subgroup: Library
</p>
<p class=western style="MARGIN-LEFT:2.46in; TEXT-INDENT:0.49in">
  Reply to: Matthew Austern &lt;austern@google.com&gt;
</p>
<h1 class=western>
  <font size=5>Minimal Unicode support for the standard library (revision 3)<br>
  </font>
</h1>
<h2 class=western>
  <font size=4 style=FONT-SIZE:16pt>Background </font>
</h2>
<p class=western style="">
  Unicode is an industry standard developed by the Unicode Consortium, with the
  goal of encoding every character in every writing system. It is synchronized
  with ISO 10646, which contains the same characters and the same character
  codes, and for the purposes of this paper we may treat Unicode and ISO 10646
  as synonymous. Many programming languages and platforms already support
  Unicode, and many standards, such as XML, are defined in terms of Unicode.
  There has already been some work to add Unicode support to ISO C++.<br>
  <br>
  C++ has two character types, char and wchar_t. The standard does not specify
  which character set either type uses, except that each is a superset of the
  95-character <i>basic execution character set</i>. In practice char is almost
  always an 8-bit type, typically used either for the ASCII character set or for
  some 256-character superset of ASCII (e.g. ISO-8859-1). Some programs use
  wchar_t for Unicode characters, but wchar_t varies enough from one platform to
  another that it is unsuitable for portable Unicode programming.<br>
  <br>
  Unicode assigns “a unique number for every character, no matter what the
  platform, no matter what the program, no matter what the language.” [6] These
  numbers are known as <i>code points</i>. A <i>character encoding form</i>
  specifies the way in which a sequence of code points is represented in an
  actual program. Code points range from 0x000000 through 0x10ffff, so 21 bits
  suffice to represent all Unicode characters. No popular architecture has a
  21-bit word size, so instead most programs that work with Unicode use one of
  the following character encoding forms for internal processing:
</p>
<ul>
  <li>
    <p class=western>
      UTF-32 uses a 32-bit word to store each character. This encoding is
      attractive because of its simplicity, unattractive because it wastes 11
      bits per character.
    </p>
  </li>
  <li>
    <p class=western>
      UTF-16 is a variable width character encoding form where a code point is
      represented by either one or two 16-bit <i>code units</i>. The most common
      characters are represented as a single code unit, and the less common
      characters are represented as two code units, called <i>surrogates</i>. It
      is possible to tell, without having to examine context, whether a UTF-16
      code unit is a leading surrogate, a trailing surrogate, or a complete
      character.
    </p>
  </li>
  <li>
    <p class=western style="">
      UTF-8 is a variable-width character encoding form where a code point is
      represented up to four 8-bit code units. It is possible to tell, without
      examining context, whether a code unit is a complete character, a leading
      byte, or a trailing byte. An ASCII character is represented as a single
      UTF-8 code unit.
    </p>
  </li>
</ul>
<p class=western style="">
  Other character encoding forms are sometimes used for serialization or
  external storage.
</p>
<p class=western style="">
  <br>
</p>
<p class=western style="">
  Unicode support for ISO C is described in TR 19769:2004, a Type 2 technical
  report. TR 19769 proposes two new character types, char16_t and char32_t,
  together with new syntax for character and string literals of those types, and
  a few additions to the C library to manipulate strings of those types.
  Lawrence Crowl’s paper N2149, “New Character Types in C++,” proposes that WG21
  adopt TR 19769 almost unchanged; essentially the only change from TR 19769 is
  that char16_t and char32_t are required to be distinct from other integer
  types, so that it’s possible to overload on them.
</p>
<p class=western style="">
  <br>
</p>
<p class=western style="">
  This paper describes changes to the standard library that will be needed if
  WG21 chooses to adopt N2149. It is a proposal for C++0x, because it proposes
  changes in existing standard library components.
</p>
<h2 class=western>
  <font size=4 style=FONT-SIZE:16pt>Goals and design decisions</font>
</h2>
<p class=western style="">
  The main goal of this paper is simple: make it possible to use library
  facilities in combination with the two new character types char16_t and
  char32_t. This paper does not attempt to define new library facilities or to
  fix defects in existing ones, but only to make it possible to use char16_t and
  char32_t with existing library facilities.
</p>
<p class=western style="">
  <br>
</p>
<p class=western style="">
  This goal is important despite the existence of wchar_t. Even if wchar_t is
  the same size as one of those two types, it is distinct from both from the
  point of view of the C++ type system. It would be very poor user experience if
  we told users that they had to cast their Unicode strings to some other type
  in order to use library facilities, especially since that type would vary from
  one system to another. (Internally, of course, I imagine most library
  implementers will choose to share code between char32_t and wchar_t or between
  char16_t and wchar_t.) It is indeed irritating to have three distinct types
  when two of them will almost always be identical, but, as with char, signed
  char, and unsigned char, history leaves us little choice.
</p>
<h3 class=western>
  Minimal support for char32_t
</h3>
<p class=western style="">
  Minimal support for char32_t is simple: UTF-32 is a fixed width encoding, so
  we just need to require specializations of library facilities for char32_t in
  the same way that we do for char and wchar_t. Arguably a basic_string of
  32-bit characters isn't all that useful, but I think just enough people would
  use it to make it worth having.
</p>
<h3 class=western>
  Minimal support for char16_t
</h3>
<p class=western style="">
  Minimal support for char16_t is more complicated in theory, but equally simple
  in practice: again, just add specializations of all library facilities for
  char16_t. UTF-16 is not a fixed width encoding, but, for two reasons, it can
  almost be treated as one. First, most text is composed only of the common
  characters that lie in the Basic Multilingual Plane (BMP), and for such text
  UTF-16 is in fact a fixed width encoding. Second, since it’s always possible
  to tell whether a code unit is a complete character, a leading surrogate, or a
  trailing surrogate, there is little danger from treating a UTF-16 string as a
  sequence of code units instead of a sequence of code points. Corrupting a
  UTF-16 string by inserting an incorrect code unit is no more likely than
  corrupting a UTF-32 string, and the corruption, if any, will be confined to a
  single character.
</p>
<p class=western style="">
  <br>
</p>
<p class=western style="">
  We don’t need to say very much about how the library handles char16_t strings.
  There is already language in the standard to allow facets to give errors at
  run time for invalid strings, and we need that for UTF-32 as well as UTF-16.
</p>
<p class=western style="">
  <br>
</p>
<p class=western style="">
  In practice, we need library support for UTF-16 because that’s the real world;
  if the standard library ignores UTF-16 then the standard library will be
  irrelevant to processing non-ASCII text. The small amount of extra simplicity
  that you get from using UTF-32 instead of UTF-16 just doesn’t outweigh the
  cost of using 4 bytes per character instead of 2+ε. Microsoft, Apple, and Java
  all use UTF-16 as their primary string representation, and in practice it
  works fine. Microsoft’s decision to use UTF-16 for wchar_t shows that there is
  no insuperable obstacle to using UTF-16 with the standard C++ library.
</p>
<h3 class=western>
  Names of template specializations
</h3>
<p class=western style="">
  The C++ standard assigns names for many of the specializations of class
  templates on character types. For example, string is shorthand for
  basic_string&lt;char&gt; and wstreambuf is shorthand for
  basic_streambuf&lt;wchar_t&gt;. Our general pattern: no prefix for char
  specializations and the prefix ‘w’ for wchar_t specializations. What should
  the pattern be for specializations on char16_t and char32_t?
</p>
<p class=western style="">
  In principle we could use a prefix based on the “u” and “U” prefixes that
  N2149 proposes for Unicode string literals, or we could use a prefix or suffix
  based on the “16” and “32” in the type names themselves. In the first revision
  of this paper I proposed a combination of the two: a “u” prefix for the
  char16_t specializations and a “u32” prefix for the char32_t specializations.
  Rationale:
</p>
<ul>
  <li>
    <p class=western style="">
      I expect that basic_string&lt;char16_t&gt; will be used much more often
      than basic_string&lt;char32_t&gt; or basic_string&lt;wchar_t&gt;, since
      UTF-16 is generally a good tradeoff between convenience and space
      efficiency. This argues that the name of basic_string&lt;char16_t&gt;
      should not be more cumbersome than that of basic_string&lt;char32_t&gt; or
      basic_string&lt;wchar_t&gt;, and that it should have a single-character
      prefix. The obvious choice is “u”.
    </p>
  </li>
  <li>
    <p class=western style="">
      There is an obvious one-character prefix for char32_t specializations,
      “U”. In general, however, the standard library avoids uppercase names, and
      especially avoids having two names that differ only by case. Using a “u32”
      prefix for char32_t specializations does mean that it will be less
      convenient to use basic_string&lt;char32_t&gt; than to use
      basic_string&lt;char16_t&gt;, but that reflects what I expect to be
      real-world usage. I do not expect people to use UTF-32 as often as they
      use UTF-16.
    </p>
  </li>
</ul>
<p class=western style="">
  <br>
</p>
<p class=western style="">
  This revision makes a different choice: for the sake of consistency with the
  names of the character types, it uses the prefix “u16” for char16_t
  specializations and “u32” for char32_t specializations.
</p>
<h3 class=western>
  Prior art
</h3>
<p class=western>
  There is no prior art for a C++ library implementation containing four types
  named char, wchar_t, char16_t, and char32_t. However, there is also no doubt
  that the proposal is implementable. There is extensive prior art for C++
  standard library implementations that use UTF-16 wide characters (wchar_t in
  Microsoft’s C++ implementation uses UTF-16), and there is extensive prior art
  for C++ standard library implementations that use UTF-32 wide characters (most
  Unix implementations).
</p>
<h3 class=western>
  Dependence on N2149
</h3>
<p class=western style="">
  Since N2149 has not been voted into the WP, building library facilities on top
  of it is slightly dicey. In principle a number of questions are still open:
  the names of the two new types (I have chosen to use char16_t and char32_t in
  accordance with TR 19769 and N2149), whether they are new built-in types or
  new user-defined types (the latter would only make sense if there are core
  language changes to permit string literals for user-defined types), whether
  char16_t and char32_t are the names of types or just the names of typedefs for
  underlying types with uglier names, and, if they are typedef names, which
  namespace the typedefs live in. Except for the names char16_t and char32_t,
  nothing in this paper depends on those decisions.
</p>
<br>
<h2>
  Changes since revision 1
</h2>
<p class=western style="">
  This document has been revised as a result of the LWG's straw polls in
  Portland. The LWG voted on whether to support char16_t and char32_t for
  various library facilities:
</p>
<div align=center>
  <table border=0 cellpadding=3 cellspacing=0 style="WIDTH:563px; HEIGHT:218px">
    <tbody>
    <tr>
      <td width=33%>
        <span style=FONT-WEIGHT:bold>Library facility</span><br>
      </td>
      <td width=20%>
        <span style=FONT-WEIGHT:bold>For</span><br>
      </td>
      <td width=20%>
        <span style=FONT-WEIGHT:bold>Against</span><br>
      </td>
    </tr>
    <tr>
      <td width=33%>
        char_traits<br>
      </td>
      <td width=20%>
        12<br>
      </td>
      <td width=20%>
        0<br>
      </td>
    </tr>
    <tr>
      <td width=33%>
        iostream<br>
      </td>
      <td width=20%>
        1<br>
      </td>
      <td width=20%>
        9<br>
      </td>
    </tr>
    <tr>
      <td width=33%>
        fstream<br>
      </td>
      <td width=20%>
        3<br>
      </td>
      <td width=20%>
        4<br>
      </td>
    </tr>
    <tr>
      <td width=33%>
        sstream<br>
      </td>
      <td width=20%>
        2<br>
      </td>
      <td width=20%>
        1<br>
      </td>
    </tr>
    <tr>
      <td width=33%>
        facets (other than codecvt)<br>
      </td>
      <td width=20%>
        3<br>
      </td>
      <td width=20%>
        4<br>
      </td>
    </tr>
    <tr>
      <td width=33%>
        codecvt<br>
      </td>
      <td width=20%>
        11<br>
      </td>
      <td width=20%>
        0<br>
      </td>
    </tr>
    <tr>
      <td width=33%>
        regex<br>
      </td>
      <td width=20%>
        2<br>
      </td>
      <td width=20%>
        7<br>
      </td>
    </tr>
    </tbody>
  </table>
  <br>
  <div style=TEXT-ALIGN:left>
    The rationale for leaving out stream specializations of the two new types
    was that streams of non-char types have not attracted wide usage, so it is
    not clear that there is a real need for doubling the number of
    specalizations of this very complicated machinery. The rationale for leaving
    out regex specializations was that regexes include ranges, which gets us
    into complicated collation issues; this ought to be addressed in detail in a
    separate paper.<br>
    <br>
    There were no straw polls on numeric_limits and hash, and it was assumed
    that basic_string would work if char_traits was supported. We need language
    in the standard to support hash since it lists all types for which it is
    defiend, but no such language is needed for numberic_limits since the
    standard already says that it is specialized for all fundamental types.<br>
    <br>
    Revision 1 included support for all library facilities, i.e. adding char16_t
    and char32_t overloads to every feature that currently takes char and
    wchar_t overloads. Revision 2 removes support for all facilities that the
    majority of the LWG opposed in the Portland straw poll. Even though sstream
    attracted majority support, I have also removed it from revision 2 of this
    proposal because (a) it was a very weak majority, and (b) sstream depends so
    heavily on the general iostream machinery that it would be difficult to
    support sstream without also supporting other parts of the iostream
    machinery.<br>
    <br>
  </div>
</div>
<h2>
  Changes since revision 2
</h2>
<p class=western style="">
  The only change between revision 2 and revision 3 is that the “ustring”
  typedef name has been changed to “u16string”.
</p>
<h2 class=western>
  <font size=4 style=FONT-SIZE:16pt>Possible future directions</font>
</h2>
<p class=western style="">
  Two items are conspicuously missing from this paper: UTF-8 support, and
  explicit support for Unicode features like normalization, case conversion, and
  collation. I intend to address those issues in future papers.
</p>
<h3 class=western>
  UTF-8 support
</h3>
<p class=western style="">
  One way to provide UTF-8 support would be a new string class whose interface
  is very different from basic_string, designed to preserve string validity and
  to encourage users to view the string as code points rather than individual
  bytes. Alternative approaches include UTF-8 iterator adaptors, or just user
  education to encourage users to store UTF-8 data in the existing string class.
</p>
<br>
<p class=western style="">
  Some form of UTF-8 support is important because there's an awful lot of
  real-world code that uses UTF-8 even internally, and programmers certainly
  need UTF-8 to interface with third-party libraries like libxml2.
</p>
<h3 class=western>
  Unicode text manipulation
</h3>
<p class=western style="">
  Unicode is more than a character set and a handful of encoding schemes. It
  also specifies a great deal of information about each character, including
  script identification, character classification, and text direction, and
  various operations on strings, including normalization, case conversion, and
  collation.
</p>
<p class=western style="">
  <br>
</p>
<p class=western style="">
  Normalization is particularly important because there are cases where two
  different sequences of code points can represent what is conceptually the same
  string. For example, a string that is printed as “á” can be either the single
  character U+00E1 (LATIN SMALL LETTER A WITH ACUTE) or the two-character
  sequence U+0061 (LATIN SMALL LETTER A) U+0301 (COMBINING ACUTE ACCENT).
  Unicode defines several different canonical forms, and algorithms for
  converting to canonical form and for testing string equivalence.
</p>
<p class=western style="">
  <br>
</p>
<p class=western style="">
  In principle some of these facilities are already part of C++’s facet
  interface, and it might be argued that we do not need a separate mechanism
  just for Unicode. There is, however, an important way in which Unicode is
  special: since it uses a single code point space for all scripts, many
  operations in Unicode are locale-independent that in other encodings are
  necessarily locale-dependent. Since the C++ locale interface is so awkward, it
  would be useful to provide a locale-independent interface for common
  operations that do not require locales.
</p>
<p class=western style="">
  <br>
</p>
<p class=western style="">
  <font color=#000080><u><a href=http://icu.sourceforge.net/>ICU (International
  Components for Unicode)</a></u></font> is a useful source of prior art.
</p>
<p class=western style="">
  <br>
</p>
<h2 class=western style="">
  <font size=4 style=FONT-SIZE:16pt>Proposed working paper changes</font>
</h2>
<p class=western>
  In clause 20.5 [lib.function.objects], in the header &lt;functional&gt;
  synopsis, add the following specializations of class template hash&lt;&gt;:
</p>
<p class=western style="MARGIN-LEFT:40px; FONT-FAMILY:Courier New">
  template&lt;&gt; struct hash&lt;char16_t&gt;;
</p>
<p class=western style="MARGIN-LEFT:40px; FONT-FAMILY:Courier New">
  template&lt;&gt; struct hash&lt;char32_t&gt;;<br>
</p>
<p class=western style="MARGIN-LEFT:40px; FONT-FAMILY:Courier New">
  template&lt;&gt; struct hash&lt;std::u16string&gt;;
</p>
<p class=western style="MARGIN-LEFT:40px; FONT-FAMILY:Courier New">
  template&lt;&gt; struct hash&lt;std::u32string&gt;;
</p>
<br style="FONT-FAMILY:Times New Roman">
<p class=western style="FONT-FAMILY:Courier New">
  <span style=FONT-FAMILY:Verdana>In clause 20.5.15 [lib.unord.hash], in
  paragraph 1, change "and
  <span style="FONT-FAMILY:Courier New">std::string</span> and
  <span style="FONT-FAMILY:Courier New">std::wstring</span>" to read "and
  <span style="FONT-FAMILY:Courier New">std::string</span>,
  <span style="FONT-FAMILY:Courier New">std::wstring</span>,
  <span style="FONT-FAMILY:Courier New">std::u16string</span>, and
  <span style="FONT-FAMILY:Courier New">std::u32string</span>".</span><span style="FONT-FAMILY:Times New Roman"><br>
  </span>
</p>
<p class=western>
  <br>
</p>
<p class=western>
  Add two new sections after 21.1.3.2 [lib.char.traits.specializations.wchar.t]:
</p>
<p class=western>
  <br>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  [lib.char.traits.specializations.char16.t]
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">namespace std { </font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">template&lt;&gt; </font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">struct char_traits&lt;char16_t&gt; { </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef char16_t char_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef uint_least_16_t int_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef streamoff off_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef ustreampos pos_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef mbstate_t state_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static void assign(char_type&amp; c1, const
  char_type&amp; c2); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static bool eq(const char_type&amp; c1, const
  char_type&amp; c2); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static bool lt(const char_type&amp; c1, const
  char_type&amp; c2); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static int compare(const char_type* s1, const
  char_type* s2, size_t n); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static size_t length(const char_type* s); </font>
</p>
<p class=western style="MARGIN-LEFT:0.6634in; FONT-FAMILY:Courier New">
  static const char_type* find(const char_type* s, size_t n,
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  const char_type&amp; a); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static char_type* move(char_type* s1, const
  char_type* s2, size_t n); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static char_type* copy(char_type* s1, const
  char_type* s2, size_t n); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static char_type* assign(char_type* s, size_t n,
  char_type a);</font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static int_type not_eof(const int_type&amp; c);
  </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static char_type to_char_type(const int_type&amp; c);
  </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static int_type to_int_type(const char_type&amp; c);
  </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static bool eq_int_type(const int_type&amp; c1, const
  int_type&amp; c2);</font><font face="Courier New"><br>
  </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static int_type eof(); </font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">}; </font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">} </font>
</p>
<p class=western style="">
  <br>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  The header &lt;string&gt; (21.2) declares a specialization of the class
  template char_traits for char16_t.
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <br>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  The two-argument member assign is defined identically to the built-in
  operator=. The two-argument members eq and lt are defined identically to the
  built-in operators == and &lt;.
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <br>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  The member eof() returns an implementation defined constant that cannot appear
  as a valid UTF-16 code unit.
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <br>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  [lib.char.traits.specializations.char32.t]
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">namespace std { </font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">template&lt;&gt; </font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">struct char_traits&lt;char32_t&gt; { </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef char32_t char_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef uint_least_32_t int_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef streamoff off_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef u32streampos pos_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">typedef mbstate_t state_type; </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static void assign(char_type&amp; c1, const
  char_type&amp; c2); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static bool eq(const char_type&amp; c1, const
  char_type&amp; c2); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static bool lt(const char_type&amp; c1, const
  char_type&amp; c2); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static int compare(const char_type* s1, const
  char_type* s2, size_t n); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static size_t length(const char_type* s); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New"><span style="FONT-FAMILY:Courier New">static const
  char_type* find(const char_type* s, size_t n,</span><br>
  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  const char_type&amp; a); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static char_type* move(char_type* s1, const
  char_type* s2, size_t n); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static char_type* copy(char_type* s1, const
  char_type* s2, size_t n); </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static char_type* assign(char_type* s, size_t n,
  char_type a);</font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static int_type not_eof(const int_type&amp; c);
  </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static char_type to_char_type(const int_type&amp; c);
  </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static int_type to_int_type(const char_type&amp; c);
  </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static bool eq_int_type(const int_type&amp; c1, const
  int_type&amp; c2);</font><font face="Courier New"><br>
  </font>
</p>
<p class=western style=MARGIN-LEFT:0.6634in>
  <font face="Courier New">static int_type eof(); </font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">}; </font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">} </font>
</p>
<p class=western style="">
  <br>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  The header &lt;string&gt; (21.2) declares a specialization of the class
  template char_traits for char32_t.
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <br>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  The two-argument member assign is defined identically to the built-in operator
  =. The two-argument members eq and lt are defined identically to the built-in
  operators == and &lt;.
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <br>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  The member eof() returns an implementation defined constant that does not
  represent a Unicode code point.
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <br>
</p>
<p class=western>
  In clause 21.2 [lib.string.classes], add the following to the beginning of the
  header &lt;string&gt; synopsis:
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">template&lt;&gt; struct
  char_traits&lt;char16_t&gt;;</font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">template&lt;&gt; struct
  char_traits&lt;char32_t&gt;;</font>
</p>
<p class=western style="">
  and the following to the end:
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">typedef basic_string&lt;char16_t&gt;
  u16string;</font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">typedef basic_string&lt;char32_t&gt;
  u32string;</font>
</p>
<p class=western>
  <br>
</p>
<p class=western>
  In Table 65 (Locale category facets) in clause 22.1 [lib.locale.category], add
  the following specializations:
</p>
<br>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">codecvt&lt;char16_t, char, mbstate_t&gt;</font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">codecvt&lt;char32_t, char, mbstate_t&gt;</font>
</p>
<br>
<p class=western>
  <br>
</p>
<p class=western>
  In Table 66 (Required Specializations) in clause 22.1 [lib.locale.category],
  add the following specializations:
</p>
<br>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">codecvt_byname&lt;char16_t, char,
  mbstate_t&gt;</font>
</p>
<p class=western style=MARGIN-LEFT:0.25in>
  <font face="Courier New">codecvt_byname&lt;char32_t, char,
  mbstate_t&gt;</font>
</p>
<br>
<p class=western>
  In clause 22.2.1.4 [lib.locale.codecvt] paragraph 3, remove the phrase “namely
  codecvt&lt;wchar_t, char, mbstate_t&gt; and codecvt&lt;char, char,
  mbstate_t&gt;.” Add the following sentence, after the one describing the
  wchar_t specialization: “The specialization codecvt&lt;char16_t, char,
  mbstate_t&gt; converts between the UTF-16 and UTF-8 encoding schemes, and the
  specialization codecvt&lt;char32_t, char, mbstate_t&gt; converts between the
  UTF-32 and UTF-8 encoding schemes.”
</p>
<br>
<p class=western>
</p>
<br>
<h2 class=western style="">
  <font size=4 style=FONT-SIZE:16pt>References </font>
</h2>
<p class=western style="">
  [1] Lawrence Crowl, <i>Extensions for the Programming Language C++ to Support
  New Character Data Types.</i> WG21 N2149, 2007.
</p>
<p class=western style="">
  [2] ISO. <i>Information technology -- Universal Multiple-Octet Coded Character
  Set (UCS)</i>, ISO/IEC 10646.<br>
  [3] ISO. <i>Information technology -- Programming languages, their
  environments and system software inferfaces -- Extensions for the programming
  language C to support new character data types</i>, ISO/IEC TR 19769:2004.<br>
</p>
<p class=western style="">
  [4] The Unicode Consortium. The Unicode Standard, Version 4.1.0, defined by:
  <i>The Unicode Standard, Version 4.0</i> (Boston, MA, Addison-Wesley, 2003.
  ISBN 0-321-18578-1), as amended by Unicode 4.0.1
  (<font color=#000080><u><a href=http://www.unicode.org/versions/Unicode4.0.1>http://www.unicode.org/versions/Unicode4.0.1</a></u></font>)
  and by Unicode 4.1.0
  (<font color=#000080><u><a href=http://www.unicode.org/versions/Unicode4.1.0>http://www.unicode.org/versions/Unicode4.1.0</a></u></font>).
</p>
<p class=western style="">
  [5] The Unicode Consortium, <i>Frequently Asked Questions</i>,
  <font color=#000080><u><a href=http://www.unicode.org/unicode/faq/>http://www.unicode.org/unicode/faq/</a></u></font>.
  See in particular
  <font color=#000080><u><a href=http://www.unicode.org/faq/utf_bom.html>http://www.unicode.org/faq/utf_bom.html</a></u></font>
  for a discussion of encodings.
</p>
<p class=western style="">
  [6] The Unicode Consortium, <span style=FONT-STYLE:italic>What is
  Unicode?</span>,
  <a href=http://www.unicode.org/standard/WhatIsUnicode.html title=http://www.unicode.org/standard/WhatIsUnicode.html>http://www.unicode.org/standard/WhatIsUnicode.html</a>
</p>
<p class=western style="">
  <br>
  <br>
  <br>
</p></body>
</html>