<html>
<head>
<title>New Character Types in C++</title>
</head>

<body>
<h1>New Character Types in C++</h1>

<p>ISO/IEC JTC1 SC22 WG21 N2249 = 07-0109 - 2007-04-19

<p>Lawrence Crowl

<p>This document replaces N2149 = 07-0009 - 2007-01-10.

<h2>Problem</h2>

<p>Many users of C++ need to manipulate Unicode character strings.
Unfortunately, there is no C++ standard means to do so.

<h2>Solution</h2>

<p>The
<a href="http://www.open-std.org/jtc1/sc22/wg14/">ISO C</a>
committee has addressed this issue extensively.
See
<a href="http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=33907&ICS1=35&ICS2=60&ICS3=&scopelist=">
ISO/IEC TR 19769:2004</a>
"Extensions for the programming language C
to support new character data types"
as described in draft report ISO/IEC JTC1 SC22 WG14
<a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf">
N1040</a>
at
<a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf">
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf</a>.

<p>This proposal adopts their work,
but with those changes necessary for effective use within C++.
In particular, we propose new types to support overloading.

<p>A separate proposal will address specializations for
numeric_limits, character traits, basic strings,
streams, and insertion operations.

<h2>References</h2>

<p>See section 2.5 "Encoding Forms" in
<blockquote>
The Unicode Consortium.
The Unicode Standard, Version 5.0.0, defined by:
<cite>The Unicode Standard, Version 5.0</cite>
(Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)
</blockquote>
The online version (printing prohibited)
is at
<a href="http://www.unicode.org/versions/Unicode5.0.0/">
http://www.unicode.org/versions/Unicode5.0.0/</a>.

<p>See
<a href="http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc">
Annex C</a>
of
<a href="http://std.dkuug.dk/JTC1/SC2/WG2/docs/projects#10646">ISO
10646</a>-1,
which is online at
<a href="http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc">
http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc</a>.

<p>See
<a href="http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=39921&ICS1=35&ICS2=40&ICS3=">ISO/IEC 10646:2003</a>,
which is publicly available
in several text and PDF files within a zip archive from 
<a href="http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip">
http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip</a>.

<p>See
<a href="http://www.unicode.org/faq/utf_bom.html">UTF-8,
UTF-16, UTF-32 &amp; BOM</a>.

<h2>Summary of ISO/IEC TR 19769 (WG14 N1040)</h2>

<p>The document ISO/IEC TR 19769 (WG14 N1040) provides motivation,
new typedefs for the (at least) 16-bit and
(at least) 32-bit character types,
macros for reporting
<a href="http://std.dkuug.dk/JTC1/SC2/WG2/docs/projects#10646">
ISO 10646</a>
encoding,
character and string literals, mixed string concatenation,
four library functions,
and a new header with appropriate declarations.

<h2>Summary of Changes to the C Proposal for C++</h2>

<p>The document ISO/IEC TR 19769 (WG14 N1040) can be adopted with few changes.
Further changes are possible,                                               
but this proposal minimizes the changes to ensure maximum interoperability.

<h3>Define new primitive types.</h3>

<p>Define <tt>char16_t</tt> to be a distinct new type,
that has the same size and representation as <tt>uint_least16_t</tt>.
Likewise,
define <tt>char32_t</tt> to be a distinct new type,
that has the same size and representation as <tt>uint_least32_t</tt>.

<p>[N1040 defined <tt>char16_t</tt> and <tt>char32_t</tt>
as typedefs to <tt>uint_least16_t</tt> and <tt>uint_least32_t</tt>,
which make overloading on these characters impossible.]

<p>[The experiments on open-source software
indicate that these identifiers are not commonly used,
and when used,
used in a manner consistent with the proposal.]

<h3>Add C++-specific headers.</h3>

<p>Add a new C++ header <tt>&lt;cuchar&gt;</tt>
corresponding to the new C header <tt>&lt;uchar.h&gt;</tt>.

<h3>Clarify literals.</h3>

<p>Clarify the handling of universal character names
that do not fit with <tt>char16_t</tt>.
In particular, the interaction with ISO 10646 UTF-16                          
is underspecified in the C proposal.                                         

<h3>Require UTF.</h3>

<p>The C TR makes the encoding of <tt>char16_t</tt> and <tt>char32_t</tt>
implementation-defined.
It also provides macros to indicate whether or not the encoding is UTF.
In contrast, this proposal requires UTF encoding.

<h2>Changes to the C++ Standard</h2>

<h3>2.11 Keywords</h3>

<p>To "Table 3 -- keywords",
add <tt>char16_t</tt> and <tt>char32_t</tt>.

<h3>2.13.2 Character literals</h3>

<p>To the grammar, add
<blockquote>
<dl>
<dt><var>character-literal:</var></dt>
<dd><tt>u'</tt> <var>c-char-sequence</var> <tt>'</tt></dd>
<dd><tt>U'</tt> <var>c-char-sequence</var> <tt>'</tt></dd>
</dl>
</blockquote>

<p>In paragraph 1, edit
<blockquote>
A character literal is one or more characters enclosed in single quotes,
as in <tt>'x'</tt>,
optionally preceded by <ins>one of</ins> the letter<ins>s
<tt>u</tt>, <tt>U</tt>, or</ins> <tt>L</tt>,
as in
<ins><tt>u'y'</tt>, <tt>U'z'</tt>, or</ins>
<tt>L'x'</tt><ins>, respectively</ins>.
A character literal that does not begin with
<ins><tt>u</tt>, <tt>U</tt>, or</ins> <tt>L</tt>
is an ordinary character literal,
also referred to as a narrow-character literal.
An ordinary character literal that contains a single <var>c-char</var>
has type <tt>char</tt>,
with value equal to the numerical value of
the encoding of the <var>c-char</var> in the execution character set.
An ordinary character literal that contains more than one <var>c-char</var>
is a <dfn>multicharacter</dfn> literal.
A multicharacter literal has type <tt>int</tt>
and implementation-defined value.
</blockquote>

<p>To paragraph 2, edit
<blockquote>
<ins>A character literal that begins with the letter <tt>u</tt>,
such as <tt>u'y'</tt>,
is a character literal of type <tt>char16_t</tt>.
The value of a <tt>char16_t</tt> literal
containing a single <var>c-char</var>
is equal to its ISO 10646 code point value,
provided that the code point is representable with a single 16-bit code unit.
(That is, provided it is a basic multi-lingual plane code point.)
If the value is not representable within 16 bits,
the program is ill-formed.
A <tt>char16_t</tt> literal containing multiple <var>c-char</var>s           
is ill-formed.</ins>
<ins>A character literal that begins with the letter <tt>U</tt>,
such as <tt>U'z'</tt>,
is a character literal of type <tt>char32_t</tt>.
The value of a <tt>char32_t</tt> literal
containing a single <var>c-char</var>
is equal to its ISO 10646 code point value.
A <tt>char32_t</tt> literal containing multiple <var>c-char</var>s           
is ill-formed.</ins>
A character literal that begins with the letter <tt>L</tt>,
such as <tt>Lx</tt>, is a wide-character literal.
A wide-character literal has type <code>wchar_t</code>.26)
The value of a wide-character literal containing a single <var>c-char</var>
has value equal to the numerical value of the encoding of the <var>c-char</var>
in the execution wide-character set.
The value of a wide-character literal containing multiple
<var>c-char</var>s is implementation-defined.
</blockquote>

<p>In paragraph 4, edit
<blockquote>
The escape <var>\ooo</var> consists of the backslash
followed by one, two, or three octal digits
that are taken to specify the value of the desired character.
The escape <var>\xhhh</var> consists of the backslash
followed by <tt>x</tt> followed by one or more hexadecimal digits
that are taken to specify the value of the desired character.
There is no limit to the number of digits in a hexadecimal sequence.
A sequence of octal or hexadecimal digits
is terminated by the first character
that is not an octal digit or a hexadecimal digit, respectively.
The value of a character literal is implementation-defined
if it falls outside of the implementation-defined range
defined for char (for ordinary literals)<ins>,
<tt>char16_t</tt> (for literals prefixed by '<tt>u</tt>'),
<tt>char32_t</tt> (for literals prefixed by '<tt>U</tt>'),</ins>
or wchar_t (for wide literals).
</blockquote>

<h3>2.13.4 String literals</h3>

<p>To the grammar, add
<blockquote>
<dl>
<dt><var>string-literal:</var></dt>
<dd><tt>u"</tt> <var>s-char-sequence<sub>opt</sub></var> <tt>"</tt></dd>
<dd><tt>U"</tt> <var>s-char-sequence<sub>opt</sub></var> <tt>"</tt></dd>
</dl>
</blockquote>

<p>In paragraph 1, edit
<blockquote>
A string literal is a sequence of characters (as defined in 2.13.2)
surrounded by double quotes,
optionally beginning with <ins>one of</ins> the letter<ins>s
<tt>u</tt>, <tt>U</tt>, or</ins> <tt>L</tt>,
as in <tt>"..."</tt><ins>, <tt>u"..."</tt>, <tt>U"..."</tt></ins>
or <tt>L"..."</tt><ins>, respectively</ins>.
A string literal that does not begin with
<ins><tt>u</tt>, <tt>U</tt>, or</ins> <tt>L</tt>,
is an ordinary string literal,
also referred to as a narrow string literal.
An ordinary string literal has type "array of <var>n</var> <tt>const char</tt>"
and <ins>has</ins> static storage duration (3.7),
where <var>n</var> is the size of the string as defined below,
and is initialized with the given characters.

<ins>A string literal that begins with <tt>u</tt>, such as <tt>u"asdf"</tt>,
is a <tt>char16_t</tt> string literal.
A <tt>char16_t</tt> string literal has type
"array of <var>n</var> <tt>const char16_t</tt>"
and has static storage duration,
where <var>n</var> is the size of the string as defined below,
and is initialized with the given characters.
A single <tt><var>c-char</var></tt>
may produce more than one <tt>char16_t</tt>
in the form of surrogate pairs.</ins>

<ins>A string literal that begins with <tt>U</tt>, such as <tt>U"asdf"</tt>,
is a <tt>char32_t</tt> string literal.
A <tt>char32_t</tt> string literal has type
"array of <var>n</var> <tt>const char32_t</tt>"
and has static storage duration,
where <var>n</var> is the size of the string as defined below,
and is initialized with the given characters.</ins>

A string literal that begins with <tt>L</tt>, such as <tt>L"asdf"</tt>,
is a wide string literal.
A wide string literal has type "array of <var>n</var> <tt>const wchar_t</tt>"
and has static storage duration,
where <var>n</var> is the size of the string as defined below,
and is initialized with the given characters.

</blockquote>

<p>In paragraph 3, replace
<blockquote>
In translation phase 6 (2.1),
adjacent string literals are concatenated.
<del>If a narrow string literal token
is adjacent to a wide string literal token,
the result is a wide string literal.</del>
<ins>
If both string literals have the same prefix,
the resulting concatenated string literal has that prefix.
If one string literal has no prefix,
it is treated as a string literal
of the same prefix as the other operand.
Any other concatenations are conditionally supported
with implementation-defined behavior.
Note that this concatenation is an interpretation, not a conversion.
[<i>Example:</i> Here are some examples of valid concatenations:
<table border=1 cellpadding=2>
<tr>
<th>source</th><th>means</th>
<th>source</th><th>means</th>
<th>source</th><th>means</th>
<tr>
<td><tt>u"a" u"b"</tt></td><td><tt>u"ab"</tt></td>
<td><tt>U"a" U"b"</tt></td><td><tt>U"ab"</tt></td>
<td><tt>L"a" L"b"</tt></td><td><tt>L"ab"</tt></td>
<tr>
<td><tt>u"a" "b"</tt></td><td><tt>u"ab"</tt></td>
<td><tt>U"a" "b"</tt></td><td><tt>U"ab"</tt></td>
<td><tt>L"a" "b"</tt></td><td><tt>L"ab"</tt></td>
<tr>
<td><tt>"a" u"b"</tt></td><td><tt>u"ab"</tt></td>
<td><tt>"a" U"b"</tt></td><td><tt>U"ab"</tt></td>
<td><tt>"a" L"b"</tt></td><td><tt>L"ab"</tt></td>
</table>
]
</ins>
Characters in concatenated strings are kept distinct.
[ <i>Example:</i>
<blockquote>
<tt>"\xA"</tt> <tt>"B"</tt>
</blockquote>
contains the two characters <tt>\xA</tt> and <tt>B</tt>
after concatenation (and not the single hexadecimal character <tt>\xAB</tt>).
-- end example ]
</blockquote>

<p>In paragraph 5, edit
<blockquote>
Escape sequences and universal-character-names in string literals
have the same meaning as in character literals (2.13.2),
except that the single quote <tt></tt>
is representable either by itself or by the escape sequence <tt>\</tt>,
and the double quote <tt>"</tt> shall be preceded by a <tt>\</tt>.
In a narrow string literal,
a universal-character-name may map to more than one char element
due to multibyte encoding.

The size of a <ins><tt>char32_t</tt> or</ins> wide string literal
is the total number of escape sequences,
universal-character-names, and other characters,
plus one for the terminating
<ins><tt>U'\0'</tt> or</ins> <tt>L\0</tt>.

<ins>The size of a <tt>char16_t</tt> string literal
is the total number of escape sequences,
universal-character-names, and other characters,
plus one for each character requiring a surrogate pair,
plus one for the terminating <tt>u'\0'</tt>.
[Note: The size of a <tt>char16_t</tt> string literal
is the number of code units,
not the number of characters.]
Within <tt>char32_t</tt> or <tt>char16_t</tt> literals,
any universal-character-names must be within the range 0x0 to 0x10FFFF.</ins>

The size of a narrow string literal
is the total number of escape sequences and other characters,
plus at least one for the multibyte encoding of each universal-character-name,
plus one for the terminating <tt>\0</tt>.
</blockquote>

<h3>3.9.1 Fundamental Types</h3>

<p>In paragraph 5, edit
<blockquote>
Type <tt>wchar_t</tt> is a distinct type
whose values can represent distinct codes
for all members of the largest extended character set
specified among the supported locales (22.1.1).
Type <tt>wchar_t</tt> shall have
the same size, signedness, and alignment requirements (3.9)
as one of the other integral types, called its underlying type.
<ins>Types <tt>char16_t</tt> and <tt>char32_t</tt>
denote distinct types with the same size, signedness, and alignment as
<tt>uint_least16_t</tt> and <tt>uint_least32_t</tt>,
respectively, in <tt>&lt;stdint.h&gt;</tt>,
called the underlying types.</ins>
</blockquote>

<p>The <tt>&lt;stdint.h&gt;</tt> header is from                               
<a href="http://www.open-std.org/jtc1/sc22/wg14/">ISO C</a>                   
as proposed in document WG21 N1835 = 05-0095,                                
and subsequently adopted into                                               
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf">
ISO/IEC TR 19768: C++ Library Extensions TR1</a>.                          

<p>In paragraph 7, edit
<blockquote>
Types <tt>bool</tt>, <tt>char</tt>,
<ins><tt>char16_t</tt>, <tt>char32_t</tt>,</ins> <tt>wchar_t</tt>,
and the signed and unsigned integer types
are collectively called integral types.48)
A synonym for integral type is integer type.
The representations of integral types
shall define values by use of a pure binary numeration system.49) ....
</blockquote>

<h3>4.2 Array-to-pointer conversion</h3>

<p>In paragraph 2, edit
<blockquote>
A string literal (2.13.4) <del>that is not a wide string literal</del>
<ins>with no prefix,
with <tt>u</tt> prefix,
with <tt>U</tt> prefix,
or with <tt>L</tt> prefix</ins>
can be converted to an rvalue of type
"pointer to char"<del>;
a wide string literal can be converted to an rvalue of type</del>
<ins>"pointer to <tt>char16_t</tt>",
"pointer to <tt>char32_t</tt>",
or</ins> "pointer to <tt>wchar_t</tt>"<ins>, respectively</ins>.
In <del>either</del> <ins>any</ins> case,
the result is a pointer to the first element of the array.
....
</blockquote>

<h3>4.5 Integral promotions</h3>

<p>In paragraph 1, edit
<blockquote>
An rvalue of
an integer type other than
<tt>bool</tt><ins>,</ins>
<ins><tt>char16_t</tt>, <tt>char32_t</tt>,</ins>
or <tt>wchar_t</tt>
whose integer conversion rank (4.13)
is less than the rank of <tt>int</tt>
can be converted to an rvalue of type <tt>int</tt>
if <tt>int</tt> can represent all the values of the source type;
otherwise,
the source rvalue can be converted to an rvalue of type <tt>unsigned int</tt>.
</blockquote>

<p>In paragraph 2, edit
<blockquote>
An rvalue of type
<ins><tt>char16_t</tt>, <tt>char32_t</tt>, or</ins>
<tt>wchar_t</tt> (3.9.1)
can be converted to an rvalue of the first of the following types
that can represent all the values of its underlying type:
<tt>int</tt>, <tt>unsigned int</tt>, <tt>long int</tt>,
<tt>unsigned long int</tt>, <tt>long long int</tt>, or
<tt>unsigned long long int</tt>.
If none of the types in that list
can represent all the values of its underlying type,
An rvalue of type
<ins><tt>char16_t</tt>, <tt>char32_t</tt>, or</ins>
<tt>wchar_t</tt>
can be converted to an rvalue of its underlying type.
</blockquote>

<h3>4.13 Integer conversion rank [conv.rank]</h3>

<p> In paragraph 1, bullet 8, edit
<blockquote>
The rank<ins>s</ins> of
<ins><tt>char16_t</tt>, <tt>char32_t</tt>, and</ins> <tt>wchar_t</tt>
shall equal the rank of <del>its</del>
<ins>their</ins> underlying type<ins>s</ins> (3.9.1).
</blockquote>

<h3>5 Expressions</h3>

<p>In paragraph 10, bullet 4, footnote 59, edit
<blockquote>
As a consequence,
operands of type <tt>bool</tt>,
<ins><tt>char16_t</tt>, <tt>char32_t</tt>,</ins> <tt>wchar_t,</tt>
or an enumerated type are converted to some integral type.
</blockquote>

<h3>5.3.3 Sizeof</h3>

<p>In paragraph 1, note 1, edit
<blockquote>
[ <i>Note:</i> in particular,
<tt>sizeof(bool)</tt><ins>,
<tt>sizeof(char16_t)</tt>, <tt>sizeof(char32_t)</tt>,</ins>
and <tt>sizeof(wchar_t)</tt>
are implementation-defined.73)
-- end note ]
</blockquote>

<h3>7.1.5.2 Simple type specifiers</h3>

<p>To the grammar in paragraph 1, add
<blockquote>
<dl>
<dt><var>simple-type-specifier:</var></dt>
<dd><tt>char16_t</tt></dd>
<dd><tt>char32_t</tt></dd>
</dl>
</blockquote>

<p>To Table 8 "<var>simple-type-specifiers</var> and the types they specify",
add
<blockquote>
<table>
<tr><td><tt>char16_t</tt></td><td>"<tt>char16_t</tt>"</td></tr>
<tr><td><tt>char32_t</tt></td><td>"<tt>char32_t</tt>"</td></tr>
</table>
</blockquote>

<h3>8.5 Initializers</h3>

<p>In paragraph 15, bullet 2, edit
<blockquote>
If the destination type is an array of characters<ins>,
an array of <tt>char16_t</tt>,
an array of <tt>char32_t</tt>,</ins>
or an array of <tt>wchar_t</tt>,
and the initializer is a string literal, see 8.5.2.
</blockquote>

<h3>8.5.2 Character arrays</h3>

<p>In paragraph 1, edit
<blockquote>
A <tt>char</tt> array
(whether plain <tt>char</tt>, <tt>signed char</tt>, or
<tt>unsigned char</tt>)<ins>,
<tt>char16_t</tt> array,
<tt>char32_t</tt> array,
or <tt>wchar_t</tt> array</ins>
can be initialized by a string-literal
(optionally enclosed in braces)<del>;
a <tt>wchar_t</tt> array
can be initialized by a wide string-literal
(optionally enclosed in braces)</del>
<ins>with no prefix,
with <tt>u</tt> prefix, with <tt>U</tt> prefix, or
with <tt>L</tt> prefix, respectively</ins>;
successive characters of the string-literal
initialize the members of the array.
....
</blockquote>

<h3>15.1 Throwing an exception</h3>

<p>In paragraph 3, note 1, edit
<blockquote>
[ <i>Note:</i> the temporary object created
for a <var>throw-expression</var> that is a string literal
is never of type <tt>char*</tt><ins>,
<tt>char16_t</tt>, <tt>char32_t</tt>,</ins> or <tt>wchar_t*</tt>;
that is, the special conversions for string literals
from the types "array of <tt>const char</tt>"<ins>,
"array of <tt>const char16_t</tt>",
"array of <tt>const char32_t</tt>",</ins> and
"array of <tt>const wchar_t</tt>"
to the types "pointer to <tt>char</tt>"<ins>,
"pointer to <tt>char16_t</tt>",
"pointer to <tt>char32_t</tt>",</ins> and
"pointer to <tt>wchar_t</tt>", respectively (4.2),
are never applied to a throw-expression. -- <i>end note</i> ]
</blockquote>

<h3>17 Library introduction [library]</h3>

<p>In paragraph 4, edit
<blockquote>
The strings components provide support for manipulating text represented as
sequences of type <tt>char</tt>,
<ins>sequences of type <tt>char16_t</tt>,
sequences of type <tt>char32_t</tt>,</ins>
sequences of type <tt>wchar_t</tt>,
or sequences of any other "character-like" type.
The localization components
extend internationalization support for such text processing.
</blockquote>

<h3>17.1.2 [defns.character]</h3>

<p>In paragraph 1, edit
<blockquote>
<b>character</b><br>
in clauses 21, 22, and 27, means any object which, when treated sequentially,
can represent text.
The term does not only mean <tt>char</tt><ins>, 
<tt>char16_t</tt>, <tt>char32_t</tt>,</ins> and wchar_t objects,
but any value that can be represented by a type
that provides the definitions specified in these clauses.
</blockquote>

<h3>17.3.2.1.3.4 [NEW] char16-character sequences</h3>

<p>A <dfn>char16-character sequence</dfn>
is an array object (8.3.4) <tt>A</tt>
that can be declared as <tt>T A[N]</tt>,
where <tt>T</tt> is type <tt>char16_t</tt> (3.9.1),
optionally qualified
by any combination of <tt>const</tt> and <tt>volatile</tt>.
The initial elements of the array have defined contents
up to and including an element determined by some predicate.
A character sequence can be designated by a pointer value <tt>S</tt>
that designates its first element.

<p>A <dfn>null-terminated char16-character string</dfn>,
or <dfn>NTC16S</dfn>,
is a char16-character sequence
whose highest-addressed element with defined content
has the value zero.
[Footnote: Many of the objects
manipulated by function signatures declared in <tt>&lt;cuchar&gt;</tt>
are char16-character sequences or NTC16Ss.]

<p>The <dfn>length of an NTC16S</dfn>
is the number of elements
that precede the terminating null char16 character.
An <dfn>empty NTC16S</dfn> has a length of zero.

<p>The <dfn>value of an NTC16S</dfn>
is the sequence of values of the elements
up to and including the terminating null character.

<p>A <dfn>static NTC16S</dfn>
is an NTC16S with static storage duration.
[Footnote: A char16 string literal, such as <tt>u"abc"</tt>,
is a static NTC16S.]

<h3>17.3.2.1.3.5 [NEW] char32-character sequences</h3>

<p>A <dfn>char32-character sequence</dfn>
is an array object (8.3.4) <tt>A</tt>
that can be declared as <tt>T A[N]</tt>,
where <tt>T</tt> is type <tt>char32_t</tt> (3.9.1),
optionally qualified
by any combination of <tt>const</tt> and <tt>volatile</tt>.
The initial elements of the array have defined contents
up to and including an element determined by some predicate.
A character sequence can be designated by a pointer value <tt>S</tt>
that designates its first element.

<p>A <dfn>null-terminated char32-character string</dfn>,
or <dfn>NTC32S</dfn>,
is a char32-character sequence
whose highest-addressed element with defined content
has the value zero.
[Footnote: Many of the objects
manipulated by function signatures declared in <tt>&lt;cuchar&gt;</tt>
are char32-character sequences or NTC32Ss.]

<p>The <dfn>length of an NTC32S</dfn>
is the number of elements
that precede the terminating null char32 character.
An <dfn>empty NTC32S</dfn> has a length of zero.

<p>The <dfn>value of an NTC32S</dfn>
is the sequence of values of the elements
up to and including the terminating null character.

<p>A <dfn>static NTC32S</dfn>
is an NTC32S with static storage duration.
[Footnote: A char32 string literal, such as <tt>U"abc"</tt>,
is a static NTC32S.]

<h3>17.4.1.2 Headers</h3>

<p>To table 12, add <tt>&lt;cuchar&gt;</tt>.

<h3>17.4.3.1.3 External linkage</h3>

<p>In paragraph 5, footnote 168, add <tt>&lt;cuchar&gt;</tt>.

<h3>21.4 Null-terminated sequence utilities</h3>

<p>Add paragraph 20,

<blockquote>
<p>Table 50 describes headers
<tt>&lt;cuchar&gt;</tt> and <tt>&lt;uchar.h&gt;</tt>.
The distinction is that <tt>&lt;cuchar&gt;</tt>
defines the function names within namespace <tt>std</tt>
and that <tt>&lt;uchar.h&gt;</tt>
defines them at global scope.
</blockquote>

<p>Add Table 50,
<blockquote>
<table border=1 cellpadding=2>
<caption>Table 50 -- Headers <tt>&lt;cuchar&gt;</tt> and
<tt>&lt;uchar.h&gt;</tt> synopsis</caption>
<tr><th colspan=2>Macro Names</th></tr>
<tr><td>__STDC_UTF_16__</td><td>__STDC_UTF_32__</td></tr>
<tr><th colspan=2>Function Names</th></tr>
<tr><td>mbrtoc16</td><td>c16rtomb</td></tr>
<tr><td>mbrtoc32</td><td>c32rtomb</td></tr>
</table>
</blockquote>

<h3>21 Strings Library</h3>

<p>Add <tt>&lt;cuchar&gt;</tt> to table 38
under "Null-terminated sequence utilities".

<h3>21.5 [NEW] char16 and char32 characters</h3>

<p>The headers <tt>&lt;cuchar&gt;</tt> and <tt>&lt;uchar.h&gt;</tt>
define macros and declare functions
for use with at-least-16-bit and at-least-32-bit characters.

<h3>21.5.1 [NEW] The <tt>__STDC_UTF_16__</tt>
and <tt>__STDC_UTF_32__</tt> macros</h3>

<p>The headers <tt>&lt;cuchar&gt;</tt> and <tt>&lt;uchar.h&gt;</tt>
define the macro <tt>__STDC_UTF_16__</tt>,
and values of type <tt>char16_t</tt> shall be valid UTF-16 code units,
as defined by ISO 10646.
</dl>

<p>The headers <tt>&lt;cuchar&gt;</tt> and <tt>&lt;uchar.h&gt;</tt>
shall define the macro <tt>__STDC_UTF_32__</tt>,
and values of type <tt>char32_t</tt> shall be valid UTF-32 code units,
as defined by ISO 10646.
</dl>

<h3>21.5.2 [NEW] The <tt>mbrtoc16</tt> function</h3>

<h4>Synopsis</h4>

<blockquote><tt>
#include &lt;cuchar&gt;<br>
size_t std::mbrtoc16(char16_t * pc16, const char * s, size_t n, mbstate_t * ps);
</tt></blockquote>

<h4>Description</h4>

<p>If <tt>s</tt> is a null pointer,
the <tt>mbrtoc16</tt> function is equivalent to the call:
<blockquote>
<tt>mbrtoc16(NULL, "", 1, ps)</tt>
</blockquote>
In this case,
the values of the parameters <tt>pc16</tt> and <tt>n</tt>
are ignored.

<p>If s is not a null pointer,
the <tt>mbrtoc16</tt> function inspects at most <tt>n</tt> bytes
beginning with the byte pointed to by <tt>s</tt>
to determine the number of bytes needed
to complete the next multibyte character
(including any shift sequences).
If the function determines
that the next multibyte character is complete and valid,
it determines the value of the corresponding wide character
and then, if <tt>pc16</tt> is not a null pointer,
stores that value in the object pointed to by <tt>pc16</tt>.
If the corresponding wide character is the null wide character,
the resulting state described is the initial conversion state.

<h4>Returns</h4>

The <tt>mbrtoc16</tt> function
returns the first of the following that applies
(given the current conversion state):

<dl>

<dt><tt>0</tt></dt>
<dd>if the next <tt>n</tt> or fewer bytes
complete the multibyte character
that corresponds to the null wide character
(which is the value stored).</dd>

<dt><tt>[1..n]</tt></dt>
<dd>if the next <tt>n</tt> or fewer bytes
complete a valid multibyte character
(which is the value stored);
the value returned is the number of bytes
that complete the multibyte character.</dd>

<dt><tt>(size_t)(-3)</tt></dt>
<dd>if the multibyte sequence
converted more than one corresponding <tt>char32_t</tt> character
and not all these characters have yet been stored;
the next character in the sequence has now been stored
and no bytes from the input have been consumed by this call.</dd>

<dt><tt>(size_t)(-2)</tt></dt>
<dd>if the next <tt>n</tt> bytes
contribute to an incomplete (but potentially valid) multibyte character,
and all n bytes have been processed (no value is stored).

<p>Note:
When <tt>n</tt> has at least the value of the <tt>MB_CUR_MAX</tt> macro,
this case can only occur
if <tt>s</tt> points at a sequence of redundant shift sequences
(for implementations with state-dependent encodings).</dd>

<dt><tt>(size_t)(-1)</tt></dt>
<dd>if an encoding error occurs,
in which case the next <tt>n</tt> or fewer bytes
do not contribute to a complete and valid multibyte character
(no value is stored);
the value of the macro <tt>EILSEQ</tt>
is stored in <tt>errno</tt>,
and the conversion state is unspecified.</dd>

</dl>

<h3>21.5.3 [NEW] The <tt>c16rtomb</tt> function</h3>

<h4>Synopsis</h4>

<blockquote><tt>
#include &lt;cuchar&gt;<br>
size_t std::c16rtomb(char * s, char16_t c16, mbstate _t * ps);
</tt></blockquote>

<h4>Description</h4>

<p>If <tt>s</tt> is a null pointer,
the <tt>c16rtomb</tt> function is equivalent to the call
<blockquote>
<tt>c16rtomb(buf, L'\0', ps)</tt>
</blockquote>
where <tt>buf</tt> is an internal buffer.

<p>If <tt>s</tt> is not a null pointer,
the <tt>c16rtomb</tt> function
determines the number of bytes needed
to represent the multibyte character
that corresponds to the wide character given by <tt>c16</tt>
(including any shift sequences),
and stores the multibyte character representation
in the array whose first element is pointed to by <tt>s</tt>.
At most <tt>MB_CUR_MAX</tt> bytes are stored.
If <tt>c16</tt> is a null wide character,
a null byte is stored,
preceded by any shift sequence
needed to restore the initial shift state;
the resulting state described is the initial conversion state.

<h4>Returns</h4>

<p>The <tt>c16rtomb </tt>function
returns the number of bytes stored in the array object;
this may be 0 (including any shift sequences).
When <tt>c16</tt> is not a valid wide character,
an encoding error occurs:
the function
stores the value of the macro <tt>EILSEQ</tt> in <tt>errno</tt>
and returns <tt>(size_t)(-1)</tt>;
the conversion state is unspecified.

<h3>21.5.4 [NEW] The <tt>mbrtoc32</tt> function</h3>

<h4>Synopsis</h4>

<blockquote><tt>
#include &lt;cuchar&gt;<br>
size_t std::mbrtoc32(char32_t * pc32, const char * s, size_t n, mbstate_t * ps);
</tt></blockquote>

<h4>Description</h4>

<p>If <tt>s</tt> is a null pointer,
the <tt>mbrtoc32</tt> function is equivalent to the call:
<blockquote>
<tt>mbrtoc32(NULL, "", 1, ps)</tt>
</blockquote>
In this case,
the values of the parameters <tt>pc32</tt> and <tt>n</tt> are ignored.

<p>If <tt>s</tt> is not a null pointer,
the <tt>mbrtoc32</tt> function inspects at most <tt>n</tt> bytes
beginning with the byte pointed to by <tt>s</tt>
to determine the number of bytes needed
to complete the next multibyte character
(including any shift sequences).
If the function determines that the next multibyte character
is complete and valid,
it determines the value of the corresponding wide character
and then, if <tt>pc32</tt> is not a null pointer,
stores that value in the object pointed to by <tt>pc32</tt>.
If the corresponding wide character is the null wide character,
the resulting state described is the initial conversion state.

<h4>Returns</h4>

The <tt>mbrtoc32</tt> function
returns the first of the following that applies
(given the current conversion state):

<dl>

<dt><tt>0</tt></dt>
<dd>if the next <tt>n</tt> or fewer bytes
complete the multibyte character
that corresponds to the null wide character
(which is the value stored).</dd>

<dt><tt>[1..n]</tt></dt>
<dd>if the next <tt>n</tt> or fewer bytes
complete a valid multibyte character (which is the value stored);
the value returned is the number of bytes
that complete the multibyte character.</dd>

<dt><tt>(size_t)(-3)</tt></dt>
<dd>if the multibyte sequence
converted more than one corresponding <tt>char32_t</tt> character
and not all these characters have yet been stored;
the next character in the sequence has now been stored
and no bytes from the input have been consumed by this call.</dd>

<dt><tt>(size_t)(-2)</tt></dt>
<dd>if the next <tt>n</tt> bytes
contribute to an incomplete (but potentially valid) multibyte character,
and all n bytes have been processed (no value is stored).

<p>Note:
When <tt>n</tt> has at least the value of the <tt>MB_CUR_MAX</tt> macro,
this case can only occur
if <tt>s</tt> points at a sequence of redundant shift sequences
(for implementations with state-dependent encodings).</dd>

<dt><tt>(size_t)(-1)</tt></dt>
<dd>if an encoding error occurs,
in which case the next <tt>n</tt> or fewer bytes
do not contribute to a complete and valid multibyte character
(no value is stored);
the value of the macro <tt>EILSEQ</tt> is stored in <tt>errno</tt>,
and the conversion state is unspecified.</dd>

</dl>

<h3>21.5.5 [NEW] The <tt>c32rtomb</tt> function</h3>

<h4>Synopsis</h4>

<blockquote><tt>
#include &lt;cuchar&gt;<br>
size_t std::c32rtomb(char * s, char32_t c32, mbstate_t * ps);
</tt></blockquote>

<h4>Description</h4>

<p>If <tt>s</tt> is a null pointer,
the <tt>c32rtomb</tt> function is equivalent to the call
<blockquote>
<tt>c32rtomb(buf, L'\0', ps)</tt>
</blockquote>
where <tt>buf </tt>is an internal buffer.

<p>If <tt>s</tt> is not a null pointer,
the <tt>c32rtomb</tt> function
determines the number of bytes needed
to represent the multibyte character
that corresponds to the wide character given by <tt>c32</tt>
(including any shift sequences),
and stores the multibyte character representation
in the array whose first element is pointed to by <tt>s</tt>.
At most <tt>MB_CUR_MAX</tt> bytes are stored.
If <tt>c32</tt> is a null wide character,
a null byte is stored,
preceded by any shift sequence
needed to restore the initial shift state;
the resulting state described is the initial conversion state.

<h4>Returns</h4>

<p> The <tt>c32rtomb</tt> function
returns the number of bytes stored in the array object;
this may be 0 (including any shift sequences).
When <tt>c32</tt> is not a valid wide character,
an encoding error occurs:
the function
stores the value of the macro <tt>EILSEQ</tt> in <tt>errno</tt>
and returns <tt>(size_t)(-1)</tt>;
the conversion state is unspecified.

<h3>C.1.1 Clause 2: lexical conventions</h3>

<p>At the end of Subclause _lex.string: Change:, add
<blockquote>
The type of a char16 string literal is changed from
array of <var>some-integer-type</var>
to array of const <tt>char16_t</tt>.
The type of a char32 string literal is changed from
array of <var>some-integer-type</var>
to array of const <tt>char32_t</tt>.
</blockquote>

<h3>C.2.2.4 Header &lt;uchar.h&gt;</h3>

<p>Add section.
<blockquote>
The types <tt>char16_t</tt> and <tt>char32_t</tt>
are distinct types
rather than typedefs to existing integral types.
</blockquote>

<h3>D.5 Standard C Library Headers</h3>

<p>Replace "18 C headers"
with "18 C headers and 1 C technical report header".

<p>To table 101, add
<blockquote>
<tt>&lt;uchar.h&gt;</tt>
</blockquote>

</body>
</html>
