<html>
<head>
<title>UTF-8 String Literals</title>
</head>

<body>
<h1>UTF-8 String Literals</h1>

<p>ISO/IEC JTC1 SC22 WG21 N2159 = 07-0019 - 2007-01-10

<p>Lawrence Crowl

<h2>Problem</h2>

<p>Many users of C++ need to manipulate Unicode character strings.
While <a href="n2149.html"><cite>N2149 New Character Types for C++</cite></a>
addresses most low-level issues,
it does not provide a mechanism to ensure UTF-8 literals.
For portable international code,
the standard needs such a mechanism.

<h2>Solution</h2>

<p>We propose to add a new lexical token for UTF-8 string literals.
No new types or other language changes are required.
In particular, we do <em>not</em> propose character literals.

<p>Note that this paper does not presume adoption of 
<a href="n2149.html">N2149</a>
and some editorial merge will be necessary.

<p>Likewise, this paper does not presume adoption of
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2053.html">
<cite>N2053 Raw String Literals<cite></a>,
for which some editorial merge will also be necessary.

<h2>References</h2>

<p>See section 2.5 "Encoding Forms" in
<blockquote>
The Unicode Consortium.
The Unicode Standard, Version 5.0.0, defined by:
<cite>The Unicode Standard, Version 5.0</cite>
(Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)
</blockquote>
The online version (printing prohibited)
is at
<a href="http://www.unicode.org/versions/Unicode5.0.0/">
http://www.unicode.org/versions/Unicode5.0.0/</a>.

<p>See
<a href="http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc">
Annex C</a>
of
<a href="http://std.dkuug.dk/JTC1/SC2/WG2/docs/projects#10646">ISO
10646</a>-1,
which is online at
<a href="http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc">
http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc</a>.

<p>See
<a href="http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=39921&ICS1=35&ICS2=40&ICS3=">ISO/IEC 10646:2003</a>,
which is publicly available
in several text and PDF files within a zip archive from
<a href="http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip">
http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip</a>.

<p>See
<a href="http://www.unicode.org/faq/utf_bom.html">UTF-8,
UTF-16, UTF-32 &amp; BOM</a>.

<h2>Changes to the C++ Standard</h2>

<h3>2.13.4 String literals</h3>

<p>To the grammar, add
<blockquote>
<dl>
<dt><var>string-literal:</var></dt>
<dd><tt>E"</tt> <var>c-char-sequence<sub>opt</sub></var> <tt>"</tt></dd>
</dl>
</blockquote>

<p>To paragraph 1, replace
<blockquote>
optionally beginning with the letter <tt>L</tt>,
as in <tt>"..."</tt> or <tt>L"..."</tt>
</blockquote>
with
<blockquote>
optionally beginning with one of the letters
<tt>L</tt>, or <tt>E</tt>,
as in <tt>"..."</tt>, <tt>L"..."</tt>,
or <tt>E"..."</tt>,
respectively
</blockquote>

<p>To paragraph 1, append
<blockquote>
A string literal that begins with <tt>E</tt>, such as <tt>E"asdf"</tt>,
is a <tt>char</tt> string literal.
The literal has the type <q>array of <var>n</var> <tt>const char</tt></q>
where <var>n</var> is the size of the string as defined below,
and is initialized with the given characters encoded in UTF-8.
It is implementation-defined whether literals
may contain more than members of the basic character set
and universal character names
(<tt>\U</tt><var>nnnnnnnn</var> and <tt>\u</tt><var>nnnn</var>).
</blockquote>

<p>In paragraph 3, append
<blockquote>
If any narrow string literal in the concatenation specifies UTF-8 encoding,
the resulting string has UTF-8 encoding.
</blockquote>

<p>Paragraph 5 already admits a multi-byte encoding
of ordinary character string literals.

</body>
</html>
