<html>
<head>
<title>UTF-8 String Literals</title>
</head>

<body>
<h1>UTF-8 String Literals</h1>

<p>ISO/IEC JTC1 SC22 WG21 N2209 = 07-0069 - 2007-03-08 

<p>Lawrence Crowl

<p>This document replaces N2159 = 07-0019 - 2007-01-10.

<h2>Problem</h2>

<p>Many users of C++ need to manipulate Unicode character strings.
While
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2149.html">
<cite>N2149 New Character Types for C++</cite></a>
addresses most low-level issues,
it does not provide a mechanism to ensure UTF-8 literals.
For portable international code,
the standard needs such a mechanism.

<h2>Solution</h2>

<p>We propose to add a new lexical token for UTF-8 string literals.
No new types or other language changes are required.
In particular, we do <em>not</em> propose character literals.

<p>Adoption of this paper requires all conforming implementations
to have bytes of at least eight bits in size.
We believe that all existing systems already conform.

<p>Note that this paper does not presume adoption of 
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2149.html">
<cite>N2149 New Character Types for C++</cite></a>
and some editorial merge will be necessary.

<p>Likewise, this paper does not presume adoption of
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2053.html">
<cite>N2053 Raw String Literals<cite></a>,
for which some editorial merge will also be necessary.

<h2>References</h2>

<p>See section 2.5 "Encoding Forms" in
<blockquote>
The Unicode Consortium.
The Unicode Standard, Version 5.0.0, defined by:
<cite>The Unicode Standard, Version 5.0</cite>
(Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)
</blockquote>
The online version (printing prohibited)
is at
<a href="http://www.unicode.org/versions/Unicode5.0.0/">
http://www.unicode.org/versions/Unicode5.0.0/</a>.

<p>See
<a href="http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc">
Annex C</a>
of
<a href="http://std.dkuug.dk/JTC1/SC2/WG2/docs/projects#10646">ISO
10646</a>-1,
which is online at
<a href="http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc">
http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc</a>.

<p>See
<a href="http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=39921&ICS1=35&ICS2=40&ICS3=">ISO/IEC 10646:2003</a>,
which is publicly available
in several text and PDF files within a zip archive from
<a href="http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip">
http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip</a>.

<p>See
<a href="http://www.unicode.org/faq/utf_bom.html">UTF-8,
UTF-16, UTF-32 &amp; BOM</a>.

<h2>Changes to the C++ Standard</h2>

<h3>1.7 The C++ memory model [intro.memory]</h3>

<p>To paragraph 1, edit

<blockquote>
The fundamental storage unit in the C++ memory model is the <dfn>byte</dfn>.
A byte is at least large enough to contain
<del>any member of the basic execution character set</del>
<ins>the eight-bit code units of the Unicode UTF-8 encoding form</ins>
and is composed of a contiguous sequence of bits,
the number of which is implementation-defined.
The least significant bit is called the <dfn>low-order bit</dfn>;
the most significant bit is called the <dfn>high-order bit</dfn>.
The memory available to a C++ program
consists of one or more sequences of contiguous bytes.
Every byte has a unique address.
</blockquote>

<h3>2.13.4 String literals [lex.string]</h3>

<p>To the grammar, edit
<blockquote>
<dl>
<dt><var>string-literal:</var></dt>
<dd><tt>"</tt> <var>c-char-sequence<sub>opt</sub></var> <tt>"</tt></dd>
<dd><ins><tt>E"</tt> <var>c-char-sequence<sub>opt</sub></var> <tt>"</tt></ins></dd>
<dd><tt>L"</tt> <var>c-char-sequence<sub>opt</sub></var> <tt>"</tt></dd>
</dl>
</blockquote>

<p>To paragraph 1, edit
<blockquote>
A string literal is a sequence of characters (as defined in 2.13.2)
surrounded by double quotes,
optionally beginning with <ins>one of</ins> the letter<ins>s E or</ins> L,
as in <code>"..."</code><ins>, <code>E"..."</code>,</ins>
or <code>L"..."</code>.
A string literal
that does not begin with <ins><code>E</code> or</ins> <code>L</code>
is an ordinary string literal,
<ins>and is initialized with the given characters.
A string literal that begins with <tt>E</tt>, such as <tt>E"asdf"</tt>,
is a UTF-8 string literal
and is initialized with the given characters as encoded in UTF-8.
It is implementation-defined whether literals
may contain more than members of the basic character set
and universal character names
(<tt>\U</tt><var>nnnnnnnn</var> and <tt>\u</tt><var>nnnn</var>).
Ordinary string literals and UTF-8 string literals are</ins>
also referred to as <del>a</del> narrow string literal<ins>s</ins>.
An <del>ordinary</del> <ins>narrow</ins> string literal has type
"array of <var>n</var> <code>const char</code>"
and static storage duration (3.7),
where <var>n</var> is the size of the string as defined below<del>,
and is initialized with the given characters</del>.
A string literal that begins with <code>L</code>,
such as <code>L"asdf"</code>, is a wide string literal.
A wide string literal has type
"array of <var>n</var> <code>const wchar_t</code>"
and has static storage duration,
where <var>n</var> is the size of the string as defined below,
and is initialized with the given characters.
</blockquote>

<p>In paragraph 3, edit
<blockquote>
In translation phase 6 (2.1), adjacent string literals are concatenated.
<ins>If an ordinary string literal token
is adjacent to a UTF-8 string literal token,
the result is a UTF-8 string literal.</ins>
If <del>a narrow</del> <ins>an ordinary</ins>
string literal token
is adjacent to a wide string literal token,
the result is a wide string literal.
<ins>If a UTF-8 string literal token
is adjacent to a wide string literal token,
the program is ill-formed.</ins>
</blockquote>

<p>Paragraph 5 already admits a multi-byte encoding of narrow string literals.

<h3>3.9.1 Fundamental Types [basic.fundamental]</h3>

<p>To paragraph 1, after the first sentence, add

<blockquote>
Objects declared as characters (<code>char</code>)
shall be large enough to store
<ins>either one byte (1.7 [intro.memory]) or</ins>
any member of the implementation's basic character set.
If a character from this set is stored in a character object,
the integral value of that character object
is equal to the value of the single character literal form of that character.
</blockquote>

</body>
</html>
