<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head>



  <meta name="generator" content="Microsoft FrontPage 5.0">
  <meta http-equiv="Content-Language" content="en-us">
  <meta name="GENERATOR" content="Microsoft FrontPage 5.0">
  <meta name="ProgId" content="FrontPage.Editor.Document">
  <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"><title>Raw and Unicode String Literals; Unified Proposal</title></head><body>
  <p>Doc. no.&nbsp;&nbsp; <span style="background-color: rgb(255, 255, 0);">N2442</span><span style="background-color: rgb(255, 255, 0);">=07-0312</span><br>
  Date:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
  <!--webbot bot="Timestamp" s-type="EDITED" s-format="%Y-%m-%d" startspan -->2007-10-05<!--webbot bot="Timestamp" endspan i-checksum="12046" --><br>

  Project:&nbsp;&nbsp;&nbsp;&nbsp; Programming Language C++<br>
  Reply to:&nbsp;&nbsp; Lawrence Crowl &lt;lawrence at
  crowl.org&gt;<br>
  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  Beman Dawes &lt;bdawes at acm.org&gt;</p>

  <h1>Raw and Unicode String Literals; Unified Proposal (Rev. 2)</h1>

  <h2>Introduction</h2>

  <p>Two papers, <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2209.html">
  N2209, UTF-8 String Literals</a>, by Lawrence Crowl, and <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2146.html">
  N2146, Raw String Literals (Revision 1)</a>, by Beman Dawes,
  propose additional forms of string literals for C++. Both have
  been approved by the Evolution Working Group and are ready for
  processing by the Core Working Group. Both papers make changes to
  the same text in the Working Paper. This proposal unifies the
  changed wording to avoid race conditions in editing the text.</p>

  <p>The motivation, discussion, and other details from the
  original proposals remains unchanged.</p>

  <h2>Revision history</h2>

  <h3><span style="background-color: rgb(255, 255, 0);">N2442</span></h3>

  <p>Changes from
  <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2384.html">
  N2384</a>:</p>

  <ul>
    <li>Excluded capital "U" from the grammar for r-char.<br>
&nbsp;</li>
    <li>Added "(1)" and "(2)" to clarify wording in specification of grammar for 
    r-char.<br>
&nbsp;</li>
    <li>Retain the 1.7 words "any member of the basic execution character set", 
    but do add the indicated eight-bit UTF8 wording.<br>
&nbsp;</li>
    <li>In 1.7, paragraph 1, section 5, make singular and fix conjunctions.<br>
&nbsp;</li>
    <li>After the grammar: "... initial d-char-sequence," [full stop here] 
    remove: "The maximum length of d-char-sequence shall be 16 characters." add: 
    "A d-char-sequence shall consist of at most 16 characters."<br>
&nbsp;</li>
    <li>"The terminating d-char-sequence of a raw-string shall be the same 
    sequence of characters as the initial d-char-sequence,": "shall be" implies 
    an implementation has to diagnose it, which isn't the case. consensus: 
    Replace "shall be" by "is".<br>
&nbsp;</li>
    <li>Added footnote: "For a specification of Unicode and UTF-8, see ISO 
    10646."</li>
  </ul>

  <h3>
  <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2384.html">N2384</a></h3>

  <p>Changes from 
  <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2295.html">
  N2295</a>:</p>

  <ul>
    <li>Removed the note that read<i> "[Note:</i>
  Implementations are encouraged to accept as physical source file
  characters all the permissible characters whose character short name in
  ISO/IEC 10646 is 0000NNNN. <i>--end note]" </i>The core working group 
    considers that market forces are adequate to motivate implementors.<br>
&nbsp;</li>
    <li>In the grammar, the sequence <i><code>"</code>d-char-sequence</i><sub> opt</sub>
    <code>[</code><i>r-char-sequence</i><sub>opt</sub>
    <code>]</code><i>d-char-sequence</i><sub>opt</sub>
    <code>"</code> was factored into a new non-terminal named <i>
    raw-string</i>, at the request of Jens Maurer. Clark Nelson suggested the 
    name.<br>
&nbsp;</li>
    <li>Wording for <i>r-char</i> was clarified in response to comments from 
    Jens Maurer, Clark Nelson, and Lawrence Crowl.<br>
&nbsp;</li>
    <li>The space character was excluded from <i>d-char</i> in response to a 
    request by Lawrence Crowl.<br>
&nbsp;</li>
    <li>The uR example was corrected, fixing an error noticed by James Widman.<br>
&nbsp;</li>
    <li><i>d-char</i> has been limited to the basic source character set, as 
    suggested by Clark Nelson and Beman Dawes.</li>
  </ul>

  <h3>
  <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2295.html">
  N2295</a></h3>

  <p>The proposed text is the same as in the original papers (<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2209.html">N2209,</a> <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2146.html">
  N2146</a>),
  except:</p>

  <ul>
    <li>The original raw string literal syntax allowed the 'R' that
    denotes a raw string literal either before or after other
    prefixes. Thus either LR or RL were valid. To reduce the
    combinatorial explosion caused by the addition of the u, U, and
    u8 prefixes, the R is now only valid following the other
    portion of a prefix. This is the same as in Python.<br>
    &nbsp;</li>

    <li>The original UTF-8 string literal wording made any source
    character set extensions to the basic source character set
    implementation-defined, but only for literals. It seemed
    awkward to make the source character set
    implementation-defined, but in only literals, so that was
    changed to apply to the entire source file. Non-normative
    encouragement to support all 16-bit ISO/IEC 10646 characters
    was added to encourage physical source file character set
    uniformity. That's existing practice for compilers such as
    VC++.</li>
  </ul>

  <h2>Proposed Text</h2>

  <p><i>Change 1.7 The C++ memory model [intro.memory] as indicated:</i></p>

  <p>The fundamental storage unit in the C++ memory model is the
  <dfn>byte</dfn>. A byte is at least large enough to contain any member of the 
  basic execution character set <font color="#228822"><ins>
  and the eight-bit code units of 
  the Unicode UTF-8 encoding form</ins></font> and is composed of a contiguous 
  sequence of bits, the number of which is implementation-defined. The least 
  significant bit is called the <dfn>low-order bit</dfn>; the most significant 
  bit is called the <dfn>high-order bit</dfn>. The memory available to a C++ 
  program consists of one or more sequences of contiguous bytes. Every byte has 
  a unique address.</p>

  <p><i>Change 2.1 [lex.phases], paragraph 1 as indicated. (Note to
  reviewers: the ISO/IEC short name wording is the same as used in
  2.2 Character sets [lex.charset] paragraph two.)</i></p>

  <p>1. Physical source file characters are mapped, in an
  implementation-defined manner, to the basic source character set
  (introducing new-line characters for end-of-line indicators) if
  necessary. <u><font color="#228822">The set of physical source
  file characters accepted is implementation-defined. </font></u>&nbsp;Trigraph
  sequences (2.3) are replaced by corresponding single-character
  internal representations. Any source file character not in the
  basic source character set (2.2) is replaced by the
  universal-character-name that designates that character. (An
  implementation may use any internal encoding, so long as an
  actual extended character encountered in the source file, and the
  same extended character expressed in the source file as a
  universal-character-name (i.e. using the \uXXXX notation), are
  handled equivalently.)</p>

  <p><i>Change 2.1 [lex.phases], paragraph 1 as indicated:</i></p>

  <p>5. Each source character set member<font color="#ff0000"><strike>, escape sequence</strike></font>, or
  universal-character-name in <font color="#ff0000"><strike>
  character literals and string
  literals</strike></font><u><font color="#228822"> 
  a character literal or a string literal, or escape sequence in
  a character literal or a non-raw string literal,</font></u> is
  converted to the corresponding member of the execution character
  set (2.13.2, 2.13.4); if there is no corresponding member, it is
  converted to an implementation-defined member other than the null
  (wide) character.<sup>17)</sup></p>

  <p><i>Change 2.13.4 String literals [lex.string] as
  indicated:</i></p>

  <blockquote>
    <p><i>string-literal:<br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</i>
    <code>"</code><i>s-char-sequence<sub>opt</sub></i><code>"</code><i>
    <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</i> <font color="#228822"><u><code>u8"</code><i>s-char-sequence<sub>opt</sub></i><code>
    "</code></u></font><i><br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</i>
    <code>u"</code><i>s-char-sequence<sub>opt</sub></i><code>"</code><i>
    <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</i>
    <code>U"</code><i>s-char-sequence<sub>opt</sub></i><code>"</code><i>
    <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</i>
    <code>L"</code><i>s-char-sequence<sub>opt</sub></i><code>"<br>
    &nbsp;&nbsp;&nbsp;</code> <u><font color="#228822"><code>R<i> </i></code><i>raw-string</i><code><br></code></font></u>
    <code>&nbsp;&nbsp;&nbsp;</code> <u><font color="#228822"><code>u8R<i> </i></code><i>raw-string</i><code><br></code></font></u>
    <code>&nbsp;&nbsp;&nbsp;</code> <u><font color="#228822"><code>uR<i> </i></code><i>raw-string</i><code><br></code></font></u>
    <code>&nbsp;&nbsp;&nbsp;</code> <u><font color="#228822"><code>UR<i> </i></code><i>raw-string</i><code><br></code></font></u>
    <code>&nbsp;&nbsp;&nbsp;</code> <u><font color="#228822"><code>LR<i> </i></code><i>raw-string</i></font></u></p>

    <p><i>s-char-sequence:<br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s-char<br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s-char-sequence
    s-char</i></p>

    <p><i>s-char:</i><br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; any member of the
    source character set except the double-quote ", backslash \, or
    new-line character<br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <i>escape-sequence<br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    universal-character-name</i></p>

    <p align="left"><font color="#228822"><i><u>
    raw-string:<code><br>
    </code></u><code>&nbsp; <u>"</u></code><u>d-char-sequence</u></i><u><sub> opt</sub>
    <code>[</code><i>r-char-sequence</i><sub>opt</sub>
    <code>]</code><i>d-char-sequence</i><sub>opt</sub>
    <code>"</code></u></font></p>

    <p align="left"><i><font face="Times New Roman"><u><font color="#228822">r-char-sequence:</font></u><br>
    &nbsp;&nbsp;&nbsp; <u><font color="#228822">r-char</font></u><br>
    &nbsp;&nbsp;&nbsp; <u><font color="#228822">r-char-sequence
    r-char</font></u></font></i></p>

    <p><font face="Times New Roman"><i><u><font color="#228822">r-char:</font></u><br>
    &nbsp;&nbsp;&nbsp;</i> <u><font color="#228822">any member of the source 
    character set, except, (1), a<b> </b></font></u> </font><font color="#228822"><u>
    backslash \</u></font><font face="Times New Roman"><u><font color="#228822"><b> </b>followed by a u
    or U, or, 
    (2), 
    a right square bracket
    ]<br></font></u> <font color="#228822">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<u>followed by the initial</u></font></font><u> <font color="#228822"><i>d-char-sequence</i> (which may be empty) followed by a double quote
    "</font></u><font color="#228822" face="Times New Roman"><u>.<br></u> &nbsp;&nbsp;&nbsp;
    <i><u>universal-character-name</u></i></font></p>

    <p align="left"><i><font face="Times New Roman"><u><font color="#228822">d-char-sequence:</font></u><br>
    &nbsp;&nbsp;&nbsp; <u><font color="#228822">d-char</font></u><br>
    &nbsp;&nbsp;&nbsp; <u><font color="#228822">d-char-sequence
    d-char</font></u></font></i></p>

    <p><font face="Times New Roman"><u><font color="#228822"><i>d-char:</i></font></u><br>
    &nbsp;&nbsp;&nbsp; <font color="#228822"><u>any member of the basic source character set, except space, the left square bracket [, the
    right square bracket ],<br></u>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<u>or the control characters representing horizontal tab,
    vertical tab, form feed, or new-line.</u></font></font></p>
  </blockquote>

  <p>A string literal is a sequence of characters (as defined in
  2.13.2) surrounded by double quotes, optionally
  <strike><font color="#ff0000">beginning with one of the
  letters</font></strike> <u><font color="#228822">prefixed by
  <code>R</code>,</font></u> <font color="#228822"><u><code>u8</code></u></font>, <font color="#228822"><u><code>u8R</code></u></font>, <code>u</code>,
  <font color="#228822"><u><code>uR</code>,</u></font>
  <code>U</code>, <u><font color="#228822"><code>UR</code>,</font></u>&nbsp; <code>L</code>, or
  <u><font color="#228822"><code>LR</code></font></u>, as in
  <code>"..."</code><font color="#228822"><u>,&nbsp;<code>R"[...]"</code></u></font><u><font color="#228822">
  ,&nbsp;</font></u> <code><font color="#228822"><u>u8"..."</u></font></code><u><font color="#228822">,&nbsp;</font></u> <code><font color="#228822"><u>u8R"**[...]**"</u></font></code>,<code>u"..."</code>,
  <u><font color="#228822"><code>uR"*@[...]*@"</code>,</font></u>
  <code>U"..."</code>,<code><u><font color="#228822">UR"zzz[...]zzz"</font></u></code><u><font color="#228822">,</font></u> <code>L"..."</code>, or <font color="#228822"><u><code>LR"[...]"</code></u></font>, respectively.</p>

  <p><u><font color="#228822">A
  string literal that has an <code>R</code> in the prefix is a <i>raw
  string literal</i>. The terminating <i>d-char-sequence</i> of a <i>raw-string </i>
  is the same sequence of characters as the
  initial <i>d-char-sequence</i></font></u><font color="#228822"><span style="text-decoration: underline;">. 
  A <i>d-char-sequence</i> shall consist of at most 16 characters.</span></font></p>

  <p><u><font color="#228822"><i>[Note:</i> A source-file new-line
  in a raw string-literal results in a new-line in the resulting
  execution <i>string-literal</i>, unless preceded by a backslash.
  Assuming no whitespace at the beginning of lines in the following
  example, the assert will succeed:</font></u></p>

  <p><font color="#228822"><code>&nbsp;&nbsp; <u>const char * p =
  R"[a\</u><br>
  &nbsp;&nbsp; <u>b<br></u>&nbsp;&nbsp; <u>c]";</u><br>
  &nbsp;&nbsp; <u>assert(std::strcmp(p, "ab\nc") ==
  0);<br></u><br></code></font><u><font color="#228822">&nbsp;<i>--
  end note]</i></font></u></p>

  <p>A string literal that does not begin with <font color="#228822"><u><code>u8</code>,</u></font> <code>u</code>,
  <code>U</code>, or <code>L</code> is an ordinary string literal,
  <font color="#228822"><u>and is initialized with the given
  characters.</u></font></p>

  <p><font color="#228822"><u>A string literal that begins with
  <code>u8</code>, such as <code>u8</code><tt>"asdf"</tt>, is a
  UTF-8 string literal and is initialized with the given characters
  as encoded in UTF-8.<sup><i>footnote</i></sup></u></font></p>

  <p><sup><u><font color="#228822"><i>
  footnote </i></font></u></sup><font color="#228822"><u>
  For a specification of Unicode and 
  UTF-8, see ISO 10646.</u></font></p>

  <p><font color="#228822"><u>Ordinary string literals and UTF-8
  string literals are</u></font> also referred to as
  <strike><font color="#ff0000">a</font></strike> narrow string
  literal<font color="#228822"><u>s</u></font>.
  A<strike><font color="#ff0000">n ordinary</font></strike>
  <u><font color="#228822">narrow</font></u> string literal has
  type &#8220;array of <i>n</i> <code>const char</code>&#8221;,
  where <i>n</i> is the size of the string as defined below,
  <strike><font color="#ff0000">it</font></strike> <u><font color="#228822">and</font></u> has static storage duration (3.7)
  <strike><font color="#ff0000">and is initialized with the given
  characters</font></strike>.</p>

  <p>A string literal that begins with <code>u</code>, such as
  <code>u"asdf"</code>, is a <code>char16_t</code> string literal.
  A <code>char16_t</code> string literal has type &#8220;array of
  <i>n</i> <code>const char16_t</code>&#8221;, where <i>n</i> is
  the size of the string as defined below; it has static storage
  duration and is initialized with the given characters. A single
  <i>c-char</i> may produce more than one <code>char16_t</code>
  character in the form of surrogate pairs.</p>

  <p>A string literal that begins with <code>U</code>, such as
  <code>U"asdf"</code>, is a <code>char32_t</code> string literal.
  A <code>char32_t</code> string literal has type &#8220;array of
  <i>n</i> c<code>onst char32_t</code>&#8221;, where <i>n</i> is
  the size of the string as defined below; it has static storage
  duration and is initialized with the given characters.</p>

  <p>A string literal that begins with <code>L</code>, such as
  <code>L"asdf"</code>, is a wide string literal. A wide string
  literal has type &#8220;array of <i>n</i> <code>const
  wchar_t</code>&#8221;, where <i>n</i> is the size of the string
  as defined below, it has static storage duration and is
  initialized with the given characters.</p>

  <p>Whether all string literals are distinct (that is, are stored
  in nonoverlapping objects) is implementation-defined. The effect
  of attempting to modify a string literal is undefined.</p>

  <p>In translation phase 6 (2.1), adjacent string literals are
  concatenated. If both string literals have the same prefix, the
  resulting concatenated string literal has that prefix. If one
  string literal has no prefix, it is treated as a string literal
  of the same prefix as the other operand. <font color="#228822"><ins>If a UTF-8 string literal token is adjacent to a
  wide string literal token, the program is
  ill-formed.</ins></font> Any other concatenations are
  conditionally supported with implementation-defined behavior.
  <i>[ Note:</i> This concatenation is an interpretation, not a
  conversion. <i>&#8212;end note ] [ Example:</i> Here are some
  examples of valid concatenations:</p>

  <p align="center"><i>Table unchanged</i></p>

  <p><i>&#8212;end example ]</i> Characters in concatenated strings
  are kept distinct. <i>[ Example:<br></i> <code>"\xA"
  "B"</code></p>

  <p>contains the two characters <code>&#8217;\xA&#8217;</code> and
  <code>&#8217;B&#8217;</code> after concatenation (and not the
  single hexadecimal character <code>&#8217;\xAB&#8217;</code>).
  <i>&#8212;end example ]</i></p>

  <p>After any necessary concatenation, in translation phase 7
  (2.1), <code>&#8217;\0&#8217;</code> is appended to every string
  literal so that programs that scan a string can find its end.</p>

  <p>Escape sequences <u><font color="#228822">in 
  non-raw string
  literals</font></u> and universal-character-names in string
  literals have the same meaning as in character literals (2.13.2),
  except that the single quote <code>&#8217;</code> is
  representable either by itself or by the escape sequence
  <code>\&#8217;</code>, and the double quote " shall be preceded
  by a <code>\</code>. In a narrow string literal, a
  universal-character-name may map to more than one char element
  due to multibyte encoding. The size of a <code>char32_t</code> or
  wide string literal is the total number of escape sequences,
  universal-character-names, and other characters, plus one for the
  terminating <code>U&#8217;\0&#8217;</code> or
  <code>L&#8217;\0&#8217;</code>. The size of a
  <code>char16_t</code> string literal is the total number of
  escape sequences, universal-character-names, and other
  characters, plus one for each character requiring a surrogate
  pair, plus one for the terminating
  <code>u&#8217;\0&#8217;</code>. [ Note: The size of a
  <code>char16_t</code> string literal is the number of code units,
  not the number of characters. &#8212;end note ] Within
  <code>char32_t</code> and <code>char16_t</code> literals, any
  universal-character-names must be within the range 0x0 to
  0x10FFFF. The size of a narrow string literal is the total number
  of escape sequences and other characters, plus at least one for
  the multibyte encoding of each universal-character-name, plus one
  for the terminating <code>&#8217;\0&#8217;</code>.</p>
  <hr>

  <p>&nbsp;</p>
</body></html>
