<html>
<style>
ins {background-color:#A0FFA0}
del {background-color:#FFA0A0}
</style>
<title>Alternative approach to Raw String issues</title>
<body>
Jason Merrill
<br>2010-03-12
<br>N3077=10-0067
<h1>Alternative approach to Raw String issues</h1>

<h2>Introduction</h2>

<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2009/n2990.pdf">N2990</a>
deals with the problem of trigraph replacement in raw strings (part of core
issue 789) by moving trigraphs out of phase 1, making them alternate
spellings instead.  This solves the problem for trigraphs, but does not
deal with the related issues for UCNs (extended characters being replaced
with \uXXXX in phase 1) and line splicing in phase 2.

<p>This paper proposes an alternate approach to dealing with these issues:
just undo the transformations done in phase 1 and 2 inside a raw string.
Apparently many compilers already keep track of those transformations
internally.</p>

<p>This paper incorporates the proposed wording for issue 872, and
  addresses UK comment 11 (core issue 789).</p>

<h2>Proposed Wording</h2>

2.2 [lex.phases]:

<blockquote>
3. The source file is decomposed into preprocessing tokens (2.5) and
sequences of white-space characters (including comments). A source file
shall not end in a partial preprocessing token or in a partial
comment.<sup><small>12</small></sup> Each comment is replaced by one space
character. New-line characters are retained. Whether each nonempty sequence
of white-space characters other than new-line is retained or replaced by
one space character is unspecified. The process of dividing a source file's
characters into preprocessing tokens is context-dependent. [ Example: see
the handling of <tt>&lt;</tt> within a <tt>#include</tt> preprocessing
directive.  -- end example ] <ins>Within the <i>r-char-sequence</i> of a
raw string literal, any transformations performed in phases 1 and 2
(trigraphs, universal-character-names, and line splicing) are reverted.</ins>


</blockquote>

<p>2.5 [lex.pptoken] paragraph 3:</p>
<blockquote>
  If the input stream has been parsed into preprocessing tokens up to a
  given character<ins>:</ins>
<ul>
  <li><ins>if the next character begins a sequence of characters that could be
  the prefix and initial double quote of a raw string literal, such as <tt>R"</tt>,
  the next preprocessing token shall be a raw string literal;</ins></li>
  <li><ins>otherwise</ins>, the next preprocessing token is the longest sequence of
  characters that could constitute a preprocessing token, even if that
  would cause further lexical analysis to fail.</li>
</ul>
<ins>[ Example: </ins>
<pre>
  <ins>#define R "x"</ins>
  <ins>const char* s = R"y"; // ill-formed raw string, not "x" "y"</ins>
</pre>
<ins>--end example ]</ins>
</ul>
</blockquote>

2.14.5 [lex.string]:

<blockquote>
<pre>
raw-string:
      " d-char-sequenceopt <del>[</del><ins>(</ins> r-char-sequenceopt <ins>)</ins><del>]</del> d-char-sequenceopt "

r-char-sequence:
      r-char
      r-char-sequence r-char

r-char:
  any member of the source character set, except
    <del>(1), a backslash \followed by a u or U, or</del>
    <del>(2),</del> a right <del>square bracket ]</del><ins>parenthesis )</ins> followed by the initial d-char-sequence
    (which may be empty) followed by a double quote ".
  <del>universal-character-name</del>

d-char-sequence:
      d-char
      d-char-sequence d-char

d-char:
      any member of the basic source character set except:
            space, the left <del>square bracket [</del><ins>parenthesis (</ins>, the right <del>square bracket ]</del><ins>parenthesis )</ins>,
            <ins>the backslash \,</ins>
            and the control characters representing horizontal tab,
            vertical tab, form feed, and newline.
</pre>

...

<p>  A string literal is a sequence of characters (as defined in 2.14.3) surrounded by double quotes, optionally
  prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"<del>[</del><ins>(</ins>...<del>]</del><ins>)</ins>", u8"...", u8R"**<del>[</del><ins>(</ins>...<del>]</del><ins>)</ins>**", u"...",
  uR"*~<del>[</del><ins>(</ins>...<del>]</del><ins>)</ins>*~", U"...", UR"zzz<del>[</del><ins>(</ins>...<del>]</del><ins>)</ins>zzz", L"...", or LR"<del>[</del><ins>(</ins>...<del>]</del><ins>)</ins>", respectively.</p>


<p>A string literal that has an <tt>R</tt> in the prefix is a <i>raw string
  literal</i>. The <i>d-char-sequence</i> serves as a delimiter.

  The terminating <i>d-char-sequence</i> of a raw-string is the same
  sequence of characters as the initial <i>d-char-sequence</i>. A <i>d-char-sequence</i>
  shall consist of at most 16 characters. <ins>[ Footnote: Use of characters with trigraph equivalents in a <i>d-char-sequence</i> may produce unintended results.  --end footnote ]</ins></p>

<p>[ Note: The characters '<del>[</del><ins>(</ins>' and '<del>]</del><ins>)</ins>' are permitted in a raw-string. 
  Thus, R"delimiter<del>[[a-z]]</del><ins>((a|b))</ins>delimiter" is equivalent to "<del>[a-z]</del><ins>(a|b)</ins>". -- end note ]</p>

<p>[ Note: A source-file new-line in a raw string literal results in a new-line
  in the resulting execution string-literal<del>, unless preceded by a backslash</del>. 
  Assuming no whitespace at the beginning of lines in the following
  example, the assert will succeed:
<pre>
     const char *p = R"<del>[</del><ins>(</ins>a\
     b
     c<ins>)</ins><del>]</del>";
     assert(std::strcmp(p, "a<ins>\\\n</ins>b\nc") == 0);
</pre>
   -- end note ]
</p>

<p>...</p>

<p>Escape sequences <ins>and universal-character-names</ins> in non-raw
  string literals <del>and universal-character-names in string
  literals</del> have the same meaning as in character literals ....</p>



</blockquote>

</body>
</html>
