<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<TITLE>
    CWG Issue 872</TITLE>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<STYLE TYPE="text/css">
  INS { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
  .INS { text-decoration:none; background-color:#D0FFD0 }
  DEL { text-decoration:line-through; background-color:#FFA0A0 }
  .DEL { text-decoration:line-through; background-color: #FFD0D0 }
  @media (prefers-color-scheme: dark) {
    HTML { background-color:#202020; color:#f0f0f0; }
    A { color:#5bc0ff; }
    A:visited { color:#c6a8ff; }
    A:hover, a:focus { color:#afd7ff; }
    INS { background-color:#033a16; color:#aff5b4; }
    .INS { background-color: #033a16; }
    DEL { background-color:#67060c; color:#ffdcd7; }
    .DEL { background-color:#67060c; }
  }
  SPAN.cmnt { font-family:Times; font-style:italic }
</STYLE>
</HEAD>
<BODY>
<P><EM>This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21
  Core Issues List revision 118b.
  See http://www.open-std.org/jtc1/sc22/wg21/ for the official
  list.</EM></P>
<P>2025-09-28</P>
<HR>
<A NAME="872"></A><H4>872.
  
Lexical issues with raw strings
</H4>
<B>Section: </B>5.13.5&#160; [<A href="https://wg21.link/lex.string">lex.string</A>]
 &#160;&#160;&#160;

 <B>Status: </B>CD2
 &#160;&#160;&#160;

 <B>Submitter: </B>Joseph Myers
 &#160;&#160;&#160;

 <B>Date: </B>16 April, 2009<BR>


<P>[Voted into WP at March, 2010 meeting as document N3077.]</P>

<P>The specification of raw string literals interacts poorly with
the specification of preprocessing tokens.  The grammar in
5.5 [<A href="https://wg21.link/lex.pptoken">lex.pptoken</A>] has a production reading</P>

<UL>each non-white-space character that cannot be one of the above</UL>

<P>This is echoed in the max-munch rule in paragraph 3:</P>

<BLOCKQUOTE>

If the input stream has been parsed into preprocessing tokens up to a
given character, the next preprocessing token is the longest sequence
of characters that could constitute a preprocessing token, even if
that would cause further lexical analysis to fail.

</BLOCKQUOTE>

<P>This raises questions about the handling of raw string literals.
Consider, for instance,</P>

<PRE>
    #define R "x"
    const char* s = R"y";
</PRE>

<P>The character sequence <TT>R"y"</TT> does not satisfy the
syntactic requirements for a raw string.  Should it be diagnosed as
an ill-formed attempt at a raw string, or should it be well-formed,
interpreting <TT>R</TT> as a preprocessor token that is a macro name and
thus initializing <TT>s</TT> with a pointer to the string
<TT>"xy"</TT>?</P>

<P>For another example, consider:</P>

<PRE>
    #define R "]"
    const char* x = R"foo[";
</PRE>

<P>Presumably this means that the entire rest of the file must be
scanned for the characters <TT>]foo"</TT> and, if they are not
found, macro-expand <TT>R</TT> and initialize <TT>x</TT> with a pointer to
the string <TT>"]foo["</TT>.  Is this the intended result?</P>

<P>Finally, does the requirement in 5.13.5 [<A href="https://wg21.link/lex.string">lex.string</A>]
that</P>

<BLOCKQUOTE>

A <I>d-char-sequence</I> shall consist of at most 16 characters.

</BLOCKQUOTE>

<P>mean that</P>

<PRE>
    #define R "x"
    const char* y = R"12345678901234567[y]12345678901234567";
</PRE>

<P>is ill-formed, or a valid initialization of <TT>y</TT> with a
pointer to the string <TT>"x12345678901234567[y]12345678901234567"</TT>?</P>

<P><B>Additional note, June, 2009:</B></P>



<P>The translation of characters that are not in the basic source
character set into universal-character-names in translation phase 1
raises an additional problem: each such character will occupy at
least six of the 16 <I>r-char</I>s that are permitted.  Thus, for
example, <TT>R"@@@[]@@@"</TT> is ill-formed because <TT>@@@</TT>
becomes <TT>\u0040\u0040\u0040</TT>, which is 18 characters.</P>

<P>One possibility for addressing this might be to disallow the
<TT>\</TT> character completely as an <I>d-char</I>, which would
have the effect of restricting <I>r-char</I>s to the basic source
character set.</P>



<P><B>Proposed resolution (October, 2009):</B></P>

<OL>
<LI><P>Change the grammar in 5.13.5 [<A href="https://wg21.link/lex.string">lex.string</A>] as
follows:</P></LI>

<UL>
<I>d-char:</I>
<UL>any member of the basic source character set except:
<UL>space, the left square bracket <TT>[</TT>,
the right square bracket <TT>]</TT>, <INS>the backslash <TT>\</TT>,</INS>
and the control characters representing horizontal tab,
vertical tab, form feed, and newline.</UL>
</UL>
</UL>

<LI><P>Change 5.13.5 [<A href="https://wg21.link/lex.string#2">lex.string</A>] paragraph 2 as follows:</P></LI>

<BLOCKQUOTE>

<P>A string literal that has an <TT>R</TT> in the prefix is a
<I>raw string literal</I>. The <I>d-char-sequence</I> serves as a
delimiter.  The terminating <I>d-char-sequence</I> of a raw-string is the
same sequence of characters as the initial <I>d-char-sequence</I>. A
<I>d-char-sequence</I> shall consist of at most 16 characters.
<INS>If the input stream contains a sequence of characters that could
be the prefix and initial double quote of a raw string literal, such
as <TT>R"</TT>, those characters are considered to begin a raw string
literal even if that literal is not well-formed. [<I>Example:</I></INS>
</P>

<PRE>
<INS>  #define R "x"
  const char* s = R"y"; //<SPAN CLASS="cmnt"> ill-formed raw string, not </SPAN>"x" "y"
</INS>
</PRE>

<P><INS>&#8212;<I>end example</I>]</INS></P>

</BLOCKQUOTE>

</OL>

<BR><BR>
</BODY>
</HTML>
