<html>

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta name="GENERATOR" content="Microsoft FrontPage 5.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Raw String Literals</title>
</head>

<body>

<p>Doc. no.&nbsp;&nbsp; WG21/N2053=06-0123<br>
Date:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
<!--webbot bot="Timestamp" s-type="EDITED" s-format="%Y-%m-%d" startspan -->2006-09-06<!--webbot bot="Timestamp" endspan i-checksum="12590" --><br>
Project:&nbsp;&nbsp;&nbsp;&nbsp; Programming Language C++<br>
Reply to:&nbsp;&nbsp; Beman Dawes &lt;<a href="mailto:bdawes@acm.org">bdawes@acm.org</a>&gt;</p>

<h1>Raw String Literals</h1>

<p><a href="#Introduction">Introduction</a><br>
<a href="#Motivating">Motivating examples</a><br>
&nbsp;&nbsp;&nbsp; <a href="#Regular">Regular Expression motivating example</a><br>
&nbsp;&nbsp;&nbsp; <a href="#Markup">Markup motivating example</a><br>
<a href="#Implementation">Implementation experience</a><br>
<a href="#Raw-character-literals">Raw character literals?</a><br>
<a href="#Acknowledgements">Acknowledgements</a><br>
<a href="#Proposed">Proposed wording</a></p>

<h2><a name="Introduction">Introduction</a></h2>

<p>In recent years it has become more common for C++ to work with regular expressions 
and with markup languages such as HTML and XML.</p>

<p>Regular expressions use the same backslash escape sequence as C++ does in 
string literals. The resulting plethora of backslashes is very difficult to write 
correctly and impenetrable to read. See <a href="#Regular">Regular expressions 
motivating example</a>.</p>

<p>Markup languages such as XML and HTML use a lot of quotation marks and 
newlines. The resulting escape sequences in string literals are irritating, 
cumbersome, and error prone. See <a href="#Markup">Markup motivating example</a>.</p>

<p>Other programming languages, such as Perl, Python, 
and Lua, have addressed these issues by providing raw string literals in 
addition to regular string literals. A <b><i>raw string literal</i></b> is 
simply a string literal that does not recognize C++ escape sequences. Raw string 
literals are well accepted and used regularly (pun intended!) in languages that 
have them.</p>

<p>This document proposes adding raw string literals to C++0x.</p>

<p>The proposal is a pure extension. It will have no impact on any existing code.</p>

<p>The proposal has some minor interaction with proposal N2018 to add additional 
character types. See<a class="moz-txt-link-freetext" href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html"> 
www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html</a>. If N2018 is 
accepted, four additional lines must be added to the grammar in addition to the 
two additional grammar lines proposed in N2018.</p>

<h2><a name="Motivating">Motivating</a> examples</h2>

<h3><a name="Regular">Regular</a> Expression motivating example</h3>

<p>Here is an example of the concatenated string literals in an actual C++ 
program (by John Maddock):</p>
<blockquote>
  <pre>&quot;(^[[:blank:]]*#(?:[^\\\\\\n]|\\\\[^\\n[:punct:][:word:]]*[\\n[:punct:][:word:]])*)|&quot;
&quot;(//[^\\n]*|/\\*.*?\\*/)|&quot;
&quot;\\&lt;([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\\.)&quot;
&quot;?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\\&gt;|&quot;
&quot;('(?:[^\\\\']|\\\\.)*'|\&quot;(?:[^\\\\\&quot;]|\\\\.)*\&quot;)|&quot;
&quot;\\&lt;(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import&quot;
&quot;|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall&quot;
&quot;|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool&quot;
&quot;|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete&quot;
&quot;|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto&quot;
&quot;|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected&quot;
&quot;|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast&quot;
&quot;|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned&quot;
&quot;|using|virtual|void|volatile|wchar_t|while)\\&gt;&quot;</pre>
</blockquote>
  <p>Note in particular the line that reads:</p>
  <blockquote>
    <pre>&quot;('(?:[^\\\\']|\\\\.)*'|\&quot;(?:[^<span style="background-color: #FFFF00">\\\\\</span>&quot;]|\\\\.)*\&quot;)|&quot;</pre>
</blockquote>
<p>Are the high-lighted five backslashes correct or not? Even experts become 
easily confused. Here is the equivalent line as a raw string:</p>
<blockquote>
  <pre>('(?:[^\\']|\\.)*'|&quot;(?:[^<span style="background-color: #FFFF00">\\</span>&quot;]|\\.)*&quot;)|\</pre>
</blockquote>
<p>Note the the five backslash sequence has been reduced to a more manageable 
two backslash sequence. And, yes, the original five backslash sequence was both 
correct and necessary in C++03.</p>
<p>Here is the complete example using the raw string proposal:</p>
<blockquote>
  <pre>R&quot;&quot;(^[[:blank:]]*#(?:[^\\\n]|\\[^\n[:punct:][:word:]]*[\n[:punct:][:word:]])*)|\
(//[^\n]*|/\*.*?\*/)|\
\&lt;([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\.)\
?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\&gt;|\
('(?:[^\\']|\\.)*'|&quot;(?:[^\\&quot;]|\\.)*&quot;)|\
\&lt;(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import\
|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall\
|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool\
|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete\
|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto\
|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected\
|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast\
|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned\
|using|virtual|void|volatile|wchar_t|while)\&gt;&quot;&quot;</pre>
</blockquote>
<h3><a name="Markup">Markup</a> motivating example</h3>

<p>Here is an example of the concatenated string literals in an actual C++ 
program (again by John Maddock):</p>
<blockquote>
  <pre>&quot;&lt;HTML&gt;\n&quot;
&quot;&lt;HEAD&gt;\n&quot;
&quot;&lt;TITLE&gt;Auto-generated html formated source&lt;/TITLE&gt;\n&quot;
&quot;&lt;META HTTP-EQUIV=\&quot;Content-Type\&quot; CONTENT=\&quot;text/html; charset=windows-1252\&quot;&gt;\n&quot;
&quot;&lt;/HEAD&gt;\n&quot;
&quot;&lt;BODY LINK=\&quot;#0000ff\&quot; VLINK=\&quot;#800080\&quot; BGCOLOR=\&quot;#ffffff\&quot;&gt;\n&quot;
&quot;&lt;P&gt; &lt;/P&gt;\n&quot;
&quot;&lt;PRE&gt;\n&quot;</pre>
</blockquote>

  <p>Here is the complete example using the raw string proposal:</p>
<blockquote>
  <pre>R&quot;$\
&lt;HTML&gt;
&lt;HEAD&gt;
&lt;TITLE&gt;Auto-generated html formated source&lt;/TITLE&gt;
&lt;META HTTP-EQUIV=&quot;Content-Type&quot; CONTENT=&quot;text/html; charset=windows-1252&quot;&gt;
&lt;/HEAD&gt;
&lt;BODY LINK=&quot;#0000ff&quot; VLINK=&quot;#800080&quot; BGCOLOR=&quot;#ffffff&quot;&gt;
&lt;P&gt; &lt;/P&gt;
&lt;PRE&gt;
$&quot;</pre>
</blockquote>

  <p>There are several reasons the raw string versions is preferred:</p>
<ul>
  <li>It is easier to write, whether by hand or by cut-and-past from an actual 
  HTML file.</li>
  <li>It is easier to read, although perhaps not as markedly easier as with the 
  regular expression example.</li>
  <li>Code that does markup language generation often does a lot of it, so the 
  multiplier effect is large. In other words, even moderate gains in writeability 
  and readability in a single example become important when multiplied by many 
  similar uses in a larger program.</li>
</ul>

<h2><a name="Implementation">Implementation</a> experience</h2>

<p>Not yet.</p>

<h2><a name="Raw-character-literals">Raw character literals</a>?</h2>

<p>As a deliberate design choice, the proposal does not include raw character 
(as opposed to string) literals because there is no apparent need; escape 
sequences do not pose the same practical problems in character literals that 
they do in string-literals.</p>

<p>The arguments in favor of raw character literals are symmetry and 
error-reduction. Knowing that raw string-literals are allowed, programmers are 
likely to assume raw character-literals are also available. Indeed, a committee 
member inadvertently made that assumption when reading a draft of this paper. 
Although the resulting compiler error is easy to fix, there is the argument that 
it is better to eliminate the possibility of the error by providing raw 
character-literals in the first place.</p>

<p>I will be happy to provide proposed wording if the committee desires to add 
raw character-literals.</p>

  <h2><font face="Times New Roman"><a name="Acknowledgements">Acknowledgements</a></font></h2>
  <p>This proposal was initiated in response to a posting on the LWG reflector 
  from Thomas Witt, with comments from several others committee members. John 
  Maddock provided insights about the string-literal needs of regular 
  expressions. Robert Klarer provided examples and clarifications.</p>

<h2><a name="Proposed">Proposed</a> wording</h2>

<p>Added text is shown in <u>
<font color="#228822">green and underlined</font></u>. Deleted text is shown 
in <strike><font color="#FF0000">red with strikethrough</font></strike>. 
Commentary is shown in <span style="background-color: #C0C0C0">gray shading</span> 
and is not part of the proposed wording.</p>

<p>Change 2.1 [lex.phases], paragraph 5:</p>
<blockquote>
<p>Each source character set member<strike><font color="#FF0000">, escape 
sequence, or universal-character-name</font></strike> in character literals and 
string<br>
literals<u><font color="#228822">, and escape sequence or universal-character 
name in character literals and regular string literals,</font></u> is converted 
to the corresponding member of the execution character set (2.13.2, 2.13.4); if 
there is no<br>
corresponding member, it is converted to an implementation-defined member other 
than the null (wide) character.<sup>17)</sup></p>
</blockquote>
<p>Change 2.13.4 [lex.string] :</p>
<blockquote>
  <p align="left"><i>string-literal:</i><br>
&nbsp;&nbsp;&nbsp; &quot;<i>s-char-sequence</i><sub>opt</sub>&quot;<br>
&nbsp;&nbsp;&nbsp; L&quot;<i>s-char-sequence</i><sub>opt</sub>&quot;<br>
&nbsp;&nbsp;&nbsp; <u><font color="#228822"> <span style="background-color: #FFFFFF">R<i>&quot;d-char r-char-sequence</i></span><sub><span style="background-color: #FFFFFF">opt</span></sub><span style="background-color: #FFFFFF"><i> d-char</i>&quot;</span></font></u><br>
&nbsp;&nbsp;&nbsp; <u><font color="#228822">LR<i>&quot;</i><span style="background-color: #FFFFFF"><i>d-char r-char-sequence</i></span><sub><span style="background-color: #FFFFFF">opt</span></sub><span style="background-color: #FFFFFF"><i> d-char</i></span>&quot;</font></u><br>
&nbsp;&nbsp;&nbsp; <u><font color="#228822">RL<i>&quot;</i><span style="background-color: #FFFFFF"><i>d-char r-char-sequence</i></span><sub><span style="background-color: #FFFFFF">opt</span></sub><span style="background-color: #FFFFFF"><i> d-char</i></span>&quot;</font></u><br>
&nbsp;&nbsp;&nbsp; <font color="#228822"> <u>uR<i>&quot;</i><span style="background-color: #FFFFFF"><i>d-char r-char-sequence</i></span><sub><span style="background-color: #FFFFFF">opt</span></sub><span style="background-color: #FFFFFF"><i> d-char</i></span>&quot;</u>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </font>
  <span style="background-color: #C0C0C0">Applies only if N2018 is accepted</span><br>
&nbsp;&nbsp;&nbsp; <font color="#228822"> <u>Ru<i>&quot;</i><span style="background-color: #FFFFFF"><i>d-char r-char-sequence</i></span><sub><span style="background-color: #FFFFFF">opt</span></sub><span style="background-color: #FFFFFF"><i> d-char</i></span>&quot;</u>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </font>
  <span style="background-color: #C0C0C0">Applies only if N2018 is accepted</span><br>
&nbsp;&nbsp;&nbsp; <font color="#228822"><u>UR<i>&quot;</i><span style="background-color: #FFFFFF"><i>d-char r-char-sequence</i></span><sub><span style="background-color: #FFFFFF">opt</span></sub><span style="background-color: #FFFFFF"><i> d-char</i></span>&quot;</u>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </font>
  <span style="background-color: #C0C0C0">Applies only if N2018 is accepted</span><br>
&nbsp;&nbsp;&nbsp; <font color="#228822"> <u>UL<i>&quot;</i><span style="background-color: #FFFFFF"><i>d-char r-char-sequence</i></span><sub><span style="background-color: #FFFFFF">opt</span></sub><span style="background-color: #FFFFFF"><i> d-char</i></span>&quot;</u>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </font>
  <span style="background-color: #C0C0C0">Applies only if N2018 is accepted</span><br>
  <br>
  <i><font face="Times New Roman"><u><font color="#228822">r-char-sequence:</font></u><br>
&nbsp;&nbsp;&nbsp; <u><font color="#228822">r-char</font></u><br>
&nbsp;&nbsp;&nbsp; <u><font color="#228822">r-char-sequence r-char</font></u></font></i></p>
  <p><font face="Times New Roman"><i><u><font color="#228822">r-char:</font></u><br>
&nbsp;&nbsp;&nbsp; </i><u><font color="#228822">any member of the source character set, except the 
  initial <i>d-char</i> when followed by &quot;.</font></u></font></p>
  <p><font face="Times New Roman"><u><font color="#228822"><i>d-char:</i></font></u><br>
&nbsp;&nbsp;&nbsp; <u><font color="#228822">any member of the source character set for which <code>
  std::ispunc</code> is true;</font></u><br>
&nbsp;&nbsp;&nbsp; 
  <u><font color="#228822">the terminating <i>d-char</i> is the same character as the initial <i>d-char</i>.</font></u></font></p>
</blockquote>
  <p>Change 2.13.4 [lex.string] paragraph 1:</p>
<blockquote>
  <p><u><font color="#228822">A string literal is regular string literal or a 
  raw string literal</font></u><font color="#228822">. <u>A regular string 
  literal does not have an R prefix. A raw string literal has an R prefix, as in
  <code>R&quot;&quot;...&quot;&quot;</code></u></font><u><font color="#228822">,&nbsp; </font></u>
  <font color="#228822"><u><code>RL&quot;&quot;...&quot;&quot;</code> or <code>LR&quot;&quot;...&quot;&quot;</code>.</u></font> 
  A string literal is <strike><font color="#FF0000">a sequence of characters (as 
  defined in 2.13.2) surrounded by double quotes,</font></strike> optionally <u>
  <font color="#228822">prefixed</font></u> <strike><font color="#FF0000">
  beginning</font></strike> with the letter L, as in <code>L&quot;...&quot;<u><font color="#228822">,</font></u></code><u><font color="#228822">
  <code>RL&quot;&quot;...&quot;&quot;</code> or <code>LR&quot;&quot;...&quot;&quot;</code></font></u>. A string literal that does not 
  <strike><font color="#FF0000">begin with</font></strike> <u>
  <font color="#228822">have an</font></u> L <u><font color="#228822">prefix</font></u> is an ordinary string 
  literal, also referred to as a narrow string literal. An ordinary string 
  literal has type array of n const char and <i>static</i> storage duration 
  (3.7), where n is the size of the string as defined below, and is initialized 
  with the given characters. A string literal that <strike>
  <font color="#FF0000">begins with</font></strike> <u><font color="#228822">has 
  an</font></u> L prefix, such as <code>L&quot;asdf&quot;</code><font color="#228822"><u> or
  <code>RL&quot;/\bgd/&quot;</code></u></font>, 
  is a wide string literal. A wide string literal has type array of n const 
  wchar_t and has static storage duration, where n is the size of the string as 
  defined below, and is initialized with the given characters.</p>
  <p><u><font color="#228822"><i>[Example: </i>Whether or not a source-file 
  new-line in a raw string-literal 
  results in a newline in the resulting execution <i>string-literal</i> is 
  determined by the second phase of translation (2.1) rules for a trailing 
  backslash:</font></u></p>
  <p><font color="#228822">
  <code>&nbsp;&nbsp; <u>const char * p1 = R&quot;&quot;abc</u><br>
&nbsp;&nbsp; <u>def&quot;&quot;;</u><br>
&nbsp;&nbsp; <u>assert(strcmp(p1, &quot;abc\ndef&quot;) == 0);&nbsp;// assert 
  will succeed</u><br>
  <br>
  &nbsp;&nbsp; <u>const char * p2 = R&quot;&quot;abc\</u><br>
&nbsp;&nbsp; <u>def&quot;&quot;;</u><br>
&nbsp;&nbsp; <u>assert(strcmp(p2, &quot;abcdef&quot;) == 0);&nbsp;&nbsp;// assert 
  will succeed<br>
  </u><br>
  </code></font><u><font color="#228822">&nbsp;<i>-- end example]</i></font></u></p>
</blockquote>

  <p>To 2.13.4 [lex.string] paragraph 4 add:</p>
<blockquote>
  <p><font color="#228822"><u><i>[Example:</i></u><br>
  <br>
  <code>&nbsp;&nbsp; <u>const char * p1 = R&quot;$A\bC$&quot; &quot;def&quot; R&quot;!GHI!&quot;;</u><br>
&nbsp;&nbsp; <u>const char * p2 = &quot;A\\bCdefGHI&quot;;</u><br>
&nbsp;&nbsp; <u>assert(strcmp(p1, p2) == 0);&nbsp;&nbsp;&nbsp;&nbsp; // assert 
  will succeed</u><br>
  </code><br>
  <u><i>-- end example]</i></u></font></p>
</blockquote>

<hr>
<p> Beman Dawes 2006</p>

</body>

</html>
