<html>
<head>
<title>P2201R1: Mixed string literal concatenation</title>

<style type="text/css">
  ins { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
  .new { text-decoration:none; font-weight:bold; background-color:#D0FFD0 }
  del { text-decoration:line-through; background-color:#FFA0A0 }  
  strong { font-weight: inherit; color: #2020ff }
  table, td, th { border: 1px solid black; border-collapse:collapse; padding: 5px }
</style>
</head>

<body>
ISO/IEC JTC1 SC22 WG21 P2201R1<br/>
Author: Jens Maurer<br/>
Target audience: CWG<br/>
2021-04-12<br/>

<h1>P2201R1: Mixed string literal concatenation</h1>

<h2>Introduction</h2>

String concatenation involving <em>string-literal</em>s
with <em>encoding-prefix</em>es mixing L"", u8"", u"", and U"" is
currently conditionally-supported with implementation-defined behavior
(5.13.5 [lex.string] paragraph 11).
<p>
None of icc, gcc, clang, MSVC supports such mixed
concatenations; all issue an error:
<a href="https://compiler-explorer.com/z/4NDo-4">https://compiler-explorer.com/z/4NDo-4</a>.
Test code:
<pre>
void f() {

  { auto a = L"" u""; }
  { auto a = L"" u8""; }
  { auto a = L"" U""; }

  { auto a = u8"" L""; }
  { auto a = u8"" u""; }
  { auto a = u8"" U""; }

  { auto a = u"" L""; }
  { auto a = u"" u8""; }
  { auto a = u"" U""; }

  { auto a = U"" L""; }
  { auto a = U"" u""; }
  { auto a = U"" u8""; }
}
</pre>
SDCC, the Small Device C Compiler, does support such mixed concatenations,
apparently taking the first <em>encoding-prefix</em>.
The sentiment was expressed that the feature is not actually used much,
if at all: <a href="http://open-std.org/jtc1/sc22/wg14/18105">WG14 e-mail</a>
<p>
No meaningful use-case for such mixed concatenations is known.
<p>
This paper makes such mixed concatenations ill-formed.

<h2>History</h2>

The history was kindly provided by Alisdair Meredith,
although all errors should be blamed on the author.
<p>
Concatenating narrow and wide string literals was made defined behavior
for C++11 by Clark Nelson’s paper synchronizing with the C99 preprocessor:
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1653.htm">N1653</a>.
<p>
The conditionally supported implementation-defined  behavior for concatenating
unicode and wide string literals was a feature of the original proposal for
unicode characer types:
<a hreF="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html">N2249</a>.
<p>
The final rule to make u8 literals ill-formed when attempting to concatenate with
a wide string literal was in the original paper proposing u8 literals:
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm">N2442</a>

<h2>Changes in R1 vs. R0</h2>

<ul>
  <li>Approved by SG16, EWG, and CWG.</li>
  <li>Added Annex C entries.</li>
</ul>


<h2>Wording changes</h2>

Change in 5.13.5 [lex.string] paragraph 11:
<blockquote>
In translation phase 6 (5.2 [lex.phases]),
adjacent <em>string-literal</em>s are concatenated. If both
<em>string-literal</em>s have the same <em>encoding-prefix</em>, the
resulting concatenated <em>string-literal</em> has
that <em>encoding-prefix</em>.  If one <em>string-literal</em> has no
<em>encoding-prefix</em>, it is treated as a <em>string-literal</em>
of the same <em>encoding-prefix</em> as the other operand. <del>If a
UTF-8 string literal token is adjacent to a wide string literal token,
the program is ill-formed.</del> Any other concatenations
are <del>conditionally-supported with implementation-defined
behavior</del> <ins>ill-formed</ins>. [Note: This concatenation is an
interpretation, not a conversion. Because the interpretation happens
in translation phase 6 (after each character from
a <em>string-literal</em> has been translated into a value from the
appropriate character set), a <em>string-literal</em>’s initial
rawness has no effect on the interpretation or well-formedness of the
concatenation.  — end note] Table 11 has some examples of valid
concatenations.
<p>
  (Table 11)
<p>
Characters in concatenated strings are kept distinct.
[Example:
<pre>
  "\xA" "B"
</pre>
contains the two characters ’\xA’ and ’B’ after concatenation (and not
the single hexadecimal character ’\xAB’). — end example]
</blockquote>

<strong>Insert a new subclause C.1 "C++ and ISO C++ 2020":</strong>

<blockquote>
<b>Affected subclause:</b> 5.13.5 [lex.string]
<p>
<b>Change:</b> Concatenated <em>string-literal</em>s can no longer
have conflicting <em>encoding-prefix</em>es.
<p>
<b>Rationale:</b> Removal of unimplemented conditionally-supported feature.
<p>
<b>Effect on original feature</b>:
Concatenation of <em>string-literal</em>s with different
<em>encoding-prefix</em>es is now ill-formed.  [ Example:
<pre>
  auto c = L"a" U"b";  // was conditionally-supported; now ill-formed
</pre>
-- end example ]
</blockquote>

<strong>Add to C.5.1 [diff.lex]:</strong>

<blockquote>
<b>Affected subclause:</b> 5.13.5 [lex.string]
<p>
<b>Change:</b> Concatenated <em>string-literal</em>s can no longer
have conflicting <em>encoding-prefix</em>es.
<p>
<b>Rationale:</b> Removal of non-portable feature.
<p>
<b>Effect on original feature</b>:
Concatenation of <em>string-literal</em>s with different
<em>encoding-prefix</em>es is now ill-formed.<br/>
<b>Difficulty of converting:</b> Syntactic transformation.<br/>
<b>How widely used:</b> Seldom.
</blockquote>

</body>
</html>
