<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<TITLE>
    CWG Issue 1802</TITLE>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<STYLE TYPE="text/css">
  INS { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
  .INS { text-decoration:none; background-color:#D0FFD0 }
  DEL { text-decoration:line-through; background-color:#FFA0A0 }
  .DEL { text-decoration:line-through; background-color: #FFD0D0 }
  @media (prefers-color-scheme: dark) {
    HTML { background-color:#202020; color:#f0f0f0; }
    A { color:#5bc0ff; }
    A:visited { color:#c6a8ff; }
    A:hover, a:focus { color:#afd7ff; }
    INS { background-color:#033a16; color:#aff5b4; }
    .INS { background-color: #033a16; }
    DEL { background-color:#67060c; color:#ffdcd7; }
    .DEL { background-color:#67060c; }
  }
  SPAN.cmnt { font-family:Times; font-style:italic }
</STYLE>
</HEAD>
<BODY>
<P><EM>This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21
  Core Issues List revision 118b.
  See http://www.open-std.org/jtc1/sc22/wg21/ for the official
  list.</EM></P>
<P>2025-09-28</P>
<HR>
<A NAME="1802"></A><H4>1802.
  
<TT>char16_t</TT> string literals and surrogate pairs
</H4>
<B>Section: </B>5.13.5&#160; [<A href="https://wg21.link/lex.string">lex.string</A>]
 &#160;&#160;&#160;

 <B>Status: </B>CD4
 &#160;&#160;&#160;

 <B>Submitter: </B>Jeffrey Yasskin
 &#160;&#160;&#160;

 <B>Date: </B>2013-10-30<BR>


<P>[Moved to DR at the November, 2014 meeting.]</P>



<P>The intent of <TT>char16_t</TT> string literals, as
evident from 5.13.5 [<A href="https://wg21.link/lex.string#9">lex.string</A>] paragraph 9, is
that they be encoded in UTF-16, that is, including surrogate
pairs to represent code points outside the basic
multi-lingual plane:</P>

<BLOCKQUOTE>

A single <I>c-char</I> may produce more than
one <TT>char16_t</TT> character in the form of surrogate
pairs.

</BLOCKQUOTE>

<P>Paragraph 15, however, is inconsistent with this approach,
saying,</P>

<BLOCKQUOTE>

Escape sequences and universal-character-names in non-raw
string literals have the same meaning as in character
literals (5.13.3 [<A href="https://wg21.link/lex.ccon">lex.ccon</A>]), except that the
single quote <TT>'</TT> is representable either by itself or
by the escape sequence <TT>\'</TT>, and the double
quote <TT>"</TT> shall be preceded by a <TT>\</TT>.

</BLOCKQUOTE>

<P>The reason is that code points outside the basic
multi-lingual plane are ill-formed in <TT>char16_t</TT>
character literals:</P>

<BLOCKQUOTE>

A character literal that begins with the letter <TT>u</TT>, such as
<TT>u'y'</TT>, is a character literal of
type <TT>char16_t</TT>. The value of a <TT>char16_t</TT>
literal containing a single <I>c-char</I> is equal to its
ISO 10646 code point value, provided that the code point is
representable with a single 16-bit code unit. (That is,
provided it is a basic multi-lingual plane code point.) If
the value is not representable within 16 bits, the program
is ill-formed.

</BLOCKQUOTE>

<P>It should be clarified that this restriction does not apply
to <TT>char16_t</TT> string literals.</P>

<P><B>Proposed resolution (February, 2014):</B></P>

<P>Change 5.13.5 [<A href="https://wg21.link/lex.string#16">lex.string</A>] paragraph 16 as follows:</P>

<BLOCKQUOTE>

Escape sequences and universal-character-names in non-raw string literals
have the same meaning as in character literals
(5.13.3 [<A href="https://wg21.link/lex.ccon">lex.ccon</A>]), except that the single quote <TT>'</TT> is
representable either by itself or by the escape sequence <TT>\'</TT>, and
the double quote <TT>"</TT> shall be preceded by a <TT>\</TT><INS>, and
except that a universal-character-name in a <TT>char16_t</TT> string
literal may yield a surrogate pair</INS>. In a narrow string literal...

</BLOCKQUOTE>

<BR><BR>
</BODY>
</HTML>
