<html>
<head>
<meta charset="UTF-8">
<title>P2314R2: Character sets and encodings</title>

<style type="text/css">
  ins { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
  .new { text-decoration:none; background-color:#D0FFD0 }
  del { text-decoration:line-through; background-color:#FFA0A0 }  
  strong { font-weight: inherit; color: #2020ff }
  table, td, th { border: 1px solid black; border-collapse:collapse; padding: 5px }
</style>
</head>

<body>
ISO/IEC JTC1 SC22 WG21 P2314R2<br/>
Author: Jens Maurer<br/>
Target audience: CWG, LWG<br/>
2021-05-14<br/>

<h1>P2314R2: Character sets and encodings</h1>

<h2>Introduction</h2>

This paper implements the following changes:

<ul>

<li>Switch C++ to a modified "model C" approach
for <em>universal-character-name</em>s as described in the C99
  Rationale v5.10, section 5.2.1.</li>

<li>Introduce the term "literal encoding".  For purposes of the C++
specification, the actual set of characters is not relevant, but the
sequence of code units (i.e. the encoding) specified by a given
character or string literal are.  The terms "execution (wide)
character set" are retained to describe the locale-dependent runtime
character set used by functions such as <code>isalpha</code>.</li>

<li>(Not a wording change) Do not attempt to treat all string literals
  the same; their treatment depends on (phase 7) context.
</ul>

This paper resolves the following core issues:

<ul>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#578">578.</a> Phase 1 replacement of characters with universal-character-names</li>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#1332">1332.</a> Handling of invalid universal-character-names</li>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1335">1335.</a> Stringizing, extended characters, and universal-character-names</li>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1403">1403.</a> Universal-character-names in comments</li>
  <li><strong><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#2455">2455.</a> Concatenation of string literals vs translation phases 5 and 6</strong></li>
</ul>

<h2>Change history</h2>

<h3>Changes since R0</h3>

<ul>
  <li>missed edits in the normative wording</li>
  <li>add comparison with P2297R0</li>
  <li>retained "execution (wide) character set" for locale-dependent runtime encoding, but moved the definition to the library wording</li>
</ul>

<h3>Changes since R1</h3>

<ul>
  <li>missed edits in the normative wording</li>
  <li>Clarify that an <em>r-char-sequence</em> never contains
    a <em>universal-character-name</em>.</li>
  <li>Remove words that claim string literal objects are initialized
    in translation phase 5.</li>
  <li>Add SG16 and EWG poll results.</li>
  <li>Fix typo in [lex.ccon] table 10.</li>
</ul>


<h2>Terminology changes</h2>

The following terms are defined by this paper:

<ul>
<li>translation character set: the abstract character set used during translation; can represent the character equivalent of all valid <em>universal-character-name</em>s</li>
<li>basic character set: minimum character set needed to express C++ program source</li>
<li>basic literal character set: minimum set of characters expressible by literals</li>
<li>ordinary / wide literal encoding: compile-time encoding used for initializing string literal objects</li>
</ul>

The term "basic / extended source character set" is removed.

<h2>Behavior changes</h2>

The core behavior change is that <em>universal-character-name</em>s
are no longer formed in translation phase 1.  Instead, all Unicode
input characters are retained throughout the translation.
<p>
This changes the specified behavior of the stringizing preprocessor
operator [cpp.stringize] as follows
  
  <table align="center">
    <tr><td>C++20</td><td>this paper</td></tr>
    <tr><td><pre>
#define S(x) # x
const char * s1 = S(Köppe);       // "K\\u00f6ppe"
const char * s2 = S(K\u00f6ppe);  // "K\\u00f6ppe"
      </pre></td>
      <td><pre>
#define S(x) # x
const char * s1 = S(Köppe);       // "Köppe"
const char * s2 = S(K\u00f6ppe);  // "Köppe"
      </pre></td>
    </tr>
    </table>

However, it turns out that all major implementations already implement
what this paper specifies, i.e. no implementation provides an escaped
UCN.

<h2>Not all <em>string-literal</em>s are the same</h2>

In C++, string literals can appear in the following contexts:
<p>
<table align="center">
  <tr><th>Context</th><th>Destination</th></tr>
  <tr><td><em>asm-declaration</em></td><td>build environment</td></tr>
  <tr><td><code>#include "fn"</code> or <code>#include &lt;fn></code></td><td>file name</td></tr>
  <tr><td>language linkage</td><td>translation</td></tr>
  <tr><td><code>operator ""</code> [over.literal]</td><td>translation</td></tr>
  <tr><td><code>#line</code> directive</td><td>diagnostic</td></tr>
  <tr><td>argument for [[nodiscard]] and [[deprecated]]</td><td>diagnostic</td></tr>
  <tr><td><code>#error, static_assert</code></td><td>diagnostic</td></tr>
  <tr><td>__FILE__, __func__</td><td>literal encoding</td></tr>
  <tr><td><code>std::typeinfo::name()</code></td><td>literal encoding</td></tr>
  <tr><td><em>character-literal</em> or <em>string-literal</em> appearing elsewhere</td><td>literal encoding</td></tr>
  <tr><td><em>user-defined-literal</em></td><td>literal encoding</td></tr>
</table>
<p>
The destinations have the following meaning:
<ul>
  <li>build environment: A string likely passed as text (program input) to another component in the build environment.</li>
  <li>file name: A file name suitable for the build environment.</li>
  <li>translation: No use outside the compiler.</li>
  <li>diagnostic: Diagnostic text; kept in the translation character
    set until needed for output; then treated the same as
    e.g. <em>identifier</em>s appearing in diagnostic messages</li>
  <li>literal encoding: The implementation-defined encoding for the
    runtime environment, used for <em>string-literal</em>s that appear
    in usual program text.</li>
</ul>

The existing text in 5.13.5 [lex.string] already specifies that the
initialization of a string literal object (as needed when using
a <em>string-literal</em> as a primary expression) is the point where
the <em>string-literal</em> is encoded.  In other contexts, no such
encoding happens.
<strong>Words to the contrary appearing for translation phase 5
[lex.phases] have been excised.</strong>

<h2>Comparison with P2297R0</h2>

This paper P2314 refomulates the core language rules around lexing of
non-basic characters, while keeping the actual semantic changes to a
minimum. This makes it more likely that the paper can either directly
proceed to CWG or be reviewed by EWG with minimal effort.
<p>
  
The paper P2297R0 "Wording improvements for encodings and character
sets" by Corentin Jabot has overlap with this paper.
The main differences are:

<ul>
  <li>terminology: This paper uses "translation character set";
  P2297R0 uses "Unicode" to describe the C++ compile-time character
  set.  Note that ISO 10646 does not seem to define "Unicode" as a
  stand-alone term, and the term seems unclear regarding the inclusion
    of non-assigned code points.</li>

  <li>alert and backspace for literals: This paper retains the status
  quo, after reformulation to refer to Unicode code points; P2297R0
  removes the requirement to represent these. In my view,
  since <em>simple-escape-sequence</em>s are expressly specified to
  represent alert and backspace, those characters should be required
  to be representable in the literal encoding. The presence or absence
  of such a specified requirement is believed not to have an impact on
  currently existing implementations.</li>
</ul>

<h2>Poll results</h2>

<h3>SG16</h3>

Poll: Introduce the concept of a 'translation character set' which
synthesizes characters for unassigned UCS scalar values.

<pre>
  SF F N A SA
   2 4 1 0  1
</pre>

Present: 9<br/>
Consensus: In favour<br/>
P2314 author&apos;s position: F<br/>
Strongly against: The abstraction is unnecessary and the definition of
'translation character set' is incorrectly using terms defined by Unicode and
UCS.

<p>
  
Poll: Forward D2314R2 as presented on 2021-03-24 to EWG for inclusion in C++23.

<pre>
  SF F N A SA
   3 5 0 0  0
</pre>

Present: 9<br/>
Consensus: Strongly in favour<br/>
P2314 author&apos;s position: In favour

<h3>EWG</h3>

Send P2314 to Electronic Polling, with the intent of going to Core for C++23.

<pre>
SF 	F 	N 	A 	SA
5 	6 	0 	0 	0
</pre>


<h2>Wording changes</h2>

Change in 3.35 [defns.multibyte]:

<blockquote>
<b>multibyte character</b>
<p>
sequence of one or more bytes representing <del>a member of the
extended character set of either the source or the execution
environment</del> <ins>the code unit sequence for an encoded
character of the execution character set</ins>
<p>
<del>[Note 1 to entry: The extended character set is a superset of the basic character set (5.3). — end note]</del>
</blockquote>

Change in 5.2 [lex.phases] paragraph 1:

<blockquote>
1. Physical source file characters are mapped, in an
implementation-defined manner, to the <del>basic source</del>
<ins>translation</ins>
character set (introducing new-line characters for end-of-line
indicators) <del>if necessary</del>. The set of physical source file
characters accepted is implementation-defined. <del>Any source file
character not in the basic source character set (5.3 [lex.charset]) is
replaced by the universal-character-name that designates that
character. An implementation may use any internal encoding, so long as
an actual extended character encountered in the source file, and the
same extended character expressed in the source file as a
universal-character-name (e.g., using the \uXXXX notation), are
handled equivalently except where this replacement is reverted (5.4
[lex.pptoken]) in a raw string literal.</del>
<p>
  ...
<p>
3. The source file is decomposed into preprocessing tokens (5.4
[lex.pptoken]) and sequences of white-space characters (including
comments). A source file shall not end in a partial preprocessing
token or in a partial comment. [ Footnote: ... ] Each comment is
replaced by one space character. New-line characters are retained.
Whether each nonempty sequence of white-space characters other than
new-line is retained or replaced by one space character is
unspecified. <ins>Each <em>universal-character-name</em> outside of a
<em>header-name</em> or a character or string literal
is replaced by the designated element of the translation character set
([lex.charset]).</ins>  The process of dividing a source file’s
characters into preprocessing tokens is context-dependent. [Example:
See the handling of &lt; within a #include preprocessing directive.  —
end example]
<p>
4. Preprocessing directives are executed, macro invocations are
expanded, and _Pragma unary operator expressions are executed. <del>If a
character sequence that matches the syntax of a
<em>universal-character-name</em> is produced by token concatenation
(15.6.3 [lex.concat]), the behavior is undefined.</del> A #include
preprocessing directive causes the named header or source file to be
processed from phase 1 through phase 4, recursively.  All
preprocessing directives are then deleted.
<p>
5. <ins>For a sequence of two or more
adjacent <em>string-literal</em> tokens, a
common <em>encoding-prefix</em> is determined as specified in 5.13.5
[lex.string].  Each such <em>string-literal</em> token is then
considered to have that common <em>encoding-prefix</em>.</ins>
<del>
Each <em>basic-c-char</em>, <em>basic-s-char</em>, and <em>r-char</em>
in a <em>character-literal</em> or a <em>string-literal</em>, as well
as each <em>escape-sequence</em> and <em>universal-character-name</em>
in a <em>character-literal</em> or a non-raw string literal, is
encoded in the literal’s associated character encoding as specified in
5.13.3 [lex.ccon] and 5.13.5 [lex.string].</del>
<p>
6. Adjacent <del>string literal</del>
<ins><em>string-literal</em></ins> tokens are concatenated
<del>and a null character is appended to the result as specified in 5.13.5</del>
<ins>(5.13.5 [lex.string])</ins>.
</blockquote>

Replace all of 5.3 [lex.charset] (paragraphs 1-3):

<blockquote class="new">
1 The <em>translation character set</em> consists of the following elements:
<ul>
<li>each character named by ISO/IEC 10646,
    as identified by its unique UCS scalar value, and</li>
<li>a distinct character for each UCS scalar value where no
  named character is assigned.</li>
</ul>
[ Note: ISO/IEC 10646 code points are integers in the range [0,
10FFFF] (hexadecimal). A surrogate code point is a value in the range
[D800, DFFF] (hexadecimal). A UCS scalar value is any code point that
is not a surrogate code point. -- end note ]
<p>
2 The <em>basic character set</em> is a subset of the translation
character set, consisting of 96 characters as specified in table X. [
Note: Unicode short names are given only as a means to identifying the
character; the numerical value has no other meaning in this
context. -- end note ]
<p>

<table border="1" align="center">
<tr><td>U+0009</td> <td>CHARACTER TABULATION</td></tr>
<tr><td>U+000B</td> <td> LINE TABULATION</td></tr>
<tr><td>U+000C</td> <td> FORM FEED (FF)</td></tr>
<tr><td>U+0020</td> <td> SPACE</td></tr>
<tr><td>U+000A</td> <td>LINE FEED (LF)</td><td><em>new-line</em></td></tr>
<tr><td>U+0021</td> <td> EXCLAMATION MARK</td><td>!</td></tr>
<tr><td>U+0022</td> <td> QUOTATION MARK</td><td>&quot;</td></tr>
<tr><td>U+0023</td> <td> NUMBER SIGN</td><td>#</td></tr>
<tr><td>U+0025</td> <td> PERCENT SIGN</td><td>%</td></tr>
<tr><td>U+0026</td> <td> AMPERSAND</td><td>&amp;</td></tr>
<tr><td>U+0027</td> <td> APOSTROPHE</td><td>&#39;</td></tr>
<tr><td>U+0028</td> <td> LEFT PARENTHESIS</td><td>(</td></tr>
<tr><td>U+0029</td> <td> RIGHT PARENTHESIS</td><td>)</td></tr>
<tr><td>U+002A</td> <td> ASTERISK</td><td>*</td></tr>
<tr><td>U+002B</td> <td> PLUS SIGN</td><td>+</td></tr>
<tr><td>U+002C</td> <td> COMMA</td><td>,</td></tr>
<tr><td>U+002D</td> <td> HYPHEN-MINUS</td><td>-</td></tr>
<tr><td>U+002E</td> <td> FULL STOP</td><td>.</td></tr>
<tr><td>U+002F</td> <td> SOLIDUS</td><td>/</td></tr>
<tr><td>U+0030 .. U+0039</td> <td>DIGIT ZERO .. NINE</td><td>0 1 2 3 4 5 6 7 8 9</td></tr>
<tr><td>U+003A</td> <td> COLON</td><td>:</td></tr>
<tr><td>U+003B</td> <td> SEMICOLON</td><td>;</td></tr>
<tr><td>U+003C</td> <td> LESS-THAN SIGN</td><td>&lt;</td></tr>
<tr><td>U+003D</td> <td> EQUALS SIGN</td><td>=</td></tr>
<tr><td>U+003E</td> <td> GREATER-THAN SIGN</td><td>&gt;</td></tr>
<tr><td>U+003F</td> <td> QUESTION MARK</td><td>?</td></tr>
<tr><td>U+0041 .. U+005A</td> <td>LATIN CAPITAL LETTER A .. Z</td><td>A B C D E F G H I J K L M<br/>N O P Q R S T U V W X Y Z</tr>
<tr><td>U+005B</td> <td> LEFT SQUARE BRACKET</td><td>[</td></tr>
<tr><td>U+005C</td> <td> REVERSE SOLIDUS</td><td>\</td></tr>
<tr><td>U+005D</td> <td> RIGHT SQUARE BRACKET</td><td>]</td></tr>
<tr><td>U+005E</td> <td> CIRCUMFLEX ACCENT</td><td>^</td></tr>
<tr><td>U+005F</td> <td> LOW LINE</td><td>_</td></tr>
<tr><td>U+0061 .. U+007A</td> <td> LATIN SMALL LETTER A .. Z</td><td>a b c d e f g h i j k l m<br/>n o p q r s t u v w x y z</td></tr>
<tr><td>U+007B</td> <td> LEFT CURLY BRACKET</td><td>{</td></tr>
<tr><td>U+007C</td> <td> VERTICAL LINE</td><td>|</td></tr>
<tr><td>U+007D</td> <td> RIGHT CURLY BRACKET</td><td>}</td></tr>
<tr><td>U+007E</td> <td> TILDE</td><td>~</td></tr>
</table>
</blockquote>

<blockquote>
The <em>universal-character-name</em> construct provides a way to name
other characters.
<pre>
<em>hex-quad :
    hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
  
universal-character-name :
    \u hex-quad
    \U hex-quad hex-quad</em>
</pre>

A <em>universal-character-name</em> designates the
character in <del>ISO/IEC 10646 (if any)</del> <ins>the translation
character set</ins> whose <del>Unicode code point</del> <ins>UCS
scalar value</ins> is the hexadecimal number represented by the
sequence of <em>hexadecimal-digit</em>s in
the <em>universal-character-name</em>. The program is ill-formed if
that number is not a <del>Unicode code point or if it is a surrogate
code point</del> <ins>UCS scalar value</ins>. <del>Noncharacter code
points and reserved code points are considered to designate separate
characters distinct from any ISO/IEC 10646 character.</del>  If
a <em>universal-character-name</em> outside
the <em>c-char-sequence</em>, <em>s-char-sequence</em>,
or <em>r-char-sequence</em> of a <em>character-literal</em>
or <em>string-literal</em> (in either case, including within a
user-defined-literal) corresponds to a control character or to a
character in the basic <del>source</del> character set, the program is
ill-formed. [ <del>Footnote:</del> <ins>Note:</ins> A sequence of
characters resembling a <em>universal-character-name</em> in
an <em>r-char-sequence</em> (5.13.5) does not form
a <em>universal-character-name</em>. ] <del>[Note: ISO/IEC 10646 code
points are integers in the range [0, 10FFFF] (hexadecimal). A
surrogate code point is a value in the range [D800, DFFF]
(hexadecimal). A control character is a character whose code point is
in either of the ranges [0, 1F] or [7F, 9F] (hexadecimal). — end
note]</del>
</blockquote>

<blockquote class="new">
The <em>basic literal character set</em> consists of all characters of
the basic character set, plus the control characters specified in table Y.
<table border="1" align="center">
<tr><td>U+0000</td><td>NULL</td></tr>
<tr><td>U+0007</td><td>BELL</td></tr>
<tr><td>U+0008</td><td>BACKSPACE</td></tr>
<tr><td>U+000D</td><td>CARRIAGE RETURN (CR)</td></tr>
</table>
<p>
A <em>code unit</em> is an integer value of character type (6.8.1
[basic.fundamental]).  Characters in a <em>character-literal</em>
other than a multicharacter or non-encodable character literal or in
a <em>string-literal</em> are encoded as a sequence of one or more
code units, as determined by the <em>encoding-prefix</em> ([lex.ccon],
[lex.string]); this is termed the respective <em>literal
encoding</em>.

The <em>ordinary literal encoding</em> is the encoding applied to an
ordinary character or string literal.  The <em>wide literal
encoding</em> is the encoding applied to a wide character or string
literal.
<p>
A literal encoding encodes each element of the basic literal character
set as a single code unit with non-negative value, distinct
from the code unit for any other such element. [ Note: A character not
in the basic literal character set can be encoded with more than one
code unit; the value of such a code unit can be the same as that of a
code unit for an element of the basic literal character set. -- end
note ].  The U+0000 NULL character is
encoded as the value 0. No other element of the translation character
set is encoded with a code unit of value 0.  The code unit value of
each decimal digit character after the digit 0 (U+0030) shall be one
greater than the value of the previous.  The ordinary and wide literal
encodings are otherwise implementation-defined.
For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value
corresponding to each character of the translation character set is
encoded as specified in ISO/IEC 10646 for the respective UCS encoding
form.
</blockquote>

<blockquote>
<del>The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set, plus control characters representing alert,
backspace, and carriage return, plus a null character (respectively,
null wide character), whose value is 0. For each basic execution
character set, the values of the members shall be non-negative and
distinct from one another. In both the source and execution basic
character sets, the value of each character after 0 in the above list
of decimal digits shall be one greater than the value of the
previous. The <em>execution character set</em> and the <em>execution
wide-character set</em> are implementation-defined supersets of the
basic execution character set and the basic execution wide-character
set, respectively.  The values of the members of the execution
character sets and the sets of additional members are
locale-specific.</del>
</blockquote>

Change in 5.4 [lex.pptoken] paragraph 2:

<blockquote>
A preprocessing token is the minimal lexical element of the language
in translation phases 3 through 6. <ins>In this document, glyphs are
used to identify elements of the basic character set
([lex.charset]).</ins>  The categories of preprocessing token are:
header names, placeholder tokens produced by preprocessing import and
module directives (<em>import-keyword</em>, <em>module-keyword</em>,
and <em>export-keyword</em>), identifiers, preprocessing numbers,
character literals (including user-defined character literals), string
literals (including user-defined string literals), preprocessing
operators and punctuators, and single non-whitespace characters that
do not lexically match the other preprocessing token categories.  If
a <del>&apos; or a &quot;</del><ins>U+0027 APOSTROPHE or a U+0022
QUOTATION MARK</ins> character matches the last category, the behavior
is undefined. Preprocessing tokens can be separated by whitespace;
this consists of comments (5.7), or whitespace characters (<del>space,
horizontal tab</del><ins>U+0020 SPACE, U+0009 CHARACTER
TABULATION</ins>, new-line, <del>vertical tab, and
form-feed</del><ins>U+000B LINE TABULATION, and U+000C FORM
FEED</ins>), or both. ...
</blockquote>

Change in 5.4 [lex.pptoken] paragraph 3 bullet 1:

<blockquote>
<ul>
<li>If the next character begins a sequence of characters that could
be the prefix and initial double quote of a raw string literal, such
as R&quot;, the next preprocessing token shall be a raw string
literal. Between the initial and final double quote characters of the
raw string, any transformations performed in <del>phases</del> <ins>phase</ins>
<del>1 and</del> 2 (<del>universal-character-names and</del> line
splicing) are reverted; this reversion shall apply before any d-char,
r-char, or delimiting parenthesis is identified.
</blockquote>

Change in 5.8 [lex.header] paragraph 1:

<blockquote>
<pre>
<em>h-char</em>:
    any member of the <del>source</del> <ins>translation</ins> character set except new-line and <del>&gt;</del> <ins>U+003E GREATER-THAN SIGN</ins>
...
<em>q-char</em>:
    any member of the <del>source</del> <ins>translation</ins> character set except new-line and <del>&quot;</del> <ins>U+0022 QUOTATION MARK</ins>
</blockquote>

<strong>Change in 5.10 [lex.name]:</strong>

<blockquote>
<pre>
<em>identifier-nondigit:
    nondigit
    <ins>any member of the translation character set from Table 2</ins>
    universal-character-name</em>
</pre>
</blockquote>


Change in 5.13.3 [lex.ccon] before paragraph 1:

<blockquote>
<pre>
<em>basic-c-char</em>:
    any member of the <del>basic source</del> <ins>translation</ins> character set
    except the <del>single-quote ’, backslash \</del> <ins>U+0027 APOSTROPHE, U+005C REVERSE SOLIDUS</ins>, or new-line character
...
<em>conditional-escape-sequence-char</em>:
    any member of the basic <del>source</del> character set that is not an <em>octal-digit</em>, a <em>simple-escape-sequence-char</em>, or
    the characters u, U, or x
</pre>
</blockquote>

Change in 5.13.3 [lex.ccon] paragraph 2:

<blockquote>
[Note 1 : The associated character encoding for ordinary and wide character literals determines encodability, but
does not determine the value of non-encodable ordinary or wide character literals or ordinary or wide multicharacter
literals. The examples in Table 9 for non-encodable ordinary and wide character literals assume that the specified
character lacks representation in the <del>execution character set</del> <ins>ordinary literal encoding</ins> or <del>execution wide-character set</del> <ins>wide literal encoding</ins>, respectively, or that
encoding it would require more than one code unit. — end note]
</blockquote>

Change in 5.13.3 [lex.ccon] table tab:lex.ccon.literal:
<blockquote>
  <table>
    <tr><td>Encoding prefix</td><td>...</td><td>Associated character encoding</td></tr>
    <tr><td>none</td><td>...</td><td>
    <del>encoding of the execution character set</del>
    <ins>ordinary literal encoding</ins>
    </td></tr>
      <tr><td>L</td><td>...</td><td>
      <del>encoding of the execution wide-character set</del>
      <ins>wide literal encoding</ins>
      </td></tr>
  </table>
</blockquote>

Replace 5.13.3 [lex.ccon] table tab:lex.ccon.esc:

<blockquote>
The character specified by a <em>simple-escape-sequence</em> is
specified in Table 10.
<p>
  
<table class="new" border="1" align="center">
<tr><th colspan=3>character</th><th><em>simple-escape-sequence</em></th></tr>
<tr><td>U+000A</td><td>LINE FEED (LF)</td><td></td><td><code>\n</code></td></tr>
<tr><td>U+0009</td><td>CHARACTER TABULATION</td><td></td><td><code>\t</code></td></tr>
<tr><td>U+000B</td><td>LINE TABULATION</td><td></td><td><code>\v</code></td></tr>
<tr><td>U+0008</td><td>BACKSPACE</td><td></td><td><code>\b</code></td></tr>
<tr><td>U+000D</td><td>CARRIAGE RETURN (CR)</td><td></td><td><code>\r</code></td></tr>
<tr><td>U+000C</td><td>FORM FEED (FF)</td><td></td><td><code>\f</code></td></tr>
<tr><td>U+0007</td><td>BELL</td><td></td><td><code>\a</code></td></tr>
<tr><td>U+005C</td><td>REVERSE SOLIDUS</td><td><code>\</code></td><td><code>\\</code></td></tr>
<tr><td>U+003F</td><td>QUESTION MARK</td><td><code>?</code></td><td><code>\?</code></td></tr>
<tr><td>U+0027</td><td>APOSTROPHE</td><td><code>&apos;</code></td><td><code>\&apos;</code></td></tr>
<tr><td>U+0022</td><td>QUOTATION MARK</td><td><code>"</code></td><td><code>\&quot;</code></td></tr>
</table>
</blockquote>
  

Change in 5.13.5 [lex.string] before paragraph 1:

<blockquote>
<pre>
<em>basic-s-char</em>:
    any member of the <del>basic source</del> <ins>translation</ins> character set
    except the <del>double-quote &quot;, backslash \</del><ins>U+0022 QUOTATION MARK, U+005C REVERSE SOLIDUS</ins>, or new-line character
...
<em>r-char</em>:
    any member of the <del>source</del> <ins>translation</ins> character set, except a <del>right parenthesis )</del> <ins>U+0029 RIGHT PARENTHESIS</ins> followed by
    the initial <em>d-char-sequence</em> (which may be empty) followed by a <del>double quote &quot;</del> <ins>U+0022 QUOTATION MARK</ins>.
...
<em>d-char</em>:
    any member of the basic <del>source</del> character set except:
    <del>space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters</del>
    <del>representing horizontal tab, vertical tab, form feed</del>
    <ins>U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+005C REVERSE SOLIDUS,
    U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM FEED (FF)</ins>, and new-line

and the control characters
representing horizontal tab, vertical tab, form feed, and newline.

</pre>
</blockquote>

Change in 5.13.5 [lex.string] table tab:lex.string.literal:

<blockquote>
<table>
  <tr><td>Encoding prefix</td><td>...<td>Associated character encoding</td></tr>
  <tr><td>none</td><td>...</td>
    <td><del>encoding of the execution character set</del>
      <ins>ordinary literal encoding</ins></td></tr>
    <tr><td>L</td><td>...</td>
      <td><del>encoding of the execution widecharacter set</del>
	<ins>wide literal encoding</ins></td></tr>
</table>
</blockquote>

Change in 5.13.5 [lex.string] paragraphs 7 and 8:

<blockquote>
- 7 - <del>In translation phase 6 (5.2 [lex.phases]),
adjacent <em>string-literal</em>s are concatenated.</del>
<ins>The common <em>encoding-prefix</em> for a sequence of
adjacent <em>string-literal</em>s is determined pairwise as
follows:</ins>

If <del>both</del> <ins>two</ins> <em>string-literal</em>s have the same
<em>encoding-prefix</em>, the <del>resulting
concatenated <em>string-literal</em> has</del>
<ins>common <em>encoding-prefix</em> is</ins>
that <em>encoding-prefix</em>. If one <em>string-literal</em> has no
<em>encoding-prefix</em>, <del>it is treated as
a <em>string-literal</em> of the same <em>encoding-prefix</em>
as</del> <ins>the common <em>encoding-prefix</em> is that of</ins> the
other <del>operand</del> <ins><em>string-literal</em></ins>. If a UTF-8 string
literal token is adjacent to a wide string literal token, the program
is ill-formed. Any other <del>concatenations</del>
<ins>combinations</ins> are conditionally-supported with
implementation-defined behavior. [Note: <del>This concatenation is an
interpretation, not a conversion. Because the interpretation happens
in translation phase 6 (after each character from a
string-literal has been translated into a value from the appropriate
character set), a</del> <ins>A</ins>
<em>string-literal</em>’s <del>initial</del> rawness has no effect on the
<del>interpretation or well-formedness of the concatenation</del>
<ins>determination of the common <em>encoding-prefix</em></ins>.  -- end note]
<p>
<del>Table 13 has some examples of valid concatenations.</del>
<p>
- 8 -
<ins>In translation phase 6 (5.2 [lex.phases]),
adjacent <em>string-literal</em>s are concatenated.
The lexical structure of the contents of the individual
<em>string-literal</em>s is retained.</ins>
<del>Characters in concatenated strings are kept distinct.</del>
[Example:
<pre>"\xA" "B"</pre>
<del>contains the two characters</del>
<ins>represents the code unit</ins> ’\xA’ and <ins>the character</ins> ’B’ after concatenation
(and not the single <del>hexadecimal character</del> <ins>code unit</ins> ’\xAB’).
<ins>Similarly,
<pre><ins>R"(\u00)" "41"</ins></pre>
represents six characters, starting with a backslash and ending with
the digit <code>1</code> (and not the single character "A" specified by
a <em>universal-character-name</em>).</ins>
<p>
<ins>Table 13 has some examples of valid concatenations.</ins>
— end example]
<p>
<del>In translation phase 6 (5.2), after adjacent string-literals are
concatenated, a null character is appended to the result.</del>
</blockquote>

Change in 5.13.5 [lex.string] paragraph 10 and de-bulletize:

<blockquote>
String literal objects are initialized with the sequence of code unit
values corresponding to the <em>string-literal</em>&#39;s sequence
of <em>s-char</em>s (for a non-raw string literal)
and <em>r-char</em>s (for a raw string literal)<ins>, plus a
terminating U+0000 NULL character,</ins> in order as follows:
<ul>
<li>The sequence of characters denoted by each contiguous sequence
of <em>basic-s-char</em>s, <em>r-char</em>s, <em>simple-escape-sequence</em>s
(5.13.3), and <em>universal-character-name</em>s (5.3) is encoded to a
code unit sequence using the <em>string-literal</em>&apos;s associated
character encoding.
If a character lacks representation in the
associated character encoding, <del>then: If
the <em>string-literal</em>&apos;s <em>encoding-prefix</em> is absent
or <code>L</code>,</del> then the <em>string-literal</em> is
conditionally-supported and an implementation-defined code unit
sequence is encoded.
<ins>[ Note: No character lacks representation in any of the UCS
encoding forms. -- end note ]</ins>

<del>Otherwise, the <em>string-literal</em> is ill-formed.</del>
<p>
When encoding a stateful character encoding, ... </li>
  <li>...</li>
  </ul>
</blockquote>

Change in 5.13.8 [lex.ext] paragraph 3:

<blockquote>
[ Note: The sequence c<sub>1</sub> c<sub>2</sub> ...c<sub>k</sub> can
only contain characters from the basic <del>source</del> character
set. — end note]
</blockquote>

Change in 5.13.8 [lex.ext] paragraph 4:

<blockquote>
[ Note: The sequence c<sub>1</sub> c<sub>2</sub> ...c<sub>k</sub> can
only contain characters from the basic <del>source</del> character
set. — end note]
</blockquote>


Change in 6.7.1 [intro.memory] paragraph 1:

<blockquote>
The fundamental storage unit in the memory model is
the <em>byte</em>. A byte is at least large enough to contain <del>any
member</del> <ins>the ordinary literal encoding of any element</ins>
of the basic <del>execution</del> <ins>literal</ins> character set
(5.3) and the eight-bit code units of the Unicode UTF-8 encoding form
and is composed of a contiguous sequence of bits, [ Footnote: ... ]
the number of which is implementation-defined.
</blockquote>

Change in 6.8.2 [basic.fundamental] paragraph 7:

<blockquote>
Type <code>char</code> is a distinct type that has an
implementation-defined choice of “signed char” or “unsigned char” as
its underlying type.  <del>The values of type <code>char</code> can
represent distinct codes for all members of the implementation’s basic
character set.</del> ...
</blockquote>

<em>Editing note: The strike-out above is already stated in the
definition of "byte", above.  If desired, we can add a note that a
char takes exactly one byte.</em>

<p>

Change in 6.8.2 [basic.fundamental] paragraph 8:
<blockquote>
Type <code>wchar_t</code> is a distinct type that has an
implementation-defined signed or unsigned integer type as its
underlying type. The values of type <code>wchar_t</code> can represent
distinct codes for all members of <del>the largest
extended</del> <ins>any</ins> character set specified among the
supported locales (28.3.1).
</blockquote>

Change in 6.8.2 [basic.fundamental] paragraph 11:

<blockquote>
<ins>The types <code>char</code>, <code>wchar_t</code>,
<code>char8_t</code>, <code>char16_t</code>, <code>char32_t</code> are
collectively called <em>character types</em>.  The character
types, </ins> <del>Types</del> <code>bool</code>,
<del>char, wchar_t, char8_t, char16_t, char32_t,</del> and the signed
and unsigned integer types are collectively called <em>integral
types</em>. A synonym for integral type is integer type. [Note:
Enumerations (9.7.1) are not integral; however, unscoped enumerations
can be promoted to integral types as specified in 7.3.6. — end note]
</blockquote>

Change in 7.5.1 [expr.prim.literal] paragraph 1:

<blockquote>
<del>A <em>literal</em> is a primary expression.</del>  The type of
a <em>literal</em> is determined based on its form as specified in
5.13 [lex.literal].  A <em>string-literal</em> is an
lvalue <ins>designating the corresponding string literal object
([lex.string])</ins>, a <em>user-defined-literal</em> has the same
value category as the corresponding operator call expression described
in 5.13.8 [lex.ext], and any other <em>literal</em> is a prvalue.
</blockquote>

Change in 15.2 [cpp.cond] paragraph 12:

<blockquote>
The resulting tokens comprise the controlling constant expression
which is evaluated according to the rules of 7.7 using arithmetic that
has at least the ranges specified in 17.3. For the purposes of this
token conversion and evaluation all signed and unsigned integer types
act as if they have the same representation as, respectively, intmax_t
or uintmax_t (17.4). [Note: ... -- end note] This includes
interpreting <em>character-literal</em>s, which may
involve <del>converting escape sequences into execution character set
members</del> <ins>interpreting <em>escape-sequence</em>s
and <em>universal-character-name</em>s (5.13.3
[lex.ccon])</ins>. Whether the numeric value for these
<em>character-literal</em>s matches the value obtained when an identical
<em>character-literal</em> occurs in an expression (other than within
a #if or #elif directive) is implementation-defined. [Note: ... -- end
note] Also, whether a single-character <em>character-literal</em> may
have a negative value is implementation-defined. Each subexpression
with type <code>bool</code> is subjected to integral promotion before
processing continues.
</blockquote>

Change in 15.6.3 [lex.concat] paragraph 3:

<blockquote>
For both object-like and function-like macro invocations, before the
replacement list is reexamined for more macro names to replace, each
instance of a ## preprocessing token in the replacement list (not from
an argument) is deleted and the preceding preprocessing token is
concatenated with the following preprocessing token. Placemarker
preprocessing tokens are handled specially: concatenation of two
placemarkers results in a single placemarker preprocessing token, and
concatenation of a placemarker with a non-placemarker preprocessing
token results in the non-placemarker preprocessing token. If the
result is not a valid preprocessing token, the behavior is
undefined. <ins>If the result matches the syntax of a
<em>universal-character-name</em>, the behavior is undefined.</ins>
The resulting token is available for further macro replacement. The
order of evaluation of ## operators is unspecified.
</blockquote>

Change in 16.3.3.3.5.1 [character.seq] paragraph 1:

<blockquote>
The C standard library makes widespread use of characters and
character sequences that follow a few uniform conventions:

<ul>
<li><ins>Properties specified as <em>locale-specific</em> may change
during program execution by a call to <code>setlocale(int, const
char*)</code> (28.5.1 [clocale.syn]), or by a change to
a <code>locale</code> object, as described in 28.3 [locales] and
Clause 29 [input.output].</ins></li>

<li><ins>The <em>execution character set</em> and the <em>execution
wide-character set</em> are supersets of the basic literal character
set (5.3 [lex.charset]).  The encodings of the execution character
sets and the sets of additional elements (if any) are
locale-specific. [ Note: The encoding of the execution character sets
can be unrelated to any literal encoding. -- end note ]</ins></li>
    
<li>A <em>letter</em> is any of the 26 lowercase or 26 uppercase
letters in the basic <del>execution</del> character set.</li>

<li>The <em>decimal-point character</em> is
the <ins>locale-specific</ins> (single-byte) character used by
functions that convert between a (single-byte) character sequence and
a value of one of the floating-point types. It is used in the
character sequence to denote the beginning of a fractional part. It is
represented in Clause 17 through Clause 32 and Annex D by a period,
’.’, which is also its value in the "C" locale<del>, but may change
during program execution by a call to <code>setlocale(int, const
char*)</code>, [ Footnote: ... ] or by a change to
a <code>locale</code> object, as described in 28.3 and Clause
29</del>.</li>
</ul>

</blockquote>

Change in 16.3.3.3.5.2 [multibyte.strings] paragraph 1:

<blockquote>
A null-terminated multibyte string, or ntmbs, is an ntbs that
constitutes a sequence of valid multibyte characters, beginning and
ending in the initial shift state. [ Footnote: An NTBS that contains
characters only from the basic <del>execution</del> <ins>literal</ins>
character set is also an NTMBS. Each multibyte character then consists
of a single byte. ]
</blockquote>

Change in 27.13 [time.parse] table [tab:time.parse.spec]:

<blockquote>
  <table>
    <tr><td>%Z</td><td>The time zone abbreviation or name. A single
word is parsed. This word can only contain characters from the
basic <del>source</del> character set (5.3 [lex.charset]) that are
	alphanumeric, or one of ’_’, ’/’, ’-’, or ’+’.</td></tr>
    </table>
</blockquote>

Change in 28.4.2.2.3 [locale.ctype.virtuals] paragraphs 11 and 13:

<blockquote>
The only characters for which unique transformations are required are
those in the basic <del>source</del> character set (5.3
[lex.charset]).
<p>
[...]
<p>
For any character c in the basic <del>source</del> character set (5.3
[lex.charset]) the transformation is such that
<pre>do_widen(do_narrow(c, 0)) == c</pre>
</blockquote>

Change in C.2.3 [diff.cpp14.lex]:

<blockquote>
<b>Affected subclause:</b> 5.2<br/>
<b>Change:</b> Removal of trigraph support as a required feature.<br/>
<b>Rationale:</b> Prevents accidental uses of trigraphs in non-raw string literals and comments.
Effect on original feature: Valid C ++ 2014 code that uses trigraphs may not be valid or may have different
semantics in this revision of C ++ . Implementations may choose to translate trigraphs as specified in C ++ 2014
if they appear outside of a raw string literal, as part of the implementation-defined mapping from physical
source file characters to the basic <del>source</del> character set.
</blockquote>

<h2>Acknowledgements</h2>

Thanks to Corentin Jabot and his related paper P2297R0 for detailed
discussions.

</body>
</html>
