<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=US-ASCII">
<title>Universal Character Names in Literals</title>
</head>

<body>
<h1>Universal Character Names in Literals</h1>

<p>ISO/IEC JTC1 SC22 WG21 N2170 = 07-0030 - 2007-02-02

<p>Lawrence Crowl

<h2>Problem: Excessive Exclusion of Some Characters</h2>

<p>The current standard prohibits using universal character names
to specify many characters,
in particular the control characters
(00-1F, 7F-9F)
and the basic source characters
(20-23 <q><code>&nbsp;!"#</code></q>,
25-3F <q><code>%&amp;'()*+,-./0123456789:;&lt;=&gt;?</code></q>,
41-5F <q><code>ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_</code></q>,
61-7E <q><code>abcdefghijklmnopqrstuvwxyz{|}~</code></q>).
By implication,
the standard permits specifying the printable ASCII characters
24 <q><code>$</code></q>,
40 <q><code>@</code></q>, and
60 <q><code>`</code></q>.

<p>While the prohibition against basic source characters
is generally not a significant problem,
the prohibition against control characters within character and string literals
causes programmers to fall back upon traditional escape sequences,
which makes the code more platform-dependent.

<p>For example, the high control characters of Unicode (80-9F)
have code points with different meanings in windows-1252.
In UTF-8, those points also have a different representation.
For example, <code>"\u0085"</code> would be <code>"\xC2\x85"</code>.

<h2>Problem: Excessive Inclusion of Some Values</h2>

<p>The current C++ standard
permits specification of universal characters
within the range D800 through DFFF inclusive.
These values do not identify characters,
but rather identify half of surrogate pairs.
The C 1999 standard prohibits specification of these values.

<p>This problem is core issue number 558,
and this paper proposes a solution to that issue.

<p> The only potential need for values within this range
is processing of strings.
In those rare cases, use of direct numeric constants (e.g. 0xD83F)
will suffice.

<h2>Solution</h2>

<p>We propose to lift the prohibitions
on control and basic source universal character names
within character and string literals.
We propose to add prohibitions
against surrogate values in all universal character names.

<p>The existing wording in the phases of translation (2.1)
and existing grammar for character (2.13.2) and string (2.13.4) literals
prevents problems parsing literals
because interpretation of the universal character names
occurs after tokenization.
Because the prohibitions remain outside of string literals,
the existing parse is not affected.

<h3>2.2 Character sets [lex.charset]</h3>

<p>In paragraph 2, edit

<blockquote>
The <var>universal-character-name</var> construct
provides a way to name other characters.
<blockquote>
<dl>
<dt><var>hex-quad:</var></dt>
<dd><var>hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit</var></dd>
<dt><var>universal-character-name:</var></dt>
<dd><code>\u</code> <var>hex-quad</var></dd>
<dd><code>\U</code> <var>hex-quad hex-quad</var></dd>
</dl>
</blockquote>
The character designated
by the universal-character-name <code>\UNNNNNNNN</code>
is that character whose character short name in ISO/IEC 10646
is <code>NNNNNNNN</code>;
the character designated
by the universal-character-name <code>\uNNNN</code>
is that character whose character short name in ISO/IEC 10646
is <code>0000NNNN</code>.
If the hexadecimal value for a universal character name
<ins>corresponds to a surrogate code point
(in the range 0xD800-0xDFFF, inclusive),
the program is ill-formed.
Additionally, if the hexadecimal value for a universal character name
outside a character or string literal</ins>
<del>is less than 0x20,
or in the range 0x7F-0x9F (inclusive),</del>
<ins>corresponds to a control character
(in either of the ranges 0x0-0x1F or 0x7F-0x9F, both inclusive)</ins>
or <ins>to</ins> <del>if the universal character name
designates</del> a character in the basic source character set,
<del>then</del> the program is ill-formed.
</blockquote>

</body>
</html>
