<!DOCTYPE html>
<!-- saved from url=(0053)http://wiki.edg.com/pub/Wg21belfast/SG16/D1949R0.html -->
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang=""><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  
  <meta name="generator" content="mpark/wg21">
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
  <meta name="dcterms.date" content="2019-11-03">
  <title>C++ Identifier Syntax using Unicode Standard Annex 31</title>
  <style>
      code{white-space: pre-wrap;}
      span.smallcaps{font-variant: small-caps;}
      span.underline{text-decoration: underline;}
      div.column{display: inline-block; vertical-align: top; width: 50%;}
  </style>
  <style>
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
  { counter-reset: source-line 0; }
pre.numberSource code > span
  { position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
  { content: counter(source-line);
    position: relative; left: -1em; text-align: right; vertical-align: baseline;
    border: none; display: inline-block;
    -webkit-touch-callout: none; -webkit-user-select: none;
    -khtml-user-select: none; -moz-user-select: none;
    -ms-user-select: none; user-select: none;
    padding: 0 4px; width: 4em;
    color: #aaaaaa;
  }
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa;  padding-left: 4px; }
div.sourceCode
  {  background-color: #f6f8fa; }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span. { } /* Normal */
code span.al { color: #ff0000; } /* Alert */
code span.an { } /* Annotation */
code span.at { } /* Attribute */
code span.bn { color: #9f6807; } /* BaseN */
code span.bu { color: #9f6807; } /* BuiltIn */
code span.cf { color: #00607c; } /* ControlFlow */
code span.ch { color: #9f6807; } /* Char */
code span.cn { } /* Constant */
code span.co { color: #008000; font-style: italic; } /* Comment */
code span.cv { color: #008000; font-style: italic; } /* CommentVar */
code span.do { color: #008000; } /* Documentation */
code span.dt { color: #00607c; } /* DataType */
code span.dv { color: #9f6807; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #9f6807; } /* Float */
code span.fu { } /* Function */
code span.im { } /* Import */
code span.in { color: #008000; } /* Information */
code span.kw { color: #00607c; } /* Keyword */
code span.op { color: #af1915; } /* Operator */
code span.ot { } /* Other */
code span.pp { color: #6f4e37; } /* Preprocessor */
code span.re { } /* RegionMarker */
code span.sc { color: #9f6807; } /* SpecialChar */
code span.ss { color: #9f6807; } /* SpecialString */
code span.st { color: #9f6807; } /* String */
code span.va { } /* Variable */
code span.vs { color: #9f6807; } /* VerbatimString */
code span.wa { color: #008000; font-weight: bold; } /* Warning */
code.diff {color: #898887}
code.diff span.va {color: #006e28}
code.diff span.st {color: #bf0303}
  </style>
  <style type="text/css">
body {
margin: 5em;
font-family: serif;

hyphens: auto;
line-height: 1.35;
}
div.wrapper {
max-width: 60em;
margin: auto;
}
ul {
list-style-type: none;
padding-left: 2em;
margin-top: -0.2em;
margin-bottom: -0.2em;
}
a {
text-decoration: none;
color: #4183C4;
}
a.hidden_link {
text-decoration: none;
color: inherit;
}
li {
margin-top: 0.6em;
margin-bottom: 0.6em;
}
h1, h2, h3, h4 {
position: relative;
line-height: 1;
}
a.self-link {
position: absolute;
top: 0;
left: calc(-1 * (3.5rem - 26px));
width: calc(3.5rem - 26px);
height: 2em;
text-align: center;
border: none;
transition: opacity .2s;
opacity: .5;
font-family: sans-serif;
font-weight: normal;
font-size: 83%;
}
a.self-link:hover { opacity: 1; }
a.self-link::before { content: "§"; }
ul > li:before {
content: "\2014";
position: absolute;
margin-left: -1.5em;
}
:target { background-color: #C9FBC9; }
:target .codeblock { background-color: #C9FBC9; }
:target ul { background-color: #C9FBC9; }
.abbr_ref { float: right; }
.folded_abbr_ref { float: right; }
:target .folded_abbr_ref { display: none; }
:target .unfolded_abbr_ref { float: right; display: inherit; }
.unfolded_abbr_ref { display: none; }
.secnum { display: inline-block; min-width: 35pt; }
.header-section-number { display: inline-block; min-width: 35pt; }
.annexnum { display: block; }
div.sourceLinkParent {
float: right;
}
a.sourceLink {
position: absolute;
opacity: 0;
margin-left: 10pt;
}
a.sourceLink:hover {
opacity: 1;
}
a.itemDeclLink {
position: absolute;
font-size: 75%;
text-align: right;
width: 5em;
opacity: 0;
}
a.itemDeclLink:hover { opacity: 1; }
span.marginalizedparent {
position: relative;
left: -5em;
}
li span.marginalizedparent { left: -7em; }
li ul > li span.marginalizedparent { left: -9em; }
li ul > li ul > li span.marginalizedparent { left: -11em; }
li ul > li ul > li ul > li span.marginalizedparent { left: -13em; }
div.footnoteNumberParent {
position: relative;
left: -4.7em;
}
a.marginalized {
position: absolute;
font-size: 75%;
text-align: right;
width: 5em;
}
a.enumerated_item_num {
position: relative;
left: -3.5em;
display: inline-block;
margin-right: -3em;
text-align: right;
width: 3em;
}
div.para { margin-bottom: 0.6em; margin-top: 0.6em; text-align: justify; }
div.section { text-align: justify; }
div.sentence { display: inline; }
span.indexparent {
display: inline;
position: relative;
float: right;
right: -1em;
}
a.index {
position: absolute;
display: none;
}
a.index:before { content: "⟵"; }

a.index:target {
display: inline;
}
.indexitems {
margin-left: 2em;
text-indent: -2em;
}
div.itemdescr {
margin-left: 3em;
}
.bnf {
font-family: serif;
margin-left: 40pt;
margin-top: 0.5em;
margin-bottom: 0.5em;
}
.ncbnf {
font-family: serif;
margin-top: 0.5em;
margin-bottom: 0.5em;
margin-left: 40pt;
}
.ncsimplebnf {
font-family: serif;
font-style: italic;
margin-top: 0.5em;
margin-bottom: 0.5em;
margin-left: 40pt;
background: inherit; 
}
span.textnormal {
font-style: normal;
font-family: serif;
white-space: normal;
display: inline-block;
}
span.rlap {
display: inline-block;
width: 0px;
}
span.descr { font-style: normal; font-family: serif; }
span.grammarterm { font-style: italic; }
span.term { font-style: italic; }
span.terminal { font-family: monospace; font-style: normal; }
span.nonterminal { font-style: italic; }
span.tcode { font-family: monospace; font-style: normal; }
span.textbf { font-weight: bold; }
span.textsc { font-variant: small-caps; }
a.nontermdef { font-style: italic; font-family: serif; }
span.emph { font-style: italic; }
span.techterm { font-style: italic; }
span.mathit { font-style: italic; }
span.mathsf { font-family: sans-serif; }
span.mathrm { font-family: serif; font-style: normal; }
span.textrm { font-family: serif; }
span.textsl { font-style: italic; }
span.mathtt { font-family: monospace; font-style: normal; }
span.mbox { font-family: serif; font-style: normal; }
span.ungap { display: inline-block; width: 2pt; }
span.textit { font-style: italic; }
span.texttt { font-family: monospace; }
span.tcode_in_codeblock { font-family: monospace; font-style: normal; }
span.phantom { color: white; }

span.math { font-style: normal; }
span.mathblock {
display: block;
margin-left: auto;
margin-right: auto;
margin-top: 1.2em;
margin-bottom: 1.2em;
text-align: center;
}
span.mathalpha {
font-style: italic;
}
span.synopsis {
font-weight: bold;
margin-top: 0.5em;
display: block;
}
span.definition {
font-weight: bold;
display: block;
}
.codeblock {
margin-left: 1.2em;
line-height: 127%;
}
.outputblock {
margin-left: 1.2em;
line-height: 127%;
}
div.itemdecl {
margin-top: 2ex;
}
code.itemdeclcode {
white-space: pre;
display: block;
}
span.textsuperscript {
vertical-align: super;
font-size: smaller;
line-height: 0;
}
.footnotenum { vertical-align: super; font-size: smaller; line-height: 0; }
.footnote {
font-size: small;
margin-left: 2em;
margin-right: 2em;
margin-top: 0.6em;
margin-bottom: 0.6em;
}
div.minipage {
display: inline-block;
margin-right: 3em;
}
div.numberedTable {
text-align: center;
margin: 2em;
}
div.figure {
text-align: center;
margin: 2em;
}
table {
border: 1px solid black;
border-collapse: collapse;
margin-left: auto;
margin-right: auto;
margin-top: 0.8em;
text-align: left;
hyphens: none; 
}
td, th {
padding-left: 1em;
padding-right: 1em;
vertical-align: top;
}
td.empty {
padding: 0px;
padding-left: 1px;
}
td.left {
text-align: left;
}
td.right {
text-align: right;
}
td.center {
text-align: center;
}
td.justify {
text-align: justify;
}
td.border {
border-left: 1px solid black;
}
tr.rowsep, td.cline {
border-top: 1px solid black;
}
tr.even, tr.odd {
border-bottom: 1px solid black;
}
tr.capsep {
border-top: 3px solid black;
border-top-style: double;
}
tr.header {
border-bottom: 3px solid black;
border-bottom-style: double;
}
th {
border-bottom: 1px solid black;
}
span.centry {
font-weight: bold;
}
div.table {
display: block;
margin-left: auto;
margin-right: auto;
text-align: center;
width: 90%;
}
span.indented {
display: block;
margin-left: 2em;
margin-bottom: 1em;
margin-top: 1em;
}
ol.enumeratea { list-style-type: none; background: inherit; }
ol.enumerate { list-style-type: none; background: inherit; }

code.sourceCode > span { display: inline; }

div#refs p { padding-left: 32px; text-indent: -32px; }
</style>
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
  
</head>
<body>
<div class="wrapper">
<header id="title-block-header">
<h1 class="title" style="text-align:center">C++ Identifier Syntax using Unicode Standard Annex 31</h1>

<table style="border:none;float:right">
  <tbody><tr>
    <td>Document #: </td>
    <td>D1949R0</td>
  </tr>
  <tr>
    <td>Date: </td>
    <td>2019-11-03</td>
  </tr>
  <tr>
    <td style="vertical-align:top">Project: </td>
    <td>Programming Language C++<br>
      SG16<br>
      EWG<br>
      CWG<br>
    </td>
  </tr>
  <tr>
    <td style="vertical-align:top">Reply-to: </td>
    <td>
      Steve Downey<br>&lt;<a href="mailto:sdowney@gmail.com" class="email">sdowney@gmail.com</a>, <a href="mailto:sdowney2@bloomberg.net" class="email">sdowney2@bloomberg.net</a>&gt;<br>
    </td>
  </tr>
</tbody></table>

</header>
<div style="clear:both">
<h1 data-number="1" id="abstract"><span class="header-section-number">1</span> Abstract<a href="http://wiki.edg.com/pub/Wg21belfast/SG16/D1949R0.html#abstract" class="self-link"></a></h1>
<p>In response to NL 029 : Disallow zero-width and control characters</p>
<p>Adopt Unicode Annex 31 as part of C++ 23. - That C++ identifiers match the pattern (XID_START + _ ) + XID_CONTINUE*. - That portable source is required to be normalized as NFC. - That using unassigned code points ill-formed.</p>
<h1 data-number="2" id="poll-before-discussion"><span class="header-section-number">2</span> Poll before discussion<a href="http://wiki.edg.com/pub/Wg21belfast/SG16/D1949R0.html#poll-before-discussion" class="self-link"></a></h1>
<p>The current state, allowing control characters, ZWJ, and unassigned codepoints in C++ identifiers is not a defect, and is working as designed, and does not need to be addressed</p>
<h1 data-number="3" id="addressing-identifiers-in-a-more-principled-ways"><span class="header-section-number">3</span> Addressing identifiers in a more principled ways<a href="http://wiki.edg.com/pub/Wg21belfast/SG16/D1949R0.html#addressing-identifiers-in-a-more-principled-ways" class="self-link"></a></h1>
<p><a href="https://unicode.org/reports/tr31/">UNICODE IDENTIFIER AND PATTERN SYNTAX</a> is an attempt to provide a normative way of specifying definitions of general-purpose identifiers for use in programming languages. It has evolved signfigantly over the years, in particular since the time that C++ 11 was specified. In particular, the characters that were allowed as identifiers, and the patterns, were not stable at the time of C++11, which is the last time identifiers were addressed in the standard. In addition, at that time, ISO was promulgating advice suggesting a list of code points as the recommended method for ISO standards to specify identifiers.</p>
<p>Today the definitions in UAX31 can be used to provide stable definitions for programming language identifiers, with guarantees that an identifier will not be invalidated by later standards.</p>
<p>Originally, UAX31 relied on derived properties of characters, ID_START and ID_CONTINUE, however those properties relied on fundamental properties that could change over time. The unicode database now provides XID_START and XID_CONTINUE, based on the same characteristics, but with an additional stability guarantee. The Unicode database now provides explicit classification of both.</p>
<p>The original definitions closely match the identifier syntax of C:</p>
<table>
<colgroup>
<col style="width: 4%">
<col style="width: 95%">
</colgroup>
<thead>
<tr class="header">
<th><div style="text-align:center">
<strong>Properties</strong>
</div></th>
<th><div style="text-align:center">
<strong>General Description of Coverage</strong>
</div></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>ID_Start</td>
<td>ID_Start characters are derived from the Unicode General_Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points.</td>
</tr>
<tr class="even">
<td></td>
<td>In set notation:</td>
</tr>
<tr class="odd">
<td></td>
<td>[\p{L}\p{Nl}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
</tr>
<tr class="even">
<td>ID_Continue</td>
<td>ID_Continue characters include ID_Start characters, plus characters having the Unicode General_Category of nonspacing marks, spacing combining marks, decimal number, connector punctuation, plus Other_ID_Continue , minus Pattern_Syntax and Pattern_White_Space code points.</td>
</tr>
<tr class="odd">
<td></td>
<td>In set notation:</td>
</tr>
<tr class="even">
<td></td>
<td>[\p{ID_Start}\p{Mc}\p{Pc}\p{Other_ID_Continue}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</td>
</tr>
<tr class="odd">
<td></td>
<td></td>
</tr>
</tbody>
</table>
<p>The X versions of the properties start the same, but are guaranteed stable in subsequent Unicode standards</p>
<h1 data-number="4" id="issues"><span class="header-section-number">4</span> Issues<a href="http://wiki.edg.com/pub/Wg21belfast/SG16/D1949R0.html#issues" class="self-link"></a></h1>
<ul>
<li>Continue does not include ZWJ, which some scripts require</li>
<li>Does not exclude homoglyph attack</li>
<li>Does not require the compiler to normalize identifiers</li>
<li>Does not allow emoji</li>
</ul>
<h1 data-number="5" id="history"><span class="header-section-number">5</span> History<a href="http://wiki.edg.com/pub/Wg21belfast/SG16/D1949R0.html#history" class="self-link"></a></h1>
<p>Using an explicit list of Unicode characters was considered a best practice for ISO standardization in TR 10176:2003 Guidelines for the preparation of programming language standards.</p>
<p>National body comment CA 24 for C++11:</p>
<blockquote>
<p>A list of issues related TR 10176:2003:</p>
<ul>
<li>“Combining characters should not appear as the first character of an identifier.” Reference: ISO/IEC TR 10176:2003 (Annex A) This is not reflected in FCD.</li>
<li>Restrictions on the first character of an identifier are not observed as recommended in TR 10176:2003. The inclusion of digits (outside of those in the basic character set) under identifer-nondigit is implied by FCD.</li>
<li>It is implied that only the “main listing” from Annex A is included for C++. That is, the list ends with the Special Characters section. This is not made explicit in FCD. Existing practice in C++03 as well as WG 14 (C, as of N1425) and WG 4 (COBOL, as of N4315) is to include a list in a normative Annex.</li>
<li>Specify width sensitivity as implied by C++03:  is not the same as A. Case sensitivity is already stated in [lex.name].</li>
</ul>
</blockquote>
<p>N3146 in 2010-10-04 considered using UAX31, but at the time there were stability issues with identifiers, and came down on the side of explicit white listing.</p>
<p>The Unicode standard has since made stability guarantees about identifiers, and created the XID_START and XID_CONTINUE properties to alleviate the stability concerns that existed in 2010.</p>
<h1 data-number="6" id="wording"><span class="header-section-number">6</span> Wording<a href="http://wiki.edg.com/pub/Wg21belfast/SG16/D1949R0.html#wording" class="self-link"></a></h1>
<p>Wording to follow based on SG16 and EWG guidance. There is much prior art to follow based on similar proposals and adoption in Rust and Swift.</p>
<p>Explicit universal character names and codepoints are available for particular Unicode standards from the published database, and could be appended as an appendix.</p>
</div>
</div>


</body></html>