<html>
<head>
<title>Digit Separators coming back</title>

<style type="text/css">
  ins { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
  del { text-decoration:line-through; background-color:#FFA0A0 }
</style>

</head>

<body>

N3342=12-0032<br/>
Jens Maurer<br/>
2012-01-09<br/>

<h1>Digit Separators coming back</h1>

<h2>Introduction</h2>

<p>
This paper proposes syntax extensions to C++ in order to be able to
write large numeric literals with separators between the digits to
make them more readable.
</p>

<p>
This paper is largely based on N2281 = 07-0141 "Digit Separators" by
Lawrence Crowl.  The proposed wording changes have been updated for
C++11 (more specifically, the latest working draft N3290).
</p>

<p>
This paper does not propose to add binary literals or hexadecimal
floating-point literals; those are considered largely independent of
this paper and thus can be addressed separately.
</p>


<h2>Motivation</h2>

<p>
For most people, reading large numbers without additional (redundant)
visual cues is hard.  Examples:

<ul>
<li>pronounce 7237498123</li>
<li>compare 237498123 with 237499123 for equality</li>
<li>decide whether 237499123 or 20249472 is larger</li>
</ul>

Adding additional visual cues help, for example spaces:

<ul>
<li>pronounce 7 237 498 123</li>
<li>compare 237 498 123 with 237 499 123 for equality</li>
<li>decide whether 237 499 123 or 20 249 472 is larger</li>
</ul>

An alternative visual cue might be to use underscores, elsewhere often
employed to form identifiers with a space-lookalike character (but
without violating identifier syntax):

<ul>
<li>pronounce 7_237_498_123</li>
<li>compare 237_498_123 with 237_499_123 for equality</li>
<li>decide whether 237_499_123 or 20_249_472 is larger</li>
</ul>


<h2>Discussion</h2>

<p>
Using a space character would cause a literal potentially to become two
or more <var>preprocessing-token</var>s, with rather substantial impact
not only on the lexing phase, but also on the parsing phase of C++.
Therefore, this paper proposes to use the underscore variant.
</p>

<p>
Using underscores conflicts with user-defined literals.  Appropriate
disambiguation is already provided for in the current wording, see
2.14.8 lex.ext paragraph 1, but the example can be improved for the
new situation.

In effect, that means a user-defined literal
may not start with underscore-digit.  Given that user-defined literals
are already severely constrained (see 2.14.8 lex.ext and 17.6.4.3.5
userlit.suffix), this seems to be a mild inconvenience for the next
revision of the standard.
</p>


<h2>Wording Changes</h2>

<p>
The grammar production <em>pp-number</em> in 2.10 lex.ppnumber already
permits underscores inside (via <em>identifier-nondigit</em> and
<em>nondigit</em>).  No changes are necessary.
</p>

<p>
Change in 2.14.2 lex.icon:
</p>

<blockquote>
<pre>
<em>decimal-literal:
       nonzero-digit
       decimal-literal <ins>underscore<sub>opt</sub></ins> digit</em>

<em>octal-literal:
       0
       octal-literal <ins>underscore<sub>opt</sub></ins> octal-digit</em>

<em>hexadecimal-literal:
      0x hexadecimal-digit
      0X hexadecimal-digit
      hexadecimal-literal <ins>underscore<sub>opt</sub></ins> hexadecimal-digit</em>

<ins><em>underscore:</em> _</ins>
</pre>
</blockquote>

Change in 2.14.2 lex.icon paragraph 1:
<blockquote>
An <em>integer literal</em> is a sequence of digits that has no period
or exponent part<ins>, with optional separating underscores that are
ignored when determining its value</ins>. ...
[ Example: the number twelve can be written 12, <ins>1_2,</ins> 014,
<ins>01_4,</ins> or 0XC. -- end example ]
</blockquote>

<p>
Change in 2.14.4 lex.fcon:
</p>

<blockquote>
<pre>
<em>digit-sequence:
       digit
       digit-sequence <ins>underscore<sub>opt</sub></ins></ins> digit</em>
</pre>
</blockquote>

Change in 2.14.4 lex.fcon paragraph 1:

<blockquote>
... The integer and fraction parts both consist of a sequence of
decimal (base ten) digits<ins>, with optional separating underscores
that are ignored when determining the value</ins>. ...
</blockquote>


Change in 2.14.8 lex.ext paragraph 1:

<blockquote>
If a token matches both <em>user-defined-literal</em> and another literal kind,
it is treated as the latter. [ Example: 123_km is a
<em>user-defined-literal</em>, but <ins>123_456 and 12LL are
<em>integer-literal</em>s</ins> <del>12LL is an
<em>integer-literal</em></del>. -- end example ] ...
</blockquote>
