<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=US-ASCII">

<title>Ambiguity and Insecurity with User-Defined Literals</title>
</head>

<body>
<h1>Ambiguity and Insecurity with User-Defined Literals</h1>

<p>
ISO/IEC JTC1 SC22 WG21 N2747 = 08-0257 - 2008-08-24
</p>

<p>
Lawrence Crowl, Lawrence@Crowl.org, crowl@google.com
</p>


<h2>Introduction</h2>

<p>
The proposal for user-defined literals,
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2378.pdf">N2378
User-defined Literals (aka. Extensible Literals (revision 3))</a>,
has significant ambiguity and insecurity.
These problems should be solved
before the feature is added to the language
to prevent significant harm.
</p>

<p>
The use cases for user-defined literals are:
</p>
<ul>
<li>
Compatibility with C language evolution
via library changes instead of lexical changes.
The primary example is decimal floating point.
</li>
<li>
Representation of new low-level numeric types.
For example, a type representing probabilities.
</li>
<li>
Specification of scientific units.
For example, pressure in kPa.
</li>
</ul>

<p>
The current proposal for user-defined literals
may inhibit the use of user-defined literals in exactly those cases.
</p>


<h2>Lookup Ambiguity</h2>

<p>
User-defined literals can be defined via literal operators.
These literal operators may be defined in namespace scope
(N2378 section 5 "Proposed Wording",
modifying clause 13.5.8 over.literal, paragraph 1).
</p>

<p>
The definition of these operators in a nested namespace
with a using directive in the global namespace
(N2378 section 3.5 "An idiom")
effectively provides for conflict-free <em>definition</em>
of literal operators.
</p>

<p>
Unfortunately, unambiguous definition of a literal operator
does not imply unambiguous <em>invocation</em>.
</p>

<p>
This ambiguity will be compounded
because suffixes will tend to be very short,
and collisions are likely.
For example, competition for SI suffixes
will likely be fierce and immediate,
particularly as the C++ committee
is not yet ready to standardize SI types or their literals.
</p>

<p>
Consider two libraries that export literal operators
in the manner suggested by the idiom.
</p>

<pre><code>
// library ping.h
namespace ping {
     struct X {};
     namespace literals {
         X operator "foo"( unsigned long long );
         X operator "bar"( unsigned long long );
     }
}
using namespace ping::literals;

// library pong.h
namespace pong {
     struct Y {};
     namespace literals {
         Y operator "foo"( unsigned long long );
         Y operator "bar"( unsigned long long );
     }
}
using namespace pong::literals;

// application.cc

auto u = 1foo;
auto v = 2bar;

namespace applic {
    auto u = 1foo;
    auto v = 2bar;
}

int main() {
    auto a = 1foo;
    auto b = 2bar;
}
</code></pre>

<p>
In this example,
any simple use of either suffix
will result in an ambiguity.
</p>


<h3>Existing Solution: Using Declarations</h3>

<p>
Daveed Vandevoorde notes that some of this ambiguity can be resolved
with a using declaration.
For example,
</p>

<pre><code>
int main() {
    using ping::literals::operator "foo";
    using pong::literals::operator "bar";
    auto a = 1foo;
    auto b = 2bar;
}
</code></pre>

<p>
Unfortunately, this solution does not work at the global namespace
because the using declaration is redundant with existing using directive.
The solution also does not work
when using the same suffix from two different libraries.
</p>


<h3>Proposed Solution: Modified Idiom</h3>

<p>
Much of the ambiguity in the existing idiom
arises from headers
providing the using directive for the namespace of literals.
Modifying the idiom so that clients,
not libraries,
provide the using directive would substantially reduce
the number of likely collisions.
</p>

<p>
In the following example,
all literals will be selected from the library <code>ping</code>.
</p>

<pre><code>
// library ping.h
namespace ping {
     struct X {};
     namespace literals {
         X operator "foo"( unsigned long long );
         X operator "bar"( unsigned long long );
     }
}

// library pong.h
namespace pong {
     struct Y {};
     namespace literals {
         Y operator "foo"( unsigned long long );
         Y operator "bar"( unsigned long long );
     }
}

// application.cc

using namespace ping::literals;
auto u = 1foo;
auto v = 2bar;

namespace applic {
    auto u = 1foo;
    auto v = 2bar;
}

int main() {
    auto a = 1foo;
    auto b = 2bar;
}
</code></pre>


<h3>Existing Solution: Call Syntax</h3>

<p>
Again, Daveed Vandevoorde notes that
one can avoid these ambiguities by using function call syntax,
either the operator form or perhaps some constructor.
For example,
</p>

<pre><code>
int main() {
    auto a = ping::literals::operator"foo"(1);
    auto b = pong::Y(2);
}
</code></pre>

<p>
The first alternative suffers from verbosity.
The second alternative suffers
from two problems.
First, the intent of a literal is lost.
Second, there is potential for undesirable function overloading.
</p>


<h3>Proposed Solution: Qualified Literals</h3>

<p>
The general solution is to provide qualified literals;
which enables fine-grained selection of the appropriate literal operator.
The evolution subcommittee discussed this option,
but ultimately did not choose it.
Perhaps that choice should be revisited.
</p>


<h2>Parse Ambiguity</h2>

<p>
Any user-defined suffix starting with
[<code>A</code>-<code>F</code><code>a</code>-<code>f</code>]
has a potential conflict with hexidecimal notation.
While the proposal defines away ambiguity
(N2378 section 5 "Proposed Wording",
modifying clause 13.5.8 over.literal, paragraph 0),
that definition effectively means that suffixes
intended for use with arbitrary integers must
avoid more than 22% of the available suffix namespace.
Some of these effectively prohibited suffixes
would otherwise be the natural choice.
The same applies to floating point values,
e.g. "e" as electron charges,
though to a lesser degree.
</p>

<p>
As Daveed Vandevoorde points out,
there is also a more subtle ambiguity.
Some letters are visually similar to digits,
which could lead to misinterpretation by readers
on casual reading.
Daveed's example is:
</p>

<blockquote>
<p>
I might introduce "units" for memory sizes:
<code>B</code> for bytes, <code>KB</code> for kilobytes,
<code>MB</code> for > megabytes etc.
Unfortunately,
</p>
<blockquote>
<p>
<code>size_t memsize = 11B;</code>
</p>
</blockquote>
<p>
and
</p>
<blockquote>
<p>
<code>size_t memsize = 118;</code>
</p>
</blockquote>
<p>
look a lots like each other, ...
(It gets worse even
if you have suffixes that start with l (letter ell) or O (letter oh).)
</p>
</blockquote>


<h3>Existing Solution: Leading Underscore</h3>

<p>
The intended solution to these parse ambiguities
is to define a suffix with leading underscore,
which separates the meaningful part of the suffix
from the remainder of the literal.
Continuing Daveed's example,
</p>
<blockquote>
<p>
so I prefer to separate the suffix with an underscore:
</p>
<blockquote>
<p>
<code>size_t memsize = 11_B;</code>
</p>
</blockquote>
</blockquote>

<p>
Unfortunately,
the ambiguity reappears with digit separators.
The intent of the current proposal is to be compatible
with any future adoption of 
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2281.html">N2281
Digit Separators</a>
(N2378 section 4 "Use cases", paragraph 1).
However, it fails to achieve that goal
because the syntax of separated digits matches
the syntax of user-defined literals starting with an underscore.
For example, <code>0xAB_B</code>
is ambiguously <code>0xABB</code>
or <code>operator"_B"(0xAB)</code>.
(Likewise, <code>11_B</code>
is visually similar to the number <code>11_8</code>,
though that particular construct is less likely.)
</p>

<h3>Possible Solution: Prefer Digit Separator Interpretation</h3>

<p>
As Daveed Vandevoorde points out,
one possible resolution to this problem
is to simply require the compiler
to disambiguate digit separators and literal suffixes.
</p>

<p>
Unfortunately, this approach will cause arbitrary
interpretation of inherently ambiguous tokens.
</p>

<p>
The problem will exacerbated
because 
in the absence of digit separators,
programmers will be well-motiviated to add user-defined literals
for the sole purpose of achieving digit separation.
For example,
programmers are likely to define
</p>
<blockquote>
<p>
<code>int operator"_000"( unsigned long long n ) { return 1000*n; }</code>
</p>
</blockquote>
<p>
to clarify the magnitude of literals, as in
<code>123_000</code>.
</p>

<p>
(At the same time,
because suffixes must be enumerated,
literal operators are an insufficient mechanism for digit separation;
and thus can be used with only a sparse set,
such as round thousands and millions.)
</p>

<p>
Any change to the standard to recognize digit separators
will invalidate code.
</p>

<p>
Finally, users that do define literal operators for the purposes
of digit separation
effectively exclude the use of all other literal operators
because the proposal permits only invocation of only one literal operator
per literal.
</p>


<h3>Proposed Solution: Preemptive Adoption of Digit Separators</h3>

<p>
We can prevent misuse of user-defined literals for digit separation
by preemptively adopting digit separators.
</p>


<h3>Proposed Solution: Double Leading Underscore</h3>

<p>
Both with and without digit separators,
some ambiguity remains.
Rather than provide conventions for disambiguation,
it is preferable to prevent ambiguity in the first place.
We can achieve this by using a double underscore to separate
the value from the suffix.
</p>

<p>
For example, <code>0xAB__B</code> would be unambigously
a suffix because
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2281.html">N2281</a>
admits only a single underscore between digits.
</p>

<p>
The standard can enforce this double underscore separation
either by introducing it as syntax,
or by simply requiring the operator identifier
to have two underscores rather than one.
</p>


<h3>Retained Solution: No Leading Underscore</h3>

<p>
For compatiblity with C,
literals must retain the capability for suffixes
with no leading underscores.
There is no proposal to remove that capability.
</p>


<h3>Proposed Solution: Qualified Literals</h3>

<p>
With a double underscore separating value from suffix,
we are very close to qualified names.
One could instead separate value from suffix with the scope operator.
For example,
<blockquote>
<p>
<code>size_t memsize = 11::B;</code>
</p>
</blockquote>

<p>
Admittedly, this suffix form of qualification
would be somewhat unnatural.
</p>

<p>
Daveed Vandevoorde points out
that there is a potential problem with this approach
in the code
</p>
<blockquote>
<p>
<code>extern "C++"::X f();</code>
</p>
</blockquote>
<p>
However, in this one special case in the syntax,
user-defined literals are inappropriate as well.
Normal literals cannot have an following scope operator.
</p>


<h2>Evolution Insecurity</h2>

<p>
Because of the high degree of ambiguity,
use of user-defined literals will expose programs
to long-term instability.
Because users of literals must be <code>using</code> the definition,
the must extend the set of using declarations and defintions,
and are vulnerable to any change in the using environment.
</p>

<p>
A concrete example of the insecurity
is the subsequent addition to the standard
of a new suffix.
Once the C++0x is published,
and users start to define literal operators,
any addition to the standard of a new suffix
will potentially invalidate those literal operators.
This problem is particularly vexing because
the C language has no such problem
and is thus free to add suffixes at will.
Thus, C++ will have the unfortunate choice of
either being incompatible with C at the lexical level,
or breaking user code.
Since suffixes will tend to be short,
that breakage is likely.
</p>

<p>
Further,
the C++ standard is likely to define new literal operators over time.
Again, these are likely to be highly ambiguous
with user-written literal operators.
Again, C++ will have the unfortunate choice of
either not introducing new literal operators
or breaking user code.
Since suffixes will tend to be short,
that breakage is likely.
</p>


<h3>Proposed Solution: Reserve "Jammed" Literals</h3>

<p>
One solution to this problem is to simply reserve
suffixes with no leading underscores
(or with no scope operator)
to the standard.
This approach as several advantages.
</p>
<ul>
<li>
It enables future compatiblity with all likely future C suffixes.
</li>
<li>
It provides a namespace for C++ suffixes that cannot conflict
with user suffixes.
</li>
<li>
It puts the hexadecimal and floating literal ambigity problems
in the hands of the C++ committee, 
where such concerns are likely to get more attention.
</li>
<li>
It enables automatic lookup in namespace <code>std</code>
without requiring any using directives or declarations.
</li>
</ul>


</body>
</html>
