<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=us-ascii">
<title>Skeleton Proposal for Thread-Local Storage (TLS)</title>
</head>
<body>
<h1>Skeleton Proposal for Thread-Local Storage (TLS)</h1>

<p>
ISO/IEC JTC1 SC22 WG21 P0108R0 - 2015-09-24
</p>

<p>
Paul E. McKenney, paulmck@linux.vnet.ibm.com<br>
JF Bastien, jfb@google.com<br>
TBD
</p>

<h2>Introduction</h2>

<p>
This document in a follow-on to
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4376.html">N4376</a>,
and provides an initial description of a potential solution to the TLS
problem statement implied by that document.

<h2>Summary of Problem Statement</h2>

<p>
We expect that lightweight executors will have problems
with TLS as currently envisioned and implemented.
For example, some types of executors nest hierarchically, so that
a number of light-weight executors might run in the context of
a single heavy-weight <tt>std::thread</tt>.
If a given function accesses TLS, and is called both from the context
of a <tt>std::thread</tt> and from the context of a task executing
within an <tt>std::thread</tt>, what should its TLS accesses do?
If the instances invoked from a task access task-level TLS data,
the function must do different things when invoked in different contexts.
If the <tt>std::thread</tt>-level TLS data is accesses, then the
task-level accesses might introduce data races and thus undefined behavior.

<p>
This also can interact with signal handling.
To see this, suppose that a signal arrives at a
<tt>std::thread</tt> while that <tt>std::thread</tt>
is running a light-weight executor, for example, a task.
The signal handler will likely conceptually be part of the
<tt>std::thread</tt> rather than the task.
This would imply some additional context switching at signal-handler
start and end.

<p>
TLS is most especially a problem for light-weight executors
implementing same-instruction-multiple-data (SIMD) units
and general-purpose graphical processing units (GPGPUs)
because large programs can have very large amounts of TLS data,
each item of which might have C++ constructors and destructors.
Spending many milliseconds to run constructors and destructors
for a SIMD computation that only takes a few microseconds to run
is clearly not a reasonably way to achieve high performance.

<p>
GPGPU code often has longer runtimes, but they also tend to run
extremely large numbers of threads, adding a memory-footprint
problem to the constructor-destructor overhead problem.
To make matters worse, in some environments, the constructors and
destructors must be run on heavyweight CPUs rather than on the
lightweight GPGPU hardware threads, which severely restricts the
computational resources that can be applied to run constructors
and destructors for GPGPU TLS data.

<p>
At the source-code level, it isn't generally knowable which executor a
function is called from, or even if a function is called from multiple
executors.
It is left up to the programmer to write code which correctly
accesses state for the executor(s) that the code will execute in.
(In theory, we could of course use a TLS variable to record what type
of executor was currently executing, but in practice that of course
requires a TLS implementation that is efficient enough to be used by
light-weight executors, and if we had that, we wouldn't be writing
this paper.)

<h2>Tentative Goals</h2>

<p>
There are a number of possible ways of resolving this issue, as discussed in
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4376.html">N4376</a>,
however, this paper focuses on the possibility that TLS is an optional
component of an executor.
With this approach, <tt>std::thread</tt> implements TLS, but lighter-weight
executors might choose not to.

<p>
For this approach, we put forward the following tentative goals:

<ol>
<li>	Make TLS availability optional for light-weight executors,
	as noted above.
	<ol type="alpha">
	<li>	Modify the standard library so as to minimize the
		number of standard library functions that are
		prohibited from within TLS-free executors.
	<li>	Maintain the performance and scalability of high-quality
		standard-library implementations.
	</ol>
<li>	Avoid source-code changes for existing code running in
	existing executors (such as <tt>std::thread</tt>) that
	provide TLS.
<li>	Avoid the need to recompile existing code running in
	existing executors (such as <tt>std::thread</tt>) that
	provide TLS.
<li>	Avoid API changes in the standard library.
	(C++ only, as it seems quite unlikely that this goal can be
	achieved in C.)
<li>	Recruit sanitizer developers to help identify issues in
	new code and in standard-library code related to this change.
</ol>

<p>
The next section exercises these goals by attempting to apply them
to the TLS <tt>errno</tt> facility as used by the standard math library,
in the hope of sparking productive discussion.
Note that when multiple lightweight executors run concurrently in the
context of a single <tt>std::thread</tt>, setting <tt>errno</tt>
implicitly (and for some, surprisingly) invokes
undefined behavior, so a fix is a matter of some importance.
At a minimum, lightweight executors that do not support TLS need to
state that attempts to access TLS results in undefineed behavior.

<h2>The Curious Case of <tt>errno</tt> and the Standard Math Library</h2>

<p>
C++ provides a per-<tt>std::thread</tt> facility named
<tt>errno</tt> (19.4) in order to provide POSIX compatibility.
This is also required to allow C++'s standard math library (26)
maintain compatibility with that of C.
Section 7.12 of the C standard specifies that if
<tt>math_errhandling &amp; MATH_ERRNO</tt> is non-zero, indication of
certain errors are available via <tt>errno</tt>.
Furthermore, Section 19.4 of the C++ standard specifies that
<tt>errno</tt> is provided on a per-thread basis.
Therefore, <tt>errno</tt> is frequently implemented using TLS,
which in turn means that the math library's use of <tt>errno</tt>
forms an excellent initial test case for changes to TLS.

<p>
This section looks at the following approaches:

<ol>
<li>	Restricting configuration.
<li>	Adding <tt>errno</tt> parameter via function overloading.
<li>	Adding <tt>errno</tt> to return value.
</ol>

<h3>Restricting Configuration</h3>

<p>
One approach is to require that
<tt>math_errhandling &amp; MATH_ERREXCEPT</tt> be non-zero
(as is required for IEC 60559) and that
<tt>math_errhandling &amp; MATH_ERRNO</tt> be zero in all cases where
math library functions are invoked from executors that do not provide TLS.
Note that <tt>math_errhandling</tt> is global and constant, which means
that it cannot have different values in different contexts of the same
execution.
However, this approach cannot be used in conjunction with existing code
that invokes math functions and tests <tt>errno</tt>.
This could in turn be dealt with by forbidding use of code
that checks for math errors using <tt>errno</tt>, but this would have
the undesirable effect of acting as a barrier to the adoption of
light-weight executors.

<h3>Adding <tt>errno</tt> Parameter Via Function Overloading</h3>

<p>
Another approach is to use function overloading, so that an additional
<tt>double sqrt(double, int *)</tt> declaration could be used in
light-weight executors.
Note that in some implemnetations this could require modifying the
underlying C library in order to bypass <tt>errno</tt> setting.
Code invoked both from light-weight and heavy-weight executors would
need to use the new delaration, but code invoked only from
heavy-weight executors could continue using the old API, consistent
with the goals preserving existing source and binary code.
It is tempting to instead overload on the return value, but C++
of course does not support this notion.
A (probably partial) list of new APIs is as follows:

<ul>
<li>	<tt>double acos(double x, int *errnm);</tt>
<li>	<tt>float acosf(float x, int *errnm);</tt>
<li>	<tt>long double acosl(long double x, int *errnm);</tt>
<li>	<tt>double asin(double x, int *errnm);</tt>
<li>	<tt>float asinf(float x, int *errnm);</tt>
<li>	<tt>long double asinl(long double x, int *errnm);</tt>
<li>	<tt>double atan2(double y, double x, int *errnm);</tt>
<li>	<tt>float atan2f(float y, float x, int *errnm);</tt>
<li>	<tt>long double atan2l(long double y, long double x, int *errnm);</tt>
<li>	<tt>double acosh(double xint *errnm);</tt>
<li>	<tt>float acoshf(float xint *errnm);</tt>
<li>	<tt>long double acoshl(long double xint *errnm);</tt>
<li>	<tt>double atanh(double xint *errnm);</tt>
<li>	<tt>float atanhf(float xint *errnm);</tt>
<li>	<tt>long double atanhl(long double xint *errnm);</tt>
<li>	<tt>double cosh(double xint *errnm);</tt>
<li>	<tt>float coshf(float xint *errnm);</tt>
<li>	<tt>long double coshl(long double xint *errnm);</tt>
<li>	<tt>double sinh(double xint *errnm);</tt>
<li>	<tt>float sinhf(float xint *errnm);</tt>
<li>	<tt>long double sinhl(long double xint *errnm);</tt>
<li>	<tt>double exp(double xint *errnm);</tt>
<li>	<tt>float expf(float xint *errnm);</tt>
<li>	<tt>long double expl(long double xint *errnm);</tt>
<li>	<tt>double exp2(double xint *errnm);</tt>
<li>	<tt>float exp2f(float xint *errnm);</tt>
<li>	<tt>long double exp2l(long double xint *errnm);</tt>
<li>	<tt>double expm1(double xint *errnm);</tt>
<li>	<tt>float expm1f(float xint *errnm);</tt>
<li>	<tt>long double expm1l(long double xint *errnm);</tt>
<li>	<tt>int ilogb(double xint *errnm);</tt>
<li>	<tt>int ilogbf(float xint *errnm);</tt>
<li>	<tt>int ilogbl(long double xint *errnm);</tt>
<li>	<tt>double log(double xint *errnm);</tt>
<li>	<tt>float logf(float xint *errnm);</tt>
<li>	<tt>long double logl(long double xint *errnm);</tt>
<li>	<tt>double log10(double xint *errnm);</tt>
<li>	<tt>float log10f(float xint *errnm);</tt>
<li>	<tt>long double log10l(long double xint *errnm);</tt>
<li>	<tt>double log1p(double xint *errnm);</tt>
<li>	<tt>float log1pf(float xint *errnm);</tt>
<li>	<tt>long double log1pl(long double xint *errnm);</tt>
<li>	<tt>double log2(double xint *errnm);</tt>
<li>	<tt>float log2f(float xint *errnm);</tt>
<li>	<tt>long double log2l(long double xint *errnm);</tt>
<li>	<tt>double logb(double xint *errnm);</tt>
<li>	<tt>float logbf(float xint *errnm);</tt>
<li>	<tt>long double logbl(long double xint *errnm);</tt>
<li>	<tt>double scalbn(double x, int nint *errnm);</tt>
<li>	<tt>float scalbnf(float x, int nint *errnm);</tt>
<li>	<tt>long double scalbnl(long double x, int nint *errnm);</tt>
<li>	<tt>double scalbln(double x, long int nint *errnm);</tt>
<li>	<tt>float scalblnf(float x, long int nint *errnm);</tt>
<li>	<tt>long double scalblnl(long double x, long int nint *errnm);</tt>
<li>	<tt>double hypot(double x, double yint *errnm);</tt>
<li>	<tt>float hypotf(float x, float yint *errnm);</tt>
<li>	<tt>long double hypotl(long double x, long double yint *errnm);</tt>
<li>	<tt>double pow(double x, double yint *errnm);</tt>
<li>	<tt>float powf(float x, float yint *errnm);</tt>
<li>	<tt>long double powl(long double x, long double yint *errnm);</tt>
<li>	<tt>double sqrt(double xint *errnm);</tt>
<li>	<tt>float sqrtf(float xint *errnm);</tt>
<li>	<tt>long double sqrtl(long double xint *errnm);</tt>
<li>	<tt>double erfc(double xint *errnm);</tt>
<li>	<tt>float erfcf(float xint *errnm);</tt>
<li>	<tt>long double erfcl(long double xint *errnm);</tt>
<li>	<tt>double lgamma(double xint *errnm);</tt>
<li>	<tt>float lgammaf(float xint *errnm);</tt>
<li>	<tt>long double lgammal(long double xint *errnm);</tt>
<li>	<tt>double tgamma(double xint *errnm);</tt>
<li>	<tt>float tgammaf(float xint *errnm);</tt>
<li>	<tt>long double tgammal(long double xint *errnm);</tt>
<li>	<tt>long int lrint(double xint *errnm);</tt>
<li>	<tt>long int lrintf(float xint *errnm);</tt>
<li>	<tt>long int lrintl(long double xint *errnm);</tt>
<li>	<tt>long long int llrint(double xint *errnm);</tt>
<li>	<tt>long long int llrintf(float xint *errnm);</tt>
<li>	<tt>long long int llrintl(long double xint *errnm);</tt>
<li>	<tt>long int lround(double xint *errnm);</tt>
<li>	<tt>long int lroundf(float xint *errnm);</tt>
<li>	<tt>long int lroundl(long double xint *errnm);</tt>
<li>	<tt>long long int llround(double xint *errnm);</tt>
<li>	<tt>long long int llroundf(float xint *errnm);</tt>
<li>	<tt>long long int llroundl(long double xint *errnm);</tt>
<li>	<tt>double fmod(double x, double yint *errnm);</tt>
<li>	<tt>float fmodf(float x, float yint *errnm);</tt>
<li>	<tt>long double fmodl(long double x, long double yint *errnm);</tt>
<li>	<tt>double remainder(double x, double yint *errnm);</tt>
<li>	<tt>float remainderf(float x, float yint *errnm);</tt>
<li>	<tt>long double remainderl(long double x, long double yint *errnm);</tt>
<li>	<tt>double remquo(double x, double y, int *quoint *errnm);</tt>
<li>	<tt>float remquof(float x, float y, int *quoint *errnm);</tt>
<li>	<tt>long double remquol(long double x, long double y, int *quoint *errnm);</tt>
<li>	<tt>double nextafter(double x, double yint *errnm);</tt>
<li>	<tt>float nextafterf(float x, float yint *errnm);</tt>
<li>	<tt>long double nextafterl(long double x, long double yint *errnm);</tt>
<li>	<tt>double fdim(double x, double yint *errnm);</tt>
<li>	<tt>float fdimf(float x, float yint *errnm);</tt>
<li>	<tt>long double fdiml(long double x, long double yint *errnm);</tt>
<li>	<tt>double fma(double x, double y, double zint *errnm);</tt>
<li>	<tt>float fmaf(float x, float y, float zint *errnm);</tt>
<li>	<tt>long double fmal(long double x, long double y, long double zint *errnm);</tt>
</ul>

<p>
Note that new APIs need be provided only for those math functions that
set <tt>errno</tt>.
Note also that because C does not provide function overloading,
different names will need to be used should C adopt similar functionality.

<p>One might expect some dissatisfaction with the invention of more than
100 new functions, especially given that a great many uses of these functions
ignore <tt>errno</tt>.
Although one can argue that ignoring <tt>errno</tt> is a bad idea,
one might also expect strenuous objections to pointless modifications
of existing errno-ignoring code.

<h3>Adding <tt>errno</tt> to Function Return Value</h3>

Another approach is to define an additional namespace containing
definitions of these functions that return a tuple that includes both
the normal return value and the <tt>errno</tt> value.
For example:

<blockquote>
<pre>
 1 std::tuple&lt;T, errno_t&gt; acos(T);
 2 
 3 template&lt;typename T&gt; struct math_result {
 4   explicit math_result(T);
 5   explicit math_result(errno_t);
 6   T operator T() const;
 7 errno_t error() const;
 8   // Implementation-defined.
 9 };
</pre>
</blockquote>

<p>
This approach allows errno-ignoring code to run safely in light-weight
executors, with modest changes for code that pays attention to errno.
One way of preventing silent miscomputation by errno-ignoring code
is to use exceptions, which this approach also supports.
However, some might take exception to the use of exceptions, given that
a number of current implementations of exceptions use, you guessed it,
TLS!

<h2>Summary</h2>

<p>
This document has examined some ways to permit light-weight executors
to avoid implementing TLS.
Your ideas are more than welcome!

<p>
Future work includes handling of
allocators (which introduces the problem of cross-executor freeing),
setjmp/longjmp,
locales,
filesystems,
signal handling,
floating-point rounding modes (and everything else in <tt>fenv</tt>),
and exceptions.
The problem of nested executors that all provide TLS is also left
unaddressed by this draft.
In addition, and perhaps most important, future work includes guidelines
and patterns to allow user code to work well with TLS in environments
that include lightweight executors.

<h2>Acknowledgements</h2>

<p>
@@@

<h2>Additional Information</h2>

<p>
Floating-point state is stored on a per-thread basis, which means that
if a light-weight executor can be preempted or migrated among
<tt>std::thread</tt> instance, things like rounding modes and
error/exception indications can be subject to unscheduled revision.

</body></html>
