<html>
<head>
<title>P2066R9: Suggested draft TS for C++ Extensions for Minimal Transactional Memory</title>

<style type="text/css">
  ins { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
  .new { text-decoration:none; font-weight:bold; background-color:#D0FFD0 }
  del { text-decoration:line-through; background-color:#FFA0A0 }  
  strong { font-weight: inherit; color: #2020ff }
  table, td, th { border: 1px solid black; border-collapse:collapse; padding: 5px }
</style>
</head>

<body>
ISO/IEC JTC1 SC22 WG21 P2066R9<br/>
Authors: Hans Boehm, Victor Luchangco, Jens Maurer, Michael L. Scott, Michael Spear, and Michael Wong<br/>
Reply to: Jens Maurer &lt;Jens.Maurer@gmx.net><br/>
Target audience: CWG, LWG<br/>
2021-09-15<br/>

<h1>P2066R9: Suggested draft TS for C++ Extensions for Minimal Transactional Memory</h1>

<h2>Introduction</h2>

This paper presents suggested wording for a future Technical
Specification on a  variant of Transactional Memory. See
P1875R2 "Transactional Memory Lite Support in C++" for discussion.

<h2>Paper history</h2>

<h3>Changes in R1 since R0</h3>
<ul>
  <li>incorporated feedback from SG1 in Prague</li>
</ul>

<h3>Changes in R2 since R1</h3>
<ul>
  <li>turned "atomic" into a context-sensitive keyword</li>
  <li>added questions for TS deployment</li>
</ul>

<h3>Changes in R3 since R2</h3>
<ul>
  <li>EWG feedback: extend definition of full-expression in
    [intro.execution], add note about parallel execution of
    transactions, constant evaluations in atomic blocks are
    irrelevant, use "atomic do" as the introducer keywords.</li>
  <li>Explicitly list standard library functions required to be
    supported in an atomic block</li>
</ul>

<h3>Changes in R4 since R3</h3>
<ul>
  <li>Make throwing an exception from inside a transaction undefined
  behavior</li>
</ul>

<h3>Changes in R5 since R4</h3>
<ul>
  <li>SG5 review: Exceptions handled inside an atomic block are fine.</li>
  <li>SG5 review: Nearly all standard library functions may be used
  inside an atomic block; any other function is restricted to
  compiler-visible.</li>
</ul>

<h3>Changes in R6 since R5</h3>
<ul>
  <li>SG5 review: Exclude more standard library facilities causing synchronization.</li>
</ul>

<h3>Changes in R7 since R6</h3>
<ul>
  <li>EWG review: Exclude more standard library facilities causing synchronization.</li>
  <li>R6 was approved by EWG on 2021-04-28.</li>
</ul>

<h3>Changes in R8 since R7</h3>
<ul>
  <li>Added a section "notable design decisions" in preparation for LEWG review.</li>
</ul>

<h3>Changes in R9 since R8</h3>
<ul>
  <li>Address feedback from Hubert Tong.</li>
  <li>Address feedback from CWG.</li>
  <li>Address preliminary feedback from LWG.</li>
</ul>

<h2>Notable design decisions</h2>
<p>
A more detailed rationale for many of the decisions made here is given in
<a href="http://wg21.link/p1875">P1875</a>. Here we quickly summarize design decisions leading
to this proposal. These are generally motivated by the desire to produce a much simpler,
more easily implementable specification. Except for the last two, these have been stable
since the original SG1 discussion. All were part of the proposal during the final EWG
discussion.
<ul>
<li> This is a language, not library, proposal.
Passing lambda expressions to a "run this as a transaction" function was not well-liked, and has
issues with varargs access. An RAII-based library facility was considered, but also disliked by
SG1, this time, among other issues, because it waas unclear what this would mean if the RAII
object was not stack-allocated.</li>
<li> We want to leave a lot implementation-defined for the TS, to encourage implementations,
and support experimentation. We do not want to again end up with a TS that is difficult to
implement usefully, and thus fails to attract implementations.</li>
<li> We want to focus on simple implementations that are based in hardware transactional memory,
and/or use the
trivial global lock implementation. We expect HTM implementations that fall back on a global
lock to dominate in practice, at least initially. We want to allow more elaborate implementations
based on software transactional memory but, unlike the current TS, we do not want to require
such elaborate implementations.</li>
<li> There is no support for explicitly declaring transaction-safety. This is a major simplification
over the current Technical Specification. This does not matter for HTM or gloabl lock implementations,
which do not need to instrument code. It does make it more difficult to extract good performance
from software transctional memory implementations, but there is agreement that the simplification is worth
the trade-off.</li>
<li> Exceptions may not escape from transactions. This dodges many complicated issues about whether
transactions should commit or abort and, in the latter case, how exception objects themselves survive
transaction aborts.</li>
<li> The "atomic do {...}" syntax. This was the most natural syntax we (mostly EWG) could come up with that
appears to be easily parseable, and does not require a new keyword. It was "atomic {...}" during SG1 discussions.
"atomic do" reflects a non-unanimous consensus in EWG.</li>
<li> Allow <code>malloc</code> (and hence non-escaping exceptions, and much more of the standard library) inside transactions.
This was a recent change, but preceded the last EWG discussion. We previously thought that we could get by with
very restricted transactions, so long as they were sufficiently general to implement N-way compare_exchange.
Closer examination by SG5 determined that was problematic since the load operations preceding the compare_exchange
transaction would also need to be transactions, making them expensive, at least without heroic implementation
effort. Thus we concluded that it was desirable to again allow much of the standard library inside
transactions. (Also discussed in more detail in P1875, but more as an open issue.)</li>
</ul>

<h2>Questions for TS deployment</h2>

<ul>
  <li>Can we strengthen some of the "implementation-defined" parts of
    [stmt.tx]? In which way?</li>
  <li>What is the facility used for in practice?</li>
  <li>Are there problems that might benefit from transactions, but
    cannot be solved with the facilities presented?</li>
  <li>Do implementations provide additional facilities for
    transactions (possibly with extra syntax) beyond those
    presented?</li>
  <li>Is it vital for the implementation that the definitions of all
    functions invoked within a transaction are visible?</li>
  <li>Are there functions in the standard library that would be useful
    inside a transaction, but do not work yet?</li>
</ul>

<h2>Wording changes</h2>

These wording changes are relative to the current C++ Working Draft, N4849.
<p>
In 5.10 [lex.name], add <ins><code>atomic</code></ins> to table 4 [tab:lex.name.special].

<p>

Change in 6.9.1 [intro.execution] paragraph 5:

<blockquote>
A <em>full-expression</em> is
<ul>
  <li>...</li>
  <li>an invocation of a destructor generated at the end of the lifetime of an object other than a temporary
    object (6.7.7) whose lifetime has not been extended, <del>or</del></li>
  <li><ins>the start and the end of an atomic block (8.8 [stmt.tx]), or</ins></li>
<li>an expression that is not a subexpression of another expression and that is not otherwise part of a
full-expression.
</blockquote>

Change in 6.9.2.1 [intro.races] paragraph 6:

<blockquote>
<ins>Atomic blocks as well
as</ins> <del>Certain</del> <ins>certain</ins> library
calls <ins>may</ins> <em>synchronize with</em> other <ins>atomic
blocks and</ins> library calls performed by another thread.
</blockquote>

Add a new paragraph after 6.9.2.1 [intro.races] paragraph 20:

<blockquote class="new">
An atomic block that is not
dynamically nested within another atomic block is termed
a transaction. [Note: Due to syntactic
constraints, blocks cannot overlap unless one is nested within the
other.] There is a global total order of execution for
all transactions. If, in that total order,
a transaction T1 is ordered before a transaction T2, then
<ul>
  <li>no evaluation in T2 happens before any evaluation in T1 and</li>
  <li>if T1 and T2 perform conflicting expression evaluations, then
  the end of T1 synchronizes with the start of T2.</li>
</ul>
[ Note: If the evaluations in T1 and T2 do not conflict, they
might be executed concurrently. -- end note ]
</blockquote>
<blockquote>
Two actions are <em>potentially concurrent</em> if ...
</blockquote>

Change in 6.9.2.1 [intro.races] paragraph 21:

<blockquote>
... [Note: It can be shown that programs that correctly use
mutexes<ins>, atomic blocks,</ins>
and <code>memory_order::seq_cst</code> operations to prevent all data
races and use no other synchronization operations behave as if the
operations executed by their constituent threads were simply
interleaved, with each value computation of an object being taken from
the last side effect on that object in that interleaving. This is
normally referred to as "sequential consistency". ...
</blockquote>

Add a new paragraph after 6.9.2.1 [intro.races] paragraph 21:

<blockquote class="new">
[ Note: The following holds for a data-race-free program: If the start
of an atomic block T is sequenced before an evaluation A, A is
sequenced before the end of T, A strongly happens before some
evaluation B, and B is not sequenced before the end of T,
then the end of T strongly happens before B. If an
evaluation C strongly happens before that evaluation A
and C is not sequenced after the start of T, then C
strongly happens before the start of T. These properties in turn imply
that in any simple interleaved (sequentially consistent) execution,
the operations of each atomic block appear to be contiguous in the
interleaving. -- end note ]
</blockquote>

Change in 6.9.2.2 [intro.progress] paragraph 1:

<blockquote>
  <del>The implementation may assume that any thread will eventually do</del>
  <ins>An <em>inter-thread side effect</em> is</ins> one
  of the following:
<ul>
  <li><del>terminate,</del></li>
  <li>a call to a library I/O function,</li>
  <li>an access through a volatile glvalue, or</li>
  <li>a synchronization operation or an atomic operation <ins>([atomics])</ins>.</li>
</ul>
<ins>The implementation may assume that any thread will eventually
terminate or evaluate an inter-thread side effect.</ins>
[Note: This is intended to allow compiler transformations such as
removal of empty loops, even when termination cannot be proven. — end
note]
</blockquote>

Add a production to the grammar in 8.1 [expr.pre]:

<blockquote>
  <pre>
<em>statement:
      labeled-statement
      attribute-specifier-seq<sub>opt</sub> expression-statement
      attribute-specifier-seq<sub>op</sub>t compound-statement
      attribute-specifier-seq<sub>opt</sub> selection-statement
      attribute-specifier-seq<sub>opt</sub> iteration-statement
      attribute-specifier-seq<sub>opt</sub> jump-statement
      declaration-statement
      attribute-specifier-seq<sub>opt</sub> try-block
      <ins>atomic-statement</ins></em>
</blockquote>

Add a new subclause before 8.8 [stmt.dcl]:

<blockquote class="new">
<b>8.8 Atomic statement [stmt.tx]</b>
<p>
<pre>
<em>atomic-statement:</em>
     atomic do <em>compound-statement</em>
</pre>

An <em>atomic-statement</em> is also called an <em>atomic block</em>.
<p>
The start of the atomic block is immediately before the
opening <code>{</code> of the <em>compound-statement</em>. The end of the
atomic block is immediately after the closing <code>}</code> of the
<em>compound-statement</em>. [ Note: Thus, variables with automatic
storage duration declared in the <em>compound-statement</em> are
destroyed prior to reaching the end of the atomic block; see 8.7
[stmt.jump]. -- end note ]
<p>
A goto or switch statement shall not be used to transfer control into
an atomic block.
<p>
If the execution of an atomic block evaluates an inter-thread side
effect (6.9.2.2 [intro.progress]) or
if an atomic block is exited via an exception,
the behavior is undefined.
<p>
<em>Recommended practice:</em> In case an atomic block is
exited via an exception, the program should be terminated without
invoking a terminate handler (17.9.5 [exception.terminate]) or
destroying any objects with static or thread storage
duration (6.9.3.4 [basic.start.term]).
<p>
If the execution of an atomic block evaluates any of the following
outside of a manifestly constant-evaluated context (7.7
[expr.const]), the behavior is implementation-defined:
<ul>
  <li>an <em>asm-declaration</em> (9.10 [dcl.asm]);</li>
  <li>an invocation of a function <strong>other than one of the
  standard library functions specified in (16.4.6.17 [atomic.use]),
  unless the function is inline</strong> with a reachable
  definition;</li>
  <li>a virtual function call (7.6.1.3 [expr.call]);</li>
  <li>a function call, <strong>unless overload resolution selects</strong>
    <ul>
      <li><strong>a named function (12.2.2.2.2 [over.call.func]) or</strong></li>
      <li><strong>a function call operator (12.2.2.2.3
  [over.call.object]), but not a surrogate call
  function;</strong></li>
      </ul>
  <li>a <code>co_await</code> expression (7.6.2.3 [expr.await]),
  a <em>yield-expression</em> (7.6.17 [expr.yield]), or
  a <code>co_return</code> statement (8.7.4 [stmt.return.coroutine]);</li>
  <li>dynamic initialization of a block-scope variable with static storage duration; or</li>
  <li>dynamic initialization of a variable with thread storage duration.</li>
</ul>
[ Note: The implementation <strong>can</strong>
define that the behavior is undefined in some or all of the cases
above. ]
<p>
[ Example:
<pre>
<strong>unsigned</strong> int f()
{
  static <strong>unsigned</strong> int i = 0;
  atomic do {
    ++i;
    return i;
  }
}
</pre>
Each invocation of f (even when called from several threads
simultaneously) retrieves a unique value (ignoring <strong>wrap-around</strong>). -- end
example ]
<p>

[ Note: Atomic blocks are likely to perform best where they
execute quickly and touch little data. -- end note ]

<p>

</blockquote>

Add 16.4.6.17 [atomic.use]:

<blockquote class="new">
<h2>16.4.6.17 Functions usable in an atomic block [atomic.use]</h2>

All library functions may be used in an atomic block (8.8 [stmt.tx]), except

<ul>
  <li><strong>error category objects ([syserr.errcat.objects])</strong></li>
  <li>time zone database ([time.zone.db])</li>
  <li>clocks ([time.clock])</li>
  <li><code>signal</code> ([support.signal]) <strong>and <code>raise</code> ([csignal.syn])</strong></li>
  <li><code>set_new_handler</code>, <code>set_terminate</code>, <code>get_new_handler</code>, <code>get_terminate</code> ([handler.functions], [alloc.errors], [exception.syn])</li>
  <li><strong><code>system</code> ([cstdlib.syn])</strong></li>
  <li><code>shared_ptr</code> ([util.smartptr.shared]) and <code>weak_ptr</code> ([util.smartptr.weak])</li>
  <li><code>synchronized_pool_resource</code> ([mem.res.pool])</li>
  <li><strong>program-wide <code>memory_resource</code> objects ([mem.res.global])</strong></li>
  <li><code>setjmp</code> / <code>longjmp</code> ([csetjmp.syn])</li>
  <li><strong>parallel algorithms ([algorithms.parallel])</strong></li>
  <li><code>locale</code> construction ([locale.cons])
  <li>input/output ([input.output])</li>
  <li>atomic operations ([atomics])</li>
  <li>thread support ([thread])</li>
</ul>
</blockquote>

</body>
</html>
