<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link href='https://fonts.googleapis.com/css?family=Roboto' rel='stylesheet' type='text/css'>
    <link href='https://fonts.googleapis.com/css?family=Inconsolata:bold' rel='stylesheet' type='text/css'>
    <style>
      body {
        font-family: 'Roboto', sans-serif;
      }
      .body {
        margin: 0 auto;
        max-width: 80em;
      }
      a {
        color: #455A64;
      }
      h1, h2, h3, h4, h5, h6 {
        color: #37474F;
      }
      h1 a, h2 a, h3 a, h4 a, h5 a, h6 a {
        padding-left: 1em;
        padding-right: 3em;
        text-decoration: none;
        opacity: 0;
      }
      a.headerlink:hover {
        opacity: 1;
      }
      .field-name {
        text-align: right;
        padding-right: 1em;
      }
      tt, .highlight {
        color: #263238;
        background-color: #ECEFF1;
        font-family: 'Inconsolata', monospace;
        font-weight: bold;
      }
      tt {
        padding: 0em 0.5em;
      }
      .highlight {
        margin: 0em 1em;
        padding: 0.1em 1em;
      }
    </style>
    <title>N4522 std::atomic_object_fence(mo, T&amp;&amp;...)</title> 
  </head>
  <body>
    <div class="body">

  <div class="section" id="n4522-std-atomic-object-fence-mo-t">
<h1>N4522 <tt class="docutils literal"><span class="pre">std::atomic_object_fence(mo,</span> <span class="pre">T&amp;&amp;...)</span></tt><a class="headerlink" href="#n4522-std-atomic-object-fence-mo-t" title="Permalink to this headline">¶</a></h1>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Author:</th><td class="field-body">Olivier Giroux</td>
</tr>
<tr class="field-even field"><th class="field-name">Contact:</th><td class="field-body"><a class="reference external" href="mailto:ogiroux&#37;&#52;&#48;nvidia&#46;com">ogiroux<span>&#64;</span>nvidia<span>&#46;</span>com</a></td>
</tr>
<tr class="field-odd field"><th class="field-name">Author:</th><td class="field-body">JF Bastien</td>
</tr>
<tr class="field-even field"><th class="field-name">Contact:</th><td class="field-body"><a class="reference external" href="mailto:jfb&#37;&#52;&#48;google&#46;com">jfb<span>&#64;</span>google<span>&#46;</span>com</a></td>
</tr>
<tr class="field-odd field"><th class="field-name">Date:</th><td class="field-body">2015-05-21</td>
</tr>
<tr class="field-even field"><th class="field-name">URL:</th><td class="field-body"><a class="reference external" href="https://github.com/jfbastien/papers/blob/master/source/N4522.rst">https://github.com/jfbastien/papers/blob/master/source/N4522.rst</a></td>
</tr>
</tbody>
</table>
<div class="section" id="rationale">
<h2>Rationale<a class="headerlink" href="#rationale" title="Permalink to this headline">¶</a></h2>
<p>Fences allow programmers to express a conservative approximation to the precise
pair-wise relations of operations required to be ordered in the happens-before
relation. This is conservative because fences use the sequenced-before relation
to select vast extents of the program into the happens-before relation.</p>
<p>This conservatism is commonly desired because it is difficult to reason about
operations hidden behind layers of abstraction in C++ programs. An unfortunate
consequence of this is that precise expression of ordering is not possible in
C++ currently, which makes it easy to over-constrain the order of operations
internal to synchronization primitives that comprise multiple atomic objects.
This constrains the ability of implementations (compiler and hardware) to
reorder, ignore, or assume the absence of operations that are not relevant or
not visible.</p>
<p>In existing practice, the <tt class="docutils literal"><span class="pre">flush</span></tt> primitive of OpenMP is more expressive than
the fences of C++ in at least this one sense: it can optionally restrict the
ordering of operations to a developer-specified set of memory locations. This is
enough to exactly express the required pair-wise ordering for short lock-free
algorithms. This capability isn&#8217;t only relevant to OpenMP and would be further
enhanced if it was integrated with the other facets of the more modern C++
memory model.</p>
<p>An example use-case for this capability is a likely implementation strategy for
<a class="reference external" href="http://wg21.link/N4392">N4392</a>&#8216;s <tt class="docutils literal"><span class="pre">std::barrier</span></tt> object. This algorithm makes ordered modifications on
the atomic sub-objects of a larger non-atomic synchronization object, but the
internal modifications need only be ordered with respect to each other, not all
surrounding objects (they are ordered separately).</p>
<p>In one example implementation, <tt class="docutils literal"><span class="pre">std::barrier</span></tt> is coded as follows:</p>
<div class="highlight-c++"><div class="highlight"><pre><span class="k">struct</span> <span class="n">barrier</span> <span class="p">{</span>
    <span class="c1">// Some member functions elided.</span>
    <span class="kt">void</span> <span class="n">arrive_and_wait</span><span class="p">()</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="k">const</span> <span class="n">myepoch</span> <span class="o">=</span> <span class="n">epoch</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
        <span class="kt">int</span> <span class="k">const</span> <span class="n">result</span> <span class="o">=</span> <span class="n">arrived</span><span class="p">.</span><span class="n">fetch_add</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">memory_order_acq_rel</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">result</span> <span class="o">==</span> <span class="n">expected</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">expected</span> <span class="o">=</span> <span class="n">nexpected</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
            <span class="n">arrived</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
            <span class="c1">// Only need to order {expected, arrived} -&gt; {epoch}.</span>
            <span class="n">epoch</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="n">myepoch</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">memory_order_release</span><span class="p">);</span>
        <span class="p">}</span>
        <span class="k">else</span>
            <span class="k">while</span> <span class="p">(</span><span class="n">epoch</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">memory_order_acquire</span><span class="p">)</span> <span class="o">==</span> <span class="n">myepoch</span><span class="p">)</span>
                <span class="p">;</span>
    <span class="p">}</span>
<span class="nl">private:</span>
    <span class="kt">int</span> <span class="n">expected</span><span class="p">;</span>
    <span class="n">atomic</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span> <span class="n">arrived</span><span class="p">,</span> <span class="n">nexpected</span><span class="p">,</span> <span class="n">epoch</span><span class="p">;</span>
<span class="p">};</span>
</pre></div>
</div>
<p>The release operation on the epoch atomic is likely to require the compiler to
insert a fence that has an effect that goes beyond the intended constraint,
which is to order only the operations on the barrier object. Since the barrier
object is likely to be smaller than a cache line and the library&#8217;s
implementation can control its alignment using <tt class="docutils literal"><span class="pre">alignas</span></tt>, then it would be
possible to compile this program without a fence in this location on
architectures that are cache-line coherent.</p>
<p>To concisely express the bound on the set of memory operations whose order is
constrained, we propose to accompany <tt class="docutils literal"><span class="pre">std::atomic_thread_fence</span></tt> with an
<tt class="docutils literal"><span class="pre">object</span></tt> variant which takes a reference to the object(s) to be ordered by
the fence.</p>
</div>
<div class="section" id="proposed-addition">
<h2>Proposed addition<a class="headerlink" href="#proposed-addition" title="Permalink to this headline">¶</a></h2>
<p>Under 29.2 Header <tt class="docutils literal"><span class="pre">&lt;atomic&gt;</span></tt> synopsis [<strong>atomics.syn</strong>]:</p>
<div class="highlight-c++"><div class="highlight"><pre><span class="k">namespace</span> <span class="n">std</span> <span class="p">{</span>
   <span class="c1">// 29.8, fences</span>
   <span class="c1">// ...</span>
   <span class="k">template</span><span class="o">&lt;</span><span class="n">class</span><span class="p">...</span> <span class="n">T</span><span class="o">&gt;</span>
   <span class="kt">void</span> <span class="n">atomic_object_fence</span><span class="p">(</span><span class="n">memory_order</span><span class="p">,</span> <span class="n">T</span><span class="o">&amp;&amp;</span><span class="p">...</span> <span class="n">objects</span><span class="p">)</span> <span class="n">noexcept</span><span class="p">;</span>
 <span class="p">}</span>
</pre></div>
</div>
<p>Under 29.8 Fences [<strong>atomics.fences</strong>], after the current
<tt class="docutils literal"><span class="pre">atomic_thread_fence</span></tt> paragraph:</p>
<p><tt class="docutils literal"><span class="pre">template&lt;class...</span> <span class="pre">T&gt;</span> <span class="pre">void</span> <span class="pre">atomic_object_fence(memory_order,</span> <span class="pre">T&amp;&amp;...</span> <span class="pre">objects)</span> <span class="pre">noexcept;</span></tt></p>
<p><em>Effect</em>: Equivalent to <tt class="docutils literal"><span class="pre">atomic_thread_fence(order)</span></tt> except that operations on
objects other than those in the variadic template arguments and their
sub-objects are <em>un-sequenced</em> with the fence.</p>
<p><em>Note</em>: The compiler may omit fences entirely depending on alignment
information, may generate a dynamic test leading to a fence for under-aligned
objects, or may emit the same fence an <tt class="docutils literal"><span class="pre">atomic_thread_fence</span></tt> would.</p>
<p>The <tt class="docutils literal"><span class="pre">__cpp_lib_atomic_object_fence</span></tt> feature test macro should be added.</p>
</div>
<div class="section" id="example-implementation">
<h2>Example implementation<a class="headerlink" href="#example-implementation" title="Permalink to this headline">¶</a></h2>
<p>A trivial, yet conforming implementation may implement the new fence in terms of
the existing <tt class="docutils literal"><span class="pre">std::atomic_thread_fence</span></tt> using the same memory order:</p>
<div class="highlight-c++"><div class="highlight"><pre><span class="k">template</span><span class="o">&lt;</span><span class="n">class</span><span class="p">...</span> <span class="n">T</span><span class="o">&gt;</span>
<span class="kt">void</span> <span class="n">atomic_object_fence</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order</span> <span class="n">order</span><span class="p">,</span> <span class="n">T</span> <span class="o">&amp;&amp;</span><span class="p">...)</span> <span class="n">noexcept</span> <span class="p">{</span>
  <span class="n">std</span><span class="o">::</span><span class="n">atomic_thread_fence</span><span class="p">(</span><span class="n">order</span><span class="p">);</span>
<span class="p">}</span>
</pre></div>
</div>
<p>A more advanced implementation can overload this for the single-object case
on architectures (or micro-architectures) that have cache coherency with a known
line size, even if it is conservatively approximated:</p>
<div class="highlight-c++"><div class="highlight"><pre><span class="cp">#define __CACHELINE_SIZE </span><span class="c1">// Secret (micro-)architectural value.</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="o">&gt;</span>
<span class="n">std</span><span class="o">::</span><span class="kt">enable_if_t</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">is_standard_layout</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;::</span><span class="n">value</span> <span class="o">&amp;&amp;</span>
                 <span class="n">__CACHELINE_SIZE</span> <span class="o">-</span> <span class="n">alignof</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="o">%</span> <span class="n">__CACHELINE_SIZE</span> <span class="o">&gt;=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">)</span><span class="o">&gt;</span>
<span class="n">atomic_object_fence</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order</span><span class="p">,</span> <span class="n">T</span> <span class="o">&amp;&amp;</span><span class="n">object</span><span class="p">)</span> <span class="n">noexcept</span> <span class="p">{</span>
  <span class="k">asm</span> <span class="k">volatile</span><span class="p">(</span><span class="s">&quot;&quot;</span> <span class="o">:</span> <span class="s">&quot;+m&quot;</span><span class="p">(</span><span class="n">object</span><span class="p">)</span> <span class="o">:</span> <span class="s">&quot;m&quot;</span><span class="p">(</span><span class="n">object</span><span class="p">));</span>  <span class="c1">// Code motion barrier.</span>
<span class="p">}</span>
</pre></div>
</div>
<p>To extend this for multiple objects, an implementation for the same architecture may
emit a run-time check that the total footprint of all the objects fits in the span of
a single cache line.  This check may commonly be eliminated as dead code, for example
when the objects are references from a common base pointer.</p>
<p>The above <tt class="docutils literal"><span class="pre">std::barrier</span></tt> example&#8217;s inner-code can use the new overload as follows:</p>
<div class="highlight-c++"><div class="highlight"><pre><span class="k">if</span> <span class="p">(</span><span class="n">result</span> <span class="o">==</span> <span class="n">expected</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">expected</span> <span class="o">=</span> <span class="n">nexpected</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
    <span class="n">arrived</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
    <span class="n">atomic_object_fence</span><span class="p">(</span><span class="n">memory_order_release</span><span class="p">,</span> <span class="o">*</span><span class="k">this</span><span class="p">);</span>
    <span class="n">epoch</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="n">myepoch</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="p">}</span>
</pre></div>
</div>
<p>It is equivalently valid to list the individual members of <tt class="docutils literal"><span class="pre">barrier</span></tt> instead of
<tt class="docutils literal"><span class="pre">*this</span></tt>. Both forms are equivalent.</p>
<p>Less trivial implementations of <tt class="docutils literal"><span class="pre">std::atomic_object_fence</span></tt> can enable more
optimizations for new hardware and portable program representations.</p>
</div>
<div class="section" id="relation-to-n4523">
<h2>Relation to N4523<a class="headerlink" href="#relation-to-n4523" title="Permalink to this headline">¶</a></h2>
<p>In <a class="reference external" href="http://wg21.link/N4523">N4523</a> we propose to formalize the notions of false-sharing and true-sharing
as perceived by the implementation in relation to the placement of objects in
memory. In the expository implementation of the previous section we also showed
how a cache-line coherent architecture or micro-architecture can elide fences
that only bisect relations between objects that are in the same cache line, if
provable at compile-time. These notions interact in a virtuous way because
N4523&#8217;s abstraction enables reasoning about likely cache behavior that
implementations can optimize for.</p>
<p>The example application of <tt class="docutils literal"><span class="pre">std::atomic_object_fence</span></tt> to the <tt class="docutils literal"><span class="pre">std::barrier</span></tt>
object is improved by combining these notions as follows:</p>
<div class="highlight-c++"><div class="highlight"><pre><span class="n">alignas</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kr">thread</span><span class="o">::</span><span class="n">hardware_true_sharing_size</span><span class="p">)</span> <span class="c1">// N4523</span>
<span class="k">struct</span> <span class="n">barrier</span> <span class="p">{</span>
    <span class="c1">// Some member functions elided.</span>
    <span class="kt">void</span> <span class="n">arrive_and_wait</span><span class="p">()</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="k">const</span> <span class="n">myepoch</span> <span class="o">=</span> <span class="n">epoch</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
        <span class="kt">int</span> <span class="k">const</span> <span class="n">result</span> <span class="o">=</span> <span class="n">arrived</span><span class="p">.</span><span class="n">fetch_add</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">memory_order_acq_rel</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">result</span> <span class="o">==</span> <span class="n">expected</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">expected</span> <span class="o">=</span> <span class="n">nexpected</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
            <span class="n">arrived</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
            <span class="n">atomic_object_fence</span><span class="p">(</span><span class="n">memory_order_release</span><span class="p">,</span> <span class="o">*</span><span class="k">this</span><span class="p">);</span> <span class="c1">// N4522</span>
            <span class="n">epoch</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="n">myepoch</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
        <span class="p">}</span>
        <span class="k">else</span>
            <span class="k">while</span> <span class="p">(</span><span class="n">epoch</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">memory_order_acquire</span><span class="p">)</span> <span class="o">==</span> <span class="n">myepoch</span><span class="p">)</span>
                <span class="p">;</span>
    <span class="p">}</span>
<span class="nl">private:</span>
    <span class="kt">int</span> <span class="n">expected</span><span class="p">;</span>
    <span class="n">atomic</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span> <span class="n">arrived</span><span class="p">,</span> <span class="n">nexpected</span><span class="p">,</span> <span class="n">epoch</span><span class="p">;</span>
<span class="p">};</span>
</pre></div>
</div>
<p>By aligning the barrier object to the true-sharing granularity, it is
significantly more likely that the implementation will be able to elide the
fence if the architecture or micro-architecture has cache-line coherency. Of
course an implementation of the Standard is free to ensure this by other means,
we provide this example as exposition for what developer programs might do.</p>
</div>
<div class="section" id="memory-model-example">
<h2>Memory model example<a class="headerlink" href="#memory-model-example" title="Permalink to this headline">¶</a></h2>
<table border="1" class="docutils">
<colgroup>
<col width="50%" />
<col width="50%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">T0</th>
<th class="head">T1</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">0:</span> <span class="pre">w</span> <span class="pre">=</span> <span class="pre">1;</span></tt></td>
<td><tt class="docutils literal"><span class="pre">4:</span> <span class="pre">while(!a.load(rlx));</span></tt></td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">1:</span> <span class="pre">x</span> <span class="pre">=</span> <span class="pre">1;</span></tt></td>
<td><tt class="docutils literal"><span class="pre">5:</span> <span class="pre">objfence(acq,</span> <span class="pre">a,</span> <span class="pre">x);</span></tt></td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">2:</span> <span class="pre">objfence(rel,</span> <span class="pre">a,</span> <span class="pre">x);</span></tt></td>
<td><tt class="docutils literal"><span class="pre">6:</span> <span class="pre">assert(x);</span></tt></td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">3:</span> <span class="pre">a.store(1,rlx);</span></tt></td>
<td><tt class="docutils literal"><span class="pre">7:</span> <span class="pre">assert(w);</span></tt></td>
</tr>
</tbody>
</table>
<p>The semantics of fences mean that:</p>
<dl class="docutils">
<dt><tt class="docutils literal"><span class="pre">2</span></tt> synchronizes-with <tt class="docutils literal"><span class="pre">5</span></tt> because [<strong>29.8¶2</strong>]:</dt>
<dd><ol class="first last upperalpha simple">
<li><tt class="docutils literal"><span class="pre">2</span></tt> is sequenced-before <tt class="docutils literal"><span class="pre">3</span></tt>,</li>
<li><tt class="docutils literal"><span class="pre">3</span></tt> inter-thread happens-before <tt class="docutils literal"><span class="pre">4</span></tt>, and</li>
<li><tt class="docutils literal"><span class="pre">4</span></tt> is sequenced-before <tt class="docutils literal"><span class="pre">5</span></tt>.</li>
</ol>
</dd>
<dt><tt class="docutils literal"><span class="pre">1</span></tt> happens-before <tt class="docutils literal"><span class="pre">6</span></tt> because [<strong>1.10¶13-14</strong>]:</dt>
<dd><ol class="first last upperalpha simple">
<li><tt class="docutils literal"><span class="pre">1</span></tt> is sequenced-before <tt class="docutils literal"><span class="pre">2</span></tt>,</li>
<li><tt class="docutils literal"><span class="pre">2</span></tt> synchronizes-with <tt class="docutils literal"><span class="pre">5</span></tt>, and</li>
<li><tt class="docutils literal"><span class="pre">5</span></tt> is sequenced-before <tt class="docutils literal"><span class="pre">6</span></tt>.</li>
</ol>
</dd>
</dl>
<p>Therefore the program is well-defined (so far) and the <tt class="docutils literal"><span class="pre">assert(x)</span></tt> of <tt class="docutils literal"><span class="pre">6</span></tt>
does not fire.</p>
<p>However, the <em>un-sequenced</em> semantics of the object fence also mean that:</p>
<dl class="docutils">
<dt><tt class="docutils literal"><span class="pre">0</span></tt>  conflicts with <tt class="docutils literal"><span class="pre">7</span></tt> because [<strong>1.10¶23</strong>]:</dt>
<dd><ol class="first last upperalpha simple">
<li><tt class="docutils literal"><span class="pre">0</span></tt> is a store to <tt class="docutils literal"><span class="pre">w</span></tt>, <tt class="docutils literal"><span class="pre">7</span></tt> is a load of <tt class="docutils literal"><span class="pre">w</span></tt> and they are not both
atomic, and</li>
<li><tt class="docutils literal"><span class="pre">0</span></tt> is not sequenced-before <tt class="docutils literal"><span class="pre">2</span></tt> and <tt class="docutils literal"><span class="pre">5</span></tt> is not sequenced-before
<tt class="docutils literal"><span class="pre">7</span></tt>.</li>
</ol>
</dd>
</dl>
<p>Therefore the <tt class="docutils literal"><span class="pre">assert(w)</span></tt> of <tt class="docutils literal"><span class="pre">7</span></tt> makes the program undefined due to a
data-race.</p>
</div>
</div>


    </div>
  </body>
</html>