<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>Explicit memory fences</title>

    
    <meta content="http://schemas.microsoft.com/intellisense/ie5" name="vs_targetSchema">
    <meta http-equiv="Content-type" content="text/html;charset=UTF-8" >
    <meta http-equiv="Content-Language" content="en-us">
    <address>
      Document number: N2262=07-0122 </address>
    <address>
      Programming Language C++, Evolution/Library</address>
    <address>
      &nbsp;</address>
    <address>
      Raul Silvera, &lt;<a href="mailto:rauls@ca.ibm.com">rauls@ca.ibm.com</a>&gt;</address>
    <address>
      Peter Dimov, &lt;<a href="mailto:pdimov@pdimov.com">pdimov@pdimov.com</a>&gt;</address>
    <address>
      &nbsp;</address>
    <address>
      2007-05-06</address>
    <h1>
      Explicit Memory Fences</h1>
    <ul>
      <li><a href="#overview">Overview</a></li>
      <li><a href="#rationale">Rationale</a></li>
      <li><a href="#objections">Objections</a></li>
      <li><a href="#proposed">Proposed Text</a></li>
    </ul>
    <h2>
      <a name="overview">I. Overview</a></h2>
    <p>
      This document presents a proposal on explicit Memory Fences for
      the C++ standard.  It has been extracted from the atomic
      operation library N2195 by Peter Dimov, which had been based in
      part from the memory model proposed in N2237 by Silvera, Wong,
      McKenney and Blainey, and on Alexander Terekhov's contributions
      to mailing lists and the comp.programming.threads discussion
      group.</p>
    <p>
      Memory fences are a mechanism to provide ordering between memory
      operations. They differ from ordered atomic operations as
      described on N2195 and N2145 in that they are not associated to
      a specific memory access; they describe an ordering relationship
      between all preceding memory accessed and all subsequent ones.
      This proposal includes several variants of fences, which vary on
      the class of memory operations they affect (loads vs stores).
      </p>
    <h2>
      <a name="rationale">II. Rationale</a></h2>
    <p>
      The proposal in this document maintains the semantics from
      N2195. The additional contribution from this paper is to discuss
      the justification for standalone memory fences and to summarize
      and refute the main objections to their introduction that have
      been raised so far. </p>
    
	<p>
      There are multiple grounds for introducing explicit memory
      fences to the C++ standard:
      </p>

    <ul>
      <li><b>Widespread use of fences on current software.</b> 

	There is a large body of code that currently relies on
	standalone memory fences for concurrent operations.

	These programs can benefit from using standard mechanisms to
	facilitate portability to new platforms.

	However, it would be a large and error-prone undertaking to modify
	these programs to use ordered atomic operations.

      <li><b>Runtime performance.</b> 

	Atomic ordered operations are insufficient to precisely
	describe the synchronization requirements of some
	algorithms. 

	In the absence of explicit fences, the programmer will be
	forced to introduce redundant ordering operations to his
	program, which will impact runtime performance on many
	platforms.

	There is a detailed discussion in N2237 of some of the
	algorithms that would be affected; these include some common
	idioms, such as reference-counting and multiple lock release.

      <li><b>Widely implemented and well understood.</b> 

	Memory fences are provided by several hardware
	implementations, and are also present on many programming
	models, including OpenMP and UPC.

	Many programmers are familiar with fences and have signficant
	experience with them; they will find it easier to program with
	fences than the atomic ordered operations being proposed.

	The acquire and release variants of fences share the same
	properties of the acquire and release operations in the
	current memory models, to minimize the amount of rules that
	the programmer needs to follow.

	Doug Lea also has an proposal analogous to this one for
	explicit fences in Java, which is being implemented as a GPL
	add-on to the hotspot compiler.
    </ul>

    <p>
      This proposal includes three variants of fences: acquire,
      release and full (ordered) fences. Acquire and release fences
      have semantics analogous to the acquire and release operations
      in the atomics package. They are typically much better
      performing than fully ordered fences and they can be used on
      many concurrent programming idioms. N2237 includes a more
      detailed justification and use cases for these fences.
      </p>
    
    <h2>
      <a name="objections">III. Objections</a></h2>

    <p>
      This section will summarize the objections raised so far against
      the introduction of memory fences and present some discussion to 
      counter those arguments.</p>
    
    <p>
      <b>Globality.</b> Standalone memory fences order all preceding
      memory operations vs all subsequent ones. Such globality could
      hinder their performance on widely decoupled massively parallel
      architectures, and that including them on the standard will
      prevent hardware vendors from developing such machines.</p>
    
    <p>
      In the current memory model, acquire fences could be implemented
      (albeit suboptimally) by upgrading the latest executed load of
      each live atomic variable from relaxed to acquire. If the latest
      executed load of an atomic variable was already a load_acquire,
      the fence would have no effect on the visibility of that
      variable. While the set of variables affected is potentially
      unbound, it seems clear that if all atomic loads in a program
      were acquire loads, then all acquire fences would be redundant
      and could be eliminated.

      This supports the contention that the hardware mechanisms
      required to support acquire loads are sufficient to implement
      acquire fences. So, the introduction of acquire fences into the
      memory model will not prevent development of new parallel
      architectures any further than the presence of acquire loads.
      An analogous argument can be made for release and ordered
      fences.
    </p>
    <p>
      There are no known existing platforms where fences cannot be
      efficiently implemented, and even if such a platform existed,
      only programs that explicitly use fences would be affected by
      their inclusion in the standard.
      
    </p>
    <p>
      <b>Teachability.</b> The wide majority of programmers are
      generaly unable to comprehend weak memory ordering so they will
      be unable to properly use mechanisms such as standalone memory
      fences.</p>

    <p>
      One of our main justification for introducing fences is to
      upgrade existing code that already is written in tersm of
      fences. In this case, the programmer is already familiar with
      the fence abstraction. </p>

    <p>
      It is questionable whether weakly constrained atomic accesses
      are easier to teach than explicit memory fences. Some
      programmers find the fence abstraction much easier to comprehend
      and reason about. We believe that both are advanced tools that
      require approximately the same level of deep
      understanding. Fences are the ordering mechanism of choice for
      several existing weakly-ordered architectures and programming
      models, so most programmers with experience in concurrency are
      likely to have encountered them.</p>

    <p>
      <b>Redundancy.</b> Atomic ordered operations already provide an
      ordering mechanism, so explicit memory fences are redundant and
      unnecessary.</p>

    <p>
      The ordering provided by explicit fences is not exactly
      redundant with the one provided by ordered atomic operations. As
      described on N2237, some algorithms can be more precisely
      described using fences instead of atomic ordered operations. On
      the other hand, atomic ordered operations are fully redundant
      with the proposed fences; however, we're not recommending their
      removal since they are a useful abstraction that is sufficient
      in many practical cases.
      </p>
    <h2>
        <a name="proposed">IV. Proposed Text</a></h2>
    <p>
      This section described the functions to be introduced as part of the atomic library package.
    <h4>
        Header &lt;atomic&gt; synopsis</h4>
    <pre>

// Fences

inline void acquire_memory_fence( void );
inline void release_memory_fence( void );
inline void acq_rel_memory_fence( void );
inline void ordered_memory_fence( void );

// Compiler fences

inline void acquire_compiler_fence( void );
inline void release_compiler_fence( void );
inline void acq_rel_compiler_fence( void );
inline void ordered_compiler_fence( void );

</pre>

    <p>An alternative syntax for these primitives uses a single
    function with a template parameter to specify the ordering
    constraint. This proposal recommends a syntax consistent with the
    rest of the atomics package.</p>

    <h4>
        Fence Semantics</h4>
    <ul>
        <li><code>acquire_memory_fence( void )</code>, ensures that
        all subsequent operations in program order are performed after
        all preceding loads in program order;</li>

        <li><code>release_memory_fence( void )</code>, ensures that
        all preceding operations in program order are performed before
        all subsequent stores in program order;</li>

        <li><code>acq_rel_memory_fence( void )</code>, combines the
        semantics of <code>acquire</code> and <code>
        release</code>;</li>

        <li><code>ordered_memory_fence( void )</code>, ensures that
        all preceding operations in program order are performed before
        all subsequent operations in program order.</li>

    </ul>

    <p>
      The compiler versions of these fences provide the same mechanism
      as their memory counterparts, except that the memory accesses
      are ordered only with respect to threads executing on the same
      processing unit. These fences will avoid the introduction of any
      hardware primitives that order memory accesses across different
      processors.
      </p>

    <hr>
    <p>
        <em>--end</em></p>
</body></html>
