<html><head><title>C++ Atomic Types and Operations</title></head>
<body>
<h1>C++ Atomic Types and Operations</h1>

<p> ISO/IEC JTC1 SC22 WG21 N2324 = 07-0184 - 2007-06-24

<p> Hans-J. Boehm, Hans.Boehm@hp.com, boehm@acm.org
<br>Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

<p> This document is a revision of N2145 = 07-0005 - 2007-01-12.

<ul>
<li><a href="#Introduction">Introduction</a></li>
    <ul>
    <li><a href="#Rationale">Rationale</a></li>
    <li><a href="#Prior">Prior Approaches</a></li>
    <li><a href="#Current">Current Approach</a></li>
    <li><a href="#Model">Memory Model Interaction</a></li>
    <li><a href="#Summary">Summary of Types and Operations</a></li>
    <li><a href="#Implementability">Implementability</a></li>
    <li><a href="#Remaining">Remaining Issues</a></li>
    </ul>
<li><a href="#Interface">Interface</a></li>
    <ul>
    <li><a href="#Headers">Headers</a></li>
    <li><a href="#Memory">Memory Order</a></li>
    <li><a href="#Operations">Atomic Operations</a></li>
    <li><a href="#Types">Atomic Types</a></li>
    <li><a href="#LockFree">Lock-Free</a></li>
    <li><a href="#Synopsis">Synopsis</a></li>
    </ul>
<li><a href="#Implementation">Implementation</a></li>
    <ul>
    <li><a href="#Presentation">Notes on the Presentation</a></li>
    <li><a href="#Files">Implementation Files</a></li>
    <li><a href="#CPP0X">C++0x Features</a></li>
    <li><a href="#Order">Memory Order</a></li>
    <li><a href="#Flag">Flag Type and Operations</a></li>
    <li><a href="#ImplMacros">Implementation Macros</a></li>
    <li><a href="#LockFreeMacro">Lock-Free Macro</a></li>
    <li><a href="#Regular">Regular Types</a></li>
        <ul>
        <li><a href="#Boolean">Boolean</a></li>
        <li><a href="#Address">Address</a></li>
        <li><a href="#Integers">Integers</a></li>
        <li><a href="#Typedefs">Integer Typedefs</a></li>
        <li><a href="#Characters">Characters</a></li>
        <li><a href="#Generic">Generic Type</a></li>
        <li><a href="#Pointer">Pointer Partial Specialization</a></li>
        <li><a href="#Special">Full Specializations</a></li>
        </ul>
    <li><a href="#Functions">C++ Core Functions</a></li>
    <li><a href="#CoreMacros">C Core Macros</a></li>
    <li><a href="#Methods">Operators and Methods</a></li>
    <li><a href="#Cleanup">Implementation Header Cleanup</a></li>
    <li><a href="#Standard">Standard Headers</a></li>
    </ul>
<li><a href="#Examples">Examples of Use</a></li>
    <ul>
    <li><a href="#ExampleFlag">Flag</a></li>
    <li><a href="#ExampleLazy">Lazy Initialization</a></li>
    <li><a href="#ExampleInteger">Integer</a></li>
    <li><a href="#ExampleEvent">Event Counter</a></li>
    <li><a href="#ExampleList">List Insert</a></li>
    <li><a href="#ExampleUpdate">Update</a></li>
    <li><a href="#ExampleMain">Main</a></li>
    </ul>
</ul>

<h2><a name="Introduction">Introduction</a></h2>

<p> We present an interface and minimal implementation
for standard atomic types and operations.

<h3><a name="Rationale">Rationale</a></h3>

<p> The standard needs atomic types for two reasons.

<dl>

<dt>Lock-free concurrent data structures</dt>
<dd> Lock-free concurrent data structures
are difficult to design and implement correctly.
As such, there is a significant need for programmers
capable of doing so to write portably,
and for that they need standards.</dd>

<dt>Inter-thread memory synchronization</dt>
<dd> Occasionally synchronization with locks offers insufficient performance,
and other, more specific idioms suffice.</dd>

</dl>

<p> The traditional shared-memory notion
that every store is instantly visible to all threads
induces an unacceptable performance loss on modern systems.
Therefore programmers must have a mechanism
to indicate when stores in one thread should be communicated to another.
This mechanism must necessarily be atomic,
and we use the proposed interface to provide that mechanism.

<p> Specifically, a program that wishes to communicate the fact that
a particular piece of data prepared by one thread
is ready to be examined by another thread,
needs a shared variable <em>flag</em>,
that both

<ul>
<li> Allows atomic accesses,
in the sense that concurrent reads and writes are allowed
and that the reads result in only one of the assigned values
and never undefined behavior.</li>
<li> Ensures that any ordinary data written before the flag is set
(i.e. the prepared data)
is seen correctly by another thread after it sees a set flag.</li>
</ul>

<p> Although the second aspect is often glossed over,
it is usually not automatic with modern hardware and compilers,
and is just as important as the first in ensuring correctness.

<h3><a name="Prior">Prior Approaches</a></h3>

<p> <a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1875.html">N1875</a>
presented an atomic statement,
which would require the compiler to select the appropriate atomic operations.
We chose to abandon this approach because:

<ul>
<li> The non-atomic evaluations within the atomic block
were less than obvious.</li>
<li> The atomic block syntax was too invitingly simple
for the level of skill required to use them correctly.</li>
<li> The atomic block syntax
seems far more appropriate to express general atomic memory transactions.
It is premature to standardize those at this point.</li>
</ul>

<p> <a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2006/n2047.html">N2047</a>
presented a template-based C++ library of atomic types
layered on top of a C library of atomic operations on plain integral types.
We chose to abandon this approach because:

<ul>
<li> The template-based approach severely limits interoperability with C.</li>
<li> The proposal had
both "guaranteed atomic" and "possibly emulated" versions of atomic types,
which appears to insufficient expressive power to warrant the added complexity.
It was originally designed in part
to allow for access to hardware-provided atomic load and store operations
on platforms that do not provide hardware provided compare-and-swap,
a property not shared by the current proposal.
Our current feeling is that such platforms
will be rare enough by 2010
that this property is not a major consideration.</li>
<li> The distinction between wait-free and lock-free
had insufficient user support and use cases.
<li> The C-level approach
failed to identify concurrently accessed data at the point of declaration,
and made it too easy to access such data with non-atomic primitives.</li>
</ul>

<p> <a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2145.html">N2145</a>
is the basis for this proposal and is not fundamentally different.
However, we have refined the proposal in several areas.

<ul>
<li>The synchronization specification
has changed from a part of the function name to a function parameter.
The change simplifies the specification and
simplifies the programming of generic algorithms.
Its major cost is that optimizers
will need to do constant parameter propagation to obtain full performance.
We feel this capability is common enough in modern compilers
to not be an excessive burden.</li>
<li>We have added per-variable fence operations,
consistent with the updated concurrency model.</li>
<li>We have updated the proposal to include the new character types from
<a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2249.html">N2249
New Character Types in C++</a>.
</li>
<li>We have addressed the definitional weaknesses
with respect to POD-ness
with the help of the following papers.
<ul>
<li><a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2215.pdf">N2215</a> Initializer lists (Rev.3)</li>
<li><a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2230.html">N2230</a> POD's Revisited; Resolving Core Issue 568 (Revison 3)</li>
<li><a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2235.pdf">N2235</a> Generalized Constant Expressions &mdash; Revision 5</li>
<li><a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2326.html">N2326</a> Defaulted and Deleted Functions</li>
</li>
</ul>
</ul>

<h3><a name="Current">Current Approach</a></h3>

<p> We propose to add atomic types and operations purely as a library API.
In practice, this API would have to be implemented largely with
either compiler intrinsics or assembly code.
(As such, this proposal should be implemented by compiler vendors,
not library vendors,
much as the exception facilities are implemented by compiler vendors.)

<p> The proposal defines atomic types and
defines all atomic operations on those types,
which enables enforcement of a single synchronization protocol
for all operations on an object.

<p> The proposal defines the types as standard layout structs
and the core functions on pointers to those structs,
so that types and core functions are usable from both C and C++.
That is, a header included from both C and C++
can declare a variable of an atomic type
and provide inline functions that operate on them.
The proposal additionally provides member operators and member functions
so that C++ programmers may use a more concise syntax.

<p> The proposal defines the core functions
as overloaded functions in C++
and as type-generic macros in C.
This approach helps programmers avoid changing code
when an atomic type changes size.

<p> The proposal defines additional template types
to aid type-safe use of atomic pointers
and to simplify the wrapping of user-defined types.

<p> The proposal defines feature queries to determine
whether or not a type's operations are lock-free.
In some cases, both the decision to use a lock-free algorithm,
and sometimes the choice of lock-free algorithm,
depends on the availability of underlying hardware primitives.
In other cases, e.g. when dealing with asynchronous signals,
it may be important to know that operations like compare-and-swap
are really lock-free,
because a lock-based emulation might result in deadlock.
We provide two kinds of feature queries,
compile-time preprocessor macros and run-time functions.
The former provide general characteristics of the implementation.
The latter provide per-type information.


<p> To facilitate inter-process communication via shared memory,
it is our intent that lock-free operations also be <em>address-free</em>.
That is, atomic operations on the same memory location
via two different addresses
will communicate atomically.
The implementation may not depend on any per-process state.
While such a definition is beyond the scope of the standard,
a clear statement of our intent
will enable a portable expression of class of a programs already extant.

<h3><a name="Model">Memory Model Interaction</a></h3>

<p> Synchronization operations in the memory model
(<a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2006/n2052.htm">N2052</a>,
<a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2006/n2138.html">N2138</a>,
<a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2171.htm">N2171</a>,
<a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2176.html">N2176</a>,
<a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2177.html">N2177</a>,
<a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2300.html">N2300</a>)
may be either <em>acquire</em> or <em>release</em> operations, or both.
These operations govern the communication of non-atomic stores between threads.
A release operation ensures that
prior memory operations become visible
to the thread performing subsequent acquire operation
on the same object.

<p> Rather than have atomic operations implicitly provide
both acquire and release semantics, 
we choose to complicate the interface
by adding explicit ordering specifications to various operations.
Many comparable packages do not,
and instead provide only a single version of operations,
like compare-and-swap,
which implicitly include a full memory fence.

<p> Unfortunately,
the extra ordering constraint introduced by the single version
is almost never completely necessary.
For example,
an atomic increment operation
may be used simply to count the number of times a function is called,
as in a profiler.
This requires atomicity, but no ordering constraints.
And on many architectures (PowerPC, ARM, Itanium, Alpha, though not X86),
the extra ordering constraints are at least moderately expensive.

<p> It is also unclear how a convention requiring full memory fences
would properly interact with an interface
that supported simple atomic loads and stores.
Here a full memory fence would generally multiply the cost by a large factor.
(The current gcc interface
does not seem to support simple atomic loads and stores explicitly,
which makes it unclear how to support e.g. lock-based emulation,
or architectures on which the relevant loads and stores
are not implicitly atomic.)

<p> There are two possible approaches to specifying ordering constraints:

<ul>
<li> Have the programmer provide explicit memory fences/barriers,
perhaps most usefully
in a way that is analogous to the SPARC membar instructions.</li>
<li> Associate the ordering semantics with operations.
The closest hardware analog for this is probably Itanium,
though we carry this through more consistently.</li>
</ul>

<p> Both approaches appear to have their merits.
We chose the latter for several reasons:

<ul>
<li> On architectures such as X86 and Itanium,
it can lead to substantially faster code,
at least in the absence of complex compiler analysis.
For example, on X86,
a lock is often released with a simple store instruction,
which is widely believed to effectively have implicit "release" semantics.
If this appears in the source as <samp>store_release</samp>,
it is easy to simply map that to a store instruction.
If it is expressed as a fence followed by a store,
the compiler would have to deduce that the fence is redundant.
It is unclear that that can be done under realistic conditions,
since the fence prevents operations from moving into the critical region,
while a simple store does not do so for loads.
On Itanium a similar situation arises with the compare-and-swap operation.</li>

<li> It seems to be marginally more convenient.
For example, double-checked locking can be easily written
with load_acquire and store_release, with no explicit barriers.
(The semantics of the fence version are also unnecessarily stronger,
causing unnecessary overhead on Itanium.
We are not aware of common examples where the reverse is true
for other architectures.)</li>

<li> It gives us an easy way to express
that atomic loads and stores "normally" have acquire and release semantics,
but that weaker options may exist.
It is important to encourage the acquire/release versions,
since they behave with respect to dependencies
in the way that essentially all programmers expect,
while remaining easily definable.
The unordered variants can be very counterintuitive.</li>

<li> It makes it harder to ignore ordering issues.
Ignoring them is almost always a mistake at this level.</li>
</ul>

<p> Some architectures provide fences that are limited to loads or stores.
We have, so far, not included them,
since it seems to be hard to find cases in which both:

<ul>
<li> Such limited ordering constraints
are useful and not excessively brittle.</li>
<li> They actually result in a performance benefit
over the more general constraint.</li>
</ul>

<p> However, we have provided limited fence operations,
which are semantically modeled on read-modify-write operations.
We expect that implementations that have hardware fences
will use such operations to implement the language fences.

<p> Most architectures provide additional ordering guarantees
if one memory operation is dependent on another.
In fact, these are critical
for efficient implementation of languages like Java.

<p> In this case, there is near-universal agreement
that it would be nice to have some such guarantees.
The difficulty is that we have not been able to formulate such a guarantee
that both makes sense at the C++ source level,
and does not interfere with optimization of code in compilation units
that do not mention atomics.
The fundamental issues are:
<ul>
<li> Compilers may remove or change data and/or control dependencies.</li>
<li> Detailed guarantees vary across architectures.</li>
</ul>

<p> A strict interpretation of acquire and release
yields a fairly weak consistency model,
which allows threads to have a different notion of the order of writes.
For stronger consistency,
this proposal destinguishes between
an operation with acquire and release semantics
and an operation with sequentially consistent semantics.

<h3><a name="Summary">Summary of Types and Operations</a></h3>

<p> The proposal defines a synchronization enumeration,
which enables detailed specification of the memory order 
for every operation.

<p> The proposal includes a very simple atomic flag type,
providing two operations, test-and-set-acquire and clear-release.
This type is the minimum hardware-implemented type
needed to conform to this standard.
The remaining types can be emulated with the atomic flag,
though with less than ideal properties.
Few programmers should be using this type.

<p> The proposal includes several standard scalar atomic types:
boolean, address, and many integral types.

<p> To support generic data and algorithms,
the proposal also includes a generic atomic type,
a partial specialization for pointers (based on the atomic address),
and full specializations for integral types (based on the atomic scalars).

<p> The primary problem with a generic atomic template
is that effective use of machine operations
requires three properties of their parameter types:

<ul>
<li> bitwise copyable, </li>
<li> bitwise comparable, and </li>
<li> statically initializable </li>
</ul>

<p> In the present language,
there is no mechanism to enforce the properties parameter type.
Roland Schwarz suggests using a template union to enfore POD parameter types.
Unfortunately, that approach also prevents
the derivation of specializations of atomic for the types above,
which is unacceptable.
Furthermore, Lois Goldthwaite
proposes generalizing unions to permit non-POD types in
<a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2248.html">N2248 Toward a More Perfect Union</a>.
We believe that concepts are a more appropriate mechanism
to enforce this restriction,
but we do not employ that mechanism in this paper.

<p> The intent is that vendors will specialize
a fully-general locking implementation of a generic atomic template
with implementations using hardware primitives
when those primitives are applicable.
This specialization can be accomplished
by defining a base template with the size and alignment
of the parameter type as additional template parameters,
and then specializing on those two arguments.

<p> The proposal defines several atomic functions,
which serve the requirements of both C and C++ programmers.
Because these functions are clumsy,
the proposal includes member operator and member function definitions
that are syntactically simpler.
The member operators provide only the strongest memory synchronization.

<p> The fetch-and-operation functions return the original stored value.
This approach is required for fetch-and-or and fetch-and-and
because there is no means to compute to the original stored value
from the new value and the modifying argument.
In contrast to the core functions,
the modifying assignment operators return the new value.
We do this for consistency with normal assignment operators.
Unlike normal assignment operators, though,
the atomic assignments return values rather than references.
The reason is that another thread might intervene
between an assignment and a subsequent read.
Rather than introduce this classic parallel programming bug,
we return a value.

<p> The functions and operations are defined to work with volatile objects,
so that variables that should be volatile can also be atomic.
The volatile qualifier, however, is not required for atomicity.

<p> The normal signed integral addition operations
have undefined behavior in the event of an overflow.
For atomic variables,
this undefined behavior introduces significant burden.
Rather than leave overflow undefined,
we recognize the defacto behavior of modern systems
and define atomic fetch-and-add (-subtract) to use two's-complement arithmetic.
We are aware of no implementation of C++
for which this definition is a problem.

<p> The fence operations act upon an atomic variable
and provide synchronization semantics of a read-modify-write operation.


<h3><a name="Implementability">Implementability</a></h3>

<p> We believe that there is ample evidence for implementatability.

<p> The Intel/gcc
<a href="http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html#Atomic-Builtins"><samp>__sync</samp></a> intrinsics
provide evidence for compiler implementability of the proposed interface.

<blockquote>
(We do not advocate standardizing these intrisics as is.
They provide far less control over memory ordering
than we advocated above.
For example,
they provide no way to atomically increment a counter
without imposing unnecessary ordering constraints.
The lack of appropriate ordering control
appears to already have resulted in implementation shortcuts,
e.g. gcc does not implement <samp>__sync_synchronize()</samp>
as a full memory barrier on X86,
in spite of the documentation.
We believe a number of issues were not fully understood
when that design was developed,
and it could could greatly benefit from another iteration at this stage.)
</blockquote>

<p> Other packages,
particularly Boehm's
<a href="http://www.hpl.hp.com/research/linux/atomic_ops/">atomic_ops</a>
package provide evidence of efficient implementability
over a range of architectures.

<p> This proposal includes atomic integers smaller than a machine word,
even though many architectures do not have such operations.
For machines that implement a word-based compare-and-swap operation,
the effect of operations can be achieved by loading the containing word,
modifying the sub-word in place,
and performing a compare-and-swap on the containing word.
In the event that no compare-and-swap is available,
the implementation may need to 
either implement smaller types with larger types
or use locking algorithms.

<p> The remaining implementation issue is the burden on implementors
to produce a minimally conforming implementation.
The minimum hardware support required
is the atomic test-and-set and clear operations
that form the basis of the atomic flag type.
This proposal includes an example implementation
based on that minimal hardware support,
and thus shows the vendor work <em>required</em>.

<h3><a name="Remaining">Remaining Issues</a></h3>

<p> The proposal does not standardize any flag wait functions,
despite using one.
This issue as that any function will often be inappropriate.
Should the proposal define a set of wait functions
for common situations?
Example situations include a non-preemptive kernel execution
and preemptive space execution.

<p> There are FIX 470 functions and FIX 206 methods
defined by this proposal.
While the methods generally have trival implementations,
the functions do not.
Will vendors commit to implementing this large a definition?
Note that both the prototype presented here
and the
<a href="http://www.hpl.hp.com/research/linux/atomic_ops/">atomic_ops</a>
package
provide moderately compact implementations,
in spite of the interface sizes.

<p> The mechanism provided to verify that programmers
are not using detailed synchronization control
is to encode such control in an enumeration
which is searchable.
Is this acceptable?

<p> Are non-volatile overloads in template classes neccessary or desireable?

<p> Is a per-type static lock-free query function desireable?

<p> The present implementation does not privatize members.
This is a shortcut to avoid all the friend declarations.


<h2><a name="Interface">Interface</a></h2>

<p> This section of the proposal is intended to provide
the basis for formal wording in a subsequent proposal.

<h3><a name="Headers">Headers</a></h3>

<p> The standard provides two headers,
<samp>cstdatomic</samp> and
<samp>stdatomic.h</samp>.
The <samp>cstdatomic</samp> header
defines the types and operations in namespace <samp>std</samp>.
The <samp>stdatomic.h</samp> header
defines the types and operations in namespace <samp>std</samp>
also exports the types to the global namespace.


<h3><a name="Memory">Memory Order</a></h3>

<p> The enumeration <samp>memory_order</samp>
specifies the detailed memory synchronization order
as defined in
<a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2300.html">N2300</a>.
Its enumerated values and their meanings are as follows.

<dl>
<dt><samp>memory_order_relaxed</samp>
<dd>
These operations do not order memory.
This order does not apply to fences.
</dd>

<dt><samp>memory_order_release</samp>
<dd>
These operations make regular memory writes
visible to other threads
through the atomic variable to which they are applied.
This order applies only to fences and operations that load.
</dd>

<dt><samp>memory_order_acquire</samp>
<dd>
These operations make regular memory writes in other threads
released through the atomic variable to which they are applied,
visible to the current thread.
This order applies only to fences and operations that store.
</dd>

<dt><samp>memory_order_acq_rel</samp>
<dd>
These operations have both acquire and release semantics.
This order applies only to fences and operations that may both load and store.
</dd>

<dt><samp>memory_order_seq_cst</samp>
<dd>
These operations have both acquire and release semantics,
and addition, have sequentially-consistent operation ordering.
This order applies to all operations.
The <samp>memory_order_seq_cst</samp> operations that load a value
are acquire operations on the affected locations.
The <samp>memory_order_seq_cst</samp> operations that store a value
are release operations on the affected locations.
In addition, in a consistent execution,
there must be a single total order S
on all <samp>memory_order_seq_cst</samp> operations,
consistent with the happens before order and modification orders
for all affected locations,
such that each <samp>memory_order_seq_cst</samp> operation
observes either the last preceding modification according to this order S,
or the result of an operation that is not <samp>memory_order_seq_cst</samp>.
[<i>Note:</i> Although we do not explicitly require that S include locks,
it can always be extended to an order that does include locks.
<i>&mdash;end note</i>]
</dd>

</dl>

<p> An atomic store shall only store a value
that has been computed from constants and program input values
by a finite sequence of program evaluations,
such that each evaluation observes the values of variables
as computed by the last prior assignment in the sequence.
The ordering of evaluations in this sequence must be such that 

<ul>

<li>If an evaluation B observes a value computed by A in the same thread,
then B must not be sequenced before A.
</li>

<li>If an evaluation B observes a value computed by A in a different thread,
then B must not happen before A.
</li>

<li>If an evaluation A is included,
then all evaluations that assign to the same variable
and are sequenced before or happens-before A
must be included.
[<i>Note:</i> This requirement disallows "out-of-thin-air",
or "speculative" stores of atomics when relaxed atomics are used.
Since unordered operations are involved,
evaluations may appear in this sequence out of thread order.
For example, with <samp>x</samp> and <samp>y</samp> initially zero, 

<dl>
<dt>Thread 1:</dt>
<dd><samp>r1 = y.load( memory_order_relaxed );</samp></dd>
<dd><samp>x.store( r1, memory_order_relaxed );</samp></dd>
<dt>Thread 2:</dt>
<dd><samp>r2 = x.load( memory_order_relaxed );</samp></dd>
<dd><samp>y.store( 42, memory_order_relaxed );</samp></dd>
</dl>
is allowed to produce r1 = r2 = 42.
The sequence of evaluations justifying this starts consists of: 

<blockquote><samp>
y.store( 42, memory_order_relaxed );
<br>r1 = y.load( memory_order_relaxed );
<br>x.store( r1, memory_order_relaxed );
<br>r2 = x.load( memory_order_relaxed );
</samp></blockquote>

On the other hand, 

<dl>
<dt>Thread 1:</dt>
<dd><samp>r1 = y.load( memory_order_relaxed );</samp></dd>
<dd><samp>x.store( r1, memory_order_relaxed );</samp></dd>
<dt>Thread 2:</dt>
<dd><samp>r2 = x.load( memory_order_relaxed );</samp></dd>
<dd><samp>y.store( r2, memory_order_relaxed );</samp></dd>
</dl>

may not produce r1 = r2 = 42,
since there is no sequence of evaluations
that results in the computation of 42.
In the absence of "relaxed" operations
and read-modify-write operations with weaker than acq_rel ordering,
it has no impact.  <i>&mdash;end note</i>]
</li>

<li>No evaluation may produce an outcome
that would be disallowed if an atomic object
were replaced by its non-atomic counterpart.
[<i>Note:</i> Since such a replacement usually results in a data race,
this is rarely a significant constraint.
It does disallow r1 == r2 == 42 in
(x,y initially zero):

<dl>
<dt>Thread 1:</dt>
<dd><samp>r1 = x.load( memory_order_relaxed );</samp></dd>
<dd><samp>if ( r1 == 42 ) y.store( r1, memory_order_relaxed );</samp></dd>
<dt>Thread 2:</dt>
<dd><samp>r2 = y.load( memory_order_relaxed );</samp></dd>
<dd><samp>if ( r2 == 42 ) x.store( 42, memory_order_relaxed );</samp></dd>
</dl>

since the corresponding program without atomics is race-free.

<i>&mdash;end note</i>]
</li>
</ul>

<h3><a name="Operations">Atomic Operations</a></h3>

<p> There are only a few kinds of general operations,
though there are many variants on those kinds.
The operations may explicitly specify memory order.
Implicitly, the memory order is <samp>memory_order_seq_cst</samp>.

<dl>

<dt>store</dt>
<dd>
The store operations replace the current value of the atomic
with the desired new value.
These operations may also appear as assignment operators,
in which case the value assigned is the result of the assignment.
</dd>

<dt>load</dt>
<dd>
The load operations return the current value of the atomic.
These operations may also appear as conversion operators.
</dd>

<dt>swap</dt>
<dd>
The swap operations
replace the value of the atomic with the desired new value
and returns the old value.
These operations are read-modify-write operations
in the sense of the "synchronizes with" definition in 1.10p7,
and hence both the evaluation that produced the input value,
and the operation itself,
synchronizes with any evaluation that reads the updated value.
</dd>

<dt>compare-and-swap</dt>
<dd>
The compare-and-swap operation
compares the expected old value to the current value of the atomic
and if equal,
replaces the value of the atomic with the desired new value
and returns true.
If the values are unequal, the operation updates the expected value
with the current value
and returns false.
These operations are read-modify-write operations
in the sense of the "synchronizes with" definition in 1.10p7,
and hence both the evaluation that produced the input value,
and the operation itself,
synchronizes with any evaluation that reads the updated value.
The compare-and-swap operation
may spuriously fail (e.g. on load-locked store-conditional machines)
by returning false
with an updated expected value equal to the original expected value.
</dd>

<dt>fence</dt>
<dd>The fence operations
do not modify the atomic and do not return any values.
However, these operations are read-modify-write operations
in the sense of the "synchronizes with" definition in 1.10p7,
and hence both the evaluation that produced the input value,
and the operation itself,
synchronizes with any evaluation that subsequently reads the value.
</dd>

<dt>fetch-and-{add,sub,and,or,xor}</dt>
<dd>
The fetch-and-<var>op</var> operations
replace the current value of the atomic
with the result of that value <var>op</var> the augend.
For member operators, the value returned is the updated value.
For other functions, the value returned is the old value.
In the event of signed integer overflow,
the result is as though the operations were twos-complement.
These operations are read-modify-write operations
in the sense of the "synchronizes with" definition in 1.10p7,
and hence both the evaluation that produced the input value,
and the operation itself,
synchronizes with any evaluation that reads the updated value.
</dd>

</dl>

<p> Implementations should stive to make atomic stores
visible to atomic loads within a reasonable amount of time.
They should never move an atomic operation out of an unbounded loop.

<p> Many operations are volatile qualified.
This qualification means that volatility is preserved
when applying these operations to volatile objects.
It does not mean that operatoins on non-volatile objects
become volatile.


<h3><a name="Types">Atomic Types</a></h3>

<p> The <samp>atomic_flag</samp> type
is the minimum hardware-implemented type needed to conform to this standard.
The remaining types can be emulated with the atomic flag,
though with less than ideal properties.
The flag type has two primary operations,
test-and-set and clear.

<p> The <samp>atomic_bool</samp> type
supports store, load, swap, compare-and-swap, and fence.

<p> The <samp>atomic_address</samp> type
provides atomic <samp>void*</samp> operations
and 
supports store, load, swap, compare-and-swap, fence,
fetch-and-add, and fetch-and-subtract.
The unit of addition/subtraction is one byte.

<p> The atomic integral types
support store, load, swap, compare-and-swap, fence,
fetch-and-add, fetch-and-subtract,
fetch-and-and, and fetch-and-or, and fetch-and-xor.
The integral types are
characters
<blockquote>
<samp>atomic_char16_t</samp>,
<samp>atomic_char32_t</samp>,
<samp>atomic_wchar_t</samp>
</blockquote>
and integers
<blockquote>
<samp>atomic_char</samp>,
<samp>atomic_schar</samp>,
<samp>atomic_uchar</samp>,
<samp>atomic_short</samp>,
<samp>atomic_ushort</samp>,
<samp>atomic_int</samp>,
<samp>atomic_uint</samp>,
<samp>atomic_long</samp>,
<samp>atomic_ulong</samp>,
<samp>atomic_llong</samp>,
<samp>atomic_ullong</samp>
</blockquote>
In addition, there are typedefs
for atomic types corresponding to the stdint typedefs.
<blockquote>
<samp>atomic_int_least8_t</samp>,
<samp>atomic_uint_least8_t</samp>,
<samp>atomic_int_least16_t</samp>,
<samp>atomic_uint_least16_t</samp>,
<samp>atomic_int_least32_t</samp>,
<samp>atomic_uint_least32_t</samp>,
<samp>atomic_int_least64_t</samp>,
<samp>atomic_uint_least64_t</samp>,
<samp>atomic_int_fast8_t</samp>,
<samp>atomic_uint_fast8_t</samp>,
<samp>atomic_int_fast16_t</samp>,
<samp>atomic_uint_fast16_t</samp>,
<samp>atomic_int_fast32_t</samp>,
<samp>atomic_uint_fast32_t</samp>,
<samp>atomic_int_fast64_t</samp>,
<samp>atomic_uint_fast64_t</samp>,
<samp>atomic_intptr_t</samp>,
<samp>atomic_uintptr_t</samp>,
<samp>atomic_size_t</samp>,
<samp>atomic_ssize_t</samp>,
<samp>atomic_ptrdiff_t</samp>,
<samp>atomic_intmax_t</samp>,
<samp>atomic_uintmax_t</samp>
</blockquote>

<p> Note that in C,
the atomic character types are not distinct types,
but rather typedefs to other atomic integral types.

<p> There is a generic <samp>atomic</samp> class template.
The template requires that its type argument be
<ul>
<li> bitwise copyable, </li>
<li> bitwise comparable, and </li>
<li> statically initializable </li>
</ul>
The operations supported are store, load, swap, compare-and-swap, and fence.

<p> There are pointer partial specializations
on the <samp>atomic</samp> class template.
The operations supported are store, load, swap, compare-and-swap, fence,
fetch-and-add, and fetch-and-subtract.
For <samp>atomic</samp> pointer partial specializations,
the unit of addition/subtraction is the size of the referenced type.

<p> Finally, there are full specializations
over the integral types
on the <samp>atomic</samp> class template.
The operations supported are store, load, swap, compare-and-swap, fence,
fetch-and-add, fetch-and-subtract,
fetch-and-and, fetch-and-or, and fetch-and-xor.

<h3><a name="LockFree">Lock-Free</a></h3>

<p> Whether or not a particular type supports lock-free operations
is important to its use in signal handlers and certain algorithms,
so there are queries for the lock-free property.
Because consistent use of operations requires
that all operations on a type must use the same protocol,
all operations are wait-free or none of them are.
Therefore, there is a single wait-free query per type.
However, the proposal defines operations on the <samp>atomic_flag</samp> type
to be wait-free.

<p> To facilitate optimal storage use,
the proposal supplies a feature macro, <samp>ATOMIC_SCALAR_LOCK_FREE</samp>,
that indicates general lock-free status of scalar atomic types.
A value of 0 indicates that the scalar types are never lock-free.
A value of 1 indicates that the scalar types are sometimes lock-free.
A value of 2 indicates that the scalar types are always lock-free.

<p> The result lock-free query on an object
cannot be inferred from the result of a lock-free query
on another object.
[<i>Note:</i>
This rule permits misaligned atomic variable when they are unavoidable.
<i>&mdash;end note</i> ]

<p>
The clear and test-and-set operations must be lock-free,
and hence address-free.

<h3><a name="Synopsis">Synopsis</a></h3>

<p> The atomic types and operations have the following synopsis.

<pre><samp>
typedef enum memory_order {
       memory_order_relaxed, memory_order_acquire, memory_order_release,
       memory_order_acq_rel, memory_order_seq_cst
} memory_order;

typedef struct atomic_flag
{
    bool test_and_set( memory_order = memory_order_seq_cst ) volatile;
    void clear( memory_order = memory_order_seq_cst ) volatile;

    atomic_flag() = default;
    atomic_flag( const atomic_flag&amp; ) = delete;
    atomic_flag&amp; operator =( const atomic_flag&amp; ) = delete;
}

bool atomic_flag_test_and_set( volatile atomic_flag* );
bool atomic_flag_test_and_set_explicit( volatile atomic_flag*, memory_order );
void atomic_flag_clear( volatile atomic_flag* );
void atomic_flag_clear_explicit( volatile atomic_flag*, memory_order );

typedef struct atomic_bool
{
    bool lock_free();
    void store( bool, memory_order = memory_order_seq_cst ) volatile;
    bool load( memory_order = memory_order_seq_cst ) volatile;
    bool swap( bool, memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( bool&amp;, bool,
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;

    atomic_bool() = default;
    constexpr atomic_bool( bool __v__ );
    atomic_bool( const atomic_bool&amp; ) = delete;
    atomic_bool&amp; operator =( const atomic_bool &amp; ) = delete;
    bool operator =( bool ) volatile;
    operator bool() volatile;
} atomic_bool;

bool atomic_lock_free( volatile atomic_bool* );
void atomic_store( volatile atomic_bool*, bool );
void atomic_store_explicit( volatile atomic_bool*, bool, memory_order );
bool atomic_load( volatile atomic_bool* );
bool atomic_load_explicit( volatile atomic_bool*, memory_order );
bool atomic_swap( volatile atomic_bool* );
bool atomic_swap_explicit( volatile atomic_bool*, bool );
bool atomic_compare_swap( volatile atomic_bool*, bool*, bool );
bool atomic_compare_swap_explicit( volatile atomic_bool*, bool*, bool,
                                   memory_order );
void atomic_fence( volatile atomic_bool*, memory_order ) volatile;

typedef struct atomic_address
{
    bool lock_free();
    void store( void*, memory_order = memory_order_seq_cst ) volatile;
    void* load( memory_order = memory_order_seq_cst ) volatile;
    void* swap( void*, memory_order = memory_order_seq_cst ) volatile;
    void* compare_swap( void*&amp;, void*,
                        memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    void* fetch_add( void*, memory_order = memory_order_seq_cst ) volatile;
    void* fetch_sub( void*, memory_order = memory_order_seq_cst ) volatile;

    atomic_address() = default;
    constexpr atomic_address( void* );
    atomic_address( const atomic_address&amp; ) = delete;
    atomic_address&amp; operator =( const atomic_address&amp; ) = delete;
    void* operator =( void* ) volatile;
    operator void*() volatile;
    void* operator +=( void* ) volatile;
    void* operator -=( void* ) volatile;
} atomic_address;

bool atomic_lock_free( volatile atomic_address* );
void atomic_store( volatile atomic_address*, void* );
void atomic_store_explicit( volatile atomic_address*, void*, memory_order );
void* atomic_load( volatile atomic_address* );
void* atomic_load_explicit( volatile atomic_address*, memory_order );
void* atomic_swap( volatile atomic_address* );
void* atomic_swap_explicit( volatile atomic_address*, void* );
void* atomic_compare_swap( volatile atomic_address*, void**, void* );
void* atomic_compare_swap_explicit( volatile atomic_address*, void**, void*,
                                    memory_order );
void atomic_fence( volatile atomic_address*, memory_order ) volatile;
void* atomic_fetch_add( volatile atomic_address*, void* );
void* atomic_fetch_add_explicit( volatile atomic_address*, void*,
                                 memory_order );
void* atomic_fetch_sub( volatile atomic_address*, void* );
void* atomic_fetch_sub_explicit( volatile atomic_address*, void*,
                                 memory_order );
</samp></pre>

<p> And for each of
the <samp><var>integral</var></samp> (character and integer) types listed above,

<pre><samp>
typedef struct atomic_<var>integral</var>
{
    bool lock_free();
    void store( <var>integral</var>, memory_order = memory_order_seq_cst ) volatile;
    <var>integral</var> load( memory_order = memory_order_seq_cst ) volatile;
    <var>integral</var> swap( <var>integral</var>,
                   memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( <var>integral</var>&amp;, <var>integral</var>,
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    <var>integral</var> fetch_add( <var>integral</var>,
                        memory_order = memory_order_seq_cst ) volatile;
    <var>integral</var> fetch_sub( <var>integral</var>,
                        memory_order = memory_order_seq_cst ) volatile;
    <var>integral</var> fetch_and( <var>integral</var>,
                        memory_order = memory_order_seq_cst ) volatile;
    <var>integral</var> fetch_or( <var>integral</var>,
                        memory_order = memory_order_seq_cst ) volatile;
    <var>integral</var> fetch_xor( <var>integral</var>,
                        memory_order = memory_order_seq_cst ) volatile;

    atomic_<var>integral</var>() = default;
    constexpr atomic_<var>integral</var>( <var>integral</var> );
    atomic_<var>integral</var>( const atomic_<var>integral</var>&amp; ) = delete;
    atomic_<var>integral</var>&amp; operator =( const atomic_<var>integral</var> &amp; ) = delete;
    <var>integral</var> operator =( <var>integral</var> ) volatile;
    operator <var>integral</var>() volatile;
    <var>integral</var> operator +=( <var>integral</var> ) volatile;
    <var>integral</var> operator -=( <var>integral</var> ) volatile;
    <var>integral</var> operator &amp;=( <var>integral</var> ) volatile;
    <var>integral</var> operator |=( <var>integral</var> ) volatile;
    <var>integral</var> operator ^=( <var>integral</var> ) volatile;
} atomic_<var>integral</var>;

bool atomic_lock_free( volatile atomic_<var>integral</var>* );
void atomic_store( volatile atomic_<var>integral</var>*, <var>integral</var> );
void atomic_store_explicit( volatile atomic_<var>integral</var>*, <var>integral</var>, memory_order );
<var>integral</var> atomic_load( volatile atomic_<var>integral</var>* );
<var>integral</var> atomic_load_explicit( volatile atomic_<var>integral</var>*, memory_order );
<var>integral</var> atomic_swap( volatile atomic_<var>integral</var>*, <var>integral</var> );
<var>integral</var> atomic_swap_explicit( volatile atomic_<var>integral</var>*, <var>integral</var>,
                               memory_order );
bool atomic_compare_swap( volatile atomic_<var>integral</var>*, <var>integral</var>*, <var>integral</var> );
bool atomic_compare_swap_explicit( volatile atomic_<var>integral</var>*, <var>integral</var>*,
                                   <var>integral</var>, memory_order );
void atomic_fence( volatile atomic_<var>integral</var>*, memory_order ) volatile;
<var>integral</var> atomic_fetch_add( volatile atomic_<var>integral</var>*, <var>integral</var> );
<var>integral</var> atomic_fetch_add_explicit( volatile atomic_<var>integral</var>*, <var>integral</var>,
                                    memory_order );
<var>integral</var> atomic_fetch_sub( volatile atomic_<var>integral</var>*, <var>integral</var> );
<var>integral</var> atomic_fetch_sub_explicit( volatile atomic_<var>integral</var>*, <var>integral</var>,
                                    memory_order );
<var>integral</var> atomic_fetch_and( volatile atomic_<var>integral</var>*, <var>integral</var> );
<var>integral</var> atomic_fetch_and_explicit( volatile atomic_<var>integral</var>*, <var>integral</var>,
                                    memory_order );
<var>integral</var> atomic_fetch_or( volatile atomic_<var>integral</var>*, <var>integral</var> );
<var>integral</var> atomic_fetch_or_explicit( volatile atomic_<var>integral</var>*, <var>integral</var>,
                                    memory_order );
<var>integral</var> atomic_fetch_xor( volatile atomic_<var>integral</var>*, <var>integral</var> );
<var>integral</var> atomic_fetch_xor_explicit( volatile atomic_<var>integral</var>*, <var>integral</var>,
                                    memory_order );
</samp></pre>

<h2><a name="Implementation">Implementation</a></h2>

<p> This proposal embeds
an example, minimally-conforming, implementation.
The implementation uses a hash table of flags
and does busy waiting on the flags.

<h3><a name="Presentation">Notes on the Presentation</a></h3>

<p> The proposal marks the defined interface
with the <samp>&lt;code&gt;</samp> font tag,
which typically renders in a <samp>teletype</samp> font.

<p> The proposal marks the example implemenation
with the <samp>&lt;var&gt;</samp> font tag
within the <samp>&lt;code&gt;</samp> font tag,
which typically renders in an <samp><var>italic teletype</var></samp> font.
This example implementation is <em>not</em> part of the standard;
it is evidence of implementability.

<p> The embedded source is a bash script
that generates the C and C++ source files.
We have taken this approach
because the definitions have a high degree of redundancy,
which would otherwise interfere with the readability of the document.

<p> To extract the bash script from the HTML source,
use the following sed script.
(The bash script will also generate the sed script.)
<pre><code>
echo n2324.sed
cat &lt;&lt;EOF &gt;<var>n2324.sed</var>

<var>1,/&lt;code&gt;/        d
/&lt;\/code&gt;/,/&lt;code&gt;/    d
            s|&lt;var&gt;||g
            s|&lt;/var&gt;||g
            s|&amp;lt;|&lt;|g
            s|&amp;gt;|&gt;|g
            s|&amp;amp;|\&amp;|g</var>

EOF
</code></pre>

<p> To compile the enclosed sources and examples,
use the following Makefile.
(The bash script will also generate the Makefile.)
<pre><code>
echo Makefile
cat &lt;&lt;EOF &gt;<var>Makefile</var>

<var>default : test

n2324.bash : n2324.html
	sed -f n2324.sed n2324.html &gt; n2324.bash

stdatomic.h cstdatomic impatomic.h impatomic.c n2324.c : n2324.bash
	bash n2324.bash

impatomic.o : impatomic.h impatomic.c
	gcc -std=c99 -c impatomic.c

n2324.c.exe : n2324.c stdatomic.h impatomic.o
	gcc -std=c99 -o n2324.c.exe n2324.c impatomic.o

n2324.c++.exe : n2324.c stdatomic.h impatomic.o
	g++ -o n2324.c++.exe n2324.c impatomic.o

test : n2324.c.exe n2324.c++.exe

clean :
	rm -f n2324.bash stdatomic.h cstdatomic impatomic.h impatomic.c
	rm -f impatomic.o n2324.c.exe n2324.c++.exe</var>

EOF
</code></pre>

<h3><a name="Files">Implementation Files</a></h3>

<p> As is common practice,
we place the common portions of the C and C++ standard headers
in a separate implementation header.

<p> The implementation header includes standard headers
to obtain basic typedefs.
<pre><code>
echo impatomic.h includes
cat &lt;&lt;EOF &gt;<var>impatomic.h</var>

#ifdef __cplusplus
#include &lt;cstddef&gt;
namespace std {
#else
#include &lt;stddef.h&gt;
#include &lt;stdbool.h&gt;
#endif

EOF
</code></pre>

<p> The corresponding implementation source
includes the implementation header and <samp>stdint.h</samp>.
<pre><code>
echo impatomic.c includes
cat &lt;&lt;EOF &gt;<var>impatomic.c</var>

#include &lt;stdint.h&gt;
#include "<var>impatomic.h</var>"

EOF
</code></pre>

<h3><a name="CPP0X">C++0x Features</a></h3>

<p> Because current compilers do not support the new C++0x features,
we have surrounded these with a macro to conditionally remove them.

<pre><code>
echo impatomic.h CPP0X
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

<var>#define CPP0X( feature )</var>

EOF
</code></pre>

<h3><a name="Order">Memory Order</a></h3>

<pre><code>
echo impatomic.h order
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

typedef enum memory_order {
    memory_order_relaxed, memory_order_acquire, memory_order_release,
    memory_order_acq_rel, memory_order_seq_cst
} memory_order;

EOF
</code></pre>

<h3><a name="Flag">Flag Type and Operations</a></h3>

<p> To aid the emulated implementation,
the example implementation includes a predefined hashtable of locks
implemented via flags.

<pre><code>
echo impatomic.h flag
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

typedef struct atomic_flag
{
#ifdef __cplusplus
    bool test_and_set( memory_order = memory_order_seq_cst ) volatile;
    void clear( memory_order = memory_order_seq_cst ) volatile;

    atomic_flag() CPP0X(=default);
    atomic_flag( const atomic_flag&amp; ) CPP0X(=delete);
    atomic_flag&amp; operator =( const atomic_flag&amp; ) CPP0X(=delete);

#endif
    <var>bool __f__</var>;
} atomic_flag;

#define ATOMIC_FLAG_INIT { <var>false</var> }

#ifdef __cplusplus
extern "C" {
#endif

<var>extern</var> bool atomic_flag_test_and_set( atomic_flag volatile* );
<var>extern</var> bool atomic_flag_test_and_set_explicit
( atomic_flag volatile*, memory_order );
<var>extern</var> void atomic_flag_clear( atomic_flag volatile* );
<var>extern</var> void atomic_flag_clear_explicit
( atomic_flag volatile*, memory_order );
<var>extern void __atomic_flag_wait__
( atomic_flag volatile* );</var>
<var>extern void __atomic_flag_wait_explicit__
( atomic_flag volatile*, memory_order );</var>
<var>extern atomic_flag volatile* __atomic_flag_for_address__
( void volatile* __z__ )
__attribute__((const))</var>;

#ifdef __cplusplus
}
#endif

#ifdef __cplusplus

inline bool atomic_flag::test_and_set( memory_order __x__ ) volatile
{ return atomic_flag_test_and_set_explicit( this, __x__ ); }

inline void atomic_flag::clear( memory_order __x__ ) volatile
{ atomic_flag_clear_explicit( this, __x__ ); }

#endif

EOF
</code></pre>

<p> The wait operation may be implemented with busy-waiting,
and hence must be used only with care.

<p> The for_address function returns the address of a flag.
Multiple argument values may yield a single flag,
and the implementation of locks may use these flags,
so no operation should attempt to hold any flag or lock
while holding a flag for an address.

<p> The prototype implementation of flags
uses the
<a href="http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html#Atomic-Builtins"><samp>__sync</samp> macros</a>
from the <a href="http://gcc.gnu.org/">GNU C/C++ compiler</a>,
when available,
and otherwise uses a non-atomic implementation
with the expectation that vendors will replace it.
It might even be implemented with, for example,
<samp>pthread_mutex_trylock</samp>,
in which case the internal flag wait function
might just be <samp>pthread_mutex_lock</samp>.
This would of course tend to
make <samp>atomic_flag</samp> larger than necessary.

<p> The prototype implementation of flags is implemented in C.
<pre><code>
echo impatomic.c flag
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.c</var>

<var>#if defined(__GNUC__)
#if __GNUC__ &gt; 4 || (__GNUC__ == 4 &amp;&amp; __GNUC_MINOR__ &gt; 0)
#define USE_SYNC
#endif
#endif</var>

bool atomic_flag_test_and_set( atomic_flag volatile* __a__ )
{ return atomic_flag_test_and_set_explicit( __a__, memory_order_seq_cst ); }

bool atomic_flag_test_and_set_explicit
( atomic_flag volatile* __a__, memory_order __x__ )
<var>{
#ifdef USE_SYNC
    if ( __x__ >= memory_order_acq_rel )
        __sync_synchronize();
    return __sync_lock_test_and_set( &amp;(__a__-&gt;__f__), 1 );
#else
    bool result = __a__-&gt;__f__;
    __a__-&gt;__f__ = true;
    return result;
#endif
}</var>

void atomic_flag_clear( atomic_flag volatile* __a__ )
{ atomic_flag_clear_explicit( __a__, memory_order_seq_cst ); }

void atomic_flag_clear_explicit
( atomic_flag volatile* __a__, memory_order __x__ )
<var>{
#ifdef USE_SYNC
    __sync_lock_release( &amp;(__a__-&gt;__f__) );
    if ( __x__ >= memory_order_acq_rel )
        __sync_synchronize();
#else
    __a__-&gt;__f__ = false;
#endif
} </var>

</code></pre>

<p> Note that the following implementation of wait
is almost always wrong --
it has high contention.
Some form of exponential backoff prevents excessive contention.

<pre><code>
<var>void __atomic_flag_wait__( atomic_flag volatile* __a__ )
{ while ( atomic_flag_test_and_set( __a__ ) ); }</var>

<var>void __atomic_flag_wait_explicit__( atomic_flag volatile* __a__,
                                    memory_order __x__ )
{ while ( atomic_flag_test_and_set_explicit( __a__, __x__ ) ); }</var>

<var>#define LOGSIZE 4

static atomic_flag volatile __atomic_flag_anon_table__[ 1 &lt;&lt; LOGSIZE ] =
{
    ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT,
    ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT,
    ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT,
    ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT,
};</var>

<var>atomic_flag volatile* __atomic_flag_for_address__( void volatile* __z__ )
{
    uintptr_t __u__ = (uintptr_t)__z__;
    __u__ += (__u__ &gt;&gt; 2) + (__u__ &lt;&lt; 4);
    __u__ += (__u__ &gt;&gt; 7) + (__u__ &lt;&lt; 5);
    __u__ += (__u__ &gt;&gt; 17) + (__u__ &lt;&lt; 13);
    if ( sizeof(uintptr_t) &gt; 4 ) __u__ += (__u__ &gt;&gt; 31);
    __u__ &amp;= ~((~(uintptr_t)0) &lt;&lt; LOGSIZE);
    return __atomic_flag_anon_table__ + __u__;
}</var>

EOF
</code></pre>

<h3><a name="ImplMacros">Implementation Macros</a></h3>

<p> The remainder of the example implementation uses the following macros.
These macros exploit GNU extensions for
value-returning blocks (AKA statement expressions)
and __typeof__.

<p> The macros rely on data fields of atomic structs being named __f__.
Other symbols used are
__a__=atomic, 
__e__=expected, 
__f__=field, 
__g__=flag, 
__m__=modified, 
__o__=operation, 
__r__=result, 
__p__=pointer to field,
__v__=value (for single evaluation), and
__x__=memory-ordering.

<pre><code>
echo impatomic.h macros implementation
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

<var>#define _ATOMIC_LOAD_( __a__, __x__ ) \\
({ __typeof__((__a__)-&gt;__f__) volatile* __p__ = &amp;((__a__)-&gt;__f__); \\
   atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ ); \\
   __atomic_flag_wait_explicit__( __g__, __x__ ); \\
   __typeof__((__a__)-&gt;__f__) __r__ =*__p__; \\
   atomic_flag_clear_explicit( __g__, __x__ ); \\
   __r__; })</var>

<var>#define _ATOMIC_STORE_( __a__, __m__, __x__ ) \\
({ __typeof__((__a__)-&gt;__f__) volatile* __p__ = &amp;((__a__)-&gt;__f__); \\
   __typeof__(__m__) __v__ = (__m__); \\
   atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ ); \\
   __atomic_flag_wait_explicit__( __g__, __x__ ); \\
   *__p__ = __v__; \\
   atomic_flag_clear_explicit( __g__, __x__ ); \\
   __v__; })</var>

<var>#define _ATOMIC_MODIFY_( __a__, __o__, __m__, __x__ ) \\
({ __typeof__((__a__)-&gt;__f__) volatile* __p__ = &amp;((__a__)-&gt;__f__); \\
   __typeof__(__m__) __v__ = (__m__); \\
   atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ ); \\
   __atomic_flag_wait_explicit__( __g__, __x__ ); \\
   __typeof__((__a__)-&gt;__f__) __r__ = *__p__; \\
   *__p__ __o__ __v__; \\
   atomic_flag_clear_explicit( __g__, __x__ ); \\
   __r__; })</var>

<var>#define _ATOMIC_CMPSWP_( __a__, __e__, __m__, __x__ ) \\
({ __typeof__((__a__)-&gt;__f__) volatile* __p__ = &amp;((__a__)-&gt;__f__); \\
   __typeof__(__e__) __q__ = (__e__); \\
   __typeof__(__m__) __v__ = (__m__); \\
   bool __r__; \\
   atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ ); \\
   __atomic_flag_wait_explicit__( __g__, __x__ ); \\
   __typeof__((__a__)-&gt;__f__) __t__ = *__p__; \\
   if ( __t__ == *__q__ ) { *__p__ = __v__; __r__ = true; } \\
   else { *__q__ = __t__; __r__ = false; } \\
   atomic_flag_clear_explicit( __g__, __x__ ); \\
   __r__; })</var>

EOF
</code></pre>

<h3><a name="LockFreeMacro">Lock-Free Macro</a></h3>

<pre><code>
echo impatomic.h lock-free macros
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#define ATOMIC_SCALAR_LOCK_FREE <var>0</var>

EOF
</code></pre>

<h3><a name="Regular">Regular Types</a></h3>

<p> The standard defines atomic types
corresponding to booleans, addresses, integers, and for C++, wider characters.
These atomic types are defined in terms of a base type.

<p> The base types have two names in this proposal,
a short name usually embedded within other identifiers,
and a long name for the base type.
The mapping between them is as follows.
<pre><code>
bool="bool"
address="void*"

INTEGERS="char schar uchar short ushort int uint long ulong llong ullong"
char="char"
schar="signed char"
uchar="unsigned char"
short="short"
ushort="unsigned short"
int="int"
uint="unsigned int"
long="long"
ulong="unsigned long"
llong="long long"
ullong="unsigned long long"

CHARACTERS="wchar_t"
/* CHARACTERS="char16_t char32_t wchar_t" // char*_t not yet in compilers */
char16_t="char16_t"
char32_t="char32_t"
wchar_t="wchar_t"
</pre></code>

<p> In addition to types, some operations also need two names,
one for embedding within other identifiers,
and one consisting of the operator.
<pre><code>
ADR_OPERATIONS="add sub"
INT_OPERATIONS="add sub and ior xor"
add="+"
sub="-"
and="&amp;"
ior="|"
xor="^"
</code></pre>

<h4><a name="Boolean">Boolean</a></h4>
<pre><code>
echo impatomic.h type boolean
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

typedef struct atomic_bool
{
#ifdef __cplusplus
    bool lock_free() volatile;
    void store( bool, memory_order = memory_order_seq_cst ) volatile;
    bool load( memory_order = memory_order_seq_cst ) volatile;
    bool swap( bool, memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap ( bool&amp;, bool,
                        memory_order = memory_order_seq_cst) volatile;
    void fence( memory_order ) volatile;

    atomic_bool() CPP0X(=delete);
    CPP0X(constexpr) atomic_bool( bool __v__ ) : __f__( __v__ ) { }
    atomic_bool( const atomic_bool&amp; ) CPP0X(=delete);
    atomic_bool&amp; operator =( const atomic_bool&amp; ) CPP0X(=delete);

    bool operator =( bool __v__ ) volatile
    { store( __v__ ); return __v__; }

    operator bool() volatile
    { return load(); }

#endif
    <var>bool __f__;</var>
} atomic_bool;

EOF
</code></pre>

<h4><a name="Address">Address</a></h4>
<pre><code>
echo impatomic.h type address
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

typedef struct atomic_address
{
#ifdef __cplusplus
    bool lock_free() volatile;
    void store( void*, memory_order = memory_order_seq_cst ) volatile;
    void* load( memory_order = memory_order_seq_cst ) volatile;
    void* swap( void*, memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( void*&amp;, void*,
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    void* fetch_add( ptrdiff_t, memory_order = memory_order_seq_cst ) volatile;
    void* fetch_sub( ptrdiff_t, memory_order = memory_order_seq_cst ) volatile;

    atomic_address() CPP0X(=default);
    atomic_address( const atomic_address&amp; ) CPP0X(=delete);
    CPP0X(constexpr) atomic_address( void* __v__ ) : __f__( __v__) { }
    atomic_address&amp; operator =( const atomic_address &amp; ) CPP0X(=delete);

    void* operator =( void* __v__ ) volatile
    { store( __v__ ); return __v__; }

    operator void*() volatile
    { return load(); }

    void* operator +=( ptrdiff_t __v__ ) volatile
    { return fetch_add( __v__ ); }

    void* operator -=( ptrdiff_t __v__ ) volatile
    { return fetch_sub( __v__ ); }

#endif
    void* __f__;
} atomic_address;

EOF
</code></pre>

<h4><a name="Integers">Integers</a></h4>
<pre><code>
echo impatomic.h type integers
for TYPEKEY in ${INTEGERS}
do
TYPENAME=${!TYPEKEY}
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

typedef struct atomic_${TYPEKEY}
{
#ifdef __cplusplus
    bool lock_free() volatile;
    void store( ${TYPENAME},
                memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} load( memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} swap( ${TYPENAME},
                      memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( ${TYPENAME}&amp;, ${TYPENAME},
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    ${TYPENAME} fetch_add( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_sub( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_and( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_or( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_xor( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;

    atomic_${TYPEKEY}() CPP0X(=default);
    atomic_${TYPEKEY}( const atomic_${TYPEKEY}&amp; ) CPP0X(=delete);
    CPP0X(constexpr) atomic_${TYPEKEY}( ${TYPENAME} __v__ ) : __f__( __v__) { }
    atomic_${TYPEKEY}&amp; operator =( const atomic_${TYPEKEY}&amp; ) CPP0X(=delete);

    ${TYPENAME} operator =( ${TYPENAME} __v__ ) volatile
    { store( __v__ ); return __v__; }

    operator ${TYPENAME}() volatile
    { return load(); }

    ${TYPENAME} operator ++( int ) volatile
    { return fetch_add( 1 ); }

    ${TYPENAME} operator --( int ) volatile
    { return fetch_sub( 1 ); }

    ${TYPENAME} operator ++() volatile
    { return fetch_add( 1 ) + 1; }

    ${TYPENAME} operator --() volatile
    { return fetch_sub( 1 ) - 1; }

    ${TYPENAME} operator +=( ${TYPENAME} __v__ ) volatile
    { return fetch_add( __v__ ) + __v__; }

    ${TYPENAME} operator -=( ${TYPENAME} __v__ ) volatile
    { return fetch_sub( __v__ ) - __v__; }

    ${TYPENAME} operator &amp;=( ${TYPENAME} __v__ ) volatile
    { return fetch_and( __v__ ) &amp; __v__; }

    ${TYPENAME} operator |=( ${TYPENAME} __v__ ) volatile
    { return fetch_or( __v__ ) | __v__; }

    ${TYPENAME} operator ^=( ${TYPENAME} __v__ ) volatile
    { return fetch_xor( __v__ ) ^ __v__; }

#endif
    <var>${TYPENAME} __f__;</var>
} atomic_${TYPEKEY};

EOF
done
</code></pre>

<h4><a name="Typedefs">Integer Typedefs</a></h4>

<p> The following typedefs
support atomic versons
of the <samp>cstdint</samp> and <samp>stdint.h</samp> types.
<pre><code>
echo impatomic.h typedefs integers
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

typedef <var>atomic_schar</var> atomic_int_least8_t;
typedef <var>atomic_uchar</var> atomic_uint_least8_t;
typedef <var>atomic_short</var> atomic_int_least16_t;
typedef <var>atomic_ushort</var> atomic_uint_least16_t;
typedef <var>atomic_int</var> atomic_int_least32_t;
typedef <var>atomic_uint</var> atomic_uint_least32_t;
typedef <var>atomic_llong</var> atomic_int_least64_t;
typedef <var>atomic_ullong</var> atomic_uint_least64_t;

typedef <var>atomic_schar</var> atomic_int_fast8_t;
typedef <var>atomic_uchar</var> atomic_uint_fast8_t;
typedef <var>atomic_short</var> atomic_int_fast16_t;
typedef <var>atomic_ushort</var> atomic_uint_fast16_t;
typedef <var>atomic_int</var> atomic_int_fast32_t;
typedef <var>atomic_uint</var> atomic_uint_fast32_t;
typedef <var>atomic_llong</var> atomic_int_fast64_t;
typedef <var>atomic_ullong</var> atomic_uint_fast64_t;

typedef <var>atomic_long</var> atomic_intptr_t;
typedef <var>atomic_ulong</var> atomic_uintptr_t;

typedef <var>atomic_long</var> atomic_ssize_t;
typedef <var>atomic_ulong</var> atomic_size_t;

typedef <var>atomic_long</var> atomic_ptrdiff_t;

typedef <var>atomic_llong</var> atomic_intmax_t;
typedef <var>atomic_ullong</var> atomic_uintmax_t;

EOF
</code></pre>

<h4><a name="Characters">Characters</a></h4>
<pre><code>
echo impatomic.h type characters
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#ifdef __cplusplus

EOF

for TYPEKEY in ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

typedef struct atomic_${TYPEKEY}
{
#ifdef __cplusplus
    bool lock_free() volatile;
    void store( ${TYPENAME}, memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} load( memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} swap( ${TYPENAME},
                      memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( ${TYPENAME}&amp;, ${TYPENAME},
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    ${TYPENAME} fetch_add( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_sub( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_and( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_or( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_xor( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;

    atomic_${TYPENAME}() CPP0X(=default);
    atomic_${TYPENAME}( const atomic_${TYPENAME}&amp; ) CPP0X(=delete);
    CPP0X(constexpr) atomic_${TYPEKEY}( ${TYPENAME} __v__ ) : __f__( __v__) { }
    atomic_${TYPENAME}&amp; operator =( const atomic_${TYPENAME}&amp; ) CPP0X(=delete);

    ${TYPENAME} operator =( ${TYPENAME} __v__ ) volatile
    { store( __v__ ); return __v__; }

    operator ${TYPENAME}() volatile
    { return load(); }

    ${TYPENAME} operator ++( int ) volatile
    { return fetch_add( 1 ); }

    ${TYPENAME} operator --( int ) volatile
    { return fetch_sub( 1 ); }

    ${TYPENAME} operator ++() volatile
    { return fetch_add( 1 ) + 1; }

    ${TYPENAME} operator --() volatile
    { return fetch_sub( 1 ) - 1; }

    ${TYPENAME} operator +=( ${TYPENAME} __v__ ) volatile
    { return fetch_add( __v__ ) + __v__; }

    ${TYPENAME} operator -=( ${TYPENAME} __v__ ) volatile
    { return fetch_sub( __v__ ) - __v__; }

    ${TYPENAME} operator &amp;=( ${TYPENAME} __v__ ) volatile
    { return fetch_and( __v__ ) &amp; __v__; }

    ${TYPENAME} operator |=( ${TYPENAME} __v__ ) volatile
    { return fetch_or( __v__ ) | __v__; }

    ${TYPENAME} operator ^=( ${TYPENAME} __v__ ) volatile
    { return fetch_xor( __v__ ) ^ __v__; }

#endif
    <var>${TYPENAME} __f__;</var>
} atomic_${TYPEKEY};

EOF
done

cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#else

typedef <var>atomic_int_least16_t</var> atomic_char16_t;
typedef <var>atomic_int_least32_t</var> atomic_char32_t;
typedef <var>atomic_int_least32_t</var> atomic_wchar_t;

#endif

EOF
</code></pre>

<h4><a name="Generic">Generic Type</a></h4>

<p> This minimal implementation does not specialize on size.

<pre><code>
echo impatomic.h type generic
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#ifdef __cplusplus

template< typename T >
struct atomic
{
#ifdef __cplusplus

    bool lock_free() volatile;
    void store( T, memory_order = memory_order_seq_cst ) volatile;
    T load( memory_order = memory_order_seq_cst ) volatile;
    T swap( T __v__, memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( T&amp;, T, memory_order = memory_order_seq_cst) volatile;
    void fence( memory_order ) volatile;

    atomic() CPP0X(=default);
    CPP0X(constexpr) atomic( T __v__ ) : __f__( __v__ ) { }
    atomic( const atomic&amp; ) CPP0X(=delete);
    atomic&amp; operator =( const atomic&amp; ) CPP0X(=delete);

    T operator =( T __v__ ) volatile
    { store( __v__ ); return __v__; }

    operator T() volatile
    { return load(); }

#endif
    T <var>__f__</var>;
};

#endif
EOF
</code></pre>

<h4><a name="Pointer">Pointer Partial Specialization</a></h4>
<pre><code>
echo impatomic.h type pointer
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#ifdef __cplusplus

template&lt;typename T&gt; struct atomic&lt; T* &gt; : atomic_address
{
    T* fetch_add( ptrdiff_t, memory_order = memory_order_seq_cst ) volatile;
    T* fetch_sub( ptrdiff_t, memory_order = memory_order_seq_cst ) volatile;

    atomic() CPP0X(=default);
    CPP0X(constexpr) atomic( T __v__ ) : atomic_address( __v__ ) { }
    atomic( const atomic&amp; ) CPP0X(=delete);
    atomic&amp; operator =( const atomic&amp; ) CPP0X(=delete);

    T* operator =( T* __v__ ) volatile
    { store( __v__ ); return __v__; }

    operator T*() volatile
    { return load(); }

    T* operator ++( int ) volatile
    { return fetch_add( 1 ); }

    T* operator --( int ) volatile
    { return fetch_sub( 1 ); }

    T* operator ++() volatile
    { return fetch_add( 1 ) + 1; }

    T* operator --() volatile
    { return fetch_sub( 1 ) - 1; }

    T* operator +=( T* __v__ ) volatile
    { return fetch_add( __v__ ) + __v__; }

    T* operator -=( T* __v__ ) volatile
    { return fetch_sub( __v__ ) - __v__; }
};

#endif
EOF
</code></pre>

<h4><a name="Special">Full Specializations</a></h4>

<p> We provide full specializations of the generic atomic
for integers and characters.
These specializations derive from the specific atomic types
to enable implicit reference conversions.
The implicit assignment of the derived class
prevents inheriting the base class assignments,
and so the assignment must be explicitly redeclared.
<pre><code>
echo impatomic.h type specializations
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#ifdef __cplusplus

EOF

for TYPEKEY in bool address ${INTEGERS} ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

template&lt;&gt; struct atomic&lt; ${TYPENAME} &gt; : atomic_${TYPEKEY}
{
    atomic() CPP0X(=default);
    CPP0X(constexpr) atomic( ${TYPENAME} __v__ ) : atomic_${TYPEKEY}( __v__ ) { }
    atomic( const atomic&amp; ) CPP0X(=delete);
    atomic&amp; operator =( const atomic&amp; ) CPP0X(=delete);

    ${TYPENAME} operator =( ${TYPENAME} __v__ ) volatile
    { store( __v__ ); return __v__; }
};

EOF
done

cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#endif

EOF
</code></pre>

<h3><a name="Functions">C++ Core Functions</a></h3>

<p> In C++, these operations are implemented as overloaded functions.
<pre><code>
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#ifdef __cplusplus

EOF

echo impatomic.h functions ordinary basic
for TYPEKEY in bool address ${INTEGERS} ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

<var>inline</var> bool atomic_lock_free( volatile atomic_${TYPEKEY}* __a__ )
<var>{ return false; }</var>

<var>inline</var> ${TYPENAME} atomic_load( volatile atomic_${TYPEKEY}* __a__ )
<var>{ return _ATOMIC_LOAD_( __a__, memory_order_seq_cst ); }</var>

<var>inline</var> ${TYPENAME} atomic_load_explicit
( volatile atomic_${TYPEKEY}* __a__, memory_order __x__ )
<var>{ return _ATOMIC_LOAD_( __a__, __x__ ); }</var>

<var>inline</var> ${TYPENAME} atomic_store
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__ )
<var>{ return _ATOMIC_STORE_( __a__, __m__, memory_order_seq_cst ); }</var>

<var>inline</var> ${TYPENAME} atomic_store_explicit
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__, memory_order __x__ )
<var>{ return _ATOMIC_STORE_( __a__, __m__, __x__ ); }</var>

<var>inline</var> ${TYPENAME} atomic_swap
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__ )
<var>{ return _ATOMIC_MODIFY_( __a__, =, __m__, memory_order_seq_cst ); }</var>

<var>inline</var> ${TYPENAME} atomic_swap_explicit
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__, memory_order __x__ )
<var>{ return _ATOMIC_MODIFY_( __a__, =, __m__, __x__ ); }</var>

<var>inline</var> bool atomic_compare_swap
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME}* __e__, ${TYPENAME} __m__ )
<var>{ return _ATOMIC_CMPSWP_( __a__, __e__, __m__, memory_order_seq_cst ); }</var>

<var>inline</var> bool atomic_compare_swap_explicit
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME}* __e__, ${TYPENAME} __m__,
  memory_order __x__ )
<var>{ return _ATOMIC_CMPSWP_( __a__, __e__, __m__, __x__ ); }</var>

<var>inline</var> void atomic_fence
( volatile atomic_${TYPEKEY}* __a__, memory_order __x__ )
<var>{ ${TYPENAME} __v__ = _ATOMIC_LOAD_( __a__, memory_order_relaxed );
  _ATOMIC_CMPSWP_( __a__, &amp;__v__, __v__, __x__ ); }</var>

EOF
done

echo impatomic.h functions template basic
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

template&lt; typename T &gt;
<var>inline</var> bool atomic_lock_free( volatile atomic&lt;T&gt;* __a__ )
<var>{ return false; }</var>

template&lt; typename T &gt;
<var>inline</var> T atomic_store
( volatile atomic&lt;T&gt;* __a__, T __m__ )
<var>{ return _ATOMIC_STORE_( __a__, __m__, memory_order_seq_cst ); }</var>

template&lt; typename T &gt;
<var>inline</var> T atomic_store_explicit
( volatile atomic&lt;T&gt;* __a__, T __m__, memory_order __x__ )
<var>{ return _ATOMIC_STORE_( __a__, __m__, __x__ ); }</var>

template&lt; typename T &gt;
<var>inline</var> T atomic_load( volatile atomic&lt;T&gt;* __a__ )
<var>{ return _ATOMIC_LOAD_( __a__, memory_order_seq_cst ); }</var>

template&lt; typename T &gt;
<var>inline</var> T atomic_load_explicit
( volatile atomic&lt;T&gt;* __a__, memory_order __x__ )
<var>{ return _ATOMIC_LOAD_( __a__, __x__ ); }</var>

template&lt; typename T &gt;
<var>inline</var> T atomic_swap
( volatile atomic&lt;T&gt;* __a__, T __m__ )
<var>{ return _ATOMIC_MODIFY_( __a__, =, __m__, memory_order_seq_cst ); }</var>

template&lt; typename T &gt;
<var>inline</var> T atomic_swap_explicit
( volatile atomic&lt;T&gt;* __a__, T __m__, memory_order __x__ )
<var>{ return _ATOMIC_MODIFY_( __a__, =, __m__, __x__ ); }</var>

template&lt; typename T &gt;
<var>inline</var> bool atomic_compare_swap
( volatile atomic&lt;T&gt;* __a__, T* __e__, T __m__ )
<var>{ return _ATOMIC_CMPSWP_( __a__, __e__, __m__, memory_order_seq_cst ); }</var>

template&lt; typename T &gt;
<var>inline</var> bool atomic_compare_swap_explicit
( volatile atomic&lt;T&gt;* __a__, T* __e__, T __m__, memory_order __x__ )
<var>{ return _ATOMIC_CMPSWP_( __a__, __e__, __m__, __x__ ); }</var>

template&lt; typename T &gt;
<var>inline</var> void atomic_fence
( volatile atomic&lt;T&gt;* __a__, memory_order __x__ )
<var>{ T __v__ = _ATOMIC_LOAD_( __a__, memory_order_relaxed );
  _ATOMIC_CMPSWP_( __a__, &amp;__v__, __v__, __x__ ); }</var>

EOF

echo impatomic.h functions address fetch
TYPEKEY=address
TYPENAME=${!TYPEKEY}

for FNKEY in ${ADR_OPERATIONS}
do
OPERATOR=${!FNKEY}

cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

<var>inline</var> ${TYPENAME} atomic_fetch_${FNKEY}
( volatile atomic_${TYPEKEY}* __a__, ptrdiff_t __m__ )
<var>{ ${TYPENAME} volatile* __p__ = &amp;((__a__)-&gt;__f__);
  atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ );
  __atomic_flag_wait__( __g__ );
  ${TYPENAME} __r__ = *__p__;
  *__p__ = (${TYPENAME})((char*)(*__p__) ${OPERATOR} __m__);
  atomic_flag_clear( __g__ );
  return __r__; }</var>

<var>inline</var> ${TYPENAME} atomic_fetch_${FNKEY}_explicit
( atomic_${TYPEKEY} volatile* __a__, ptrdiff_t __m__, memory_order __x__ )
<var>{ ${TYPENAME} volatile* __p__ = &amp;((__a__)-&gt;__f__);
  atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ );
  __atomic_flag_wait_explicit__( __g__, __x__ );
  ${TYPENAME} __r__ = *__p__;
  *__p__ = (${TYPENAME})((char*)(*__p__) ${OPERATOR} __m__);
  atomic_flag_clear_explicit( __g__, __x__ );
  return __r__; }</var>

EOF
done

echo impatomic.h functions integer fetch
for TYPEKEY in ${INTEGERS} ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}

for FNKEY in ${INT_OPERATIONS}
do
OPERATOR=${!FNKEY}

cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

<var>inline</var> ${TYPENAME} atomic_fetch_${FNKEY}
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__ )
<var>{ return _ATOMIC_MODIFY_( __a__, ${OPERATOR}=, __m__, memory_order_seq_cst ); }</var>

<var>inline</var> ${TYPENAME} atomic_fetch_${FNKEY}_explicit
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__, memory_order __x__ )
<var>{ return _ATOMIC_MODIFY_( __a__, ${OPERATOR}=, __m__, __x__ ); }</var>

EOF
done
done
</code></pre>

<h3><a name="CoreMacros">C Core Macros</a></h3>

<p> For C, we need type-generic macros.
<pre><code>
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#else

EOF

echo impatomic.h type-generic macros basic
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#define atomic_lock_free( __a__ ) \\
<var>false</var>

#define atomic_load( __a__ ) \\
<var>_ATOMIC_LOAD_( __a__, memory_order_seq_cst )</var>

#define atomic_load_explicit( __a__, __x__ ) \\
<var>_ATOMIC_LOAD_( __a__, __x__ )</var>

#define atomic_store( __a__, __m__ ) \\
<var>_ATOMIC_STORE_( __a__, __m__, memory_order_seq_cst )</var>

#define atomic_store_explicit( __a__, __m__, __x__ ) \\
<var>_ATOMIC_STORE_( __a__, __m__, __x__ )</var>

#define atomic_swap( __a__, __m__ ) \\
<var>_ATOMIC_MODIFY_( __a__, =, __m__, memory_order_seq_cst )</var>

#define atomic_swap_explicit( __a__, __m__, __x__ ) \\
<var>_ATOMIC_MODIFY_( __a__, =, __m__, __x__ )</var>

#define atomic_compare_swap( __a__, __e__, __m__ ) \\
<var>_ATOMIC_CMPSWP_( __a__, __e__, __m__, memory_order_seq_cst )</var>

#define atomic_compare_swap_explicit( __a__, __e__, __m__, __x__ ) \\
<var>_ATOMIC_CMPSWP_( __a__, __e__, __m__, __x__ )</var>

#define atomic_fence( __a__, __x__ ) \\
<var>({ T __v__ = _ATOMIC_LOAD_( __a__, memory_order_relaxed ); \\
   _ATOMIC_CMPSWP_( __a__, &amp;__v__, __v__, __x__ ); })</var>

EOF

echo impatomic.h type-generic macros fetch
for FNKEY in ${INT_OPERATIONS}
do
OPERATOR=${!FNKEY}

cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#define atomic_fetch_${FNKEY}( __a__, __m__ ) \\
<var>_ATOMIC_MODIFY_( __a__, ${OPERATOR}=, __m__, memory_order_seq_cst )</var>

#define atomic_fetch_${FNKEY}_explicit( __a__, __m__, __x__ ) \\
<var>_ATOMIC_MODIFY_( __a__, ${OPERATOR}=, __m__, __x__ )</var>

EOF
done

cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#endif

EOF
</code></pre>

<h3><a name="Methods">C++ Methods</a></h3>

<p> The core functions are difficult to use,
and so the proposal includes member function definitions
that are syntactically simpler.
The member operators are defined in the class definitions.

<pre><code>
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#ifdef __cplusplus

EOF

echo impatomic.h methods ordinary basic
for TYPEKEY in bool address ${INTEGERS} ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}

cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

<var>inline</var> bool atomic_${TYPEKEY}::lock_free() volatile
<var>{ return false; }</var>

<var>inline</var> void atomic_${TYPEKEY}::store
( ${TYPENAME} __m__, memory_order __x__ ) volatile
<var>{ atomic_store_explicit( this, __m__, __x__ ); }</var>

<var>inline</var> ${TYPENAME} atomic_${TYPEKEY}::load
( memory_order __x__ ) volatile
<var>{ return atomic_load_explicit( this, __x__ ); }</var>

<var>inline</var> ${TYPENAME} atomic_${TYPEKEY}::swap
( ${TYPENAME} __m__, memory_order __x__ ) volatile
<var>{ return atomic_swap_explicit( this, __m__, __x__ ); }</var>

<var>inline</var> bool atomic_${TYPEKEY}::compare_swap
( ${TYPENAME}&amp; __e__, ${TYPENAME} __m__, memory_order __x__ ) volatile
<var>{ return atomic_compare_swap_explicit( this, &amp;__e__, __m__, __x__ ); }</var>

EOF
done

echo impatomic.h methods template basic
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

template&lt; typename T &gt;
<var>inline</var> bool atomic&lt;T&gt;::lock_free() volatile
<var>{ return false; }</var>

template&lt; typename T &gt;
<var>inline</var> void atomic&lt;T&gt;::store( T __v__, memory_order __x__ ) volatile
<var>{ atomic_store_explicit( this, __v__, __x__ ); }</var>

template&lt; typename T &gt;
<var>inline</var> T atomic&lt;T&gt;::load( memory_order __x__ ) volatile
<var>{ return atomic_load_explicit( this, __x__ ); }</var>

template&lt; typename T &gt;
<var>inline</var> T atomic&lt;T&gt;::swap( T __v__, memory_order __x__ ) volatile
<var>{ return atomic_swap_explicit( this, __v__, __x__ ); }</var>

template&lt; typename T &gt;
<var>inline</var> bool atomic&lt;T&gt;::compare_swap
( T&amp; __r__, T __v__, memory_order __x__ ) volatile
<var>{ return atomic_compare_swap_explicit( this, &amp;__r__, __v__, __x__ ); }</var>

EOF

echo impatomic.h methods address fetch
TYPEKEY=address
TYPENAME=${!TYPEKEY}

cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

<var>inline</var> void* atomic_address::fetch_add
( ptrdiff_t __m__, memory_order __x__ ) volatile
<var>{ return atomic_fetch_add_explicit( this, __m__, __x__ ); }</var>

<var>inline</var> void* atomic_address::fetch_sub
( ptrdiff_t __m__, memory_order __x__ ) volatile
<var>{ return atomic_fetch_sub_explicit( this, __m__, __x__ ); }</var>

EOF

echo impatomic.h methods pointer fetch
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

template&lt; typename T &gt;
T* atomic&lt;T*&gt;::fetch_add( ptrdiff_t __v__, memory_order __x__ ) volatile
<var>{ return atomic_fetch_add_explicit( this, sizeof(T) * __v__, __x__ ); }</var>

template&lt; typename T &gt;
T* atomic&lt;T*&gt;::fetch_sub( ptrdiff_t __v__, memory_order __x__ ) volatile
<var>{ return atomic_fetch_sub_explicit( this, sizeof(T) * __v__, __x__ ); }</var>

EOF

cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#endif

EOF
</code></pre>

<h3><a name="Cleanup">Implementation Header Cleanup</a></h4>
<pre><code>
echo impatomic.h close namespace
cat &lt;&lt;EOF &gt;&gt;<var>impatomic.h</var>

#ifdef __cplusplus
} // namespace std
#endif

EOF
</code></pre>

<h3><a name="Standard">Standard Headers</a></h3>

<p> The C standard header.
<pre><code>
echo stdatomic.h
cat &lt;&lt;EOF &gt;stdatomic.h

#include "<var>impatomic.h</var>"

#ifdef __cplusplus

EOF

for TYPEKEY in flag bool address ${INTEGERS} ${CHARACTERS}
do
cat &lt;&lt;EOF &gt;&gt;stdatomic.h

using std::atomic_${TYPEKEY};

EOF
done

cat &lt;&lt;EOF &gt;&gt;stdatomic.h

using std::atomic;
using std::memory_order;
using std::memory_order_relaxed;
using std::memory_order_acquire;
using std::memory_order_release;
using std::memory_order_acq_rel;
using std::memory_order_seq_cst;

#endif

EOF

</code></pre>

<p> The C++ standard header.
<pre><code>
echo cstdatomic
cat &lt;&lt;EOF &gt;cstdatomic

#include "<var>impatomic.h</var>"

EOF
</code></pre>

<h2><a name="Examples">Examples of Use</a></h2>

<p> The following program shows example uses of the atomic types.
These examples also serve as tests for the interface definition.
<pre><code>
echo n2324.c include
cat &lt;&lt;EOF &gt;n2324.c

#include "stdatomic.h"

EOF
</code></pre>

<h3><a name="ExampleFlag">Flag</a></h3>

<p> We show two uses,
global functions with explicit memory ordering
and member functions with implicit memory ordering.

<pre><code>
echo n2324.c flag
cat &lt;&lt;EOF &gt;&gt;n2324.c

atomic_flag af = ATOMIC_FLAG_INIT;

void flag_example( void )
{
    if ( ! atomic_flag_test_and_set_explicit( &amp;af, memory_order_acquire ) )
        atomic_flag_clear_explicit( &amp;af, memory_order_release );
#ifdef __cplusplus
    if ( ! af.test_and_set() )
        af.clear();
#endif
}

EOF
</code></pre>

<h3><a name="ExampleLazy">Lazy Initialization</a></h3>

<p> For lazy initialization,
a thread that does not do initialization
may need to wait on the thread that does.
(Lazy initialization is similar to double-checked locking.)
For this example, we busy wait on a boolean.
Busy waiting like this is usually ill-advised,
but it sufficies for the example.
There are three variants of the example:
one using strong C++ operators and methods,
one using weak C functions, and
one using fence-based C++ operators and methods.

<pre><code>
echo n2324.c lazy
cat &lt;&lt;EOF &gt;&gt;n2324.c

atomic_bool lazy_ready = { false };
atomic_bool lazy_assigned = { false };
int lazy_value;

#ifdef __cplusplus

int lazy_example_strong_cpp( void )
{
    if ( ! lazy_ready ) {
        /* the value is not yet ready */
        if ( lazy_assigned.swap( true ) ) {
            /* initialization assigned to another thread; wait */
            while ( ! lazy_ready );
        }
        else {
            lazy_value = 42;
            lazy_ready = true;
        }
    }
    return lazy_value;
}

#endif

int lazy_example_weak_c( void )
{
    if ( ! atomic_load_explicit( &amp;lazy_ready, memory_order_acquire ) ) {
        if ( atomic_swap_explicit( &amp;lazy_assigned, true,
                                   memory_order_relaxed ) ) {
            while ( ! atomic_load_explicit( &amp;lazy_ready,
                                            memory_order_acquire ) );
        }
        else {
            lazy_value = 42;
            atomic_store_explicit( &amp;lazy_ready, true, memory_order_release );
        }
    }
    return lazy_value;
}

#ifdef __cplusplus

int lazy_example_fence_cpp( void )
{
    if ( lazy_ready.load( memory_order_relaxed ) )
        lazy_ready.fence( memory_order_acquire );
    else if ( lazy_assigned.swap( true, memory_order_relaxed ) )
        while ( ! lazy_ready.load( memory_order_relaxed ) );
    else {
        lazy_value = 42;
        lazy_ready.store( true, memory_order_release );
    }
    return lazy_value;
}

#endif

EOF
</code></pre>

<h3><a name="ExampleInteger">Integer</a></h3>
<pre><code>
echo n2324.c integer
cat &lt;&lt;EOF &gt;&gt;n2324.c

atomic_ulong volatile aulv = { 0 };
atomic_ulong auln = { 1 };
#ifdef __cplusplus
atomic&lt; unsigned long &gt; taul CPP0X( { 3 } );
#endif

void integer_example( void )
{
    atomic_ulong a = { 3 };
    unsigned long x = atomic_load_explicit( &amp;auln, memory_order_acquire );
    atomic_store_explicit( &amp;aulv, x, memory_order_release );
    unsigned long y = atomic_fetch_add_explicit( &amp;aulv, 1,
                                                 memory_order_relaxed );
    unsigned long z = atomic_fetch_xor( &amp;auln, 4 );
#ifdef __cplusplus
    x = auln;
    aulv = x;
    auln += 1;
    aulv ^= 4;
    // auln = aulv; // uses a deleted operator
    aulv -= auln++;
    auln |= --aulv;
    aulv &amp;= 7;
    atomic_store_explicit( &amp;taul, 7, memory_order_release );
    x = taul.load( memory_order_acquire);
    y = atomic_fetch_add_explicit( &amp; taul, 1, memory_order_acquire );
    z = atomic_fetch_xor( &amp; taul, 4 );
    x = taul;
    // auln = taul; // uses a deleted operator
    // taul = aulv; // uses a deleted operator
    taul = x;
    taul += 1;
    taul ^= 4;
    taul -= taul++;
    taul |= --taul;
    taul &amp;= 7;
#endif
}

EOF
</code></pre>

<h3><a name="ExampleEvent">Event Counter</a></h3>

<p> An event counter is not part of the communication between threads,
and so it can use faster primitives.
<pre><code>
echo n2324.c event
cat &lt;&lt;EOF &gt;&gt;n2324.c

#ifdef __cplusplus

struct event_counter
{
    void inc() { au.fetch_add( 1, memory_order_relaxed ); }
    unsigned long get() { au.load( memory_order_relaxed ); }
    atomic_ulong au;
};
event_counter ec = { 0 };

void generate_events()
{
    ec.inc();
    ec.inc();
    ec.inc();
}

int read_events()
{
    return ec.get();
}

int event_example()
{
    generate_events(); // possibly in multiple threads
    // join all other threads, ensuring that final value is written
    return read_events();
}

#endif

EOF
</code></pre>

<p> An important point here is that this is safe,
and we are guaranteed to see exactly the final value,
because the thread joins force the necessary ordering
between the <samp>inc</samp> calls and the <samp>get</samp> call.

<h3><a name="ExampleList">List Insert</a></h3>

<p> The insertion into a shared linked list
can be accomplished in a lock-free manner with compare-and-swap,
provided that compare-and-swap is lock-free.
(Note that adding a correct "remove" operation without a garbage collector
is harder than it seems.)
<pre><code>
echo n2324.c list
cat &lt;&lt;EOF &gt;&gt;n2324.c

#ifdef __cplusplus

struct data;
struct node
{
    node* next;
    data* value;
};

atomic&lt; struct node* &gt; head CPP0X( { (node*)0 } );

void list_example_strong( data* item )
{
    node* candidate = new node;
    candidate-&gt;value = item;
    candidate-&gt;next = head;
    while ( ! head.compare_swap( candidate-&gt;next, candidate ) );
}

void list_example_weak( struct data* item )
{
    node* candidate = new node;
    candidate-&gt;value = item;
    candidate-&gt;next = head.load( memory_order_relaxed );
    while ( ! head.compare_swap( candidate-&gt;next, candidate,
                                 memory_order_release ) );
}

#endif

EOF
</code></pre>

<h3><a name="ExampleUpdate">Update</a></h3>

<p> The best algorithm for updating a variable
may depend on whether or not atomics are lock-free.
In the example below,
this update can be accomplished in a lock-free manner with compare-and-swap
when atomic scalars are lock-free,
but may require other mechanisms when
atomic scalars are not lock-free.
This example uses the feature macro to generate minimal code
when the lock-free status is known a priori.
<pre><code>
echo n2324.c update
cat &lt;&lt;EOF &gt;&gt;n2324.c

#if ATOMIC_SCALAR_LOCK_FREE &lt;= 1
atomic_flag pseudo_mutex = ATOMIC_FLAG_INIT;
unsigned long regular_variable = 1;
#endif
#if ATOMIC_SCALAR_LOCK_FREE &gt;= 1
atomic_ulong atomic_variable = { 1 };
#endif

void update()
{
#if ATOMIC_SCALAR_LOCK_FREE == 1
    if ( atomic_lock_free( &amp;atomic_variable ) ) {
#endif
#if ATOMIC_SCALAR_LOCK_FREE &gt; 0
        unsigned long full = atomic_load( atomic_variable );
        unsigned long half = full / 2;
        while ( ! atomic_compare_swap( &amp;atomic_variable, &amp;full, half ) )
            half = full / 2;
#endif
#if ATOMIC_SCALAR_LOCK_FREE == 1
    } else {
#endif
#if ATOMIC_SCALAR_LOCK_FREE &lt; 2
        __atomic_flag_wait__( &amp;pseudo_mutex );
        regular_variable /= 2 ;
        atomic_flag_clear( &amp;pseudo_mutex );
#endif
#if ATOMIC_SCALAR_LOCK_FREE == 1
    }
#endif
}

EOF
</code></pre>

<h3><a name="ExampleMain">Main</a></h3>
<pre><code>
echo n2324.c main
cat &lt;&lt;EOF &gt;&gt;n2324.c

int main()
{
}

EOF
</code></pre>

</body></html>
