<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>WG21/N2392: A Memory Model for C++: Sequential Consistency for Race-Free Programs</TITLE>
<meta http-equiv="Content-Type" content="text/html;charset=US-ASCII" >
<BODY>
<TABLE 
  summary="This table provides identifying information for this document."><TBODY>
  <TR>
    <TH>Doc. No.:</TH>
    <TD>WG21/N2392<BR>J16/07-252</TD></TR>
  <TR>
    <TH>Date:</TH>
    <TD>2007-09-09</TD></TR>
  <TR>
    <TH>Reply to:</TH>
    <TD>Hans-J. Boehm</TD></TR>
  <TR>
    <TH>Phone:</TH>
    <TD>+1-650-857-3406</TD></TR>
  <TR>
    <TH>Email:</TH>
    <TD><A 
href="mailto:Hans.Boehm@hp.com">Hans.Boehm@hp.com</A></TD></TR></TBODY></TABLE>
<H1>N2392: A Memory Model for C++: Sequential Consistency for Race-Free 
Programs</H1>
Here we explore consequences of the proposed
<A HREF="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2334.htm">
N2334</a>
concurrency memory model for C++, and in the process suggest a
change to the proposed <TT>try_lock</tt> API
(see
<A HREF="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2320.htm">
N2320</a>).
This paper does
not itself propose specific wording changes to the standard.
<P>
A program has a sequentially consistent execution if there
is an interleaving of the actions performed by each thread such that
each read access to a scalar object sees the last write
to the object that precedes it in the interleaving.  Slightly
more formally, we must be able to arrange the actions performed by
all threads in a single total order <VAR>T</var>, such that:
<UL>
<LI>Actions performed by each individual thread occur in <VAR>T</var>
in the order in which they were performed by that thread.
<LI>Each load of a scalar object <VAR>s</var> observes the last prior
write to <VAR>s</var> in <VAR>T</var>.
</ul>
<P>
Here we argue that programs that are data-race-free by
either the definitions in
<A HREF="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2334.htm#races">N2334</a>, or some more intuitive formulations,
and use only locks and sequentially consistent atomics
for synchronization, exhibit only sequentially consistent executions.
<P>This is somewhat analogous to
<A href="http://portal.acm.org/citation.cfm?doid=1040305.1040336">
the corresponding theorem for Java</A>. 
<P><B>Main Claims:</B>
If a program uses no synchronization operations other than
<UL>
<LI> Sequentially consistent atomics, and
<LI> Simple <TT>lock()</TT> and <TT>unlock()</TT> 
operations with acquire and release semantics respectively,
</ul>
<P>
then
<P>
<OL>
<LI>
If a program allows no data races on a given input
(using the N2334 definition), then the program 
obeys sequential consistency rules, i.e. it behaves as though it had been 
executed by sequentially interleaving its actions, and letting each load 
instruction see the last preceding value stored to the same location in this 
interleaving. 
<LI>
If a program does allow data races on a given input (using the
N2334 definition), then there exists
a (possibly partial) sequentially consistent execution,
with two conflicting actions, neither of which happens before the other.
In effect, we only need to look at sequentially consistent
executions in order to determine whether there is a race.  For example,
a program such as
<TABLE align=center border=1>
  <TBODY>
  <TR>
    <TD>Thread1 </TD>
    <TD>Thread2 </TD></TR>
  <TR>
    <TD rowSpan=2>x = 0; <BR>if (x) y = 1; </TD>
    <TD rowSpan=2>y = 0; <BR>if (y) x = 1; 
  </TD></TR></TBODY></TABLE>
cannot possibly contain a race, since in a sequentially consistent execution,
each variable is accessed by only one thread.
<LI>
A program allows a data race on a given input according to the N2334
definition, if and only if there exists a (partial) sequentially consistent
execution in which the two unordered conflicting actions are adjacent in the
sequential interleaving, i.e. one occurs directly before the other.
</ol>
<P>
From a pure correctness
perspective, condition variable notification can be modelled
as a no-op, and a condition variable wait as a an <TT>unlock()</tt>
followed by a <TT>lock()</tt> operation.  Hence the results here
also apply to programs with condition variables.

<P><B>Assumptions about Synchronization Operations:</b>
<P>
First note that although N2334 no longer explicitly requires the happens-before
relation to be irreflexive, i.e. acyclic, this is in fact still an implicit
requirement.  If there were a cycle such that <I>A</i> happened before
<I>A</i>, then this cycle would have to involve at least one inter-thread
synchronizes-with relationship, for which both the store <I>S</i> and load
<I>L</i> appear in the cycle.  But this would prevent <I>S</i> from being in
the visible sequence (1.10p10) of <I>L</i>, since <I>S</i> also "happens after"
<I>L</i>.
<P>
We now restrict our attention to programs whose only conflicting accesses
between threads are to locks (<TT>lock()</tt> and <TT>unlock()</tt> only for now)
and sequentially consistent atomics.
<P>
We assume that there exists a single strict irreflexive total
"atomic synchronization order"
<I>SO</i> on atomic operations, such that:
<OL>
<LI><I>SO</i> is
consistent with the happens-before relation,
i.e. the transitive closure of the union of the happens before
relation and the synchronization order remains irreflexive.
<LI><I>SO</i> is
consistent with the modification order of each atomic variable, i.e. the
modification orders are just <I>SO</i> restricted to operations on that
variable.
<LI>Each load operation performed on an atomic object yields the value of the
immediately preceding (in <I>SO</i>) store to that atomic object.
</ol>
This is required by the current atomics proposal
<A HREF="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2381.html">N2381</a>.
<P>
We further know that lock acquisitions and releases of lock <I>l</i> occur in
a single total order, namely the modification order <I>M<SUB>l</sub></i>
for <I>l</i>.  Since every lock release synchronizes with the
next lock acquisition in <I>M<SUB>l</sub></i>, and we assume that for
every lock acquisition, the next operation in <I>M<SUB>l</sub></i> is
a release of the lock performed by the same thread, it follows that
<I>M<SUB>l</sub></i> is a subset of the happens before relation.
<P>
Since <I>SO</i> is consistent with happens-before, and the <I>M<SUB>l</sub></i>
for various locks are subsets of happens before, we can extend <I>SO</i>
to a total order that includes all the <I>M<SUB>l</sub></i>.  (We get such
an order by topologically sorting the transitive closure of the union of
all the <I>M<SUB>l</sub></i> and <I>S</i>.)
From here on, we will assume, without loss of generality,
that <I>SO</i> includes lock operations, and refer to it simply as
the "synchronization order".
<P><B>The peculiar role of <TT>try_lock()</tt>:</b>
<P>
So far we have limited ourselves to only <TT>lock()</tt> and
<TT>unlock()</tt> operations on locks.
<A HREF="http://portal.acm.org/citation.cfm?doid=1229428.1229470">
Boehm, "Reordering constraints for pthread-style locks", PPoPP 07</a>
points out that <TT>try_lock()</tt> introduces a different set of
issues.  Herb Sutter has pointed out that <TT>timed_lock()</tt>
shares the same issues.
<P>
The fundamental issue here is that, since <TT>lock()</tt> does not have release
semantics, a failed <TT>try_lock()</tt> sees the value of a modification that
did not necessarily happen before it.  There is not necessarily a single
irreflexive order that is consistent with both <I>SO</i> and the apparent
ordering of lock operations.  For example, the following is consistent with our
currently proposed semantics, since there are no synchronizes-with
relationships between the two threads:
<P>
<TABLE align=center border=1>
  <TBODY>
  <TR>
    <TD>Thread1 </TD>
    <TD>Thread2 </TD></TR>
  <TR>
    <TD rowSpan=2>store(&amp;x, 1); <BR>lock(&amp;l1); </TD>
    <TD rowSpan=2>r2 = try_lock(&amp;l1); // fails<BR>r3 = load(&amp;x); // yields 0 
  </TD></TR></TBODY></TABLE>
Thus, in spite of contrary claims in earlier versions of this paper, even our
first claim does not appear to hold in the presence of <TT>try_lock()</tt>.
<P>
If we wanted the first claim to hold with the customary interpretation
of <TT>try_lock()</tt>, we would need to preclude the above outcome by ensuring
that the two statements executed by thread 2 in the above example become
visible in order.  This would certainly require that failed <TT>try_lock()</tt>
operations have acquire semantics, which has non-negligible cost on some
architectures.  If we want all our claims to hold in the presence of
a standard <TT>try_lock()</tt>, we would also need the <TT>lock()</tt>
operation to have release semantics (in addition to its normal
acquire semantics), since it writes the value read by a failed
<TT>try_lock()</tt>.  This often has a substantial performance cost,
even in the absence of lock contention.
<P>
It was generally agreed that we do not want to incur either of the above
costs solely in order to support abuses of
<TT>try_lock()</tt>, such as the one in the above example.  We thus
proceed on a different path.
<P>
We will assume that if <TT>try_lock()</tt> is present at all, then
it can <EM>fail spuriously</em>, i.e. fail to acquire the lock and
return failure, even if the lock is available.  Similarly, if
<TT>timed_lock()</tt> is available, it may fail to acquire the lock,
even if the lock was available during the entire time window in
which we attempted to acquire it.
<P>
These have the effect of ensuring that neither a failed <TT>try_lock()</tt>
nor a failed <TT>timed_lock()</tt> provides useful information about the
state of the lock.  Hence they no longer act as read operations, and we can
no longer "read" the value of a lock unless the corresponding "write"
operation happened before the read.
<P>
The example above is no longer a counter-example to our first claim.
The outcome is possible in a sequentially consistent execution in which
all of thread 2 is executed before thread 1, since the
<TT>try_lock()</tt> in thread 2 can fail spuriously.

<P><B>Proof Of Main Claim 1</B>: 

<P>Again consider a particular race-free execution on the given input,
which follows the rules of
<A HREF="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2334.htm#races">N2334</a>.
<P>
The corresponding happens-before relation (<I>hb</i>) and synchronization
order are irreflexive and consistent, and hence can be extended
to a strict total order <I>T</i>.
<P>
Clearly the actions of each thread appear in <I>T</i> in thread order, i.e. in 
is-sequenced-before order.
<P>
It remains to be shown that
each load sees the last preceding store in <I>T</i> that stores to the 
same location. (For present purposes we view bit-field stores as loading and 
then storing into the entire "location", i.e. contiguous sequence
of non-zero-length bit-fields.) 
<P>
Clearly this is true for operations on atomic objects, since all such operations
appear in the same order as in <I>SO</i>, and each load in <I>SO</i>
sees the preceding store in <I>SO</i>.  A similar argument applies to operations
on locks.
<P>
Lock operations on a single lock, other than failed <TT>try_lock</tt>s,
are also totally ordered by <I>SO</i>.  Thus each such operation
must see the results of the last preceding one in <I>T</i>.
<P>
We can say little about where a failed <TT>try_lock()</tt> operation
on a lock <I>l</i> appears in
<I>T</i>.  But, since we assume that <TT>try_lock()</tt> may fail spuriously,
it does not matter.  A failure outcome is acceptable no matter what state
the lock was left in by the last preceding operation (in <I>T</i>)
on <I>l</i>.  No matter where the failed  <TT>try_lock()</tt> appears
in <I>T</i>, the operations on <I>l</i> could have been executed in that
order, and produced the original results.
<P>
From here on, we consider only ordinary, non-atomic memory operations.
<P>
Consider a store operation <I>S<SUB>visible</sub></i> seen by a load <I>L</i>.
<P>
By the rules of N2334, <I>S<SUB>visible</sub></i> must happen before <I>L</i>.
Hence
<I>S<SUB>visible</sub></i> precedes <I>L</i> in the total order <I>T</i>.
<P>
Now assume that another store 
<I>S<SUB>middle</sub></i> appears between
<I>S<SUB>visible</sub></i> and <I>L</i> in <I>T</i>. 
<P>
We know from the fact that <I>T</i> is an extension of 
<I>hb</i>, that we cannot have either of
<P ALIGN="center">
<I>L</i> <I>hb</i> <I>S<SUB>middle</sub></i>
<P ALIGN="center">
<I>S<SUB>middle</sub></i> <I>hb</i> <I>S<SUB>visible</sub></i>
<P>
since that would be inconsistent with the reverse ordering in
<I>T</i>.
<P>
However all three operations conflict and we have no data races.
Hence they must all be ordered by <I>hb</i>, and <I>S<SUB>middle</sub></i>
must also be <I>hb</i> ordered between the other two.  But this violates
the second clause of the visibility definition in 1.10p9, concluding the proof.

<P><B>Proof Of Main Claim 2</B>: 
<P>
We show that any data race by our definition corresponds to a data race in
a sequentially consistent execution.
<P>
Consider an execution with a data race.  Let <VAR>T</var> be the total
extension of the happens before and synchronization orders, as constructed
above.
<P>
Consider the longest prefix <VAR>TP</var> of <VAR>T</var>
that contains no race.
Note that each load in <VAR>TP</var> must see a store that precedes it in
either the synchronization or happens before orders.  Hence each load
in <VAR>TP</var> must see a store that is also in <VAR>TP</var>.
Similarly each lock operation must see the state produced by another lock
operation also in <VAR>TP</var>, or it must be a failed <TT>try_lock()</tt>
whose outcome could have occurred if it had seen such a state.
<P>
By the arguments of
the preceding section, the original execution restricted to <VAR>TP</var>
is equivalent to a sequentially consistent execution.
<P>
The next element <VAR>N</var> of <VAR>T</var> following
<VAR>TP</var> must be an ordinary
memory access that introduces a race.
If <VAR>N</var> is a write operation,
consider the original execution restricted to <VAR>TP</var>
&cup; {<VAR>N</var>}.  Otherwise consider the same execution except
that <VAR>N</var> sees the value written by the last write to the
same variable in <VAR>TP</var>.
<P>
In either case, the resulting execution (of <VAR>TP</var> plus the
operation introducing the race) is sequentially consistent; if
we extend <VAR>T'</var> from above with the (variant of) <VAR>N</var>,
the resulting sequence is guaranteed to still be an interleaving
of the evaluation steps of the threads, such that each read sees
the preceding write.  (If <VAR>N</var> was a write, it could not
have been seen by any of the reads in <VAR>TP</var>, since those
reads were ordered before <VAR>N</var> in <VAR>T</var>.  If <VAR>N</var>
was a read, it was modified to make this claim true by construction.)
Thus we have a sequentially consistent execution of the program that
exhibits the data race.

<P><B>Proof Of Main Claim 3</B>: 
<P>
If <I>A</i> synchronizes with, or is sequenced before <I>B</i>, then clearly
<I>A</i> must precede <I>B</i> in the interleaving corresponding to a
sequentially consistent execution.  In the first case <I>B</i> sees the value
stored by <I>A</i>, in the second case, the order within a thread must be
preserved.  Thus if <I>A</i> happens before <I>B</i>, <I>A</i> must precede
<I>B</i> in the interleaving.
<P>
It follows that if two conflicting operations executed by different threads
are adjacent in the interleaving, neither can happen before the other.
The synchronization operations introducing the happens-before ordering
would otherwise have to occur between them.  Thus if two such operations
exist in the interleaving, we must have an N2334 race.
<P>
It remains to show the converse:  If we have an N2334 race, there
must be an interleaving corresponding to a sequentially consistent execution
in which the racing operations are adjacent.
<P>
Start with the execution restricted to <VAR>TP</var>
&cup; {<VAR>N</var>}, as above, with any value read by {<VAR>N</var>}
adjusted as needed, also as above.  We know that nothing in this partial
execution depends on the value read by {<VAR>N</var>}.  Let <VAR>M</var>
be the other memory reference involved in the race.
<P>
We can further restrict the execution to those operations that
happen before either <VAR>M</var> or <VAR>N</var>.  This set still
consists of prefixes of the sequences of operations performed by
each thread.   Since each load sees a store that happens before it,
the omitted operations cannot impact the remaining execution.
(This would of course not be true for <TT>try_lock()</tt>.)
<P>
Define a partial order <EM>race-order</em> on { <VAR>x</var> that happen before
<VAR>M</var> or <VAR>N</var> } &cup; {<VAR>M</var>} &cup; {<VAR>N</var>}
as follows.  First divide this set into three subsets:
<OL>
<LI> All elements that happen before either <VAR>M</var> or <VAR>N</var>.
<LI> { <VAR>M</var> }
<LI> { <VAR>N</var> }
</ol>
Race-order orders elements in each subset after elements of the subset(s)
that precede it, and before elements of the subset(s) that follow it.
We impose no ordering on the elements within the initial subset.
Clearly <VAR>M</var> and <VAR>N</var> must be adjacent in any total
extension of race-order.
<P>
Race-order is consistent with happens-before and synchronization
order.  It imposes no additional order on the initial subset.  Neither
<VAR>M</var> nor <VAR>N</var> is ordered by the synchronization order,
and neither is race ordered or happens before any of the elements in the
first subset.
If we had  a cycle <VAR>A<SUB>0</sub></var>,
<VAR>A<SUB>1</sub></var>, ...,
<VAR>A<SUB>n</sub></var> = <VAR>A<SUB>0</sub></var>,
where each element of the sequence happens before or is race-ordered
or synchronization ordered before the next, neither <VAR>M</var>
nor <VAR>N</var> could thus appear in the cycle.
This is impossible, since happens-before and the synchronization
order are required to be consistent.
<P>
Construct the total order <VAR>T'</var> as a total extension of the
reflexive transitive closure of the union of
<OL>
<LI>happens before
<LI>synchronization order
<LI>race order
</ol>
By the preceding observation, this exists.
<P>
By the same arguments as in the proof of claim 1, every memory read
must see the preceding write in this sequence,
except possibly <VAR>N</var>, since it is the only one
that may see a value stored by a racing operation.  But we can again simply
adjust the value seen by <VAR>N</var> to obtain the property we desire,
without affecting the rest of the execution.  Thus <VAR>T'</var> is the
desired interleaving of thread actions in which the racing actions are
adjacent.

<P><B>Concluding Observation</B>: 
<P>
<P>None of the above applies if we allow <TT>load_acquire</TT> and 
<TT>store_release</TT> operations, since the synchronization
operations themselves may not behave in a sequentially
consistent manner.
In particular, consider the following standard ("Dekker's") example: 
<P>
<TABLE align=center border=1>
  <TBODY>
  <TR>
    <TD>Thread1 </TD>
    <TD>Thread2 </TD></TR>
  <TR>
    <TD rowSpan=2>store_release(&amp;y, 1); <BR>r1 = load_acquire(&amp;x); </TD>
    <TD rowSpan=2>store_release(&amp;x, 1); <BR>r2 = load_acquire(&amp;y); 
  </TD></TR></TBODY></TABLE>
<P>This allows <I>r1</i> = <I>r2</i> = 0, where sequential consistency (and 
Java) do not. </P></BODY></HTML>
