<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<html lang="en-us">

<HEAD>

<TITLE>WG21/N2300: Concurrency memory model (revised)</TITLE>

<META http-equiv=Content-Type content="text/html; charset=windows-1252">

<STYLE type=text/css>.deleted {

	TEXT-DECORATION: line-through

}

.inserted {

	TEXT-DECORATION: underline

}

</STYLE>



</HEAD>

<BODY>

<TABLE 

  summary="This table provides identifying information for this document."><TBODY>

  <TR>

    <TH>Doc. No.:</TH>

    <TD>WG21/N2300<BR>J16/07-160</TD></TR>

  <TR>

    <TH>Date:</TH>

    <TD>2007-06-22</TD></TR>

  <TR>

    <TH>Reply to:</TH>

    <TD>Clark Nelson</TD>

    <TD>Hans-J. Boehm</TD></TR>

  <TR>

    <TH>Phone:</TH>

    <TD>+1-503-712-8433</TD>

    <TD>+1-650-857-3406</TD></TR>

  <TR>

    <TH>Email:</TH>

    <TD><A href="mailto:clark.nelson@intel.com">clark.nelson@intel.com</A></TD>

    <TD><A 

href="mailto:Hans.Boehm@hp.com">Hans.Boehm@hp.com</A></TD></TR></TBODY></TABLE>

<H1>Concurrency memory model (revised)</H1>

<P>This paper is a follow-on to <A

href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2171.htm">N2171</A>,

which was effectively divided into

<A 

href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2239.htm">N2239</A>

and this paper.  N2239 is included in the current working paper.  This is an

update to the remaining changes, which are all related to concurrency.

<P>

Changes to the corresponding section of N2171 include:

<UL>

<LI> Various changes to notes, including fixing an editing mistake in 1.10p10,

and the addition of some explanatory notes.

<LI> Switched to a weaker "synchronizes with" formulation in which

a load must see either the value of the store itself, or a derivative obtained

by a sequence of RMW operations.

<LI> Rephrased the interaction of modification order and visibility in

1.10p10.  This version imposes stronger restrictions, and generally disallows

"flickering" of values.

<LI> Switched to a more conventional happens-before definition, which

includes sequenced-before.

<LI> Switched to a more conventional formulation in which the "precedes"

relation was replaced by a statement that no evaluation can see a store that

happens after it, or is sequenced after it.

<P>

This avoids some concerns about synchronization

elimination.  In particular, the old formulation allowed (everything

initially zero, atomic):

<P>

<TABLE align=center border=1>

  <TBODY>

  <TR>

    <TD>Thread1 </TD>

    <TD>Thread2 </TD></TR>

  <TR>

    <TD rowSpan=2>r1 = x.load_relaxed(); // yields 1

        <BR>y.store_relaxed(1);</td>

    <TD rowSpan=2>r1 = y.load_relaxed(); // yields 1

        <BR>x.store_relaxed(1);</td>

  </TR></TBODY></TABLE>

<P>

but disallowed the corresponding test case if the two statements in each

thread were separated by acq_rel read-modify-write operations to dead

variables, or a locked region.  That interferes with the elimination of

locks if the compiler decides to statically combine threads, which is

likely to become important for certain programming styles.

<LI> Explicitly mentioned the input when referring to a data race.

A program may exhibit a data race on some inputs and have well-defined

semantics on others.

<LI> Added back a proposed paragraph to explicitly address nonterminating

loops.

<LI> On Beman Dawes' suggestion, added the change to 15.3p9 dealing with

uncaught exceptions, and renamed "inter-thread data race" to just "data race".

</ul>

<P>This version has benefitted from feed back from many people, including

Sarita Adve, Paul McKenney, Raul Silvera, Lawrence Crowl, and Peter Dimov.

<H2>Contents</H2>

<UL>

  <LI><A href="#location">The definition of "memory location"</A> 

  <LI><A href="#races">Multi-threaded executions and data races</A> 

  <LI><A href="#loops">Nonterminating loops</A> 

  <LI><A href="#exceptions">Treatment of uncaught exceptions</A> 

</ul>

<H2><A id=location>The definition of "memory location"</A></H2>

<P>New paragraphs inserted as 1.7p3 et seq.:</P>

<BLOCKQUOTE class=inserted>

  <P>A <DFN>memory location</DFN> is either an object of scalar type, or a

  maximal sequence of adjacent bit-fields all having non-zero width. Two

  threads of execution can update and access separate memory locations without

  interfering with each other.</P> <P>[<EM>Note</EM>: Thus a bit-field and an

  adjacent non-bit-field are in separate memory locations, and therefore can be

  concurrently updated by two threads of execution without interference. The

  same applies to two bit-fields, if one is declared inside a nested struct

  declaration and the other is not, or if the two are separated by a

  zero-length bit-field declaration, or if they are separated by a

  non-bit-field declaration. It is not safe to concurrently update two

  bit-fields in the same struct if all fields between them are also bit-fields,

  no matter what the sizes of those intervening bit-fields happen to be.

  <EM>end note</EM> ]</P>

  <P>[<EM>Example</EM>: A structure declared as <CODE>struct {char a; int b:5, 

  c:11, :0, d:8; struct {int ee:8;} e;}</CODE> contains four separate memory 

  locations: The field <CODE>a</CODE>, and bit-fields <CODE>d</CODE> and 

  <CODE>e.ee</CODE> are each separate memory locations, and can be modified 

  concurrently without interfering with each other. The bit-fields 

  <CODE>b</CODE> and <CODE>c</CODE> together constitute the fourth memory 

  location. The bit-fields <CODE>b</CODE> and <CODE>c</CODE> can not be 

  concurrently modified, but <CODE>b</CODE> and <CODE>a</CODE>, for example, can 

  be. <EM>end example</EM>.] </P></BLOCKQUOTE>

<H2><A id=races>Multi-threaded executions and data races</A></H2>

<P>Insert a new section between 1.9 and 1.10, titled "Multi-threaded executions 

and data races".</P>

<P>1.10p1:</P>

<BLOCKQUOTE class=inserted>

  <P>Under a hosted implementation, a C++ program can have more than one 

  <DFN>thread of execution</DFN> (a.k.a. <DFN>thread</DFN>) running 

  concurrently. Each thread executes a single function according to the rules 

  expressed in this standard. The execution of the entire program consists of an 

  execution of all of its threads. [<EM>Note:</EM> Usually the execution can be 

  viewed as an interleaving of all its threads. However some kinds of atomic 

  operations, for example, allow executions inconsistent with a simple 

  interleaving, as described below. <EM>end note</EM> ] Under a freestanding 

  implementation, it is implementation-defined whether a program can have more 

  than one thread of execution.</P></BLOCKQUOTE>

<P>1.10p2:</P>

<BLOCKQUOTE class=inserted>

  <P>The execution of each thread proceeds as defined by the remainder of this

  standard. The value of an object visible to a thread <VAR>T</VAR> at a

  particular point might be the initial value of the object, a value assigned

  to the object by <VAR>T</VAR>, or a value assigned to the object by another 

  thread, according to the rules below.

  [<em>Note:</em> Much of this section is motivated by the desire

  to support atomic operations with explicit and detailed visibility

  constraints.  However, it also implicitly supports a simpler view

  for more restricted programs.  See 1.10p11. <EM>end note</EM> ]

  </P></BLOCKQUOTE>

<P>1.10p3:</P>

<BLOCKQUOTE class=inserted>

  <P>Two expression evaluations <DFN>conflict</DFN> if one of them modifies a 

  memory location and the other one accesses or modifies the same memory 

  location.</P></BLOCKQUOTE>

<P>1.10p4:</P>

<BLOCKQUOTE class=inserted>

  <P>The library defines a number of operations, such as operations on locks

  and atomic objects, that are specially identified as synchronization

  operations.  These operations play a special role in making assignments in

  one thread visible to another. A <DFN>synchronization operation</DFN> is

  either an acquire operation or a release operation, or both, on one or more

  memory locations. [<EM>Note:</EM> For example, a call that acquires a lock

  will perform an acquire operation on the locations comprising the lock.

  Correspondingly, a call that releases the same lock will perform a release

  operation on those same locations. Informally, performing a release operation

  on <VAR>A</VAR> forces prior side effects on other memory locations to become

  visible to other threads that later perform an acquire operation on

  <VAR>A</VAR>.  We do not include "relaxed" atomic operations as

  "synchronization" operations though, like synchronization operations, they

  cannot contribute to data races.

  &#8212;<EM>end note</EM> ]</P></BLOCKQUOTE>

<P><A id=s1.10p5>1.10p5-6, previously containing the definition of

"inter-thread ordered before",</A> have been deleted from this revision.

1.10-p6 was subsequently replaced by the following paragraph, which was at

one point part of 1.10-p7.

Subsequent paragraphs will be renumbered eventually.</P>

<P>1.10p6:</P>

<BLOCKQUOTE class=inserted>

  <P>All modifications to a particular atomic object <VAR>M</VAR> occur in some 

  particular total order, called the <DFN>modification order</DFN> of 

  <VAR>M</VAR>.

  [<em>Note:</em> These are separate orders for each scalar object.  There is

  no requirement that these can be combined into a single total order for all

  objects.  In general this will be impossible since different threads may

  observe modifications to different variables in inconsistent orders.

  &#8212;<em>end note</em> ]</p></blockquote>



<P>1.10p7:</P>

<P>This was weakened since N2171 not to require synchronizes-with

for all later reads.

Some weakening of the older specs appears to be necessary to preserve

efficient cross-platform implementability of low-level atomics.

This is probably not the only possible such weakening.  But all of them

appear to either:

<UL>

<LI>Make the memory model much harder to describe, or

<LI>Allow somewhat counterintuitive outcomes for some test cases.

</ul>

Without the special exemption for read-modify-write operations,

we would allow the particularly counterintuitive outcome for one of

Peter Dimov's

examples: (x, y ordinary, v atomic, all initially zero)

<P>

<TABLE align=center border=1>

  <TBODY>

  <TR>

    <TD>Thread1 </TD>

    <TD>Thread2 </TD>

    <TD>Thread3 </TD</TR>

  <TR>

    <TD rowSpan=2>x = 1; <BR>fetch_add_release(&amp;v, 1);</td>

    <TD rowSpan=2>y = 1; <BR>fetch_add_release(&amp;v, 1);</td>

    <TD rowSpan=2>if (load_acquire(&amp;v) == 2) <BR>&nbsp;&nbsp;assert (x + y == 2);</td>

  </TR></TBODY></TABLE>

<P>

Here the assertion could fail, since only the later fetch_add_release

would ensure

visibility of the preceding store.  The value written by the earlier

might not seen by thread3.  The special clause for RMW operations

prevents the assertion from failing here and in similar examples.

<BLOCKQUOTE class=inserted>

  <P>An evaluation <VAR>A</VAR> that performs a release operation on 

  an object <VAR>M</VAR> <DFN>synchronizes with</DFN>

  an evaluation <VAR>B</VAR> 

  that performs an acquire operation on <VAR>M</VAR> and reads either the value 

  written by <VAR>A</VAR> or, if the following (in modification order)

  sequence of updates to <VAR>M</VAR> are atomic read-modify-write operations

  or sequentially consistent atomic stores,

  a value written by one of these read-modify-write operations or

  sequentially consistent stores.

  [<EM>Note:</em> 

  Except in the specified cases,

  reading a later value does not necessarily ensure

  visibility as described below.  Such a requirement would sometimes

  interfere with efficient implementation. &#8212;<EM>end note</EM> ]

  [<EM>Note:</EM> The specifications of the synchronization 

  operations define when one reads the value written by another. For atomic 

  variables, the definition is clear. All operations on a given lock occur in a 

  single total order. Each lock acquisition "reads the value written" by the 

  last lock release. &#8212;<EM>end note</EM> ]</P></BLOCKQUOTE>

<P>1.10p8:</P>

<P> This has been strengthened since N2171 to include sequenced before in

happens before.

<BLOCKQUOTE class=inserted>

  <P>An evaluation <VAR>A</VAR> <DFN>happens before</DFN> an evaluation 

  <VAR>B</VAR> if:</P>

  <UL>

    <LI><VAR>A</VAR> is sequenced before <VAR>B</VAR>

    or 

    <LI><VAR>A</VAR> synchronizes with <VAR>B</VAR>; or 

    <LI>for some evaluation <VAR>X</VAR>, <VAR>A</VAR> happens before 

    <VAR>X</VAR> and <VAR>X</VAR> happens before <VAR>B</VAR>. 

</LI></UL></BLOCKQUOTE>

<P><A id=s1.10p9>1.10p9</A> was once proposed to define a "precedes" relation,

which is no longer needed.  It should eventually be renumbered out of

existence.  Insisting on an acyclic "precedes" relation potentially interfered

with synchronization elimination.



<P>1.10p10:</P>

This paragraph has been revised repeatedly, as we have tried to

pin down the interaction with "modification" order, i.e. what's normally

known as "cache coherence".  Note that directly including modification

order in "happens before" is too strong.  To see this,

consider (everything again initially zero):

<P>

<TABLE align=center border=1>

  <TBODY>

  <TR>

    <TD>Thread1 </TD>

    <TD>Thread2 </TD></TR>

  <TR>

    <TD rowSpan=3> x.store_relaxed(1); <BR>

    		   v.store_relaxed(1); <BR>

        	   r1 = y.load_relaxed();</td>

    <TD rowSpan=3> y.store_relaxed(1); <BR>

    		   v.store_relaxed(2); <BR>

        	   r2 = x.load_relaxed();</td>

  </TR></TBODY></TABLE>

<P>

If we had a happens-before ordering between the two stores to <VAR>v</var>,

in either direction, we would preclude <VAR>r1</var> = <VAR>r2</var> = 0,

which could usually only be enforced with a fence.

<P>

This version was also altered by the removal of the "precedes" relation.

Note that the new first clause here may be technically redundant, but I

think it is clearer to state it explicitly.

<P>

<BLOCKQUOTE class=inserted>

  <P>A multi-threaded execution is <DFN>consistent</DFN> if

  <UL>

  <LI>no evaluation happens before itself,

  <LI>if a side effect <VAR>W</VAR> to scalar object <VAR>M</VAR>

    happens before another side effect <VAR>W'</VAR> to the same

    scalar object <VAR>M</VAR>, then <VAR>W</VAR> must precede

    <VAR>W'</VAR> in <VAR>M</VAR>'s modification order,

  </ul>

  and if for every read access <VAR>R</var>

  to scalar object <VAR>M</var> that observes value <VAR>a</var> written

  by side effect <VAR>W</var> the following conditions hold:

  <UL>

    <LI> <VAR>R</var> does not happen before <VAR>W</var>.

    <LI> There is no side effect <VAR>W'</VAR> to <VAR>M</VAR> such that 

    <UL>

      <LI><VAR>W</VAR> happens before <VAR>W'</VAR>,  and 

      <LI><VAR>W'</VAR> happens before <VAR>R</VAR>. 

    </UL>

    <LI>If read access <VAR>R</VAR> happens before

    read access <VAR>R'</VAR> to the same scalar object <VAR>M</VAR>

    which observes value <VAR>b</var>,

    then the corresponding side effect assigning <VAR>b</var> to <VAR>M</VAR>

    may not precede the side effect <VAR>W</VAR> in

    <VAR>M</VAR>'s modification order.

  </UL>

  <P>[<EM>Note:</EM> The first condition states essentially that the

  happens-before relation consistently orders evaluations.  We cannot

  have <VAR>A</VAR> happens before <VAR>B</VAR>, and <VAR>B</VAR>

  happens before <VAR>A</VAR>, since that would imply <VAR>A</VAR>

  happens before <VAR>A</VAR>.

  The second condition states that the modification orders must

  respect happens before.

  The third condition implies that a read operation 

  <VAR>R</VAR> cannot "see" an assignment <VAR>W</VAR> if <VAR>R</VAR> happens 

  before <VAR>W</VAR>.

  The fourth condition effectively asserts that later assignments 

  hide earlier ones if there is a well-defined order between them.

  The fifth condition states that reads of the same object must observe

  a sequence of changes that is consistent with that object's

  modification order.  This last condition effectively disallows

  compiler reordering of atomic operations to a single object,

  even if both operations are "relaxed" loads.  By doing so, we effectively

  make the "cache coherence" guarantee provided by essentially all

  hardware available to C++ atomic operations.<EM>end 

  note</EM> ]</P></BLOCKQUOTE>

<P>1.10p11:</P>

<BLOCKQUOTE class=inserted>

  <P>An execution contains a <DFN>data race</DFN> if it contains two

  conflicting actions in different threads, at least one of which is not

  atomic, and neither happens before the other. Any data race results in

  undefined behavior. A multi-threaded program that does not allow a data

  race for the given inputs exhibits the behavior of a consistent execution,

  as defined in 1.10p10.

  [<EM>Note:</EM> It can be shown that programs that correctly use simple locks

  to prevent all data races, and use no other synchronization operations,

  behave as though the executions of their constituent threads were simply

  interleaved, with each observed value of an object being the last value

  assigned in that interleaving. This is normally referred to as "sequential

  consistency".  However, this applies only to race-free programs, and

  race-free programs cannot observe most program transformations that do not

  change single-threaded program semantics. In fact, most single-threaded

  program transformations continue to be allowed, since any program that

  behaves differently as a result must perform an undefined operation.

  <EM>end note</EM> ]</P></BLOCKQUOTE>

<P>1.10p12:</P>

<BLOCKQUOTE class=inserted>

  <P>[<EM>Note:</EM> Compiler transformations that introduce assignments to a 

  potentially shared memory location that would not be modified by the abstract 

  machine are generally precluded by this standard, since such an assignment 

  might overwrite another assignment by a different thread in cases in which an 

  abstract machine execution would not have encountered a data race.

  This includes implementations of data member assignment that overwrite

  adjacent members in separate memory locations. <EM>end 

  note</EM> ]</P></BLOCKQUOTE>

<P>1.10p13:</P>

<BLOCKQUOTE class=inserted>

  <P>[<EM>Note:</EM> Transformations that introduce a speculative read of

  a shared variable may not preserve the semantics of the C++ program as

  defined in this standard,

  since they potentially introduce a data race.  However, they are typically

  valid in the context of an optimizing compiler that targets a specific

  machine with well-defined semantics for data races.  They would

  be invalid for a hypothetical machine that is not tolerant of races

  or provides hardware race detection.

  <EM>end note</EM> ]</P></BLOCKQUOTE>

<H2><A id=loops>Nonterminating loops</A></H2>

<P>

It is generally felt that it is important to allow the transformation

of potentially nonterminating loops (e.g. by merging two loops that

iterate over the same potentially infinite set, or by eliminating

a side-effect-free loop), even when that

may not otherwise be justified in the case in which the first loop never

terminates.

<P>

Existing compilers commonly assume that code immediately

following a loop is executed if and only if code immediately preceding

a loop is executed.  This assumption is clearly invalid if the loop fails

to terminate.  Even if we wanted to prohibit this behavior, it is unclear

that all relevant compilers could comply in a reasonable amount of

time.  The assumption appears both pervasive and hard to test for.

<P>

The treatment of nonterminating loops in the current standard is very

unclear.  We believe that some implementations already eliminate

potentially nonterminating, side-effect-free, loops, probably based on 1.9p9,

which appears to impose very weak requirements on conforming

implementations for nonterminating programs.  We had previously arrived

at a tentative conclusion that nonterminating loops were already

sufficiently weakly specified that no changes were needed.

We no longer believe this, for the following reasons:

<UL>

<LI>On closer inspection,

it is at best unclear that this reasoning would continue to apply in a

world in which the program may terminate even if one of the threads does not.

<LI>In the presence of threads, the elimination of certain side-effect-free

potentially infinite loops (e.g. <TT>while

(!please_self_destruct.load_acquire()) {}; self_destruct()</tt>) is clearly

hazardous, and a bit more clarity seems appropriate.

</ul>

Hence we propose the following addition:

<P>6.5p5:</P>

<P>

<BLOCKQUOTE class=inserted>

  <P>A nonterminating loop that

  <UL>

  	<LI> performs no I/O operations, and

	<LI> does not access or modify volatile objects, and

	<LI> performs no synchronization or atomic operations

  </ul>

  invokes undefined behavior.  [<EM>Note:</EM>  This is meant to

  allow compiler transformations, such as removal of empty loops,

  even when termination cannot be proven. <EM>end note</EM>]

</BLOCKQUOTE>

<P>

We had previously discussed limiting "undefined" behavior to certain

optimizations.  But it is unclear how to do that usefully, such that

there are any programs that could usefully take advantage of such a

statement.

<P>

This formulation does have the advantage that it makes it possible to

painlessly write nonterminating loops that <EM>cannot</em>

be eliminated by the compiler, even for single-threaded programs.

<H2><A id=exceptions>Treatment of uncaught exceptions</A></H2>

<P>15.3p9:</P>

<P>[Beman Dawes suggestion, reflecting an earlier discussion:]

Change "a program" to "the current thread of execution" in

<BLOCKQUOTE>

  <P>If no matching handler is found in <DEL>a program</del>

  <INS>the current thread of execution</ins>,

  the function std::terminate() is called; whether

  or not the stack is unwound before this call to std::terminate()

  is implementation-defined (15.5.1)."</blockquote>

</BODY></HTML>

