<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
	"http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-us">

<head>
<title>WG21/N2334: Concurrency memory model (revised again)</title>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
<style type="text/css">
.deleted {
	text-decoration: line-through
}
.inserted {
	text-decoration: underline
}
</style>
</head>

<body>

<table summary="This table provides identifying information for this document.">
	<tr>
		<th>Doc. No.:</th>
		<td>WG21/N2334<br />
		J16/07-0194</td>
	</tr>
	<tr>
		<th>Date:</th>
		<td>2007-08-05</td>
	</tr>
	<tr>
		<th>Reply to:</th>
		<td>Clark Nelson</td>
		<td>Hans-J. Boehm</td>
	</tr>
	<tr>
		<th>Phone:</th>
		<td>+1-503-712-8433</td>
		<td>+1-650-857-3406</td>
	</tr>
	<tr>
		<th>Email:</th>
		<td><a href="mailto:clark.nelson@intel.com">clark.nelson@intel.com</a></td>
		<td><a href="mailto:Hans.Boehm@hp.com">Hans.Boehm@hp.com</a></td>
	</tr>
</table>
<h1>Concurrency memory model (revised again)</h1>
<p>This paper is a revision of
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2300.htm">N2300</a>. 
Changes to N2300 include:</p>
<ul>
	<li>Added another clause to the definition of consistent execution, after Sarita 
	Adve pointed out that the N2300 version resulted in a definition of data race 
	that declared data races to exist in programs without atomics that did not correspond 
	to data races in sequentially consistent executions. This subsequently 
	turned into the much cleaner restriction that ordinary memory reads see only 
	writes that happen before them.</li>
	<li>Made it clearer that the definition of data race is based on the definition 
	of consistent execution.</li>
	<li>Weakened the synchronizes with rule for intervening sequentially consistent 
	stores, and relaxed read-modify-write operations due to implementability concerns. 
	Strengthened it to stores in the same thread to accomodate N2324-style fences.
	</li>
	<li>Expanded 1.10p2 and 1.10p12 notes.</li>
	<li>Clarified that the conditions of 6.5p5 do not apply to for loop initialization.</li>
	<li>Restructured much of the section to make it more comprehensible to the 
	intended target audience, eliminating the explicit notion of a consistent 
	execution. Instead the notions of &quot;visible&quot; side effect and &quot;visible 
	sequence&quot; were added in 1.10p9 and 1.10p10. Although this changed much of 
	the text, hopefully making it much more readable, it is not intended to be a 
	substantive change.</li>
	<li>Renumbered paragraphs.</li>
</ul>
<p>The existing discussion from N2300 was generally left in place..</p>
<p>N2300 changes to the corresponding section of its predecessor,
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2171.htm">N2171</a> 
include:</p>
<ul>
	<li>Various changes to notes, including fixing an editing mistake in 1.10p10, 
	and the addition of some explanatory notes.</li>
	<li>Switched to a weaker &quot;synchronizes with&quot; formulation in which a load must 
	see either the value of the store itself, or a derivative obtained by a sequence 
	of RMW operations.</li>
	<li>Rephrased the interaction of modification order and visibility in 1.10p10. 
	This version imposes stronger restrictions, and generally disallows &quot;flickering&quot; 
	of values.</li>
	<li>Switched to a more conventional definition of &quot;happens before&quot;, which includes 
	&quot;sequenced before&quot;.</li>
	<li>Switched to a more conventional formulation in which the &quot;precedes&quot; relation 
	was replaced by a statement that no evaluation can see a store that happens 
	after it, or is sequenced after it.
	<p>This avoids some concerns about synchronization elimination. In particular, 
	the old formulation allowed (everything initially zero, atomic):</p>
	<table>
		<tr>
			<td>Thread1</td>
			<td>Thread2</td>
		</tr>
		<tr>
			<td>r1 = x.load_relaxed(); // yields 1 <br />
			y.store_relaxed(1);</td>
			<td>r1 = y.load_relaxed(); // yields 1 <br />
			x.store_relaxed(1);</td>
		</tr>
	</table>
	<p>but disallowed the corresponding test case if the two statements in each 
	thread were separated by acq_rel read-modify-write operations to dead variables, 
	or a locked region. That interferes with the elimination of locks if the compiler 
	decides to statically combine threads, which is likely to become important for 
	certain programming styles.</p>
	</li>
	<li>Explicitly mentioned the input when referring to a data race. A program 
	may exhibit a data race on some inputs and have well-defined semantics on others.
	</li>
	<li>Added back a proposed paragraph to explicitly address nonterminating loops.
	</li>
	<li>On Beman Dawes&#39; suggestion, added the change to 15.3p9 dealing with uncaught 
	exceptions, and renamed &quot;inter-thread data race&quot; to just &quot;data race&quot;.</li>
</ul>
<p>This version has benefitted from feed back from many people, including Sarita 
Adve, Paul McKenney, Raul Silvera, Lawrence Crowl, and Peter Dimov.</p>
<h2>Contents</h2>
<ul>
	<li><a href="#location">The definition of &quot;memory location&quot;</a></li>
	<li><a href="#races">Multi-threaded executions and data races</a></li>
	<li><a href="#loops">Nonterminating loops</a></li>
	<li><a href="#exceptions">Treatment of uncaught exceptions</a></li>
</ul>
<h2><a id="location">The definition of &quot;memory location&quot;</a></h2>
<p>New paragraphs inserted as 1.7p3 et seq.:</p>
<blockquote class="inserted">
	<p>A <dfn>memory location</dfn> is either an object of scalar type, or a maximal 
	sequence of adjacent bit-fields all having non-zero width. Two threads of execution 
	can update and access separate memory locations without interfering with each 
	other.</p>
	<p>[<em>Note</em>: Thus a bit-field and an adjacent non-bit-field are in separate 
	memory locations, and therefore can be concurrently updated by two threads of 
	execution without interference. The same applies to two bit-fields, if one is 
	declared inside a nested struct declaration and the other is not, or if the 
	two are separated by a zero-length bit-field declaration, or if they are separated 
	by a non-bit-field declaration. It is not safe to concurrently update two bit-fields 
	in the same struct if all fields between them are also bit-fields, no matter 
	what the sizes of those intervening bit-fields happen to be. <em>end note</em> 
	]</p>
	<p>[<em>Example</em>: A structure declared as <code>struct {char a; int b:5, 
	c:11, :0, d:8; struct {int ee:8;} e;}</code> contains four separate memory locations: 
	The field <code>a</code>, and bit-fields <code>d</code> and <code>e.ee</code> 
	are each separate memory locations, and can be modified concurrently without 
	interfering with each other. The bit-fields <code>b</code> and <code>c</code> 
	together constitute the fourth memory location. The bit-fields <code>b</code> 
	and <code>c</code> can not be concurrently modified, but <code>b</code> and
	<code>a</code>, for example, can be. <em>end example</em>.]</p>
</blockquote>
<h2><a id="races">Multi-threaded executions and data races</a></h2>
<p>Insert a new section between 1.9 and 1.10, titled &quot;Multi-threaded executions 
and data races&quot;.</p>
<p>1.10p1:</p>
<blockquote class="inserted">
	<p>Under a hosted implementation, a C++ program can have more than one <dfn>
	thread of execution</dfn> (a.k.a. <dfn>thread</dfn>) running concurrently. Each 
	thread executes a single function according to the rules expressed in this standard. 
	The execution of the entire program consists of an execution of all of its threads. 
	[<em>Note:</em> Usually the execution can be viewed as an interleaving of all 
	its threads. However some kinds of atomic operations, for example, allow executions 
	inconsistent with a simple interleaving, as described below. <em>end note</em> 
	] Under a freestanding implementation, it is implementation-defined whether 
	a program can have more than one thread of execution.</p>
</blockquote>
<p>1.10p2:</p>
<blockquote class="inserted">
	<p>The execution of each thread proceeds as defined by the remainder of this 
	standard. The value of an object visible to a thread <var>T</var> at a particular 
	point might be the initial value of the object, a value assigned to the object 
	by <var>T</var>, or a value assigned to the object by another thread, according 
	to the rules below. [<em>Note:</em> In some cases, there may instead be undefined 
	behavior. Much of this section is motivated by the desire to support atomic 
	operations with explicit and detailed visibility constraints. However, it also 
	implicitly supports a simpler view for more restricted programs.  
	<em>end note</em> ]</p>
</blockquote>
<p>1.10p3:</p>
<blockquote class="inserted">
	<p>Two expression evaluations <dfn>conflict</dfn> if one of them modifies a 
	memory location and the other one accesses or modifies the same memory location.</p>
</blockquote>
<p>1.10p4:</p>
<blockquote class="inserted">
	<p>The library defines a number of operations, such as operations on locks and 
	atomic objects, that are specially identified as synchronization operations. 
	These operations play a special role in making assignments in one thread visible 
	to another. A <dfn>synchronization operation</dfn> is either an <dfn>acquire</dfn> operation 
	or a <dfn>release</dfn> operation, or both, on one or more memory locations; 
	the semantics of these are described below. In addition, there are <dfn>relaxed</dfn> 
	atomic operations, which are not synchronization operations, and <dfn>read-modify-write</dfn> operations, 
	which have special characteristics, also described below. [<em>Note:</em> 
	For example, a call that acquires a lock will perform an acquire operation on 
	the locations comprising the lock. Correspondingly, a call that releases the 
	same lock will perform a release operation on those same locations. Informally, 
	performing a release operation on <var>A</var> forces prior side effects on 
	other memory locations to become visible to other threads that later perform 
	an acquire operation on <var>A</var>. We do not include &quot;relaxed&quot; atomic operations 
	as &quot;synchronization&quot; operations though, like synchronization operations, they 
	cannot contribute to data races. <em>end note</em> ]</p>
</blockquote>
<p>1.10p5 (previously 1.10p6):</p>
<blockquote class="inserted">
	<p>All modifications to a particular atomic object <var>M</var> occur in some 
	particular total order, called the <dfn>modification order</dfn> of <var>M</var>. 
	If <var>A</var> and <var>B</var> are modifications of an atomic object <var>
	M</var>, and <var>A</var> happens before <var>B</var>, then <var>A</var> shall 
	precede <var>B</var> in the modification order of <var>M</var>, which is defined below. [<em>Note:</em> 
	This states that the modification orders must respect &quot;happens before&quot;. <em>end 
	note</em> ] [<em>Note:</em> These is a separate order for each scalar object. 
	There is no requirement that these can be combined into a single total order 
	for all objects. In general this will be impossible since different threads 
	may observe modifications to different variables in inconsistent orders. <em>end 
	note</em> ]</p>
</blockquote>
<p>1.10p6 (previously embedded in 1.10p7):</p>
<blockquote class="inserted">
	<p>A <dfn>release sequence</dfn> on an atomic object <var>M</var> is a maximal 
	contiguous sub-sequence of side effects in the modification order of <var>M</var>, 
	where the first operation is a release, and every subsequent operation</p>
	<ul>
		<li>is performed by the same thread that performed the release, or</li>
		<li>is a non-relaxed atomic read-modify-write operation.</li>
	</ul>
</blockquote>
<p>1.10p7:</p>
<p>This was weakened since N2171 not to require synchronizes-with for all later 
reads. Some weakening of the older specs appears to be necessary to preserve efficient 
cross-platform implementability of low-level atomics. This is probably not the only 
possible such weakening. But all of them appear to either:</p>
<ul>
	<li>Make the memory model much harder to describe, or</li>
	<li>Allow somewhat counterintuitive outcomes for some test cases.</li>
</ul>
<p>Without the special exemption for read-modify-write operations, we would allow 
the particularly counterintuitive outcome for one of Peter Dimov&#39;s examples: (x, 
y ordinary, v atomic, all initially zero)</p>
<table>
	<tr>
		<td>Thread1</td>
		<td>Thread2</td>
		<td>Thread3</td>
	</tr>
	<tr>
		<td>x = 1; <br />
		fetch_add_release(&amp;v, 1);</td>
		<td>y = 1; <br />
		fetch_add_release(&amp;v, 1);</td>
		<td>if (load_acquire(&amp;v) == 2) <br />
&nbsp; assert (x + y == 2);</td>
	</tr>
</table>
<p>Here the assertion could fail, since only the later fetch_add_release would ensure 
visibility of the preceding store. The value written by the earlier might not seen 
by thread3. The special clause for RMW operations prevents the assertion from failing 
here and in similar examples.</p>
<blockquote class="inserted">
	<p>An evaluation <var>A</var> that performs a release operation on an object
	<var>M</var> <dfn>synchronizes with</dfn> an evaluation <var>B</var> that performs 
	an acquire operation on <var>M</var> and reads a value written by any side effect 
	in the release sequence headed by <var>A</var>. [<em>Note:</em> Except in the 
	specified cases, reading a later value does not necessarily ensure visibility 
	as described below. Such a requirement would sometimes interfere with efficient 
	implementation. <em>end note</em> ] [<em>Note:</em> The specifications of the 
	synchronization operations define when one reads the value written by another. 
	For atomic variables, the definition is clear. All operations on a given lock 
	occur in a single total order. Each lock acquisition &quot;reads the value written&quot; 
	by the last lock release. <em>end note</em> ]</p>
</blockquote>
<p>1.10p8:</p>
<p>This has been strengthened since N2171 to include &quot;sequenced before&quot; in &quot;happens 
before&quot;.</p>
<blockquote class="inserted">
	<p>An evaluation <var>A</var> <dfn>happens before</dfn> an evaluation <var>B</var> 
	if:</p>
	<ul>
		<li><var>A</var> is sequenced before <var>B</var> or</li>
		<li><var>A</var> synchronizes with <var>B</var>; or</li>
		<li>for some evaluation <var>X</var>, <var>A</var> happens before <var>X</var> 
		and <var>X</var> happens before <var>B</var>.</li>
	</ul>
</blockquote>
<p>1.10p9 (previously embedded in 1.10p10):</p>
<blockquote class="inserted">
	<p>A <dfn>visible</dfn> side effect <var>A</var> on an object <var>M</var> 
	with respect to a value computation <var>B</var> of <var>M</var> satisfies the 
	conditions:</p>
	<ul>
		<li><var>A</var> happens before <var>B</var>, and</li>
		<li>there is no other side effect <var>X</var> to <var>M</var> such that
		<var>A</var> happens before <var>X</var> and <var>X</var> happens before
		<var>B</var>.</li>
	</ul>
	<p>The value of a non-atomic scalar object <var>M</var>, as determined by evaluation
	<var>B</var>, shall be the value stored by the visible side effect <var>A</var>. 
	[ <em>Note:</em> If there is ambiguity about which side effect to a non-atomic 
	object is visible, then there is a data race, and the behavior is undefined. 
	<em>end note</em> ] [ <em>Note:</em> This states that operations on 
	ordinary variables are not visibly reordered. This is not actually 
	detectable without data races, but it is necessary to ensure that data 
	races, as defined here, and with suitable restrictions on the use of 
	atomics, correspond to data races in a simple interleaved (sequentially 
	consistent) execution. <em>end note</em> ]</p>
</blockquote>
<p>1.10p10 (previously embedded in 1.10p10):</p>
<blockquote class="inserted">
	<p>The <dfn>visible sequence</dfn> of side effects on an atomic object <var>M</var>, with respect 
	to a value computation <var>B</var> of <var>M</var>, is a maximal contiguous 
	sub-sequence of side effects in the modification order of <var>M</var>, where 
	the first operation is visible with respect to <var>B</var>, and <var>B</var> happens before no subsequent 
	operation. The value of an atomic object <var>M</var>, as determined 
	by evaluation <var>B</var>, shall be the value stored by some operation in the 
	visible sequence of <var>M</var> with respect to <var>B</var>. Furthermore, if a value computation
	<var>A</var> of an atomic object <var>M</var> happens before a value computation
	<var>B</var> of <var>M</var>, and the value computed by <var>A</var> corresponds 
	to the value stored by side effect <var>X</var>, then the value computed by
	<var>B</var> shall either equal the value computed by <var>A</var>, or be the value stored by side effect <var>Y</var>, where <var>Y</var> follows
	<var>X</var> in the modification order of <var>M</var>. [<em>Note:</em> 
	This effectively disallows compiler reordering of atomic operations 
	to a single object, even if both operations are &quot;relaxed&quot; loads. By doing so, 
	we effectively make the &quot;cache coherence&quot; guarantee provided by essentially 
	all hardware available to C++ atomic operations.<em>end note</em> ] [Note: 
	The visible sequence depends on the &quot;happens 
	before&quot; relation, which depends on the values observed by loads of atomics, 
	which we are restricting here. The intended reading is that there must exist 
	an association of atomic loads with modifications they observe that, 
	together with suitably chosen modification orders, and the happens before 
	relation derived as described above, satisfy the resulting constraints as 
	imposed here. -- end note.]</p>
</blockquote>
<!--
<p>1.10p11 (previously 1.10p10):</p>
<p>What was 1.10p10 has been revised repeatedly, as we have tried to pin down the 
interaction with &quot;modification&quot; order, i.e. what&#39;s normally known as &quot;cache coherence&quot;. 
Note that directly including modification order in &quot;happens before&quot; is too strong. 
To see this, consider (everything again initially zero):</p>
<table>
	<tr>
		<td>Thread1</td>
		<td>Thread2</td>
	</tr>
	<tr>
		<td>x.store_relaxed(1); <br />
		v.store_relaxed(1); <br />
		r1 = y.load_relaxed();</td>
		<td>y.store_relaxed(1); <br />
		v.store_relaxed(2); <br />
		r2 = x.load_relaxed();</td>
	</tr>
</table>
<p>If we had a &quot;happens before&quot; ordering between the two stores to <var>v</var>, 
in either direction, we would preclude <var>r1</var> = <var>r2</var> = 0, which 
could usually only be enforced with a fence.</p>
<p>This version was also altered by the removal of the &quot;precedes&quot; relation. Note 
that the new first clause here may be technically redundant, but I think it is clearer 
to state it explicitly.</p>
<blockquote class="inserted">
	<p>A multi-threaded execution is <dfn>consistent</dfn> if no evaluation happens 
	before itself. [<em>Note:</em> This states essentially that the &quot;happens before&quot; 
	relation consistently orders evaluations. We cannot have <var>A</var> happens 
	before <var>B</var>, and <var>B</var> happens before <var>A</var>, since that 
	would imply <var>A</var> happens before <var>A</var>. <em>end note</em> ] </p>
</blockquote>
-->
<p>1.10p11:</p>
<blockquote class="inserted">
	<p>The execution of a program contains a <dfn>data race</dfn> if it contains two 
	conflicting actions in different threads, at least one of which is not atomic, 
	and neither happens before the other. Any such data race results in undefined 
	behavior. <!-- A multi-threaded program that does not allow a data race for the given 
	inputs exhibits the behavior of a consistent execution. --> [<em>Note:</em> It can 
	be shown that programs that correctly use simple locks to prevent all data races, 
	and use no other synchronization operations, behave as though the executions 
	of their constituent threads were simply interleaved, with each observed value 
	of an object being the last value assigned in that interleaving. This is normally 
	referred to as &quot;sequential consistency&quot;. However, this applies only to race-free 
	programs, and race-free programs cannot observe most program transformations 
	that do not change single-threaded program semantics. In fact, most single-threaded 
	program transformations continue to be allowed, since any program that behaves 
	differently as a result must perform an undefined operation. <em>end note</em> 
	]</p>
</blockquote>
<p>1.10p12:</p>
<blockquote class="inserted">
	<p>[<em>Note:</em> Compiler transformations that introduce assignments to a 
	potentially shared memory location that would not be modified by the abstract 
	machine are generally precluded by this standard, since such an assignment might 
	overwrite another assignment by a different thread in cases in which an abstract 
	machine execution would not have encountered a data race. This includes implementations 
	of data member assignment that overwrite adjacent members in separate memory 
	locations. We also generally preclude reordering of atomic loads in cases in 
	which the atomics in question may alias, since this may violate the last clause 
	of 1.10p10. <em>end note</em> ]</p>
</blockquote>
<p>1.10p13:</p>
<blockquote class="inserted">
	<p>[<em>Note:</em> Transformations that introduce a speculative read of a shared 
	variable may not preserve the semantics of the C++ program as defined in this 
	standard, since they potentially introduce a data race. However, they are typically 
	valid in the context of an optimizing compiler that targets a specific machine 
	with well-defined semantics for data races. They would be invalid for a hypothetical 
	machine that is not tolerant of races or provides hardware race detection. <em>end 
	note</em> ]</p>
</blockquote>
<h2><a id="loops">Nonterminating loops</a></h2>
<p>It is generally felt that it is important to allow the transformation of potentially 
nonterminating loops (e.g. by merging two loops that iterate over the same potentially 
infinite set, or by eliminating a side-effect-free loop), even when that may not 
otherwise be justified in the case in which the first loop never terminates.</p>
<p>Existing compilers commonly assume that code immediately following a loop is 
executed if and only if code immediately preceding a loop is executed. This assumption 
is clearly invalid if the loop fails to terminate. Even if we wanted to prohibit 
this behavior, it is unclear that all relevant compilers could comply in a reasonable 
amount of time. The assumption appears both pervasive and hard to test for.</p>
<p>The treatment of nonterminating loops in the current standard is very unclear. 
We believe that some implementations already eliminate potentially nonterminating, 
side-effect-free, loops, probably based on 1.9p9, which appears to impose very weak 
requirements on conforming implementations for nonterminating programs. We had previously 
arrived at a tentative conclusion that nonterminating loops were already sufficiently 
weakly specified that no changes were needed. We no longer believe this, for the 
following reasons:</p>
<ul>
	<li>On closer inspection, it is at best unclear that this reasoning would continue 
	to apply in a world in which the program may terminate even if one of the threads 
	does not.</li>
	<li>In the presence of threads, the elimination of certain side-effect-free 
	potentially infinite loops (e.g. <code>while (!please_self_destruct.load_acquire()) 
	{}; self_destruct()</code>) is clearly hazardous, and a bit more clarity seems 
	appropriate.</li>
</ul>
<p>Hence we propose the following addition:</p>
<p>6.5p5:</p>
<blockquote class="inserted">
	<p>A nonterminating loop that, outside of the <var>for-init-statement</var> 
	in the case of a for statement,</p>
	<ul>
		<li>performs no I/O operations, and</li>
		<li>does not access or modify volatile objects, and</li>
		<li>performs no synchronization or atomic operations</li>
	</ul>
	<p>invokes undefined behavior. [<em>Note:</em> This is meant to allow compiler 
	transformations, such as removal of empty loops, even when termination cannot 
	be proven. <em>end note</em>]</p>
</blockquote>
<p>We had previously discussed limiting &quot;undefined&quot; behavior to certain optimizations. 
But it is unclear how to do that usefully, such that there are any programs that 
could usefully take advantage of such a statement.</p>
<p>This formulation does have the advantage that it makes it possible to painlessly 
write nonterminating loops that <em>cannot</em> be eliminated by the compiler, even 
for single-threaded programs.</p>
<h2><a id="exceptions">Treatment of uncaught exceptions</a></h2>
<p>15.3p9:</p>
<p>[Beman Dawes&#39; suggestion, reflecting an earlier discussion:] Change &quot;a program&quot; 
to &quot;the current thread of execution&quot; in</p>
<blockquote>
	<p>If no matching handler is found in <span class="deleted">a program</span>
	<span class="inserted">the current thread of execution</span>, the function 
	std::terminate() is called; whether or not the stack is unwound before this 
	call to std::terminate() is implementation-defined (15.5.1).&quot;</p>
</blockquote>

</body>

</html>
