<html>

<head>
  <meta name="description" content="Proposal for C++ Standard">
  <meta name="keywords" content="C++,cplusplus,wg21,C++20,parallism,wavefront,vectorization">
  <meta name="author" content="Alisdair Meredith">

  <title>Target Vectorization Policies from Parallelism V2 TS to C++20</title>

  <style type="text/css">
    ins {background-color:#A0FFA0}
    del {background-color:#FFA0A0}
    li {text-align:justify}
    blockquote.note
    {
      background-color:#E0E0E0;
      padding-left: 15px;
      padding-right: 15px;
      padding-top: 1px;
      padding-bottom: 1px;
    }
  </style>
</head>

<body>
<table>
<tr>
  <td align="left">Doc. no.</td>
  <td align="left">P1001R2</td>
</tr>
<tr>
  <td align="left">Date:</td>
  <td align="left">2019-02-22</td>
</tr>
<tr>
  <td align="left">Project:</td>
  <td align="left">Programming Language C++</td>
</tr>
<tr>
  <td align="left">Audience:</td>
  <td align="left">SG1 Parallelism and Concurrency</td>
</tr>
<tr>
  <td align="left"> </td>
  <td align="left">Library Working Group</td>
</tr>
<tr>
  <td align="left">Reply to:</td>
  <td align="left">Alisdair Meredith &lt;<a href="mailto:ameredith1@bloomberg.net">ameredith1@bloomberg.net</a>&gt;</td>
<tr>
  <td> </td>
  <td align="left">Pablo Halpern &lt;<a href="mailto:phalpern@halpernwightsoftware.com">phalpern@halpernwightsoftware.com</a>&gt;</td>
</tr>
</table>

<h1>Target Vectorization Policies from Parallelism V2 TS to C++20</h1>

<h2>Table of Contents</h2>
<ol start="0">
<a href="#rev.hist"><li>Revision History</li></a>
  <ul>
  <li><a href="#rev.0">Revision 0</a></li>
  <li><a href="#rev.1">Revision 1</a></li>
  <li><a href="#rev.1">Revision 2</a></li>
  </ul>
<a href="#1.0"><li>Introduction</li></a>
<a href="#2.0"><li>Stating the problem</li></a>
  <ol>
  <a href="#2.1"><li>Conclusion of SG1 Review (Jacksonville 2018</li></a>
  </ol>
<a href="#3.0"><li>Propose Solution</li></a>
<a href="#4.0"><li>Other Directions</li></a>
<a href="#5.0"><li>Formal Wording</li></a>
<a href="#6.0"><li>Acknowledgements</li></a>
<a href="#7.0"><li>References</li></a>
</ol>


<h2><a name="rev.hist">Revision History</a></h2>

<h3><a name="rev.0">Revision 0</a></h3>
<p>
Original version of the paper for the 2018 Jacksonville meeting.
</p>

<h3><a name="rev.1">Revision 1</a></h3>
<p>
Revision of the paper following SG1 feedback at the 2018 Jacksonville meeting,
proposing just the unsquenced policy for C++20.
</p>

<h3><a name="rev.2">Revision 2</a></h3>
Revision of the paper following LWG feedback at the 2018 San Diego meeting.
<ul>
<li>
Rebased the wording onto
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4800.pdf">N4800</a>.
</li>
<li>
Proposed feature macros for the <tt>&lt;version&gt;</tt> header.
</li>
<li>
Moved normative permissions on using the new policy to [algorithms.parallel.exec].
</li>
<li>
Moved the IS definition of a vector-unsafe function up two subclauses into the
parallel alogorithms terms and definitions subclause.
</li>
</ul>


<h2><a name="1.0">1 Introduction</a></h2>
<p>
The Parallelism v2 PDTS has two new vector policies for the parallel
algorithms that might better target C++20 directly.
</p>


<h2><a name="2.0">2 Stating the problem</a></h2>
<p>
We sent the Parallelism V2 working paper to PDTS ballot at the Jacksonville
meeting.  It seems reasonable to conclude that if we want feedback from the TS
process, then anything in that TS vehicle would miss the merge deadline for
C++20.  This paper was written to confirm early that there were no features
more appropriately targeted straight at the IS.
</p>
<p>
Of the (expected) three components to that TS, the Task Blocks feature depends
on the <tt>exception_list</tt> class that is still underspecified for general
use, and waiting feedback from such a TS process.
<p>
</p>
The <tt>simd</tt> library type has been going through rapid evolution and been
subject to much change in the review process, so seems appropriate to seek
feedback through the TS process.
</p>
<p>
Finally, there are two new vectorization policies for the parallel
algorithms, which rely on the main <tt>&lt;algorithm&gt;</tt> header providing
overloads for the new policies in the non-experimental <tt>std</tt> namespace
in order to be usable.  The algorithms can safely be implemented as reverting
to serial or any other lesser parallel behavior without using the freedoms
granted by the wavefront policies, so implementability is much less of a
concern than QoI.  There is no room in the design space for how we dispatch to
the algorithms to be seeking TS feedback, so it is seems reasonable to suggest
that if we want this feature, it would be more appropriate to target the C++20
standard directly, avoiding the awkward interaction with the <tt>std</tt> and
<tt>experimental</tt> namespaces.  We are still early enough in the C++20 cycle
to respond to unexpected implementer feedback, and the current state of the
C++17 parallel algorithms is that most standard library vendors are still
starting or in the middle of their first implementation of the parallel
algorithms, so it would be easier to tackle these extra policies now along with
the rest of that work.
</p>

<h3><a name="2.1">2.1 Conclusion of SG1 Review (Jacksonville 2018</a></h3>
<p>
Merging the <tt>unsequenced</tt> policy was entirely non-controversial, and is
recommended for C++20 by SG1.  It fills an important missing hole in the original
specifciation.
</p>

<p>
Moving ahead with the vector wavefront policy seemed premature to the group.
There is a desire to see that the new policy is generally useful, and provides
a real optimization opportunity in practice.  There are particular concerns
that it appears useful in only a small subset of the parallel algorithms, and a
more targeted approach for those few special cases might be more appropriate.
</p>


<h2><a name="3.0">3 Propose Solution</a></h2>
<p>
Merge the wording for the <tt>unsequenced</tt> execution policy into C++20, but
leave it in place in the Parallelism V2 TS as it may be useful as a fallback
policy for algorirthms implementing the <tt>vector</tt> executiuon policy.
</p>


<h2><a name="4.0">4 Other Directions</a></h2>
<p>
The original proposal recommended merging both vectorization policies, and their
associated machinery, into C++20, and removing them from the Parallelism V2 TS.
SG1 rejected this approach as there is a desire to see real benefits from the
implementation of the vector wavefront policy in shipping libraries before
adopting in the standard.  There was a particular concern that very few existing
algorithms are expected to see a benefit from this policy, other than the new
algorithms specifically added in the TS for that policy.  We may want to revisit
the idea of whether that policy is a generic execution policy like the others,
or whether we add just a few specific overloads explicitly taking this policy
for the few algorithms that are expected to profit.
</p>


<h2><a name="5.0">5 Formal Wording</a></h2>

<blockquote>

<h3>16.3.1 General [support.limits.general]</h3>
<p>
Table 36 &mdash; Standard library feature-test macros
</p>

<table>
  <tr>
    <td><b>Macro name</b></td>
    <td><b>Value</b></td>
    <td><b>Header(s)</b></td>
  </tr>
  <tr>
    <td>...</td>
    <td>...</td>
    <td>...</td>
  </tr>
  <tr>
    <td><tt>__cpp_lib_execution</tt></td>
    <td><del>201603L</del><ins>201903L</ins></td>
    <td><tt>&lt;execution&gt;</tt></td>
  </tr>
  <tr>
  <tr>
    <td>...</td>
    <td>...</td>
    <td>...</td>
  </tr>
</table>

</blockquote>

<blockquote class="note">
The value of the macro is a suggestion, the project editor should pick a
final value inspired by the date the edits are applied.
</blockquote>

<blockquote>

<h3>19.18.2 Header <tt>&lt;execution&gt;</tt> synopsis [execution.syn]</h3>
<pre>
namespace std {
  <i>// 19.18.3, execution policy type trait</i>
  template&lt;class T&gt; struct is_execution_policy;
  template&lt;class T&gt; inline constexpr bool is_execution_policy_v = is_execution_policy&lt;T&gt;::value;
}

namespace std::execution {
  <i>// 19.18.4, sequenced execution policy</i>
  class sequenced_policy;

  <i>// 19.18.5, parallel execution policy</i>
  class parallel_policy;

  <i>// 19.18.6, parallel and unsequenced execution policy</i>
  class parallel_unsequenced_policy;

  <i><ins>// 19.18.7, unsequenced execution policy</i></ins>
  <ins>class unsequenced_policy;</ins>

  <i>// 19.18.<del>7</del><ins>8</ins>, execution policy objects</i>
  inline constexpr sequenced_policy seq{ <i>unspecified</i> };
  inline constexpr parallel_policy par{ <i>unspecified</i> };
  inline constexpr parallel_unsequenced_policy par_unseq{ <i>unspecified</i> };
  <ins>inline constexpr unsequenced_policy unseq{ <i>unspecified</i> };</ins>
}
</pre>

<h3><ins>19.18.7 Unsequenced execution policy [parallel.execpol.unseq]</ins></h3>
<pre>
<ins>class unsequenced_policy{ <i>unspecified</i> };</ins>
</pre>

<ol>
<li><ins>
The class <tt>unsequenced_policy</tt> is an execution policy type used as a
unique type to disambiguate parallel algorithm overloading and indicate that a
parallel algorithm's execution may be vectorized, e.g., executed on a single
thread using instructions that operate on multiple data items.
</ins></li>

<li><ins>
During the execution of a parallel algorithm with the
<tt>execution::unsequenced_policy</tt> policy, if the invocation of an element
access function exits via an uncaught exception, <tt>terminate()</tt> shall be
called.
</ins></li>
</ol>


<h3>19.18.<del>7</del><ins>8</ins> Execution policy objects [execpol.objects]</h3>
<pre>
inline constexpr execution::sequenced_policy            execution::seq{ <i>unspecified</i> };
inline constexpr execution::parallel_policy             execution::par{ <i>unspecified</i> };
inline constexpr execution::parallel_unsequenced_policy execution::par_unseq{ <i>unspecified</i> };
<ins>inline constexpr execution::unsequenced_policy          execution::unseq{ <i>unspecified</i> };</ins>
</pre>
<ol>
<li>
The header <tt>&lt;execution&gt;</tt> declares global objects associated with each type of execution policy.
</ol>


<h3>24.3.1 Terms and definitions [algorithms.parallel.defns]</h3>
<ol start="3">
<li><ins>
A standard library function is <i>vectorization-unsafe</i> if it is specified
to synchronize with another function invocation, or another function invocation
is specified to synchronize with it, and if it is not a memory allocation or
deallocation function.  [<i>Note:</i> Implementations must ensure that internal
synchronization inside standard library functions does not prevent forward
progress when those functions are executed by threads of execution with weakly
parallel forward progress guarantees.  <i>&mdash; endnote</i>] [<i>Example:</i>
<blockquote><pre>
int x = 0;
std::mutex m;
int a[] = {1,2};
std::for_each(std::execution::par_unseq, std::begin(a), std::end(a), [&amp;](int) {
  std::lock_guard&lt;mutex&gt; guard(m); <i>// incorrect: lock_guard constructor calls m.lock()</i>
  ++x;
});
</pre></blockquote>
The above program may result in two consecutive calls to <tt>m.lock()</tt> on
the same thread of execution (which may deadlock), because the applications of
the function object are not guaranteed to run on different threads of
execution.  <i>&mdash; end example</i>]
</ins></li>
</ol>

<h3>24.3.3 Effect of execution policies on algorithm execution [algorithms.parallel.exec]</h3>
<ol start="4">

<li>
The invocations of element access functions in parallel algorithms invoked with
an execution policy object of type <tt>execution::sequenced_policy</tt> all
occur in the calling thread of execution. [<i>Note:</i> The invocations are not
interleaved; see 6.8.1. <i>&mdash; end note</i>] </li>

<li><ins>
The invocations of element access functions in parallel algorithms invoked with
an execution policy object of type <tt>execution::unsequenced_policy</tt> are
permitted to execute in an unordered fashion in the calling thread of
execution, unsequenced with respect to one another in the calling thread of
execution. [<i>Note:</i> This means that multiple function object invocations
may be interleaved on a single thread of execution, which overrides the usual
guarantee from 6.8.1 [intro.execution] that function executions do not overlap
with one another.  <i>&mdash; end note</i>] Since
<tt>execution::unsequenced_policy</tt> allows the execution of element access
functions to be interleaved on a single thread of execution, blocking
synchronization, including the use of mutexes, risks deadlock. Thus, the
synchronization with <tt>execution::unsequenced_policy</tt> is restricted as
follows: vectorization-unsafe standard library functions may not be invoked by
user code called from <tt>execution::unsequenced_policy</tt> algorithms.
</ins></li>

<li>
The invocations of element access functions in parallel algorithms invoked with
an execution policy object of type <tt>execution::parallel_policy</tt> are
permitted to ...
</li>

<li>
The invocations of element access functions in parallel algorithms invoked with
an execution policy <ins>object</ins> of type <tt>execution::parallel_unsequenced_policy</tt> are
permitted to execute in an unordered fashion in unspecified threads of
execution, and unsequenced with respect to one another within each thread of
execution. These threads of execution are either the invoking thread of
execution or threads of execution implicitly created by the library; the latter
will provide weakly parallel forward progress guarantees. [<i>Note:</i> This
means that multiple function object invocations may be interleaved on a single
thread of execution, which overrides the usual guarantee from 6.8.1 that
function executions do not <del>interleave</del><ins>overlap</ins> with one
another. <i>&mdash; end note</i>] Since
<tt>execution::parallel_unsequenced_policy</tt> allows the execution of element
access functions to be interleaved on a single thread of execution, blocking
synchronization, including the use of mutexes, risks deadlock. Thus, the
synchronization with <tt>execution::parallel_unsequenced_policy</tt> is
restricted as follows: <del>A standard library function is
<i>vectorization-unsafe</i> if it is specified to synchronize with another
function invocation, or another function invocation is specified to synchronize
with it, and if it is not a memory allocation or deallocation function.</del>
<del>V</del><ins>v</ins>ectorization-unsafe standard library functions may not be invoked by user code
called from <tt>execution::parallel_unsequenced_policy</tt> algorithms.
<del>[<i>Note:</i> Implementations must ensure that internal synchronization
inside standard library functions does not prevent forward progress when those
functions are executed by threads of execution with weakly parallel forward
progress guarantees.  <i>&mdash; endnote</i>] [<i>Example:</i>
<blockquote><pre>
int x = 0;
std::mutex m;
int a[] = {1,2};
std::for_each(std::execution::par_unseq, std::begin(a), std::end(a), [&amp;](int) {
  std::lock_guard&lt;mutex&gt; guard(m); <i>// incorrect: lock_guard constructor calls m.lock()</i>
  ++x;
});
</pre></blockquote>
The above program may result in two consecutive calls to <tt>m.lock()</tt> on
the same thread of execution (which may deadlock), because the applications of
the function object are not guaranteed to run on different threads of
execution.  <i>&mdash; end example</i>] [<i>Note:</i> The semantics of the
<tt>execution::parallel_policy</tt> or the
<tt>execution::parallel_unsequenced_policy</tt> invocation allow the
implementation to fall back to sequential execution if the system cannot
parallelize an algorithm invocation due to lack of resources. <i>&mdash; end
note</i>]</del>
</li>

<li><ins>
[<i>Note:</i> The semantics of invocation with
<tt>execution::unsequenced_policy</tt>, <tt>execution::parallel_policy</tt>, or
<tt>execution::parallel_unsequenced_policy</tt> allow the implementation to
fall back to sequential execution if the system cannot parallelize an algorithm
invocation, e.g., due to lack of resources. <i>&mdash; end note</i>]
</ins></li>

<li>
If an invocation of a parallel algorithm uses threads of execution implicitly
created by the library, then the invoking thread of execution will either ...
</li>

</ol>


</blockquote>

<h2><a name="6.0">6 Acknowledgements</h2>
<p>
Thanks to all the troublemakers in Jacksonville who persuaded me there was time
to write the original late paper!  Thanks to SG1 for finding the time to review
it, and provide feedback for a more appropriate timetable.
</p>


<h2><a name="7.0">7 References</h2>
<ul>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/n4725">N4725</a> Working Draft, Technical Specification for C++ Extensions for Parallelism Version 2, Jared Hoberock</li>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0075r2">P0075R2</a> Template Library for Parallel For Loops, Pablo Halpern, Clark Nelson, Arch D. Robison, Robert Geva</li>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0076r4">P0076R4</a> Vector and Wavefront Policies, Arch Robison, Pablo Halpern, Robert Geva, Clark Nelson, Jens Maurer</li>
</ul>


</body>
</html>
