<html>

<head>
  <meta name="description" content="Proposal for C++ Standard">
  <meta name="keywords" content="C++,cplusplus,wg21,C++20,parallism,wavefront,vectorization">
  <meta name="author" content="Alisdair Meredith">

  <title>Target Vectorization Policies from Parallelism V2 TS to C++20</title>

  <style type="text/css">
    ins {background-color:#A0FFA0}
    del {background-color:#FFA0A0}
    li {text-align:justify}
    blockquote.note
    {
      background-color:#E0E0E0;
      padding-left: 15px;
      padding-right: 15px;
      padding-top: 1px;
      padding-bottom: 1px;
    }
  </style>
</head>

<body>
<table>
<tr>
  <td align="left">Doc. no.</td>
  <td align="left">P10010R1</td>
</tr>
<tr>
  <td align="left">Date:</td>
  <td align="left">2018-03-16</td>
</tr>
<tr>
  <td align="left">Project:</td>
  <td align="left">Programming Language C++</td>
</tr>
<tr>
  <td align="left">Audience:</td>
  <td align="left">SG1 Parallelism and Concurrency</td>
</tr>
<tr>
  <td align="left"> </td>
  <td align="left">Library Evolution Working Group</td>
</tr>
<tr>
  <td align="left">Reply to:</td>
  <td align="left">Alisdair Meredith &lt;<a href="mailto:ameredith1@bloomberg.net">ameredith1@bloomberg.net</a>&gt;</td>
</tr>
</table>

<h1>Target Vectorization Policies from Parallelism V2 TS to C++20</h1>

<h2>Table of Contents</h2>
<ol>
<li><a href="#rev.hist">Revision History</a></li>
  <ul>
  <li><a href="#rev.0">Revision 0</a></li>
  <li><a href="#rev.1">Revision 1</a></li>
  </ul>
<li><a href="#1.0">1 Introduction</a></li>
<li><a href="#2.0">2 Stating the problem</a></li>
  <ul>
  <li><a href="#2.1">2.1 Conclusion of SG1 Review (Jacksonville 2018</a></li>
  </ul>
<li><a href="#3.0">3 Propose Solution</a></li>
<li><a href="#4.0">4 Other Directions</a></li>
<li><a href="#5.0">5 Formal Wording</a></li>
<li><a href="#6.0">6 Acknowledgements</a></li>
<li><a href="#7.0">7 References</a></li>
</ol>


<h2><a name="rev.hist">Revision History</a></h2>

<h3><a name="rev.0">Revision 0</a></h3>
<p>
Original version of the paper for the 2018 Jacksonville meeting.
</p>

<h3><a name="rev.1">Revision 1</a></h3>
<p>
Revision of the paper following SG1 feedback at the 2018 Jacksonville meeting,
proposing just the unsquenced policy for C++20.
</p>


<h2><a name="1.0">1 Introduction</a></h2>
<p>
The Parallelism v2 PDTS has two new vector policies for the parallel
algorithms that might better target C++20 directly.
</p>


<h2><a name="2.0">2 Stating the problem</a></h2>
<p>
We plan to send the Parallelism V2 working paper to PDTS ballot at the
Jacksonville meeting.  It seems reasonable to conclude that if we want feedback
from the TS process, then anything in that TS vehicle will miss the merge
deadline for C++20.
</p>
<p>
Of the (expected) three components to that TS, the Task Blocks feature depends
on the <tt>exception_list</tt> class that is still underspecified for general
use, and waiting feedback from such a TS process.
<p>
</p>
The <tt>simd</tt> library type has been going through rapid evolution and been
subject to much change in the review process, so seems appropriate to seek
feedback through the TS process.
</p>
<p>
Finally, there are two new vectorization policies for the parallel
algorithms, which rely on the main <tt>&lt;algorithm&gt;</tt> header providing
overloads for the new policies in the non-experimental <tt>std</tt> namespace
in order to be usable.  The algorithms can safely be implemented as reverting
to serial or any other lesser parallel behavior without using the freedoms
granted by the wavefront policies, so implementability is much less of a
concern than QoI.  There is no room in the design space for how we dispatch to
the algorithms to be seeking TS feedback, so it is seems reasonable to suggest
that if we want this feature, it would be more appropriate to target the C++20
standard directly, avoiding the awkward interaction with the <tt>std</tt> and
<tt>experimental</tt> namespaces.  We are still early enough in the C++20 cycle
to respond to unexpected implementer feedback, and the current state of the
C++17 parallel algorithms is that most standard library vendors are still
starting or in the middle of their first implementation of the parallel
algorithms, so it would be easier to tackle these extra policies now along with
the rest of that work.
</p>

<h3><a name="2.1">2.1 Conclusion of SG1 Review (Jacksonville 2018</a></h3>
<p>
Merging the <tt>unsequenced</tt> policy was entirely non-controversial, and is
recommended for C++20 by SG1.  It fills an important missing hole in the original
specifciation.
</p>

<p>
Moving ahead with the vector wavefront policy seemed premature to the group.
There is a desire to see that the new policy is generally useful, and provides
a real optimization opportunity in practice.  There are particular concerns
that it appears useful in only a small subset of the parallel algorithms, and a
more targeted approach for those few special cases might be more appropriate.
</p>


<h2><a name="3.0">3 Propose Solution</a></h2>
<p>
Merge the wording for the <tt>unsequenced</tt> execution policy into C++20, but
leave it in place in the Parallelism V2 TS as it may be useful as a fallback
policy for algorirthms implementing the <tt>vector</tt> executiuon policy.
</p>


<h2><a name="4.0">4 Other Directions</a></h2>
<p>
The original proposal recommended merging both vectorization policies, and their
associated machinery, into C++20, and removing them from the Parallelism V2 TS.
SG1 rejected this approach as there is a desire to see real benefits from the
implementation of the vector wavefront policy in shipping libraries before
adopting in the standard.  There was a particular concern that very few existing
algorithms are expected to see a benefit from this policy, other than the new
algorithms specifically added in the TS for that policy.  We may want to revisit
the idea of whether that policy is a generic execution policy like the others,
or whether we add just a few specific overloads explicitly taking this policy
for the few algorithms that are expected to profit.
</p>


<h2><a name="5.0">5 Formal Wording</a></h2>

<blockquote>

<h3>23.19.2 Header <execution> synopsis [execution.syn]</h3>
<pre>
namespace std {
  <i>// 23.19.3, execution policy type trait</i>
  template&lt;class T&gt; struct is_execution_policy;
  template&lt;class T&gt; inline constexpr bool is_execution_policy_v = is_execution_policy&lt;T&gt;::value;
}

namespace std::execution {
  <i>// 23.19.4, sequenced execution policy</i>
  class sequenced_policy;

  <i>// 23.19.5, parallel execution policy</i>
  class parallel_policy;

  <i>// 23.19.6, parallel and unsequenced execution policy</i>
  class parallel_unsequenced_policy;

  <i><ins>// 23.19.7, unsequenced execution policy</i></ins>
  <ins>class unsequenced_policy;</ins>

  <i>// 23.19.<del>7</del><ins>8</ins>, execution policy objects</i>
  inline constexpr sequenced_policy seq{ <i>unspecified</i> };
  inline constexpr parallel_policy par{ <i>unspecified</i> };
  inline constexpr parallel_unsequenced_policy par_unseq{ <i>unspecified</i> };
  <ins>inline constexpr unsequenced_policy unseq{ <i>unspecified</i> };</ins>
}
</pre>

<h3><ins>23.19.7 Unsequenced execution policy [parallel.execpol.unseq]</ins></h3>
<pre>
<ins>class unsequenced_policy{ <i>unspecified</i> };</ins>
</pre>

<ol>
<li><ins>
The class <tt>unsequenced_policy</tt> is an execution policy type used as a
unique type to disambiguate parallel algorithm overloading and indicate that a
parallel algorithm's execution may be vectorized, e.g., executed on a single
thread using instructions that operate on multiple data items.
</ins></li>

<li><ins>
The invocations of element access functions in parallel algorithms invoked with
an execution policy of type <tt>unsequenced_policy</tt> are permitted to
execute in an unordered fashion in the calling thread, unsequenced with respect
to one another within the calling thread. [ <i>Note:</i> This means that
multiple function object invocations may be interleaved on a single thread.
&mdash; <i>end note</i> ]
</ins></li>

<li><ins>
[ <i>Note:</i> This overrides the usual guarantee from 4.6 [intro.execution] that
function executions do not overlap with one another. &mdash; <i>end note</i> ]
</ins></li>

<li><ins>
During the execution of a parallel algorithm with the
<tt>execution::unsequenced_policy</tt> policy, if the invocation of an element
access function exits via an uncaught exception, <tt>terminate()</tt> shall be
called.
</ins></li>
</ol>


<h3>23.19.<del>7</del><ins>8</ins> Execution policy objects [execpol.objects]</h3>
<pre>
inline constexpr execution::sequenced_policy            execution::seq{ <i>unspecified</i> };
inline constexpr execution::parallel_policy             execution::par{ <i>unspecified</i> };
inline constexpr execution::parallel_unsequenced_policy execution::par_unseq{ <i>unspecified</i> };
<ins>inline constexpr execution::unsequenced_policy          execution::unseq{ <i>unspecified</i> };</ins>
</pre>
<ol>
<li>
The header <tt>&lt;execution&gt;</tt> declares global objects associated with each type of execution policy.
</ol>
</blockquote>

<h2><a name="6.0">6 Acknowledgements</h2>
<p>
Thanks to all the troublemakers in Jacksonville who persuaded me there was time
to write this late paper!  Thanks to SG1 for finding the time to review it, and
provide feedback for a more appropriate timetable :)
</p>


<h2><a name="7.0">7 References</h2>
<ul>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/n4725">N4725</a> Working Draft, Technical Specification for C++ Extensions for Parallelism Version 2, Jared Hoberock</li>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0075r2">P0075R2</a> Template Library for Parallel For Loops, Pablo Halpern, Clark Nelson, Arch D. Robison, Robert Geva</li>
  <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0076r4">P0076R4</a> Vector and Wavefront Policies, Arch Robison, Pablo Halpern, Robert Geva, Clark Nelson, Jens Maurer</li>
</ul>


</body>
</html>
