<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link href='https://fonts.googleapis.com/css?family=Roboto' rel='stylesheet' type='text/css'>
    <link href='https://fonts.googleapis.com/css?family=Inconsolata:bold' rel='stylesheet' type='text/css'>
    <style>
      body {
        font-family: 'Roboto', sans-serif;
      }
      .body {
        margin: 0 auto;
        max-width: 80em;
      }
      a {
        color: #455A64;
      }
      h1, h2, h3, h4, h5, h6 {
        color: #37474F;
      }
      h1 a, h2 a, h3 a, h4 a, h5 a, h6 a {
        padding-left: 1em;
        padding-right: 3em;
        text-decoration: none;
        opacity: 0;
      }
      a.headerlink:hover {
        opacity: 1;
      }
      .field-name {
        text-align: right;
        padding-right: 1em;
      }
      tt, .highlight {
        color: #263238;
        background-color: #ECEFF1;
        font-family: 'Inconsolata', monospace;
        font-weight: bold;
      }
      tt {
        padding: 0em 0.5em;
      }
      .highlight {
        margin: 0em 1em;
        padding: 0.1em 1em;
      }
    </style>
    <title>P0193R1 Where is Vectorization in C++‽</title> 
  </head>
  <body>
    <div class="body">

  <div class="section" id="p0193r1-where-is-vectorization-in-c">
<h1>P0193R1 Where is Vectorization in C++‽<a class="headerlink" href="#p0193r1-where-is-vectorization-in-c" title="Permalink to this headline">¶</a></h1>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Author:</th><td class="field-body">JF Bastien</td>
</tr>
<tr class="field-even field"><th class="field-name">Contact:</th><td class="field-body"><a class="reference external" href="mailto:jfb&#37;&#52;&#48;google&#46;com">jfb<span>&#64;</span>google<span>&#46;</span>com</a></td>
</tr>
<tr class="field-odd field"><th class="field-name">Author:</th><td class="field-body">Hans Boehm</td>
</tr>
<tr class="field-even field"><th class="field-name">Contact:</th><td class="field-body"><a class="reference external" href="mailto:hboehm&#37;&#52;&#48;google&#46;com">hboehm<span>&#64;</span>google<span>&#46;</span>com</a></td>
</tr>
<tr class="field-odd field"><th class="field-name">Audience:</th><td class="field-body">SG1, SG14</td>
</tr>
<tr class="field-even field"><th class="field-name">Date:</th><td class="field-body">2016-03-20</td>
</tr>
<tr class="field-odd field"><th class="field-name">URL:</th><td class="field-body"><a class="reference external" href="https://github.com/jfbastien/papers/blob/master/source/P0193R1.rst">https://github.com/jfbastien/papers/blob/master/source/P0193R1.rst</a></td>
</tr>
</tbody>
</table>
<p>The C++ Standards Committee&#8217;s Concurrency and Parallelism sub-group (WG21/SG1)
has considered multiple approaches to user-aided vector programming and
Single-Instruction Multiple-Data (SIMD) computations. The Committee&#8217;s Game
Development &amp; Low Latency sub-group (WG21/SG14) has expressed interest in
participating in this design as well. This paper documents the state of SG1&#8217;s
exploration with the goal of moving the Committee towards a Technical
Specification in the near future.</p>
<p>This paper doesn&#8217;t deal with auto-vectorization in any form.</p>
<div class="section" id="design-approach">
<h2>Design Approach<a class="headerlink" href="#design-approach" title="Permalink to this headline">¶</a></h2>
<p>SG1 has prefered library-only solutions for user-aided vector programming. These
changes often integrate into the language more cleanly because they don&#8217;t
interact with the rest of the language and don&#8217;t introduce onerous
single-purpose syntax. This also allows the library implementation to use
vendor-specific extensions in a manner that is seamless to the developer, giving
leeway to implementations and allowing them to change their extensions over time
without needing to standardize a particular implementation approach. When
<em>implementors</em> peek under the hood the solution isn&#8217;t library-only, but the
Standard presents C++ <em>users</em> with what appears as a library-only solution.</p>
<p>A similar design approach was taken for the <tt class="docutils literal"><span class="pre">&lt;atomic&gt;</span></tt> library (as was the
original motivation for a library-only design for coroutines). In the
<tt class="docutils literal"><span class="pre">&lt;atomic&gt;</span></tt> case, interoperability with C was a strong design consideration,
allowing <tt class="docutils literal"><span class="pre">std::atomic</span></tt> to be implemented in terms of <tt class="docutils literal"><span class="pre">_Atomic</span></tt> or vice
versa, and allowing a program to use both correctly (despite <tt class="docutils literal"><span class="pre">_Atomic</span></tt> being
added to C after <tt class="docutils literal"><span class="pre">std::atomic</span></tt> was added to C++). A library approach allowed
the Standard to reason in terms of objects and their lifetime instead of C&#8217;s
more error-prone per-access specification where each access has to be consistent
lest it cause undefined behavior. It also allows implementations to specify
storage for <tt class="docutils literal"><span class="pre">std::atomic&lt;T&gt;</span></tt> which differs from <tt class="docutils literal"><span class="pre">T</span></tt>&#8216;s storage, and which may
have a different ABI. These library types further enable developers to use
common generic programming patterns. We&#8217;ve found that some generic programming
facilities were missing, such as <a class="reference external" href="http://wg21.link/n4509">constexpr atomic&lt;T&gt;::is_always_lock_free</a>,
and the library&#8217;s design has made their addition simple.</p>
<p>The Committee in general has a strong leaning towards proposals which
standardize existing practice. Implementation experience and usage experience
are key. For vector programming, SG1 wants a Technical Specification which has
at least one solid implementation, and would rather standardize something with
multiple implementations and sustained usage. In the case of vector programming,
platform portability and vendor buy-in matters much more than for most other C++
language or library features: the features are intended to work well on a
variety of architectures with different fixed-width SIMD widths, with
variable-width capabilities, and to degrade gracefully to scalar operation on
architectures without vector capabilities.</p>
<p>The solution should work well for developers who target:</p>
<ol class="arabic simple">
<li>Multiple different ISAs with the same code base.</li>
<li>Multiple architectural implementations of the same ISA, sometimes supporting
wider vectors in newer revisions.</li>
<li>A single architectural implementation (such as HPC or consoles) to obtain
better performance by using their compiler&#8217;s tuning flags such as <tt class="docutils literal"><span class="pre">-march</span></tt>.</li>
<li>Separate compilation of different source files, with the ability to provide
vector code in libraries.</li>
</ol>
<p>Vector programming requiring specific code-generation techniques such as
Just-in-Time code generation would be a very unusual change for
C++. Ahead-of-Time compilation is also frequently used, which doesn&#8217;t preclude
implementations from using other approaches if they desire.</p>
<p>Some in the committee believe that the amount of code generated by a compiler
should not explode when new features are used, whereas others find the size
trade-off acceptable. Requiring functions to be compiled multiple times is a
no-go for some developers, especially if other language or library features
interact in a combinatorial manner. One design point SG1 has considered is
giving the developer control over the code size tradeoffs, as is possible with
OpenMP.</p>
<p>Existing practice takes two approaches to user-aided vector programming:</p>
<ol class="arabic simple">
<li>High-level library focusing on APIs and their operations on data ranges.</li>
<li>Low-level vector types and operations on <tt class="docutils literal"><span class="pre">N</span></tt> scalars.</li>
</ol>
<p>The former is typically easy to use while sometimes achieving optimal
speedups. The later is often harder to use but grants developers more control,
can provide significant speedups, and is usable to build some higher-level
libraries. SG1 is comfortable moving forward with one Technical Specification in
each category.</p>
</div>
<div class="section" id="open-design-areas">
<h2>Open Design Areas<a class="headerlink" href="#open-design-areas" title="Permalink to this headline">¶</a></h2>
<p>Vectorization is inherently re-association and therefore interacts with
floating-point results. Beyond re-association, the solutions which SG1 has
explored have not exposed architectural differences in the result&#8217;s precision,
yet some applications benefit from significant speedups when some precision is
abandoned. It is usually up to the developer to figure out whether their
application remains correct with the loss of precision. It is unclear whether
SG1 is comfortable exploring such vector programming opportunities at this point
in time. For example:</p>
<ul class="simple">
<li>Image crossfade can be accelerated without loss of precision by using some
architecture-specific instructions, or by dropping the least-significant bits
when blending.</li>
<li>Some architectures provide vector floating-point reciprocal and reciprocal
square root approximation instructions, whereas in other architectures fast
floating-point inverse square root can be obtained using integer multiply by a
constant.</li>
<li>Some architectures provide vectorized 16-bit floating-point, in some cases
simply as a storage type whereas in other cases most arithmetic is supported
on half-precision floating-point.</li>
<li>Some architecture support Fused-Multiply-Add which performs one less rounding
that the unfused equivalent, and is sometimes faster or less power-demanding
than the unfused equivalent. Floating-point contraction is meant to address
this, but the support for it isn&#8217;t generally useful.</li>
</ul>
<p>A core optimization which hasn&#8217;t directly been addressed directly by current
vector programming proposals is how to restructure data structures and
algorithms to better fit the current architecture&#8217;s SIMD width and memory
hierarchy. C++ data structures have a fixed layout at every stage in the
compiler&#8217;s pipeline, yet in many cases it is beneficial to change
Array-of-Structures to Structures-of-Arrays. Low-level vector types as well as
proposed reflection features (c.f. WG21/SG7) can be used to synthesize optimal
data structures. This requires each programmer to express data structure layout
transformations which are suitable to their algorithms and target architectures.</p>
</div>
<div class="section" id="latest-papers">
<h2>Latest Papers<a class="headerlink" href="#latest-papers" title="Permalink to this headline">¶</a></h2>
<p>The following papers are the latest ones being considered by SG1 as of the
<a class="reference external" href="http://open-std.org/jtc1/sc22/wg21/docs/papers/2016#mailing2016-02">pre-Jacksonville mailing</a> with respect to vector programming. Other papers are
obsolete, have been abandoned, or haven&#8217;t been presented to SG1 recently.</p>
<p>Some of the following papers are explored in more details in the next sections.</p>
<ul class="simple">
<li>The Parallelism TS1 has a similar API as the STL&#8217;s <tt class="docutils literal"><span class="pre">&lt;algorithm&gt;</span></tt> header and
was voted for inclusing into C++17. These algorithms are composable by the
developer, who provides iterators to the data and, in some cases, function
objects.<ul>
<li><a class="reference external" href="http://wg21.link/N4354">DTS Ballot Document</a>.</li>
<li><a class="reference external" href="http://wg21.link/p0024r0">Should be Standardized</a>.</li>
<li><a class="reference external" href="http://wg21.link/p0075r0">Template Library for Index-Based Loops</a>.</li>
<li><a class="reference external" href="http://wg21.link/p0076r0">Vector and wavefront policies</a> is moving to LEWG with the expectation of
being in the Parallelism TS2 with a few modifications.</li>
<li><a class="reference external" href="http://wg21.link/p0072r1">Light-Weight Execution Agents</a> discusses forward progress guarantees, with
special attention to the Parralelism TS.<ul>
<li><a class="reference external" href="http://wg21.link/P0296R0">Forward progress guarantees: Base definitions</a> provides some wording.</li>
<li><a class="reference external" href="http://wg21.link/P0299R0">Forward progress guarantees for the Parallelism TS v2</a> provides more.</li>
</ul>
</li>
</ul>
</li>
<li>The Parallelism TS2 hasn&#8217;t been published yet but is expected to contain
follow-ups to the Parallelism TS1.</li>
<li>The SIMD Types proposal exposes types which are fixed-width as well as
<tt class="docutils literal"><span class="pre">typedef</span></tt> for wider vectors as appropriate for the target architecture.<ul>
<li><a class="reference external" href="http://wg21.link/n4184">The Vector Type &amp; Operations</a>.</li>
<li><a class="reference external" href="http://wg21.link/n4185">The Mask Type &amp; Write-Masking</a>.</li>
<li><a class="reference external" href="http://wg21.link/n4395">ABI Considerations</a>.</li>
<li><a class="reference external" href="http://wg21.link/n4454">Example: Matrix Multiplication</a>.</li>
<li><a class="reference external" href="http://wg21.link/P0203R0">Considerations for the design of expressive portable SIMD vectors</a> raises
design questions on the above papers.</li>
<li><a class="reference external" href="http://wg21.link/n3571">A Proposal to add Single Instruction Multiple Data Computation to the
Standard Library</a> pre-dates the The Vector Type &amp; Operations proposal and
has similar API principles with different API choices and naming. The
authors are discussing collaboration since The Vector Type &amp; Operations
proposal has received strong support from SG1.</li>
</ul>
</li>
<li><a class="reference external" href="http://wg21.link/p0035r0">Dynamic memory allocation for over-aligned data</a> allows allocating aligned
data.</li>
</ul>
<blockquote>
<div></div></blockquote>
</div>
<div class="section" id="parallelism-ts-details">
<h2>Parallelism TS Details<a class="headerlink" href="#parallelism-ts-details" title="Permalink to this headline">¶</a></h2>
<p>Of note in the Parallelism TS is the <tt class="docutils literal"><span class="pre">par_vec</span></tt> execution policy from section
2.6 [parallel.execpol.vec]:</p>
<blockquote>
<div><p>The class <tt class="docutils literal"><span class="pre">class</span> <span class="pre">parallel_vector_execution_policy</span></tt> is an execution policy
type used as a unique type to disambiguate parallel algorithm overloading and
indicate that a parallel algorithm&#8217;s execution may be vectorized and
parallelized.</p>
<p>This execution policy is defined as follows in section 4.1.2
[parallel.alg.general.exec]:</p>
<p>The invocations of element access functions in parallel algorithms invoked
with an execution policy of type <tt class="docutils literal"><span class="pre">parallel_vector_execution_policy</span></tt> are
permitted to execute in an unordered fashion in unspecified threads, and
unsequenced with respect to one another within each thread. [ <em>Note:</em> This
means that multiple function object invocations may be interleaved on a single
thread. — <em>end note</em> ]</p>
<p>[ <em>Note:</em> This overrides the usual guarantee from the C++ standard, Section
1.9 [intro.execution] that function executions do not interleave with one
another. — <em>end note</em> ]</p>
<p>Since <tt class="docutils literal"><span class="pre">parallel_vector_execution_policy</span></tt> allows the execution of element
access functions to be interleaved on a single thread, synchronization,
including the use of mutexes, risks deadlock. Thus the synchronization with
<tt class="docutils literal"><span class="pre">parallel_vector_execution_policy</span></tt> is restricted as follows:</p>
<p>A standard library function is <em>vectorization-unsafe</em> if it is specified to
synchronize with another function invocation, or another function invocation
is specified to synchronize with it, and if it is not a memory allocation or
deallocation function. Vectorization-unsafe standard library functions may not
be invoked by user code called from <tt class="docutils literal"><span class="pre">parallel_vector_execution_policy</span></tt>
algorithms.</p>
<p>[ <em>Note:</em> Implementations must ensure that internal synchronization inside
standard library routines does not induce deadlock. — <em>end note</em> ]</p>
</div></blockquote>
</div>
<div class="section" id="template-library-for-index-based-loops-details">
<h2>Template Library for Index-Based Loops Details<a class="headerlink" href="#template-library-for-index-based-loops-details" title="Permalink to this headline">¶</a></h2>
<p>The proposal adds the following to the Parallelism TS:</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">for_loop</span></tt> and <tt class="docutils literal"><span class="pre">for_loop_strided</span></tt>.</li>
<li><tt class="docutils literal"><span class="pre">reduction</span></tt>, <tt class="docutils literal"><span class="pre">reduction_plus</span></tt>, <tt class="docutils literal"><span class="pre">reduction_mutiplies</span></tt>, …</li>
<li><tt class="docutils literal"><span class="pre">induction</span></tt>.</li>
</ul>
</div>
<div class="section" id="vector-and-wavefront-policies-details">
<h2>Vector and wavefront policies Details<a class="headerlink" href="#vector-and-wavefront-policies-details" title="Permalink to this headline">¶</a></h2>
<p>The proposal adds two new execution policies to the Parallelism TS:</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">unsequenced_execution_policy</span></tt>.</li>
<li><tt class="docutils literal"><span class="pre">vector_execution_policy</span></tt>.</li>
</ul>
<div class="section" id="data-races">
<h3>Data races<a class="headerlink" href="#data-races" title="Permalink to this headline">¶</a></h3>
<p>This paper is contentious in SG1 because examples such as the following have
<tt class="docutils literal"><span class="pre">par_vec</span></tt> data races:</p>
<div class="code c++ highlight-python"><div class="highlight"><pre>for_loop(vec, 0, n, [&amp;](int i) {
   y[i] += y[i+1];
});
</pre></div>
</div>
<p>In the implicit wavefront policy, this will work as expected: The load is
sequenced before the store, and the loaded location is only overwritten by a
later iteration. Operations that both appear earlier in the loop body (in the
sequenced before sense) and in an earlier (or same) iteration than in another
operation remain ordered in this model. This is the classic model followed by
many existing vectorizing compilers.</p>
<p>In the explicit wavefront model, this requires a new kind of ordering barrier to
explicitly ensure this ordering.</p>
<p>Three alternatives were discussed by a few SG1 members, outside of a full SG1
meeting. The first alternative is prefered by the small group at this point in
time.</p>
<ol class="arabic">
<li><p class="first">The implicit model continues to make some slightly uncomfortable, but there
is agreement that we should nonetheless proceed with it. It is clearly the
center of existing practice. And the negatives seem to be more along the
lines of vague discomfort rather than precisely definable objections.</p>
<p>The fundamental issue with this model is that it introduces a context in
which certain otherwise safe compile transformations can no longer be applied
by a compiler before the code is vectorized. Before the code is vectorized,
the sequenced-before relation must, in the general case, be preserved. The
compiler is no longer allowed to reorder ordinary assignments touching
different memory locations. In other contexts such restrictions only arise in
the presence of synchronization or volatile operations. For a compiler that
immediately vectorizes before performing other transformations, this is not
an issue. The current belief is that that this does commonly impact compiler
structure.</p>
<p>This transformation restriction of course also applies to the user:
reordering independent operations in a vector context affects semantics and
may not be correct. But users should expect that. In theory it applies to
libraries called from a vector context as well. But in practice the calling
code will either not be sensitive to such reordering, or the library routines
will have been written with the explicit expectation of being used in a
vector context.</p>
</li>
<li><p class="first">A narrow majority of SG1 previously favored the explicit model to largely
avoid this issue. By requiring explicit barriers of some sort, the implicit
compiler restrictions disappear. But the problem with this is pointed out by
the above example: there is no natural place to just put a barrier. In fact
it would have to order a textually later operation before an earlier one. One
would have to break the loop body up into multiple statements. This was
previously pointed out, and SG1 was mostly convinced that this is a serious
practical issue, at least for those already familiar with the implicit model.</p>
</li>
<li><p class="first">There was brief thinking about alternative non-barrier-like syntax to address
the problems with (2). But there wasn&#8217;t much enthusiasm for trying to invent
something new at this stage.</p>
</li>
</ol>
</div>
<div class="section" id="scatter-ordering">
<h3>Scatter ordering<a class="headerlink" href="#scatter-ordering" title="Permalink to this headline">¶</a></h3>
<p>Another contentious point was whether scatter operations should be ordered by
default or not. The committee currently leans towards <em>not</em> being ordered by
default.</p>
</div>
</div>
<div class="section" id="the-vector-type-operations-details">
<h2>The Vector Type &amp; Operations Details<a class="headerlink" href="#the-vector-type-operations-details" title="Permalink to this headline">¶</a></h2>
<p>This paper describes a template class for portable SIMD Vector types. The class
is portable because its size depends on the target system and only operations
that are independent of the SIMD register size are part of the interface.</p>
<p>The <tt class="docutils literal"><span class="pre">Vector&lt;T&gt;</span></tt> type only has the vector element type <tt class="docutils literal"><span class="pre">T</span></tt> as a parameter. It
has <tt class="docutils literal"><span class="pre">constexpr</span></tt> members <tt class="docutils literal"><span class="pre">MemoryAlignment</span></tt> and <tt class="docutils literal"><span class="pre">Size</span></tt>.</p>
<p>The following APIs are supported:</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">load</span></tt> and <tt class="docutils literal"><span class="pre">store</span></tt> based on a pointer to the scalar element type.<ul>
<li>With optional mask.</li>
<li>Some are based on a pointer to a different scalar type, leading to
conversion.</li>
<li>Optional flags specify alignment, temporality, and prefetching.</li>
</ul>
</li>
<li>Unary <tt class="docutils literal"><span class="pre">+</span></tt> and <tt class="docutils literal"><span class="pre">-</span></tt>.</li>
<li>Binary arithmetic, comparison, bitwise, and shift.</li>
<li>Subscripting. It&#8217;s open whether subscripting should be an <cite>lvalue</cite> reference
or a smart reference, though smart reference is currently slightly preferred.</li>
<li>Gather and scatter.</li>
<li>At the Jacksonville meeting there was strong support for adding <cite>sqrt</cite>, <cite>abs</cite>,
<cite>min</cite>, <cite>max</cite>, <cite>minnum</cite>, <cite>maxnum</cite>, <cite>copysign</cite>, <cite>modf</cite>, and similar to the
initial published TS. There was little support for trigonometric functions,
these would therefore come in a later revision.</li>
</ul>
<p>here has been much discussion on how to handle ABI compatibility issues, given
that machines otherwise considered ABI-compatible routinely support different
vector instruction sets, and even different vector lengths. They will want to
use different default vector lengths, and pass vectors in different registers.
The current SIMD vector types proposal lets the programmer trade off best
performance from the latest hardware against compatibility with older hardware.</p>
</div>
<div class="section" id="acknowledgement">
<h2>Acknowledgement<a class="headerlink" href="#acknowledgement" title="Permalink to this headline">¶</a></h2>
<p>Thanks to Chandler Carruth, Joel Falcou, Michael Wong, and Robert Geva for their
review of the pre-publication paper.</p>
<p>Thanks to the many vector programming paper authors and SG1 for working
tirelessly on such a complex topic for years.</p>
</div>
</div>


    </div>
  </body>
</html>