<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>

<meta http-equiv="Content-Type" content="text/html;charset=UTF8">

<style type="text/css">

body {
  color: #000000;
  background-color: #FFFFFF;
  counter-reset: section example;
  max-width: 50em;
}

del {
  text-decoration: line-through;
  color: #8B0040;
}
ins {
  text-decoration: underline;
  color: #005100;
}

h2:before {
  display: inline;
  content: counter(section) "  ";
  counter-increment: section;
}
h2 {
  counter-reset: subsection;
}

h3:before {
  display: inline;
  content: counter(section) "." counter(subsection) " ";
  counter-increment: subsection;
}
h3 {
  counter-reset: subsubsection;
}

h4:before {
  display: inline;
  content: counter(section) "." counter(subsection) "." counter(subsubsection) " ";
  counter-increment: subsubsection;
}

p {
  margin: 1em 0em 0em 0em;
}
p.example {
  margin: 1em 2em;
}

span.lastexample:before {
  content: "Example " counter(example);
}

pre{
  margin: 0.3ex 1em 0.3ex 1em;
}
pre.example {
  display: block;
  white-space: pre;
  background-color: #f0fff0;
  border-color: #e0ffe0;
  border-style: solid;
  border-top-width: 1px;
  border-bottom-width: 1px;
  border-right-width: 1px;
  border-left-width: 5px;
  padding-left: 1ex;
}
pre.example:before {
  white-space: pre;
  background-color: #e0ffe0;
  border: none;
  padding-left: 1ex;
  padding-top: 0ex;
  padding-bottom: 0ex;
  content: "Example " counter(example);
  counter-increment: example;
  display:block;
  font-family: sans-serif;
  font-size: 75%;
  text-align: center;
  margin-left: -1ex;
}
pre.impl {
  display: block;
  white-space: pre;
  background-color: #f0f0ff;
  border-color: #e0e0ff;
  border-style: solid;
  border-top-width: 1px;
  border-bottom-width: 1px;
  border-right-width: 1px;
  border-left-width: 5px;
  padding-left: 1ex;
  padding-top: 0.2ex;
  padding-bottom: 0.2ex;
}
div.example {
  margin: 2em;
}

code.extract {
  background-color: #F5F6A2;
}
pre.extract {
  margin: 2em;
  background-color: #F5F6A2;
  border: 1px solid #E1E28E;
}

pre.inline{
  margin: 0;
  display: inline;
}

p.function {
}

p.attribute {
  text-indent: 3em;
}

blockquote.std {
  color: #000000;
  background-color: #F1F1F1;
  border: 1px solid #D1D1D1;
  padding: 0.5em;
}

blockquote.stddel {
  text-decoration: line-through;
  color: #000000;
  background-color: #FFEBFF;
  border: 1px solid #ECD7EC;
  padding: 0.5em;
}

blockquote.stdins {
  text-decoration: underline;
  color: #000000;
  background-color: #C8FFC8;
  border: 1px solid #B3EBB3;
  padding: 0.5em;
}

table {
  border: 1px solid black;
  border-spacing: 0px;
  margin-left: auto;
  margin-right: auto;
}
th {
  text-align: left;
  vertical-align: top;
  padding: 0.2em;
  border: none;
}
td {
  text-align: left;
  vertical-align: top;
  padding: 0.2em;
  border: none;
}

</style>

<title>SIMD Vector Types</title>
</head>
<body>

<p> Document number: N3759
<br>Date: 2013-08-30
<br>Project: Programming Language C++, Library Working Group
<br>Reply-to: Matthias Kretz &lt;kretz@kde.org&gt; &lt;kretz@compeng.uni-frankfurt.de&gt;

<h1>SIMD Vector Types</h1>
<p>
<a href="#Introduction">Introduction</a><br>
<a href="#Motivation">Motivation</a><br>
<a href="#Problem">Problem</a><br>
<a href="#Proposal">Proposal &mdash; SIMD Types</a><br>
<a href="Acknowledgements">Acknowledgements</a><br>
<a href="#References">References</a><br>
</p>
<h2><a name="Preface">Preface</a></h2>
<p>
In Bristol N3571 (A Proposal to add Single Instruction Multiple Data Computation to the Standard Library) was discussed and the following straw poll was taken:
“Should C++ include a fixed length vector type to abstract vector registers”
<ul>
<li> 2/2/1/4/6 SF/WF/N/WA/SA
<li> Consensus not to move forward
</ul>

<p>
This poll was assuming a slightly different approach than presented in this proposal.
It did not make sense to take over the discussion then to talk about this approach.
In/after that session I was told that I should write a new proposal.

<h2><a name="Introduction">Introduction &mdash; SIMD Registers and Operations</a></h2>
<p>
Since many years the number of SIMD instructions and the size of SIMD registers have been growing.
Newer microarchitectures introduce new operations to optimize certain (common or specialized) operations.
Additionally, the size of SIMD registers has increased and may increase further in the future.

<p>
The typical minimal set of SIMD instructions for a given scalar data type comes down to the following:
<ul>
<li>Load instructions: load N successive scalar values starting from a given address into a SIMD register. SSE examples:
<pre class="example">
movaps (%rax),%xmm0
movups 0x4(%rax),%xmm1
</pre>
<li>Store instructions: store from a SIMD register to N successive scalar values at a given address. SSE examples:
<pre class="example">
movaps %xmm0,(%rax)
movups %xmm1,0x4(%rax)
</pre>
<li>Arithmetic instructions. SSE examples:
<pre class="example">
addps %xmm0,%xmm1
mulps %xmm1,%xmm1
divps %xmm1,%xmm0
</pre>
<li>Compare instructions. SSE examples:
<pre class="example">
cmpeqps %xmm0,%xmm1
</pre>
<li>Bitwise instructions. SSE examples:
<pre class="example">
andps %xmm0,%xmm1
xorps %xmm0,%xmm0
</pre>
</ul>

<p>
The set of available operations can differ considerably between different microarchitectures of the same CPU family.
Furthermore there are different SIMD register sizes.
Future extensions will certainly add more instructions and larger SIMD registers.

<h2><a name="Motivation">Motivation</a></h2>
<p>
There is no need to motivate SIMD programming.
It is very much needed, the open question is only: “How?”.

<p>
There have been several approaches to vectorization.
I'd like to only discuss the merits of SIMD types here.
<!--There is no need to discredit other approaches in this paper.-->

<p>
SIMD registers and operations are the low-level ingredients to SIMD programming.
Higher-level abstractions can be built on top of these.
<!--Once these can be accessed from C++ it is possible to build higher-level abstractions around them.-->
If the lowest-level access to SIMD is not provided, users of C++ will be constrained to work within the limits of the provided abstraction.

<p>
In some cases the compiler might generate better code if only the intent is stated instead of an exact sequence of operations.
Thus higher-level abstractions might seem preferable to low-level SIMD types.
In my experience this is not the case because programming with SIMD types makes intent very clear and compilers can optimize sequences of SIMD operations just like they can for scalar operations.
SIMD types do not lead to an easy and obvious answer to efficient and easily usable data structures, though.

<p>
One major benefit from SIMD types is that the programmer can gain an intuition for SIMD.
This subsequently influences further design of data structures and algorithms to better suit SIMD architectures.

<p>
There are already many users of SIMD intrinsics (and thus a primitive form of SIMD types).
Providing a cleaner and portable SIMD API would provide many of them with a better alternative.
Thus SIMD types in C++ would capture existing practice.

<p>
The challenge remains in providing <em>portable</em> SIMD types and operations.

<h2><a name="Problem">Problem</a></h2>
<p>
C++ has no means to use SIMD operations directly.
There are indirect uses through loop vectorization or optimized algorithms (that use extensions to C/C++ or assembler for their implementation).

<p>
All compiler vendors (that I worked with) add intrinsics support to their compiler products to make these operations accessible from C.
These intrinsics are inherently not portable and most of the time very directly bound to a specific instruction.
(Compilers are able to statically evaluate and optimize SIMD code written via intrinsics, though.)

<h3>Algorithms on Large Datasets</h3>
<p>
Solutions that encode data parallel operations via the application of operations or algorithms on a larger set of data, quickly lead to inefficient use of the CPUs caches.
Consider valarray:
<pre class="example">
1  std::valarray<float> data(N);
2  // initialize it somehow
3  data *= 2.f;
4  data += 1.f;
</pre>
A compiler will be able to vectorize lines 3 and 4.

<p>
§ 6.2 [stmt.expr]: "All side effects from an expression statement are completed before the next statement is executed." does not necessarily stop the compiler from combining the operations on lines 3 and 4.
Still, (for a larger number of expressions and containers) it cannot be reasonably expected/required that compilers will be able to combine the statements.
Thus, we must assume that line 3 will iterate over N values and only afterwards line 4 is allowed to iterate over the same memory again.
But modern CPUs require small working-sets to make efficient use of their caches.
Therefore N should not be too large.
General and exact bounds for N do not exist, though.

<p>
Any solution to achieving smaller working sets requires some form of loop construct with current C++.

<p>
The following is not a solution for this example unless expression templates were used in the implementation:
<pre class="example">
...
3  data = data * 2.f + 1.f;
</pre>

<!--
<p>
A different variation of this problem occurs when using STL algorithms:
<pre class="example"><code>
1  std::array&lt;float, N&gt; data;
2  // initialize it somehow
3  std::transform(data.begin(), data.end(), data.begin(), [](float x) { return x * 2.f; });
4  std::transform(data.begin(), data.end(), data.begin(), [](float x) { return x + 1.f; });
</pre>
The working-set problem is obviously fixable here:
<pre class="example">
...
3  std::transform(data.begin(), data.end(), data.begin(), [](float x) { return x * 2.f + 1.f; });
</pre>
It could possibly be "fixed" to work on a smaller working-set, though:
<pre class="example">
1  std::array&lt;float, N&gt; data;
2  // initialize it somehow
3  int i = 0;
4  for (; i &lt; data.size() - 15; i += 16) {
5    std::transform(&amp;data[i], &amp;data[i + 16], &amp;data[i], [](float x) { return x * 2.f; });
6    std::transform(&amp;data[i], &amp;data[i + 16], &amp;data[i], [](float x) { return x + 1.f; });
7  }
8  for (; i &lt; data.size(); ++i) {
9    std::transform(&amp;data[i], &amp;data[i + 1], &amp;data[i], [](float x) { return x * 2.f; });
10   std::transform(&amp;data[i], &amp;data[i + 1], &amp;data[i], [](float x) { return x + 1.f; });
11 }
</pre>
Which does not look like the code we want developers to write.
This looks like working around the API to fit modern hardware, rather than the API supporting development for modern hardware.
-->

<h3>The Array Building Blocks Approach</h3>
<p>
For the reasons sketched in the section above, Intel experimented with an approach that would split the algorithm in optimized working sets at runtime.
This required special sections of code which generated the code at runtime that subsequently was executed to do the actual work.
The semantics in different sections of the code were thus slightly different:
Something which can be really surprising and hard to understand for developers.

<p>
Whenever you take the approach of expressing the algorithm on the whole large data, you will either end up with inefficient cache usage or a solution resembling Intel Array Building Blocks.
(I'd be happy to learn of another &mdash; better &mdash; solution, of course.)

<h3>Cache Efficient on Large Data == Loops</h3>
<p>
Therefore loops still remain the state of the art for cache efficient processing of large data sets.
The loop count can be reduced by executing the loop body simultaneously on SIMD vectors.
(Additionally the loop can be divided onto multiple cores.)

<h3>Portability Considerations</h3>
<p>
The main portability concern stems from different SIMD register widths for different targets.
This is a real problem mainly when SIMD types are used in I/O.
But this is comparable to differing endianness.
It simply requires software to use a portable interchange format (e.g. SoA or AoS of scalar types).

<p>
Typically the compiler is told the target microarchitecture via flags.
Obviously this will create code that is not guaranteed to run on older/incompatible CPUs.
An implementation might decide to compile the code for several targets, though and decide for the best one to use at runtime.
But it is also possible to achieve runtime selection of the target microarchitecture without help from the compiler (e.g. Krita uses Vc this way).

<h2><a name="Proposal">Proposal</a></h2>
<p>
This is a pure library proposal.
It does depend on implementation-specific language extensions, though (e.g. SIMD intrinsics/builtins).
The proposal builds upon the experience from the Vc library [<a href="#cite1">1</a>, <a href="#cite2">2</a>].

<h3>Room for Follow-up Proposals</h3>
This proposal focuses on the core class only.
Therefore a lot of interesting functionality that is present in Vc will not be discussed here.
Follow-up proposals can address the following issues:
<ul>
<li>Mask types and solutions for conditional assignment / predicates
<li>Load/store optimizations: streaming and prefetching
<li>Gather/scatter
<li>Shuffles, swizzles, vector shift/rotate
<li>Portable optimized (de)interleaving
<li>Iterators / Ranges (make iteration with SIMD types easier)
<li>Containers for SIMD
<li>“valarray&lt;simd_vector&lt;T&gt;, N&gt;”
<li>support for load/store with half-precision floats
<li>“fix” allocators to support SIMD types (over-aligned) throughout the C++ standard (<code>new</code>, <code>std::allocator</code>)
</ul>

<h3>Types</h3>
<p>
Provide at least the following new types:
<ul>
<li><code>int16_v</code>
<li><code>int32_v</code>
<li><code>uint16_v</code>
<li><code>uint32_v</code>
<li><code>float_v</code>
<li><code>double_v</code>
</ul>
These types should probably be provided as well (they are not provided in Vc &mdash; I never had a need for them):
<ul>
<li><code>int8_v</code>
<li><code>uint8_v</code>
<li><code>int64_v</code>
<li><code>uint64_v</code>
</ul>

<p>
Each class has a single data member, an implementation-specific object to access the SIMD register (this could be <code>__m256</code> with AVX intrinsics).
If the type is not supported for SIMD on the target platform, the data member will be a single scalar value.
For example <code>double_v</code> has one <code>double</code> member for ARM NEON.

<p>
The sizes of these SIMD types therefore depend on the natural size suggested by the architecture of the execution environment.
This is similar to § 3.9.1 [basic.fundamental] p2: “Plain <code>int</code>s have the natural size suggested by the architecture of the execution environment”.

<p>
These types should be instantiations of a SIMD vector template class: <code>typedef simd_vector<float> float_v;</code>.
This makes generic code slightly easier to create, without the need for SFINAE:
<pre class="example">
template&lt;typename T&gt; void someFunction(simd_vector&lt;T&gt; v);
// vs.
template&lt;typename V&gt; typename std::enable_if&lt;is_simd_vector&lt;V&gt;::value, void&gt;::type someFunction(V v);
</pre>

<p>
The SIMD types all need to be inside a namespace whose name depends on the ABI/target.
For instance the symbols for SSE and AVX SIMD types must be different so that incompatible object files/libraries fail to link.
<pre class="impl">
inline namespace <i>ABI/target-dependent</i> {
</pre>

<p>
The <code>simd_vector&lt;T&gt;</code> class:
<pre class="impl">
template&lt;typename T&gt;
class simd_vector
{
  typedef <i>implementation-defined</i> simd_type;
  simd_type data;
public:
  typedef T value_type;

  // the number of values in the SIMD vector
  static constexpr size_t size = sizeof(simd_type) / sizeof(value_type);

  // in Vc operator new / delete are overloaded to work around the fact that new ignores the alignof
  // of the allocated type. If this does not get solved in the standard the following will be needed:
  void *operator new(size_t size);
  void *operator new(size_t, void *p) { return p; }
  void *operator new[](size_t size);
  void *operator new[](size_t , void *p) { return p; }
  void operator delete(void *ptr, size_t);
  void operator delete(void *, void *) {}
  void operator delete[](void *ptr, size_t);
  void operator delete[](void *, void *) {}

  // init to zero
  float_v();

  // copy
  simd_vector(const simd_vector &amp;);
  simd_vector &amp;operator=(const simd_vector &amp;);

  // implicit conversion from compatible simd_vector&lt;T&gt;
  template&lt;typename U&gt; simd_vector(simd_vector&lt;U&gt;, typename enable_if&lt;is_integral&lt;value_type&gt;::value
    &amp;&amp; is_integral&lt;U&gt;::value &amp;&amp; size == simd_vector&lt;U&gt;::size, void *&gt;::type = nullptr);

  // static_cast from vectors of possibly (depending on target) different size
  // (dropping values or filling with 0 if the size is not equal)
  template&lt;typename U&gt; explicit simd_vector(simd_vector&lt;U&gt;, typename enable_if&lt;!(
    is_integral&lt;value_type&gt;::value &amp;&amp; is_integral&lt;U&gt;::value &amp;&amp; size == simd_vector&lt;U&gt;::size),
    void *&gt;::type = nullptr);

  // broadcast with implicit conversions
  simd_vector(value_type);

  // load member functions
  void load(const value_type *mem);
  template&lt;typename U&gt; void load(const U *mem) ;

  // load ctors (optional)
  explicit Vector(const value_type *mem);
  template&lt;typename U&gt; explicit Vector(const U *mem);

  // store functions
  void store(value_type *mem) const;
  template&lt;typename U&gt; void store(U *mem) const;

  // unary operators
  simd_vector &amp;operator++();
  simd_vector  operator++(int);
  simd_vector &amp;operator--();
  simd_vector  operator--(int);

  simd_vector operator~() const;
  simd_vector operator+() const;
  simd_vector&lt;typename negate_type&lt;T&gt;type&gt; operator-() const;

  // assignment operators
  simd_vector &amp;operator+=(simd_vector&lt;T&gt; x);
  simd_vector &amp;operator-=(simd_vector&lt;T&gt; x);
  simd_vector &amp;operator*=(simd_vector&lt;T&gt; x);
  simd_vector &amp;operator/=(simd_vector&lt;T&gt; x);
  simd_vector &amp;operator%=(simd_vector&lt;T&gt; x);
  simd_vector &amp;operator&amp;=(simd_vector&lt;T&gt; x);
  simd_vector &amp;operator|=(simd_vector&lt;T&gt; x);
  simd_vector &amp;operator^=(simd_vector&lt;T&gt; x);

  simd_vector &amp;operator&lt;&lt;=(simd_vector&lt;T&gt; x);
  simd_vector &amp;operator&gt;&gt;=(simd_vector&lt;T&gt; x);
  simd_vector &amp;operator&lt;&lt;=(int x);
  simd_vector &amp;operator&gt;&gt;=(int x);

  // scalar entries access
  <i>implementation-defined</i> &amp;operator[](size_t index);
  value_type operator[](size_t index) const;
};

typedef simd_vector&lt;   float&gt;  float_v;
typedef simd_vector&lt;  double&gt; double_v;
typedef simd_vector&lt; int64_t&gt;  int64_v;
typedef simd_vector&lt;uint64_t&gt; uint64_v;
typedef simd_vector&lt; int32_t&gt;  int32_v;
typedef simd_vector&lt;uint32_t&gt; uint32_v;
typedef simd_vector&lt; int16_t&gt;  int16_v;
typedef simd_vector&lt;uint16_t&gt; uint16_v;
typedef simd_vector&lt;  int8_t&gt;   int8_v;
typedef simd_vector&lt; uint8_t&gt;  uint8_v;

} // namespace
</pre>

<h4>Conversions</h4>
<p>
Implicit conversion between vectors and implicit conversion from scalars to vectors follows the rules of conversion between scalar types as closely as possible.
The rules are:
<ul>
<li>If <code>U</code> implicitly converts to <code>T</code> then implicit conversion from <code>U</code> to <code>simd_vector&lt;T&gt;</code> also works.
<li>If <code>simd_vector&lt;U&gt;::size == simd_vector&lt;T&gt;::size</code> works portably, then implicit conversion from <code>simd_vector&lt;U&gt;</code> to <code>simd_vector&lt;T&gt;</code> also works.
</ul>

<p>
In Vc the following guarantees are made:
<ul>
<li><code>float_v::size == int32_v::size == uint32_v::size</code>
<li><code>int16_v::size == uint16_v::size</code>
</ul>
It is very convenient to have an integer vector of the same size as a float vector.
But on the other hand this does not map naturally to all targets (e.g. AVX).
A solution that is closer to reality is to only guarantee the equality for integer vectors:
<ul>
<li><pre class="inline">int64_v::size == uint64_v::size</pre>
<li><pre class="inline">int32_v::size == uint32_v::size</pre>
<li><pre class="inline">int16_v::size == uint16_v::size</pre>
<li><pre class="inline"> int8_v::size ==  uint8_v::size</pre>
</ul>
The convenience can be restored with an abstraction above the <code>simd_vector&lt;T&gt;</code> types.

<p>
Explicit conversions between vectors of possibly different size are allowed and will either drop the last values or fill the remaining values in the destination with zeros.
(Together with vector shift/rotate functions and a bit of care this allows portable code that casts between int and float vectors.)

<h4>Loads / Stores</h4>
<p>
The most important portable way to efficiently fill a complete SIMD vector is via loads.
A load requires a pointer to at least <code>size</code> consecutive values in memory.
Whether load constructors should be specified needs to be discussed.
Their use does not explicitly state the operation:
<pre class="example">
float *data = ...;
float_v a(data); // it is not obvious what this line of code does
float_v b;
b.load(data);    // this is a lot clearer
</pre>
On the other hand there should be a way to fill a vector on construction (e.g. to ease the use of <code>const</code>).

<p>
As for loads, the most important portable way to efficiently store the results from a SIMD vector is via store functions.

<p>
The overloads of load/store with <code>value_type*</code> exists to support arguments that have an implicit conversion to <code>value_type*</code>.

<p>
Loads and stores can optionally do a conversion from/to memory of a different type (e.g. <code>float_v fun(short *mem) { return float_v(mem); }</code>).
This feature is important because:
<ul>
<li>hardware may provide special support for converting loads/stores (e.g. Intel MIC)
<li>the pattern is otherwise much harder to use (e.g. using <code>float</code> for storage and <code>double_v</code> for calculation)
</ul>

<h3>Binary Operators</h3>
<p>
All arithmetic, logical, and bit-wise operators are overloaded for combinations of different simd_vector&lt;T&gt; and builtin scalar types.
<pre class="impl">
template&lt;typename L, typename R&gt; typename determine_return_type&lt;L, R&gt;::type operator+(L &amp;&amp;x, R &amp;&amp;y)
{
  typedef typename determine_return_type&lt;L, R&gt;::type V;
  return <i>internal_add_function</i>(V(std::forward&lt;L&gt;(x)), V(std::forward&lt;R&gt;(y)));
}
</pre>

It is possible to get correct automatic conversion.
The <code>determine_return_type</code> type decides which type combinations are allowed and what conversions are done.

<p>
Consider:
<pre class="example">
void f(int32_v x, unsigned int y) {
  auto z = x * y;
  ...
</pre>
We expect the type of <code>z</code> in <span class="lastexample"/> to be <code>uint32_v</code>.
Analogue to <code>1 * 1u</code> yielding an <code>unsigned int</code>.

<p>
On the other hand the following example must always fail to compile:
<pre class="example">
void f(int32_v x, double_v y) {
  auto z = x * y;
  ...
</pre>
While <code>1 * 1.</code> is well-defined and yields a <code>double</code>, the issue with SIMD vectors is that their number of entries is not guaranteed to match.
(Consider x holding four values and y holding two values.
The semantics of a multiplication of <code>x</code> with <code>y</code> is unclear.
It could mean [x0 * y0, x1 * y0, x2 * y1, x3 * y1], or [x0 * y0, x1 * y1, x2, x3], or [x0 * y0, x1 * y1, x2 * y0, x3 * y1], or [x0 * y0, x1 * y1], or [x2 * y0, x3 * y1].
Or, for a different target, it could even simply mean [x0 * y0].)

<p>
Via SFINAE the operators can be defined such that only the following type combinations  work:<br/>
<code>
simd_vector&lt;T&gt; &times; is_convertible&lt;From, simd_vector&lt;T&gt; &amp;&amp; !is_narrowing_float_conversion&lt;From, T&gt;
</code>

<p>
A follow-up proposal will define the <code>determine_return_type</code> class.
This proposal could also discuss operator implementations for better error reporting (via <code>static_assert</code>) if a <code>simd_vector</code> is combined with an incompatible second operand.

<h3>Math Functions</h3>
<p>
All the functions in <code>&lt;cmath&gt;</code> can/should be overloaded to accept SIMD vectors as input.

<h3>Examples</h3>
<h4>Convert Many Points to Polar Coordinates</h4>
<p>
With the <code>simd_vector&lt;T&gt;</code> class it is possible to explicitly vectorize simple loops:

<pre class="example">
std::array&lt;float, 1024&gt; x_data, y_data, r_data, phi_data;
// fill x and y with random values from -1 to 1
// The following loop converts the x,y coordinates into r,phi polar coordinates.
for (int i = 0; i &lt; 1024; i += float_v::size) {
  const float_v x(&x_data[i]);
  const float_v y(&y_data[i]);
  const float_v r = sqrt(x * x + y * y);
  const float_v phi = atan2(y, x) * 57.29578f;
  r.store(&r_data[i]);
  phi.store(&phi_data[i]);
}
</pre>

<span class="lastexample"/> already shows a few issues that will be solved in follow-up proposals:
<ol>
<li>The alignment of the float arrays is not guaranteed to work for aligned SIMD loads and stores.
<li>The loads and stores are not important to the algorithm, but still they take up most of the code in the loop.
<li>If the size of the arrays is not a multiple of the SIMD width, out-of-bounds accesses would result or an extra scalar epilogue were required.
</ol>

<p>
If this example is compiled for an x86 target with SSE2 instructions the loop body calculates four results in parallel.
If it is compiled for an x86 target with AVX instructions the loop body calculates eight results in parallel.
If the target does not support <code>float</code> SIMD vectors the loop body calculates just one result.

<h4>Kalman-Filter</h4>
<p>
This is a (slightly reduced) Kalman-Filter example from CERN/RHIC track reconstruction.
It is used to fit the track parameters of several particles in parallel.
<pre class="example">
struct Covariance {
  float_v C00,
          C10, C11,
          C20, C21, C22,
          C30, C31, C32, C33,
          C40, C41, C42, C43, C44;
};

struct Track {
  float_v x, y, tx, ty, qp, z;
  float_v chi2;
  Covariance C;

  float_v NDF;
  ...
};

struct HitInfo {
  float_v cos_phi, sin_phi, sigma2, sigma216;
};

void Filter(Track &amp;track, const HitInfo &amp;info, float_v m) {
  Covariance &amp;C = track.C;
  const float_v residual = info.cos_phi * track.x + info.sin_phi * track.y - m; // ζ = Hr - m
  const float_v F0 = info.cos_phi * C.C00 + info.sin_phi * C.C10; //  CHᵀ
  const float_v F1 = info.cos_phi * C.C10 + info.sin_phi * C.C11;
  const float_v F2 = info.cos_phi * C.C20 + info.sin_phi * C.C21;
  const float_v F3 = info.cos_phi * C.C30 + info.sin_phi * C.C31;
  const float_v F4 = info.cos_phi * C.C40 + info.sin_phi * C.C41;
  const float_v HCH = F0 * info.cos_phi + F1 * info.sin_phi;      // HCHᵀ
  const float_v wi = 1.f / (info.sigma2 + HCH);
  const float_v zetawi = residual * wi;                           // (V + HCHᵀ)⁻¹ ζ
  const float_v K0 = F0 * wi;
  const float_v K1 = F1 * wi;
  const float_v K2 = F2 * wi;
  const float_v K3 = F3 * wi;
  const float_v K4 = F4 * wi;
  track. x -= F0 * zetawi;                                        // r -= CHᵀ (V + HCHᵀ)⁻¹ ζ
  track. y -= F1 * zetawi;
  track.tx -= F2 * zetawi;
  track.ty -= F3 * zetawi;
  track.qp -= F4 * zetawi;
  C.C00 -= K0 * F0;                                               // C -= CHᵀ (V + HCHᵀ)⁻¹ HC
  C.C10 -= K1 * F0;
  C.C11 -= K1 * F1;
  C.C20 -= K2 * F0;
  C.C21 -= K2 * F1;
  C.C22 -= K2 * F2;
  C.C30 -= K3 * F0;
  C.C31 -= K3 * F1;
  C.C32 -= K3 * F2;
  C.C33 -= K3 * F3;
  C.C40 -= K4 * F0;
  C.C41 -= K4 * F1;
  C.C42 -= K4 * F2;
  C.C43 -= K4 * F3;
  C.C44 -= K4 * F4;
  track.chi2 += residual * zetawi;                                // χ² += ζ (V + HCHᵀ)⁻¹ ζ
  track.NDF += 1;
}
</pre>

In the <code>Filter</code> function a new measurement (<code>m</code>) is added to the <code>Track</code> state.
In the data structures used here, the objects each contain the data of <code>float_v::size</code> tracks.
So instead of working with one track at a time, the code explicitly states that multiple tracks can be filtered together and that their data is stored interleaved in memory.
(The corresponding track (t<sub>n</sub>) in the memory layout of a <code>Track</code> object (with e.g. SSE) looks like this:
<code>[x<sub>t<sub>0</sub></sub> x<sub>t<sub>1</sub></sub> x<sub>t<sub>2</sub></sub> x<sub>t<sub>3</sub></sub>
y<sub>t<sub>0</sub></sub> y<sub>t<sub>1</sub></sub> y<sub>t<sub>2</sub></sub> y<sub>t<sub>3</sub></sub>
tx<sub>t<sub>0</sub></sub> ...
]</code>
With AVX:
<code>[x<sub>t<sub>0</sub></sub> x<sub>t<sub>1</sub></sub> x<sub>t<sub>2</sub></sub> x<sub>t<sub>3</sub></sub>
x<sub>t<sub>4</sub></sub> x<sub>t<sub>5</sub></sub> x<sub>t<sub>6</sub></sub> x<sub>t<sub>7</sub></sub>
y<sub>t<sub>0</sub></sub> y<sub>t<sub>1</sub></sub> y<sub>t<sub>2</sub></sub> y<sub>t<sub>3</sub></sub>
y<sub>t<sub>4</sub></sub> ...
]</code>)

<!-- requires masks and conditional assignment
<p>
The following example implements the inner loop of the Mandelbrot algorithm.
<pre class="example">
uint_v mandelbrotColorAt(float_v x, float y)
{
  const std::complex&lt;float_v&gt; c(x, y);
  std::complex&lt;float_v&gt; z = c;
  uint_v n = 0u;
  float_m inside = std::norm(z) &lt; S;
  while (any_of(inside &amp;&amp; n &lt; maxIt)) {
    z = z * z + c;
    n(inside) += 1;
    inside = std::norm(z) &lt; S;
  }
  uint_v colorValue = (255 - n) * 0x10101;
}
</pre>
-->

<h2><a name="Acknowledgements">Acknowledgements</a></h2>
<ul>
<li>Thanks to Chandler Carruth for a good discussion about SIMD in C++ and comments on the structure and focus of this proposal.
<li>Thanks to Matthias Bach, David Rohr, and Volker Lindenstruth for API discussions and encouragement.
<li>Thanks to Igor Kulakov and Maksym Zyzak for all the feedback about usage of Vc in RHIC software.
<li>This work was supported by GSI Helmholtzzentrum für Schwerionenforschung and
  the Hessian LOEWE initiative through the Helmholtz International Center for FAIR (HIC for FAIR).
</ul>

<h2><a name="References">References</a></h2>
<a name="cite1">[1]</a> <a href="http://dx.doi.org/10.1002/spe.1149">Kretz, M. and Lindenstruth, V. 2011. <i>Vc: A C++ library for explicit vectorization.</i> Software: Practice and Experience. (2011).</a><br/>
<a name="cite2">[2]</a> Vc Repository <a href="http://code.compeng.uni-frankfurt.de/projects/vc">http://code.compeng.uni-frankfurt.de/projects/vc</a><br/>

<!--
imap <F2> <pre class="example"><CR></pre>
imap << &lt;
imap >> &gt;
-->
