<html>
<head><title>A constexpr bitwise operations library for C++</title></head>
<body>
<h1>A constexpr bitwise operations library for C++</h1>

<ul>
<li>Document Number: N3864</li>
<li>Date: 2014-01-08</li>
<li>Programming Language C++, Numerics Working Group</li>
<li>Reply-to: Matthew Fioravante <a href="&#109;&#x61;&#105;&#x6C;&#116;&#x6F;:&#x66;&#109;&#x61;&#116;&#116;&#104;&#x65;&#119;&#x35;&#56;&#x37;&#54;&#64;&#x67;&#x6D;&#x61;i&#108;&#46;&#x63;&#x6F;&#109;">&#x66;&#109;&#x61;&#116;&#116;&#104;&#x65;&#119;&#x35;&#56;&#x37;&#54;&#64;&#x67;&#x6D;&#x61;i&#108;&#46;&#x63;&#x6F;&#109;</a></li>
</ul>

<h1>Introduction</h1>

<p>This proposal adds support for low level bitwise and logical operations to C++.</p>

<h1>Impact on the standard</h1>

<p>This proposal is a pure library extension. 
It does not require any changes in the core language and does not depend on any other library extensions.
The proposal is composed entirely of free functions. The proposed functions are added to the <code>&lt;cmath&gt;</code> and <code>&lt;memory&gt;</code>
headers. No new headers are introduced.</p>

<p>While this proposal can be implemented entirely in standard C++14,
optimal implementations will require additional support from the compiler to detect and
replace function calls with native instructions when available.
See [<a href="#BitOpsRef">BitOpsRef</a>] for a reference implementation written in C++14.</p>

<h1>Motivation</h1>

<p>The C and C++ languages provide an abstraction over the machine.
The machine provides the common arithmetic and logical operations, which are
accessed using the built in operators inherited from C. These operations are the primitives
which are used to implement higher level abstractions.</p>

<p>We construct algorithms by combining these basic operations.
Sometimes significant performance benefits can be gained by
directly manipulating the binary quantities contained
within the registers which make up this numerical abstraction.
Many online and print references including [<a href="#Anderson01">Anderson01</a>],
[<a href="#Dietz01">Dietz01</a>], [<a href="#Neumann01">Neumann01</a>], [<a href="#Warren01">Warren01</a>], and [<a href="#HACKMEM">HACKMEM</a>]
are devoted to discovering these algorithms and implementing
them efficiently.</p>

<p>Hardware vendors have understood the importance of high performance
bitwise manipulation routines. Many of them have provided
additional hardware which can perform these bitwise operations directly with
a single instruction. These instructions are often much more efficient than
computing the algorithm manually in C or assembly.</p>

<p>Other bitwise manipulation algorithms can be implemented using
clever but non-intuitive combinations of arithmetic and logical operations.
Most importantly, for some bitwise algorithms, the most
efficient implementation varies between hardware platforms.
These differences create an unreasonably large maintenance burden on the programmer
who wishes to write efficient and portable code.</p>

<p>As a motivating example, consider the various implementations of
the count trailing zeroes algorithm presented in [<a href="#Kostjuchenko01">Kostjuchenko01</a>].
In order to implement an SSE2 optimized <code>strlen()</code> function,
the author had to implement, test, and profile many different versions of
count trailing zeroes.  None of them take advantage of native instructions.</p>

<p>One who wishes to exploit 
the <code>bsf</code> or <code>tzcnt</code> instructions on Intel must rely on non-standard
compiler intrinsics or inline assembly. One must also provide 
backup implementations for other platforms which do not have such instructions.
Adding support for native instructions requires a nest of <code>#ifdef</code>s and
deep knowledge of different processor architectures. This is a heavy cost
in programmer time.</p>

<p>Bitwise algorithms are general purpose tools which can be used in a wide
variety of domains and are the key to unlocking high performance in many
important algorithms. A bitwise operations library has been badly needed in 
the C and C++ standard libraries for many years.</p>

<p>We present a bitwise operations library which exposes these native instructions
wherever possible and provides backup implementations in C++ if they do not exist.
Because this library would be part of the standard, these commonly used routines
could be implemented once and profiled on each platform. Finally, this library 
offers an interface which takes full advantage of the latest processor features
including ARMv8 and Intel BMI2.</p>

<h1>Design Goals and Scope</h1>

<p>There are seemingly endless ways one can manipulate binary quantities. How does one go
about choosing which ones to include in a library and which ones to exclude? 
How does one choose proper names for each function? Which algorithms can
be trivially converted into single instructions by the optimizer and which
actually require the programmer to declare their use through a function call?
We will address these questions with the following design goals.</p>

<h2>Design Goal 1: Provide the programmer with better access to the machine</h2>

<p>In 1970, the Digital Equipment Corporation announced the PDP11. 
This 16 bit machine has 3 instructions of interest, <code>ROR</code> (rotate right), 
<code>ROL</code> (rotate left), and <code>SWAB</code> (swap bytes).
These operations along with their later 32 and 64 bit variants are provided 
by many more modern machines as will be shown. As of 2013,
the programmer still does not have direct access to these instructions in modern C and C++.</p>

<p>Therefore, the first and most important goal of this proposal is to provide the programmer with better
access to the machine via a new set of primitives which go beyond simple arithmetic
and logical operations. We will present new functions for the standard library which can be implemented
using only few instructions if supported by the machine, using backup implementations
if no such support is provided by the hardware.</p>

<h2>Design Goal 2: Provide a reusable library of generic bitwise manipulation routines</h2>

<p>In designing this proposal, we wish not just to limit ourselves to operations which may have
native machine instruction implementations on at least one platform. We would like to provide
a standard library of primitives which are commonly found to be reimplemented time and time again in different code bases.
The standard library already provides a rich set of generic containers and algorithms. What is missing is a 
set of bitwise manipulation primitives.</p>

<p>Of particular emphasis are algorithms whose most efficient implementations depend on the implementations of
other bitwise operations. A motivating example is checking whether a number is a power of 2. </p>

<p>Consider the following implementations:</p>

<pre><code>bool ispow2(unsigned x) { return popcount(x) == 1; }
bool ispow2(unsigned x) { return x != 0 &amp;&amp; (x &amp; (x -1)) == 0; }
</code></pre>

<p>In the above example, <code>popcount()</code> is the population count or number of 1 bits in <code>x</code>. 
On a machine with a popcount instruction, the first implementation uses less instructions
and no branches. Without a popcount instruction, the second version is the better choice
as computing popcount requires much more than a few logical operations and comparisons 
[<a href="#Dietz01">Dietz01</a>]. In order to implement <code>ispow2()</code>, the programmer is faced with
the same set of dilemnas as with the count trailing zeroes example from the <a href="#motivation">Motivation</a>
section.</p>

<h1>Glossary of Terms</h1>

<p>The following terminology is used in the remainder of this document to describe the technical aspects of this proposal.</p>

<ul>
<li><em>set</em>: If we say that a bit has been "set", we mean that we will change it's value to 1. We can also say "set the bit to x", which has the obvious meaning of changing the value to 0 or 1, depending on the value of x.</li>
<li><em>reset</em>: To reset a bit is to change it's value to 0.</li>
<li><em>flip</em>: To flip a bit is to invert it's value. That is set the bit if it is currently 0 and likewise reset the bit if it is currently 1.</li>
<li><em>test</em>: To test a bit is to return <code>true</code> if its value is 1, otherwise return <code>false</code>.</li>
<li><em>subword</em>: A collection of contiguous bits of a given size. Some commonly found examples:
<ul>
<li><em>nibble</em>: a subword of size 4</li>
<li><em>byte</em>: a subword of size <code>CHAR_BIT</code>, usually 8</li>
<li><em>word</em>: Depending on the platform terminology, often a subword of size 16, 32 or 64. </li>
</ul></li>
<li><em>most significant bit (msb)</em>: The high order bit in a binary quantity.</li>
<li><em>most significant X bit (msXb)</em>: The highest order bit in a binary quantity with a value of X.</li>
<li><em>least significant bit (lsb)</em>: The low order bit in a binary quantity.</li>
<li><em>least significant X bit (lsXb)</em>: The lowest order bit in a binary quantity with a value of X.</li>
<li><code>~T(0)</code>: This statement represents a quantity of type <code>T</code> where all of the bits are 1. We avoid the more commonly used <code>T(-1)</code> as it assumes 2's compliment signed integers and is a little less intuitive to the unitiated.</li>
</ul>

<h1>Technical Specification</h1>

<p>We will now describe the additions to <code>&lt;cmath&gt;</code> and <code>&lt;memory&gt;</code>. This is a procedural library implemented
entirely using <code>constexpr</code> templated free functions.
In addition, each function has been qualified with <code>noexcept</code>. These
operations are most often used in highly optimized numerical code where the overhead of exception
handling would be inappropriate. For functions which have pre-conditions on their inputs, we
have opted for undefined return values if these pre-conditions are ever violated.
The functions are classified into different groups to aid analysis and discussion
and each group will be presented one at a time. </p>

<p>We have chosen to support all signed and unsigned integral types in this proposal. It
is often suggested that signed integers represent "numbers" and unsigned integers
represent "bit fields" and that they should never be used together. While we 
generally agree with this philosophy, many of these algorithms have real use cases
for signed and unsigned integral values.
The primary danger of using both
signed and unsigned integers comes from the pitfalls of comparing signed and unsigned values. None of the functions in this
proposal require or encourage comparing or combining of signed and unsigned types.
The template arguments for each proposed function are named <code>integral</code> to indicate generic support
for all builtin integral types, signed and unsigned. Functions which take more than one
integral argument of different types will use a single letter suffix, for example
<code>integrall</code> and <code>integralr</code> for left and right hand arguments.</p>

<p>With regards to signed integers, this proposal does not require signed integers be
implemented using 2's compliment. However, the design of this proposal considers
the practical reality that almost all modern hardware does in fact use 2's compliment.
All example code, including the reference implementation [<a href="#BitOpsRef">BitOpsRef</a>]
assume 2's compliment signed integers with undefined behavior on overflow and underflow.
Adding support for other signed representations is an exercise left for the reader.</p>

<p>Each section will describe the full technical specifications of the functions in that group, noting their
return values and undefined behavior if any. We will also discuss the background
and justification for each of the functions and list some applications where
necessary. For the more complicated algorithms, examples will be provided to
help illustrate how they work.</p>

<h2>cmath Header Additions</h2>

<p>The following sections describe the additions to the <code>&lt;cmath&gt;</code> header.</p>

<h3>Explicit shifts</h3>

<p>Bit shifting is provided in C++ with <code>operator&lt;&lt;</code> and <code>operator&gt;&gt;</code> for integral types. It is
a very simplistic abstraction with many deficiencies and some subtle caveats.</p>

<p>First as noted earlier, there is no primitive for rotational shifts even though these shifts can be found in the instruction 
set of almost every machine. Second, 
<code>operator&gt;&gt;</code> for signed types has implementation defined behavior with regards to filling in the high order bits, making
it nearly useless when writing portable code.
Writing a portable arithmetic right shift cumbersome at best and inefficient at worst. Finally, performing a logical right shift on a signed
quantity is also cumbersome because it requires casts which obscure the meaning of the code.</p>

<h4>List of Functions</h4>

<pre><code>//SHift Logical Left
template &lt;class integral&gt;
constexpr integral shll(integral x, int s) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>x &lt;&lt; s</code></li>
<li><em>Remarks:</em> result is undefined if <code>s &lt; 0 || s &gt; sizeof(x) * CHAR_BIT</code></li>
</ul>

<!-- -->

<pre><code>//SHift Logical Right
template &lt;class integral&gt;
constexpr integral shlr(integral x, int s) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>x</code> with all of it's bits shifted right by <code>s</code> positions. The <code>s</code> high order bits of the result are reset.</li>
<li><em>Remarks:</em> result is undefined if <code>s &lt; 0 || s &gt; sizeof(x) * CHAR_BIT</code></li>
</ul>

<!-- -->

<pre><code>//SHift Arithmetic Left
template &lt;class integral&gt;
constexpr integral shal(integral x, int s) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>x &lt;&lt; s</code></li>
<li><em>Remarks:</em> result is undefined if <code>s &lt; 0 || s &gt; sizeof(x) * CHAR_BIT</code></li>
<li><em>Remarks:</em> This function is identical to <code>shll()</code> and is only provided for symmetry.</li>
</ul>

<!-- -->

<pre><code>//SHift Arithmetic Right
template &lt;class integral&gt;
constexpr integral shar(integral x, int s) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>x</code> with all of it's bits shifted right by <code>s</code> positions. The <code>s</code> high order bits of the result are set to the value of most significant bit of <code>x</code>.</li>
<li><em>Remarks:</em> result is undefined if <code>s &lt; 0 || s &gt; sizeof(x) * CHAR_BIT</code></li>
</ul>

<!-- -->

<pre><code>//ROTate Left
template &lt;class integral&gt;
constexpr integral rotl(integral x, int s) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>x</code> with all of it's bits shifted left by <code>s</code> positions.
The <code>s</code> low order bits are set to the <code>s</code> high order bits of <code>x</code>.</li>
<li><em>Remarks:</em> result is undefined if <code>s &lt; 0 || s &gt; sizeof(x) * CHAR_BIT</code></li>
</ul>

<!-- -->

<pre><code>//ROTate Right
template &lt;class integral&gt;
constexpr integral rotr(integral x, int s) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>x</code> with all of it's bits shifted right by <code>s</code> positions.
The <code>s</code> high order bits are set to the <code>s</code> low order bits of <code>x</code>.</li>
<li><em>Remarks:</em> result is undefined if <code>s &lt; 0 || s &gt; sizeof(x) * CHAR_BIT</code></li>
</ul>

<h3>Bit Counting Algorithms</h3>

<p>Bit counting is used to construct efficient implementations of other higher level algorithms.
Many of these operations have native support on a wide variety of modern
and antiquated hardware.
Some example applications will be provided below.</p>

<h4>List of functions</h4>

<pre><code>//CouNT Trailing 0's
template &lt;class integral&gt;
constexpr int cntt0(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> The number of trailing 0 bits in <code>x</code>, or <code>sizeof(x) * CHAR_BIT</code> if <code>x == 0</code>.</li>
</ul>

<!-- -->

<pre><code>//CouNT Leading 0's
template &lt;class integral&gt;
constexpr int cntl0(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> The number of leading 0 bits in <code>x</code>, or <code>sizeof(x) * CHAR_BIT</code> if <code>x == 0</code>.</li>
</ul>

<!-- -->

<pre><code>//CouNT Trailing 1's
template &lt;class integral&gt;
constexpr int cntt1(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> The number of trailing 1 bits in <code>x</code>, or <code>sizeof(x) * CHAR_BIT</code> if <code>x == ~integral(0)</code>.</li>
</ul>

<!-- -->

<pre><code>//CouNT Leading 1's
template &lt;class integral&gt;
constexpr int cntl1(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> The number of leading 1 bits in <code>x</code>, or <code>sizeof(x) * CHAR_BIT</code> if <code>x == ~integral(0)</code>.</li>
</ul>

<!-- -->

<pre><code>//POPulation COUNT
template &lt;class integral&gt;
constexpr int popcount(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> The number of 1 bits in <code>x</code>.</li>
</ul>

<!-- -->

<pre><code>//PARITY
template &lt;class integral&gt;
constexpr int parity(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> 1 if the number of 1 bits in <code>x</code> is odd, otherwise 0.</li>
</ul>

<h4>Applications</h4>

<p>One application of <code>cntt0</code> is in computing the greatest common divisor of 2 numbers. Credit goes
to Howard Hinnant for bringing this to our attention.</p>

<pre><code>template &lt;typename unsigned-integral&gt;
T gcd(T x, T y) 
{ 
    if (x == 0) 
        return y; 
    if (y == 0) 
        return x; 
    int cf2 = std::cntt0(x | y); 
    x &gt;&gt;= std::cntt0(x); 
    while (true) 
    { 
        y &gt;&gt;= std::cntt0(y); 
        if (x == y) 
            break; 
        if (x &gt; y) 
            std::swap(x, y); 
        if (x == 1) 
            break; 
        y -= x; 
    } 
    return x &lt;&lt; cf2; 
}
</code></pre>

<p>As mentioned earlier, we can use <code>popcount()</code> to detect whether or not an integer (signed or unsigned) is a power of 2.</p>

<pre><code>template &lt;typename integral&gt;
bool ispow2(integral x) {
  return x &gt; 0 &amp;&amp; popcount(x) == 1;
}
</code></pre>

<h3>Rightmost bit manipulation</h3>

<p>The following functions perform simple manipulations on the rightmost bits of the given quantity.
All of these operations can be trivially implemented using a few arithmetic and logical operations. Therefore,
these functions are only provided as usability wrappers in order to allow the programmer to avoid having
to spend time looking them up, reimplementing, and/or unit testing them. Because their simple implementations,
we have included the implementations in the function listing. Credit goes to [<a href="#Warren01">Warren01</a>] and
[<a href="#ChessProg">ChessProg</a>] for providing these implementations and insight into their usage.</p>

<p>Most of the operations in the section are implemented in hardware on Intel and AMD
machines which have Intel BMI and/or AMD TBM extensions 
(see <a href="#survey-of-hardware-support">Survey of Hardware Support</a>).
All of these functions were tested on gcc 4.8 using the provided C++ implementations.
We found that on BMI and TBM enabled compilation, the optimizer
was successfully able to compile the C++ expression a single BMI or TBM instruction.
Therefore an implementation of this section can simply use the provided trivial implementations
and rely on the optimizer for hardware support if available.</p>

<h4>List of Functions</h4>

<pre><code>//ReSeT Least Significant 1 Bit
template &lt;class integral&gt;
constexpr integral rstls1b(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>x</code> with it's least significant 1 bit reset, or 0 if <code>x == 0</code>.</li>
<li><em>Implementation:</em> <code>x &amp; (x - 1)</code></li>
</ul>

<!-- -->

<pre><code>//SET Least Significant 0 Bit
template &lt;class integral&gt;
constexpr integral setls0b(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>x</code> with it's least significant 0 bit set, or <code>x</code> if <code>x == ~integral(0)</code>.</li>
<li><em>Implementation:</em> <code>x | (x + 1)</code></li>
</ul>

<!-- -->

<pre><code>//ISOlate Least Significant 1 Bit
template &lt;class integral&gt;
constexpr integral isols1b(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> A quantity where the bit in the position of the least significant 1 bit of <code>x</code> is set and all of the other bits reset. Returns 0 if <code>x == 0</code>.</li>
<li><em>Implementation:</em> <code>(~x) &amp; (-x)</code></li>
</ul>

<!-- -->

<pre><code>//ISOlate Least Significant 0 Bit
template &lt;class integral&gt;
constexpr integral isols0b(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> A quantity where the bit in the position of the least significant 0 bit of <code>x</code> is set and all of the other bits reset. Returns <code>x</code> if <code>x == ~integral(0)</code>.</li>
<li><em>Implementation:</em> <code>(~x) &amp; (x + 1)</code></li>
</ul>

<!-- -->

<pre><code>//ReSeT Trailing 1's
template &lt;class integral&gt;
constexpr integral rstt1(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> resets all of the trailing 1's in <code>x</code>, or 0 if <code>x == ~integral(0)</code>.</li>
<li><em>Implementation:</em> <code>x &amp; (x + 1)</code></li>
</ul>

<!-- -->

<pre><code>//SET Trailing 0's
template &lt;class integral&gt;
constexpr integral sett0(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> sets all of the trailing 0's in <code>x</code>, or <code>x</code> if <code>x == 0</code>.</li>
<li><em>Implementation:</em> <code>x | (x - 1)</code></li>
</ul>

<!-- -->

<pre><code>//MaSK Trailing 0's
template &lt;class integral&gt;
constexpr integral maskt0(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> a quantity where all of the bits corresponding to the trailing 0 bits of <code>x</code> are set, the remaining bits reset.</li>
<li><em>Implementation:</em> <code>(~x) &amp; (x - 1)</code></li>
</ul>

<!-- -->

<pre><code>//MaSK Trailing 1's
template &lt;class integral&gt;
constexpr integral maskt1(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> a quantity where all of the bits corresponding to the trailing 1 bits of <code>x</code> are set, the remaining bits reset.</li>
<li><em>Implementation:</em> <code>~((~x) | (x + 1))</code></li>
</ul>

<!-- -->

<pre><code>//MaSK Trailing 0's and Least Significant 1 Bit
template &lt;class integral&gt;
constexpr integral maskt0ls1b(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> a quantity where all of the bits corresponding to the trailing 0 bits and the least significant 1 bit of <code>x</code> are set, the remaining bits reset.</li>
<li><em>Implementation:</em> <code>x ^ (x - 1)</code></li>
</ul>

<!-- -->

<pre><code>//MaSK Trailing 1's and Least Significant 0 Bit
template &lt;class integral&gt;
constexpr integral maskt1ls0b(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> a quantity where all of the bits corresponding to the trailing 1 bits and the least significant 0 bit of <code>x</code> are set, the remaining bits are reset.</li>
<li><em>Implementation:</em> <code>x ^ (x + 1)</code></li>
</ul>

<h3>Single Bit Manipulation</h3>

<p>Most programmers have at some point in their career have needed to index and manipulate a single bit
within a given integral quantity.
These functions are trivial to implement and are provided only for usability.</p>

<h4>List of Functions</h4>

<pre><code>//SET BIT
template &lt;class integral&gt;
constexpr integral setbit(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Sets bit at position <code>b</code> of <code>x</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
<li><em>Implementation:</em> <code>x | (integral(1) &lt;&lt; b)</code></li>
</ul>

<!-- -->

<pre><code>//ReSeT BIT
template &lt;class integral&gt;
constexpr integral rstbit(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Resets bit at position <code>b</code> of <code>x</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
<li><em>Implementation:</em> <code>x &amp; ~(integral(1) &lt;&lt; b)</code></li>
</ul>

<!-- -->

<pre><code>//FLIP BIT
template &lt;class integral&gt;
constexpr integral flipbit(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Flips bit at position <code>b</code> of <code>x</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
<li><em>Implementation:</em> <code>x ^ (integral(1) &lt;&lt; b)</code></li>
</ul>

<!-- -->

<pre><code>//TEST BIT
template &lt;class integral&gt;
constexpr bool testbit(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Returns true if the bit at position <code>b</code> of <code>x</code> is set, otherwise 0.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
<li><em>Implementation:</em> <code>bool(x &amp; (integral(1) &lt;&lt; b))</code></li>
</ul>

<h3>Range of bits manipulation</h3>

<p>The following operations manipulate ranges of bits above or below a given index.
One of them is implemented in hardware
(see <a href="#survey-of-hardware-support">Survey of Hardware Support</a>),
the rest are provided for usability and completeness.</p>

<h4>List of functions</h4>

<pre><code>//ReSeT BITS Greater than or Equal to
template &lt;class integral&gt;
constexpr integral rstbitsge(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Reset all bits of <code>x</code> in positions greater than or equal to <code>b</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
</ul>

<!-- -->

<pre><code>//ReSeT BITS Less than or Equal to
template &lt;class integral&gt;
constexpr integral rstbitsle(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Reset all bits of <code>x</code> in positions less than or equal to <code>b</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
</ul>

<!-- -->

<pre><code>//SET BITS Greater than or Equal to
template &lt;class integral&gt;
constexpr integral setbitsge(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Set all bits of <code>x</code> in positions greater than or equal to <code>b</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
</ul>

<!-- -->

<pre><code>//SET BITS Less than or Equal to
template &lt;class integral&gt;
constexpr integral setbitsle(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Set all bits of <code>x</code> in positions less than or equal to <code>b</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
</ul>

<!-- -->

<pre><code>//FLIP BITS Greater than or Equal to
template &lt;class integral&gt;
constexpr integral flipbitsge(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Flip all bits of <code>x</code> in positions greater than or equal to <code>b</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
</ul>

<!-- -->

<pre><code>//FLIP BITS Less than or Equal to
template &lt;class integral&gt;
constexpr integral flipbitsle(integral x, int b) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Flip all bits of <code>x</code> in positions less than or equal to <code>b</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>b &lt; 0 || b &gt;= sizeof(x) * CHAR_BIT</code>.</li>
</ul>

<h3>Bitwise and Bytewise Permutations</h3>

<p>These functions provide a generic interface for permuting the bits and bytes in a word.
Each function takes the following form:</p>

<pre><code>template &lt;typename integral&gt;
constexpr integral permute_bits(integral x, /*arg2, arg3, ...*/ subword_bits=1, num_swar_words=1) noexcept;
</code></pre>

<p>In the above example, <code>x</code> is the value to permute, with additional arguments following it if needed.
Each function operates on subwords of size <code>subword_bits</code> measured in bits. In this case,
<code>permute_bits(x)</code> will perform the permutation on all of the bits of <code>x</code>, <code>permute_bits(x, 2)</code> will
permute each pair of bits in <code>x</code>, <code>permute_bits(x, 4)</code> each nibble in <code>x</code>, and finally <code>permute_bits(x, CHAR_BIT)</code>
will permute the bytes of <code>x</code>.</p>

<p>The <code>num_swar_words</code> parameter (Number of Simd Within A Register Words) enables parallel operation
on multiple words within <code>x</code>. The size in bits of each individual word to permute will be <code>sizeof(integral) * CHAR_BIT / num_swar_words</code>.
For example, <code>permute_bits&lt;uint32_t&gt;(x, 1, 2)</code> will independently permute the 16 high order bits of <code>x</code> 
and the 16 low order bits of <code>x</code>. Another example, <code>permute_bits&lt;uint32_t&gt;(x, 2, 4)</code> will permute the
pairs of bits in each byte (assuming <code>CHAR_BIT == 8</code>) of <code>x</code>.</p>

<p>Finally, for each bitwise permutation routine, we also provide a corresponding bytewise
permutation routine. These are provided for usability. They operate exactly like their
bitwise cousins, except that their subword size is computed in bytes instead of bits.</p>

<pre><code>template &lt;typename integral&gt;
constexpr integral permute_bytes(integral x, /*arg2, arg3, ...*/ subword_bytes=1, num_swar_words=1) noexcept;
</code></pre>

<p>The bytewise routines are trivially implemented as simple wrappers over the bitwise routines where
we perform the simple conversion: <code>subword_bits = subword_bytes * CHAR_BITS</code>.</p>

<h4>List of Functions</h4>

<ul>
<li><em>Remarks:</em> For all functions defined in this section, the result is undefined if any of the following hold:
<ul>
<li><code>num_swar_words &lt; 0 || num_swar_words &gt; sizeof(integral) * CHAR_BIT</code></li>
<li><code>(sizeof(integral) * CHAR_BIT) % num_swar_words != 0</code></li>
<li><code>subword_bits &lt; 0 || subword_bits &gt; ((sizeof(integral) * CHAR_BIT) / num_swar_words)</code></li>
<li><code>((sizeof(integral) * CHAR_BIT) / num_swar_words) % subword_bits != 0</code></li>
</ul></li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral reverse_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Split <code>x</code> into <code>num_swar_words</code> "words" of equal size and then independently reverse the subwords of size <code>subword_bits</code> bits in each word.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral reverse_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>reverse_bits(x, subword_bytes * CHAR_BIT, num_swar_words);</code></li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral outer_perfect_shuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Split <code>x</code> into <code>num_swar_words</code> "words" of equal size and then independently outer perfect shuffle the subwords of size <code>subword_bits</code> bits in each word.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral outer_perfect_shuffle_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>outer_perfect_shuffle_bits(x, subword_bytes * CHAR_BIT, num_swar_words);</code></li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral outer_perfect_unshuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Split <code>x</code> into <code>num_swar_words</code> "words" of equal size and then independently outer perfect unshuffle the subwords of size <code>subword_bits</code> bits in each word.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral outer_perfect_unshuffle_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>outer_perfect_unshuffle_bits(x, subword_bytes * CHAR_BIT, num_swar_words);</code></li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral inner_perfect_shuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Split <code>x</code> into <code>num_swar_words</code> "words" of equal size and then independently inner perfect shuffle the subwords of size <code>subword_bits</code> bits in each word.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral inner_perfect_shuffle_bytes(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>inner_perfect_shuffle_bits(x, subword_bytes * CHAR_BIT, num_swar_words);</code></li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral inner_perfect_unshuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Split <code>x</code> into <code>num_swar_words</code> "words" of equal size and then independently inner perfect unshuffle the subwords of size <code>subword_bits</code> bits in each word.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral inner_perfect_unshuffle_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>inner_perfect_unshuffle_bits(x, subword_bytes * CHAR_BIT, num_swar_words);</code></li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral deposit_bits_right(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Split <code>x</code> into <code>num_swar_words</code> "words" of equal size and then independently deposit the subwords of size <code>subword_bits</code> bits in each word identified by the bits of <code>mask</code> to the low order subwords of the result. The remaining subwords are set to 0.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral deposit_bytes_right(integral x, integral mask, int subword_right=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>deposit_bits_right(x, mask, subword_bytes * CHAR_BIT, num_swar_words);</code></li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral deposit_bits_left(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Split <code>x</code> into <code>num_swar_words</code> "words" of equal size and then independently deposit the subwords of size <code>subword_bits</code> bits in each word identified by the bits of <code>mask</code> to the high order subwords of the result. The remaining subwords are set to 0.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral deposit_bytes_left(integral x, integral mask, int subword_bytes=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>deposit_bits_left(x, mask, subword_bytes * CHAR_BIT, num_swar_words);</code></li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral extract_bits_right(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Split <code>x</code> into <code>num_swar_words</code> "words" of equal size and then independently extract the low order subwords of size <code>subword_bits</code> bits in each word to the subwords of the result identified by the bits of <code>mask</code>. The remaining subwords are set to 0.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral extract_bytes_right(integral x, integral mask, int subword_bytes=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>extract_bits_right(x, mask, subword_bytes * CHAR_BIT, num_swar_words);</code></li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral extract_bits_left(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Split <code>x</code> into <code>num_swar_words</code> "words" of equal size and then independently extract the high order subwords of size <code>subword_bits</code> bits in each word to the subwords of the result identified by the bits of <code>mask</code>. The remaining subwords are set to 0.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral extract_bytes_left(integral x, integral mask, int subword_bytes=1, int num_swar_words=1) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>extract_bits_left(x, mask, subword_bytes * CHAR_BIT, num_swar_words);</code></li>
</ul>

<h4>Examples</h4>

<p>The following table shows how some example 8 bit binary values would be permuted by each function.
We use the C++14 binary literal syntax here. The bits of a given value are represented by
letters to show how a generic value would be permuted.
For a more detailed treatment of these operations, refer to [<a href="#Neumann01">Neumann01</a>]] and Chapter 7 of [<a href="#Warren01">Warren01</a>].</p>

<ul>
<li><code>reverse_bits(ABCDEFGHb)</code> -> <code>GHFEDCBA</code></li>
<li><code>reverse_bits(ABCDEFGHb, 1, 2)</code> -> <code>DCBAHGFEb</code></li>
<li><code>reverse_bits(ABCDEFGHb, 1, 4)</code> -> <code>BADCFEHGb</code></li>
<li><code>reverse_bits(ABCDEFGHb, 2)</code> -> <code>GHEFCDABb</code></li>
<li><code>reverse_bits(ABCDEFGHb, 2, 2)</code> -> <code>CDABGHEFb</code></li>
<li><p><code>reverse_bits(ABCDEFGHb, 4)</code> -> <code>EFGHABCDb</code></p></li>
<li><p><code>outer_perfect_shuffle(ABCDEFGHb)</code> -> <code>EAFBGCHDb</code></p></li>
<li><code>inner_perfect_shuffle(ABCDEFGHb)</code> -> <code>AEBFCGEHb</code></li>
<li><code>outer_perfect_unshuffle(ABCDEFGHb)</code> -> <code>BDFHACEGb</code></li>
<li><code>inner_perfect_unshuffle(ABCDEFGHb)</code> -> <code>ACEGBDFHb</code></li>
<li><code>outer_perfect_unshuffle(outer_perfect_shuffle(x))</code> -> <code>x</code></li>
<li><p><code>inner_perfect_unshuffle(inner_perfect_shuffle(x))</code> -> <code>x</code></p></li>
<li><p><code>deposit_bits_right(ABCDEFGHb, 01110110b)</code> -> <code>0000CDFGb</code></p></li>
<li><code>deposit_bits_right(ABCDEFGHb, 10001000b)</code> -> <code>000000AEb</code></li>
<li><code>deposit_bits_left(ABCDEFGHb, 00010110b)</code> -> <code>DEF00000b</code></li>
<li><code>deposit_bits_left(ABCDEFGHb, 10111000b)</code> -> <code>ACDE0000b</code></li>
<li><code>extract_bits_right(ABCDEFGHb, 11000110b)</code> -> <code>EF000GH0b</code></li>
<li><code>extract_bits_right(ABCDEFGHb, 10101010b)</code> -> <code>E0F0G0H0b</code></li>
<li><code>extract_bits_left(ABCDEFGHb, 00110110b)</code> -> <code>00AB0CD0b</code></li>
<li><code>extract_bits_left(ABCDEFGHb, 11111001b)</code> -> <code>ABCDE00Fb</code></li>
</ul>

<h4>Applications</h4>

<ul>
<li>Endian conversion for binary protocols and networking [<a href="#N3646">N3646</a>].</li>
<li>Cryptography (nibble swapping) [<a href="#EmbedGuru01">EmbedGuru01</a>].</li>
<li>Network topology definitions and routing.</li>
<li>Bioinformatics, image processing, steganography, cryptanalysis, and coding [<a href="#Hilewitz01">Hilewitz01</a>]</li>
<li>Chess Board Programming [<a href="#ChessProg">ChessProg</a>]</li>
</ul>

<h3>Power of 2 manipulation</h3>

<p>The following functions detect and compute powers of 2.</p>

<h4>List of Functions</h4>

<pre><code>//IS POWer of 2
template &lt;class integral&gt;
constexpr bool ispow2(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>true</code> if <code>x</code> is a positive power of 2, otherwise <code>false</code>.</li>
</ul>

<!-- -->

<pre><code>//CEILing Power of 2
template &lt;class integral&gt;
constexpr integral ceilp2(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Returns the unique quantity <code>n</code> where <code>ispow2(n) &amp;&amp; n &gt;= x</code>.</li>
<li><em>Remarks:</em> Result is undefined if the value of <code>n</code> is too large to be represented by type <code>integral</code>.</li>
</ul>

<!-- -->

<pre><code>//FLOOR Power of 2
template &lt;class integral&gt;
constexpr integral floorp2(integral x) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> Returns the unique quantity <code>n</code> where <code>ispow2(n) &amp;&amp; N &lt;= x</code>.</li>
<li><em>Remarks:</em> Result is undefined if <code>x &lt;= 0</code>.</li>
</ul>

<h4>Applications</h4>

<ul>
<li>Data structures whose capacity must be a power of 2 (example: circular queue).</li>
<li>Scaling an image dimensions to the next power of 2 for texturing in 3d rendering.</li>
</ul>

<h3>Saturated arithmetic</h3>

<p>Saturated arithmetic is useful in digital signal processing applications [<a href="#EmbedGuru01">EmbedGuru01</a>].
It is also provided as a hardware instruction
on some machines. In our efforts to better expose hardware features, we have included saturated addition and subtraction functions in this proposal.</p>

<h4>List of Functions</h4>

<pre><code>//SATurated ADDition
template &lt;class integral_l, class integral_r&gt;
constexpr auto satadd(integral_l l, integral_r r) noexcept -&amp;gt; decltype(l + r);
</code></pre>

<ul>
<li><em>Returns:</em> <code>l + r</code></li>
<li><em>Remarks:</em> On overflow, will return <code>std::numeric_limits&lt;decltype(l + r)&gt;::max()</code></li>
<li><em>Remarks:</em> On underflow, will return <code>std::numeric_limits&lt;decltype(l + r)&gt;::min()</code></li>
</ul>

<!-- -->

<pre><code>//SATurated SUBtraction
template &lt;class integral_l, class integral_r&gt;
constexpr auto satsub(integral_l l, integral_r r) noexcept -&amp;gt; decltype(l - r);
</code></pre>

<ul>
<li><em>Returns:</em> <code>l - r</code></li>
<li><em>Remarks:</em> On overflow, will return <code>std::numeric_limits&lt;decltype(l + r)&gt;::max()</code></li>
<li><em>Remarks:</em> On underflow, will return <code>std::numeric_limits&lt;decltype(l + r)&gt;::min()</code></li>
</ul>

<h2>memory Header Additions</h2>

<p>This section describes the additions to the <code>&lt;memory&gt;</code> header.</p>

<h3>Alignment helpers</h3>

<p>These are primitives used for aligning objects in memory. They supplement use cases with which <code>std::align</code> is
not designed to handle. These are very useful operations and all of them have trivial implementations.
They can often be found in operating system kernels and device drivers reimplemented time and time
again as macros.</p>

<h4>List of Functions</h4>

<pre><code>template &lt;class integral&gt;
constexpr bool is_aligned(integral x, size_t align) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>true</code> if <code>x</code> is a multiple of <code>align</code>.</li>
<li><em>Implementation:</em> <code>(x &amp; (a - 1)) == 0</code></li>
</ul>

<!-- -->

<pre><code>bool is_aligned(void* val, size_t align) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>is_aligned(uintptr_t(val), align)</code>.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral align_up(integral x, size_t align) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> The unique value <code>n</code> such that <code>is_aligned(n, align) &amp;&amp; n &gt;= x</code>.</li>
<li><em>Implementation:</em> <code>(x + (a - 1)) &amp; -a</code></li>
</ul>

<!-- -->

<pre><code>void* align_up(void* val, size_t align) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>(void*)align_up(uintptr_t(val), align)</code>.</li>
</ul>

<!-- -->

<pre><code>template &lt;class integral&gt;
constexpr integral align_down(integral x, size_t align) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> The unique value <code>n</code> such that <code>is_aligned(n, align) &amp;&amp; n &lt;= x</code>.</li>
<li><em>Implementation:</em> <code>x &amp; (-a)</code></li>
</ul>

<!-- -->

<pre><code>void* align_down(void* val, size_t align) noexcept;
</code></pre>

<ul>
<li><em>Returns:</em> <code>(void*)align_down(uintptr_t(val), align)</code>.</li>
</ul>

<h4>Applications and std::align</h4>

<p>We currently have <code>std::align</code> in the standard for doing alignment calculations.
The function <code>std::align</code>
has one very specific use case, that is to carve out an aligned buffer of a known size within a larger buffer.
In order to use <code>std::align</code>, the user must a priori know the size of the aligned buffer
they require. Unfortunately in some use cases, even calculating the size of this buffer
as an input to <code>std::align</code> itself requires doing alignment calculations.
Consider the following example of using aligned SIMD registers to process a memory buffer.
The alignment calculations here cannot be done with <code>std::align</code>.</p>

<pre><code>void process(char* b, char* e) {
  char* pb = std::min((char*)std::align_up(b, sizeof(simd16)), e);
  char* pe = (char*)std::align_down(e, sizeof(simd16));

  for(char* p = b; p &lt; pb; ++p) {
    process1(p);
  }
  for(char* p = pb; p &lt; pe; p += sizeof(simd16)) {
    simd16 x = simd16_aligned_load(p);
    process16(x);
    simd16_aligned_store(x, p);
  }
  for(char* p = pe; p &lt; e; ++p) {
    process1(p);
  }
}
</code></pre>

<p>We conclude that <code>std::align</code> is much too specific for general alignment calculations. It has a very narrow
use case and should only be considered as a helper function for when that use case is needed.</p>

<h1>Implementation</h1>

<h2>Guidelines for Implementors</h2>

<p>Those who wish to implement the functions provided by this proposal must consider the following guidelines:</p>

<ul>
<li>Implementors must at a minimum provide support for 1, 2, 4, and 8 byte integral signed and unsigned types.</li>
<li>Implementors should use optimized hardware instructions wherever possible.
<ul>
<li><em>Prefer</em> compiler intrinsics to inline assembly.
The former allows important optimizations while the later does not.
As a motivating example, consider the count trailing zeros algorithm.
On older Intel machines, this is implemented with a <code>bsf</code> instruction followed by a <code>cmov</code>
instruction to handle the case where the input is 0. In many contexts
the optimizer is able to prove that the input is never 0 and thus the
cmov instruction can be omitted. These optimizations are often impossible with inline assembly.</li>
</ul></li>
<li>Implementors are encouraged to add support for wider integral types where it makes sense to do so.</li>
</ul>

<h2>Survey of Hardware Support</h2>

<p>The following is a list of compiler intrinsics and native instructions which can be used to implement
the proposal on various platforms. 
Several machine architectures were surveyed for their instruction references.
The purpose of this section is to demonstrate the current state
of the art on many different machines. We have also noted when one operation is
trivially implementable from another bitops proposal operation.</p>

<ul>
<li><p><code>cntt0(x)</code></p>

<ul>
<li>i386: <code>bsf</code>, <code>cmov</code></li>
<li>x86_64 w/ BMI1: <code>tzcnt</code></li>
<li>alpha: <code>cttz</code></li>
<li>gcc: <code>x == 0 ? sizeof(x) * CHAR_BIT : __builtin_ctz(x)</code></li>
<li>bitops: <code>cntt1(~x)</code></li>
</ul></li>
<li><p><code>cntl0(x)</code></p>

<ul>
<li>i386: <code>bsr</code>, <code>cmov</code></li>
<li>x86_64 w AMD SSE4a / Intel BMI1: <code>lzcnt</code></li>
<li>ARMv5: <code>CLZ</code></li>
<li>IA64: <code>clz</code></li>
<li>PowerPC: <code>cntlzd</code></li>
<li>MIPS: <code>CLZ</code></li>
<li>gcc: <code>(x == 0 ? sizeof(x) * CHAR_BIT : __builtin_clz(x))</code></li>
<li>bitops: <code>cntl1(~x)</code></li>
</ul></li>
<li><p><code>cntt1(x)</code></p>

<ul>
<li>bitops: <code>cntt0(~x)</code></li>
</ul></li>
<li><p><code>cntl1(x)</code></p>

<ul>
<li>ARMv8: <code>CLS</code></li>
<li>Blackfin: <code>SIGNBITS</code></li>
<li>C6X: <code>NORM</code></li>
<li>Picochip: <code>SBC</code></li>
<li>MIPS: <code>CLO</code></li>
<li>bitops: <code>cntl0(~x)</code></li>
</ul></li>
<li><p><code>popcount(x)</code></p>

<ul>
<li>x86_64 SSE4: <code>popcnt</code></li>
<li>IA64: <code>popcnt</code></li>
<li>Alpha: <code>CTPOP</code></li>
<li>PowerPC: <code>popcntb</code></li>
<li>SparcV9: <code>POPC</code></li>
<li>gcc: <code>__builtin_popcount(x)</code></li>
</ul></li>
<li><p><code>parity(x)</code></p>

<ul>
<li>gcc: <code>__builtin_parity(x)</code></li>
<li>bitops: <code>popcount(x) &amp; 1</code></li>
</ul></li>
<li><p><code>rstls1b(x)</code></p>

<ul>
<li>x86_64 w/ BMI1: <code>BLSR</code></li>
</ul></li>
<li><p><code>setls0b(x)</code></p>

<ul>
<li>x86_64 w/ AMD TBM: <code>BLCS</code></li>
</ul></li>
<li><p><code>isols0b(x)</code></p>

<ul>
<li>x86_64 w/ AMD TBM: <code>BLCI</code>, <code>NOT</code></li>
<li>x86_64 w/ AMD TBM: <code>BLCIC</code></li>
</ul></li>
<li><p><code>isols1b(x)</code></p>

<ul>
<li>x86_64 w/ BMI1: <code>BLSI</code></li>
<li>x86_64 w/ AMD TBM: <code>BLSIC</code>, <code>NOT</code></li>
</ul></li>
<li><p><code>rstt1(x)</code></p>

<ul>
<li>x86_64 w/ AMD TBM: <code>BLCFILL</code></li>
</ul></li>
<li><p><code>sett0(x)</code></p>

<ul>
<li>x86_64 w/ AMD TBM: <code>BLSFILL</code></li>
</ul></li>
<li><p><code>maskt0(x)</code></p>

<ul>
<li>x86_64 w/ AMD TBM: <code>TZMSK</code></li>
</ul></li>
<li><p><code>maskt1(x)</code></p>

<ul>
<li>x86_64 w/ AMD TBM: <code>T1MSKC</code>, <code>NOT</code></li>
</ul></li>
<li><p><code>maskt0ls1b(x)</code></p>

<ul>
<li>x86_64 w/ BMI1: <code>BLSMSK</code></li>
</ul></li>
<li><p><code>maskt1ls0b(x)</code></p>

<ul>
<li>x86_64 w/ AMD TBM: <code>BLCMSK</code></li>
</ul></li>
<li><p><code>rstbitsge(x, b)</code></p>

<ul>
<li>x86_64 w/ BMI2: <code>BZHI</code></li>
</ul></li>
<li><p><code>reverse_bits&lt;uint32_t&gt;(x)</code></p>

<ul>
<li>ARMv7: <code>RBIT</code></li>
<li>EPIPHANY: <code>BITR</code></li>
</ul></li>
<li><code>reverse_bits&lt;uint64_t&gt;(x)</code>
<ul>
<li>ARMv8: <code>RBIT</code></li>
</ul></li>
<li><code>reverse_bits&lt;uint8_t&gt;(x, 4)</code>
<ul>
<li>AVR: <code>SWAP</code></li>
</ul></li>
<li><code>reverse_bytes&lt;uint16_t&gt;(x)</code>
<ul>
<li>PDP11: <code>SWAB</code></li>
<li>gcc: <code>__builtin_bswap16(x)</code></li>
</ul></li>
<li><code>reverse_bytes&lt;uint32_t&gt;(x)</code>
<ul>
<li>i486: <code>bswap</code></li>
<li>ARMv5: <code>REV</code></li>
<li>gcc: <code>__builtin_bswap32(x)</code></li>
</ul></li>
<li><code>reverse_bytes&lt;uint64_t&gt;(x)</code>
<ul>
<li>x86_64: <code>bswap</code></li>
<li>ARMv8: <code>REV</code></li>
<li>gcc: <code>__builtin_bswap64(x)</code></li>
</ul></li>
<li><code>reverse_bytes&lt;uint32_t&gt;(x, 1, 2)</code>
<ul>
<li>ARMv6: <code>REV16</code></li>
</ul></li>
<li><code>reverse_bytes&lt;uint64_t&gt;(x, 1, 4)</code>
<ul>
<li>ARMv8: <code>REV16</code></li>
</ul></li>
<li><code>reverse_bytes&lt;uint64_t&gt;(x, 1, 2)</code>
<ul>
<li>ARMv8: <code>REV32</code></li>
</ul></li>
<li><code>reverse_bytes&lt;uint32_t&gt;(x, 2)</code>
<ul>
<li>MC68020: <code>SWAP</code></li>
</ul></li>
<li><code>int32_t(reverse_bytes&lt;int16_t&gt;(x))</code>
<ul>
<li>ARMv5: <code>REVSH</code></li>
</ul></li>
<li><code>int32_t(reverse_bytes&lt;uint16_t&gt;(x))</code>
<ul>
<li>ARMv5: <code>REVSH</code></li>
</ul></li>
<li><code>deposit_bits_right(x)</code>
<ul>
<li>x86_64 w/ BMI2: PDEP</li>
</ul></li>
<li><code>extract_bits_right(x)</code>
<ul>
<li>x86_64 w/ BMI2: PEXT</li>
</ul></li>
<li><code>satadd(l, r)</code>
<ul>
<li>ARMv7: QADD</li>
</ul></li>
<li><code>satsub(l, r)</code>
<ul>
<li>ARMv7: QSUB</li>
</ul></li>
</ul>

<h1>Open Questions</h1>

<h2>Naming</h2>

<p>Naming is one of the most difficult problems in software development. 
One one extreme are terse names such as <code>std::ctz()</code> for Count
Trailing Zeroes. This naming style mimics assembler mnemonics and is also
an artifact of the old days when programming languages had limits on the
length of names of identifiers.</p>

<p>These short names do have some merits. They reduce the amount of typing required by
the programmer and more importantly they can be used within complex expressions.
The downside is the ambiguity that can come with some short names. Consider
a hypothetical <em>Count Leading Sign bits</em> function <code>std::cls()</code>. This name could
be interpreted in other contexts such as <em>CLear Screen</em>.</p>

<p>On the other extreme are verbose names such as <code>std::mask_least_significant_1_bit_and_trailing_zeroes()</code>. While these names
remove all ambiguity they are very cumbersome to type. They also cannot be
used easily in complex expressions with other operations.</p>

<p>We have opted to make a compromise. The current naming scheme adheres to the
following rules:</p>

<ul>
<li>All functions start with a verb. Each verb can be abbreviated to at a minimum of 3 characters. That is instead of 'ctz', we would say 'cnttz'.</li>
<li>Use 0 and 1 for 0 and 1, not Z and O or some other combination of letters. While the z in <code>cnttz()</code> is pretty obviously zero, the o in <code>cntto()</code> is not so obviously meant to be 1. Therefore we stick to the numbers to remove all ambiguity.</li>
<li>Nouns can be abbreviated to one character per word, as long as they are reused consistently. We reuse the following nouns:
<ul>
<li>t0: trailing 0s</li>
<li>t1: trailing 1s</li>
<li>l0: trailing 0s</li>
<li>l1: trailing 1s</li>
<li>ls1b: least significant 1 bit</li>
<li>ms1b: most significant 0 bit</li>
<li>ls0b: least significant 1 bit</li>
<li>ms0b: most significant 0 bit</li>
</ul></li>
</ul>

<p>As always, the naming question is continuously up for debate and reconsideration.
Some other styles have been suggested on the std-proposals discussion forum.</p>

<ul>
<li><code>ctz()</code></li>
<li><code>ct0()</code></li>
<li><code>cntt0()</code></li>
<li><code>countt0()</code></li>
<li><code>count_t0()</code></li>
<li><code>count_trailing_zeroes()</code></li>
<li><code>count_trailing_0_bits()</code></li>
<li><code>count_trailing&lt;bool&gt;()</code></li>
</ul>

<h2>Support for std::bitset</h2>

<p>Many people have suggesting adding support for <code>std::bitset</code>. While this is certainly a good idea, we believe that is outside of the scope of
this proposal. Once this proposal is finished and the interface is agreed upon, adding a follow up proposal for <code>std::bitset</code> would be easy to do.</p>

<h2>Support for C</h2>

<p>This library would also be very useful for the C community. Many of these bitwise operations
are used by embedded developers and they often choose to implement in C.  While C compatibility is a noble goal, we
do not want to make sacrifices to the C++ interface in the name of C compatibility. Particularly with regards to
templates, overloading, and <code>constexpr</code>.
This is first and foremost a C++ proposal which takes advantage of the latest C++ techniques
to provide a modern procedural interface.</p>

<p>If the C community shows interest we may consider a C interface that uses the generic macro feature. This
may allow interoperability, using macros for C and templates for C++. The <code>constexpr</code> qualifier
could be used in the C++ version while <code>inline</code> is used in the C version. If the C community shows interest,
we will consider a joint C proposal and flesh out the technical details of the interface and compatibility.</p>

<h1>Acknowledgements</h1>

<p>Thank you to everyone on the std proposals forum for feedback and suggestions.</p>

<h1>References</h1>

<ul>
<li><a name="BitOpsRef"></a>[BitOpsRef] <em>GitHub: BitOps Proposal and Reference Implementation</em>, (still under development) Available online at
<a href="https://github.com/fmatthew5876/stdcxx">https://github.com/fmatthew5876/stdcxx</a></li>
<li><a name="Anderson01"></a>[Anderson01] Anderson, Sean Eron. <em>Bit Twiddling Hacks</em>, Available online at <a href="http://graphics.stanford.edu/~seander/bithacks.html">http://graphics.stanford.edu/~seander/bithacks.html</a></li>
<li><a name="Dietz01"></a>[Dietz01] Deitz, Hendry Gordon. <em>The Aggregate Magic Algorithms</em>, University of Kentucky. 
Available online at <a href="http://aggregate.org/MAGIC/">http://aggregate.org/MAGIC/</a></li>
<li><a name="Neumann01"></a>[Neumann01] Neumann, Jasper. <em>Bit permutations</em>, Available online at
<a href="http://programming.sirrida.de/bit_perm.html">http://programming.sirrida.de/bit_perm.html</a></li>
<li><a name="Warren01"></a>[Warren01] Warren, Henry S. Jr. <em>Hacker's Delight Second Edition</em>,
Addison-Wesley, Oct 2012, ISBN 0-321-84268-5.</li>
<li><a name="pdp11"></a>[pdp11] <em>pdp11/40 process handbook</em>, Digital Equipment Corporation, 1972.</li>
<li><a name="Kostjuchenko01"></a>[Kostjuchenko01] Kostjuchenko, Dmitry. <em>SSE2 optimized strlen</em>, Available online at
<a href="http://www.strchr.com/sse2_optimised_strlen">http://www.strchr.com/sse2_optimised_strlen</a></li>
<li><a name="N3646"></a>[N3646] Pratte, Robert. <em>Network Byte Order Conversion Document Number: N3646</em>.
Available online at <a href="http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3646.pdf">http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3646.pdf</a></li>
<li><a name="HACKMEM"></a>[HACKMEM] <em>HACKMEM, AI Memo 239</em>, Available online at <a href="http://http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html">http://http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html</a>.</li>
<li><a name="ChessProg"></a>[ChessProg] <em>Chess Programming WIKI</em>, Available online at <a href="http://http://chessprogramming.wikispaces.com/">http://http://chessprogramming.wikispaces.com/</a>.</li>
<li><a name="Hilewitz01"></a>[Hilewitz01] Hilewitz, Yedidya and Lee, Ruby B. <em>Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions</em>, 2006.</li>
<li><a name="EmbedGuru01"></a>[EmbedGuru01] <em>Optimizing for the CPU / compiler / Stack Overflow</em>, Available online at <a href="http://embeddedgurus.com/stack-overflow/2012/06/optimizing-for-the-cpu-compiler/">http://embeddedgurus.com/stack-overflow/2012/06/optimizing-for-the-cpu-compiler/</a>.</li>
</ul>
</body></html>