<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <title>Hashing User-Defined Types in C++1y</title>
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"></script>
<script type="text/javascript">//<![CDATA[
$(function() {
    var next_id = 0
    function find_id(node) {
        // Look down the first children of 'node' until we find one
        // with an id. If we don't find one, give 'node' an id and
        // return that.
        var cur = node[0];
        while (cur) {
            if (cur.id) return curid;
            if (cur.tagName == 'A' && cur.name)
                return cur.name;
            cur = cur.firstChild;
        };
        // No id.
        node.attr('id', 'gensection-' + next_id++);
        return node.attr('id');
    };

    // Put a table of contents in the #toc nav.

    // This is a list of <ol> elements, where toc[N] is the list for
    // the current sequence of <h(N+2)> tags. When a header of an
    // existing level is encountered, all higher levels are popped,
    // and an <li> is appended to the level
    var toc = [$("<ol/>")];
    $(':header').not('h1').each(function() {
        var header = $(this);
        // For each <hN> tag, add a link to the toc at the appropriate
        // level.  When toc is one element too short, start a new list
        var levels = {H2: 0, H3: 1, H4: 2, H5: 3, H6: 4};
        var level = levels[this.tagName];
        if (typeof level == 'undefined') {
            throw 'Unexpected tag: ' + this.tagName;
        }
        // Truncate to the new level.
        toc.splice(level + 1, toc.length);
        if (toc.length < level) {
            // Omit TOC entries for skipped header levels.
            return;
        }
        if (toc.length == level) {
            // Add a <ol> to the previous level's last <li> and push
            // it into the array.
            var ol = $('<ol/>')
            toc[toc.length - 1].children().last().append(ol);
            toc.push(ol);
        }
        var header_text = header.text();
        toc[toc.length - 1].append(
            $('<li/>').append($('<a href="#' + find_id(header) + '"/>')
                              .text(header_text)));
    });
    $('#toc').append(toc[0]);
})
//]]></script>
<style type="text/css">
body {color: #000000; background-color: #FFFFFF;}

del {text-decoration: line-through; color: #8B0040;}
ins {text-decoration: underline; color: #005100;}


pre > code:only-child {display: inline-block}
.example, .implementation, .extract {margin: 1em 2em;}
pre.implementation > code {border: thin solid #bbf; background-color: #eef; padding: 1ex}
pre.example > code {border: thin solid #daf; background-color: #f8eeff; padding: 1ex}

p.function {}
p.attribute {text-indent: 3em;}

blockquote.std {
  color: #000000;
  background-color: #F1F1F1;
  border: 1px solid #D1D1D1;
  padding: 0.5em;
}

blockquote.stddel {
  text-decoration: line-through;
  color: #000000;
  background-color: #FFEBFF;
  border: 1px solid #ECD7EC;
  padding: 0.5em;
}

blockquote.stdins {
  text-decoration: underline;
  color: #000000;
  background-color: #C8FFC8;
  border: 1px solid #B3EBB3;
  padding: 0.5em;
}

table {
  border: 1px solid black;
  border-spacing: 0px;
  margin-left: auto;
  margin-right: auto;
}
th {
  text-align: left;
  vertical-align: top;
  padding: 0.2em;
  border: none;
}
td {
  text-align: left;
  vertical-align: top;
  padding: 0.2em;
  border: none;
}

address {float: right}
address p {margin: 0; text-align:right}

section {padding-left: 12px}
h2,h3,h4,h5,h6 {margin-left: -12px}

h2, h3, h4, h5, h6 { margin-bottom: .75em }
p {margin-top: .5em; margin-bottom: .5em}
p:first-child, ul, ol {margin-top: 0}
dt:not(:first-child) {margin-top: .5em}
p, li, dd {max-width: 80ex}

ol ol {list-style-type: lower-latin}

:target {background-color: #fed}
</style>
  </head>
<body>
<address>
  <p>Document number: N3333=12-0023</p>
  <p>Date: <time pubdate="">2012-01-13</time></p>
  <p>Jeffrey Yasskin &lt;<a href="mailto:jyasskin@google.com">jyasskin@google.com</a>&gt;</p>
  <p>Chandler Carruth &lt;<a href="mailto:chandlerc@google.com">chandlerc@google.com</a>&gt;</p>
</address>

<h1><a name="hashing">Hashing User-Defined Types in C++1y</a></h1>

<nav id="toc"></nav>

<section>
<h2><a name="background">Background</a></h2>

<p>C++11 defined a set of standard hashing containers, which rely on a specialization of <code>std::hash&lt;KeyType&gt;</code> to exist in order to hash their keys. However, we provided no help to users trying to implement this hash function. We also required them to specialize a template in <code>namespace std</code> as opposed to the namespace their code actually lives in. This leads to confused users and weak hash functions. For example:</p>
<pre class="example"><code>namespace my_namespace {
  struct OtherType; // Defined elsewhere.
  struct MyType {
    OtherType field1;
    int field2;
    char field3[500];
  };
}
namespace std {
  template&lt;&gt; struct hash&lt;::my_namespace::MyType&gt; {
    size_t operator()(const MyType&amp; val) {
      return (hash&lt;::my_namespace::OtherType&gt;()(field1) ^
              // Wow, that's verbose, and the xor makes it weak.
              hash&lt;int&gt;()(field2) ^
              // Oh noes, a copy:
              hash&lt;std::string&gt;()(field3, field3 + 500));
    }
  };
}</code></pre>

<p>This paper proposes an improvement to the situation. But first a digression on the purpose of "hash" functions:</p>
</section>

<section>
<h2><a name="fingerprinting">Hashing vs Fingerprinting</a></h2>
<p>There are several uses for this thing we refer to as hashing.</p>

<ol>
  <li>The use in the standard library is to look things up in a hash table that is local to the current process.</li>
  <li>We may want to build a similar table that's distributed across several machines running either several copies of the same binary, or several binaries.</li>
  <li>We may also want to save the table to disk and keep it for several years.</li>
  <li>We might hash some strings, and then compare <em>just</em> the hash values, trusting probability to protect us from collisions.</li>
  <li>We might wish to use the hash as part of a security-sensitive system. This generally requires a cryptographic hash.</li>
</ol>
<p>No single algorithm is optimal for all of the above use cases. If we try to make <code>std::hash&lt;&gt;</code> fill all of those roles, it will be the wrong choice for all of them instead.</p>

<p>This paper proposes that <code>std::hash&lt;&gt;</code> be designed for only the first use in the above list, and focuses on making it as simple as possible for the average user of standard library hashing containers. A later paper is likely to explore the more sophisticated interface needed to make user-defined types visible to other "fingerprinting" algorithms.</p>

<p>Hash tables only require that <code>std::hash&lt;T&gt;()(object)</code> return a stable value within a single program execution. In theory this allows the standard library to improve the hash implementation over time. However, we found at Google that without very clear documentation, we upgrade hash implementations rarely enough that developers save the values to disk anyway, which causes upgrades to break code. Therefore, we propose that hash functions <strong>return a different value in different processes</strong>, not just in different implementations, so that programs fail fast when they're incorrect. This has the secondary benefit that it frustrates pre-calculated collision tables used for denial-of-service attacks.</p>
</section>

<section>
<h2><a name="hashing.namespace">Defining hash functions easily and in a sensible namespace</a></h2>
<p>At Google, we find many users who are confused about what kinds of things they're allowed to put in <code>namespace std</code>. Requiring them to specialize <code>std::hash&lt;&gt;</code> encourages them to define new things in there, which causes undefined behavior.</p>

<p>Even more users dislike the verbosity caused by the need to close out their current namespace, open the std namespace, specialize hash, close the std namespace, and then re-open their original namespace.</p>

<p>To fix this, we propose finding <code>hash_value()</code> by argument-dependent lookup, like <code>swap</code>. <code>std::hash&lt;T&gt;::operator()(t)</code> is redefined to forward to <code>hash_value(t)</code>.</p>

<p>This leaves the problem of how to hash built-in types. In <cite>c++std-lib-31719</cite>, Peter Dimov observed that a "practical problem with the hash_value design is that types implicitly convertible to bool such as shared_ptr have a hash_value [of] 0 or 1 unless overridden." C++11's <code>std::shared_ptr</code> only has an explicit conversion to bool, so this objection doesn't apply there, but it would be poor form to lay such traps for user-defined types that aren't using new C++11 features. Instead, define:</p>

<pre class="implementation"><code>namespace std {
  template&lt;typename Bool&gt;
  typename enable_if&lt;is_same&lt;Bool, bool&gt;::value, size_t&gt;::type
  hash_value(Bool b) { return ...; }
}</code></pre>

<pre class="example"><code>struct BoolConvertible { operator bool() { return true; } };
void f() {
  BoolConvertible bc;
  using std::hash_value;
  hash_value(bc);  // error
}</code></pre>

<p>This templating should be done for every <code>hash_value(<var>primitive</var>)</code> overload to avoid similar conversion mistakes. Other library, extension, or language techniques to prevent implicit conversions to the argument types should also be allowed.</p>

<p>In order to keep compatibility with existing specializations of <code>std::hash</code>, the standard library should continue calling <code>std::hash&lt;T&gt;()(value)</code> where it needs a hash value, but user code that doesn't need such compatibility can move to the more natural <code>hash_value(value)</code>.</p>

<p>Templated user code needs to use the more verbose</p>
<pre class="example"><code>using std::hash_value;
hash_value(value);</code></pre>

<p>in order to pick up the definitions for primitive types. Perhaps the standard library should also include an <code>adl_hash_value()</code> with that definition to make this a bit less verbose. On the other hand, only hash table implementations are likely to call <code>hash_value</code>, so <code>adl_hash_value</code> may not be worth it.</p>
</section>

<section>
<h2><a name="helping.users">Helping users implement their <code>hash_value()</code></a></h2>
<p>If users are asked to implement a hash function for their own types with no guidance, they generally write bad hash functions. Instead, we should provide a simple function to pass hash-relevant member variables into, in order to define a decent hash function:</p>

<pre class="example"><code>namespace my_namespace {
  struct OtherType { ... };
  struct MyType {
    OtherType field1;
    int field2;
    char field3[500];
    float field4;  // Does not contribute to equality.
  };
  std::hash_code hash_value(const MyType&amp; val) {
    return std::hash_combine(val.field1,
                             val.field2,
                             val.field3);
  }
}</code></pre>

<p><code>std::hash_code</code> will be explained and defined later.</p>

<section>
<h3><a name="hash_combine"><code>hash_combine()</code></a></h3>
<p><code>hash_combine()</code> is declared as:</p>
<pre class="implementation"><code>namespace std {
  template&lt;typename T, typename... U&gt;
  hash_code hash_combine(const T&amp; arg1, const U&amp; ...args);
}</code></pre>
<p>and has an implementation-defined definition satisfying the following properties:</p>

<ol>
  <li><code>hash_combine(args...)</code> interacts with user-defined types by calling unqualified <code>hash_value()</code> on them.</li>
  <li>Within the same process, if <code>all_of((args1 == args2)...)</code> then <code>hash_combine(args1...) == hash_combine(args2...)</code>.</li>
  <li>If two calls to <code>hash_combine()</code> aren't defined to return equal values, then their return values must be completely different with high probability See <a href="http://en.wikipedia.org/wiki/Avalanche_effect">http://en.wikipedia.org/wiki/Avalanche_effect</a> for a possible (but not the only possible) meaning of "completely different with high probability". Situations that don't require equal <code>hash_combine</code> results include:
  <ol>
    <li>changing a single bit in any of <code>hash_combine()</code>'s arguments. (This includes changing <code>hash_combine(false)</code> to <code>hash_combine(true)</code>.)</li>
    <li>calling <code>hash_combine</code> in a different execution of the same binary.</li>
    <li>Replacing <code>hash_combine(arg1, arg2, arg3)</code> with <code>hash_combine(hash_combine(arg1, arg2), arg3)</code></li>
  </ol>
  </li>
</ol>

<p>A downside to this approach is that modern hash functions contain initialization and finalization phases that will be re-run for each call to <code>hash_combine</code>, where they could be skipped if the hash algorithm knew it was only dealing with an intermediate result. The authors of <a href="http://code.google.com/p/cityhash/">CityHash</a> (Geoff Pike) and <a href="http://code.google.com/p/smhasher/wiki/MurmurHash">MurmurHash</a> (Austin Appleby) don't think this inefficiency will be a significant problem in practice.</p>
</section>

<section>
<h3><a name="per.process.seed">Different processes have different hash values</a></h3>
<p>Ensuring that the result of <code>hash_combine</code> is different in different processes requires picking a seed at process start time. I know of three ways to do this:</p>
<ol>
  <li>Use the technique of <code>&lt;iostream&gt;</code>: Define a static object in the header that defines <code>hash_combine()</code>, and have it initialize the global seed if it's the first such object to be constructed.</li>
  <li>Implement <code>hash_combine</code> like:
  <pre class="implementation"><code>atomic&lt;seed_t&gt; global_seed;
hash_combine(...) {
  seed_t seed = global_seed.load(memory_order_relaxed);
  if (seed == 0) seed = initialize_global_seed_to_nonzero();
  ...
}</code></pre>
  </li>
  <li>Implement a <code>get_seed()</code> function as:
  <pre class="implementation"><code>seed_t get_seed() {
  static const seed_t seed = initialize_global_seed();
  return seed;
}</code></pre>
  The implementation of this is likely to be similar to, but not quite as efficient as, option 2.</li>
</ol>

<p>Implementations should provide a way for users to set the seed explicitly, primarily to track down bugs that result from assuming a stable hash value.</p>

<p>This is actually stricter than the current requirements on the Hash concept and <code>std::hash</code>, which only require that <code>k1==k2</code> imply <code>h(k1)==h(k2)</code> for a single hash instance '<code>h</code>'. We're tightening this to require that two <code>std::hash&lt;T&gt;</code> instances in the same process return the same value for equal inputs.</p>
</section>

<section>
<h3><a name="hash_range">Hashing ranges</a></h3>
<p>It's possible to hash ranges of values using code like:</p>

<pre class="example"><code>hash_code hash_value(const MyType&amp; val) {
  hash_code result = 0;
  for (const auto&amp; v : val) {
    result = hash_combine(result, c);
  }
  return result;
}</code></pre>

<p>However, this code is likely to be slower than it needs to be because the hash algorithm needs to initialize and finalize for each call to <code>hash_combine</code>. Instead, we define:</p>

<pre class="implementation"><code>namespace std {
  template&lt;typename InputIterator&gt;
  hash_code hash_combine_range(InputIterator first,
                               InputIterator last);
}</code></pre>

<p>which combines hashes for each element in the range. If we had a <code>std::range&lt;Iterator&gt;</code> object, we could define <code>hash_value(std::range)</code> directly in terms of this. The semantics of the combining for both <code>hash_combine</code> and <code>hash_combine_range</code> are the same, leading to the following invariants:</p>
<pre class="example"><code>std::vector&lt;int&gt; v = {1, 2, 3};
assert(std::hash_combine(1, 2, 3) ==
       std::hash_combine_range(v.begin(), v.end()));
high_probability(std::hash_combine(1, 2, 3) !=
                 std::hash_combine(
                   std::hash_combine_range(v.begin(), v.end())));

// And likely following from the above:
assert(std::hash_combine(1, 2, 3) != std::hash_combine(v));</code></pre>

<p>Note that this requires that ranges of equal values hash equally, even if they have different types. This could require more implementation complexity in implementations that want to hash ranges of characters very efficiently, but it appears doable.</p>
</section>

<section>
<h3><a name="hash_as_bytes">Hashing objects as byte arrays</a></h3>
<p>Users may also want to improve efficiency by hashing several sequential standard-layout and trivially-copyable fields as a single byte array, rather than passing them each individually to <code>hash_combine</code>. Unfortunately, those requirements aren't enough: the fields also have to be free of padding. To handle this, we propose the following type trait and a third hashing function.</p>

<pre class="implementation"><code>namespace std {
  template&lt;typename T&gt; is_contiguous_layout;
  template&lt;typename T&gt;
  hash_code hash_as_bytes(const T&amp; obj);
}</code></pre>

<p><code>std::is_contiguous_layout&lt;T&gt;</code> shall be a <i>UnaryTypeTrait</i> whose <i>BaseCharacteristic</i> is: <code>std::true_type</code> if all bits of the object representation participate in the value representation, or <code>std::false_type</code> otherwise. In other words, if there are any padding bits in the bytes making up an object of type <code>T</code>, then <code>is_contiguous_layout&lt;T&gt;</code> inherits from <code>false_type</code>. <code>is_contiguous_layout</code> would also be useful in optimizing uses of <code>compare_exchange</code>.</p>

<p><code>hash_as_bytes(obj)</code> is ill-formed (a diagnostic <strong>is</strong> required) if the type T of <code>obj</code> does not satisfy</p>
<pre class="implementation"><code>is_trivially_copyable&lt;T&gt;::value &amp;&amp;
is_standard_layout&lt;T&gt;::value &amp;&amp;
is_contiguous_layout&lt;T&gt;::value</code></pre>
<p><code>hash_as_bytes(args...)</code> is implemented as:</p>
<pre class="implementation"><code>unsigned char* addr = reinterpret_cast&lt;unsigned char*&gt;(&amp;obj);
return hash_range(addr, addr + sizeof(obj));</code></pre>
<p>If a user-defined type has a subset of fields that satisfy the requirements for <code>hash_as_bytes()</code>, its author can place those fields in a sub-struct.</p>

<p><strong>Note:</strong> We considered allowing users to pass multiple objects to <code>hash_as_bytes</code>, and having it treat them as a single long byte array, optimizing cases where objects are adjacent. However, because fast hash functions operate block-at-a-time, this would be difficult to implement without either copying data or changing the resulting hash value.</p>

<p>It would be possible to perform the above optimization automatically inside some calls to <code>hash_combine()</code> where arguments are adjacent and appropriate sizes, but Austin Appleby believed this was likely to surprise people, so we have a way for users to request it explicitly.</p>
</section>

<section>
<h3><a name="hash_code">The <code>hash_code</code> type</a></h3>
<p>The authors of this paper initially believed that having all intermediate hash values bottleneck through <code>size_t</code> would increase the collision rate and reduce performance. Geoff Pike and Austin Appleby agreed that <code>size_t</code> does increase the collision rate and reduce performance slightly, but not enough to justify the extra complexity of avoiding it. So, algorithmically, it would be fine to have all hash functions return <code>size_t</code>.</p>

<p>However, Austin argued that "if <code><var>hash</var>()</code> doesn't behave like a function in the mathematical '<code>y = f(x)</code>' sense due to the intentionally unstable implementation, then <code>hash_code</code> should not behave like a number and should be as opaque as possible" in order to avoid user confusion. We've decided to have the three hash utility functions return a new <code>hash_code</code> object:</p>

<pre class="implementation"><code>namespace std {
  class hash_code {
    // Implementation-specific state; probably size_t.
  public:
    // For compatibility with existing hash functions
    hash_code(size_t value);
    // Copyable, assignable.
    hash_code(const hash_code&amp;);
    hash_code&amp; operator=(const hash_code&amp;);
    // Explicit conversion to size_t.
    explicit operator size_t() const;
  }
  bool operator==(const hash_code&amp;, const hash_code&amp;);
  bool operator!=(const hash_code&amp;, const hash_code&amp;);
  hash_code hash_value(const hash_code&amp;);
}</code></pre>
<p>For backward compatibility with existing hash tables using <code>std::hash&lt;T&gt;</code>, its <code>operator()</code> still needs to return <code>size_t</code>.</p>
</section>
</section>

<section>
<h2><a name="future">Future work</a></h2>
<p>A single hash function that's easy to implement isn't enough for all use cases. The standard library should offer a collection of stable Fingerprint algorithms so that users can share their results between different implementation, and get more specialized characteristics in the output. These algorithms need a way to traverse user-defined types, and the above <code>hash_value()</code> function doesn't directly generalize. It is outside the scope of this paper to discuss approaches to generic fingerprinting algorithms.</p>
</section>

<section>
<h2><a name="acknowledgements">Acknowledgements</a></h2>
<p>Thanks to Geoff Pike (the author of CityHash) and Austin Appleby (the author of MurmurHash) for providing expert guidance. Thanks to Lawrence Crowl, Matt Austern, and Richard Smith for helping with the C++ interfaces and editing this paper.</p>
</section>

</body>
</html>