<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=US-ASCII" />
    <title>Proposing std::split()</title>

    <style type="text/css">

    body { color: #000000; background-color: #FFFFFF; }
    del { text-decoration: line-through; color: #8B0040; }
    ins { text-decoration: underline; color: #005100; }

    p.example { margin-left: 2em; }
    pre.example { margin-left: 2em; }
    div.example { margin-left: 2em; }

    code.extract { background-color: #F5F6A2; }
    pre.extract { margin-left: 2em; background-color: #F5F6A2;
      border: 1px solid #E1E28E; }

    p.function { }
    .attribute { margin-left: 2em; }
    .attribute dt { float: left; font-style: italic;
      padding-right: 1ex; }
    .attribute dd { margin-left: 0em; }

    blockquote.std { color: #000000; background-color: #F1F1F1;
      border: 1px solid #D1D1D1;
      padding-left: 0.5em; padding-right: 0.5em; }
    blockquote.stddel { text-decoration: line-through;
      color: #000000; background-color: #FFEBFF;
      border: 1px solid #ECD7EC;
      padding-left: 0.5empadding-right: 0.5em; ; }

    blockquote.stdins { text-decoration: underline;
      color: #000000; background-color: #C8FFC8;
      border: 1px solid #B3EBB3; padding: 0.5em; }

    table { border: 1px solid black; border-spacing: 0px;
      margin-left: auto; margin-right: auto; }
    th { text-align: left; vertical-align: top;
      padding-left: 0.8em; border: none; }
    td { text-align: left; vertical-align: top;
      padding-left: 0.8em; border: none; }

    </style>

    <script
      src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"
      type="text/javascript"> </script>

    <script type="text/javascript">$(function() {
        var next_id = 0
        function find_id(node) {
            // Look down the first children of 'node' until we find one
            // with an id. If we don't find one, give 'node' an id and
            // return that.
            var cur = node[0];
            while (cur) {
                if (cur.id) return curid;
                if (cur.tagName == 'A' && cur.name)
                    return cur.name;
                cur = cur.firstChild;
            };
            // No id.
            node.attr('id', 'gensection-' + next_id++);
            return node.attr('id');
        };

        // Put a table of contents in the #toc nav.

        // This is a list of <ol> elements, where toc[N] is the list for
        // the current sequence of <h(N+2)> tags. When a header of an
        // existing level is encountered, all higher levels are popped,
        // and an <li> is appended to the level
        var toc = [$("<ol/>")];
        $(':header').not('h1').each(function() {
            var header = $(this);
            // For each <hN> tag, add a link to the toc at the appropriate
            // level.  When toc is one element too short, start a new list
            var levels = {H2: 0, H3: 1, H4: 2, H5: 3, H6: 4};
            var level = levels[this.tagName];
            if (typeof level == 'undefined') {
                throw 'Unexpected tag: ' + this.tagName;
            }
            // Truncate to the new level.
            toc.splice(level + 1, toc.length);
            if (toc.length < level) {
                // Omit TOC entries for skipped header levels.
                return;
            }
            if (toc.length == level) {
                // Add a <ol> to the previous level's last <li> and push
                // it into the array.
                var ol = $('<ol/>')
                toc[toc.length - 1].children().last().append(ol);
                toc.push(ol);
            }
            var header_text = header.text();
            toc[toc.length - 1].append(
                $('<li/>').append($('<a href="#' + find_id(header) + '"/>')
                                  .text(header_text)));
        });
        $('#toc').append(toc[0]);
    })
    </script>

  </head>
  <body>
    <h1>Proposing std::split()</h1>

    <p>
    ISO/IEC JTC1 SC22 WG21 N3430 = 12-0120 - 2012-09-19
    </p>

    <address>
      Greg Miller, jgm@google.com
    </address>

    <div id="toc">
    <!-- Generated dynamically by javascript -->
    </div>

    <h2><a name="introduction">Introduction</a></h2>

    <p>
    Splitting strings into substrings is a common task in most general-purpose programming languages, and C++ is no exception.
    When the need arises, programmers need to search for an existing solution or write one of their own.
    A typical solution might look like the following:
    </p>

    <pre class="example">
    <code>std::vector&lt;std::string&gt; my_split(const std::string&amp; text, const std::string&amp; delimiter);</code>
    </pre>

    <p>
    A straightforward implementation of the above function would likely use <code>std::string::find</code> or <code>std::string::find_first_of</code> to identify substrings and move from one to the next, building the vector to return.
    This is a fine solution for simple needs, but it is deficient in the following ways:
    </p>

    <ul>
      <li>Must be reimplemented by each individual/organization</li>
      <li>Not adaptable to different types of delimiter, such as regular expressions</li>
      <li>Not adaptable to different return types, such as <code>std::set&lt;string&gt;</code></li>
      <li>Not configurable to do common things, such as skipping empty substrings</li>
    </ul>

    <p>
    These are real deficiencies that resulted in Google code accumulating more than 50 separate "Split" functions for various needs.
    For example, the following is a family of related split functions that are used in real Google code:
    </p>

    <pre class="example">
    <code>SplitStringUsing  <em>// Splits to std::vector&lt;string&gt;</em>
    SplitStringToHashsetUsing
    SplitStringToSetUsing
    SplitStringToHashmapUsing
    SplitStringAllowEmpty
    SplitStringToHashsetAllowEmpty
    SplitStringToSetAllowEmpty
    SplitStringToHashmapAllowEmpty</code>
    </pre>

    <p>
    Each of the above functions splits an input string using any of the single-byte delimiters given as the delimiter string.
    They differ only in the collection type they return and whether or not empty substrings are included in the output.
    The moment someone needs to split a string into a <code>std::unordered_set</code>, two
    new split functions will need to be written: one that skips empty substrings
    and one that allows them.
    </p>

    <p>
    To address  the above deficiencies, Google has implemented and is internally using a new API for splitting strings.
    The new API has been very well received by internal engineers writing real code, and it is rapidly replacing the existing assortment of split functions.
    The following examples demonstrate Google's new string splitting API in a few common usage scenarios.
    </p>

    <pre class="example">
    <code>using std::set;
    using std::string;
    using std::string_ref;
    using std::vector;

    vector&lt;string&gt; v1 = strings::Split("a&lt;br&gt;b&lt;br&gt;c", "&lt;br&gt;");
    <em>// v1 is {"a", "b", "c"}</em>

    set&lt;string&gt; s1 = strings::Split("a,b,c,a,b,c", ",");
    <em>// s1 is {"a", "b", "c"}</em>

    vector&lt;string&gt; v2 = strings::Split("a,b;c-d", strings::AnyOf(",;-"));
    <em>// v2 is {"a", "b", "c", "d"}</em>

    vector&lt;string&gt; v3 = strings::Split("a,,c", ",", strings::SkipEmpty());
    <em>// v3 is {"a", "c"}</em>

    vector&lt;string_ref&gt; v4 = strings::Split("a,b,c", ",");
    <em>// v4 is {"a", "b", "c"} -- string_refs refer to the data passed in the first arg, avoiding data copies</em></code>
    </pre>

    <p>
    The rest of this paper describes Google's new string splitting API as it might appear in C++ in the <code>std::</code> namespace.
    </p>

    <h2><a name="new_api">New API</a></h2>

    <p>
    At a basic level, a string splitting API breaks text into substrings using a separator or delimiter.
    This simple description combined with real-world programmer needs drawn from the existence and usage of existing split functions has led to the following goals for a new string splitting API:
    </p>

    <ul>
      <li>The concept of a "delimiter" must be flexible and extensible, allowing simple substring and single character separators, as well as more complicated delimiters like regular expressions.</li>
      <li>The caller should be able to receive the results in any standard STL container.</li>
      <li>The algorithm should support common operations, such as skipping empty substrings.</li>
    </ul>

    <p>
    The above goals are realized in the following API:
    </p>

    <pre class="example">
    <code>namespace std {

      template &lt;typename Delimiter&gt;
      splitter&lt;Delimiter&gt; split(std::string_ref text, Delimiter d);

      template &lt;typename Delimiter, typename Predicate&gt;
      splitter&lt;Delimiter&gt; split(std::string_ref text, Delimiter d, Predicate p);

    }</code>
    </pre>

    <p>
    [<i>Footnote:</i>
    This API uses the <code>std::string_ref</code> API <a href="#string_ref">[string_ref]</a> to minimize string copies.
    This API could be written in terms of <code>std::string</code> instead if <code>std:string_ref</code> is not available.
    &mdash;<i>end footnote</i>]
    </p>

    <p>
    The Delimiter template parameter represents various ways to delimit strings, such as substrings, single characters, or even regular expressions.
    The Predicate, given in the second form, represents various ways to filter the results, such as skipping empty strings.
    The <code>splitter&lt;T&gt;</code> that is returned from <code>std::split()</code> has a templated conversion operator (<code>operator T()</code>) that allows it to be implicitly converted to the type specified by the caller.
    </p>

    <p>
    The text to be split is given as a <code>std::string_ref</code> object, which cannot modify the underlying data to which it refers.
    Thus, the input text to be split is effectively immutable.
    The split results may also be returned in a collection of <code>std::string_ref</code> objects.
    In this case, the resultant <code>std::string_ref</code> objects will refer to the text data that was given as input, eliminating all string data copies.
    Data are only copied if the caller requests to store results in a container of objects that copy the data, such as a container of <code>std::string</code> objects.
    </p>

    <h3><a name="delimiters">Delimiters</a></h3>

    <p>
    The general notion of a delimiter is not new.
    A delimiter (little d) marks the boundary between two substrings in a larger string.
    With this split API comes the formal concept of a <em>Delimiter</em> (big D).
    A Delimiter is an object with a <code>find()</code> member function that knows how to find the first occurrence of itself in a given <code>std::string_ref</code>.
    Objects that conform to the Delimiter concept represent specific kinds of delimiters, such as single characters, substrings, and regular expressions.
    </p>

    <p>
    The following example shows a simple object that models to the Delimiter concept.
    It has a <code>find()</code> member function that is responsible for finding the next occurrence of a char in the given text.
    The <code>std::string_ref</code> returned from the <code>find()</code> member function must refer to a substring of <code>find()</code>'s argument text, or else it must be an empty <code>std::string_ref</code>.
    </p>

    <pre class="example">
    <code>struct char_delimiter {
      char c_;
      explicit char_delimiter(char c) : c_(c) {}
      <strong>std::string_ref find(std::string_ref text)</strong> {
        int pos = text.find(c_);
        if (pos == std::string_ref::npos)
          return std::string_ref();            <em>// Not found, returns empty std::string_ref.</em>
        return std::string_ref(text, pos, 1);  <em>// Returns a string_ref referring to the c_ that was found in the input string.</em>
      }
    };</code>
    </pre>

    <p>
    The following shows how the above delimiter could be used to split a string:
    </p>

    <pre class="example">
    <code>std::vector&lt;std::string&gt; v = std::split("a,b,c", <strong>char_delimiter(',')</strong>);
    <em>// v is {"a", "b", "c"}</em></code>
    </pre>

    <p>
    The following are standard delimiter implementations that will be part of the splitting API.
    </p>

    <dl>
      <dt><code>std::literal</code></dt>
      <dd>A string delimiter.
      The default delimiter used if a string is given as the delimiter argument to <code>std::split()</code>.
      (Alternative name, <code>std::literal_delimiter</code>.)</dd>

      <dt><code>std::any_of</code></dt>
      <dd>Each character in the given string is a delimiter.
      This is different from the <code>std::any_of</code> algorithm [alg.any_of], but overload resolution should disambiguate them.
      (Alternative name, <code>std::any_of_delimiter</code>.)</dd>
    </dl>

    <h3><a name="predicates">Predicates</a></h3>

    <p>
    The predicates used in the splitting API are unary function objects that return true or false.
    These are normal STL predicates [<i>Footnote:</i> C++11[algorithms.general]p8 &mdash;<i>end footnote</i>].
    They are used to filter the results of a split operation by determining whether or not a resultant element should be included or filtered out.
    The following example shows a predicate that will omit empty strings from the results of a split.
    </p>

    <pre class="example">
    <code>struct skip_empty {
      bool operator()(std::string_ref sref) const {
        return !sref.empty();
      }
    };</code>
    </pre>

    <p>
    The above predicate could be used when splitting as follows:
    </p>

    <pre class="example">
    <code>std::vector&lt;std::string&gt; v = std::split("a,,c", ",", <strong>skip_empty()</strong>);
    <em>// v is {"a", "c"}</em></code>
    </pre>

    <h3><a name="splitter">The splitter&lt;T&gt;</a></h3>

    <p>
    The <code>std::split()</code> function returns an object of type <code>splitter&lt;T&gt;</code>.
    This object is responsible for returning the results in the caller-specified container, which can be done using a templated conversion operator.
    The <code>splitter&lt;T&gt;</code> will also have <code>begin()</code> and <code>end()</code> member functions so it can be used in range-based for loops.
    The <code>splitter&lt;T&gt;</code> is used to implement the behavior of the <code>std::split()</code> function&mdash;<em>it is not part of the public split API</em>.
    The following example shows what a <code>splitter&lt;T&gt;</code> might look like.
    </p>

    <pre class="example">
    <code>template &lt;typename Delimiter&gt;
    class splitter {
     public:
      &hellip;
      const iterator&amp; begin() const;
      const iterator&amp; end() const;

      template &lt;typename Container&gt;
      operator Container() {
        return Container(begin(), end());
      }
    };</code>
    </pre>

    <p>
    The example code above shows a possible splitter interface with support for range-based for loops and implicit conversion to caller-specified containers.
    The templated conversion operator in the example above shall not participate in overload resolution unless the Container has a constructor taking a begin and end iterator.
    </p>

    <h2><a name="synopsis">API Synopsis</a></h2>

    <h3><a name="std_split">std::split()</a></h3>

    <p>
    The function called to split an input string into a collection of substrings.
    </p>

    <pre class="example">
    <code>namespace std {

      template &lt;typename Delimiter&gt;
      splitter&lt;Delimiter&gt; split(std::string_ref text, Delimiter d);

      template &lt;typename Delimiter, typename Predicate&gt;
      splitter&lt;Delimiter&gt; split(std::string_ref text, Delimiter d, Predicate p);

    }</code>
    </pre>

    <h3><a name="std_literal">std::literal (delimiter)</a></h3>

    <p>
    A string delimiter.
    This is the default delimiter used if a string is given as the delimiter argument to <code>std::split()</code>.
    Alternatively, this delimiter could be named differently, such as <code>std::literal_delimiter</code>.
    </p>

    <pre class="example">
    <code>namespace std {

      class literal {
       public:
        explicit literal(string_ref sref);
        <strong>string_ref find(string_ref text) const;</strong>

       private:
        const string delimiter_;
      };

    }</code>
    </pre>

    <h3><a name="std_any_of">std::any_of (delimiter)</a></h3>

    <p>
    Each character in the given string is a delimiter.
    This is different from the <code>std::any_of</code> algorithm [alg.any_of], but overload resolution should disambiguate this delimiter.
    Alternatively, this delimiter could be named differently, such as <code>std::any_of_delimiter</code>.
    </p>

    <pre class="example">
    <code>namespace std {

      class any_of {
       public:
        explicit any_of(string_ref sref);
        <strong>string_ref find(string_ref text) const;</strong>

       private:
        const string delimiters_;
      };

    }</code>
    </pre>

    <h3><a name="std_skip_empty">std::skip_empty (predicate)</a></h3>

    <p>
    Skips empty substrings in the <code>std::split()</code> output collection.
    </p>

    <pre class="example">
    <code>namespace std {

      struct skip_empty {
        bool operator()(string_ref sref) const {
          return !sref.empty();
        }
      };

    }</code>
    </pre>

    <h2><a name="api_usage">API Usage</a></h2>

    <p>
    The following using declarations are assumed for brevity:
    </p>

    <pre class="example">
    <code>using std::deque;
    using std::list;
    using std::set;
    using std::string;
    using std::string_ref;
    using std::vector;</code>
    </pre>

    <ol>

    <li>
    The default delimiter when not explicitly specified is <code>std::literal</code>.
    The following two calls to <code>std::split()</code> are equivalent.
    The first form is provided for convenience.
    <pre class="example">
    <code>vector&lt;string&gt; v1 = std::split("a,b,c", <strong>","</strong>);
    vector&lt;string&gt; v2 = std::split("a,b,c", <strong>std::literal(",")</strong>);</code>
    </pre>
    </li>

    <li>
    Empty substrings are included in the returned collection unless explicitly filtered out using a predicate.
    <pre class="example">
    <code>vector&lt;string&gt; v1 = std::split("a,,c", ",");
    assert(<strong>v1.size() == 3</strong>);  <em>// "a", "", "c"</em>

    vector&lt;string&gt; v2 = std::split("a,b,c", ",", <strong>std::skip_empty()</strong>);
    assert(<strong>v2.size() == 2</strong>);  <em>// "a", "c"</em></code>
    </pre>
    </li>

    <li>
    Results can be returned in various STL containers as specified by the caller.
    <pre class="example">
    <code>vector&lt;string&gt; v = std::split("a,b,c", ",");
    deque&lt;string&gt; v = std::split("a,b,c", ",");
    set&lt;string&gt; s = std::split("a,b,c", ",");
    list&lt;string&gt; l = std::split("a,b,c", ",");</code>
    </pre>
    </li>

    <li>
    A delimiter of the empty string results in each character in the input string becoming one element in the output collection.
    <pre class="example">
    <code>vector&lt;string&gt; v = std::split(<strong>"abc", ""</strong>);
    assert(<strong>v.size() == 3</strong>);  <em>// "a", "b", "c"</em></code>
    </pre>
    </li>

    <li>
    Results can also be returned in a container of <code>std::string_ref</code> objects rather than <code>std::string</code>s.
    The returned <code>std::string_ref</code>s will refer to the data that was given as input to the <code> std::split()</code> function.
    This eliminates all copies of string data while splitting.
    <pre class="example">
    <code>vector&lt;<strong>string_ref</strong>&gt; v = std::split("a,b,c", ",");  <em>// No data copied.</em>
    assert(v.size() == 3);  <em>// "a", "b", "c"</em>
    </code>
    </pre>
    </li>

    <li>
    Iterating the results of a split in a range-based for loop.
    <pre class="example">
    <code>for (string_ref sref : std::split("a,b,c", ",")) {
      <em>// use sref</em>
    }</code>
    </pre>
    </li>

    </ol>

    <h2><a name="references">References</a></h2>

    <dl>
      <dt>
      <a name="string_ref">[string_ref]</a>
      </dt>
      <dd>
      <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3442.html">N3442</a>
      (previously, <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3334.html">N3334</a>)
      </dd>
    </dl>

  </body>
</html>

