<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=US-ASCII" />
    <title>Proposing std::split()</title>

    <style type="text/css">

    body { color: #000000; background-color: #FFFFFF; }
    del { text-decoration: line-through; color: #8B0040; }
    ins { text-decoration: underline; color: #005100; }

    p.example { margin-left: 2em; }
    pre.example { margin-left: 2em; }
    div.example { margin-left: 2em; }

    code.extract { background-color: #F5F6A2; }
    pre.extract { margin-left: 2em; background-color: #F5F6A2;
      border: 1px solid #E1E28E; }

    p.function { }
    .attribute { margin-left: 2em; }
    .attribute dt { float: left; font-style: italic;
      padding-right: 1ex; }
    .attribute dd { margin-left: 0em; }

    blockquote.std { color: #000000; background-color: #F1F1F1;
      border: 1px solid #D1D1D1;
      padding-left: 0.5em; padding-right: 0.5em; }
    blockquote.stddel { text-decoration: line-through;
      color: #000000; background-color: #FFEBFF;
      border: 1px solid #ECD7EC;
      padding-left: 0.5empadding-right: 0.5em; ; }

    blockquote.stdins { text-decoration: underline;
      color: #000000; background-color: #C8FFC8;
      border: 1px solid #B3EBB3; padding: 0.5em; }

    table { border: 1px solid black; border-spacing: 0px;
      margin-left: auto; margin-right: auto; }
    th { text-align: left; vertical-align: top;
      padding-left: 0.8em; border: none; }
    td { text-align: left; vertical-align: top;
      padding-left: 0.8em; border: none; }

    </style>

    <script
      src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"
      type="text/javascript"> </script>

    <script type="text/javascript">$(function() {
        var next_id = 0
        function find_id(node) {
            // Look down the first children of 'node' until we find one
            // with an id. If we don't find one, give 'node' an id and
            // return that.
            var cur = node[0];
            while (cur) {
                if (cur.id) return curid;
                if (cur.tagName == 'A' && cur.name)
                    return cur.name;
                cur = cur.firstChild;
            };
            // No id.
            node.attr('id', 'gensection-' + next_id++);
            return node.attr('id');
        };

        // Put a table of contents in the #toc nav.

        // This is a list of <ol> elements, where toc[N] is the list for
        // the current sequence of <h(N+2)> tags. When a header of an
        // existing level is encountered, all higher levels are popped,
        // and an <li> is appended to the level
        var toc = [$("<ol/>")];
        $(':header').not('h1').each(function() {
            var header = $(this);
            // For each <hN> tag, add a link to the toc at the appropriate
            // level.  When toc is one element too short, start a new list
            var levels = {H2: 0, H3: 1, H4: 2, H5: 3, H6: 4};
            var level = levels[this.tagName];
            if (typeof level == 'undefined') {
                throw 'Unexpected tag: ' + this.tagName;
            }
            // Truncate to the new level.
            toc.splice(level + 1, toc.length);
            if (toc.length < level) {
                // Omit TOC entries for skipped header levels.
                return;
            }
            if (toc.length == level) {
                // Add a <ol> to the previous level's last <li> and push
                // it into the array.
                var ol = $('<ol/>')
                toc[toc.length - 1].children().last().append(ol);
                toc.push(ol);
            }
            var header_text = header.text();
            toc[toc.length - 1].append(
                $('<li/>').append($('<a href="#' + find_id(header) + '"/>')
                                  .text(header_text)));
        });
        $('#toc').append(toc[0]);
    })
    </script>

  </head>
  <body>
    <h1><code>std::split()</code>: An algorithm for splitting strings</h1>

    <p>
    ISO/IEC JTC1 SC22 WG21 N3510 - 2013-01-10
    </p>

    <address>
      Greg Miller, jgm@google.com
    </address>

    <div id="toc">
    <!-- Generated dynamically by javascript -->
    </div>

    <h2><a name="introduction">Introduction</a></h2>

    <p>
    Splitting strings into substrings is a common task in most general-purpose programming languages, and C++ is no exception.
    When the need arises, programmers need to search for an existing solution or write one of their own.
    A typical solution might look like the following:
    </p>

    <pre class="example">
    <code>std::vector&lt;std::string&gt; my_split(const std::string&amp; text, const std::string&amp; delimiter);</code>
    </pre>

    <p>
    A straightforward implementation of the above function would likely use <code>std::string::find</code> or <code>std::string::find_first_of</code> to identify substrings and move from one to the next, building the vector to return.
    This is a fine solution for simple needs, but it is deficient in the following ways:
    </p>

    <ul>
      <li>Must be reimplemented by each individual/organization</li>
      <li>Not adaptable to different types of delimiter, such as regular expressions</li>
      <li>Not adaptable to different return types, such as <code>std::set&lt;string&gt;</code></li>
    </ul>

    <p>
    Google developed a flexible and fast string-splitting API to address these deficiencies.
    The new API has been very well received by internal engineers writing real code.
    The rest of this paper describes Google's string splitting API as it might appear as a C++ standard.
    </p>

    <p>
    This proposal depends on the following proposals:
    <ul>
      <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3442.html">N3442 (<code>std::string_ref</code>)</a></li>
      <li><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3513.html">N3513 (Range support)</a></li>
    </ul>
    </p>

    <h3>Changes in this revision</h3>
    <p>
    The previous version of this proposal is <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3430.html">N3430</a>.
    The <code>std::split()</code> function described in this proposal has been greatly simplified from the previous proposal.
    The following are the major changes in this revision:
    </p>

    <ul>
      <li>No implicit type conversion. <code>std::split()</code> will return a
      <em>Range</em>, which itself can be used with range-aware STL containers
      per the <a
      href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3513.html">Range proposal (N3513)</a>.
      <li><code>std::split()</code> will return a range of <code>std::string_ref</code> objects only. Users will need to explicitly convert the returned values to <code>std::string</code> if desired.</li>
      <li>No split-specific predicates for filtering split results (i.e., no <code>std::skip_empty()</code>).
      Operations to filter or transform the returned range can be implemented separately.
      Any generic Range Adapter library (e.g., <a href="http://www.boost.org/doc/libs/1_47_0/libs/range/doc/html/range/reference/adaptors.html">Boost.RangeAdapters</a>) will work on the range returned from <code>std::split()</code>.</li>
    </ul>


    <h2><a name="new_api">std::split() API</a></h2>

    <pre class="example">
    <code>namespace std {

      template &lt;typename Delimiter&gt;
      auto split(std::string_ref text, Delimiter d) -&gt; split_range&lt;Delimiter&gt;;

    }</code>
    </pre>

    <p>
    The <code>std::split()</code> algorithm takes a <code>std::string_ref</code> and a <code>Delimiter</code> as arguments, and it returns a <em>Range</em> of <code>std::string_ref</code> objects as output.
    The <code>std::string_ref</code> objects in the returned Range refer to substrings of the input text.
    The <code>Delimiter</code> object defines the boundaries between the returned substrings.
    This is fairly common splitting behavior that is followed in many programming languages.
    </p>

    <h3><a name="delimiters">Delimiters</a></h3>

    <p>
    The general notion of a delimiter (aka separator) is not new.
    A delimiter (little d) marks the boundary between two substrings in a larger string.
    With the <code>std::split()</code> API comes the generalized concept of a <em>Delimiter</em> (big D).
    A <em>Delimiter</em> is an object with a <code>find()</code> member function that can find the next occurrence of itself in a given <code>std::string_ref</code>.
    Objects that conform to the Delimiter concept represent specific kinds of delimiters, such as single characters, substrings, and regular expressions.
    </p>

    <p>
    The result of a Delimiter's <code>find()</code> member function must be a <code>std::string_ref</code> referring to one of the following:
    <ul>
      <li>A substring of <code>find()</code>'s argument text. This is the delimiter/separator that was found.</li>
      <li>An empty <code>std::string_ref</code> with a null data member (e.g., std::string_ref(nullptr, 0)). This indicates that the delimiter/separator was not found.</li>
    </ul>
    </p>

    <p>
    The following example shows a simple object that models the Delimiter concept.
    It has a <code>find()</code> member function that is responsible for finding the next occurrence of a char in the given text.
    </p>

    <pre class="example">
    <code>struct char_delimiter {
      char c_;
      explicit char_delimiter(char c) : c_(c) {}
      <strong>std::string_ref find(std::string_ref text)</strong> {
        int pos = text.find(c_);
        if (pos == std::string_ref::npos)
          return std::string_ref(nullptr, 0);  <em>// Not found, returns null std::string_ref.</em>
        return std::string_ref(text, pos, 1);  <em>// Returns a string_ref referring to the c_ that was found in the input string.</em>
      }
    };</code>
    </pre>

    <p>
    The following shows how the above delimiter could be used to split a string:
    </p>

    <pre class="example">
    <code>std::vector&lt;std::string_ref&gt; v{std::split("a-b-c", <strong>char_delimiter('-')</strong>)};
    <em>// v is {"a", "b", "c"}</em></code>
    </pre>

    <p>
    The following are standard delimiter implementations that will be part of the splitting API.
    <ul>
      <li><code>std::literal_delimiter</code></li>
      <li><code>std::any_of_delimiter</code></li>
    </ul>
    </p>

    <h3><a name="split_range">The split_range&lt;T&gt;</a></h3>

    <p>
    <code>std::split()</code> returns a Range (i.e. an object with <code>begin()</code> and <code>end()</code> methods returning iterators) whose value type is <code>std::string_ref</code>.
    The actual type returned by <code>std::split()</code> will not be exposed as part of the API.
    </p>

    <h3>Rvalue support</h3>

    <p>
    As described so far, <code>std::split()</code> may not work correctly if splitting a <code>std::string_ref</code> that refers to a temporary string.
    In particular, the following will not work:
    </p>

    <pre class="example">
    <code>for (std::string_ref s : std::split(ReturnTemporaryString(), "-")) {
        // s now refers to a temporary string that is no longer valid.
    }</code>
    </pre>

    <p>
    To address this, <code>std::split()</code> will move ownership of rvalues into the returned range.
    </p>


    <h2><a name="synopsis">API Synopsis</a></h2>

    <h3><a name="std_split">std::split()</a></h3>

    <p>
    The function called to split an input string into a range of substrings.
    </p>

    <pre class="example">
    <code>namespace std {

      template &lt;typename Delimiter&gt;
      auto split(std::string_ref text, Delimiter d) -&gt; split_range&lt;Delimiter&gt;;

    }</code>
    </pre>

    <dl>
      <dt><em>Requires:</em></dt>
      <dd>
      the <code>text</code> to be split and a <em>Delimiter</em> on which to split the text.
      The <code>Delimiter</code> argument may be an object that models the <em>Delimiter</em> concept.
      Or, if the <code>Delimiter</code> argument is a std::string, std::string_ref, const char*, or a single char, the <code>std::literal_delimiter</code> will be used by default.
      </dd>
      <dt><em>Returns:</em></dt>
      <dd>
      a <em>Range</em> of <code>std::string_ref</code> objects, each referring to the split substrings within the given input <code>text</code>.
      </dd>
    </dl>

    [<b>Footnote:</b>
    <p>
    One question at this point is: why is this constrained to strings/string_refs? 
    One could imagine <code>std::split()</code> as an algorithm that transforms an input Range into an output Range of Ranges.
    This would make the algorithm more generally applicable.
    </p>

    <p>
    However, this generalization may also make <code>std::split()</code> less convenient in the expected common case: that of splitting string data.
    For example, the logic for detecting when to auto-construct a <code>std::literal_delimiter</code> may be more complicated, and it may not be clear that that is a reasonable default delimiter anyway.
    </p>
    &mdash;<b>end footnote</b>]

    <h3><a name="delimiter_synopsis"><em>Delimiter</em> template parameter</h3>
    <p>
    The second argument to <code>std::split()</code> may be an object that models the Delimiter concept.
    A Delimiter object must have the following member function:
    </p>

    <pre class="example">
    <code>std::string_ref find(std::string_ref text);</code>
    </pre>

    <p>
    This function is responsible for finding the next occurrence of the represented delimiter in the given <code>text</code>.
    </p>

    <dl>
      <dt><em>Requires:</em></dt>
      <dd>
      <code>text</code> is the remaining text to be split
      </dd>
      <dt><em>Returns:</em></dt>
      <dd>
      a <code>std::string_ref</code> referring to the found delimiter within the given input <code>text</code>.
      Or <code>std::string_ref(nullptr, 0)</code> if the delimiter was not found.
      </dd>
    </dl>

    [<b>Footnote:</b>
    One could imagine a <em>Delimiter</em> implementation that needs more context into the overall string being split in order to work correctly.
    For example, a <code>regex_delimiter</code> might need this ability to correctly match word boundaries ("\\b") given that "\\b" will always match the beginning of the string.
    For this reason, it may be better to change the Delimiter API to require
    that Delimiter objects provide a member function like the following:

    <pre class="example">
    <code>std::string_ref find(std::string_ref text, <strong>size_t pos</strong>);</code>
    </pre>

    In this case, the delimiter will always be given the full input <code>text</code> being split along with the position in the text where it should start looking for the next delimiter.

    &mdash;<b>end footnote</b>]

    <h3><a name="std_literal_delimiter">std::literal_delimiter</a></h3>

    <p>
    A string delimiter.
    This is the default delimiter used if a string is given as the delimiter argument to <code>std::split()</code>.
    </p>

    <p>
    The delimiter representing the empty string (<code>std::literal_delimiter("")</code>) will be defined to return each individual character in the input string.
    This matches the behavior of splitting on the empty string "" in perl.
    </p>

    <pre class="example">
    <code>namespace std {

      class literal_delimiter {
       public:
        explicit literal(string_ref sref)
        : delimiter_(static_cast&lt;string&gt;(sref)) {}
        <strong>string_ref find(string_ref text) const;</strong>

       private:
        const string delimiter_;
      };

    }</code>
    </pre>

    <dl>
      <dt><em>Requires:</em></dt>
      <dd>
      <code>text</code> is the text to be split
      </dd>
      <dt><em>Returns:</em></dt>
      <dd>
      A <code>std::string_ref</code> referring to the first substring of <code>text</code> that matches <code>delimiter_</code>.
      Or <code>std::string_ref(nullptr, 0)</code> if not found.
      </dd>
    </dl>

    <h3><a name="std_any_of_delimiter">std::any_of_delimiter</a></h3>

    <p>
    Each character in the given string is a delimiter.
    </p>

    <pre class="example">
    <code>namespace std {

      class any_of_delimiter {
       public:
        explicit any_of(string_ref sref)
        : delimiters_(static_cast&lt;string&gt;(sref)) {}
        <strong>string_ref find(string_ref text) const;</strong>

       private:
        const string delimiters_;
      };

    }</code>
    </pre>

    <dl>
      <dt><em>Requires:</em></dt>
      <dd>
      <code>text</code> is the text to be split
      </dd>
      <dt><em>Returns:</em></dt>
      <dd>
      A <code>std::string_ref</code> referring to the first occurrence of any character from <code>delimiters_</code> that is found in <code>text</code>.
      In this case, the length of the returned <code>std::string_ref</code> will be 1.
      Or <code>std::string_ref(nullptr, 0)</code> if none are found.
      </dd>
    </dl>


    <h2><a name="api_usage">API Usage</a></h2>

    <p>
    The following using declarations are assumed for brevity:
    </p>

    <pre class="example">
    <code>using std::deque;
    using std::list;
    using std::set;
    using std::string_ref;
    using std::vector;</code>
    </pre>

    <ol>

    <li>
    The default delimiter when not explicitly specified is <code>std::literal_delimiter</code>.
    The following two calls to <code>std::split()</code> are equivalent.
    The first form is provided for convenience.
    <pre class="example">
    <code>vector&lt;string_ref&gt; v1{std::split("a-b-c", <strong>"-"</strong>)};
    vector&lt;string_ref&gt; v2{std::split("a-b-c", <strong>std::literal_delimiter("-")</strong>)};</code>
    </pre>
    </li>

    <li>
    Empty substrings are included in the output.
    <pre class="example">
    <code>vector&lt;string_ref&gt; v{std::split("a--c", "-")};
    assert(<strong>v.size() == 3</strong>);  <em>// "a", "", "c"</em></code>
    </pre>
    </li>

    <li>
    The previous example showed that empty substrings are included in the output. 
    Leading and trailing delimiters result in leading and trailing empty strings in the output.
    <pre class="example">
    <code>vector&lt;string_ref&gt; v{std::split(<strong>"-a-b-c-", "-"</strong>)};
    assert(<strong>v.size() == 5</strong>);  <em>// "", "a", "b", "c", ""</em></code>
    </pre>

    </li>

    <li>
    Results can be assigned to STL containers that support the Range concept.
    <pre class="example">
    <code>vector&lt;string_ref&gt; v{std::split("a-b-c", "-")};
    deque&lt;string_ref&gt; v{std::split("a-b-c", "-")};
    set&lt;string_ref&gt; s{std::split("a-b-c", "-")};
    list&lt;string_ref&gt; l{std::split("a-b-c", "-")};</code>
    </pre>
    </li>

    <li>
    A delimiter of the empty string results in each character in the input string becoming one element in the output collection.
    This is a special case. It is done to match the behavior of splitting using the empty string in other programming languages (e.g., perl).
    <pre class="example">
    <code>vector&lt;string_ref&gt; v{std::split(<strong>"abc", ""</strong>)};
    assert(<strong>v.size() == 3</strong>);  <em>// "a", "b", "c"</em></code>
    </pre>
    </li>

    <li>
    Iterating the results of a split in a range-based for loop.
    <pre class="example">
    <code>for (string_ref sref : std::split("a-b-c", "-")) {
      <em>// use sref</em>
    }</code>
    </pre>
    </li>

    <li>
    Splitting input text that is the empty string results in a collection containing one element that is the empty string.
    <pre class="example">
    <code>vector&lt;string_ref&gt; v{std::split(<strong>""</strong>, "any delimiter")};
    assert(<strong>v.size() == 1</strong>);  <em>// ""</em></code>
    </pre>

    [<b>Footnote:</b>
    This is logical behavior given that <code>std::split()</code> doesn't skip empty substrings.
    However, it might be surprising behavior to some users.
    Would it be better if the result of splitting an empty string resulted in an <em>empty</em> Range?
    &mdash;<b>end footnote</b>]
    </li>

    </ol>

  </body>
</html>

