<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=US-ASCII" />
    <title>Proposing std::split()</title>

    <style type="text/css">

    body { color: #000000; background-color: #FFFFFF; }
    del { text-decoration: line-through; color: #8B0040; }
    ins { text-decoration: underline; color: #005100; }

    p.example { margin-left: 2em; }
    pre.example { margin-left: 2em; }
    div.example { margin-left: 2em; }

    code.extract { background-color: #F5F6A2; }
    pre.extract { margin-left: 2em; background-color: #F5F6A2;
      border: 1px solid #E1E28E; }

    p.function { }
    .attribute { margin-left: 2em; }
    .attribute dt { float: left; font-style: italic;
      padding-right: 1ex; }
    .attribute dd { margin-left: 0em; }

    blockquote.std { color: #000000; background-color: #F1F1F1;
      border: 1px solid #D1D1D1;
      padding-left: 0.5em; padding-right: 0.5em; }
    blockquote.stddel { text-decoration: line-through;
      color: #000000; background-color: #FFEBFF;
      border: 1px solid #ECD7EC;
      padding-left: 0.5empadding-right: 0.5em; ; }

    blockquote.stdins { text-decoration: underline;
      color: #000000; background-color: #C8FFC8;
      border: 1px solid #B3EBB3; padding: 0.5em; }

    table { border: 1px solid black; border-spacing: 0px;
      margin-left: auto; margin-right: auto; }
    th { text-align: left; vertical-align: top;
      padding-left: 0.8em; border: none; }
    td { text-align: left; vertical-align: top;
      padding-left: 0.8em; border: none; }

    </style>

    <script
      src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"
      type="text/javascript"> </script>

    <script type="text/javascript">$(function() {
        var next_id = 0
        function find_id(node) {
            // Look down the first children of 'node' until we find one
            // with an id. If we don't find one, give 'node' an id and
            // return that.
            var cur = node[0];
            while (cur) {
                if (cur.id) return curid;
                if (cur.tagName == 'A' && cur.name)
                    return cur.name;
                cur = cur.firstChild;
            };
            // No id.
            node.attr('id', 'gensection-' + next_id++);
            return node.attr('id');
        };

        // Put a table of contents in the #toc nav.

        // This is a list of <ol> elements, where toc[N] is the list for
        // the current sequence of <h(N+2)> tags. When a header of an
        // existing level is encountered, all higher levels are popped,
        // and an <li> is appended to the level
        var toc = [$("<ol/>")];
        $(':header').not('h1').each(function() {
            var header = $(this);
            // For each <hN> tag, add a link to the toc at the appropriate
            // level.  When toc is one element too short, start a new list
            var levels = {H2: 0, H3: 1, H4: 2, H5: 3, H6: 4};
            var level = levels[this.tagName];
            if (typeof level == 'undefined') {
                throw 'Unexpected tag: ' + this.tagName;
            }
            // Truncate to the new level.
            toc.splice(level + 1, toc.length);
            if (toc.length < level) {
                // Omit TOC entries for skipped header levels.
                return;
            }
            if (toc.length == level) {
                // Add a <ol> to the previous level's last <li> and push
                // it into the array.
                var ol = $('<ol/>')
                toc[toc.length - 1].children().last().append(ol);
                toc.push(ol);
            }
            var header_text = header.text();
            toc[toc.length - 1].append(
                $('<li/>').append($('<a href="#' + find_id(header) + '"/>')
                                  .text(header_text)));
        });
        $('#toc').append(toc[0]);
    })
    </script>

  </head>
  <body>
    <h1><code>std::split()</code>: An algorithm for splitting strings</h1>

    <p>
    ISO/IEC JTC1 SC22 WG21 N3593 - 2013-03-13
    </p>

    <address>
      Greg Miller, jgm@google.com
    </address>

    <div id="toc">
    <!-- Generated dynamically by javascript -->
    </div>

    <h2><a name="introduction">Introduction</a></h2>

    <p>
    Splitting strings into substrings is a common task in many applications.
    When the need arises in C++, programmers must search for an existing
    solution or write one of their own. A typical solution might look like the
    following:
    </p>

    <pre class="example">
    <code>std::vector&lt;std::string&gt; my_split(const std::string&amp; text, const std::string&amp; delimiter);</code>
    </pre>

    <p>
    A straightforward implementation of the above function would likely use
    <code>std::string::find</code> or <code>std::string::find_first_of</code> to
    identify substrings and move from one to the next, building the vector to
    return. This is a fine solution for simple needs, but it is deficient in the
    following ways:
    </p>

    <ul>
      <li>Must be reimplemented by each individual/organization</li>
      <li>Not adaptable to different types of delimiter, such as regular expressions</li>
      <li>Not adaptable to different return types, such as <code>std::set&lt;string&gt;</code></li>
    </ul>

    <p>
    Google developed a flexible and fast string-splitting API to address these
    deficiencies. The new API has been well received by internal engineers
    developing serious applications. The rest of this paper describes Google's
    string splitting API as it might appear as a C++ standard.
    </p>

    <p>
    This proposal depends on the following proposals:
    </p>
    <ul>
      <li>N3609 (<code>std::string_view</code>)</li>
      <li>N3513 (Range support)</li>
    </ul>

    <h3>Changes in this revision</h3>
    <p>
    The first version of this proposal was <a
    href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3430.html">N3430</a>,
    which included features such as Predicates and implicit result type
    conversions. A number of these complicated features were removed in the
    following proposal, which was <a
    href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3510.html">N3510</a>.
    The following are the major changes in this revision.
    </p>

    <ul>
      <li><em>Delimiter</em> objects now return a zero-length
      <code>std::string_view</code> referring to the input text's
      <code>end()</code> iterator to indicate Not Found. There are also
      alternative options listed.</li>

      <li>The <em>Delimiter</em> <code>find()</code> member function now takes
      a <code>size_t pos</code> argument indicating where to start looking for
      the next delimiter.</li>

    </ul>

    <h2><a name="new_api">std::split() API</a></h2>

    <pre class="example">
    <code>namespace std {

      template &lt;typename Delimiter&gt;
      auto split(std::string_view text, Delimiter d) -&gt; <em>unspecified</em>;

    }</code>
    </pre>

    <p>
    The <code>std::split()</code> algorithm takes a <code>std::string_view</code>
    and a <code>Delimiter</code> as arguments, and it returns a <em>Range</em>
    of <code>std::string_view</code> objects as output. The
    <code>std::string_view</code> objects in the returned Range will refer to
    substrings of the input text. The <code>Delimiter</code> object defines the
    boundaries between the returned substrings.
    </p>

    <h3><a name="delimiters">Delimiters</a></h3>

    <p> The general notion of a delimiter (aka separator) is not new. A
    delimiter (little d) marks the boundary between two substrings in a larger
    string. With the <code>std::split()</code> API comes the generalized concept
    of a <em>Delimiter</em> (big D). A <em>Delimiter</em> is an object with a
    <code>find()</code> member function that can find the next occurrence of
    itself in a given <code>std::string_view</code> starting at the given
    position. Objects that conform to the Delimiter concept represent specific
    kinds of delimiters. Some examples of Delimiter objects are an object that
    finds a specific character in a string, an object that finds a substring in
    a string, or even an object that finds regular expression matches in a given
    string. </p>

    <p>
    The result of a Delimiter's <code>find()</code> member function must be a
    <code>std::string_view</code> referring to one of the following:
    </p>
    <ul>
      <li>A substring of <code>find()</code>'s argument text referring to the
      delimiter/separator that was found.</li>

      <li>An empty <code>std::string_view</code> referring to
      <code>find()</code>'s argument's end iterator, (e.g.,
      <code>std::string_view(input_text.end(), 0)</code>). This indicates that
      the delimiter/separator was not found.</li>
    </ul>

    [<b>Footnote:</b>
    An alternative to having a Delimiter's <code>find()</code> function return a
    <code>std::string_view</code> is to instead have it return a
    <code>std::pair&lt;size_t, size_t&gt;</code> where the pair's first member
    is the position of the found delimiter, and the second member is the length
    of the found delimiter. In this case, Not Found could be prepresented as
    <code>std::make_pair(std::string_view::npos, 0)</code>.
    &mdash;<b>end footnote</b>]

    <p>
    The following example shows a simple object that models the Delimiter
    concept. It has a <code>find()</code> member function that is responsible
    for finding the next occurrence of the given character in the given text
    starting at the given position.
    </p>

    <pre class="example">
    <code>struct char_delimiter {
      char c_;
      explicit char_delimiter(char c) : c_(c) {}
      <strong>std::string_view find(std::string_view text, size_t pos)</strong> {
        std::string_view substr = text.substr(pos);
        size_t found = substr.find(c_);
        if (found == std::string_view::npos)
          return std::string_view(substr.end(), 0);  <em>// Not found.</em>
        return std::string_view(substr, found, 1);  <em>// Returns a string_view referring to the c_ that was found in the input string.</em>
      }
    };</code>
    </pre>

    <p> The following shows how the above delimiter could be used to split a
    string: </p>

    <pre class="example">
    <code>std::vector&lt;std::string_view&gt; v{std::split("a-b-c", <strong>char_delimiter('-')</strong>)};
    <em>// v is {"a", "b", "c"}</em></code>
    </pre>

    <p> The following are standard delimiter implementations that will be part
    of the splitting API. </p>
    <ul>
      <li><a href="#std_literal_delimiter"><code>std::literal_delimiter</code></a></li>
      <li><a href="#std_any_of_delimiter"><code>std::any_of_delimiter</code></a></li>
    </ul>
    [<b>Footnote:</b>
    Here are a few more delimiters that might be worth including by default:
    <ul>
      <li><code>std::fixed_delimiter</code> &mdash; this Delimiter breaks the
      input string at fixed length intervals.</li>

      <li><code>std::limit_delimiter</code> &mdash; this Delimiter template
      would take another Delimiter and a size_t limiting the given delimiter to
      matching a max numbers of times. This is similar to the 3rd argument to
      perl's split() function. </li>

      <li><code>std::regex_delimiter</code> &mdash; this Delimiter would take a
      regex as an argument and would match everywhere the pattern matched in the
      input string.</li>
    </ul>
    &mdash;<b>end footnote</b>]

    <h3>Rvalue support</h3>

    <p> As described so far, <code>std::split()</code> may not work correctly if
    splitting a <code>std::string_view</code> that refers to a temporary string.
    In particular, the following will not work: </p>

    <pre class="example">
    <code>for (std::string_view s : std::split(GetTemporaryString(), "-")) {
        // s now refers to a temporary string that is no longer valid.
    }</code>
    </pre>

    <p> To address this, <code>std::split()</code> will move ownership of
    rvalues into the <em>Range</em> object that is returned from
    <code>std::split()</code>. </p>


    <h2><a name="synopsis">API Synopsis</a></h2>

    <h3><a name="std_split">std::split()</a></h3>

    <p>
    The function called to split an input string into a range of substrings.
    </p>

    <pre class="example">
    <code>namespace std {

      template &lt;typename Delimiter&gt;
      auto split(std::string_view text, Delimiter d) -&gt; <em>unspecified</em>;

    }</code>
    </pre>

    <dl>
      <dt><em>Requires:</em></dt>
      <dd>
      <code>text</code> &mdash; a <code>std::string_view</code> referring to the
      input string to be split.
      </dd>
      <dd>
      <em>Delimiter</em> &mdash; an object that implements the
      <em>Delimiter</em> concept. Or if this argument type is a
      <code>std::string</code>, <code>std::string_view</code>, <code>const
      char*</code>, or <code>char</code>, then the
      <code>std::literal_delimiter</code> will be used as a default. </dd>
      <dt><em>Returns:</em></dt>
      <dd>
      a <em>Range</em> of <code>std::string_view</code> objects, each referring
      to the split substrings within the given input <code>text</code>. The
      object returned from <code>std::split()</code> will have
      <code>begin()</code> and <code>end()</code> member functions and will
      fully model the <em>Range</em> concept.
      </dd>
    </dl>

    [<b>Footnote:</b>
    <p>
    One question at this point is: why is this constrained to
    strings/string_views? One could imagine <code>std::split()</code> as an
    algorithm that transforms an input Range into an output Range of Ranges.
    This would make the algorithm more generally applicable.
    </p>

    <p>
    However, this generalization may also make <code>std::split()</code> less
    convenient in the expected common case: that of splitting string data. For
    example, the logic for detecting when to auto-construct a
    <code>std::literal_delimiter</code> may be more complicated, and it may not
    be clear that that is a reasonable default delimiter in the generic case.
    </p>
    <p>
    The current proposal limits <code>std::split</code> to strings/string_views
    to keep the function simple to use in the common case of splitting strings.
    </p>
    &mdash;<b>end footnote</b>]

    <h3><a name="delimiter_synopsis"><em>Delimiter</em> template parameter</a></h3>
    <p>
    The second argument to <code>std::split()</code> may be an object that
    models the Delimiter concept. A Delimiter object must have the following
    member function:
    </p>

    <pre class="example">
    <code>std::string_view find(std::string_view text, size_t pos);</code>
    </pre>

    <p>
    This function is responsible for finding the next occurrence of the
    represented delimiter in the given <code>text</code> at or after the given
    position <code>pos</code>.
    </p>

    <dl>
      <dt><em>Requires:</em></dt>
      <dd>
      <code>text</code> &mdash; the full input string that was originally passed
      to <code>std::split()</code>.
      </dd>
      <dd>
      <code>pos</code> &mdash; the position in <code>text</code> where the
      search for the represented delimiter should start.
      </dd>
      <dt><em>Returns:</em></dt>
      <dd>
      a <code>std::string_view</code> referring to the found delimiter within the
      given input <code>text</code>, or <code>std::string_view(text.end(),
      0)</code> if the delimiter was not found.
      </dd>
    </dl>

    <h3><a name="std_literal_delimiter">std::literal_delimiter</a></h3>

    <p>
    A string delimiter. This is the default delimiter used if a string is given
    as the delimiter argument to <code>std::split()</code>.
    </p>

    <p>
    The delimiter representing the empty string
    (<code>std::literal_delimiter("")</code>) will be defined to return each
    individual character in the input string. This matches the behavior of
    splitting on the empty string "" in perl.
    </p>

    <p>
    The following is an example of what the <code>std::literal_delimiter</code>
    might look like.
    </p>

    <pre class="example">
    <code>namespace std {

      class literal_delimiter {
        const string delimiter_;
       public:
        explicit literal(string_view sview)
        : delimiter_(static_cast&lt;string&gt;(sview)) {}
        <strong>string_view find(string_view text, size_t pos) const;</strong>
      };

    }</code>
    </pre>

    <dl>
      <dt><em>Requires:</em></dt>
      <dd>
      <code>text</code> is the text to be split.
      <code>pos</code> is the position in text to start searching for the
      delimiter.
      </dd>
      <dt><em>Returns:</em></dt>
      <dd>
      A <code>std::string_view</code> referring to the first substring of
      <code>text</code> that matches <code>delimiter_</code>, or
      <code>std::string_view(text.end(), 0)</code> if not found.
      </dd>
    </dl>

    <h3><a name="std_any_of_delimiter">std::any_of_delimiter</a></h3>

    <p>
    Each character in the given string is a delimiter. A
    <code>std::any_of_delimiter</code> with string of length 1 behaves the same
    as a <code>std::literal_delimiter</code> with the same string of length 1.
    </p>

    <pre class="example">
    <code>namespace std {

      class any_of_delimiter {
        const string delimiters_;
       public:
        explicit any_of_delimiter(string_view sview)
        : delimiters_(static_cast&lt;string&gt;(sview)) {}
        <strong>string_view find(string_view text, size_t pos) const;</strong>
      };

    }</code>
    </pre>

    <dl>
      <dt><em>Requires:</em></dt>
      <dd>
      <code>text</code> is the text to be split.
      <code>pos</code> is the position in text to start searching for the
      delimiter.
      </dd>
      <dt><em>Returns:</em></dt>
      <dd>
      A <code>std::string_view</code> referring to the first occurrence of any
      character from <code>delimiters_</code> that is found in <code>text</code>
      at or after <code>pos</code>. The length of the returned
      <code>std::string_view</code> will always be 1. If no match is found,
      <code>std::string_view(text.end(), 0)</code>.
      </dd>
    </dl>


    <h2><a name="api_usage">API Usage</a></h2>

    <p>
    The following using declarations are assumed for brevity:
    </p>

    <pre class="example">
    <code>using std::deque;
    using std::list;
    using std::set;
    using std::string_view;
    using std::vector;</code>
    </pre>

    <ol>

    <li>
    The default delimiter when not explicitly specified is
    <code>std::literal_delimiter</code>. The following two calls to
    <code>std::split()</code> are equivalent. The first form is provided for
    convenience.
    <pre class="example">
    <code>vector&lt;string_view&gt; v1{std::split("a-b-c", <strong>"-"</strong>)};
    vector&lt;string_view&gt; v2{std::split("a-b-c", <strong>std::literal_delimiter("-")</strong>)};</code>
    </pre>
    </li>

    <li>
    Empty substrings are included in the output.
    <pre class="example">
    <code>vector&lt;string_view&gt; v{std::split("a--c", "-")};
    assert(<strong>v.size() == 3</strong>);  <em>// "a", "", "c"</em></code>
    </pre>
    </li>

    <li>
    The previous example showed that empty substrings are included in the
    output. Leading and trailing delimiters result in leading and trailing empty
    strings in the output.
    <pre class="example">
    <code>vector&lt;string_view&gt; v{std::split(<strong>"-a-b-c-", "-"</strong>)};
    assert(<strong>v.size() == 5</strong>);  <em>// "", "a", "b", "c", ""</em></code>
    </pre>

    </li>

    <li>
    Results can be assigned to STL containers that support the Range concept.
    <pre class="example">
    <code>vector&lt;string_view&gt; v{std::split("a-b-c", "-")};
    deque&lt;string_view&gt; v{std::split("a-b-c", "-")};
    set&lt;string_view&gt; s{std::split("a-b-c", "-")};
    list&lt;string_view&gt; l{std::split("a-b-c", "-")};</code>
    </pre>
    </li>

    <li>
    A delimiter of the empty string results in each character in the input
    string becoming one element in the output collection. This is a special
    case. It is done to match the behavior of splitting using the empty string
    in other programming languages (e.g., perl).
    <pre class="example">
    <code>vector&lt;string_view&gt; v{std::split(<strong>"abc", ""</strong>)};
    assert(<strong>v.size() == 3</strong>);  <em>// "a", "b", "c"</em></code>
    </pre>
    </li>

    <li>
    Iterating the results of a split in a range-based for loop.
    <pre class="example">
    <code>for (string_view sview : std::split("a-b-c", "-")) {
      <em>// use sview</em>
    }</code>
    </pre>
    </li>

    <li>
    Modifying the input text invalidates the result of a split from that point
    on.
    <pre class="example">
    <code>string s = "a-b-c";
    auto r = std::split(s, "-");
    s += "-d-e-f";  // This invalidates the results r
    for (std::string_view token : r) {  // Invalid
      // ...
    }</code>
    </pre>
    </li>

    <li>
    Splitting input text that is the empty string results in a collection
    containing one element that is the empty string.
    <pre class="example">
    <code>vector&lt;string_view&gt; v{std::split(<strong>""</strong>, <em>any-delimiter</em>)};
    assert(<strong>v.size() == 1</strong>);  <em>// ""</em></code>
    </pre>

    [<b>Footnote:</b>
    This is logical behavior given that <code>std::split()</code> doesn't skip
    empty substrings. However, it might be surprising behavior to some users.
    Would it be better if the result of splitting an empty string resulted in an
    <em>empty</em> Range?
    &mdash;<b>end footnote</b>]
    </li>
    </ol>

    <h2><a name="open_questions">Open Questions</a></h2>

    <ul>
      <li>
      Should a <em>Delimiter</em>'s find() function return a std::pair&lt;size_t,
      size_t&gt; instead? For example:
      <pre class="example">
      <code>std::pair&lt;size_t, size_t&gt; find(std::string_view text, size_t pos)</code>
      </pre>
      The returned pair's first and second members would refer to the found
      position and length, respectively. Not Found would be represented simply
      as <code>std::make_pair(std::string_view::npos, 0)</code>, which is a
      position of <code>npos</code> and a length of <code>0</code>. This seems
      quite natural.
      </li>
      <li>
      Should any of the following be included as standard Delimiters?
      <ul>
        <li><code>std::fixed_delimiter</code> &mdash; this Delimiter breaks the
        input string at fixed length intervals.</li>

        <li><code>std::limit_delimiter</code> &mdash; this Delimiter template
        would take another Delimiter and a size_t limiting the given delimiter
        to matching a max numbers of times. This is similar to the 3rd argument
        to perl's split() function. </li>

        <li><code>std::regex_delimiter</code> &mdash; this Delimiter would take a
        regex as an argument and would match everywhere the pattern matched in the
        input string.</li>
      </ul>
      </li>

      <li>
      Should the Delimiter API use <code>operator()</code> rather than a named
      <code>find()</code> member function? The Delimiter API requires a member
      function named <code>find</code>. There is no technical requirement that
      this function needs to be named. Perhaps it would be better for Delimiters
      to use <code>operator()</code>.
      </li>
    </ul>

  </body>
</html>

