<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

<style type="text/css">

body { color: #000000; background-color: #FFFFFF; }
del { text-decoration: line-through; color: #8B0040; }
ins { text-decoration: underline; color: #005100; }

p.example { margin-left: 2em; }
pre.example { margin-left: 2em; }
div.example { margin-left: 2em; }

code.extract { background-color: #F5F6A2; }
pre.extract { margin-left: 2em; background-color: #F5F6A2;
  border: 1px solid #E1E28E; }

p.function { }
.attribute { margin-left: 2em; }
.attribute dt { float: left; font-style: italic;
  padding-right: 1ex; }
.attribute dd { margin-left: 0em; }

blockquote.std { color: #000000; background-color: #F1F1F1;
  border: 1px solid #D1D1D1;
  padding-left: 0.5em; padding-right: 0.5em; }
blockquote.stddel { text-decoration: line-through;
  color: #000000; background-color: #FFEBFF;
  border: 1px solid #ECD7EC;
  padding-left: 0.5empadding-right: 0.5em; ; }

blockquote.stdins { text-decoration: underline;
  color: #000000; background-color: #C8FFC8;
  border: 1px solid #B3EBB3; padding: 0.5em; }

table { border: 1px solid black; border-spacing: 0px;
  margin-left: auto; margin-right: auto; }
th { text-align: left; vertical-align: top;
  padding-left: 0.8em; border: none; }
td { text-align: left; vertical-align: top;
  padding-left: 0.8em; border: none; }

</style>

<title>regex_iterator should be iterable</title>
</head>
<body>

<b>Document number:</b> P0757R0 <br>
<b>Date:</b> 2017-09-10 <br>
<b>Project:</b> ISO JTC1/SC22/WG21, Programming Language C++ <br>
<b>Audience:</b> Library Evolution Working Group <br>
<b>Reply to:</b> Arthur O'Dwyer &lt;arthur.j.odwyer@gmail.com&gt; <br>

<h1><code>regex_iterator</code> should be iterable</h1>

<p>
<a href="#Introduction">1. Introduction and motivation</a><br>
<a href="#Member_vs_nonmember">2. Member begin() versus nonmember begin()</a><br>
<a href="#Alternatives">3. Alternative syntax ideas</a><br>
<a href="#Proposed_wording">4. Proposed wording for C++20</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;<a href="#re.syn">28.4 Header &lt;regex> synopsis [re.syn]</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;<a href="#re.iter">28.12 Regular expression iterators [re.iter]</a><br>

<a href="#References">5. References</a><br>
</p>


<h2><a name="Introduction">1. Introduction and motivation</a></h2>

<p>
The C++17 <code>filesystem</code> library introduces a <code>directory_iterator</code>
that can be used as follows: <a href="#directory_iterator">[directory_iterator]</a>
</p>

<pre>
#include &lt;fstream>
#include &lt;iostream>
#include &lt;filesystem>
namespace fs = std::filesystem;

int main()
{
    fs::create_directories("sandbox/a/b");
    std::ofstream("sandbox/file1.txt");
    std::ofstream("sandbox/file2.txt");
    <b>for (auto& p : fs::directory_iterator("sandbox"))</b> {
        std::cout << p << '\n';
    }
    fs::remove_all("sandbox");
}
</pre>

<p>
The <code>directory_iterator</code> object is both an iterator and an iterable,
by the simple mechanism of providing an overload of <code>begin()</code> that returns
<code>*this</code> and an overload of <code>end()</code> that returns <code>{}</code>.
</p>

<p>
Since C++11, the standard library has also provided a <code>regex_iterator</code>
without this convenience feature. Its example usage on cppreference.com
<a href="#regex_iterator">[regex_iterator]</a> is:
</p>

<pre>
#include &lt;iostream>
#include &lt;regex>
#include &lt;string>

int main()
{
    const std::string s = "Quick brown fox.";

    std::regex words_regex("[^\\s]+");
    <b>auto words_begin =
        std::sregex_iterator(s.begin(), s.end(), words_regex);
    auto words_end = std::sregex_iterator();

    for (std::sregex_iterator i = words_begin; i != words_end; ++i)</b> {
        std::smatch match = *i;
        std::string match_str = match.str();
        std::cout << match_str << '\n';
    }
}
</pre>

<p>
cppreference.com also includes a warning note:
<blockquote>
It is the programmer's responsibility to ensure that the <code>std::basic_regex</code>
object passed to the iterator's constructor outlives the iterator. Because the iterator
stores a pointer to the regex, incrementing the iterator after the regex was destroyed
accesses a dangling pointer.
</blockquote>
This warning is apparently necessary in the case of <code>regex_iterator</code>,
even though the same reasoning ought to apply to e.g. <code>std::istream_iterator</code>
(it must not outlive its <code>istream</code>) and even <code>std::list::iterator</code>
(it must not outlive its <code>list</code>).
The reason we don't feel the need to warn people about lifetime bugs with <code>list</code>
is that we don't often keep <code>std::list::iterator</code> objects alive outside of their
natural (for-loop) scope. The "iterable iterator" idiom as seen in
<code>directory_iterator</code> can help to keep iterators localized into tight scopes.
</p>

<p>
I propose that the following code should be well-formed:
</p>

<pre>
#include &lt;iostream>
#include &lt;regex>
#include &lt;string>

int main()
{
    const std::string s = "Quick brown fox.";

    std::regex words_regex("[^\\s]+");

    <b>for (auto& match : std::sregex_iterator(s.begin(), s.end(), words_regex))</b> {
        std::string match_str = match.str();
        std::cout << match_str << '\n';
    }
}
</pre>

<h2><a name="Member_vs_nonmember">2. Member begin() versus nonmember begin()</a></h2>

<p>
Why does <code>directory_iterator</code> provide <code>begin()</code> and <code>end()</code>
as free functions instead of member functions? It's a mystery to me. But I'm content
to follow its precedent, rather than propose member <code>begin()</code> and <code>end()</code>
functions for <code>regex_iterator</code>.
</p>

<p>
Incidentally, although it's too late to change it now, I personally believe that both of
these iterators would have benefited from a different structure; that is, rather than the syntax
<pre>
    for (auto& p : fs::directory_iterator("sandbox")) { ... }
    for (auto& m : regex_iterator(begin(s), end(s), r)) { ... }
</pre>
I would have preferred a syntax that split up the "iterable range" object from the
"iterator" object, like this:
<pre>
    for (auto& p : fs::walk_directory("sandbox")) { ... }
    for (auto& m : regex_iterate(s, r)) { ... }
</pre>
</p>

<h2><a name="Alternatives">3. Alternative syntax ideas</a></h2>

<p>
cppreference uses this style for its loop:
<pre>
    auto B = std::sregex_iterator(b, e, rx);
    auto E = std::sregex_iterator();
    for (auto it = B; it != E; ++it) {
        auto m = *it;
        // use m[0]
    }
</pre>
This can be expressed more compactly today as:
<pre>
    for (std::sregex_iterator it(b, e, rx); it != std::sregex_iterator{}; ++it) {
        auto m = *it;
        // use m[0]
    }
</pre>
The infelicity with that loop is the repetition of the name <code>std::sregex_iterator</code>,
which would be avoidable even without the present proposal if we were to standardize an
<code>operator bool</code> for <code>regex_iterator</code>. Thus:
<pre>
    // Assuming that regex_iterator were convertible to bool with the obvious semantics
    for (std::sregex_iterator it(b, e, rx); it; ++it) {
        auto m = *it;
        // use m[0]
    }
</pre>
Another alternative today is:
<pre>
    auto it = std::sregex_iterator(b, e, rx);
    while (it != std::sregex_iterator{}) {
        auto m = *it++;
        // use m[0]
    }
</pre>
This feels more palatable than the for-loop above, because even though it has the same
number of repetitions of <code>sregex_iterator</code>, at least they're not both on the
same line of code! However, practical experience shows that it is far too easy to
write <code>*it</code> instead of <code>*it++</code>, which leads to an infinite loop.
We should try to channel programmers into the common looping idioms where possible.
Therefore I propose that <code>regex_iterator</code> should be made iterable with
the ranged for-loop.
</p>

<p>
Should <code>regex_iterator</code>, <code>regex_token_iterator</code>, <code>directory_iterator</code>,
<code>istream_iterator</code>, and <code>istreambuf_iterator</code> all be treated in roughly the same
way?
</p>

<p>
Should these iterators be convertible to bool? (Right now none of them is.)
</p>

<p>
Should these iterators be iterable? (Right now <code>directory_iterator</code> is iterable but
the others are not.)
</p>


<h2><a name="Proposed_wording">4. Proposed wording for C++20</a></h2>

<p>
The wording in this section is relative to WG21 draft N4618 <a href="#N4618">[N4618]</a>,
that is, the current draft of the C++17 standard. The new wording is modeled on
the existing wording in N4618 section 27.10.13.2 [directory_iterator.nonmembers].
</p>

<h3><a name="re.syn">28.4 Header &lt;regex> synopsis [re.syn]</a></h3>

<p>
Edit paragraph 1 as follows.
</p>

<blockquote class="std">
<code>
<i>// 28.12.1, class template regex_iterator</i><br>
template &lt;class BidirectionalIterator,<br>
          class charT = typename iterator_traits&lt;<br>
            BidirectionalIterator>::value_type,<br>
          class traits = regex_traits&lt;charT>><br>
  class regex_iterator;<br>
<br>
using cregex_iterator  = regex_iterator&lt;const char*>;<br>
using wcregex_iterator = regex_iterator&lt;const wchar_t*>;<br>
using sregex_iterator  = regex_iterator&lt;string::const_iterator>;<br>
using wsregex_iterator = regex_iterator&lt;wstring::const_iterator>;<br>
<br>
<ins><i>// 28.12.2, range access for regex_iterator</i></ins><br>
<ins>template&lt;class B, class C, class T&gt;</ins></br>
<ins>regex_iterator&lt;B, C, T&gt; begin(regex_iterator&lt;B, C, T&gt;) noexcept;</ins><br>
<ins>template&lt;class B, class C, class T&gt;</ins></br>
<ins>regex_iterator&lt;B, C, T&gt; end(const regex_iterator&lt;B, C, T&gt;&amp;) noexcept;</ins><br>
<br>
<i>// <del>28.12.2</del> <ins>28.12.3</ins>, class template regex_token_iterator</i><br>
template &lt;class BidirectionalIterator,<br>
          class charT = typename iterator_traits&lt;<br>
            BidirectionalIterator>::value_type,<br>
          class traits = regex_traits&lt;charT>><br>
  class regex_token_iterator;<br>
<br>
using cregex_token_iterator  = regex_token_iterator&lt;const char*>;<br>
using wcregex_token_iterator = regex_token_iterator&lt;const wchar_t*>;<br>
using sregex_token_iterator  = regex_token_iterator&lt;string::const_iterator>;<br>
using wsregex_token_iterator = regex_token_iterator&lt;wstring::const_iterator>;<br>
<br>
<ins><i>// 28.12.4, range access for regex_token_iterator</i></ins><br>
<ins>template&lt;class B, class C, class T&gt;</ins></br>
<ins>regex_token_iterator&lt;B, C, T&gt; begin(regex_token_iterator&lt;B, C, T&gt;) noexcept;</ins><br>
<ins>template&lt;class B, class C, class T&gt;</ins></br>
<ins>regex_token_iterator&lt;B, C, T&gt; end(const regex_token_iterator&lt;B, C, T&gt;&amp;) noexcept;</ins><br>
</code>
</blockquote>

<h3><a name="re.iter">28.12 Regular expression iterators [re.iter]</a></h3>

<p>
Renumber section 28.12.2 [re.tokiter] to 28.12.3.
Insert a new section following section 28.12.1 as follows.
</p>

<blockquote class="std">
<ins>
<h3>28.12.2 <code>regex_iterator</code> non-member functions [re.iter.nonmember]</h3>
<p>1. These functions enable range access for <code>regex_iterator</code>.</p>
</ins>
<pre>
    <ins>template&lt;class B, class C, class T&gt;</ins>
    <ins>regex_iterator&lt;B, C, T&gt; begin(regex_iterator&lt;B, C, T&gt; iter) noexcept;</ins>
</pre>
<p><ins>Returns: <code>iter</code>.</ins></p>
<pre>
    <ins>template&lt;class B, class C, class T&gt;</ins>
    <ins>regex_iterator&lt;B, C, T&gt; end(const regex_iterator&lt;B, C, T&gt;&amp;) noexcept;</ins>
</pre>
<p><ins>Returns: <code>regex_iterator&lt;B, C, T&gt;()</code>.</ins></p>
</blockquote>

<p>
Insert a new section after the renumbered section 28.12.3 [re.tokiter] as follows.
</p>

<blockquote class="std">
<ins>
<h3>28.12.4 <code>regex_token_iterator</code> non-member functions [re.tokiter.nonmember]</h3>
<p>1. These functions enable range access for <code>regex_token_iterator</code>.</p>
</ins>
<pre>
    <ins>template&lt;class B, class C, class T&gt;</ins>
    <ins>regex_token_iterator&lt;B, C, T&gt; begin(regex_token_iterator&lt;B, C, T&gt; iter) noexcept;</ins>
</pre>
<p><ins>Returns: <code>iter</code>.</ins></p>
<pre>
    <ins>template&lt;class B, class C, class T&gt;</ins>
    <ins>regex_token_iterator&lt;B, C, T&gt; end(const regex_token_iterator&lt;B, C, T&gt;&amp;) noexcept;</ins>
</pre>
<p><ins>Returns: <code>regex_token_iterator&lt;B, C, T&gt;()</code>.</ins></p>
</blockquote>

<h2><a name="References">5. References</a></h2>

<dl>

<dt><a name="directory_iterator">[directory_iterator]</a></dt>
<dd>
"std::filesystem::directory_iterator" on cppreference.com.<br>
<a href="http://en.cppreference.com/w/cpp/filesystem/directory_iterator">
http://en.cppreference.com/w/cpp/filesystem/directory_iterator</a>.
</dd>

<dt><a name="N4618">[N4618]</a></dt>
<dd>
"Working Draft, Standard for Programming Language C++" (Nov 2016).<br>
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4618.pdf">
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4618.pdf</a>.
</dd>

<dt><a name="regex_iterator">[regex_iterator]</a></dt>
<dd>
"std::regex_iterator" on cppreference.com.<br>
<a href="http://en.cppreference.com/w/cpp/regex/regex_iterator">
http://en.cppreference.com/w/cpp/regex/regex_iterator</a>.
</dd>

</dl>

</body>
</html>
