<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8">
<title>Making std::istream::ignore less surprising</title>
  <style type='text/css'>
  body {font-variant-ligatures: none;}
  p {text-align:justify}
  li {text-align:justify}
  blockquote.note, div.note
  {
          background-color:#E0E0E0;
          padding-left: 15px;
          padding-right: 15px;
          padding-top: 1px;
          padding-bottom: 1px;
  }
  p code {color:navy}
  ins p code {background-color:PaleGreen}
  p ins code {background-color:PaleGreen}
  p del code {background-color:LightPink}
  ins {background-color:PaleGreen}
  del {background-color:LightPink}
  table#boilerplate { border:0 }
  table#boilerplate td { padding-left: 2em }
  table.bordered, table.bordered th, table.bordered td {
    border: 1px solid;
    text-align: center;
  }
  ins.block {background-color:PaleGreen; text-decoration: none}
  del.block {background-color:LightPink; text-decoration: none}
  #hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }
  </style>
<meta property="og:title" content="Making std::istream::ignore less surprising">
<meta property="og:type" content="website">
<meta property="og:image" content="https://isocpp.org/assets/images/cpp_logo.png">
<meta property="og:url" content="https://wg21.link/<tr><td>Document number</td><td>P3223R1</td></tr>">
</head><body>
<table id="boilerplate">
<tr><td>Document number</td><td>P3223R1</td></tr>
<tr><td>Date</td><td>2024-07-03</td></tr>
<tr><td>Project</td><td>Programming Language C++, Library Evolution Working Group</td></tr>
<tr><td>Reply-to</td><td>Jonathan Wakely &lt;cxx&#x40;kayari.org&gt;</td></tr>
</table><hr>
<h1>Making std::istream::ignore less surprising</h1>
<ul>
 <li>
 <ul>
  <li><a href="#Revision-History">Revision History</a></li>
  <li><a href="#Introduction">Introduction</a></li>
  <li><a href="#Discussion">Discussion</a>
  <ul>
   <li><a href="#Detailed-discussion">Detailed discussion</a></li>
   <li><a href="#Writing-correct-code-is-unnecessarily-hard">Writing correct code is unnecessarily hard</a></li>
  </ul>
  </li>
  <li><a href="#Possible-solutions">Possible solutions</a>
  <ul>
   <li><a href="#Change--3c-code-3e-ignore-3c--2f-code-3e--to-handle-negative--3c-code-3e-delim-3c--2f-code-3e--values">Change <code>ignore</code> to handle negative <code>delim</code> values</a></li>
   <li><a href="#Split--3c-code-3e-ignore-3c--2f-code-3e--into-two-functions-and-change--3c-code-3e-delim-3c--2f-code-3e--to--3c-code-3e-char_type-3c--2f-code-3e-">Split <code>ignore</code> into two functions and change <code>delim</code> to <code>char_type</code></a></li>
   <li><a href="#Add-an-overload-of--3c-code-3e-ignore-3c--2f-code-3e--that-takes-a--3c-code-3e-char_type-3c--2f-code-3e-">Add an overload of <code>ignore</code> that takes a <code>char_type</code></a></li>
   <li><a href="#As-above-2c--but-constrain-the-new-overload-to-prevent-ambiguties">As above, but constrain the new overload to prevent ambiguties</a></li>
   <li><a href="#Do-nothing">Do nothing</a></li>
  </ul>
  </li>
  <li><a href="#Proposed-wording">Proposed wording</a></li>
  <li><a href="#Acknowledgements">Acknowledgements</a></li>
  <li><a href="#References">References</a></li>
 </ul>
 </li>
</ul>
<a name="Revision-History"></a>
<h2>Revision History</h2>

<p>Changes in R1:</p>

<ul>
<li>Extend discussion to avoid implicit assumptions about <code>to_int_type</code> details.</li>
<li>Adjust proposed wording to add <em>Constraints</em>: to the new overload.</li>
<li>Adjust title to be specific to <code>std::istream</code> not <code>std::basic_istream</code>.</li>
</ul>


<a name="Introduction"></a>
<h2>Introduction</h2>

<p><code>std::istream::ignore(n, delim)</code> has surprising behaviour
if <code>delim</code> is a <code>char</code> with a negative value.
We can remove the surprise and make code more robust.</p>

<a name="Discussion"></a>
<h2>Discussion</h2>

<p>Passing a <code>char</code> to <code>std::istream::ignore</code> as the delimiter is a bug.
The delimiter argument should be an <code>int_type</code> value, so correct code
must do <code>is.ignore(n, std::char_traits&lt;char&gt;::to_int_type(delim))</code> instead.
There is no guarantee that an implicit conversion from a <code>char</code> value
to <code>int_type</code> yields the same value as calling <code>to_int_type</code>. On platforms
where <code>char</code> is signed, if the value of <code>EOF</code> fits in <code>char</code> then there
is always at least one <code>char</code> value (the one equal to <code>EOF</code>) that cannot
be used as a delimiter to <code>std::istream::ignore</code>, because <code>c == to_int_type(c)</code>
is false.</p>

<p>For example, on a platform where <code>char</code> is signed, <code>EOF</code> equals <code>-1</code>, and the
literal encoding is ISO-8859-1 or Windows-1252, the call <code>is.ignore(n, 'ÿ')</code>
will never match the delimiter, because <code>'ÿ' == EOF</code> is true.</p>

<p>In theory, the <code>to_int_type</code> mapping could be <code>~c</code> or <code>c + 256</code>, where the
former means that most <code>char</code> values will match unintended characters, and the
latter means that no <code>char</code> value passed directly to <code>ignore</code> will ever match.
In practice, real implementations use a more straightforward mapping, such
that calling <code>to_int_type(c)</code> is equivalent to <code>(int)(unsigned char)c</code> and
iostream functions that take an <code>int_type</code> argument expect values between -1
and 255.
This matches how the C functions like <code>isalpha(int)</code> and <code>toupper(int)</code> work.
So for all known implementations, an implicit conversion from <code>char</code> to
<code>int_type</code> is incorrect for any negative <code>char</code> value.</p>

<a name="Detailed-discussion"></a>
<h3>Detailed discussion</h3>

<p>This section assumes an implementation where <code>to_int_type</code> is equivalent to
casting to <code>unsigned char</code>, but even for a hypothetical implementation where
that isn't true, <code>is.ignore(n, ch)</code> is still wrong in general, as shown above.</p>

<p>A <code>delim</code> value passed to <code>std::istream::ignore</code> is matched using
<code>traits_type::eq_int_type(rdbuf()-&gt;sgetc(), delim)</code>. The <code>sgetc()</code> function
never returns negative values (except at EOF). It extracts a character
<code>c</code> from the input sequence and then calls <code>traits_type::to_int_type(c)</code> to
convert it to <code>int_type</code>, which produces a non-negative value. So if <code>delim</code>
is negative, then the <code>eq_int_type</code> comparison always fails.</p>

<p>The <code>std::char_traits&lt;char&gt;::to_int_type(char c)</code> function converts the
character <code>c</code> to a non-negative integer, as if by <code>(int)(unsigned char)c</code>.
This allows the value <code>(int_type)-1</code> to be reserved for EOF without worrying
about whether <code>char</code> is signed and whether <code>(char)-1</code> can equal <code>EOF</code>.
But it means that any code dealing with raw <code>char</code> values from the stream must
consistently use <code>to_int_type</code> to convert the (possibly negative) <code>char</code> value
into a non-negative <code>int_type</code> so that all characters are represented in the
same form and can be compared like-for-like.</p>

<p>So if a negative delimiter <code>char</code> can never match, this means that users should
call <code>std::cin.ignore(n, std::char_traits&lt;char&gt;::to_int_type(c))</code> or
<code>std::cin.ignore(n, (unsigned char)c)</code>
in case <code>char</code> is signed on their platform, and <code>c</code> happens to have the most
significant bit set, i.e., is a negative value. For example, on a typical
x86_64 Linux system where <code>char</code> is signed,
<code>std::cin.ignore(std::numeric_limits&lt;std::streamsize&gt;::max(), '\x80')</code>
will always discard all input up to EOF, even if <code>\x80</code> is present in the
input.</p>

<a name="Writing-correct-code-is-unnecessarily-hard"></a>
<h3>Writing correct code is unnecessarily hard</h3>

<p>In generic code that works with streams of either <code>char</code> or <code>wchar_t</code>
you need to use the more verbose <code>to_int_type</code> form rather than casting to
<code>unsigned char</code>, because casting a <code>wchar_t</code> to <code>unsigned char</code> would be wrong.
That's unfortunate, because <code>std::char_traits&lt;wchar_t&gt;::to_int_type</code> doesn't
have the same "cast to unsigned" behaviour, and so passing a <code>wchar_t</code>
directly to <code>std::wistream::ignore</code> without conversion works correctly.
But to work with both <code>char</code> and <code>wchar_t</code>, generic code must assume the worst
and defend against the signed <code>char</code> trap.</p>

<p>Even in non-generic code that only works with <code>char</code>, you still need to
remember that the trap exists, and remember to avoid it.
It's rare that I see anybody get that right for <code>std::isalpha(c)</code> et al,
and I wasn't even aware of the need to do it for <code>ignore</code> until this week.</p>

<p>I suspect that most users are not aware of the need to use <code>to_int_type</code> here,
which means that the <code>ignore</code> function is surprisingly fragile.
It's also quite ugly to have to cast or deal with the traits type directly in
a high-level <code>istream</code> API like <code>ignore</code>, which is not a low-level <code>streambuf</code>
member function. Several <code>basic_streambuf</code> members use <code>int_type</code> for arguments
and return values, but at the <code>basic_istream</code> level the other unformatted input
functions that take a delimiter (<code>get</code> and <code>getline</code>) take a <code>char_type</code>.
That means they are agnostic to whether <code>char</code> is signed or unsigned,
and any conversion to <code>int_type</code> is done by the stream, not expected to be
done by the caller. Arguments of type <code>int_type</code></p>

<a name="Possible-solutions"></a>
<h2>Possible solutions</h2>

<a name="Change--3c-code-3e-ignore-3c--2f-code-3e--to-handle-negative--3c-code-3e-delim-3c--2f-code-3e--values"></a>
<h3>Change <code>ignore</code> to handle negative <code>delim</code> values</h3>

<p>We could modify the spec for <code>basic_istream::ignore</code> so that negative values
(except for -1 which must be reserved for EOF)
are automatically fed through <code>to_int_type</code> to "clean" them, so that they're
in the same domain as the values returned by <code>sgetc()</code>.</p>

<p>The GNU C library takes this approach for its <code>&lt;ctype.h&gt;</code> functions,
which are specified to take <code>int</code>, but which have undefined behaviour if the
argument is not an <code>unsigned char</code> value or <code>EOF</code>. So <code>isalpha('\x80')</code> has
undefined behaviour if <code>char</code> is signed. But with Glibc, it works.
Values in the range [-128,-1) are handled as if converted to <code>unsigned char</code> automatically,
so that negative <code>char</code> values don't produce undefined behaviour. Obviously
it's still possible to misuse those functions, e.g., by passing an <code>int</code> value
outside the range of <code>char</code> <em>or</em> <code>unsigned char</code>, e.g., <code>isalpha(1000)</code>. But
the apparently simple <code>isalpha(c)</code> for a <code>char</code> value isn't undefined just
because <code>c</code> happens to be a negative value of a signed type.</p>

<p>Taking the same approach for <code>ignore</code> would remove the trap for cases like
<code>cin.ignore(n, '\x80')</code>, making it behave as
<code>cin.ignore(n, std::char_traits&lt;char&gt;::to_int_type('\x80'))</code>.
However, it would also "fix" cases like <code>cin.ignore(n, -10)</code> which are less
likely to be correct. There would be no way to tell the difference between
a <code>char</code> with the value <code>-10</code> and an <code>int</code> with the value <code>-10</code>, but the latter
seems odd, and possibly a bug. Depending how we specified it, this solution
might also give defined behaviour to <code>cin.ignore(n, +1000)</code> and
<code>cin.ignore(n, -9999)</code> which are just nonsense.</p>

<p>The other downside of this solution is that it only fixes negative delimiters
less than or equal to -2, because -1 still has to be reserved to mean EOF.
So on a platform where <code>char</code> is signed, <code>cin.ignore(n, '\xfe')</code> would work,
but <code>cin.ignore(n, '\xff')</code> would not, because that value is
<code>traits_type::eof()</code>. On a platform where <code>char</code> is unsigned, both work.
So this solution removes <em>most</em> surprises, but not all, and doesn't have
portable guarantees. Users should really still use
<code>in.ignore(n, std::char_traits&lt;char&gt;::to_int_type(c))</code> to work with arbitrary
delimiter characters on arbitrary platforms.</p>

<a name="Split--3c-code-3e-ignore-3c--2f-code-3e--into-two-functions-and-change--3c-code-3e-delim-3c--2f-code-3e--to--3c-code-3e-char_type-3c--2f-code-3e-"></a>
<h3>Split <code>ignore</code> into two functions and change <code>delim</code> to <code>char_type</code></h3>

<p>We could change <code>delim</code> to <code>char_type</code> and make <code>ignore</code> convert that to
<code>int_type</code> internally by using <code>to_int_type</code>.</p>

<pre><code>    basic_istream&amp; ignore(streamsize n = 1<del>, int_type = traits_type::eof()</del>);
    <ins>basic_istream&amp; ignore(streamsize n, char_type delim);</ins>
</code></pre>

<p>This would preserve the same behaviour for <code>in.ignore(n)</code>, i.e., ignore up
to <code>n</code> characters or up to EOF, whichever happens first. But it would break
explicit uses of eof as the second argument, e.g.,
<code>in.ignore(n, traits::eof())</code>. This would implicitly convert the <code>eof()</code>
value to <code>char_type</code> and treat is as a delimiter. If <code>(char)-1</code> happens to be
present in the next <code>n</code> characters of the input sequence, it would match and
we would stop ignoring too soon.
This would break too much code, and we can do better.</p>

<a name="Add-an-overload-of--3c-code-3e-ignore-3c--2f-code-3e--that-takes-a--3c-code-3e-char_type-3c--2f-code-3e-"></a>
<h3>Add an overload of <code>ignore</code> that takes a <code>char_type</code></h3>

<p>We don't need to change the existing <code>ignore</code>, we can just add an overload
that does the correct conversion to <code>int_type</code>:</p>

<pre><code>    basic_istream&amp; ignore(streamsize n, char_type delim)
    { return ignore(n, traits_type::to_int_type(delim)); }
</code></pre>

<p>This does exactly what users expect when calling <code>ignore(n, c)</code> with a
<code>char_type</code> argument (even <code>(char)-1</code>), with no alarms and no surprises.
The behaviour is entirely consistent for signed or unsigned <code>char</code> and always
matches the given <code>delim</code> if it occurs in the input sequence.</p>

<p>The downside of this solution is that calls that pass a delimiter that is
neither <code>int_type</code> nor <code>char_type</code> become ambiguous,
e.g. <code>std::cin.ignore(n, 1ULL)</code> is valid today but would become ambiguous
with this new overload. Arguably, that's a good thing.
What is this code trying to do? What if it passes a value that doesn't even
fit in <code>int_type</code>, e.g. <code>numeric_limits&lt;int_type&gt;::max()+1LL</code>?
Maybe it's good for such calls to not compile.</p>

<p>Since the problem only exists for <code>char</code> istreams and not <code>wchar_t</code>,
this new overload could be specified as present only when <code>char_type</code> is
<code>char</code>. That would avoid introducing any ambiguities for <code>std::wistream</code>.</p>

<a name="As-above-2c--but-constrain-the-new-overload-to-prevent-ambiguties"></a>
<h3>As above, but constrain the new overload to prevent ambiguties</h3>

<pre><code>    basic_istream&amp; ignore(streamsize n, same_as&lt;char_type&gt; auto delim)
    { return ignore(n, traits_type::to_int_type(delim)); }
</code></pre>

<p>This will only be selected by overload resolution when called with an argument
of type <code>char_type</code>. For all other argument types, the existing overload will
be selected and if the argument isn't of type <code>int_type</code> it will implicitly
convert to <code>int_type</code>, exactly as happens today.</p>

<p>This doesn't change the meaning of any existing code, except for calls with
negative <code>char_type</code> values which would start to work as users probably
expected them to all along. The downside is the additional complexity of
using a constrained overload, which would need to be emulated with SFINAE if
vendors wanted to backport this fix to older standards modes.</p>

<p>Personally, I think the non-constrained overload is the best option, and that
the cases which become ambiguous should probably be fixed to clarify what
they're intending to do. Making the conversion to <code>int_type</code> explicit
(and using <code>to_int_type</code> if appropriate) would probably be an improvement.</p>

<a name="Do-nothing"></a>
<h3>Do nothing</h3>

<p>Boo! Users deserve better. Well, most of them. Some don't.</p>

<a name="Proposed-wording"></a>
<h2>Proposed wording</h2>

<p>The edits are shown relative to <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/n4981.pdf" title="Working Draft - Programming Languages -- C++">N4981</a>.</p>

<p>Modify the class synopsis in [istream.general] as shown:</p>

<pre><code>    // [istream.unformatted], unformatted input
    streamsize gcount() const;
    int_type get();
    basic_istream&amp; get(char_type&amp; c);
    basic_istream&amp; get(char_type* s, streamsize n);
    basic_istream&amp; get(char_type* s, streamsize n, char_type delim);
    basic_istream&amp; get(basic_streambuf&lt;char_type, traits&gt;&amp; sb);
    basic_istream&amp; get(basic_streambuf&lt;char_type, traits&gt;&amp; sb, char_type delim);

    basic_istream&amp; getline(char_type* s, streamsize n);
    basic_istream&amp; getline(char_type* s, streamsize n, char_type delim);

    basic_istream&amp; ignore(streamsize n = 1, int_type delim = traits::eof());
    <ins>basic_istream&amp; ignore(streamsize n, char_type delim);</ins>
    int_type       peek();
    basic_istream&amp; read(char_type* s, streamsize n);
    streamsize     readsome(char_type* s, streamsize n);
</code></pre>

<p>Modify [istream.unformatted] as shown:</p>

<blockquote><p><code>basic_istream&amp; ignore(streamsize n = 1, int_type delim = traits::eof());</code></p>

<p>-25-
<em>Effects</em>: Behaves as an unformatted input function (as described above).
After constructing a sentry object, extracts characters and discards them.
Characters are extracted until any of the following occurs:</p>

<blockquote><p>(25.1) — <code>n != numeric_limits&lt;streamsize&gt;::max()</code> ([numeric.limits]) and <code>n</code>
characters have been extracted so far <br/>
(25.2) — end-of-file occurs on the input sequence (in which case the function
calls <code>setstate(eofbit)</code>, which may throw <code>ios_base::failure</code> ([iostate.flags]));<br/>
(25.3) — <code>traits::eq_int_type(traits::to_int_type(c), delim)</code> for the next
available input character <code>c</code> (in which case <code>c</code> is extracted).</p></blockquote>

<p>[<em>Note 1</em>: The last condition will never occur if <code>traits::eq_int_type(delim, traits::eof())</code>. — <em>end note</em>]</p>

<p>-26- <em>Returns</em>: <code>*this</code>.</p>

<p><ins><code>basic_istream&amp; ignore(streamsize n = 1, char_type delim);</code></ins></p>

<p><ins>-?- <em>Constraints</em>: <code>is_same_v&lt;char_type, char&gt;</code> is <code>true</code>.</p>

<p><ins>-?- <em>Effects</em>: Equivalent to:
<code>return ignore(n, traits_type::to_int_type(delim));</code></ins></p></blockquote>

<a name="Acknowledgements"></a>
<h2>Acknowledgements</h2>

<p>Thanks to Iain Sandoe and Ulrich Drepper for inspiring this.
Thanks to Tom Honermann for getting me to state the problem in the general case.</p>

<a name="References"></a>
<h2>References</h2>

<p><a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/n4981.pdf" title="Working Draft - Programming Languages -- C++">N4981</a>, Working Draft - Programming Languages -- C++, Thomas Köppe, 2024.</p>
</body></html>
