<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Issue 2959: char_traits&lt;char16_t&gt;::eof is a valid UTF-16 code unit</title>
<meta property="og:title" content="Issue 2959: char_traits&lt;char16_t&gt;::eof is a valid UTF-16 code unit">
<meta property="og:description" content="C++ library issue. Status: New">
<meta property="og:url" content="https://cplusplus.github.io/LWG/issue2959.html">
<meta property="og:type" content="website">
<meta property="og:image" content="http://cplusplus.github.io/LWG/images/cpp_logo.png">
<meta property="og:image:alt" content="C++ logo">
<style>
  p {text-align:justify}
  li {text-align:justify}
  pre code.backtick::before { content: "`" }
  pre code.backtick::after { content: "`" }
  blockquote.note
  {
    background-color:#E0E0E0;
    padding-left: 15px;
    padding-right: 15px;
    padding-top: 1px;
    padding-bottom: 1px;
  }
  ins {background-color:#A0FFA0}
  del {background-color:#FFA0A0}
  table.issues-index { border: 1px solid; border-collapse: collapse; }
  table.issues-index th { text-align: center; padding: 4px; border: 1px solid; }
  table.issues-index td { padding: 4px; border: 1px solid; }
  table.issues-index td:nth-child(1) { text-align: right; }
  table.issues-index td:nth-child(2) { text-align: left; }
  table.issues-index td:nth-child(3) { text-align: left; }
  table.issues-index td:nth-child(4) { text-align: left; }
  table.issues-index td:nth-child(5) { text-align: center; }
  table.issues-index td:nth-child(6) { text-align: center; }
  table.issues-index td:nth-child(7) { text-align: left; }
  table.issues-index td:nth-child(5) span.no-pr { color: red; }
  @media (prefers-color-scheme: dark) {
     html {
        color: #ddd;
        background-color: black;
     }
     ins {
        background-color: #225522
     }
     del {
        background-color: #662222
     }
     a {
        color: #6af
     }
     a:visited {
        color: #6af
     }
     blockquote.note
     {
        background-color: rgba(255, 255, 255, .10)
     }
  }
</style>
</head>
<body>
<hr>
<p><em>This page is a snapshot from the LWG issues list, see the <a href="lwg-active.html">Library Active Issues List</a> for more information and the meaning of <a href="lwg-active.html#New">New</a> status.</em></p>
<h3 id="2959"><a href="lwg-active.html#2959">2959</a>. <code>char_traits&lt;char16_t&gt;::eof</code> is a valid UTF-16 code unit</h3>
<p><b>Section:</b> 27.2.4.4 <a href="https://wg21.link/char.traits.specializations.char16.t">[char.traits.specializations.char16.t]</a> <b>Status:</b> <a href="lwg-active.html#New">New</a>
 <b>Submitter:</b> Jonathan Wakely <b>Opened:</b> 2017-05-05 <b>Last modified:</b> 2019-04-02</p>
<p><b>Priority: </b>3
</p>
<p><b>View all other</b> <a href="lwg-index.html#char.traits.specializations.char16.t">issues</a> in [char.traits.specializations.char16.t].</p>
<p><b>View all issues with</b> <a href="lwg-status.html#New">New</a> status.</p>
<p><b>Discussion:</b></p>
<p>
The standard requires that <code>char_traits&lt;char16_t&gt;::int_type</code> is
<code>uint_least16_t</code>, so when that has the same representation as <code>char16_t</code>
there are no bits left to represent the <code>eof</code> value.
<p/>
27.2.4.4 <a href="https://wg21.link/char.traits.specializations.char16.t">[char.traits.specializations.char16.t]</a> says:
</p>
<blockquote>
<p>
&mdash; The member <code>eof()</code> shall return an implementation-defined constant
that cannot appear as a valid UTF-16 code unit.
</p>
</blockquote>
<p>
Existing practice is to use the "noncharacter" <code>u'\uffff'</code> for this
value, but the Unicode spec is clear that <code>U+FFFF</code> and other
noncharacters are valid, and their appearance in a UTF-16 string does
not make it ill-formed. See <a href="http://www.unicode.org/faq/private_use.html#nonchar7">here</a> and
<a href="http://www.unicode.org/faq/private_use.html#nonchar8">here</a>:
</p>
<blockquote>
<p>
<i>The fact that they are called "noncharacters" and are not intended for open interchange does not mean 
that they are somehow illegal or invalid code points which make strings containing them invalid.</i>
</p>
</blockquote>
<p>
In practice this means there's no way to tell if
<code>basic_streambuf&lt;char16_t&gt;::sputc(u'\uffff')</code> succeeded or not. If it
can insert the character it returns <code>to_int_type(u'\uffff')</code> and
otherwise it returns <code>eof()</code>, which is the same value.
<p/>
I believe that <code>char_traits&lt;char16_t&gt;::to_int_type(char_type c)</code> can be
defined to transform <code>U+FFFF</code> into <code>U+FFFD</code>, so that the invariant
<code>eq_int_type(eof(), to_int_type(c)) == false</code> holds for any <code>c</code> (and the
return value of <code>sputc</code> will be distinct from <code>eof</code>). I don't think any
implementation currently meets that invariant.
<p/>
I think at the very least we need to correct the statement "The member
<code>eof()</code> shall return an implementation-defined constant that cannot
appear as a valid UTF-16 code unit", because there are no such
constants if <code>sizeof(uint_least16_t) == sizeof(char16_t)</code>.
<p/>
This issue is closely related to LWG <a href="lwg-closed.html#1200" title="&quot;surprising&quot; char_traits&lt;T&gt;::int_type requirements (Status: NAD)">1200</a><sup><a href="https://cplusplus.github.io/LWG/issue1200" title="Latest snapshot">(i)</a></sup>, but there it's a
slightly different statement of the problem, and neither the
submitter's recommendation nor the proposed resolution solves this
issue here. It seems that was closed as NAD before the Unicode corrigendum
existed, so at the time our standard just gave "surprising results"
but wasn't strictly wrong. Now it makes a normative statement that
conflicts with Unicode.
</p>

<p><i>[2017-07 Toronto Wed Issue Prioritization]</i></p>

<p>Priority 3</p>


<p id="res-2959"><b>Proposed resolution:</b></p>





</body>
</html>
