<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Issue 76: Can a codecvt facet always convert one internal character at a time?</title>
<meta property="og:title" content="Issue 76: Can a codecvt facet always convert one internal character at a time?">
<meta property="og:description" content="C++ library issue. Status: CD1">
<meta property="og:url" content="https://cplusplus.github.io/LWG/issue76.html">
<meta property="og:type" content="website">
<meta property="og:image" content="http://cplusplus.github.io/LWG/images/cpp_logo.png">
<meta property="og:image:alt" content="C++ logo">
<style>
  p {text-align:justify}
  li {text-align:justify}
  pre code.backtick::before { content: "`" }
  pre code.backtick::after { content: "`" }
  blockquote.note
  {
    background-color:#E0E0E0;
    padding-left: 15px;
    padding-right: 15px;
    padding-top: 1px;
    padding-bottom: 1px;
  }
  ins {background-color:#A0FFA0}
  del {background-color:#FFA0A0}
  table.issues-index { border: 1px solid; border-collapse: collapse; }
  table.issues-index th { text-align: center; padding: 4px; border: 1px solid; }
  table.issues-index td { padding: 4px; border: 1px solid; }
  table.issues-index td:nth-child(1) { text-align: right; }
  table.issues-index td:nth-child(2) { text-align: left; }
  table.issues-index td:nth-child(3) { text-align: left; }
  table.issues-index td:nth-child(4) { text-align: left; }
  table.issues-index td:nth-child(5) { text-align: center; }
  table.issues-index td:nth-child(6) { text-align: center; }
  table.issues-index td:nth-child(7) { text-align: left; }
  table.issues-index td:nth-child(5) span.no-pr { color: red; }
  @media (prefers-color-scheme: dark) {
     html {
        color: #ddd;
        background-color: black;
     }
     ins {
        background-color: #225522
     }
     del {
        background-color: #662222
     }
     a {
        color: #6af
     }
     a:visited {
        color: #6af
     }
     blockquote.note
     {
        background-color: rgba(255, 255, 255, .10)
     }
  }
</style>
</head>
<body>
<hr>
<p><em>This page is a snapshot from the LWG issues list, see the <a href="lwg-active.html">Library Active Issues List</a> for more information and the meaning of <a href="lwg-active.html#CD1">CD1</a> status.</em></p>
<h3 id="76"><a href="lwg-defects.html#76">76</a>. Can a <code>codecvt</code> facet always convert one internal character at a time?</h3>
<p><b>Section:</b> 28.3.4.2.5 <a href="https://wg21.link/locale.codecvt">[locale.codecvt]</a> <b>Status:</b> <a href="lwg-active.html#CD1">CD1</a>
 <b>Submitter:</b> Matt Austern <b>Opened:</b> 1998-09-25 <b>Last modified:</b> 2016-01-28</p>
<p><b>Priority: </b>Not Prioritized
</p>
<p><b>View all other</b> <a href="lwg-index.html#locale.codecvt">issues</a> in [locale.codecvt].</p>
<p><b>View all issues with</b> <a href="lwg-status.html#CD1">CD1</a> status.</p>
<p><b>Discussion:</b></p>
<p>This issue concerns the requirements on classes derived from
<code>codecvt</code>, including user-defined classes. What are the
restrictions on the conversion from external characters
(e.g. <code>char</code>) to internal characters (e.g. <code>wchar_t</code>)?
Or, alternatively, what assumptions about <code>codecvt</code> facets can
the I/O library make? </p>

<p>The question is whether it's possible to convert from internal
characters to external characters one internal character at a time,
and whether, given a valid sequence of external characters, it's
possible to pick off internal characters one at a time. Or, to put it
differently: given a sequence of external characters and the
corresponding sequence of internal characters, does a position in the
internal sequence correspond to some position in the external
sequence? </p>

<p>To make this concrete, suppose that <code>[first, last)</code> is a
sequence of <i>M</i> external characters and that <code>[ifirst,
ilast)</code> is the corresponding sequence of <i>N</i> internal
characters, where <i>N &gt; 1</i>. That is, <code>my_encoding.in()</code>,
applied to <code>[first, last)</code>, yields <code>[ifirst,
ilast)</code>. Now the question: does there necessarily exist a
subsequence of external characters, <code>[first, last_1)</code>, such
that the corresponding sequence of internal characters is the single
character <code>*ifirst</code>?
</p>

<p>(What a &quot;no&quot; answer would mean is that
<code>my_encoding</code> translates sequences only as blocks. There's a
sequence of <i>M</i> external characters that maps to a sequence of
<i>N</i> internal characters, but that external sequence has no
subsequence that maps to <i>N-1</i> internal characters.) </p>

<p>Some of the wording in the standard, such as the description of
<code>codecvt::do_max_length</code> (28.3.4.2.5.3 <a href="https://wg21.link/locale.codecvt.virtuals">[locale.codecvt.virtuals]</a>,
paragraph 11) and <code>basic_filebuf::underflow</code> (31.10.3.5 <a href="https://wg21.link/filebuf.virtuals">[filebuf.virtuals]</a>, paragraph 3) suggests that it must always be
possible to pick off internal characters one at a time from a sequence
of external characters. However, this is never explicitly stated one
way or the other. </p>

<p>This issue seems (and is) quite technical, but it is important if
we expect users to provide their own encoding facets. This is an area
where the standard library calls user-supplied code, so a well-defined
set of requirements for the user-supplied code is crucial. Users must
be aware of the assumptions that the library makes. This issue affects
positioning operations on <code>basic_filebuf</code>, unbuffered input,
and several of <code>codecvt</code>'s member functions. </p>


<p id="res-76"><b>Proposed resolution:</b></p>
<p>Add the following text as a new paragraph, following 28.3.4.2.5.3 <a href="https://wg21.link/locale.codecvt.virtuals">[locale.codecvt.virtuals]</a> paragraph 2:</p>

<blockquote>
<p>A <code>codecvt</code> facet that is used by <code>basic_filebuf</code>
(31.10 <a href="https://wg21.link/file.streams">[file.streams]</a>) must have the property that if</p>
<pre>
    do_out(state, from, from_end, from_next, to, to_lim, to_next)
</pre>
<p>would return <code>ok</code>, where <code>from != from_end</code>, then </p>
<pre>
    do_out(state, from, from + 1, from_next, to, to_end, to_next)
</pre>
<p>must also return <code>ok</code>, and that if</p>
<pre>
    do_in(state, from, from_end, from_next, to, to_lim, to_next)
</pre>
<p>would return <code>ok</code>, where <code>to != to_lim</code>, then</p>
<pre>
    do_in(state, from, from_end, from_next, to, to + 1, to_next)
</pre>
<p>must also return <code>ok</code>.  [<i>Footnote:</i> Informally, this
means that <code>basic_filebuf</code> assumes that the mapping from
internal to external characters is 1 to N: a <code>codecvt</code> that is
used by <code>basic_filebuf</code> must be able to translate characters
one internal character at a time.  <i>--End Footnote</i>]</p>
</blockquote>

<p><i>[Redmond: Minor change in proposed resolution.  Original
proposed resolution talked about "success", with a parenthetical
comment that success meant returning <code>ok</code>.  New wording
removes all talk about "success", and just talks about the
return value.]</i></p>




<p><b>Rationale:</b></p>

  <p>The proposed resoluion says that conversions can be performed one
  internal character at a time.  This rules out some encodings that
  would otherwise be legal.  The alternative answer would mean there
  would be some internal positions that do not correspond to any
  external file position.</p>
  <p>
  An example of an encoding that this rules out is one where the
  <code>internT</code> and <code>externT</code> are of the same type, and
  where the internal sequence <code>c1 c2</code> corresponds to the
  external sequence <code>c2 c1</code>.
  </p>
  <p>It was generally agreed that <code>basic_filebuf</code> relies
  on this property: it was designed under the assumption that
  the external-to-internal mapping is N-to-1, and it is not clear
  that <code>basic_filebuf</code> is implementable without that 
  restriction.
  </p>
  <p>
  The proposed resolution is expressed as a restriction on
  <code>codecvt</code> when used by <code>basic_filebuf</code>, rather
  than a blanket restriction on all <code>codecvt</code> facets,
  because <code>basic_filebuf</code> is the only other part of the 
  library that uses <code>codecvt</code>.  If a user wants to define
  a <code>codecvt</code> facet that implements a more general N-to-M
  mapping, there is no reason to prohibit it, so long as the user
  does not expect <code>basic_filebuf</code> to be able to use it.
  </p>





</body>
</html>
