<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Issue 305: Default behavior of codecvt&lt;wchar_t, char, mbstate_t&gt;::length()</title>
<meta property="og:title" content="Issue 305: Default behavior of codecvt&lt;wchar_t, char, mbstate_t&gt;::length()">
<meta property="og:description" content="C++ library issue. Status: CD1">
<meta property="og:url" content="https://cplusplus.github.io/LWG/issue305.html">
<meta property="og:type" content="website">
<meta property="og:image" content="http://cplusplus.github.io/LWG/images/cpp_logo.png">
<meta property="og:image:alt" content="C++ logo">
<style>
  p {text-align:justify}
  li {text-align:justify}
  pre code.backtick::before { content: "`" }
  pre code.backtick::after { content: "`" }
  blockquote.note
  {
    background-color:#E0E0E0;
    padding-left: 15px;
    padding-right: 15px;
    padding-top: 1px;
    padding-bottom: 1px;
  }
  ins {background-color:#A0FFA0}
  del {background-color:#FFA0A0}
  table.issues-index { border: 1px solid; border-collapse: collapse; }
  table.issues-index th { text-align: center; padding: 4px; border: 1px solid; }
  table.issues-index td { padding: 4px; border: 1px solid; }
  table.issues-index td:nth-child(1) { text-align: right; }
  table.issues-index td:nth-child(2) { text-align: left; }
  table.issues-index td:nth-child(3) { text-align: left; }
  table.issues-index td:nth-child(4) { text-align: left; }
  table.issues-index td:nth-child(5) { text-align: center; }
  table.issues-index td:nth-child(6) { text-align: center; }
  table.issues-index td:nth-child(7) { text-align: left; }
  table.issues-index td:nth-child(5) span.no-pr { color: red; }
  @media (prefers-color-scheme: dark) {
     html {
        color: #ddd;
        background-color: black;
     }
     ins {
        background-color: #225522
     }
     del {
        background-color: #662222
     }
     a {
        color: #6af
     }
     a:visited {
        color: #6af
     }
     blockquote.note
     {
        background-color: rgba(255, 255, 255, .10)
     }
  }
</style>
</head>
<body>
<hr>
<p><em>This page is a snapshot from the LWG issues list, see the <a href="lwg-active.html">Library Active Issues List</a> for more information and the meaning of <a href="lwg-active.html#CD1">CD1</a> status.</em></p>
<h3 id="305"><a href="lwg-defects.html#305">305</a>. Default behavior of codecvt&lt;wchar_t, char, mbstate_t&gt;::length()</h3>
<p><b>Section:</b> 28.3.4.2.6 <a href="https://wg21.link/locale.codecvt.byname">[locale.codecvt.byname]</a> <b>Status:</b> <a href="lwg-active.html#CD1">CD1</a>
 <b>Submitter:</b> Howard Hinnant <b>Opened:</b> 2001-01-24 <b>Last modified:</b> 2016-01-28</p>
<p><b>Priority: </b>Not Prioritized
</p>
<p><b>View all other</b> <a href="lwg-index.html#locale.codecvt.byname">issues</a> in [locale.codecvt.byname].</p>
<p><b>View all issues with</b> <a href="lwg-status.html#CD1">CD1</a> status.</p>
<p><b>Discussion:</b></p>
<p>22.2.1.5/3 introduces codecvt in part with:</p>

<blockquote><p>
  codecvt&lt;wchar_t,char,mbstate_t&gt; converts between the native
  character sets for tiny and wide characters. Instantiations on
  mbstate_t perform conversion between encodings known to the library
  implementor.
</p></blockquote>

<p>But 22.2.1.5.2/10 describes do_length in part with:</p>

<blockquote><p>
  ... codecvt&lt;wchar_t, char, mbstate_t&gt; ... return(s) the lesser of max and 
  (from_end-from).
</p></blockquote>

<p>
The semantics of do_in and do_length are linked.  What one does must
be consistent with what the other does.  22.2.1.5/3 leads me to
believe that the vendor is allowed to choose the algorithm that
codecvt&lt;wchar_t,char,mbstate_t&gt;::do_in performs so that it makes
his customers happy on a given platform.  But 22.2.1.5.2/10 explicitly
says what codecvt&lt;wchar_t,char,mbstate_t&gt;::do_length must
return.  And thus indirectly specifies the algorithm that
codecvt&lt;wchar_t,char,mbstate_t&gt;::do_in must perform.  I believe
that this is not what was intended and is a defect.
</p>

<p>Discussion from the -lib reflector:

<br/>This proposal would have the effect of making the semantics of
all of the virtual functions in <code>codecvt&lt;wchar_t, char,
mbstate_t&gt;</code> implementation specified.  Is that what we want, or
do we want to mandate specific behavior for the base class virtuals
and leave the implementation specified behavior for the codecvt_byname
derived class?  The tradeoff is that former allows implementors to
write a base class that actually does something useful, while the
latter gives users a way to get known and specified---albeit
useless---behavior, and is consistent with the way the standard
handles other facets.  It is not clear what the original intention
was.</p>

<p>
Nathan has suggest a compromise: a character that is a widened version
of the characters in the basic execution character set must be
converted to a one-byte sequence, but there is no such requirement
for characters that are not part of the basic execution character set.
</p>


<p id="res-305"><b>Proposed resolution:</b></p>
<p>
Change 22.2.1.5.2/5 from:
</p>
<p>
The instantiations required in Table 51 (lib.locale.category), namely
codecvt&lt;wchar_t,char,mbstate_t&gt; and
codecvt&lt;char,char,mbstate_t&gt;, store no characters. Stores no more
than (to_limit-to) destination elements. It always leaves the to_next
pointer pointing one beyond the last element successfully stored.
</p>
<p>
to:
</p>
<p>
Stores no more than (to_limit-to) destination elements, and leaves the
to_next pointer pointing one beyond the last element successfully
stored.  codecvt&lt;char,char,mbstate_t&gt; stores no characters.
</p>

<p>Change 22.2.1.5.2/10 from:</p>

<blockquote><p>
-10- Returns: (from_next-from) where from_next is the largest value in
the range [from,from_end] such that the sequence of values in the
range [from,from_next) represents max or fewer valid complete
characters of type internT. The instantiations required in Table 51
(21.1.1.1.1), namely codecvt&lt;wchar_t, char, mbstate_t&gt; and
codecvt&lt;char, char, mbstate_t&gt;, return the lesser of max and
(from_end-from).
</p></blockquote>

<p>to:</p>

<blockquote><p>
-10- Returns: (from_next-from) where from_next is the largest value in 
the range [from,from_end] such that the sequence of values in the range 
[from,from_next) represents max or fewer valid complete characters of 
type internT. The instantiation codecvt&lt;char, char, mbstate_t&gt; returns 
the lesser of max and (from_end-from). 
</p></blockquote>

<p><i>[Redmond: Nathan suggested an alternative resolution: same as
above, but require that, in the default encoding, a character from the
basic execution character set would map to a single external
character.  The straw poll was 8-1 in favor of the proposed
resolution.]</i></p>




<p><b>Rationale:</b></p>
<p>The default encoding should be whatever users of a given platform
would expect to be the most natural.  This varies from platform to
platform.  In many cases there is a preexisting C library, and users
would expect the default encoding to be whatever C uses in the default
"C" locale.  We could impose a guarantee like the one Nathan suggested
(a character from the basic execution character set must map to a
single external character), but this would rule out important
encodings that are in common use: it would rule out JIS, for
example, and it would rule out a fixed-width encoding of UCS-4.</p>

<p><i>[Cura&ccedil;ao: fixed rationale typo at the request of Ichiro Koshida;
&quot;shift-JIS&quot; changed to &quot;JIS&quot;.]</i></p>







</body>
</html>
