<html>
<head>
<meta charset="UTF-8">
<title>P2491R0: Text encodings follow-up</title>

<style type="text/css">
  ins { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
  .new { text-decoration:none; background-color:#D0FFD0 }
  del { text-decoration:line-through; background-color:#FFA0A0 }  
  strong { font-weight: inherit; color: #2020ff }
  table, td, th { border: 1px solid black; border-collapse:collapse; padding: 5px }
</style>
</head>

<body>
ISO/IEC JTC1 SC22 WG21 P2491R0<br/>
Author: Jens Maurer<br/>
Target audience: SG16, LEWG<br/>
2021-11-15<br/>

<h1>P2491R0: Text encodings follow-up</h1>

<h2>1 Abstract</h2>

This paper discusses some design decisions of
P1885 "Naming Text Encodings to Demystify Them" by Corentin Jabot
and Peter Brett that, in the view of the author, are going in the wrong
direction for a number of seemingly small, but crucial design decisions.
<p>
In short,
<ul>
  <li>P1885 re-uses the encoding "UTF16" for a different purpose,
  which causes confusion.</li>
  <li>P1885 cannot properly represent the UCS-2 wide encoding used on
    Microsoft Windows before the switch to UTF-16</li>
  <li>P1885 attempts to specify object representation, which breaks an
  important C++ abstraction barrier</li>
</ul>

<h2>2 Paper history</h2>

<h3>R0: initial revision</h3>

<h2>3 Use cases for the <code>std::text_encoding</code> facility</h2>

<ul>
<li>Describe the encoding of text transmitted over the network or
stored in a file.</li>
<li>Describe the encoding for C++ ordinary and wide string literals.</li>
<li>Describe the encoding of the C++ environment (e.g. environment
variables or the console).</li>
<li>Facilitate conversion between encodings (an actual conversion
facility is not in scope).</li>
</ul>
  
In all of these cases, the character set (i.e. the set of individual
characters) supported in a specific context is indirectly defined by
the encoding, but not explicitly specified.

<h2>4 Building blocks</h2>

Text encodings are used in a variety of situations:
<ul>
  <li>Identifying the encoding of text files or network streams</li>
  <li>Identifying the C++ ordinary and wide literal encodings</li>
  <li>Identifying the environment (e.g. terminal) encoding</li>
</ul>

<h3>4.1 C++ ordinary and wide literal encodings</h3>

[lex.charset] specifies in the current working draft:

<blockquote>
A <em>code unit</em> is an integer value of character type
(6.8.2). Characters in a <em>character-literal</em> [...] or in
a <em>string-literal</em> are encoded as a sequence of one or more
code units [...]; this is termed the respective literal
encoding. The <em>ordinary literal encoding</em> is the encoding
applied to an ordinary character or string literal. The
<em>wide literal encoding</em> is the encoding applied to a wide
character or string literal.
</blockquote>

Then, [lex.string] p10.1 specifies

<blockquote>
The sequence of characters denoted by each contiguous sequence
of <em>basic-s-char</em>s, <em>r-char</em>s, <em>simple-escape-sequence</em>s
(5.13.3), and <em>universal-character-name</em>s (5.3) is encoded to a
code unit sequence using the
<em>string-literal</em>’s associated character encoding.
</blockquote>

Thus, an encoding for ordinary and wide literals in C++ relates a
sequence of characters with a sequence of integer values of the
respective character type (<code>char</code> or <code>wchar_t</code>).

<h3>4.2 IANA list of character sets</h3>

The IANA (Internet Assigned Numbers Authority) maintains a registry of
encodings (called "character sets") at
<a href="https://www.iana.org/assignments/character-sets/character-sets.xhtml">https://www.iana.org/assignments/character-sets/character-sets.xhtml</a>,
as instigated
by <a href="https://www.rfc-editor.org/rfc/rfc2978.html">RFC2978</a>.
<p>
As described by RFC2978:

<blockquote>
The term "charset" (referred to as a "character set" in previous
versions of this document) is used here to refer to a method of
converting a sequence of octets into a sequence of characters.
</blockquote>

Thus, an encoding in the IANA registry relates a sequence of octets
with a sequence of characters.

<h3>4.3 Unicode</h3>

Unicode provides the concept of an encoding form for the relationship
between a sequence of characters (specifically, a sequence of code
points) and a sequence of integer values.  Unicode further provides
the concept of an encoding scheme for the relationship between a
sequence of characters and a sequence of octets.
<p>
Regrettably, the specified encoding forms and encoding schemes have
overlapping naming; "UTF-16" refers both to an encoding form and an
encoding scheme.

<h4>UTF-8</h4>

ISO 10646:2020 section 10.2 specifies the encoding form as follows:

<blockquote>
UTF-8 is the UCS encoding form that assigns each UCS scalar value to
an octet sequence of one to four octets, as specified in table 2.
</blockquote>

The encoding scheme is defined as follows in section 11.2:

<blockquote>
The UTF-8 encoding scheme serializes a UTF-8 code unit sequence in
exactly the same order as the code unit sequence itself.
</blockquote>

Thus, for UTF-8, the code units are octets and those octets also
constitute the encoding scheme.  This encoding does not depend on
endianness (byte order in the object representation of an integer) at
all.

<h4>UTF-16</h4>

ISO 10646:2020 section 10.3 specifies the encoding form called
"UTF-16" as follows:

<blockquote>
UTF-16 is the UCS encoding form that assigns each UCS scalar value to
a sequence of one to two unsigned 16-bit code units, as specified in
table 4.
</blockquote>

The encoding scheme called "UTF-16" is specified in section 11.5 as follows:

<blockquote>
The UTF-16 encoding scheme serializes a UTF-16 code unit sequence by
ordering octets in a way that either the less significant octet
precedes or follows the more significant octet.  In the UTF-16
encoding scheme, the initial signature read as &lt;FE FF> indicates that
the more significant octet precedes the less significant octet,
and &lt;FF FE> the reverse. The signature is not part of the textual
data.  In the absence of signature, the octet order of the UTF-16
encoding scheme is that the more significant octet precedes the less
significant octet.
</blockquote>

The "initial signature" is otherwise known as a byte order mark
(BOM).
<p>

The
<a href="https://www.unicode.org/versions/Unicode14.0.0/">Unicode
standard version 14.0.0</a> specifies in section 3.10:

<blockquote>
UTF-16 encoding scheme: The Unicode encoding scheme that serializes a
UTF-16 code unit sequence as a byte sequence in either big-endian or
little-endian format.
<p>
[...]
<p>
In the UTF-16 encoding scheme, an initial byte sequence corresponding
to U+FEFF is interpreted as a byte order mark; it is used to
distinguish between the two byte orders. An initial byte sequence
&lt;FE FF> indicates big-endian order, and an initial byte sequence
&lt;FF FE> indicates little-endian order. The BOM is not considered
part of the content of the text.
<p>
The UTF-16 encoding scheme may or may not begin with a BOM. However,
when there is no BOM, and in the absence of a higher-level protocol,
the byte order of the UTF-16 encoding scheme is big-endian.
</blockquote>

Note the caveat of an undefined "higher-level protocol", which does
not exist in ISO 10646.

<p>
In either standard, there are also encoding schemes UTF-16LE and
UTF-16BE that do not interpret a signature (byte order mark) at all,
but use the given big-endian or little-endian layout unconditionally.

<h4>UTF-32</h4>

The specifiation of UTF-32 is analogous to UTF-16.  There is no
provision for endianness other than big-endian or little-endian.

<h3>4.4 iconv</h3>

<h4>POSIX</h4>

<code>iconv</code> is a transcoding function specified by
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/">POSIX</a>:

<pre>
size_t iconv(iconv_t cd, char **restrict inbuf,
       size_t *restrict inbytesleft, char **restrict outbuf,
       size_t *restrict outbytesleft);
</pre>

The conversion descriptor (the first argument) is created using
<code>iconv_open</code>:

<pre>
iconv_t iconv_open(const char *tocode, const char *fromcode);
</pre>

with the following specification:

<blockquote>
The <code>iconv_open()</code> function shall return a conversion
descriptor that describes a conversion from the codeset specified by
the string pointed to by the <code>fromcode</code> argument to the
codeset specified by the string pointed to by the <code>tocode</code>
argument.  [...]
<p>
Settings of <code>fromcode</code> and <code>tocode</code> and their
permitted combinations are implementation-defined.
</blockquote>

<p>
As a non-normative note, <code>iconv</code> says:

<blockquote>
The objects indirectly pointed to by <code>inbuf</code>
and <code>outbuf</code> are not restricted to containing data that is
directly representable in the ISO C standard
language <code>char</code> data type. The type of <code>inbuf</code>
and <code>outbuf</code>, <code>char **</code>, does not imply that the
objects pointed to are interpreted as null-terminated C strings or
arrays of characters. Any interpretation interpretation of a byte
sequence that represents a character in a given character set encoding
scheme is done internally within the codeset converters. For example,
the area pointed to indirectly by inbuf and/or outbuf can contain all
zero octets that are not interpreted as string terminators but as
coded character data according to the respective codeset encoding
scheme. The type of the data
(<code>char</code>, <code>short</code>, <code>long</code>, and so on)
read or stored in the objects is not specified, but may be inferred
for both the input and output data by the converters determined by the
<code>fromcode</code> and <code>tocode</code> arguments
of <code>iconv_open()</code>.
</blockquote>

Thus,

<ul>
<li>A codeset value is understood to implicitly specify the object
  type for the character data, which may be different
  from <code>char</code>.</li>
<li>There is no normative requirement which codesets must be
supported.</li>
</ul>

<h4>GNU iconv</h4>

<a href="https://www.gnu.org/software/libc/manual/html_node/Generic-Charset-Conversion.html">GNU iconv</a>
implements POSIX iconv as follows:

<ul>
<li>It recognizes some of the dangers involved with the type-unsafety
of a <code>char*</code> parameter possibly pointing to objects of other
integer types.</li>
  
<li>The implementaton of reading the "UTF-16" codeset uses the
platform endianness as the default in absence of a byte order mark,
which conforms to the Unicode encoding scheme "UTF-16" assuming that
the higher-level protocol is the platform endianness.  However, that
implementation does not conform to the ISO 10646 encoding scheme
"UTF-16", which requires big-endian encoding in the absence of a byte
order mark.</li>

<li>The implementation of writing the "UTF-16" codeset always writes a
byte order mark.</li>
</ul>

<h3>4.5 ICU</h3>

ICU also comes with an encoding converter; the list of supported aliases
is <a href="https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=">here</a>.


<h2>5 Special handling for UTF-16 and UTF-32</h2>

As described above, UTF-16 as an encoding scheme has the following different
interpretations:
<ul>
  <li>ISO 10646: BOM-aware, big-endian if absent</li>
  <li>Unicode 14: BOM-aware, higher-level protocol if absent, default big-endian</li>
  <li>IANA: BOM-aware, recommendation for big-endian if absent (see <a href="https://www.rfc-editor.org/rfc/rfc2781.html">RFC2781</a>)</li>
  <li>iconv: reading: BOM-aware, platform endianness if absent; writing: always writes a BOM</li>
  <li>C++ wide literal on Windows: wchar_t contains UTF-16 code units; encoding scheme is identical to UTF16LE</li>
</ul>

P1885 elects to map the correct UTF16LE/BE encoding scheme identifier
possibly returned from the
<code>std::text_encoding::wide_literal()</code> function to UTF16.
This is user-unfriendly for the following reasons:
<ul>
  <li>There is no differentiation of the encoding label between text
    that has arrived from the network and is tagged "UTF16" in the
    IANA sense (objects of type <code>char</code>, a BOM is expected
    to be present) vs. the UTF16LE/BE text that is produced from a
    wide literal (objects of type <code>wchar_t</code>, without a
    BOM).</li>
  <li><code>iconv</code> always creates a BOM when writing the UTF-16
    encoding.  If a user were to convert third-party text from
    e.g. UTF-8 to "UTF16" for use with <code>std::wstring</code> and
    string literals, BOMs are likely to end up in the middle of a
    string.</li>
</ul>

Futher, <code>iconv</code>, presumably one of the premier consumers of
the object representation model (see below), was designed with the
understanding that the encoding name also conveys the object type for
each code unit (e.g. <code>char</code> or <code>int</code> or,
presumably, <code>wchar_t</code>).  This distinction is lost when both
network data (in a <code>char</code> buffer) and <code>wchar_t</code>
literals are expected to be described with the
same <code>std::text_encoding</code> value.
<p>
It is conceivable to introduce a new enumerator <code>UTF16NE</code>
that has the value of either of the existing
enumerators <code>UTF16LE</code> or <code>UTF16BE</code> (as
appropriate) and return that value
from <code>std::text_encoding::wide_literal()</code> on e.g. Windows
platforms.  This approach, as well as an earlier approach in P1885
that returns either UTF16LE or UTF16BE, but never UTF16, would
redundantly represent information about platform endianness in an
unrelated part of the standard.  Platform endianness should be handled
exclusively by the existing targeted facility <code>std::endian</code>
(see 26.5.8 [bit.endian]).

<p>
P1885 also elects to map UTF16LE/BE to UTF16 for the non-wide
<code>std::text_encoding::literal()</code>. Since <code>CHAR_BIT ==
8</code> is required for this function, the ordinary literal encoding
can never be UTF-16.  If it were, two consecutive <code>char</code>
elements would be used to represent a single code unit, but
some <code>char</code> elements might have the value 0 without
representing the null character. This is not a valid encoding per
[lex.charset]. The mapping is thus superfluous for the result of
<code>std::text_encoding::literal()</code>.
<p>
Everything said above also applies analogously to UTF-32.   

<h2>6 No special handling for UCS2</h2>

UCS-2 was effectively used on the Microsoft Windows (little-endian)
platform for a decade or so before it was switched to UTF-16.
<p>
The usage situation is approximately the same as that for UTF-16, yet
P1885 does not even attempt to perform any mapping that could be
viewed as removing endianness assumptions from the name.  Adding to
that, the IANA registry appears to define "UCS2" as big-endian, but
does not make any allowance for a little-endian UCS-2 encoding scheme.
This leaves the relevant (admittedly outdated) Microsoft Windows
platforms conceptually unsupported.
  
<h2>7 Looking at the object representation breaks an abstraction barrier</h2>

The C++ object model carefully avoids considering the object
representation.  Where it must do so (e.g. for <code>bit_cast</code>),
lots of care needs to be applied to properly deal with padding bits,
partially uninitialized values, and other obscure situations.
<p>
I believe it is a mistake that P1885 talks about specifying the object
representation by applying an encoding scheme. The object
representation should never be in the focus of a user or the
specification of a user-facing facility.
<p>
The following alternative model avoids talking about the object
representation, naturally supports implementations with <code>CHAR_BIT
    > 8</code> or with <code>sizeof(wchar_t) == 1</code>, and allows proper
differentiation between literal encodings and network data.

<ul>
<li>The IANA encoding registry is understood to list encoding
  schemes, i.e. octet-based encodings.</li>
<li>An octet as provided by an IANA encoding is mapped to a single
  element of a string (i.e. a value of type <code>char</code>
  or <code>wchar_t</code>); each octet value of an IANA encoding is
  thus understood to be a code unit.</li>
<li>In addition to the IANA list, new encodings with new MIB values
   outside of the IANA-controlled number space are introduced to
   represent popular wide literal encodings, namely "WIDE.UTF16",
   "WIDE.UTF32", "WIDE.UCS2", and "WIDE.UCS4".
  </li>
</ul>

Observations:
<ul>
  <li> Per [intro.memory], a byte (equivalently, a <code>char</code>)
  is at least 8 bits and thus can hold the value of an octet.</li>
  
  <li>GNU <code>iconv</code> (and likely other implementations) does
  not currently support the "WIDE.*" names. This can reasonably be
  expected to change when the names are standardized.</li>

  <li>Some encodings do not make sense in certain situations.  For
  example, "UTF-8" is unlikely to make sense for
  a <code>std::wstring</code> if <code>sizeof(wchar_t) >
  1</code>.</li>

  <li>IANA-registered encodings can possibly be used
  for <code>wchar_t</code> strings if <code>sizeof(wchar_t) ==
  1</code>.</li>

  <li>The "WIDE.*" encodings can possibly be used
  for <code>char</code> strings if <code>CHAR_BIT >= 16</code>.  There is no
  difference regarding the string literal encoding approach between
  a <code>char</code> with 16 bits and a <code>wchar_t</code> with 16
  bits, regardless of whether the latter consists of one or two
  bytes.</li>

  <li>No endianess information is conveyed by the "WIDE.*" encodings.
  This probably makes them unsuitable to describe a network
  transmission, but allows to properly separate the concerns of
  platform endianness from those of the code unit representation of
  string literals. (Historically, there have been platforms that are
  neither big-endian nor little-endian.)</li>
  
</ul>

This proposal does not limit the choice of encoding for the platform,
but does allow to express all reasonable encodings even for fringe
(but valid) abstract machine parameters such as <code>CHAR_BIT >=
  16</code> or <code>sizeof(wchar_t) == 1</code>.

<h2>8 Wording plan</h2>

Relative to P1885, the wording should be adjusted as follows:

<ul>
  <li>Specify that the octets of the encoding schemes in the IANA
  registry are considered as code units for purposes
    of <code>std::text_encoding</code>.</li>

  <li>Specify additional encodings "WIDE.UTF16", "WIDE.UTF32",
   "WIDE.UCS2", and "WIDE.UCS4" with negative enumerator values to
   avoid conflict with present or future IANA assignments.</li>
  
  <li>Add a note that IANA encoding schemes cannot be returned from
  <code>std::text_encoding::(wide_)literal()</code>
  unless <code>sizeof(char_type) == 1</code>.</li>

  <li>Remove restrictions about <code>CHAR_BIT == 8</code>
  or <code>sizeof(wchar_t) > 1</code> (if any).</li>
  
  <li>Adjust existing conflicting wording as appropriate.</li>
</ul>

</body>
</html>
