<html>

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta name="GENERATOR" content="Microsoft FrontPage 5.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Adapting to Unicode</title>
<style type="text/css">
 ins {background-color:#A0FFA0}
 del {background-color:#FFA0A0}
</style>
</head>

<body>

  <table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="579">
    <tr>
      <td width="153" align="left" valign="top">Document number:</td>
      <td width="426">N3336 = 12-0026</td>
    </tr>
    <tr>
      <td width="153" align="left" valign="top">Date:</td>
      <td width="426">
      <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%Y-%m-%d" startspan -->2012-01-13<!--webbot bot="Timestamp" endspan i-checksum="12045" --></td>
    </tr>
    <tr>
      <td width="153" align="left" valign="top">Project:</td>
      <td width="426">Programming Language C++, Library Working Group</td>
    </tr>
    <tr>
      <td width="153" align="left" valign="top">Reply-to:</td>
      <td width="426">Beman Dawes &lt;bdawes at acm dot org&gt;</td>
    </tr>
  </table>

<h1>Adapting Standard Library Strings and I/O to a Unicode World</h1>

<h2>Table of contents</h2>

<p><a href="#Introduction">Introduction</a><br>
<a href="#String-encoding-conversion-safety-rationale">String conversion safety 
rationale</a><br>
<a href="#Design-paths-not-taken">Design paths not taken</a><br>
<a href="#Acknowledgements">Acknowledgements</a><br>
<a href="#Problems-and-solutions">String interoperability problems and proposed 
solutions</a><br>
&nbsp;&nbsp;&nbsp; <a href="#Problem-1">Problem 1: Strings don't interoperate if 
encoding differs</a><br>
&nbsp;&nbsp;&nbsp; <a href="#Problem-2">Problem 2: Strings don't interoperate 
with I/O streams if encoding differs</a><br>
&nbsp;&nbsp;&nbsp; <a href="#Problem-3">Problem 3: String  conversion 
iterators are not provided</a><br>
  </p>

<h2><a name="Introduction">Introduction</a></h2>

<p>This paper proposes additions to the C++ standard library/TR2 to ease use of 
Unicode and other string encodings. The motivation  is a series of problems with 
the C++11 standard library.</p>

<p>The full statement of the problems with proposed solutions is given below in
<a href="#Problems-and-solutions">String interoperability problems and proposed 
solutions</a>.</p>

<p>The 
C++03 versions of these problems were first encountered while 
providing Unicode support for the internationalization of commercial GIS software. 
The problems appeared again while working on the 
Boost Filesystem Library. These problems have become more apparent as compiler 
support for C++11's additional Unicode support has made it easier to write programs 
that run up against current limitations.</p>

<p>The proposed solutions are pure additions to  the C++11 standard library. No C++03 
or C++11 compliant code is broken or otherwise affected by the additions.</p>

<p>This paper does not provide working paper wording. WP wording will be 
provided if this proposal is accepted in principle.</p>

<p>A &quot;proof-of-concept&quot; implementation of the proposals (and more) is 
available at <a href="http://github.com/Beman/string-interoperability">
github.com/Beman/string-interoperability</a>.</p>

<h2><a name="String-encoding-conversion-safety-rationale">String  conversion safety 
rationale</a></h2>
  <p>The proposed solutions below make the assumption that it is safe to convert 
  a string of any type and encoding to another type and encoding. The rationale 
  for that assumption follows.</p>
<p>Conversion in either direction between UTF-8 encoded std::string and UTF-32 
encoded std::u32string is safe because it is defined by the Unicode Consortium 
and ISO/IEC 10646 as unambiguous and lossless.</p>
<p dir="ltr">Conversion in either direction between UTF-16 encoded 
std::u16string and UTF-32 encoded std::u32string is safe because it is defined 
by the Unicode Consortium and ISO/IEC 10646 as unambiguous and lossless.</p>
<p>Conversion in either direction between UTF-8 encoded std::string and UTF-16 
encoded std::u16string is safe because it can be composed from the two previous 
known safe conversions via an intermediate conversion to and from UTF-32 encoded 
char32_t characters.</p>
<p>The cases of <code>std::string</code> and <code>std::wstring</code> are more 
complex in that the encoding is not implied by the <code>char</code> and <code>
wchar_t</code> value types.&nbsp; It is not necessary, however, to  know 
the encoding of these string types in advance as long as it is known how to convert them to 
one of the known encoding string types. The C++11 standard library requires
<code>codecvt&lt;char32_t,char,mbstate_t&gt;</code>&nbsp; and <code>codecvt&lt;wchar_t,char,mbstate_t&gt;</code> 
facets, so such conversions are always possible using the standard library. In practice, library 
implementations have additional knowledge that allow such conversions to be 
more efficient than just calling codecvt facets. To ensure safety, error handling 
does need to be 
provided, however, as conversions involving some <code>char</code> and <code>wchar_t</code> 
encodings can encounter errors. See <a href="#Problem-3">Problem 3</a> below for 
some requested error handling approaches. </p>
<p>Implicit conversion between single characters of different types, as opposed 
to strings, may require multi-character sequences. No such single character implicit conversions 
are proposed here.</p>

<h2><a name="Design-paths-not-taken">Design paths not taken</a></h2>

<p>This proposal deals with C++11 <code>std::basic_string</code> and 
character types, and with their encodings. The deeper attributes of Unicode 
characters are not addressed. See Mathias Gaunard's <a href="http://mathias.gaunard.com/unicode/doc/html/">
  Unicode project</a> for an example of deeper Unicode support.</p>

<p>This proposal does not suggest providing a string type guaranteed to provide 
UTF-8 encoding.&nbsp; Although experiments with <code>typedef 
basic_string&lt;unsigned char&gt; u8string;</code> worked well, benefits would be 
speculative and not based on existing practice.</p>

<p>Another approach would be to provide a <code>utf8_char_traits</code> class 
and then <code>typedef 
basic_string&lt;char, utf8_char_traits&gt; u8string;</code>. This approach has 
not been investigated. </p>

<h2><a name="Acknowledgements">Acknowledgements</a></h2>

<p>Peter Dimov inspired the idea of string interoperability by arguing that the 
Boost Filesystem library should treat a path is a single 
  type (i.e. not a template) regardless of  character size 
  and encoding.</p>

<p>John Maddock's Unicode conversion iterators demonstrated an 
  easier-to-use, more efficient, and STL friendlier way to perform character 
type and encoding conversions as an alternative to standard library <code>
codecvt</code> facets.</p>

<p>The C++11 standard deserves acknowledgement as it provides the underlying language and library features that allow 
Unicode string 
interoperability:</p>

<ul>
  <li><code>char16_t</code> and <code>char32_t</code>&nbsp; provide Unicode 
  character types and null-terminated characters strings with guaranteed 
  encodings.</li>
  <li><code>std::u16string</code> and <code>std::u32string</code> provide 
  library support for Unicode character types and encodings.</li>
  <li><code>u8</code>, <code>u</code>, and <code>U</code> character and string literals ease 
  programming with Unicode character types and encodings.</li>
</ul>

  <h1 align="left">
  <a name="String-interoperability-problems-and-proposed-solutions">String interoperability 
  problems and proposed solutions</a></h1>

<h2><a name="Problem-1">Problem 1</a>: Strings don't interoperate if encoding differs</h2>

<h3>Discussion</h3>

<p>Standard library strings with different character encodings have different 
types that do not interoperate.</p>

<h3>Example</h3>
<blockquote>
  <pre>u16string s16 = u&quot;&#24744;&#22909;&#19990;&#30028;&quot;;
u32string s32;
s32 = s16;           // error!
s32 = &quot;foo&quot;;         // error!
s32 = s16.c_str();   // error!
s32.assign(s16.cbegin(), s16.cend()); // error!</pre>
  <pre>void f(const string&amp;);
f(s32);              //error!</pre>
</blockquote>
<p>The encoding of basic_string instantiations can be determined for 
the types under discussion. It is either implicit in the string's value_type or 
can be determined via the locale.</p>
<h3>Existing practice</h3>

  <p>Boost Filesystem Version 3, and the filesystem proposal before the C++ 
  committee, class <code>path</code> solves some of the string 
  interoperability problems, albeit in limited context. A function that is 
  declared like this:</p>
  <blockquote>
    <pre>void f(const path&amp;);</pre>
</blockquote>
<p>Can be called like this:</p>
<blockquote>
  <pre>f(&quot;Meow&quot;);
f(L&quot;Meow&quot;);
f(u8&quot;Meow&quot;);
f(u&quot;Meow&quot;);
f(U&quot;Meow&quot;);
// ... many additional variations such as basic_strings and iterators</pre>
</blockquote>
<p>This string interoperability support has been a success. It does, however, 
raise the question of why <code>std::basic_string</code> isn't providing the 
interoperability support. Users are misusing paths as general string containers 
because they provide interoperability. The string interoperability cat is out of the bag. 
The toothpaste is out of the tube.</p>
<p>See Boost.Filesystem V3 class path for 
an example of how such interoperability might be achieved.</p>
<p>Experience with Boost.Filesystem V3 class path has demonstrated that string 
interoperability brings a considerable simplification and improvement to 
internationalized user code, but that having to provide interoperability without 
the resolution of the issues presented here is a band-aid.</p>
<h3>Relationship with interoperability iterators</h3>
<p>String interoperability will be easier to specify, implement, and use if the 
string interoperability iterators <a href="#Problem-3">proposed below</a> are accepted.</p>

<h3>Proposed Solution</h3>

<p dir="ltr">The approach is to add additional <code>std::basic_string</code> 
overloads to functions most likely to benefit from interoperability. The 
overloads are in the form of function templates with sufficient restrictions on 
overload resolution participation (i.e. enable_if) that the existing C++11 
functions are always selected if the value type of the argument is the same as 
or convertible to the <code>std::basic_string</code> type's <code>value_type</code>. 
The semantics of the added signatures are the same as original signatures except 
that arguments of the template parameter type have their value converted to the 
type and encoding of <code>
basic_string::value_type</code>.</p>

<p>The <code>std::basic_string</code> functions given additional overloads are:</p>
<ul>
  <li>Each constructor, <code>operator=</code>, <code>operator+=</code>, <code>
append</code>, and <code>assign</code> signature.<br>
&nbsp;</li>
  <li> <code>template &lt;class T&gt; <b><i>unspecified_iterator</i></b> c_str()</code>, 
  returning an unspecified iterator with <code>value_type</code> of <code>T</code>.
  <br>
&nbsp;</li>
  <li dir="ltr">

<p dir="ltr"><code>begin()</code> and <code>end()</code>. Similar to <code>c_str()</code>. </p>
  </li>
</ul>

<p dir="ltr">To keep the number and complexity of overloads manageable, the 
proof-of-concept implementation does not provide any way to specify error 
handling policies, or <code>string</code> and <code>wstring</code> encoding. 
Every one of the added signatures does not need to be able to control error 
handling and encoding. The need is particularly rare in environments where UTF-8 
is the narrow character encoding and UTF-16 is the wide character encoding. A 
subset, possibly just <code>c_str()</code>, <code>begin()</code>, and <code>
end()</code>, with error handling and encoding parameters or arguments, suitable 
defaulted, may well be sufficient.</p>

<p dir="ltr">Because full implicit interoperability involves a lot of additional 
signatures be added to basic_string, it will certainly be appropriate to discuss 
limiting changes to the key areas of need. For example, constructors and <code>
operator=</code> are much more likely to need interoperability than <code>operator+=</code>, <code>
append</code>, or <code>assign</code> signatures.</p>

<h2 dir="ltr"><a name="Problem-2">Problem 2</a>: Strings don't interoperate with I/O streams if encoding differs</h2>

<h3>Discussion</h3>

<p>I/O streams do not accept strings of different character types </p>

<p>A &quot;Hello World&quot; program using a C++11 Unicode string literal illustrates 
this 
frustration:</p>

<blockquote>
  <pre>#include &lt;iostream&gt;
int main()
{
  std::cout &lt;&lt; U&quot;&#24744;&#22909;&#19990;&#30028;&quot;;   // error in C++11!
}</pre>
</blockquote>

<p>This code should 
&quot;just work&quot;, even though the type of <code>U&quot;&#24744;&#22909;&#19990;&#30028;&quot;</code> is <code>const 
char32_t*</code>, not <code>const char*</code>, as long as the encoding of <code>char</code> 
supports &#24744;&#22909;&#19990;&#30028;. Even if those characters are not 
supported by default encodings, alternatives like UTF-8 are available. </p>
<p>The code does &quot;just work&quot; with the proof-of-concept implementation of this 
proposal. On Linux, with default <code>char</code> encoding of UTF-8,  execution 
produces the expected &#24744;&#22909;&#19990;&#30028; output. On Windows, the 
console doesn't support full UTF-8, so the output can be piped to a file or to a 
program which does handle UTF-8 correctly. And, yes, that does work correctly 
with the proof-of-concept implementation of this proposal.</p>

<h3>Proposed Solution</h3>
<p>Add additional function templates to those in 27.7.3.6.4 [ostream.inserters.character],
<i>Character inserter function templates</i>, to cover the case where the 
argument character type differs from charT and is not <code>char</code>, <code>
signed char</code>, <code>unsigned char</code>, <code>const char*</code>, <code>
const signed char*</code>, or <code>const unsigned char*</code>.&nbsp; (The 
specified types are excluded because they are covered by existing signatures.) 
The semantics of the added signatures are the same as original signatures except 
that arguments shall be converted to the type and encoding of the stream.</p>
<p>Do the same for the character extractors in 27.7.2.2.3 [istream::extractors],
<i>basic_istream::operator&gt;&gt;</i>.</p>
<p>Do the same for the two <code>std::basic_string</code> inserters and 
extractors in 21.4.8.9 [string.io], <i>Inserters and extractors</i>.</p>

<h2><a name="Problem-3">Problem 3</a>: String conversion iterators are 
not provided</h2>
<h3>Discussion</h3>

<p>Conversion between character types and their encodings using current standard 
library facilities such as <code>std::codecvt</code>, <code>std::locale</code>, 
and 
<code>std::wstring_convert</code> has multiple problems:</p>

<ul>
  <li>Interfaces are overly complex, difficult to learn, difficult to use, and error prone.</li>
  <li>Given <b><i>n</i></b> encodings, it is necessary to 
  providing <b>n<sup>2</sup></b>&nbsp; rather than <b>2n</b> codecs. In other 
  words, two <code>codecvt</code> facets don't easily compose into a complete 
  conversion from one encoding to another. Such composition is existing practice in C libraries like ICU. 
  UTF-32 is the obvious choice for the common encoding to pass between codecs.</li>
  <li>Interfaces don't work well with generic programming 
  techniques, particularly iterators.</li>
  <li>Interfaces work at the level of entire strings rather than characters, 
  resulting in unnecessary creation of temporary strings, with 
  attendant memory allocations/deallocations.</li>
  <li>Interfaces entangle <code>std::locale</code> and code conversion, even when these 
  are implementation details that should be hidden from the application.</li>
  <li>Difficult to control error actions. Choices requested by users and 
  provided by other interfaces include:<ul>
  <li>Throw exception.</li>
  <li>Replace offending character with default character. </li>
  <li>Replace offending character with specified character. Motivating example: 
  Filesystem need to use a replacement character that is acceptable to the 
  Windows codepage. See Boost issue #5769.</li>
  </ul>
  </li>
  </ul>

<h3>Example</h3>

<p dir="ltr">The generalization of the <code>std::basic_string</code> function
<code>c_str</code> is:</p>
  <blockquote>
    <pre>template &lt;class T&gt; <i>unspecified_iterator</i> c_str() const;</pre>
  </blockquote>
  <p>Give a <code>std::string</code> named <code>s8</code>, this allows a user 
  to write <code>s8.c_str&lt;char16_t&gt;()</code> to obtain an iterator with a value 
  type of <code>char16_t</code>.&nbsp; To implement this function generically 
  using the current standard library would be difficult, and would involve the 
  creation of a temporary sting. The full implementation with the proposed 
  solution is simply:</p>
  <blockquote>
    <pre>template &lt;class T&gt;
converting_iterator&lt;const_iterator, value_type, by_range, T&gt; c_str() const
{
  return converting_iterator&lt;const_iterator,
    value_type, by_range, T&gt;(cbegin(), cend());
}</pre>
  </blockquote>
  <p dir="ltr">No temporary string is created, and none of the other problems 
  listed above are present either. The solution 
  is generally useful for user defined types, and not just for implementations 
  of the standard library.</p>
  <p dir="ltr">Other problems become easier to solve with <code>converting_iterator.</code> 
  For example, the Filesystem library's class <code>path</code> in
  <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3239.html">
  N3239</a> has many functions with an argument in the form <code>const 
  codecvt_type&amp; cvt=codecvt()</code> that could be eliminated by either direct 
  or indirect use of <code>converting_iterator.</code></p>
  <h3>Existing practice</h3>

<p>Boost Regex for many years has included a set of Unicode conversion 
iterators as an implementation detail. Although these do not provide composition, they do demonstrate the 
technique of using encoding conversion iterators to avoid creation of temporary 
strings.</p>

<h3>Proposed Solution</h3>

<p>This solution is based on the
<a href="http://github.com/Beman/string-interoperability/blob/master/include/boost/interop/iterator_adapter.hpp">
proof-of-concept implementation</a>. Input iterator requirements can probably be 
loosened to bidirectional, but that hasn't been tested yet.</p>

<p>The preliminaries begin with end-detection policy classes, since strings used 
null termination, size, or half-open ranges to determine the end of a sequence.</p>
  <blockquote>
    <pre>template &lt;class InputIterator&gt; class by_null;
template &lt;class InputIterator&gt; class by_size;
template &lt;class InputIterator&gt; class by_range;</pre>
  </blockquote>
  <p>Codec templates handle actual conversion to and from UTF-32. The primary 
  templates are:</p>
  <blockquote>
    <pre>template &lt;class InputIterator, class FromCharT, template&lt;class&gt; class EndPolicy&gt; 
  class to32_iterator;
template &lt;class InputIterator, class ToCharT&gt; 
  class from32_iterator;</pre>
  </blockquote>
  <p dir="ltr">The standard library would provide specializations for <code>char</code>,
  <code>wchar_t</code>, <code>char16_t</code>, and <code>char32_t</code>. 
  Presumably users could provide specializations for UDTs, but that hasn't been 
  tested yet. The <code>char</code> and <code>wchar_t</code> specializations 
  provide mechanisms to select the encoding. Since this is a new component the
  <code>char</code> default encoding could be UTF-8 rather than locale based and 
  no existing code would be broken.</p>
  <p dir="ltr">The actual <code>converting_iterator</code> primary template is 
  simply:</p>
  <blockquote>
    <pre>template &lt;class InputIterator, class FromCharT, template&lt;class&gt; class EndPolicy,
          class ToCharT&gt; 
class converting_iterator
  : public from32_iterator&lt;to32_iterator&lt;InputIterator, FromCharT, EndPolicy&gt;,
      ToCharT&gt;
{
public:
  using from32_iterator::from32_iterator;
};</pre>
  </blockquote>
  <p>Specializations may be provided, but aren't required. The proof-of-concept 
  implementation doesn't use inherited constructors because of lack of compiler 
  support.</p>
  <hr>

</body>

</html>