<!DOCTYPE html>
<html>
<head>
<meta charset="ASCII">
<style type="text/css">
body { color: #000000; background-color: #FFFFFF; }
del { text-decoration: line-through; color: #8B0040; }
ins { text-decoration: underline; color: #005100; }
p.example { margin-left: 2em; }
pre.example { margin-left: 2em; }
div.example { margin-left: 2em; }
code.extract { background-color: #F5F6A2; }
pre.extract { margin-left: 2em; background-color: #F5F6A2;
  border: 1px solid #E1E28E; }
p.function { }
.attribute { margin-left: 2em; }
.attribute dt { float: left; font-style: italic;
  padding-right: 1ex; }
.attribute dd { margin-left: 0em; }
blockquote.std { color: #000000; background-color: #F1F1F1;
  border: 1px solid #D1D1D1;
  padding-left: 0.5em; padding-right: 0.5em; }
blockquote.stddel { text-decoration: line-through;
  color: #000000; background-color: #FFEBFF;
  border: 1px solid #ECD7EC;
  padding-left: 0.5empadding-right: 0.5em; ; }
blockquote.stdins { text-decoration: underline;
  color: #000000; background-color: #C8FFC8;
  border: 1px solid #B3EBB3; padding: 0.5em; }
table { border: 1px solid black; border-spacing: 0px;
  margin-left: auto; margin-right: auto; }
th { text-align: left; vertical-align: top;
  padding-left: 0.8em; border: none; }
td { text-align: left; vertical-align: top;
  padding-left: 0.8em; border: none; }
  
.highlight .hll { background-color: #ffffcc }
.highlight  { background: #ffffff; }
.highlight .c { color: #888888 } /* Comment */
.highlight .err { color: #FF0000; background-color: #FFAAAA } /* Error */
.highlight .k { color: #008800; font-weight: bold } /* Keyword */
.highlight .o { color: #333333 } /* Operator */
.highlight .ch { color: #888888 } /* Comment.Hashbang */
.highlight .cm { color: #888888 } /* Comment.Multiline */
.highlight .cp { color: #557799 } /* Comment.Preproc */
.highlight .cpf { color: #888888 } /* Comment.PreprocFile */
.highlight .c1 { color: #888888 } /* Comment.Single */
.highlight .cs { color: #cc0000; font-weight: bold } /* Comment.Special */
.highlight .gd { color: #A00000 } /* Generic.Deleted */
.highlight .ge { font-style: italic } /* Generic.Emph */
.highlight .gr { color: #FF0000 } /* Generic.Error */
.highlight .gh { color: #000080; font-weight: bold } /* Generic.Heading */
.highlight .gi { color: #00A000 } /* Generic.Inserted */
.highlight .go { color: #888888 } /* Generic.Output */
.highlight .gp { color: #c65d09; font-weight: bold } /* Generic.Prompt */
.highlight .gs { font-weight: bold } /* Generic.Strong */
.highlight .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
.highlight .gt { color: #0044DD } /* Generic.Traceback */
.highlight .kc { color: #008800; font-weight: bold } /* Keyword.Constant */
.highlight .kd { color: #008800; font-weight: bold } /* Keyword.Declaration */
.highlight .kn { color: #008800; font-weight: bold } /* Keyword.Namespace */
.highlight .kp { color: #003388; font-weight: bold } /* Keyword.Pseudo */
.highlight .kr { color: #008800; font-weight: bold } /* Keyword.Reserved */
.highlight .kt { color: #333399; font-weight: bold } /* Keyword.Type */
.highlight .m { color: #6600EE; font-weight: bold } /* Literal.Number */
.highlight .na { color: #0000CC } /* Name.Attribute */
.highlight .nb { color: #007020 } /* Name.Builtin */
.highlight .nc { color: #BB0066; font-weight: bold } /* Name.Class */
.highlight .no { color: #003366; font-weight: bold } /* Name.Constant */
.highlight .nd { color: #555555; font-weight: bold } /* Name.Decorator */
.highlight .ni { color: #880000; font-weight: bold } /* Name.Entity */
.highlight .ne { color: #FF0000; font-weight: bold } /* Name.Exception */
.highlight .nf { color: #0066BB; font-weight: bold } /* Name.Function */
.highlight .nl { color: #997700; font-weight: bold } /* Name.Label */
.highlight .nn { color: #0e84b5; font-weight: bold } /* Name.Namespace */
.highlight .nt { color: #007700 } /* Name.Tag */
.highlight .nv { color: #996633 } /* Name.Variable */
.highlight .ow { color: #000000; font-weight: bold } /* Operator.Word */
.highlight .w { color: #bbbbbb } /* Text.Whitespace */
.highlight .mb { color: #6600EE; font-weight: bold } /* Literal.Number.Bin */
.highlight .mf { color: #6600EE; font-weight: bold } /* Literal.Number.Float */
.highlight .mh { color: #005588; font-weight: bold } /* Literal.Number.Hex */
.highlight .mi { color: #0000DD; font-weight: bold } /* Literal.Number.Integer */
.highlight .mo { color: #4400EE; font-weight: bold } /* Literal.Number.Oct */
.highlight .sb { background-color: #fff0f0 } /* Literal.String.Backtick */
.highlight .sc { color: #0044DD } /* Literal.String.Char */
.highlight .sd { color: #DD4422 } /* Literal.String.Doc */
.highlight .s2 { background-color: #fff0f0 } /* Literal.String.Double */
.highlight .se { color: #666666; font-weight: bold; background-color: #fff0f0 } /* Literal.String.Escape */
.highlight .sh { background-color: #fff0f0 } /* Literal.String.Heredoc */
.highlight .si { background-color: #eeeeee } /* Literal.String.Interpol */
.highlight .sx { color: #DD2200; background-color: #fff0f0 } /* Literal.String.Other */
.highlight .sr { color: #000000; background-color: #fff0ff } /* Literal.String.Regex */
.highlight .s1 { background-color: #fff0f0 } /* Literal.String.Single */
.highlight .ss { color: #AA6600 } /* Literal.String.Symbol */
.highlight .bp { color: #007020 } /* Name.Builtin.Pseudo */
.highlight .vc { color: #336699 } /* Name.Variable.Class */
.highlight .vg { color: #dd7700; font-weight: bold } /* Name.Variable.Global */
.highlight .vi { color: #3333BB } /* Name.Variable.Instance */
.highlight .il { color: #0000DD; font-weight: bold } /* Literal.Number.Integer.Long */
</style>
<title>P0372R0 - A type for utf-8 data</title>
</head>
<body>

<h1>A type for utf-8 data</h1>
P0372R0<br />
May 30, 2016<br />
Michael Spencer &lt;bigcheesegs@gmail.com&gt;<br />
Davide C. C. Italiano &lt;dccitaliano@gmail.com&gt;<br />
Audience: EWG<br />

<section>
<h2 id="intro">Introduction</h2>
<p>We propose adding a new distinct type, <code>char8_t</code>, to represent
UTF-8 encoded data.</p>
</section>

<section>
<h2 id="problem">Problem</h2>
<p>The C++ standard currently confuses the native narrow encoding and UTF-8
encoding by representing them both as the type <code>char</code>. This makes it
difficult to write portable programs that interact with both the native narrow
encoding (most of the standard library) and UTF-8 (external libraries and some
parts of the standard library).</p>

<h3>Examples</h3>
<ul>
<li><dl>
<dt>File names and paths</dt>
<dd>The standard library provides no overloads for functions which take paths or
files names that are UTF-8 encoded.</dd>
</dl></li>
<li><dl>
<dt><code>codecvt</code></dt>
<dd>The <code>codecvt</code> class treats <code>char</code> as UTF-8 and
provides no way to perform conversions to or from the native narrow encoding.
</dd></dl></li>
<li><dl>
<dt><code>filesystem::path</code></dt>
<dd>The <code>u8string()</code> member function returns a std::string with UTF-8
encoding.</dd>
</dl></li>
</ul>
</section>

<section>
<h2 id="solution">Solution</h2>
<p>Add <code>char8_t</code> as a unique unsigned type with the same alignment,
value representation and object representation as <code>unsigned char</code>.
The intent is to allow explicit casting between <code>char*</code> and
<code>char8_t*</code> when the encoding is known for interoperability.</p>

<p>Make <code>u8"..."</code> strictly a UTF-8 string literal with the type
<code>const char8_t[]</code>.</p>

<p>Make <code>u8'.'</code> strictly a UTF-8 character literal with the type
<code>char8_t</code>.</p>

<p>Make UTF-8 string literals convertible to narrow string literals.</p>

<p>Make UTF-8 character literals convertible to narrow character literals.</p>

<h3>Examples</h3>
<div class="highlight">
<div class="syntax">
<pre><span></span><span class="c1">// In all cases the string is UTF-8.</span>
<span class="k">const</span> <span class="kt">char8_t</span>  <span class="n">ua</span><span class="p">[]</span> <span class="o">=</span> <span class="n">u8</span><span class="s">&quot;&quot;</span><span class="p">;</span> <span class="c1">// OK</span>
<span class="k">const</span> <span class="kt">char</span>     <span class="n">ca</span><span class="p">[]</span> <span class="o">=</span> <span class="n">u8</span><span class="s">&quot;&quot;</span><span class="p">;</span> <span class="c1">// OK</span>
<span class="k">const</span> <span class="kt">char8_t</span> <span class="o">*</span><span class="n">u</span>   <span class="o">=</span> <span class="n">u8</span><span class="s">&quot;&quot;</span><span class="p">;</span> <span class="c1">// OK</span>
<span class="k">const</span> <span class="kt">char</span>    <span class="o">*</span><span class="n">c</span>   <span class="o">=</span> <span class="n">u8</span><span class="s">&quot;&quot;</span><span class="p">;</span> <span class="c1">// OK</span>

<span class="k">const</span> <span class="kt">char</span>    <span class="o">*</span><span class="n">e</span>   <span class="o">=</span> <span class="n">u</span><span class="p">;</span> <span class="c1">// ERROR - pointers to different types</span>

<span class="kt">void</span> <span class="nf">f</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>

<span class="n">f</span><span class="p">(</span><span class="n">u8</span><span class="s">&quot;&quot;</span><span class="p">);</span> <span class="c1">// OK</span>
<span class="n">f</span><span class="p">(</span><span class="n">u</span><span class="p">);</span> <span class="c1">// ERROR - pointers to different types</span>

<span class="kt">void</span> <span class="nf">o</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">o</span><span class="p">(</span><span class="k">const</span> <span class="kt">char8_t</span><span class="o">*</span><span class="p">);</span>

<span class="n">o</span><span class="p">(</span><span class="n">u8</span><span class="s">&quot;&quot;</span><span class="p">);</span> <span class="c1">// OK - calls const char8_t*</span>
<span class="n">o</span><span class="p">(</span><span class="n">u</span><span class="p">);</span> <span class="c1">// OK - calls const char8_t*</span>
<span class="n">o</span><span class="p">(</span><span class="s">&quot;&quot;</span><span class="p">);</span> <span class="c1">// OK - calls const char*</span>
<span class="n">o</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// OK - calls const char*</span>
</pre></div>

</div>
</section>


<section>
<h2 id="usage">Where will it be used?</h2>
<p>This proposal only adds the type and changes the behavior of
<code>u8""</code> and <code>u8''</code>. Future library proposals will use
<code>char8_t</code> and friends to fill in basic unicode support for existing
parts of the standard library such as.
</p>

<ul>
<li><code>u8string</code></li>
<li><code>basic_fstream</code> filename parameter</li>
<li><code>basic_ios</code> unicode character types</li>
<li><code>filesystem::path</code> constructors from UTF-8</li>
</ul>
</section>

<section>
<h2 id="library">Why not a library implementation?</h2>
<p><code>char8_t</code> could be implemented as:</p>
<div class="highlight">
<div class="syntax"><pre><span></span><span class="k">enum</span> <span class="k">class</span> <span class="nc">char8_t</span> <span class="o">:</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="p">{};</span>
</pre></div></div>
<p>However this would require including a header to use, and would make the
definitions of <code>u8""</code> and <code>u8''</code> depend on the library. It
would also have different conversion behavior from <code>char16_t</code> and
<code>char32_t</code>.</p>
</section>

<section>
<h2 id="compat">Compatibility</h2>
<p>This change loudly breaks any current usage of the identifier
<code>char8_t</code>. All uses we found in open-source were
<code>typedef</code>s to <code>char</code>, <code>unsigned char</code> or an
equivalent type from <code>&lt;cstdint.h&gt;</code> and also used
<code>char{16,32}_t</code> in the surrounding code.</p>

<p>This change also breaks code that relies on what <code>u8""</code> and
<code>u8''</code> type deduce to. We were not able to find any instances of this
in open-source code.</p>
</section>

</body>
</html>
