<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css">
  p {text-align:justify}
  ins {background-color:#A0FFA0}
  del {background-color:#FFA0A0}
  blockquote.note
  {
   background-color:#E0E0E0;
   padding-left: 15px;
   padding-right: 15px;
   padding-top: 1px;
   padding-bottom: 1px;
  }
  div#refs p { padding-left: 32px; text-indent: -32px; }
</style>

<title>text_encoding::name() should never return null values</title>

</head>
<body>
<address style="text-align: left;">
Document number: P2862R0<br>
Date: 2023-05-07<br>
Audience: Library Evolution Working Group (Design), Library Working Group (Wording)<br>
Author: Daniel Kr&uuml;gler<br>
Reply-to: <a href="mailto:daniel.kruegler@gmail.com">Daniel Kr&uuml;gler</a>
</address>
<hr>
<h1 style="text-align: center;"><tt>text_encoding::name()</tt> should never return null values</h1>

<ul>
<li><a href="#Intro">Introduction</a></li>
<li><a href="#Discussion">Discussion</a></li>
<li><a href="#Rationale">Rationale</a></li>
<li><a href="#Implementation">Implementation</a></li>
<li><a href="#Proposed_resolution">Proposed Resolution</a></li>
<li><a href="#Akn">Acknowledgements</a></li>
<li><a href="#Bibliography">Bibliography</a></li>
</ul>

<h2><a name="Intro"></a>Introduction</h2>
<p>
This proposal suggests to modify one aspect of the most recent proposal <a href="https://wg21.link/p1885r12">P1885R12</a> 
("Naming Text Encodings to Demystify Them"), namely the part that specifies that its <tt>name()</tt> member function
under certain circumstances returns null values.
</p>

<h2><a name="Discussion"></a>Discussion</h2>
<p>
<a href="https://wg21.link/p1885r12">P1885R12</a> introduces a highly useful text encoding facility. Among its
query functions it provides a <tt>name()</tt> member function that has a <tt>const char*</tt> result type, 
which returns the name of the text encoding.
<p/>
In certain circumstances there doesn't exist a name and the specification says that in this case the function shall
return a null pointer value.
<p/>
In the following I'm focusing on the most recently wording update of P1885R12 which specifies the following invariants 
defined in [text.encoding.general]:
</p>
<blockquote style="border-left: 3px solid #ccc;padding-left: 15px;">
<p>
An object <tt>e</tt> of type <tt>text_encoding</tt> such that <tt>e.mib() == text_encoding::id::unknown</tt> is <tt>false</tt>
and <tt>e.mib() == text_encoding::id::other</tt> is <tt>false</tt> maintains the following invariants:
</p>
<ul>
<li><p><tt>e.name() == nullptr</tt> is <tt>false</tt>.</p></li>
<li><p><tt>e.mib() == text_encoding(e.name()).mib()</tt> is <tt>true</tt>.</p></li>
</ul>
</blockquote>
<p>
The question arises <em>why</em> the paper actually decided to use <tt>nullptr</tt> result at all for the <tt>name()</tt>
function? 
<p/>
This topic is mentioned in the paper and the only remaining trace 
<a href="https://wg21.link/p1885r12#page=19">of the discussion</a> is:
</p>
<blockquote>
<p>
<b>When constructed from the unknown mib, name returns a <tt>nullptr</tt> rather than an empty string.</b>
</p>
</blockquote>
<p>
The paper does not provide information <em>why</em> returning a <tt>nullptr</tt> is preferred over &mdash; for example
&mdash; an empty string.
<p/>
This paper here questions that particular part of the P1885 design decision in regard to null values and suggests 
to ensure that the <tt>name()</tt> member function of <tt>text_encoding</tt> <em>never</em> returns a null value.
</p>

<h2><a name="Rationale"></a>Rationale</h2>
<p>
The <a href="https://wg21.link/p1885r12">P1885R12</a> proposal highlights various times that the API suggested by
that paper is intended to be compatible with C API related to text encodings, e.g. 
<a href="https://wg21.link/p1885r12#page=23">on page 23</a>:
</p>
<blockquote style="border-left: 3px solid #ccc;padding-left: 15px;">
<p>
One of the design goals is to be compatible with widely deployed libraries such as ICU and iconv,
which are, on most platforms, the defacto standards for text transformations, classification,
and transcoding. These are C APIs that expect null-terminated strings. [&hellip;]
EWG previously elected to use <tt>const char*</tt> in <tt>source_location</tt>, stack trace, etc.
</p>
</blockquote>
<p>
Often (but not always) such C APIs do not support null values for encoding names. E.g. attempting to call 
<tt>iconv_open(nullptr, "utf-8")</tt>, typically leads to a segmentation fault.
<p/>
This is not restricted to C APIs. Even the following seemingly simple code lines will cause <b>undefined behaviour</b>,
assuming that <tt>te</tt> denotes a <tt>text_encoding</tt> value whose mib value is either 
<tt>text_encoding::id::unknown</tt> or <tt>text_encoding::id::other</tt> (and the provided name was empty in this latter case):
</p>
<blockquote><pre>
std::cout &lt;&lt; te.name();             <i>// Violates [ostream.inserters.character] p3</i>
std::format("Name: {}", te.name()); <i>// Violates [format.arg] p5</i>
""sv == te.name();                  <i>// Violates [string.view.cons] p2 since traits::length doesn't accept null values</i>
</pre></blockquote>
<p>
Our increased awareness to reduce the possibility of causing undefined behaviour should alert us.
<p/>
In addition, existing practice of comparable APIs of the current working draft gives us some hints:
<p/>
For recently adopted types such as <tt>source_location</tt> these <em>always</em>
return an NTBS for <tt>const char*</tt> result types, in particular it specifies for the 
<tt>function_name</tt> attribute (emphasize mine):
</p>
<blockquote style="border-left: 3px solid #ccc;padding-left: 15px;">
<p>
A name of the current function such as in <tt>__func__</tt> (9.5.1 [dcl.fct.def.general]) if any, 
<b>an empty string otherwise</b>.
</p>
</blockquote>
<p>
For the new <tt>stacktrace</tt> facility, the finally agreed on wording for the <tt>description</tt> 
and <tt>source_file</tt> attributes actually decided for using <tt>std::string</tt> 
as result type, but says that in all these cases an <em>empty</em> string should be returned, if the 
corresponding information is not available (19.6.3.4 [stacktrace.entry.query] p2+p4).
<p/>
It might also be worth pointing out that for <tt>std::filesystem::path</tt>, we also invented an empty
string content to denote "an empty path" as degenerate case.
<p/>
The following edge case demonstrates that the current P1885R12 design choice can lead so an unexpected 
result from a user perspective:
<p/>
If the user creates a <tt>text_encoding</tt> object <tt>te</tt> from a valid character sequence <tt>enc</tt> 
denoting the encoding name, the resulting <tt>te.name()</tt> always satisfies <tt>te.name() == enc</tt>,
<em>except</em> when <tt>enc</tt> denotes an empty sequence, because in this case the special empty-<tt>name_</tt> 
rule of the <tt>name()</tt> <i>Returns</i>: element transforms the actual empty name into a <tt>nullptr</tt>
and transforms this comparison into UB land.
<p/>
According to the author of this paper, it is advantageous to decide for an empty string (instead of a <tt>nullptr</tt>
result) as degenerate value for <tt>text_encoding::name()</tt> for the following reasons:
</p>
<ol>
<li><p>It is consistent with other C API-compatible parts of the C++ standard library, that denote lack of information.</p></li>
<li><p>It prevents that user-code unintentionally causes undefined behaviour when invoking typical C APIs related to
text encodings.</p></li>
<li><p>It prevents implementors from special-casing the return value of the <tt>name</tt> member and 
similarly reduces special casing the result of <tt>name</tt> when the user forwards it to other functions.</p></li>
<li><p>It leads to the following implied invariant:</p>
<blockquote><p>
<tt>text_encoding(enc).name() == enc</tt> is <tt>true</tt> for every <tt>string_view</tt> value <tt>enc</tt> that 
is valid to construct a <tt>text_encoding</tt> object.
</p></blockquote>
</li>
</ol>

<blockquote class="note">
<p>
[<i>Drafting note</i>: It is possible to argue that the possible null result of <tt>name()</tt> allows a quick test condition
such as "<tt>if (te.name())</tt>". While I consider this not as a strong argument in favor, I'd like to point out that 
with the revised semantics suggested by this paper the alternative test would only by <em>one</em> character longer by
writing "<tt>if (*te.name())</tt>" instead.]
</p>
</blockquote>

<h2><a name="Implementation"></a>Implementation</h2>
<p>
This specification change has been implemented as a 
<a href="https://github.com/Dani-Hub/encoding-identification/tree/text_encoding_without_null_names">special branch</a>
on top of the most recent original
<a href="https://github.com/cor3ntin/encoding-identification">cor3ntin/encoding-identification</a> trunk.
<p/>
The effective 
<a href="https://github.com/cor3ntin/encoding-identification/compare/master...Dani-Hub:encoding-identification:text_encoding_without_null_names">delta</a>
demonstrates the amount of simplification and code-safety.
</p>

<h2><a name="Proposed_resolution"></a>Proposed resolution</h2>

<p>
The proposed wording changes refer to <a href="https://wg21.link/p1885r12">P1885R12</a>.
</p>

<blockquote class="note">
<p>
[<i>Drafting note</i>: The author of this proposal considers this specification change as really important.
He would like to remark that if LEWG disagrees with currently suggested wording change, he would like
to offer an alternative proposal, which would suggest two different <tt>name()</tt> attributes. For example
it would be possible to introduce a new member function such as <tt>c_name()</tt> that returns 
the exposition-only member <tt>name_</tt> as shown below and keep the <tt>name()</tt> function with the 
<a href="https://wg21.link/p1885r12">P1885R12</a> semantics. The concrete wording for this alternative is not
prepared in this proposal revision, but could be provided if requested.]
</p>
</blockquote>

<ol>

<li><p>Modify in [text.encoding.general] as indicated:</p>

<blockquote>
<p>
An object <tt>e</tt> of type <tt>text_encoding</tt> such that <tt>e.mib() == text_encoding::id::unknown</tt> is <tt>false</tt>
and <tt>e.mib() == text_encoding::id::other</tt> is <tt>false</tt> maintains the following invariants:
</p>
<ul>
<li><p><tt><ins>*</ins>e.name() == <ins>'\0'</ins><del>nullptr</del></tt> is <tt>false</tt>.</p></li>
<li><p><tt>e.mib() == text_encoding(e.name()).mib()</tt> is <tt>true</tt>.</p></li>
</ul>
</blockquote>
</li>

<li><p>Modify [text.encoding.members] as indicated:</p>

<blockquote>
<pre>
constexpr const char* name() const noexcept;
</pre>
<blockquote>
<p>
<i>Returns</i>: <tt>name_</tt><ins>.</ins><del>if <tt>(name_[0] != '\0')</tt>, <tt>nullptr</tt> otherwise;</del>
<p/>
<i>Remarks</i>: <del>If <tt>name() == nullptr</tt> is <tt>false</tt>,</del><tt>name()</tt> is an NTBS and 
accessing elements of <tt>name_</tt> outside of the range <tt>name() + [0, strlen(name()) + 1)</tt> is 
undefined behavior.
</p>
</blockquote>
</blockquote>
</li>

</ol>

<h2><a name="Akn"></a>Acknowledgements</h2>
<p>
Thanks to Corentin Jabot and Peter Brett for the otherwise excellent proposal <a href="https://wg21.link/p1885r12">P1885R12</a>.
<p/>
Thanks to Tim Song for asking the important 
"*why* does name() return a null pointer instead of an empty string if name_ is an empty string?" question 
<a href="https://lists.isocpp.org/lib/2023/03/25721.php">during the reflector discussions</a>.
</p>

<h2><a name="Bibliography"></a>Bibliography</h2>

<div id="refs">
<div id="ref-N4944">
<p>
[N4944] Thomas K&ouml;ppe: "Working Draft, Standard for Programming Language C++", 2023<br/>
<a href="https://wg21.link/n4944">https://wg21.link/n4944</a>
</p>
</div>

<div id="ref-P1885R12">
<p>
[P1885R12] Corentin Jabot, Peter Brett: "Naming Text Encodings to Demystify Them", 2023<br/>
<a href="https://wg21.link/p1885r12">https://wg21.link/p1885r12</a>
</p>
</div>
</div>

</body></html>