<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Issue 2546: Implementability of locale-sensitive UnicodeEscapeSequence matching</title>
<meta property="og:title" content="Issue 2546: Implementability of locale-sensitive UnicodeEscapeSequence matching">
<meta property="og:description" content="C++ library issue. Status: New">
<meta property="og:url" content="https://cplusplus.github.io/LWG/issue2546.html">
<meta property="og:type" content="website">
<meta property="og:image" content="http://cplusplus.github.io/LWG/images/cpp_logo.png">
<meta property="og:image:alt" content="C++ logo">
<style>
  p {text-align:justify}
  li {text-align:justify}
  pre code.backtick::before { content: "`" }
  pre code.backtick::after { content: "`" }
  blockquote.note
  {
    background-color:#E0E0E0;
    padding-left: 15px;
    padding-right: 15px;
    padding-top: 1px;
    padding-bottom: 1px;
  }
  ins {background-color:#A0FFA0}
  del {background-color:#FFA0A0}
  table.issues-index { border: 1px solid; border-collapse: collapse; }
  table.issues-index th { text-align: center; padding: 4px; border: 1px solid; }
  table.issues-index td { padding: 4px; border: 1px solid; }
  table.issues-index td:nth-child(1) { text-align: right; }
  table.issues-index td:nth-child(2) { text-align: left; }
  table.issues-index td:nth-child(3) { text-align: left; }
  table.issues-index td:nth-child(4) { text-align: left; }
  table.issues-index td:nth-child(5) { text-align: center; }
  table.issues-index td:nth-child(6) { text-align: center; }
  table.issues-index td:nth-child(7) { text-align: left; }
  table.issues-index td:nth-child(5) span.no-pr { color: red; }
  @media (prefers-color-scheme: dark) {
     html {
        color: #ddd;
        background-color: black;
     }
     ins {
        background-color: #225522
     }
     del {
        background-color: #662222
     }
     a {
        color: #6af
     }
     a:visited {
        color: #6af
     }
     blockquote.note
     {
        background-color: rgba(255, 255, 255, .10)
     }
  }
</style>
</head>
<body>
<hr>
<p><em>This page is a snapshot from the LWG issues list, see the <a href="lwg-active.html">Library Active Issues List</a> for more information and the meaning of <a href="lwg-active.html#New">New</a> status.</em></p>
<h3 id="2546"><a href="lwg-active.html#2546">2546</a>. Implementability of locale-sensitive <em>UnicodeEscapeSequence</em> matching</h3>
<p><b>Section:</b> 28.6.12 <a href="https://wg21.link/re.grammar">[re.grammar]</a> <b>Status:</b> <a href="lwg-active.html#New">New</a>
 <b>Submitter:</b> Hubert Tong <b>Opened:</b> 2015-10-08 <b>Last modified:</b> 2024-10-03</p>
<p><b>Priority: </b>4
</p>
<p><b>View other</b> <a href="lwg-index-open.html#re.grammar">active issues</a> in [re.grammar].</p>
<p><b>View all other</b> <a href="lwg-index.html#re.grammar">issues</a> in [re.grammar].</p>
<p><b>View all issues with</b> <a href="lwg-status.html#New">New</a> status.</p>
<p><b>Discussion:</b></p>
<p>
In 28.6.12 <a href="https://wg21.link/re.grammar">[re.grammar]</a> paragraph 2:
</p>
<blockquote><p>
<code>basic_regex</code> member functions shall not call any locale dependent C or C++ API, including the formatted
string input functions. Instead they shall call the appropriate traits member function to achieve the required effect.
</p></blockquote>
<p>
Yet, the required interface for a regular expression traits class (28.6.2 <a href="https://wg21.link/re.req">[re.req]</a>) does not appear to have
any reliable method for determining whether a character as encoded for the locale associated with the traits
instance is the same as a character represented by a <em>UnicodeEscapeSequence</em>, e.g., assuming a sane
<code>ru_RU.koi8r</code> locale:
</p>
<blockquote><pre>
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;regex&gt;

const char data[] = "\xB3";
const char matchCyrillicCaptialLetterYo[] = R"(\u0401)";

int main(void)
{
  try {
    std::regex myRegex;
    myRegex.imbue(std::locale("ru_RU.koi8r"));

    myRegex.assign(matchCyrillicCaptialLetterYo, std::regex_constants::ECMAScript);
    printf("(%s)\n", std::regex_replace(std::string(data), myRegex, std::string("E")).c_str());

    myRegex.assign("[[:alpha:]]", std::regex_constants::ECMAScript);
    printf("(%s)\n", std::regex_replace(std::string(data), myRegex, std::string("E")).c_str());
  } catch (std::regex_error&amp; e) {
    abort();
  }
  return 0;
}
</pre></blockquote>
<p>
The implementation I tried prints:
</p>
<blockquote><pre>
(&#x401;)
(E)
</pre></blockquote>
<p>
Which means that the character class matching worked, but not the matching to the <em>UnicodeEscapeSequence</em>.
</p>

<p><i>[2024-10-03; Jonathan comments]</i></p>

<p>
<code>std::basic_regex&lt;charT&gt;</code> only properly supports
matching single code units that fit in <code class='backtick'>charT</code>.
There's nothing in the spec that supports matching code points that
require multiple code units, let alone checking whether a character
in an arbitrary encoding corresponds to any given Unicode code point.
28.6.12 <a href="https://wg21.link/re.grammar">[re.grammar]</a> paragraph 12 appears to be an attempt to
allow implementations to fail to match here, but is insufficient.
When <code>is_unsigned_v&lt;char&gt;</code> is true, the CV of the
<i>UnicodeEscapeSequence</i> <code class='backtick'>"\u0080"</code> is not greater than <code class='backtick'>CHAR_MAX</code>,
but that doesn't help because U+0080 is encoded as two bytes in UTF-8.
Being able to represent <code class='backtick'>0x80</code> as <code class='backtick'>char</code> does not mean the CV can be
matched as a single <code class='backtick'>char</code>.
The API is unsuitable for Unicode-aware strings.
</p>



<p id="res-2546"><b>Proposed resolution:</b></p>





</body>
</html>
