<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Issue 523: regex case-insensitive character ranges are unimplementable as specified</title>
<meta property="og:title" content="Issue 523: regex case-insensitive character ranges are unimplementable as specified">
<meta property="og:description" content="C++ library issue. Status: Open">
<meta property="og:url" content="https://cplusplus.github.io/LWG/issue523.html">
<meta property="og:type" content="website">
<meta property="og:image" content="http://cplusplus.github.io/LWG/images/cpp_logo.png">
<meta property="og:image:alt" content="C++ logo">
<style>
  p {text-align:justify}
  li {text-align:justify}
  pre code.backtick::before { content: "`" }
  pre code.backtick::after { content: "`" }
  blockquote.note
  {
    background-color:#E0E0E0;
    padding-left: 15px;
    padding-right: 15px;
    padding-top: 1px;
    padding-bottom: 1px;
  }
  ins {background-color:#A0FFA0}
  del {background-color:#FFA0A0}
  table.issues-index { border: 1px solid; border-collapse: collapse; }
  table.issues-index th { text-align: center; padding: 4px; border: 1px solid; }
  table.issues-index td { padding: 4px; border: 1px solid; }
  table.issues-index td:nth-child(1) { text-align: right; }
  table.issues-index td:nth-child(2) { text-align: left; }
  table.issues-index td:nth-child(3) { text-align: left; }
  table.issues-index td:nth-child(4) { text-align: left; }
  table.issues-index td:nth-child(5) { text-align: center; }
  table.issues-index td:nth-child(6) { text-align: center; }
  table.issues-index td:nth-child(7) { text-align: left; }
  table.issues-index td:nth-child(5) span.no-pr { color: red; }
  @media (prefers-color-scheme: dark) {
     html {
        color: #ddd;
        background-color: black;
     }
     ins {
        background-color: #225522
     }
     del {
        background-color: #662222
     }
     a {
        color: #6af
     }
     a:visited {
        color: #6af
     }
     blockquote.note
     {
        background-color: rgba(255, 255, 255, .10)
     }
  }
</style>
</head>
<body>
<hr>
<p><em>This page is a snapshot from the LWG issues list, see the <a href="lwg-active.html">Library Active Issues List</a> for more information and the meaning of <a href="lwg-active.html#Open">Open</a> status.</em></p>
<h3 id="523"><a href="lwg-active.html#523">523</a>. regex case-insensitive character ranges are unimplementable as specified</h3>
<p><b>Section:</b> 28.6 <a href="https://wg21.link/re">[re]</a> <b>Status:</b> <a href="lwg-active.html#Open">Open</a>
 <b>Submitter:</b> Eric Niebler <b>Opened:</b> 2005-07-01 <b>Last modified:</b> 2020-07-17</p>
<p><b>Priority: </b>4
</p>
<p><b>View other</b> <a href="lwg-index-open.html#re">active issues</a> in [re].</p>
<p><b>View all other</b> <a href="lwg-index.html#re">issues</a> in [re].</p>
<p><b>View all issues with</b> <a href="lwg-status.html#Open">Open</a> status.</p>
<p><b>Discussion:</b></p>
<p>
A problem with TR1 regex is currently being discussed on the Boost 
developers list. It involves the handling of case-insensitive matching 
of character ranges such as [Z-a]. The proper behavior (according to the 
ECMAScript standard) is unimplementable given the current specification 
of the TR1 <code>regex_traits&lt;&gt;</code> class template. John Maddock, the author of 
the TR1 regex proposal, agrees there is a problem. The full discussion 
can be found at http://lists.boost.org/boost/2005/06/28850.php (first 
message copied below). We don't have any recommendations as yet.
</p>
<p>
-- Begin original message --
</p>
<p>
The situation of interest is described in the ECMAScript specification
(ECMA-262), section 15.10.2.15:
</p>
<p>
"Even if the pattern ignores case, the case of the two ends of a range
is significant in determining which characters belong to the range.
Thus, for example, the pattern /[E-F]/i matches only the letters E, F,
e, and f, while the pattern /[E-f]/i matches all upper and lower-case
ASCII letters as well as the symbols [, \, ], ^, _, and `."
</p>
<p>
A more interesting case is what should happen when doing a
case-insensitive match on a range such as [Z-a]. It should match z, Z,
a, A and the symbols [, \, ], ^, _, and `. This is not what happens with
Boost.Regex (it throws an exception from the regex constructor).
</p>
<p>
The tough pill to swallow is that, given the specification in TR1, I
don't think there is any effective way to handle this situation.
According to the spec, case-insensitivity is handled with
<code>regex_traits&lt;&gt;::translate_nocase(CharT)</code> &mdash; two characters are equivalent
if they compare equal after both are sent through the <code>translate_nocase</code>
function. But I don't see any way of using this translation function to
make character ranges case-insensitive. Consider the difficulty of
detecting whether "z" is in the range [Z-a]. Applying the transformation
to "z" has no effect (it is essentially <code>std::tolower</code>). And we're not
allowed to apply the transformation to the ends of the range, because as
ECMA-262 says, "the case of the two ends of a range is significant."
</p>
<p>
So AFAICT, TR1 regex is just broken, as is Boost.Regex. One possible fix
is to redefine translate_nocase to return a string_type containing all
the characters that should compare equal to the specified character. But
this function is hard to implement for Unicode, and it doesn't play nice
with the existing ctype facet. What a mess!
</p>
<p>
-- End original message --
</p>

<p><i>[
John Maddock adds:
]</i></p>


<p>
One small correction, I have since found that ICU's regex package does 
implement this correctly, using a similar mechanism to the current 
TR1.Regex.
</p>
<p>
Given an expression [c1-c2] that is compiled as case insensitive it:
</p>
<p>
Enumerates every character in the range c1 to c2 and converts it to it's 
case folded equivalent.  That case folded character is then used a key to a 
table of equivalence classes, and each member of the class is added to the 
list of possible matches supported by the character-class.  This second step 
isn't possible with our current traits class design, but isn't necessary if 
the input text is also converted to a case-folded equivalent on the fly.
</p>
<p>
ICU applies similar brute force mechanisms to character classes such as 
[[:lower:]] and [[:word:]], however these are at least cached, so the impact 
is less noticeable in this case.
</p>
<p>
Quick and dirty performance comparisons show that expressions such as 
"[X-\\x{fff0}]+" are indeed very slow to compile with ICU (about 200 times 
slower than a "normal" expression).  For an application that uses a lot of 
regexes this could have a noticeable performance impact.  ICU also has an 
advantage in that it knows the range of valid characters codes: code points 
outside that range are assumed not to require enumeration, as they can not 
be part of any equivalence class.  I presume that if we want the TR1.Regex 
to work with arbitrarily large character sets enumeration really does become 
impractical.
</p>
<p>
Finally note that Unicode has:
</p>
<p>
Three cases (upper, lower and title).
One to many, and many to one case transformations.
Character that have context sensitive case translations - for example an 
uppercase sigma has two different lowercase forms  - the form chosen depends 
on context(is it end of a word or not), a caseless match for an upper case 
sigma should match either of the lower case forms, which is why case folding 
is often approximated by tolower(toupper(c)).
</p>
<p>
Probably we need some way to enumerate character equivalence classes, 
including digraphs (either as a result or an input), and some way to tell 
whether the next character pair is a valid digraph in the current locale.
</p>
<p>
Hoping this doesn't make this even more complex that it was already,
</p>

<p><i>[
Portland: Alisdair: Detect as invalid, throw an exception.
Pete: Possible general problem with case insensitive ranges.
]</i></p>


<p><i>[
2009-07 Frankfurt
]</i></p>


<blockquote>
<p>
We agree that this is a problem, but we do not know the answer.
</p>
<p>
We are going to declare this NAD until existing practice leads us in some direction.
</p>
<p>
No objection to NAD Future.
</p>
<p>
Move to NAD Future.
</p>
</blockquote>

<p><i>[LEWG Kona 2017]</i></p>

<p>Recommend Open: Tim Shen proposes: forbid use of case-insensitive ranges with regex traits other than 
<code>std::regex_traits&lt;{char, wchar_t, char16_t, char32_t}&gt;</code> when regex_constants::collate is specified.</p>

<p><i>[2020-07-17; Priority set to 4 in telecon]</i></p>



<p id="res-523"><b>Proposed resolution:</b></p>





</body>
</html>
