<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Issue 2331: regex_constants::collate's effects are inaccurately summarized</title>
<meta property="og:title" content="Issue 2331: regex_constants::collate's effects are inaccurately summarized">
<meta property="og:description" content="C++ library issue. Status: Open">
<meta property="og:url" content="https://cplusplus.github.io/LWG/issue2331.html">
<meta property="og:type" content="website">
<meta property="og:image" content="http://cplusplus.github.io/LWG/images/cpp_logo.png">
<meta property="og:image:alt" content="C++ logo">
<style>
  p {text-align:justify}
  li {text-align:justify}
  pre code.backtick::before { content: "`" }
  pre code.backtick::after { content: "`" }
  blockquote.note
  {
    background-color:#E0E0E0;
    padding-left: 15px;
    padding-right: 15px;
    padding-top: 1px;
    padding-bottom: 1px;
  }
  ins {background-color:#A0FFA0}
  del {background-color:#FFA0A0}
  table.issues-index { border: 1px solid; border-collapse: collapse; }
  table.issues-index th { text-align: center; padding: 4px; border: 1px solid; }
  table.issues-index td { padding: 4px; border: 1px solid; }
  table.issues-index td:nth-child(1) { text-align: right; }
  table.issues-index td:nth-child(2) { text-align: left; }
  table.issues-index td:nth-child(3) { text-align: left; }
  table.issues-index td:nth-child(4) { text-align: left; }
  table.issues-index td:nth-child(5) { text-align: center; }
  table.issues-index td:nth-child(6) { text-align: center; }
  table.issues-index td:nth-child(7) { text-align: left; }
  table.issues-index td:nth-child(5) span.no-pr { color: red; }
  @media (prefers-color-scheme: dark) {
     html {
        color: #ddd;
        background-color: black;
     }
     ins {
        background-color: #225522
     }
     del {
        background-color: #662222
     }
     a {
        color: #6af
     }
     a:visited {
        color: #6af
     }
     blockquote.note
     {
        background-color: rgba(255, 255, 255, .10)
     }
  }
</style>
</head>
<body>
<hr>
<p><em>This page is a snapshot from the LWG issues list, see the <a href="lwg-active.html">Library Active Issues List</a> for more information and the meaning of <a href="lwg-active.html#Open">Open</a> status.</em></p>
<h3 id="2331"><a href="lwg-active.html#2331">2331</a>. <code>regex_constants::collate</code>'s effects are inaccurately summarized</h3>
<p><b>Section:</b> 28.6.4.2 <a href="https://wg21.link/re.synopt">[re.synopt]</a> <b>Status:</b> <a href="lwg-active.html#Open">Open</a>
 <b>Submitter:</b> Stephan T. Lavavej <b>Opened:</b> 2013-09-21 <b>Last modified:</b> 2016-01-28</p>
<p><b>Priority: </b>3
</p>
<p><b>View all other</b> <a href="lwg-index.html#re.synopt">issues</a> in [re.synopt].</p>
<p><b>View all issues with</b> <a href="lwg-status.html#Open">Open</a> status.</p>
<p><b>Discussion:</b></p>
<p>
The table in 28.6.4.2 <a href="https://wg21.link/re.synopt">[re.synopt]</a>/1 says that <code>regex_constants::collate</code> "Specifies that character ranges of the form 
"<code>[a-b]</code>" shall be locale sensitive.", but 28.6.12 <a href="https://wg21.link/re.grammar">[re.grammar]</a>/14 says that it affects individual character comparisons 
too.
</p>

<p><i>[2012-02-12 Issaquah : recategorize as P3]</i></p>


<p>
Marshall Clow: 28.13/14 only applies to ECMAScript
</p>

<p>
All: we're unsure
</p>

<p>
Jonathan Wakely: we should ask John Maddock
</p>

<p>
Move to P3
</p>

<p><i>[2014-5-14, John Maddock response]</i></p>

<p>
The original intention was the original wording: namely that <code>collate</code> only made character ranges locale sensitive.  
To be frank it's a feature that's probably hardly ever used (though I have no real hard data on that), and is a leftover 
from early POSIX standards which <em>required</em> locale sensitive collation for character ranges, and then later changed 
to implementation defined if I remember correctly (basically nobody implemented locale-dependent collation).
<p/>
So I guess the question is do we gain anything by requiring all character-comparisons to go through the locale when this bit 
is set? Certainly it adds a great deal to the implementation effort (it's not what Boost.Regex has ever done). I guess the 
question is are differing code-points that collate identically an important use case? I guess there might be a few Unicode 
code points that do that, but I don't know how to go about verifying that.
<p/>
STL:
<p/>
If this was unintentional, then 28.6.4.2 <a href="https://wg21.link/re.synopt">[re.synopt]</a>/1's table should be left alone, while 28.6.12 <a href="https://wg21.link/re.grammar">[re.grammar]</a>/14 
should be changed instead.
<p/>
Jeffrey Yasskin:
<p/>
<a href="http://www.unicode.org/reports/tr18/tr18-13.html#Tailored_Loose_Matches">This page</a>
mentions that [V] in Swedish should match "W" in a perfect world.
<p/>
However, the most recent version of <a href="http://www.unicode.org/reports/tr18/#Tailored_Loose_Matches">TR18</a> retracts
both language-specific loose matches <em>and</em> language-specific ranges
because "for most full-featured regular expression engines, it is
quite difficult to match under code point equivalences that are not
1:1" and "tailored ranges can be quite difficult to implement
properly, and can have very unexpected results in practice. For
example, languages may also vary whether they consider lowercase below
uppercase or the reverse. This can have some surprising results: [a-Z]
may not match anything if <code>Z &lt; a</code> in that locale."
<p/>
<a href="http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.15">ECMAScript</a> doesn't include collation at all.
<p/>
IMO, +1 to changing 28.13 instead of 28.5.1. It seems like we'd be on
fairly solid ground if we wanted to remove <code>regex_constants::collate</code>
entirely, in favor of named character classes, but of course that's
not for this issue.
</p>



<p id="res-2331"><b>Proposed resolution:</b></p>
<p>This wording is relative to N3691.</p>

<ol>
<li><p>In 28.6.4.2 <a href="https://wg21.link/re.synopt">[re.synopt]</a>/1, Table 138 &mdash; "<code>syntax_option_type</code> effects", change as indicated:</p>

<blockquote>
<table border="1">
<caption>Table 138 &mdash; <code>syntax_option_type</code> effects</caption>
<tr>
<th align="center">Element</th>
<th align="center">Effect(s) if set</th>
</tr>

<tr>
<td colspan="2" align="center">
<code>&hellip;</code>
</td>
</tr>

<tr>
<td>
<code>collate</code>
</td>
<td>
Specifies that character <del>ranges of the form "<code>[a-b]</code>"</del><ins>comparisons and character range comparisons</ins> 
shall be locale sensitive.
</td>
</tr>

<tr>
<td colspan="2" align="center">
<code>&hellip;</code>
</td>
</tr>

</table>
</blockquote>
</li>
</ol>





</body>
</html>
