<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Issue 2584: &lt;regex&gt; ECMAScript IdentityEscape is ambiguous</title>
<meta property="og:title" content="Issue 2584: &lt;regex&gt; ECMAScript IdentityEscape is ambiguous">
<meta property="og:description" content="C++ library issue. Status: C++17">
<meta property="og:url" content="https://cplusplus.github.io/LWG/issue2584.html">
<meta property="og:type" content="website">
<meta property="og:image" content="http://cplusplus.github.io/LWG/images/cpp_logo.png">
<meta property="og:image:alt" content="C++ logo">
<style>
  p {text-align:justify}
  li {text-align:justify}
  pre code.backtick::before { content: "`" }
  pre code.backtick::after { content: "`" }
  blockquote.note
  {
    background-color:#E0E0E0;
    padding-left: 15px;
    padding-right: 15px;
    padding-top: 1px;
    padding-bottom: 1px;
  }
  ins {background-color:#A0FFA0}
  del {background-color:#FFA0A0}
  table.issues-index { border: 1px solid; border-collapse: collapse; }
  table.issues-index th { text-align: center; padding: 4px; border: 1px solid; }
  table.issues-index td { padding: 4px; border: 1px solid; }
  table.issues-index td:nth-child(1) { text-align: right; }
  table.issues-index td:nth-child(2) { text-align: left; }
  table.issues-index td:nth-child(3) { text-align: left; }
  table.issues-index td:nth-child(4) { text-align: left; }
  table.issues-index td:nth-child(5) { text-align: center; }
  table.issues-index td:nth-child(6) { text-align: center; }
  table.issues-index td:nth-child(7) { text-align: left; }
  table.issues-index td:nth-child(5) span.no-pr { color: red; }
  @media (prefers-color-scheme: dark) {
     html {
        color: #ddd;
        background-color: black;
     }
     ins {
        background-color: #225522
     }
     del {
        background-color: #662222
     }
     a {
        color: #6af
     }
     a:visited {
        color: #6af
     }
     blockquote.note
     {
        background-color: rgba(255, 255, 255, .10)
     }
  }
</style>
</head>
<body>
<hr>
<p><em>This page is a snapshot from the LWG issues list, see the <a href="lwg-active.html">Library Active Issues List</a> for more information and the meaning of <a href="lwg-active.html#C++17">C++17</a> status.</em></p>
<h3 id="2584"><a href="lwg-defects.html#2584">2584</a>. <code>&lt;regex&gt;</code> ECMAScript <code>IdentityEscape</code> is ambiguous</h3>
<p><b>Section:</b> 28.6.12 <a href="https://wg21.link/re.grammar">[re.grammar]</a> <b>Status:</b> <a href="lwg-active.html#C++17">C++17</a>
 <b>Submitter:</b> Billy O'Neal III <b>Opened:</b> 2016-01-13 <b>Last modified:</b> 2017-07-30</p>
<p><b>Priority: </b>2
</p>
<p><b>View other</b> <a href="lwg-index-open.html#re.grammar">active issues</a> in [re.grammar].</p>
<p><b>View all other</b> <a href="lwg-index.html#re.grammar">issues</a> in [re.grammar].</p>
<p><b>View all issues with</b> <a href="lwg-status.html#C++17">C++17</a> status.</p>
<p><b>Discussion:</b></p>
<p>
Stephan and I are seeing differences in implementation for how non-special characters should be handled in the 
<code>IdentityEscape</code> part of the ECMAScript grammar. For example:
</p>
<blockquote><pre>
#include &lt;stdio.h&gt;
#include &lt;iostream&gt;
#ifdef USE_BOOST
#include &lt;boost/regex.hpp&gt;
using namespace boost;
#else
#include &lt;regex&gt;
#endif
using namespace std;

int main() {
  try {
    const regex r("\\z");
    cout &lt;&lt; "Constructed \\z." &lt;&lt; endl;
    if (regex_match("z", r))
      cout &lt;&lt; "Matches z" &lt;&lt; endl;
  } catch (const regex_error&amp; e) {
      cout &lt;&lt; e.what() &lt;&lt; endl;
  }
}
</pre></blockquote>
<p>
libstdc++, boost, and browsers I tested with (Microsoft Edge, Google Chrome) all happily interpret <code>\z</code>, which 
otherwise has no meaning, as an identity character escape for the letter <code>z</code>.
libc++ and msvc++ say that this is invalid, and throw <code>regex_error</code> with <code>error_escape</code>.
<p/>
ECMAScript 3 (which is what C++ currently points to) seems to agree with libc++ and msvc++:
</p>
<blockquote>
<pre>
IdentityEscape ::
  SourceCharacter <b>but not</b> IdentifierPart

IdentifierPart ::
  IdentifierStart
  UnicodeCombiningMark
  UnicodeDigit
  UnicodeConnectorPunctuation
  \ UnicodeEscapeSequence

IdentifierStart ::
  UnicodeLetter
  $
  _
  \ UnicodeEscapeSequence
</pre>
</blockquote>
<p>
But this doesn't make any sense &mdash; it prohibits things like <code>\$</code> which users absolutely need to be able to escape. 
So let's look at ECMAScript 6. I believe this says much the same thing, but updates the spec to better handle Unicode by 
referencing what the Unicode standard says is an identifier character:
</p>
<blockquote>
<pre>
IdentityEscape ::
  SyntaxCharacter
  /
  SourceCharacter <b>but not</b> UnicodeIDContinue
  
UnicodeIDContinue ::
  any Unicode code point with the Unicode property "ID_Continue", "Other_ID_Continue", or "Other_ID_Start"
</pre>
</blockquote>
<p>
However, ECMAScript 6 has an appendix B defining "additional features for web browsers" which says:
</p>
<blockquote>
<pre>
IdentityEscape ::
  SourceCharacter <b>but not</b> c
</pre>
</blockquote>
<p>
which appears to agree with what libstdc++, boost, and browsers are doing.
<p/>
What should be the correct behavior here?
</p>

<p><i>[2016-08, Chicago]</i></p>

<p>Monday PM: Move to tentatively ready</p>


<p id="res-2584"><b>Proposed resolution:</b></p>
<p>
This wording is relative to N4567.
</p>

<ol>
<li><p>Change 28.6.12 <a href="https://wg21.link/re.grammar">[re.grammar]</a>/3 as indicated:</p>

<blockquote>
<p>
-3- The following productions within the ECMAScript grammar are modified as follows:
</p>
<blockquote><pre>
ClassAtom ::
  -
  ClassAtomNoDash
  ClassAtomExClass
  ClassAtomCollatingElement
  ClassAtomEquivalence
  
<ins>IdentityEscape ::
  SourceCharacter <b>but not</b> c</ins>
</pre></blockquote>
</blockquote>
</li>
</ol>





</body>
</html>
