<!doctype html>
<html lang="en">
<head>
	<meta charset="utf-8">
	<title>regex with Unicode character types</title>
	<style type="text/css">
	body
	{
		color: #000000;
		background-color: #fcfcfc;
	}
	div.changes
	{
		margin-right: 0.3em;
		margin-bottom: 1.2em;
		padding: 0.3em 1.2em;
		background-color: #efefef;
	}
	div.changes ins
	{
/*
		text-decoration: none;
		border-bottom: 1px solid #00007f;
		background-color: #dfdfef;
*/
	}
	div.changes p.cft
	{
		padding-left: 1.2em;
	}
	div.changes p.indent
	{
		padding-left: 1.2em;
		position: relative;
		text-indent: 0;
	}
	div.changes p.indent span.num
	{
		position: absolute;
		left: -0.3em;
	}
	dl
	{
		border: 1px solid #bfbfbf;
		border-radius: 0.6em;
		margin-bottom: 1.2em;
	}
	dl dt
	{
		background-color: #dfdffc;
		padding: 2px 6px 2px 0.6em;
	}
	em.ul
	{
		font-style: normal;
		text-decoration: underline;
	}
	p
	{
		text-indent: 1.2em;
		line-height: 1.44em;
	}
	pre.code
	{
		border: 6px double #bfbfbf;
		padding: 6px 12px;
/*
		color: #ffffff;
		background: #000000;
*/
	}
	li p, dl#proposed p
	{
		text-indent: 0;
	}
	code.cft, div.changes p.cft
	{
		font-family: monospace;
		font-size: 90%;
		color: #3f3f3f;
		font-weight: bold;
	}
	ol#toc
	{
		list-style-type: upper-roman;
	}
	body > section
	{
/*
		border: 1px solid #bfbfbf;
		border-radius: 0.6em;
		margin-bottom: 2.4em;
*/
		padding: 0.6em 1.2em 1.2em 1.2em;
	}
	section > h2
	{
		margin-top: 0.6em;
	}
	section > section
	{
		border: 1px solid #bfbfbf;
		border-radius: 0.6em;
		margin-bottom: 1.2em;
		padding-left: 1.2em;
	}
	span.eor
	{
		font-style: italic;
	}
	span.title
	{
		font-style: italic;
	}
	strong
	{
		font-weight: normal;
		font-style: italic;
		text-decoration: none;
	}
	table#meta th
	{
		text-align: left;
	}
	table#meta th,
	table#meta td
	{
		padding: 0 12px;
	}
	table.defstyle
	{
		margin-top: 0.6em;
		margin-bottom: 1.2em;
		border: 1px solid #000000;
		border-spacing: 0;
	}
	table.defstyle th, table.defstyle td
	{
		border-top: 1px solid #000000;
		border-bottom: 1px solid #000000;
		padding: 6px 12px;
	}
	table#table140 th, table#table140 td
	{
		font-size: 0.88em;
		text-align: center;
		padding: 6px;
	}
	ul
	{
		margin-top: 1.2em;
		margin-bottom: 1.2em;
	}
	ul.changes
	{
		list-style-type: square;
	}
	ul.changes > li
	{
		margin-bottom: 1.8em;
	}
	ul.changes p.target
	{
/*
		background-color: #dfdffc;
		padding: 2px 6px 2px 0.6em;
		margin-bottom: 0;
*/
		font-weight: bold;
	}
	ul#ma
	{
		display: inline;
		margin-left: 0;
		padding-left: 0;
	}
	ul#ma li
	{
		display: inline;
		list-style-type: none;
	}
	ul#ma li .tanuki
	{
		display: none;
	}
	ul#ma li span.postp:after
	{
		content: ' . ';
	}
	ul#ma li span#am:before
	{
		content: ' at ';
	}
	ul#ma li span.prep:before
	{
		content: ' . ';
	}
	</style>
</head>
<body>

<table id="meta">
	<tr>
		<th>Document Number</th>
		<td>P0169R0</td>
	</tr>
	<tr>
		<th>Date</th>
		<td>2015-11-03</td>
	</tr>
	<tr>
		<th>Audience</th>
		<td>Library Evolution Working Group</td>
	</tr>
	<tr>
		<th>Reply-To</th>
		<td>
			<ul id="ma">
				<li>Nozomu Katō</li>
				<li>&lt;</li><li><span class="tanuki">ta</span>n<span class="tanuki">tata</span>o<span class="tanuki">ta</span><span class="postp">z</span><span class="tanuki">tata</span></li>
				<li><span class="tanuki">. tantan, tanuki.</span></li>
				<li><span class="tanuki">ta</span>k<span class="tanuki">tata</span>a<span class="tanuki">ta</span></li>
				<li><span id="am"> </span></li>
				<li><span class="tanuki">ta</span>a<span class="tanuki">tata</span>k<span class="tanuki">ta</span>e<span class="tanuki">tata</span>n<span class="tanuki">ta</span>o<span class="tanuki">tata</span>t<span class="tanuki">ta</span>s<span class="tanuki">tata</span>u<span class="tanuki">ta</span>k<span class="tanuki">tata</span>i<span class="tanuki">ta</span></li>
				<li><span class="tanuki">. tatatan, tanuki.</span></li>
				<li><span class="tanuki">ta</span><span class="prep">c</span><span class="tanuki">tata</span>o<span class="tanuki">ta</span>m<span class="tanuki">tata</span></li><li>&gt;</li>
			</ul>
		</td>
	</tr>
</table>

<h1>regex with Unicode character types</h1>

<section>
	<h2>Table of Contents</h2>
	<ol id="toc">
		<li><a href="#sec1">Introduction and Motivation</a></li>
		<li><a href="#sec2">Scope and Impact on the Standard</a></li>
		<li><a href="#sec3">&lt;regex&gt; with char16_t</a></li>
		<li><a href="#sec4">Technical Specifications</a></li>
		<li><a href="#sec5">Relevant Issues</a></li>
		<li><a href="#sec6">References</a></li>
	</ol>
</section>

<section>
	<h2 id="sec1">I. Introduction and Motivation</h2>
	<p>
	Among the four character types that C++ has, only <code class="cft">char</code> and <code class="cft">wchar_t</code> can be used with the regular expression library in the C++ standard. Because of this, operations involving regular expression matching and searching against a Unicode string are available only in such environments as the value of <code class="cft">char</code> or <code class="cft">wchar_t</code> denotes a UTF-32 character.
	</p>
	<p>
	It is unfortunate and inconvenient that while C++ has two character types, two string classes dedicated to Unicode (<code class="cft">char16_t</code> and <code class="cft">char32_t</code>, <code class="cft">u16string</code> and <code class="cft">u32string</code>), and a regular expression library (regex), they cannot be used together in all implementations.
	</p>
	<p>
	In this paper it is proposed that the regular expression library in the C++ standard (henceforth, &lt;regex&gt;) should support sequences of Unicode character types at least as the same level as sequences of <code class="cft">char</code> and <code class="cft">wchar_t</code>.
	</p>
</section>

<section>
	<h2 id="sec2">II. Scope and Impact on the Standard</h2>
	<p>
	Since there are different problems in using &lt;regex&gt; with <code class="cft">char16_t</code> or <code class="cft">char32_t</code>, different measures are required for each of them:
	</p>
	<dl>
		<dt>&lt;regex&gt; with char32_t</dt>
		<dd>
			<p>
			The value of <code class="cft">char32_t</code> is practically a Unicode code point itself. It should be adaptable to &lt;regex&gt; in essence without special treatment, however, <code class="cft">basic_regex&lt;char32_t&gt;</code> is unavailable in most implemantations based on the current standard. Its core reason is that although inside the class it tries to use <code class="cft">regex_traits&lt;char32_t&gt;</code>, this is not available because it depends on several classes in &lt;locale&gt;, namely <code class="cft">ctype&lt;char32_t&gt;</code>, <code class="cft">collate&lt;char32_t&gt;</code>, and <code class="cft">collate_byname&lt;char32_t&gt;</code> for which specializations are not defined in the standard.
			</p>
			<p>
			Thus, for &lt;regex&gt; to support <code class="cft">char32_t</code>, it is proposed to define specializations of these classes for <code class="cft">char32_t</code> in the standard.
			</p>
		</dd>

		<dt>&lt;regex&gt; with char16_t</dt>
		<dd>
			<p>
			Use of &lt;regex&gt; with <code class="cft">char16_t</code> has the following problems:
			</p>
			<ul>
				<li>
					<p>
					Regular expressions that represent a set of characters, such as [\u0000-\uFFFF] (character class), . (dot atom), \S (predefined character class) etc. can match a half of a surrogate pair instead of the whole pair that represents one Unicode character, since comparison is performed conceptually between a code unit in the sequence of regular expressions and a code unit in the input sequence passed to an algorithm.
					</p>
				</li>
				<li>
					<p>
					Like the case of <code class="cft">char32_t</code>, the specializations <code class="cft">regex_traits&lt;char16_t&gt;</code>, <code class="cft">ctype&lt;char16_t&gt;</code>, <code class="cft">collate&lt;char16_t&gt;</code>, and <code class="cft">collate_byname&lt;char16_t&gt;</code> are not available. However, unlike <code class="cft">char32_t</code>, it is difficult to define appropriately specializations of <code class="cft">ctype</code> for <code class="cft">char16_t</code> because it has some member functions that take an argument of <code class="cft">charT</code>, i.e., <code class="cft">char16_t</code> and return a value of the same type. This means that such functions cannot deal with a surrogate pair, and icase matching depending on one of such functions, <code class="cft">tolower()</code>, is not performed correctly by the algorithms of &lt;regex&gt;.
					</p>
					<p class="note">
					Note: UCS-2 is already <a href="http://www.unicode.org/faq/utf_bom.html#utf16-11">obsolete in the Unicode standard</a> and deprecated in ISO/IEC 10646. Newly added features must not support UCS-2 explicitly.
					</p>
				</li>
			</ul>
			<p>
			For &lt;regex&gt; to support <code class="cft">char16_t</code>, therefore, special treatments would be required. This is discussed in the next section, but in any case the existing libraries except &lt;regex&gt; would not be affected at all.
			</p>
		</dd>
	</dl>

	<p>
	There might be demand for more full-featured Unicode regular expression support like the ones described in <a href="http://www.unicode.org/reports/tr18/">UTS #18</a> to get into the C++ standard. But I propose, as a first step, for &lt;regex&gt; to support sequences of Unicode character types as the same level as sequences of <code class="cft">char</code> and <code class="cft">wchar_t</code>, based upon the following reasons:
	</p>
	<ul>
		<li>
			<p>
			It can easily be imagined that regular expression matching operations considering normalization, composite characters, variation sequences, grapheme clusters, etc. are very slow. Even if they are supported in the future, it would be indispensable <strong>as one option</strong> for &lt;regex&gt; to support simple character-by-character comparison for <code class="cft">char16_t</code> and <code class="cft">char32_t</code>, as well as for <code class="cft">char</code> and <code class="cft">wchar_t</code>.
			</p>
		</li>
		<li>
			<p>
			These (normalization etc.) are not features specific to regular expressions but used generally in text matching, comparison and searching. They need to be considered in a comprehensive Unicode proposal.
			</p>
		</li>
	</ul>

	<p class="note">
	Note: As of October 2015, among six regular expression grammars referred to by the C++ standard, only RegExp of ECMAScript has explicit Unicode support and it performs character-by-character comparison where each character is either a code point or a code unit of UTF-16, depending upon whether the /u flag is set or not.
	</p>
</section>

<section>
	<h2 id="sec3">III. &lt;regex&gt; with char16_t</h2>

	<p>
	There are two options for <code class="cft">char16_t</code> support:
	</p>

	<dl>
		<dt>1. Provide UTF-16 to UTF-32 converting iterator</dt>
		<dd>
			<p>
			In this option the C++ standard does not support <code class="cft">std::u16regex</code>, but defines a bidirectional iterator that converts UTF-16 to UTF-32 on the fly for the algorithms of &lt;regex&gt;. This takes pointers or iterators pointing to the sequence [begin, end) of UTF-16 as input, its <code class="cft">operator*()</code> returns a value of <code class="cft">char32_t</code>, and its <code class="cft">operator++()</code> and <code class="cft">operator--()</code> move its position to the next and previous character respectively in the sequence. A very rough sketch of it is illustrated as follows:
			</p>
			<pre class="code">
template&lt;class BidiIterator&gt;
struct regex_u16u32conv_iterator
{
public:
    typedef bidirectional_iterator_tag iterator_category;

    regex_u16u32conv_iterator(BidiIterator begin, BidiIterator end) : boi(begin), eoi(end)
    {
    }

    char32_t operator*()
    {
        if ((*boi &amp; 0xdc00) == 0xd800)
        {
            BidiIterator trail = boi;
            if (++trail != eoi)
                return static_cast&lt;char32_t&gt;(((*boi & 0x3ff) &lt;&lt; 10 | (*trail &amp; 0x3ff)) + 0x10000);
        }
        return static_cast&lt;char32_t&gt;(*boi);
    }

    regex_u16u32conv_iterator &amp;operator++()
    {
        ++boi;
        if (boi != eoi && (*boi &amp; 0xdc00) == 0xdc00)
            ++boi;

        return *this;
    }

    bool operator==(const regex_u16u32conv_iterator &right) const
    {
        return boi == right.boi &amp;&amp; eoi == right.eoi;
    }

    operator BidiIterator() const
    {
        return boi;
    }

    //  other members...

private:
    BidiIterator boi;
    BidiIterator eoi;
};
typedef regex_u16u32conv_iterator&lt;char16_t*&gt; regex_u16cu32conv_iterator;
typedef regex_u16u32conv_iterator&lt;u16string::iterator&gt; regex_u16su32conv_iterator;

char16_t u16chars[] = u"\u3000\U00010000\u0040";  //  0x3000, 0xd800, 0xdc00, 0x0040
regex_u16cu32conv_iterator u16tou32(u16chars, u16chars + 4);
*u16tou32;                                  //  returns 0x3000 of <code class="cft">char32_t</code>
++u16tou32;
*u16tou32;                                  //  returns 0x10000 of <code class="cft">char32_t</code>
++u16tou32;
*u16tou32;                                  //  returns 0x40 of <code class="cft">char32_t</code>

//  A sequence of regular expressions in UTF-16 needs to be converted
//  into UTF-32 prior to passed to <code class="cft">u32regex</code>.
u32string u32restr = U"(abc|def)[ghi]";
u32regex u32re(u32restr);

u16string u16text = u" long long text encoded in UTF-16... ";
regex_u16su32conv_iterator bos(u16text.begin(), u16text.end());
regex_u16su32conv_iterator eos(u16text.end(), u16text.end());
regex_search(bos, eos, u32re);
			</pre>
			<p>
			This does not need to satisfy strictly all the requirements of the bidirectional iterator, but only needs to be recognized so by all the algorithms of &lt;regex&gt;.
			</p>
			<p>
			An advantage of this approach is that a similar iterator can be provided for UTF-8 to UTF-32 conversion, too. It is possible to support all UTFs (UTF-32, UTF-16, and UTF-8) by the combination of adding support for <code class="cft">char32_t</code> to &lt;regex&gt; and defining converting iterators.
			</p>
			<p>
			A disadvantage is that matching operations are likely to be slow, since all code units are translated into UTF-32 through this iterator every time they are accessed in regular expression algorithms. Clearly, it would be faster than the way of this option to convert the input sequence of UTF-16 into UTF-32 in advance of passing it to <code class="cft">u32regex</code> or algorithms, if it is possible.
			</p>
		</dd>

		<dt>2. Do nothing for &lt;regex&gt; with char16_t</dt>
		<dd>
			<p>
			<code class="cft">char16_t</code> resembles <code class="cft">char32_t</code> in name, however, the characteristics of their values are very different. UTF-16 contained by <code class="cft">char16_t</code> resembles UTF-8 rather than UTF-32 contained by <code class="cft">char32_t</code>, in that UTF-16 and UTF-8 are variable-width encoding schemes, whereas UTF-32 is not. Therefore, it would be a real option that nothing is done for the time being about <code class="cft">char16_t</code> which requires special considerations, whereas <code class="cft">char32_t</code> is added into the group of <code class="cft">char</code> and <code class="cft">wchar_t</code>.
			</p>
			<p>
			In this option, for UTF-8 and UTF-16 strings, until good treatment gets into the standard, it is encouraged for them to be converted into UTF-32 strings then passed to <code class="cft">std::u32regex</code> and regular expression algorithms.
			</p>
		</dd>
	</dl>
	<p>
	Either way, support for <code class="cft">basic_regex&lt;char32_t&gt;</code> is a precondition.
	</p>
</section>

<section>
	<h2 id="sec4">IV. Technical Specifications</h2>

	<section>
		<h3>1. &lt;regex&gt;</h3>
		<p>
		The following changes are proposed to support <code class="cft">basic_regex&lt;char32_t&gt;</code>:
		</p>
		<ul class="changes">
			<li>
				<p class="target">
				28.1 General [re.general]
				</p>
				<div class="changes">
					<p class="indent">
					<span class="num">2</span>
					The following subclauses describe a basic regular expression class template and its traits that can handle char-like template arguments, <del>two</del><ins>three</ins> specializations of this class template that handle sequences of <code class="cft">char</code> and <code class="cft">wchar_t</code><del>,</del><ins> and <code class="cft">char32_t</code></ins> a class template ...
					</p>
				</div>
			</li>

			<li>
				<p class="target">
				28.3 Requirements [re.req]
				</p>
				<div class="changes">
					<p class="indent">
					<span class="num">5</span>
					[ Note: ... when it is specialized for <code class="cft">char</code><ins>,</ins><del> or</del> <code class="cft">wchar_t</code><ins> or <code class="cft">char32_t</code></ins>. This class template is described ...
					</p>
				</div>
			</li>

			<li>
				<p class="target">
				28.4 Header &lt;regex&gt; synopsis [re.syn]
				</p>
				<div class="changes">
					<p class="indent">
					typedef basic_regex&lt;char&gt; regex;<br>
					typedef basic_regex&lt;wchar_t&gt; wregex;<br>
					<ins>
					typedef basic_regex&lt;char32_t&gt; u32regex;
					</ins>
					</p>
					<p class="indent">
					typedef sub_match&lt;const char*&gt; csub_match;<br>
					typedef sub_match&lt;const wchar_t*&gt; wcsub_match;<br>
					<ins>
					typedef sub_match&lt;const char32_t*&gt; u32csub_match;<br>
					</ins>
					typedef sub_match&lt;string::const_iterator&gt; ssub_match;<br>
					typedef sub_match&lt;wstring::const_iterator&gt; wssub_match;<br>
					<ins>
					typedef sub_match&lt;u32string::const_iterator&gt; u32ssub_match;
					</ins>
					</p>
					<p class="indent">
					typedef match_results&lt;const char*&gt; cmatch;<br>
					typedef match_results&lt;const wchar_t*&gt; wcmatch;<br>
					<ins>
					typedef match_results&lt;const char32_t*&gt; u32cmatch;<br>
					</ins>
					typedef match_results&lt;string::const_iterator&gt; smatch;<br>
					typedef match_results&lt;wstring::const_iterator&gt; wsmatch;<br>
					<ins>
					typedef match_results&lt;u32string::const_iterator&gt; u32smatch;
					</ins>
					</p>
					<p class="indent">
					typedef regex_iterator&lt;const char*&gt; cregex_iterator;<br>
					typedef regex_iterator&lt;const wchar_t*&gt; wcregex_iterator;<br>
					<ins>
					typedef regex_iterator&lt;const char32_t*&gt; u32cregex_iterator;<br>
					</ins>
					typedef regex_iterator&lt;string::const_iterator&gt; sregex_iterator;<br>
					typedef regex_iterator&lt;wstring::const_iterator&gt; wsregex_iterator;<br>
					<ins>
					typedef regex_iterator&lt;u32string::const_iterator&gt; u32sregex_iterator;
					</ins>
					</p>
					<p class="indent">
					typedef regex_token_iterator&lt;const char*&gt; cregex_token_iterator;<br>
					typedef regex_token_iterator&lt;const wchar_t*&gt; wcregex_token_iterator;<br>
					<ins>
					typedef regex_token_iterator&lt;const char32_t*&gt; u32cregex_token_iterator;<br>
					</ins>
					typedef regex_token_iterator&lt;string::const_iterator&gt; sregex_token_iterator;<br>
					typedef regex_token_iterator&lt;wstring::const_iterator&gt; wsregex_token_iterator;<br>
					<ins>
					typedef regex_token_iterator&lt;u32string::const_iterator&gt; u32sregex_token_iterator;
					</ins>
					</p>
				</div>
			</li>

			<li>
				<p class="target">
				28.7 Class template regex_traits [re.traits]
				</p>
				<div class="changes">
					<p class="indent">
					<span class="num">1</span>
					The specializations <code class="cft">regex_traits&lt;char&gt;</code><ins>,</ins><del> and</del> <code class="cft">regex_traits&lt;wchar_t&gt;</code><ins> and <code class="cft">regex_traits&lt;char32_t&gt;</code></ins> shall be valid and shall satisfy the requirements for a regular expression traits class (28.3).
					</p>
				</div>

				<div class="changes">
					<p class="indent">
					<span class="num">10</span>
					<span class="eor">Remarks:</span>
					... For <code class="cft">regex_traits&lt;wchar_t&gt;</code>, at least the wide character names in Table 140 shall be recognized. <ins>For <code class="cft">regex_traits&lt;char32_t&gt;</code>, at least the <code class="cft">char32_t</code> character names in Table 140 shall be recognized.</ins>
					</p>
				</div>

				<div class="changes">
					<table class="defstyle" id="table140">
						<caption>Table 140 ―― Character class names and corresponding <code class="cft">ctype</code> masks</caption>
						<tr>
							<th>Narrow character name</th>
							<th>Wide character name</th>
							<th><ins><code class="cft">char32_t</code> character name</ins></th>
							<th>Corresponding <code class="cft">ctype_base::mask</code> value</th>
						</tr>
						<tr><td>"alnum"</td><td>L"alnum"</td><td><ins>U"alnum"</ins></td><td>ctype_base::alnum</td></tr>
						<tr><td>"alpha"</td><td>L"alpha"</td><td><ins>U"alpha"</ins></td><td>ctype_base::alpha</td></tr>
						<tr><td>"blank"</td><td>L"blank"</td><td><ins>U"blank"</ins></td><td>ctype_base::blank</td></tr>
						<tr><td>"cntrl"</td><td>L"cntrl"</td><td><ins>U"cntrl"</ins></td><td>ctype_base::cntrl</td></tr>
						<tr><td>"digit"</td><td>L"digit"</td><td><ins>U"digit"</ins></td><td>ctype_base::digit</td></tr>
						<tr><td>"d"</td><td>L"d"</td><td><ins>U"d"</ins></td><td>ctype_base::digit</td></tr>
						<tr><td>"graph"</td><td>L"graph"</td><td><ins>U"graph"</ins></td><td>ctype_base::graph</td></tr>
						<tr><td>"lower"</td><td>L"lower"</td><td><ins>U"lower"</ins></td><td>ctype_base::lower</td></tr>
						<tr><td>"print"</td><td>L"print"</td><td><ins>U"print"</ins></td><td>ctype_base::print</td></tr>
						<tr><td>"punct"</td><td>L"punct"</td><td><ins>U"punct"</ins></td><td>ctype_base::punct</td></tr>
						<tr><td>"space"</td><td>L"space"</td><td><ins>U"space"</ins></td><td>ctype_base::space</td></tr>
						<tr><td>"s"</td><td>L"s"</td><td><ins>U"s"</ins></td><td>ctype_base::space</td></tr>
						<tr><td>"upper"</td><td>L"upper"</td><td><ins>U"upper"</ins></td><td>ctype_base::upper</td></tr>
						<tr><td>"w"</td><td>L"w"</td><td><ins>U"w"</ins></td><td>ctype_base::alnum</td></tr>
						<tr><td>"xdigit"</td><td>L"xdigit"</td><td><ins>U"xdigit"</ins></td><td>ctype_base::xdigit</td></tr>
					</table>
				</div>
			</li>
		</ul>
	</section>

	<section>
		<h3>2. &lt;locale&gt;</h3>

		<p>
		Relationship with &lt;regex&gt;:
		</p>
		<ul>
			<li><code class="cft">ctype::tolower()</code> is called by <code class="cft">regex_traits::translate_nocase()</code>,</li>
			<li><code class="cft">ctype::is()</code> is called by <code class="cft">regex_traits::isctype()</code>,</li>
			<li><code class="cft">collate::transform()</code> is called by <code class="cft">regex_traits::transform()</code>,</li>
			<li><code class="cft">collate_byname::transform()</code> is called by <code class="cft">regex_traits::transform_primary()</code>,</li>
			<li><code class="cft">collate&lt;charT&gt;</code> and <code class="cft">collate_byname&lt;charT&gt;</code> are referred in <code class="cft">regex_traits::transform_primary()</code>. (Cf. <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#2338">Library Issue 2338</a>)</li>
		</ul>

		<p>
		Thus, the following changes are proposed for support of <code class="cft">regex_traits&lt;char32_t&gt;</code>:
		</p>
		<ul class="changes">
			<li>
				<p class="target">
				22.3.1.1.1 Type locale::category [locale.category]
				</p>

				<div class="changes">
					<table class="defstyle" id="table80">
						<caption>Table 80 — Locale category facets</caption>
						<tr>
							<th>Category</th><th>Includes facets</th>
						</tr>
						<tr>
							<td>collate</td>
							<td><code class="cft">collate&lt;char&gt;, collate&lt;wchar_t&gt;<ins>, collate&lt;char32_t&gt;</ins></code></td>
						</tr>
						<tr>
							<td>ctype</td>
							<td><code class="cft">ctype&lt;char&gt;, ctype&lt;wchar_t&gt;<ins>, ctype&lt;char32_t&gt;</ins><br>...</code></td>
						</tr>
					</table>
				</div>

				<div class="changes">
					<table class="defstyle" id="table81">
						<caption>Table 81 — Required specializations</caption>
						<tr>
							<th>Category</th><th>Includes facets</th>
						</tr>
						<tr>
							<td>collate</td>
							<td><code class="cft">collate_byname&lt;char&gt;, collate_byname&lt;wchar_t&gt;<ins>, collate_byname&lt;char32_t&gt;</ins></code></td>
						</tr>
					</table>
				</div>
			</li>

			<li>
				<p class="target">
				22.4.1.1.2 ctype virtual functions [locale.ctype.virtuals]
				</p>

				<p>
				<code class="cft">do_toupper()</code> is not called by <code class="cft">regex_traits&lt;char32_t&gt;</code>, but the change is proposed for consistency with <code class="cft">do_tolower()</code>.
				</p>

				<div class="changes">
					<p class="cft">
					charT do_toupper(charT c) const;<br>
					const charT* do_toupper(charT* low, const charT* high) const;
					</p>
					<p class="indent">
					<span class="num">7</span>
					<span class="eor">Effects</span>:
					Converts a character or characters to upper case. The second form replaces each character *p in the range [low,high) for which a corresponding upper-case character exists, with that character.<br>
					<ins>
					When charT is <code class="cft">char32_t</code>, a character or characters should be converted to upper case in conformity with the data in <a href="http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt">UnicodeData.txt</a> provided by the Unicode Consortium.
					</ins>
					</p>
				</div>

				<div class="changes">
					<p class="cft">
					charT do_tolower(charT c) const;<br>
					const charT* do_tolower(charT* low, const charT* high) const;<br>
					</p>
					<p class="indent">
					<span class="num">9</span>
					<span class="eor">Effects</span>:
					Converts a character or characters to lower case. The second form replaces each character *p in
the range [low,high) and for which a corresponding lower-case character exists, with that character.<br>
					<ins>
					When charT is <code class="cft">char32_t</code>, a character or characters should be converted to lower case in conformity with the data in <a href="http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt">UnicodeData.txt</a> provided by the Unicode Consortium.
					</ins>
					</p>
				</div>

				<div class="changes">
					<p class="cft">
					bool do_is(mask m, charT c) const;<br>
					const charT* do_is(const charT* low, const charT* high, mask* vec) const;
					</p>
					<p class="indent">
					<span class="num">1</span>
					<span class="eor">Effects</span>:
					Classifies a character or sequence of characters. For each argument character, identifies a value M of type ctype_base::mask. The second form identifies a value M of type ctype_base::mask for each *p where (low&lt;=p && p&lt;high), and places it into vec[p-low].<br>
					<ins>
					When charT is <code class="cft">char32_t</code>, the character classification should be in conformity with <a href="http://www.unicode.org/reports/tr18/#Compatibility_Properties">Unicode Technical Standard #18, Unicode Regular Expressions, Annex C: Compatibility Properties</a>.
					</ins>
					</p>
				</div>
			</li>

			<li>
				<p class="target">
				22.4.4.1 Class template collate [locale.collate]
				</p>
				<div class="changes">
					<p class="indent">
					<span class="num">1</span>
					... The specializations required in Table 80 (22.3.1.1.1), namely <code class="cft">collate&lt;char&gt;</code><ins>,</ins><del> and</del> <code class="cft">collate&lt;wchar_t&gt;</code><ins> and <code class="cft">collate&lt;char32_t&gt;</code></ins>, apply lexicographic ordering (25.4.8).
					</p>
				</div>
			</li>

			<li>
				<p class="target">
				22.4.4.1.2 collate virtual functions [locale.collate.virtuals]
				</p>
				<div class="changes">
					<p class="cft">
					int do_compare(const charT* low1, const charT* high1, const charT* low2, const charT* high2) const;
					</p>
					<p class="indent">
					<span class="num">1</span>
					<span class="eor">Returns:</span>
					... The specializations required in Table 80 (22.3.1.1.1), namely <code class="cft">collate&lt;char&gt;</code><ins>,</ins><del> and</del> <code class="cft">collate&lt;wchar_t&gt;</code><ins> and <code class="cft">collate&lt;char32_t&gt;</code></ins>, implement a lexicographical comparison (25.4.8).
					</p>
				</div>
			</li>
		</ul>
	</section>

	<section>
		<h3>Strict Option</h3>
		<p>
		For <code class="cft">translate_nocase(charT c)</code> in <code class="cft">class regex_traits</code>, the C++ specification says:
		</p>
		<ul>
			<li><span class="eor">Returns:</span> <code class="cft">use_facet&lt;ctype&lt;charT&gt; &gt;(getloc()).tolower(c).</code></li>
		</ul>
		<p>
		However, in terms of the Unicode standard, this way is not appropriate for making a character caseless (i.e., case-folding). <a href="http://www.unicode.org/policies/stability_policy.html#Case_Folding">Case Folding Stability</a> of Unicode says that "Case folding is not the same as lowercasing, and a case-folded string is not necessarily lowercase. In particular, as of Unicode 8.0, ..., Cherokee text case folds to the existing uppercase letters."
		</p>
		<p>
		If we follow strictly the Unicode standard, the specification in "28.7 Class template regex_traits [re.traits]" is modified as follows:
		</p>
		<div class="changes">
			<p>
			<code class="cft">charT regex_traits&lt;char32_t&gt;::translate_nocase(charT c);</code>
			</p>
			<p class="indent">
			<span class="num">5</span>
			<span class="eor">Returns:</span>
			<code class="cft">use_facet&lt;ctype&lt;charT&gt; &gt;(getloc()).tolower(c)</code><ins>, if <code class="cft">charT</code> is not <code class="cft">char32_t</code></ins>.<br>
			<ins>When <code class="cft">charT</code> is <code class="cft">char32_t</code>, if <a href="http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a> of the Unicode Character Database provides a simple (S) or common (C) case folding mapping for <code class="cft">c</code>, then returns the result of applying that mapping to <code class="cft">c</code>; otherwise returns <code class="cft">c</code>. When the current locale is such that <code class="cft">tolower(U'I')</code> should return an integer corresponding to <code class="cft">U'ı'</code> instead of <code class="cft">U'i'</code>, the mappings with status T in CaseFolding.txt may be given priority.</ins>
			</p>
		</div>
		<p>
		In this case, <code class="cft">regex_traits&lt;char32_t&gt;::translate_nocase()</code> does not depend upon <code class="cft">ctype&lt;char32_t&gt;::tolower()</code>. The proposed changes to <code class="cft">do_toupper()</code> and <code class="cft">do_tolower()</code> can be removed from this proposal document.
		</p>
	</section>
</section>

<section>
	<h2 id="sec5">V. Relevant Issues</h2>

	<ul>
		<li>
			<p>
			The version of <span class="title">ISO/IEC 10646</span> in Normative references in the C++ specification is too old. It should be replaced with a more recent version, preferably <span class="title">ISO/IEC 10646:2011</span> or newer in which it is mentioned that UCS-2 is deprecated.
			</p>
		</li>
		<li>
			<p>
			The version of <span class="title">ECMAScript Language Specification</span> in Normative references in the C++ specification is old. I would like to suggest replacing it with version 6.0/2015. Apparently, this specification is put in Normative references only for &lt;regex&gt;.
			</p>
		</li>
		<li>
			<p>
			ECMAScript has adopted the new regular expression <a href="http://www.ecma-international.org/ecma-262/6.0/#sec-patterns">\u{h...}</a> where h... is one to six hexadicimal digits that represent a Unicode code point since version 6.0/2015. It is preferable that &lt;regex&gt; which can deal with <code class="cft">char32_t</code> (and <code class="cft">char16_t</code>) accepts this expression when the <code class="cft">regex_constants::ECMAScript</code> option is specified. This is the reason why the update in the preceding clause is suggested.
			</p>
			<p>
			\u{h...} is the only new regular expression added to RegExp of ECMAScript since the version to which now the C++ specification refers. New additions other than expressions are the /u flag for Unicode support, and the /y flag corresponding to <code class="cft">regex_constants::match_continuous</code> that &lt;regex&gt; already has, if my understanding is correct. In other words, by supporting this expression with character-by-character matching operation for Unicode sequences, &lt;regex&gt; catches up with RegExp of ECMAScript 6.0/2015.
			</p>
		</li>
	</ul>
</section>

<section>
	<h2 id="sec6">VI. References</h2>
	<ul>
		<li>Ecma International, <span class="title"><a href="http://www.ecma-international.org/ecma-262/6.0/">ECMA-262 6th Edition, The ECMAScript 2015 Language Specification</a></span></li>
		<li>The IEEE and The Open Group, <span class="title"><a href="http://pubs.opengroup.org/onlinepubs/9699919799/">IEEE Std 1003.1, 2013 Edition</a></span></li>
		<li>ISO/IEC <span class="title"><a href="http://standards.iso.org/ittf/PubliclyAvailableStandards/">Information technology ―― Universal Coded Charcter Set (UCS) ISO/IEC 10646:2014</a></span></li>
		<li>The Unicode Consortium, <span class="title"><a href="http://www.unicode.org/faq/utf_bom.html">FAQ - UTF-8, UTF-16, UTF-32 & BOM</a></span></li>
		<li>The Unicode Consortium, <span class="title"><a href="http://www.unicode.org/versions/Unicode8.0.0/">Unicode 8.0.0</a></span></li>
		<li>The Unicode Consortium, <span class="title">Unicode Character Database</span>, <a href="http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a> (CaseFolding-8.0.0.txt)</li>
		<li>The Unicode Consortium, <span class="title">Unicode Character Database</span>, <a href="http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt">UnicodeData.txt</a></li>
		<li>The Unicode Consortium, <span class="title"><a href="http://www.unicode.org/policies/stability_policy.html">Unicode Character Encoding Stability Policies</a></span></li>
		<li>The Unicode Consortium, <span class="title"><a href="http://www.unicode.org/reports/tr18/">Unicode Technical Standard #18, Unicode Regular Expressions</a></span></li>
	</ul>
</section>

</body>
</html>
