<HTML><HEAD><TITLE>N1500=03-0083, Regular Expressions: Internationalization and Customization</TITLE></HEAD><BODY>

<CENTER>
<H1><A NAME="N1500=03-0083, Regular Expressions: Internationalization and Customization">Regular Expressions: Internationalization and Customization</A></H1>
</CENTER>

<TABLE ALIGN="RIGHT" CELLSPACING="0" CELLPADDING="0">
<TR>
<TD ALIGN="RIGHT"><B><I>Document number:</I></B></TD>
<TD>&nbsp; N1500&nbsp;=&nbsp;03-0083</TD>
</TR>
<TR>
<TD ALIGN="RIGHT"><B><I>Date:</I></B></TD>
<TD>&nbsp; September 22, 2003</TD>
</TR>
<TR>
<TD ALIGN="RIGHT"><B><I>Project:</I></B></TD>
<TD>&nbsp; Programming Language C++</TD>
</TR>
<TR>
<TD ALIGN="RIGHT"><B><I>Reference:</I></B></TD>
<TD>&nbsp; ISO/IEC IS 14882:1998(E)</TD>
</TR>
<TR>
<TD ALIGN="RIGHT"><B><I>Reply to:</I></B></TD>
<TD>&nbsp; Pete Becker</TD>
</TR>
<TR>
<TD></TD>
<TD>&nbsp; Dinkumware, Ltd.</TD>
</TR>
<TR>
<TD></TD>
<TD>&nbsp; petebecker@acm.org</TD>
</TR>
</TABLE>
<BR CLEAR="ALL">

<HR>

<BLOCKQUOTE><I>Give me but one firm spot on which to stand, and I will move the earth.</I><BR>
-- Archimedes on the lever, quoted in <I>The Oxford Dictionary of Quotations</I>,
Second Edition, Oxford University Press, London, 1953, p. 14</BLOCKQUOTE>

<P>Regular expressions are a powerful tool that can be used to search text for
occurrences of patterns of characters. If the formal language used to define the search
pattern is too hard to understand or too hard to apply, however, the user has no firm spot
on which to stand, and it will be the fulcrum, not the earth, that moves. Designing this
formal language requires recognizing that flexibility and understandability
can easily become antagonists.</P>

<P>The regular expression proposal (N1429) is based on the regular expression support in
<A HREF="http://www.ecma-international.org/publications/standards/Ecma-262.htm">ECMAScript</A>
and in <A HREF="http://www.unix-systems.org/version3">POSIX</A>. It offers richer
internationalization and customization capabilities than either of these standards.
This paper reviews the internationalization support in the two underlying standards
as well as the support in the regular expression proposal, and recommends that we
be cautious in enhancing the capabilities of regular expressions beyond those provided
by these two standards.</P>

<H2><A NAME="Basics">Basics</A></H2>

<P>When using regular expressions, character sequences occur in two
distinct contexts: the <B><A NAME="pattern">pattern</A></B> text which
defines the regular expression itself and the <B><A NAME="target">target</A></B>
text which is the text to be searched for occurrences of the pattern. A regular
expression implementation typically <B><A NAME="compiles">compiles</A></B> the
pattern text into an intermediate representation which is used by the search
algorithms to find matches in the target text.</P>

<P>A regular expression pattern uses a set of
<B><A NAME="special characters">special characters</A></B> (e.g. the familiar
<CODE>&quot;.*&quot;</CODE>) to invoke special
matching rules. All characters that are not special characters are
<B><A NAME="ordinary characters">ordinary characters</A></B>. Ordinary characters
in the pattern text match characters in the target text only if the two characters
have exactly the same bit pattern. Other than in a POSIX bracket expression there are no
special considerations for variable width characters in the pattern text. Consequently,
a multi-character sequence in the pattern text matches the target text only if the same
sequence of characters occurs in the target text.</P>

<H2><A NAME="Internationalization in POSIX">Internationalization in POSIX</A></H2>

<P>POSIX regular expressions support more flexible matching through the use of
<B><A NAME="bracket expressions">bracket expressions</A></B> that refer to named classes.
A <B><A NAME="named class">named class</A></B>
can be one of three types: a collating symbol, an equivalence class, or
a character class.</P>

<P>A <B><A NAME="collating symbol">collating symbol</A></B>, of the form
<CODE>[.ce.]</CODE>, matches a sequence of characters in the
target text only if that sequence of characters is the same collating element.
Thus, in a locale which defines the sequence <CODE>"ch"</CODE> as a collating element,
the bracket expression <CODE>[[.ch.]]</CODE> matches the entire text sequence
<CODE>"ch"</CODE>, while the ordinary bracket expression <CODE>[ch]</CODE>
matches the first character of that text sequence.</P>

<P>An <B><A NAME="equivalence class">equivalence class</A></B>, of the form
<CODE>[=cl=]</CODE>, matches a sequence of characters in the
target text only if that sequence of characters is a collating element
in the same equivalence class as <CODE>cl</CODE> according to the current
locale. Thus, the equivalence class <CODE>[=a=]</CODE> will typically match various
accented forms of the letter <CODE>'a'</CODE> in a locale that recognizes
accented letters.</P>

<P>A <B><A NAME="character class">character class</A></B>, of the form
<CODE>[:name:]</CODE>, matches a character in the target text
only if that character is in the set of characters named by <CODE>name</CODE>.
Character class names include, but are not limited to,
<CODE>d</CODE>,
<CODE>w</CODE>,
<CODE>s</CODE>,
<CODE>alnum</CODE>,
<CODE>alpha</CODE>,
<CODE>blank</CODE>,
<CODE>cntrl</CODE>,
<CODE>digit</CODE>,
<CODE>graph</CODE>,
<CODE>lower</CODE>,
<CODE>print</CODE>,
<CODE>punct</CODE>,
<CODE>space</CODE>,
<CODE>upper</CODE>,
and <CODE>xdigit</CODE>.
The contents of each of these named classes depends on the current locale.</P>

<H2><A NAME="Sets of Characters in ECMAScript">Sets of Characters in ECMAScript</A></H2>

<P>ECMAScript does not have explicit internationalization support. For the most
part it's not needed, because ECMAScript traffics only in Unicode characters.
It also does not have POSIX-style bracket expressions. It does, however,
have <B><A NAME="escape sequences">escape sequences</A></B> that define several
character classes.</P>

<P>The <B><A NAME="digit class">digit class</A></B>, of the form <CODE>&quot;\d&quot;</CODE>,
matches any character in the set of ASCII characters <CODE>[0-9]</CODE>.</P>

<P>The <B><A NAME="word class">word class</A></B>, of the form <CODE>&quot;\w&quot;</CODE>,
matches any character in the set of ASCII characters <CODE>[a-zA-Z0-9_]</CODE>.</P>

<P>The <B><A NAME="space class">space class</A></B>, of the form <CODE>&quot;\s&quot;</CODE>,
matches any of the characters
<CODE>&lt;TAB&gt;</CODE>,
<CODE>&lt;VT&gt;</CODE>,
<CODE>&lt;FF&gt;</CODE>,
<CODE>&lt;SP&gt;</CODE>,
<CODE>&lt;NBSP&gt;</CODE>,
<CODE>&lt;USP&gt;</CODE>,
<CODE>&lt;LF&gt;</CODE>,
<CODE>&lt;CR&gt;</CODE>,
<CODE>&lt;LS&gt;</CODE>,
<CODE>&lt;PS&gt;</CODE>.</P>

<P>A <B><A NAME="word boundary assert">word boundary assert</A></B>, of the form
<CODE>&quot;\b&quot;</CODE>, tests whether the current
position in the text is at a word boundary. A word boundary is a transition from
a character that is not in the word class to a character that is in the
word class or vice versa.</P>

<P>Each of these four escape sequences has a corresponding negation, represented by
a backslash followed by the uppercase letter corresponding to the desired escape.
Thus not digit is <CODE>&quot;\D&quot;</CODE>, not word is <CODE>&quot;\W&quot;</CODE>,
not space is <CODE>&quot;\S&quot;</CODE>, and not word boundary is
<CODE>&quot;\B&quot;</CODE>.

<H2><A NAME="Internationalization in Regex Proposal">Internationalization in the Regex Proposal</A></H2>

<P>When the user <A HREF="#compiles">compiles</A> a regular expression, one of
the arguments specifies whether to treat the regular expression as
POSIX or ECMAScript.</P>

<P>When the user specifies ECMAScript the regular expression can
use both POSIX <A HREF="#named class">named classes</A> and
ECMAScript <A HREF="#escape sequences">escape sequences</A>. The three classes named
<CODE>[:d:]</CODE>,
<CODE>[:w:]</CODE>,
and <CODE>[:s:]</CODE>
are used to define the contents of the character sets named by the
ECMAScript escape sequences:</P>

<TABLE rules="groups">
<COLGROUP span="1">
<COLGROUP span="1">
<THEAD>
<TR><TD>ECMAScript<TD>POSIX
<TR><TD>escape<TD>named class
</THEAD>
<TBODY>
<TR><TD>\d<TD>[[:d:]]
<TR><TD>\w<TD>[[:w:]]
<TR><TD>\s<TD>[[:s:]]
<TR><TD>\D<TD>[^[:d:]]
<TR><TD>\W<TD>[^[:w:]]
<TR><TD>\S<TD>[^[:s:]]
</TBODY>
</TABLE>

<P>The ECMAScript escape sequences <CODE>\b</CODE> and <CODE>\B</CODE> are
defined, as in ECMAScript, in terms of transitions between characters in the set
defined by <CODE>\w</CODE> and by <CODE>\W</CODE>.</P>

<P>This support for internationalization is provided through
member functions of a <CODE>regex_traits</CODE> object.</P>

<P><A HREF="#collating symbol">Collating symbols</A> are supported by the
member function <CODE>lookup_collatename</CODE>, which determines whether a
character sequence is a valid collation element and, if so, converts that
sequence of characters into a canonical representation. During a text
search, the canonical representation for the collation element in the pattern
will be compared to the canonical representation for the current text element
to determine whether the match succeeds.</P>

<P><A HREF="#equivalence class">Equivalence classes</A> are supported by the
member function <CODE>transform_primary</CODE>, which, like <CODE>strxfrm</CODE>,
converts a sequence of characters into a canonical representation. During a text
search, the canonical representation for the collation element in the pattern
will be compared to the canonical representation for the current text element
to determine whether the match succeeds.</P>

<P><A HREF="#character class">Character classes</A> are supported by the
member function <CODE>lookup_classname</CODE>, which converts the name of
a class to a numeric value that can be passed, along with the current character,
to <CODE>is_class</CODE>, which returns true if the character is in the class.</P>

<!--
Compile-time specification of character type
fixed-width character matches (traits_inst.translate)
character sets through
    lookup_classname & is_class (POSIX character class, ECMAScript escape sequences)
    transform & transform_primary (equivalence class)
    lookup_collatename (collation element)
-->


<H2><A NAME="Customization in Regex Proposal">Customization in the Regex Proposal</A></H2>

<P>The regular expressions proposal also provides for customizing the rule
for matching an <A HREF="#ordinary characters">ordinary character</A> in
the pattern text and a character in the target text, as well as for remapping the
<A HREF="#special characters">special characters</A> used in writing
regular expressions. Just as for internationalization, these customizations
are provided through member functions of a <CODE>regex_traits</CODE> object.</P>

<P>Character matching is handled by the member function
<CODE>translate</CODE>. It takes two arguments, a character and a boolean flag, and
returns a character. The boolean flag indicates whether the translation should be
case sensitive. Two characters match if the characters returned by calling
<CODE>translate</CODE> with each of the two characters are equal.</P>

<P>Remapping special characters is handled by the member functions
<CODE>syntax_type</CODE> and <CODE>escape_syntax_type</CODE>. They are called only
when the pattern text is being <A HREF="#compiles">compiled</A>, and they return
enumerated types that tell the compiler what a character means. So, for example,
to recognize that any character can match the pattern at the current position,
the compiler code is <CODE>if (traits_inst.syntax_type(ch) == syntax_dot)</CODE>.</P>

<H2><A NAME="Discussion of Enhancements">Discussion of Enhancements</A></H2>

<H3><CODE>[:d:]</CODE>, <CODE>[:w:]</CODE>, <CODE>[:s:]</CODE> added to POSIX</H3>

<P>This enhancement is a conforming extension to the POSIX regular expression
specification.</P>

<H3>ECMAScript <A HREF="#escape sequences">escape sequences</A> are locale sensitive</H3>

<P>ECMAScript does not support locales. That's a reasonable (although a bit limiting)
choice for a regular expression specification that uses only Unicode. Since the C++ regular
expression proposal supports byte-sized character types, locales are necessary.</P>

<H3><A HREF="#named class">Named classes</A> added to ECMAScript</H3>

<P>This is technically a non-conforming change to ECMAScript. For example,
it quietly changes the meaning of the regular expression <CODE>"[[:d:]]</CODE>.
Under ECMAScript, that expression matches the full text <CODE>"[]"</CODE> --
the bracket expression (which ends at the first ']') matches any of the
characters <CODE>'['</CODE>, <CODE>':'</CODE>, <CODE>'d'</CODE>, and the final
<CODE>']'</CODE> in the pattern matches the final <CODE>']'</CODE> in the
text. With this enhancement, the match fails. Of course, that regular expression
is a rather peculiar way to try to match those three characters
The usual way to write that match would be <CODE>"[[:d]"</CODE>,
which becomes invalid with the enhancement. In practice this shouldn't
be a significant problem.</P>

<P>However, arbitrary named character classes for large character sets can
be expensive to use. For example, testing whether a Unicode character is a
letter requires large tables and several lookups. This is probably why ECMAScript's
escape sequences are defined only for specific sets of ASCII code points. This
keeps the lookup tables small. Java's <CODE>regex</CODE> package also has
the usual POSIX named character classes, but they, too, only hold
ASCII code points, not all of the possible Unicode characters.
The Java package also provides named classes for the
various Unicode character classifications, which do need the large tables
and multiple lookups.</P>

<P>Since the regular expression proposal deals with arbitrary character types,
it isn't feasible to restrict named character classes to ASCII values (which might
not be the same characters under some locales). So the provision for unrestricted
named classes in this proposal is reasonable.</P>

<H3>Matching of ordinary characters is customizable</H3>

<P>Case insensitive matches need to use a locale, so calling
<CODE>translate</CODE> is a reasonable requirement for such matches. The
broader requirement that every character match be done with <CODE>translate</CODE>,
however, isn't so clearly useful, and can impose a significant
performance cost. For case sensitive comparisons, both ECMAScript and POSIX require
that the character values be the same. This, in turn, means that comparing long
sequences that contain only text can be done with <CODE>memcmp</CODE>:</P>

<PRE><CODE>    if (memcmp(cur, pat, nchrs))
        // match failed</CODE></PRE>

<P>Using <CODE>translate</CODE> in the comparison requires an explicit loop:</P>

<PRE><CODE>    for (int i = 0; i < nchrs; ++i)
        if (traits_inst.translate(cur[i], false) != traits_inst.translate(pat[i], false))
            // match failed</CODE></PRE>

<P>I've suggested elsewhere that we change the interface to provide two versions
of <CODE>translate</CODE>, one for case sensitive matches and one for case insensitive
matches. With that change, the default implementation for case sensitive matches becomes
simple <CODE>return ch;</CODE>, which the compiler can easily inline. In effect, the
previous loop becomes:</P>

<PRE><CODE>    for (int i = 0; i < nchrs; ++i)
        if (cur[i] != pat[i])
            // match failed</CODE></PRE>

<P>While that's a significant improvement over the previous version, it's not as good
as calling <CODE>memcmp</CODE>. A compiler might recognize this pattern and generate
fast code, but that's not something users can rely on. And, of course, a non-trivial
version of <CODE>translate</CODE> would be far slower.</P>

<P>The proposal says that <CODE>translate</CODE> is needed to support various locale-specific
forms of canonical or compatibility equivalence. That's what
<A HREF="#equivalence class">equivalence classes</A> do. Equivalence classes are a
clear signal in the pattern text that a potentially expensive comparison will be made.
With a user-defined <CODE>translate</CODE> there is no such signal.</P>

<H3>Special characters can be remapped</H3>

<P>The proposal says that remapping special characters allows users to
&quot;use Han-ideographs rather that Latin punctuation
symbols in place of the usual ? * + regular expression operators.&quot;
That's certainly true, but it's not an obvious benefit. The problem is that
users can't look at a regular expression and know what it means. As an extreme example,
what is the result of matching the pattern
<CODE>&quot;[abcd](efg)&quot;</CODE> against the target text
<CODE>&quot;abcdefg&quot;</CODE>? With the usual regular expression operators,
the match is <CODE>&quot;defg&quot;</CODE>, with a subexpression matching
<CODE>&quot;efg&quot;</CODE>. But with a simple change to the function
<CODE>syntax_type</CODE>, the match would be <CODE>&quot;abcde&quot;</CODE>,
with a subexpression matching <CODE>&quot;abcd&quot;</CODE>.</P>

<P>A regular expression is a statement in a programming language. Just as we do
not support remapping C++ operators, we should not support remapping regular
expression operators.</P>

<H2>Recommendations</H2>

<P>Under the principle that you don't pay for it if you don't use it,
character equality should be based on code points, without an intervening
function call, unless the user explicitly asks for slower but more
flexible comparisons at the time the regular expression is compiled.</P>

<P>Under the princpile that code should say what it means, special characters
should not be remapped.</P>

</BODY></HTML>
