<html>
<head>
<TITLE>
ISO/IEC JTC1/SC22/WG21
N2401
</TITLE>
</head>
<body>
<h1>
<img align=top src="/pics/iso44.gif" alt="ISO/">
<img align=top src="/pics/iec44.gif" alt="IEC">
JTC1/SC22/WG21
N2401
</h1>
<pre>
ISO/IEC JTC1/SC22/WG21 N2401 = J16/07-0261
Code Conversion Facets for the Standard C++ Library

P.J. Plauger
Dinkumware, Ltd.
pjp@dinkumware.com

2007-09-03

With the acceptance of <a hRef="../2006/n2007.html">N2007</a> (Proposed Library Additions for Code Conversion)
we now have template classes wbuffer_convert and wstring_convert, as well
as basic_filebuf, that accept code-conversion facets as template parameters.
Unfortunately, the current draft C++ Standard defines only the default codecvt
facet, with weakly specified properties. This paper proposes the addition of
several facets that provide the commonest Unicode support.

Add the header &lt;codecvt> with the following definitions:

namespace std {
enum codecvt_mode {
	consume_header = 4,
	generate_header = 2,
	little_endian = 1};

template&lt;class Elem,
	unsigned long Maxcode = 0x10ffff,
	codecvt_mode Mode = (codecvt_mode)0>
	class codecvt_utf8
	: public std::codecvt&lt;Elem, char, mbstate_t>
	{	// facet for converting between Elem and UTF-8 byte sequences
	.....
	};

template&lt;class Elem,
	unsigned long Maxcode = 0x10ffff,
	codecvt_mode Mode = (codecvt_mode)0>
	class codecvt_utf16
	: public std::codecvt&lt;Elem, char, mbstate_t>
	{	// facet for converting between Elem and UTF-16 multibyte sequences
	.....
	};

template&lt;class Elem,
	unsigned long Maxcode = 0x10ffff,
	codecvt_mode Mode = (codecvt_mode)0>
	class codecvt_utf8_utf16
	: public std::codecvt&lt;Elem, char, mbstate_t>
	{	// facet for converting between UTF-16 Elem and UTF-8 byte sequences
	.....
	};
}	// namespace std

For each of the three code conversion facets codecvt_utf8, codecvt_utf16,
and codecvt_utf8_utf16:

-- Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.

-- Maxcode is the largest wide-character code that the facet will read
or write without reporting a conversion error.

-- If (Mode & consume_header), the facet consumes an optional initial
header sequence when reading a multibyte sequence to determine the
endianness of the subsequent multibyte sequence to be read.

-- If (Mode & generate_header), the facet generates an initial header
sequence when writing a multibyte sequence to advertise the endianness
of the subsequent multibyte sequence to be written.

-- If (Mode & little_endian), the facet generates a multibyte sequence in
little-endian order, as opposed to the default big-endian order.

For the facet codecvt_utf8:

-- The facet converts between UTF-8 multibyte sequences and UCS2 or UCS4
(depending on the size of Elem) within the program.

-- Endianness does not affect how multibyte sequences are read or written.

-- The multibyte sequence can be written as either a text or a binary file.

For the facet codecvt_utf16:

-- The facet converts between UTF-16 multibyte sequences and UCS2 or UCS4
(depending on the size of Elem) within the program.

-- Endianness affects how multibyte sequences are read or written.

-- The multibyte sequence must be written as a binary file.

For the facet codecvt_utf8_utf16:

-- The facet converts between UTF-8 multibyte sequences and UTF-16 (one or
two 16-bit codes) within the program.

-- Endianness does not affect how multibyte sequences are read or written.

-- The multibyte sequence can be written as eitier a text or a binary file.

</pre>
</body>
</html>
