<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang xml:lang>
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="mpark/wg21" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <meta name="dcterms.date" content="2022-11-20" />
  <title>Unicode in the Library, Part 2: Normalization</title>
  <style>
      code{white-space: pre-wrap;}
      span.smallcaps{font-variant: small-caps;}
      span.underline{text-decoration: underline;}
      div.column{display: inline-block; vertical-align: top; width: 50%;}
      div.csl-block{margin-left: 1.5em;}
      ul.task-list{list-style: none;}
      pre > code.sourceCode { white-space: pre; position: relative; }
      pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
      pre > code.sourceCode > span:empty { height: 1.2em; }
      .sourceCode { overflow: visible; }
      code.sourceCode > span { color: inherit; text-decoration: inherit; }
      div.sourceCode { margin: 1em 0; }
      pre.sourceCode { margin: 0; }
      @media screen {
      div.sourceCode { overflow: auto; }
      }
      @media print {
      pre > code.sourceCode { white-space: pre-wrap; }
      pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
      }
      pre.numberSource code
        { counter-reset: source-line 0; }
      pre.numberSource code > span
        { position: relative; left: -4em; counter-increment: source-line; }
      pre.numberSource code > span > a:first-child::before
        { content: counter(source-line);
          position: relative; left: -1em; text-align: right; vertical-align: baseline;
          border: none; display: inline-block;
          -webkit-touch-callout: none; -webkit-user-select: none;
          -khtml-user-select: none; -moz-user-select: none;
          -ms-user-select: none; user-select: none;
          padding: 0 4px; width: 4em;
          color: #aaaaaa;
        }
      pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa;  padding-left: 4px; }
      div.sourceCode
        {  background-color: #f6f8fa; }
      @media screen {
      pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
      }
      code span { } /* Normal */
      code span.al { color: #ff0000; } /* Alert */
      code span.an { } /* Annotation */
      code span.at { } /* Attribute */
      code span.bn { color: #9f6807; } /* BaseN */
      code span.bu { color: #9f6807; } /* BuiltIn */
      code span.cf { color: #00607c; } /* ControlFlow */
      code span.ch { color: #9f6807; } /* Char */
      code span.cn { } /* Constant */
      code span.co { color: #008000; font-style: italic; } /* Comment */
      code span.cv { color: #008000; font-style: italic; } /* CommentVar */
      code span.do { color: #008000; } /* Documentation */
      code span.dt { color: #00607c; } /* DataType */
      code span.dv { color: #9f6807; } /* DecVal */
      code span.er { color: #ff0000; font-weight: bold; } /* Error */
      code span.ex { } /* Extension */
      code span.fl { color: #9f6807; } /* Float */
      code span.fu { } /* Function */
      code span.im { } /* Import */
      code span.in { color: #008000; } /* Information */
      code span.kw { color: #00607c; } /* Keyword */
      code span.op { color: #af1915; } /* Operator */
      code span.ot { } /* Other */
      code span.pp { color: #6f4e37; } /* Preprocessor */
      code span.re { } /* RegionMarker */
      code span.sc { color: #9f6807; } /* SpecialChar */
      code span.ss { color: #9f6807; } /* SpecialString */
      code span.st { color: #9f6807; } /* String */
      code span.va { } /* Variable */
      code span.vs { color: #9f6807; } /* VerbatimString */
      code span.wa { color: #008000; font-weight: bold; } /* Warning */
      code.diff {color: #898887}
      code.diff span.va {color: #006e28}
      code.diff span.st {color: #bf0303}
  </style>
  <style type="text/css">
body {
margin: 5em;
font-family: serif;

hyphens: auto;
line-height: 1.35;
text-align: justify;
}
@media screen and (max-width: 30em) {
body {
margin: 1.5em;
}
}
div.wrapper {
max-width: 60em;
margin: auto;
}
ul {
list-style-type: none;
padding-left: 2em;
margin-top: -0.2em;
margin-bottom: -0.2em;
}
a {
text-decoration: none;
color: #4183C4;
}
a.hidden_link {
text-decoration: none;
color: inherit;
}
li {
margin-top: 0.6em;
margin-bottom: 0.6em;
}
h1, h2, h3, h4 {
position: relative;
line-height: 1;
}
a.self-link {
position: absolute;
top: 0;
left: calc(-1 * (3.5rem - 26px));
width: calc(3.5rem - 26px);
height: 2em;
text-align: center;
border: none;
transition: opacity .2s;
opacity: .5;
font-family: sans-serif;
font-weight: normal;
font-size: 83%;
}
a.self-link:hover { opacity: 1; }
a.self-link::before { content: "§"; }
ul > li:before {
content: "\2014";
position: absolute;
margin-left: -1.5em;
}
:target { background-color: #C9FBC9; }
:target .codeblock { background-color: #C9FBC9; }
:target ul { background-color: #C9FBC9; }
.abbr_ref { float: right; }
.folded_abbr_ref { float: right; }
:target .folded_abbr_ref { display: none; }
:target .unfolded_abbr_ref { float: right; display: inherit; }
.unfolded_abbr_ref { display: none; }
.secnum { display: inline-block; min-width: 35pt; }
.header-section-number { display: inline-block; min-width: 35pt; }
.annexnum { display: block; }
div.sourceLinkParent {
float: right;
}
a.sourceLink {
position: absolute;
opacity: 0;
margin-left: 10pt;
}
a.sourceLink:hover {
opacity: 1;
}
a.itemDeclLink {
position: absolute;
font-size: 75%;
text-align: right;
width: 5em;
opacity: 0;
}
a.itemDeclLink:hover { opacity: 1; }
span.marginalizedparent {
position: relative;
left: -5em;
}
li span.marginalizedparent { left: -7em; }
li ul > li span.marginalizedparent { left: -9em; }
li ul > li ul > li span.marginalizedparent { left: -11em; }
li ul > li ul > li ul > li span.marginalizedparent { left: -13em; }
div.footnoteNumberParent {
position: relative;
left: -4.7em;
}
a.marginalized {
position: absolute;
font-size: 75%;
text-align: right;
width: 5em;
}
a.enumerated_item_num {
position: relative;
left: -3.5em;
display: inline-block;
margin-right: -3em;
text-align: right;
width: 3em;
}
div.para { margin-bottom: 0.6em; margin-top: 0.6em; text-align: justify; }
div.section { text-align: justify; }
div.sentence { display: inline; }
span.indexparent {
display: inline;
position: relative;
float: right;
right: -1em;
}
a.index {
position: absolute;
display: none;
}
a.index:before { content: "⟵"; }

a.index:target {
display: inline;
}
.indexitems {
margin-left: 2em;
text-indent: -2em;
}
div.itemdescr {
margin-left: 3em;
}
.bnf {
font-family: serif;
margin-left: 40pt;
margin-top: 0.5em;
margin-bottom: 0.5em;
}
.ncbnf {
font-family: serif;
margin-top: 0.5em;
margin-bottom: 0.5em;
margin-left: 40pt;
}
.ncsimplebnf {
font-family: serif;
font-style: italic;
margin-top: 0.5em;
margin-bottom: 0.5em;
margin-left: 40pt;
background: inherit; 
}
span.textnormal {
font-style: normal;
font-family: serif;
white-space: normal;
display: inline-block;
}
span.rlap {
display: inline-block;
width: 0px;
}
span.descr { font-style: normal; font-family: serif; }
span.grammarterm { font-style: italic; }
span.term { font-style: italic; }
span.terminal { font-family: monospace; font-style: normal; }
span.nonterminal { font-style: italic; }
span.tcode { font-family: monospace; font-style: normal; }
span.textbf { font-weight: bold; }
span.textsc { font-variant: small-caps; }
a.nontermdef { font-style: italic; font-family: serif; }
span.emph { font-style: italic; }
span.techterm { font-style: italic; }
span.mathit { font-style: italic; }
span.mathsf { font-family: sans-serif; }
span.mathrm { font-family: serif; font-style: normal; }
span.textrm { font-family: serif; }
span.textsl { font-style: italic; }
span.mathtt { font-family: monospace; font-style: normal; }
span.mbox { font-family: serif; font-style: normal; }
span.ungap { display: inline-block; width: 2pt; }
span.textit { font-style: italic; }
span.texttt { font-family: monospace; }
span.tcode_in_codeblock { font-family: monospace; font-style: normal; }
span.phantom { color: white; }

span.math { font-style: normal; }
span.mathblock {
display: block;
margin-left: auto;
margin-right: auto;
margin-top: 1.2em;
margin-bottom: 1.2em;
text-align: center;
}
span.mathalpha {
font-style: italic;
}
span.synopsis {
font-weight: bold;
margin-top: 0.5em;
display: block;
}
span.definition {
font-weight: bold;
display: block;
}
.codeblock {
margin-left: 1.2em;
line-height: 127%;
}
.outputblock {
margin-left: 1.2em;
line-height: 127%;
}
div.itemdecl {
margin-top: 2ex;
}
code.itemdeclcode {
white-space: pre;
display: block;
}
span.textsuperscript {
vertical-align: super;
font-size: smaller;
line-height: 0;
}
.footnotenum { vertical-align: super; font-size: smaller; line-height: 0; }
.footnote {
font-size: small;
margin-left: 2em;
margin-right: 2em;
margin-top: 0.6em;
margin-bottom: 0.6em;
}
div.minipage {
display: inline-block;
margin-right: 3em;
}
div.numberedTable {
text-align: center;
margin: 2em;
}
div.figure {
text-align: center;
margin: 2em;
}
table {
border: 1px solid black;
border-collapse: collapse;
margin-left: auto;
margin-right: auto;
margin-top: 0.8em;
text-align: left;
hyphens: none; 
}
td, th {
padding-left: 1em;
padding-right: 1em;
vertical-align: top;
}
td.empty {
padding: 0px;
padding-left: 1px;
}
td.left {
text-align: left;
}
td.right {
text-align: right;
}
td.center {
text-align: center;
}
td.justify {
text-align: justify;
}
td.border {
border-left: 1px solid black;
}
tr.rowsep, td.cline {
border-top: 1px solid black;
}
tr.even, tr.odd {
border-bottom: 1px solid black;
}
tr.capsep {
border-top: 3px solid black;
border-top-style: double;
}
tr.header {
border-bottom: 3px solid black;
border-bottom-style: double;
}
th {
border-bottom: 1px solid black;
}
span.centry {
font-weight: bold;
}
div.table {
display: block;
margin-left: auto;
margin-right: auto;
text-align: center;
width: 90%;
}
span.indented {
display: block;
margin-left: 2em;
margin-bottom: 1em;
margin-top: 1em;
}
ol.enumeratea { list-style-type: none; background: inherit; }
ol.enumerate { list-style-type: none; background: inherit; }

code.sourceCode > span { display: inline; }
</style>
  <link href="data:image/x-icon;base64,AAABAAIAEBAAAAEAIABoBAAAJgAAACAgAAABACAAqBAAAI4EAAAoAAAAEAAAACAAAAABACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA////AIJEAACCRAAAgkQAAIJEAACCRAAAgkQAVoJEAN6CRADegkQAWIJEAACCRAAAgkQAAIJEAACCRAAA////AP///wCCRAAAgkQAAIJEAACCRAAsgkQAvoJEAP+CRAD/gkQA/4JEAP+CRADAgkQALoJEAACCRAAAgkQAAP///wD///8AgkQAAIJEABSCRACSgkQA/IJEAP99PQD/dzMA/3czAP99PQD/gkQA/4JEAPyCRACUgkQAFIJEAAD///8A////AHw+AFiBQwDqgkQA/4BBAP9/PxP/uZd6/9rJtf/bybX/upd7/39AFP+AQQD/gkQA/4FDAOqAQgBc////AP///wDKklv4jlEa/3o7AP+PWC//8+3o///////////////////////z7un/kFox/35AAP+GRwD/mVYA+v///wD///8A0Zpk+NmibP+0d0T/8evj///////+/fv/1sKz/9bCs//9/fr//////+/m2/+NRwL/nloA/5xYAPj///8A////ANKaZPjRmGH/5cKh////////////k149/3UwAP91MQD/lmQ//86rhv+USg3/m1YA/5hSAP+bVgD4////AP///wDSmmT4zpJY/+/bx///////8+TV/8mLT/+TVx//gkIA/5lVAP+VTAD/x6B//7aEVv/JpH7/s39J+P///wD///8A0ppk+M6SWP/u2sf///////Pj1f/Nj1T/2KFs/8mOUv+eWhD/lEsA/8aee/+0glT/x6F7/7J8Rvj///8A////ANKaZPjRmGH/48Cf///////+/v7/2qt//82PVP/OkFX/37KJ/86siv+USg7/mVQA/5hRAP+bVgD4////AP///wDSmmT40ppk/9CVXP/69O////////7+/v/x4M//8d/P//7+/f//////9u7n/6tnJf+XUgD/nFgA+P///wD///8A0ppk+NKaZP/RmWL/1qNy//r07///////////////////////+vXw/9akdP/Wnmn/y5FY/6JfFvj///8A////ANKaZFTSmmTo0ppk/9GYYv/Ql1//5cWm//Hg0P/x4ND/5cWm/9GXYP/RmGH/0ppk/9KaZOjVnmpY////AP///wDSmmQA0ppkEtKaZI7SmmT60ppk/9CWX//OkVb/zpFW/9CWX//SmmT/0ppk/NKaZJDSmmQS0ppkAP///wD///8A0ppkANKaZADSmmQA0ppkKtKaZLrSmmT/0ppk/9KaZP/SmmT/0ppkvNKaZCrSmmQA0ppkANKaZAD///8A////ANKaZADSmmQA0ppkANKaZADSmmQA0ppkUtKaZNzSmmTc0ppkVNKaZADSmmQA0ppkANKaZADSmmQA////AP5/AAD4HwAA4AcAAMADAACAAQAAgAEAAIABAACAAQAAgAEAAIABAACAAQAAgAEAAMADAADgBwAA+B8AAP5/AAAoAAAAIAAAAEAAAAABACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA////AP///wCCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAAyCRACMgkQA6oJEAOqCRACQgkQAEIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAA////AP///wD///8A////AIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRABigkQA5oJEAP+CRAD/gkQA/4JEAP+CRADqgkQAZoJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAAD///8A////AP///wD///8AgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAA4gkQAwoJEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQAxIJEADyCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAP///wD///8A////AP///wCCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAWgkQAmIJEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAJyCRAAYgkQAAIJEAACCRAAAgkQAAIJEAACCRAAA////AP///wD///8A////AIJEAACCRAAAgkQAAIJEAACCRAAAgkQAdIJEAPCCRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAPSCRAB4gkQAAIJEAACCRAAAgkQAAIJEAAD///8A////AP///wD///8AgkQAAIJEAACCRAAAgkQASoJEANKCRAD/gkQA/4JEAP+CRAD/g0YA/39AAP9zLgD/bSQA/2shAP9rIQD/bSQA/3MuAP9/PwD/g0YA/4JEAP+CRAD/gkQA/4JEAP+CRADUgkQAToJEAACCRAAAgkQAAP///wD///8A////AP///wB+PwAAgkUAIoJEAKiCRAD/gkQA/4JEAP+CRAD/hEcA/4BBAP9sIwD/dTAA/5RfKv+viF7/vp56/76ee/+wiF7/lWAr/3YxAP9sIwD/f0AA/4RHAP+CRAD/gkQA/4JEAP+CRAD/gkQArIJEACaBQwAA////AP///wD///8A////AIBCAEBzNAD6f0EA/4NFAP+CRAD/gkQA/4VIAP92MwD/bSUA/6N1Tv/ezsL/////////////////////////////////38/D/6V3Uv9uJgD/dTEA/4VJAP+CRAD/gkQA/4JEAP+BQwD/fUAA/4FDAEj///8A////AP///wD///8AzJRd5qBlKf91NgD/dDUA/4JEAP+FSQD/cy4A/3YyAP/PuKP//////////////////////////////////////////////////////9K7qP94NQD/ciwA/4VJAP+CRAD/fkEA/35BAP+LSwD/mlYA6v///wD///8A////AP///wDdpnL/4qx3/8KJUv+PUhf/cTMA/3AsAP90LgD/4dK+/////////////////////////////////////////////////////////////////+TYxf91MAD/dTIA/31CAP+GRwD/llQA/6FcAP+gWwD8////AP///wD///8A////ANGZY/LSm2X/4ap3/92mcP+wdT3/byQA/8mwj////////////////////////////////////////////////////////////////////////////+LYxv9zLgP/jUoA/59bAP+hXAD/nFgA/5xYAPL///8A////AP///wD///8A0ppk8tKaZP/RmWL/1p9q/9ubXv/XqXj////////////////////////////7+fD/vZyG/6BxS/+gcUr/vJuE//r37f//////////////////////3MOr/5dQBf+dVQD/nVkA/5xYAP+cWAD/nFgA8v///wD///8A////AP///wDSmmTy0ppk/9KaZP/SmWP/yohJ//jo2P//////////////////////4NTG/4JDFf9lGAD/bSQA/20kAP9kGAD/fz8S/+Xb0f//////5NG9/6txN/+LOgD/m1QA/51aAP+cWAD/m1cA/5xYAP+cWADy////AP///wD///8A////ANKaZPLSmmT/0ppk/8+TWf/Unmv//v37//////////////////////+TWRr/VwsA/35AAP+ERgD/g0UA/4JGAP9lHgD/kFga/8KXX/+TRwD/jT4A/49CAP+VTQD/n10A/5xYAP+OQQD/lk4A/55cAPL///8A////AP///wD///8A0ppk8tKaZP/SmmT/y4tO/92yiP//////////////////////8NnE/8eCQP+rcTT/ez0A/3IyAP98PgD/gEMA/5FSAP+USwD/jj8A/5lUAP+JNwD/yqV2/694Mf+HNQD/jkAA/82rf/+laBj/jT4A8v///wD///8A////AP///wDSmmTy0ppk/9KaZP/LiUr/4byY///////////////////////gupX/0I5P/+Wuev/Lklz/l1sj/308AP+QSwD/ol0A/59aAP+aVQD/k0oA/8yoh///////+fXv/6pwO//Lp3v///////Pr4f+oay7y////AP///wD///8A////ANKaZPLSmmT/0ppk/8uJSv/hvJj//////////////////////+G7l//Jhkb/0ppk/96nc//fqXX/x4xO/6dkFP+QSQD/llEA/5xXAP+USgD/yaOA///////38uv/qG05/8ijdv//////8efb/6ZpLPL///8A////AP///wD///8A0ppk8tKaZP/SmmT/zIxO/9yxh///////////////////////7dbA/8iEQf/Sm2X/0Zlj/9ScZv/eqHf/2KJv/7yAQf+XTgD/iToA/5lSAP+JNgD/yKFv/611LP+HNQD/jT8A/8qmeP+kZRT/jT4A8v///wD///8A////AP///wDSmmTy0ppk/9KaZP/Pk1n/1J5q//78+//////////////////+/fv/1aFv/8iEQv/Tm2b/0ppl/9GZY//Wn2z/1pZc/9eldf/Bl2b/kUcA/4w9AP+OQAD/lUwA/59eAP+cWQD/jT8A/5ZOAP+eXADy////AP///wD///8A////ANKaZPLSmmT/0ppk/9KZY//KiEn/8d/P///////////////////////47+f/05tm/8iCP//KiEj/yohJ/8eCP//RmGH//vfy///////n1sP/rXQ7/4k4AP+TTAD/nVoA/5xYAP+cVwD/nFgA/5xYAPL///8A////AP///wD///8A0ppk8tKaZP/SmmT/0ptl/8uLTf/aq37////////////////////////////+/fz/6c2y/961jv/etY7/6Myx//78+v//////////////////////3MWv/5xXD/+ORAD/mFQA/51ZAP+cWAD/nFgA8v///wD///8A////AP///wDSmmTy0ppk/9KaZP/SmmT/0ppk/8mFRP/s1b//////////////////////////////////////////////////////////////////////////////+PD/0JFU/7NzMv+WUQD/kUsA/5tXAP+dWQDy////AP///wD///8A////ANKaZP/SmmT/0ppk/9KaZP/Sm2X/z5NZ/8yMT//z5NX/////////////////////////////////////////////////////////////////9Ofa/8yNUP/UmGH/36p5/8yTWv+qaSD/kksA/5ROAPz///8A////AP///wD///8A0ppk5NKaZP/SmmT/0ppk/9KaZP/TnGf/zY9T/82OUv/t1sD//////////////////////////////////////////////////////+7Yw//OkFX/zI5R/9OcZ//SmmP/26V0/9ymdf/BhUf/ol8R6P///wD///8A////AP///wDSmmQ80ppk9tKaZP/SmmT/0ppk/9KaZP/TnGj/zpFW/8qJSv/dson/8uHS//////////////////////////////////Lj0//etIv/y4lL/86QVf/TnGj/0ppk/9KaZP/RmWP/05xn/9ymdfjUnWdC////AP///wD///8A////ANKaZADSmmQc0ppkotKaZP/SmmT/0ppk/9KaZP/Tm2b/0Zli/8qJSf/NjlH/16Z3/+G8mP/myKr/5siq/+G8mP/Xp3f/zY5S/8qISf/RmGH/05tm/9KaZP/SmmT/0ppk/9KaZP/SmmSm0pljINWdaQD///8A////AP///wD///8A0ppkANKaZADSmmQA0ppkQtKaZMrSmmT/0ppk/9KaZP/SmmT/0ptl/9GYYf/Nj1P/y4lL/8qISP/KiEj/y4lK/82PU//RmGH/0ptl/9KaZP/SmmT/0ppk/9KaZP/SmmTO0ppkRtKaZADSmmQA0ppkAP///wD///8A////AP///wDSmmQA0ppkANKaZADSmmQA0ppkANKaZGzSmmTu0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmTw0ppkcNKaZADSmmQA0ppkANKaZADSmmQA////AP///wD///8A////ANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZBLSmmSQ0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppklNKaZBTSmmQA0ppkANKaZADSmmQA0ppkANKaZAD///8A////AP///wD///8A0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQy0ppkutKaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppkvtKaZDbSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkAP///wD///8A////AP///wDSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkXNKaZODSmmT/0ppk/9KaZP/SmmT/0ppk5NKaZGDSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA////AP///wD///8A////ANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkBtKaZIbSmmTo0ppk6tKaZIrSmmQK0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZAD///8A////AP/8P///+B///+AH//+AAf//AAD//AAAP/AAAA/gAAAHwAAAA8AAAAPAAAADwAAAA8AAAAPAAAADwAAAA8AAAAPAAAADwAAAA8AAAAPAAAADwAAAA8AAAAPAAAADwAAAA+AAAAfwAAAP/AAAP/8AAP//gAH//+AH///4H////D//" rel="icon" />
  
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
</head>
<body>
<div class="wrapper">
<header id="title-block-header">
<h1 class="title" style="text-align:center">Unicode in the Library, Part
2: Normalization</h1>
<table style="border:none;float:right">
  <tr>
    <td>Document #:</td>
    <td>P2729R0</td>
  </tr>
  <tr>
    <td>Date:</td>
    <td>2022-11-20</td>
  </tr>
  <tr>
    <td style="vertical-align:top">Project:</td>
    <td>Programming Language C++</td>
  </tr>
  <tr>
    <td style="vertical-align:top">Audience:</td>
    <td>
      SG-16 Unicode<br>
      LEWG-I<br>
      LEWG<br>
    </td>
  </tr>
  <tr>
    <td style="vertical-align:top">Reply-to:</td>
    <td>
      Zach Laine<br>&lt;<a href="mailto:whatwasthataddress@gmail.com" class="email">whatwasthataddress@gmail.com</a>&gt;<br>
    </td>
  </tr>
</table>
</header>
<div style="clear:both">
<div id="TOC" role="doc-toc">
<h1 id="toctitle">Contents</h1>
<ul>
<li><a href="#motivation" id="toc-motivation"><span class="toc-section-number">1</span> Motivation<span></span></a></li>
<li><a href="#the-shortest-unicode-normalization-primer-i-can-manage" id="toc-the-shortest-unicode-normalization-primer-i-can-manage"><span class="toc-section-number">2</span> The shortest Unicode normalization
primer I can manage<span></span></a></li>
<li><a href="#the-stream-safe-format" id="toc-the-stream-safe-format"><span class="toc-section-number">3</span> The stream-safe
format<span></span></a>
<ul>
<li><a href="#unicode-reference" id="toc-unicode-reference"><span class="toc-section-number">3.1</span> Unicode
reference<span></span></a></li>
</ul></li>
<li><a href="#use-cases" id="toc-use-cases"><span class="toc-section-number">4</span> Use cases<span></span></a>
<ul>
<li><a href="#case-1-normalize-a-sequence-of-code-points-to-nfc" id="toc-case-1-normalize-a-sequence-of-code-points-to-nfc"><span class="toc-section-number">4.1</span> Case 1: Normalize a sequence of
code points to NFC<span></span></a></li>
<li><a href="#case-2-normalize-a-sequence-of-code-points-to-nfc-where-the-output-is-going-into-a-string-like-container" id="toc-case-2-normalize-a-sequence-of-code-points-to-nfc-where-the-output-is-going-into-a-string-like-container"><span class="toc-section-number">4.2</span> Case 2: Normalize a sequence of
code points to NFC, where the output is going into a string-like
container<span></span></a></li>
<li><a href="#case-3-modify-some-normalized-text-without-breaking-normalization" id="toc-case-3-modify-some-normalized-text-without-breaking-normalization"><span class="toc-section-number">4.3</span> Case 3: Modify some normalized
text without breaking normalization<span></span></a></li>
</ul></li>
<li><a href="#proposed-design" id="toc-proposed-design"><span class="toc-section-number">5</span> Proposed design<span></span></a>
<ul>
<li><a href="#dependencies" id="toc-dependencies"><span class="toc-section-number">5.1</span> Dependencies<span></span></a></li>
<li><a href="#add-unicode-version-observers" id="toc-add-unicode-version-observers"><span class="toc-section-number">5.2</span> Add Unicode version
observers<span></span></a></li>
<li><a href="#add-stream-safe-operations" id="toc-add-stream-safe-operations"><span class="toc-section-number">5.3</span> Add stream-safe
operations<span></span></a>
<ul>
<li><a href="#add-stream-safe-algorithms" id="toc-add-stream-safe-algorithms"><span class="toc-section-number">5.3.1</span> Add stream-safe
algorithms<span></span></a></li>
<li><a href="#add-stream_safe_iterator" id="toc-add-stream_safe_iterator"><span class="toc-section-number">5.3.2</span> Add
<code class="sourceCode default">stream_safe_iterator</code><span></span></a></li>
<li><a href="#add-stream_safe_view-and-as_stream_safe" id="toc-add-stream_safe_view-and-as_stream_safe"><span class="toc-section-number">5.3.3</span> Add
<code class="sourceCode default">stream_safe_view</code> and
<code class="sourceCode default">as_stream_safe()</code><span></span></a></li>
</ul></li>
<li><a href="#add-concepts-that-describe-the-constraints-on-parameters-to-the-normalization-api" id="toc-add-concepts-that-describe-the-constraints-on-parameters-to-the-normalization-api"><span class="toc-section-number">5.4</span> Add concepts that describe the
constraints on parameters to the normalization API<span></span></a></li>
<li><a href="#add-an-enumeration-listing-the-supported-normalization-forms" id="toc-add-an-enumeration-listing-the-supported-normalization-forms"><span class="toc-section-number">5.5</span> Add an enumeration listing the
supported normalization forms<span></span></a></li>
<li><a href="#add-a-generic-normalization-algorithm" id="toc-add-a-generic-normalization-algorithm"><span class="toc-section-number">5.6</span> Add a generic normalization
algorithm<span></span></a></li>
<li><a href="#add-an-append-version-of-the-normalization-algorithm" id="toc-add-an-append-version-of-the-normalization-algorithm"><span class="toc-section-number">5.7</span> Add an append version of the
normalization algorithm<span></span></a></li>
<li><a href="#add-normalization-aware-insertion-erasure-and-replacement-operations-on-strings" id="toc-add-normalization-aware-insertion-erasure-and-replacement-operations-on-strings"><span class="toc-section-number">5.8</span> Add normalization-aware insertion,
erasure, and replacement operations on strings<span></span></a></li>
<li><a href="#add-a-feature-test-macro" id="toc-add-a-feature-test-macro"><span class="toc-section-number">5.9</span> Add a feature test
macro<span></span></a></li>
</ul></li>
<li><a href="#implementation-experience" id="toc-implementation-experience"><span class="toc-section-number">6</span> Implementation
experience<span></span></a>
<ul>
<li><a href="#tldr" id="toc-tldr"><span class="toc-section-number">6.1</span> tl;dr<span></span></a></li>
</ul></li>
</ul>
</div>
<h1 data-number="1" id="motivation"><span class="header-section-number">1</span> Motivation<a href="#motivation" class="self-link"></a></h1>
<p>I’m proposing normalization interfaces that meet certain design
requirements that I think are important; I hope you’ll agree:</p>
<ul>
<li><p>Ranges are the future. We should have range-friendly ways of
doing transcoding. This includes support for sentinels and lazy
views.</p></li>
<li><p>Iterators are the present. We should support generic programming,
whether it is done in terms of pointers, a particular iterator, or an
iterator type specified as a template parameter.</p></li>
<li><p>A null-terminated string should not be treated as a special case.
The ubiquity of such strings means that they should be treated as
first-class strings.</p></li>
<li><p>If there’s a specific algorithm specialization that operates
directly on UTF-8 or UTF-16, the top-level algorithm should use that
when appropriate. This is analogous to having multiple implementations
of the algorithms in <code class="sourceCode default">std</code> that
differ based on iterator category.</p></li>
<li><p>Input may come from UTF-8, UTF-16, or UTF-32 strings (though
UTF-32 is extremely uncommon in practice). There should be a single
overload of each normalization function, so that the user does not need
to change code when the input is changed from UTF-N to UTF-M. The most
optimal version of the algorithm (processing either UTF-8 or UTF-16)
will be selected (as mentioned above).</p></li>
<li><p>The Unicode algorithms are low-level tools that most C++ users
will not need to touch, even if their code needs to be Unicode-aware.
C++ users should also be provided higher-level, string-like abstractions
(provisionally called <code class="sourceCode default">std::text</code>)
that will handle all the messy Unicode details, leaving C++ users to
think about their program instead of Unicode).</p></li>
</ul>
<h1 data-number="2" id="the-shortest-unicode-normalization-primer-i-can-manage"><span class="header-section-number">2</span> The shortest Unicode
normalization primer I can manage<a href="#the-shortest-unicode-normalization-primer-i-can-manage" class="self-link"></a></h1>
<p>You can have different strings of code points that mean the same
thing. For example, you could have the code point “ä” (U+00E4 Latin
Small Letter A with Diaeresis), or you could have the two code points
“a” (U+0061 Latin Small Letter A) and “¨̈” (U+0308 Combining Diaeresis).
The former represents “ä” as a single code point, the latter as two.
Unicode rules state that both strings must be treated as identical.</p>
<p>To make such comparisons more efficient, Unicode has normalization
forms. If all the text you ever compare is in the same normalization
form, it doesn’t matter whether they’re all the composed form like “ä”
or the decomposed form like “a¨” – they’ll all compare bitwise-equal to
each other if they represent the same text.</p>
<p>There are four official normalization forms. The first two are NFC
(“Normalization Form: Composed”), and NFD (“Normalization Form:
Decomposed”). There are these two other forms NFKC and NFKD that you can
safely ignore; they are seldom-used variants of NFC and NFD,
respectively.</p>
<p>NFC is the most compact of these four forms. It is near-ubiquitous on
the web, as W3C recommends that web sites use it exclusively.</p>
<p>There’s this other form FCC, too. It’s really close to NFC, except
that it is not as compact in some corner cases (though it is identical
to NFC in most cases). It’s really handy when doing something called
collation, which is not yet proposed. It’s coming, though.</p>
<h1 data-number="3" id="the-stream-safe-format"><span class="header-section-number">3</span> The stream-safe format<a href="#the-stream-safe-format" class="self-link"></a></h1>
<p>Unicode text often contains sequences in which a noncombining code
point (e.g. ‘A’) is followed by one or more combining code points
(e.g. some number of umlauts). It is valid to have an ‘A’ followed by
100 million umlauts. This is valid but not useful. Unicode specifies
something called the Stream-Safe Format. This format inserts extra code
points between combiners to ensure that there are never more than 30
combiners in a row. In practice, you should never need anywhere near 30
to represent meaningful text.</p>
<p>Long sequences of combining characters create a problem for
algorithms like normalization and grapheme breaking; the grapheme
breaking algorithm may be required to look ahead a very long way in
order to determine how to handle the current grapheme. To address this,
Unicode allows a conforming implementation to assume that a sequence of
code points contains graphemes of at most 31 code points. This is known
as the Stream-Safe Format assumption. All the proposed interfaces here
and in the papers to come make this assumption.</p>
<p>The stream-safe format is very important. Its use prevents the
Unicode algorithms from having to worry about unbounded-length
graphemes. This in turn allows the Unicode algorithms to use side
buffers of a small and fixed size to do their operations, which obviates
the need for most memory allocations.</p>
<p>For more info on the stream-safe format, see the appropriate <a href="https://unicode.org/reports/tr15/#Stream_Safe_Text_Format">part of
UAX15</a>.</p>
<h2 data-number="3.1" id="unicode-reference"><span class="header-section-number">3.1</span> Unicode reference<a href="#unicode-reference" class="self-link"></a></h2>
<p>See <a href="https://unicode.org/reports/tr15">UAX15 Unicode
Normalization Forms</a> for more information on Unicode
normalization.</p>
<h1 data-number="4" id="use-cases"><span class="header-section-number">4</span> Use cases<a href="#use-cases" class="self-link"></a></h1>
<h2 data-number="4.1" id="case-1-normalize-a-sequence-of-code-points-to-nfc"><span class="header-section-number">4.1</span> Case 1: Normalize a sequence of
code points to NFC<a href="#case-1-normalize-a-sequence-of-code-points-to-nfc" class="self-link"></a></h2>
<p>We want to make a normalized copy of
<code class="sourceCode default">s</code>, and we want the underlying
implementation to use a specialized UTF-8 version of the normalization
algorithm. This is the most flexible and general-purpose API.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>std<span class="op">::</span>string s <span class="op">=</span> <span class="co">/* ... */</span>; <span class="co">// using a std::string to store UTF-8</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="ot">assert</span><span class="op">(!</span>std<span class="op">::</span>uc<span class="op">::</span>is_normalized<span class="op">(</span>std<span class="op">::</span>uc<span class="op">::</span>as_utf32<span class="op">(</span>s<span class="op">)))</span>;</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="dt">char</span> <span class="op">*</span> nfc_s <span class="op">=</span> <span class="kw">new</span> <span class="dt">char</span><span class="op">[</span>s<span class="op">.</span>size<span class="op">()</span> <span class="op">*</span> <span class="dv">2</span><span class="op">]</span>;</span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="co">// Have to use as_utf32(), because normalization operates on code points, not UTF-8.</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a><span class="kw">auto</span> out <span class="op">=</span> std<span class="op">::</span>uc<span class="op">::</span>normalize<span class="op">&lt;</span>std<span class="op">::</span>uc<span class="op">::</span>nf<span class="op">::</span>c<span class="op">&gt;(</span>std<span class="op">::</span>uc<span class="op">::</span>as_utf32<span class="op">(</span>s<span class="op">)</span>, nfc_s<span class="op">)</span>;</span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a><span class="op">*</span>out <span class="op">=</span> <span class="ch">&#39;</span><span class="sc">\0</span><span class="ch">&#39;</span>;</span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a><span class="ot">assert</span><span class="op">(</span>std<span class="op">::</span>uc<span class="op">::</span>is_normalized<span class="op">(</span>nfc_s, out<span class="op">))</span>;</span></code></pre></div>
<h2 data-number="4.2" id="case-2-normalize-a-sequence-of-code-points-to-nfc-where-the-output-is-going-into-a-string-like-container"><span class="header-section-number">4.2</span> Case 2: Normalize a sequence of
code points to NFC, where the output is going into a string-like
container<a href="#case-2-normalize-a-sequence-of-code-points-to-nfc-where-the-output-is-going-into-a-string-like-container" class="self-link"></a></h2>
<p>This is like the previous case, except that the results must go into
a string-like container, not just any old output iterator. The advantage
of doing things this way is that the code is a lot faster if you can
append the results in chunks.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>std<span class="op">::</span>string s <span class="op">=</span> <span class="co">/* ... */</span>; <span class="co">// using a std::string to store UTF-8</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="ot">assert</span><span class="op">(!</span>std<span class="op">::</span>uc<span class="op">::</span>is_normalized<span class="op">(</span>std<span class="op">::</span>uc<span class="op">::</span>as_utf32<span class="op">(</span>s<span class="op">)))</span>;</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>std<span class="op">::</span>string nfc_s;</span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>nfc_s<span class="op">.</span>reserve<span class="op">(</span>s<span class="op">.</span>size<span class="op">())</span>;</span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="co">// Have to use as_utf32(), because normalization operates on code points, not UTF-8.</span></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>std<span class="op">::</span>uc<span class="op">::</span>normalize_append<span class="op">&lt;</span>std<span class="op">::</span>uc<span class="op">::</span>nf<span class="op">::</span>c<span class="op">&gt;(</span>std<span class="op">::</span>uc<span class="op">::</span>as_utf32<span class="op">(</span>s<span class="op">)</span>, nfc_s<span class="op">)</span>;</span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="ot">assert</span><span class="op">(</span>std<span class="op">::</span>uc<span class="op">::</span>is_normalized<span class="op">(</span>std<span class="op">::</span>uc<span class="op">::</span>as_utf32<span class="op">(</span>nfc_s<span class="op">)))</span>;</span></code></pre></div>
<h2 data-number="4.3" id="case-3-modify-some-normalized-text-without-breaking-normalization"><span class="header-section-number">4.3</span> Case 3: Modify some normalized
text without breaking normalization<a href="#case-3-modify-some-normalized-text-without-breaking-normalization" class="self-link"></a></h2>
<p>You cannot modify arbitrary text that is already normalized without
risking breaking the normalization. For instance, let’s say I have some
NFC-normalized text. That means that all the combining code points that
could combine with one or more preceding code points have already done
so. For instance, if I see “ä” in the NFC text, then I know it’s code
point U+00E4 “Latin Small Letter A with Diaeresis”, <em>not</em> some
combination of “a” and a combining two dots.</p>
<p>Now, forget about the “ä” I just gave as an example. Let’s say that I
want to insert a single code point, “¨̈” (U+0308 Combining Diaeresis)
into NFC text. Let’s also say that the insertion position is right after
a letter “o”. If I do the insertion and then walk away, I would have
broken the NFC normalization, because “o” followed by “¨” is supposed to
combine to form “ö” (U+00F6 Latin Small Letter O with Diaeresis).</p>
<p>Similar things can happen when deleting text – sometimes the deletion
can leave two code points next to each other that should interact in
some way that did not apply when they were separate, before the
deletion.</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>std<span class="op">::</span>string s <span class="op">=</span> <span class="co">/* ... */</span>;                            <span class="co">// using a std::string to store UTF-8</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="ot">assert</span><span class="op">(</span>std<span class="op">::</span>uc<span class="op">::</span>is_normalized<span class="op">(</span>std<span class="op">::</span>uc<span class="op">::</span>as_utf32<span class="op">(</span>s<span class="op">)))</span>; <span class="co">// already normalized</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>std<span class="op">::</span>string insertion <span class="op">=</span> <span class="co">/* ... */</span>;</span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>normalize_insert<span class="op">&lt;</span>std<span class="op">::</span>uc<span class="op">::</span>nf<span class="op">::</span>c<span class="op">&gt;(</span>s, s<span class="op">.</span>begin<span class="op">()</span> <span class="op">+</span> <span class="dv">2</span>, std<span class="op">::</span>uc<span class="op">::</span>as_utf32<span class="op">(</span>insertion<span class="op">))</span>;</span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="ot">assert</span><span class="op">(</span>std<span class="op">::</span>uc<span class="op">::</span>is_normalized<span class="op">(</span>std<span class="op">::</span>uc<span class="op">::</span>as_utf32<span class="op">(</span>nfc_s<span class="op">)))</span>;</span></code></pre></div>
<h1 data-number="5" id="proposed-design"><span class="header-section-number">5</span> Proposed design<a href="#proposed-design" class="self-link"></a></h1>
<h2 data-number="5.1" id="dependencies"><span class="header-section-number">5.1</span> Dependencies<a href="#dependencies" class="self-link"></a></h2>
<p>This proposal depends on the existence of <a href="https://isocpp.org/files/papers/P2728R0.html">P2728</a> “Unicode
in the Library, Part 1: UTF Transcoding”.</p>
<h2 data-number="5.2" id="add-unicode-version-observers"><span class="header-section-number">5.2</span> Add Unicode version observers<a href="#add-unicode-version-observers" class="self-link"></a></h2>
<div class="sourceCode" id="cb4"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> std<span class="op">::</span>uc <span class="op">{</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">inline</span> <span class="kw">constexpr</span> major_version <span class="op">=</span> <em>implementation defined</em>;</span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>  <span class="kw">inline</span> <span class="kw">constexpr</span> minor_version <span class="op">=</span> <em>implementation defined</em>;</span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>  <span class="kw">inline</span> <span class="kw">constexpr</span> patch_version <span class="op">=</span> <em>implementation defined</em>;</span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Unlike <a href="https://isocpp.org/files/papers/P2728R0.html">P2728</a> (Unicode
Part 1), the interfaces in this proposal refer to parts of the Unicode
standard that are allowed to change over time. The normalization of code
points is unlikely to change in Unicode N from what it was for those
same code points in Unicode N-1, but since new code points are
introduced with each new Unicode release, the normalization algorithms
must be updated to keep up.</p>
<p>I’m proposing that implementations provide support for whatever
version of Unicode they like, as long as they document which one is
supported via
<code class="sourceCode default">major_</code>-/<code class="sourceCode default">minor_</code>-/<code class="sourceCode default">patch_version</code>.</p>
<h2 data-number="5.3" id="add-stream-safe-operations"><span class="header-section-number">5.3</span> Add stream-safe operations<a href="#add-stream-safe-operations" class="self-link"></a></h2>
<p>As mentioned above, I consider most of the Unicode algorithms
presented in this proposal and the proposals to come to be low-level
tools that most C++ users will not need to touch. I would instead like
to see most C++ users use a higher-level, string-like abstraction
(provisionally called <code class="sourceCode default">std::text</code>)
that will handle all the messy Unicode details, leaving C++ users to
think about their program instead of Unicode). As such, most of the
interfaces in this proposal assume that their input is in stream-safe
format, but they do not enforce that. The exceptions are
<code class="sourceCode default">normalize_insert</code>/-<code class="sourceCode default">_erase</code>/-<code class="sourceCode default">_replace</code>
algorithms, which are designed to be operations with which something
like <code class="sourceCode default">std::text</code> may be built.
These interfaces do not assume stream-safe for inserted text, and in
fact they put inserted text <em>into</em> stream-safe format.</p>
<p>So, if users use something like <code class="sourceCode default">std::uc::normalize&lt;std::uc::fc::d&gt;()</code>,
they may know <em>a priori</em> that the input is in stream-safe format,
or they may not. If they do not, they can use the stream-safe operations
to meet the stream-safe precondition of the call to <code class="sourceCode default">std::uc::normalize&lt;std::uc::fc::d&gt;()</code>.</p>
<h3 data-number="5.3.1" id="add-stream-safe-algorithms"><span class="header-section-number">5.3.1</span> Add stream-safe algorithms<a href="#add-stream-safe-algorithms" class="self-link"></a></h3>
<div class="sourceCode" id="cb5"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> std<span class="op">::</span>uc <span class="op">{</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>utf_iter I, std<span class="op">::</span>sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S<span class="op">&gt;</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> I stream_safe<span class="op">(</span>I first, S last<span class="op">)</span>;</span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>utf_range_like R<span class="op">&gt;</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <em>range-like-result-iterator</em><span class="op">&lt;</span>R<span class="op">&gt;</span> stream_safe<span class="op">(</span>R <span class="op">&amp;&amp;</span> r<span class="op">)</span>;</span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>utf_iter I, sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S, output_iterator<span class="op">&lt;</span><span class="dt">uint32_t</span><span class="op">&gt;</span> O<span class="op">&gt;</span></span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> ranges<span class="op">::</span>copy_result<span class="op">&lt;</span>I, O<span class="op">&gt;</span> stream_safe_copy<span class="op">(</span>I first, S last, O out<span class="op">)</span>;</span>
<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>utf_range_like R, output_iterator<span class="op">&lt;</span><span class="dt">uint32_t</span><span class="op">&gt;</span> O<span class="op">&gt;</span></span>
<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> ranges<span class="op">::</span>copy_result<span class="op">&lt;</span><em>range-like-result-iterator</em><span class="op">&lt;</span>R<span class="op">&gt;</span>, O<span class="op">&gt;</span></span>
<span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a>      stream_safe_copy<span class="op">(</span>R <span class="op">&amp;&amp;</span> r, O out<span class="op">)</span>;</span>
<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>utf_iter I, sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S<span class="op">&gt;</span></span>
<span id="cb5-16"><a href="#cb5-16" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <span class="dt">bool</span> is_stream_safe<span class="op">(</span>I first, S last<span class="op">)</span>;</span>
<span id="cb5-17"><a href="#cb5-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-18"><a href="#cb5-18" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>utf_range_like R<span class="op">&gt;</span></span>
<span id="cb5-19"><a href="#cb5-19" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <span class="dt">bool</span> is_stream_safe<span class="op">(</span>R <span class="op">&amp;&amp;</span> r<span class="op">)</span>;</span>
<span id="cb5-20"><a href="#cb5-20" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code class="sourceCode default">stream_safe()</code> is like
<code class="sourceCode default">std::remove_if()</code> and related
algorithms. It writes the stream-safe subset of the given range into the
beginning, and leaves junk at the end. It returns the iterator to the
first junk element.</p>
<p>Note that <code class="sourceCode default"><em>range-like-result-iterator</em>&lt;R&gt;</code>
comes from <a href="https://isocpp.org/files/papers/P2728R0.html">P2728</a>. It
provides a <code class="sourceCode default">ranges::borrowed_iterator_t&lt;R&gt;</code>
or just a pointer, as appropriate, based
<code class="sourceCode default">R</code>.</p>
<h3 data-number="5.3.2" id="add-stream_safe_iterator"><span class="header-section-number">5.3.2</span> Add
<code class="sourceCode default">stream_safe_iterator</code><a href="#add-stream_safe_iterator" class="self-link"></a></h3>
<div class="sourceCode" id="cb6"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> std<span class="op">::</span>uc <span class="op">{</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="dt">int</span> <em>uc-ccc</em><span class="op">(</span><span class="dt">uint32_t</span> cp<span class="op">)</span>; <span class="co">// <em>exposition only</em></span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>code_point_iter I, sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S <span class="op">=</span> I<span class="op">&gt;</span></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>  <span class="kw">struct</span> stream_safe_iterator</span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>    <span class="op">:</span> iterator_interface<span class="op">&lt;</span>stream_safe_iterator<span class="op">&lt;</span>I, S<span class="op">&gt;</span>, forward_iterator_tag, <span class="dt">uint32_t</span>, <span class="dt">uint32_t</span><span class="op">&gt;</span> <span class="op">{</span></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> stream_safe_iterator<span class="op">()</span> <span class="op">=</span> <span class="cf">default</span>;</span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> stream_safe_iterator<span class="op">(</span>I first, S last<span class="op">)</span></span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a>      <span class="op">:</span> first_<span class="op">(</span>first<span class="op">)</span>, it_<span class="op">(</span>first<span class="op">)</span>, last_<span class="op">(</span>last<span class="op">)</span>,</span>
<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a>        nonstarters_<span class="op">(</span>it_ <span class="op">!=</span> last_ <span class="op">&amp;&amp;</span> <em>uc-ccc</em><span class="op">(*</span>it_<span class="op">)</span> <span class="op">?</span> <span class="dv">1</span> <span class="op">:</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a>        <span class="op">{}</span></span>
<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <span class="dt">uint32_t</span> <span class="kw">operator</span><span class="op">*()</span> <span class="kw">const</span>;</span>
<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> I base<span class="op">()</span> <span class="kw">const</span> <span class="op">{</span> <span class="cf">return</span> it_; <span class="op">}</span></span>
<span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-17"><a href="#cb6-17" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> stream_safe_iterator<span class="op">&amp;</span> <span class="kw">operator</span><span class="op">++()</span>;</span>
<span id="cb6-18"><a href="#cb6-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-19"><a href="#cb6-19" aria-hidden="true" tabindex="-1"></a>    <span class="kw">friend</span> <span class="kw">constexpr</span> <span class="dt">bool</span> <span class="kw">operator</span><span class="op">==(</span>stream_safe_iterator lhs, stream_safe_iterator rhs<span class="op">)</span></span>
<span id="cb6-20"><a href="#cb6-20" aria-hidden="true" tabindex="-1"></a>      <span class="op">{</span> <span class="cf">return</span> lhs<span class="op">.</span>it_ <span class="op">==</span> rhs<span class="op">.</span>it_; <span class="op">}</span></span>
<span id="cb6-21"><a href="#cb6-21" aria-hidden="true" tabindex="-1"></a>    <span class="kw">template</span><span class="op">&lt;</span><span class="kw">class</span> I, <span class="kw">class</span> S<span class="op">&gt;</span></span>
<span id="cb6-22"><a href="#cb6-22" aria-hidden="true" tabindex="-1"></a>      <span class="kw">friend</span> <span class="kw">constexpr</span> <span class="dt">bool</span> <span class="kw">operator</span><span class="op">==(</span><span class="kw">const</span> stream_safe_iterator<span class="op">&lt;</span>I, S<span class="op">&gt;&amp;</span> lhs, S rhs<span class="op">)</span></span>
<span id="cb6-23"><a href="#cb6-23" aria-hidden="true" tabindex="-1"></a>        <span class="op">{</span> <span class="cf">return</span> lhs<span class="op">.</span>base<span class="op">()</span> <span class="op">==</span> rhs; <span class="op">}</span></span>
<span id="cb6-24"><a href="#cb6-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-25"><a href="#cb6-25" aria-hidden="true" tabindex="-1"></a>    <span class="kw">using</span> base_type <span class="op">=</span>  <span class="co">// <em>exposition only</em></span></span>
<span id="cb6-26"><a href="#cb6-26" aria-hidden="true" tabindex="-1"></a>      iterator_interface<span class="op">&lt;</span>stream_safe_iterator<span class="op">&lt;</span>I, S<span class="op">&gt;</span>, forward_iterator_tag, <span class="dt">uint32_t</span>, <span class="dt">uint32_t</span><span class="op">&gt;</span>;</span>
<span id="cb6-27"><a href="#cb6-27" aria-hidden="true" tabindex="-1"></a>    <span class="kw">using</span> base_type<span class="op">::</span><span class="kw">operator</span><span class="op">++</span>;</span>
<span id="cb6-28"><a href="#cb6-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-29"><a href="#cb6-29" aria-hidden="true" tabindex="-1"></a>  <span class="kw">private</span><span class="op">:</span></span>
<span id="cb6-30"><a href="#cb6-30" aria-hidden="true" tabindex="-1"></a>    I first_;                      <span class="co">// <em>exposition only</em></span></span>
<span id="cb6-31"><a href="#cb6-31" aria-hidden="true" tabindex="-1"></a>    I it_;                         <span class="co">// <em>exposition only</em></span></span>
<span id="cb6-32"><a href="#cb6-32" aria-hidden="true" tabindex="-1"></a>    <span class="op">[[</span><span class="at">no_unique_address</span><span class="op">]]</span> S last_; <span class="co">// <em>exposition only</em></span></span>
<span id="cb6-33"><a href="#cb6-33" aria-hidden="true" tabindex="-1"></a>    <span class="dt">size_t</span> nonstarters_ <span class="op">=</span> <span class="dv">0</span>;       <span class="co">// <em>exposition only</em></span></span>
<span id="cb6-34"><a href="#cb6-34" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span>;</span>
<span id="cb6-35"><a href="#cb6-35" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code class="sourceCode default"><em>uc-ccc</em>()</code> returns the
<a href="https://unicode.org/reports/tr44/#Canonical_Combining_Class_Values">Canonical
Combining Class</a>, which indicates how and whether a code point
combines with other code points. For some code point
<code class="sourceCode default">cp</code>,
<code class="sourceCode default"><em>uc-ccc</em>(cp) == 0</code> iff
<code class="sourceCode default">cp</code> is a “starter”/“noncombiner”.
Any number of “nonstarters”/“combiners” may follow a starter (remember
that the purpose of the stream-safe format is to limit the maximum
number of combiners to at most 30).</p>
<p>The behavior of this iterator should be left to the implementation,
as long as the result meets the stream-safe format, and does not 1)
change or remove any starters, or 2) change the first 30 nonstarters
after any given starter. The Unicode standard shows a technique for
inserting special dummy-starters (that do not interact with most other
text) every 30 non-starters, so that the original input is preserved. I
think this is silly – the longest possible meaningful sequence of
nonstarters is 17 code points, and that is only necessary for backwards
comparability. Most meaningful sequences are much shorter. I think a
more reasonable implementation is simply to truncate any sequence of
nonstarters to 30 code points.</p>
<h3 data-number="5.3.3" id="add-stream_safe_view-and-as_stream_safe"><span class="header-section-number">5.3.3</span> Add
<code class="sourceCode default">stream_safe_view</code> and
<code class="sourceCode default">as_stream_safe()</code><a href="#add-stream_safe_view-and-as_stream_safe" class="self-link"></a></h3>
<div class="sourceCode" id="cb7"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> std<span class="op">::</span>uc <span class="op">{</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span><span class="kw">class</span> T<span class="op">&gt;</span></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>  <span class="kw">concept</span> <em>stream-safe-iter</em> <span class="op">=</span> <em>see below</em>;  <span class="co">// <em>exposition only</em></span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span><span class="kw">class</span> I, std<span class="op">::</span>sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S <span class="op">=</span> I<span class="op">&gt;</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>    <span class="kw">requires</span> <em>stream-safe-iter</em><span class="op">&lt;</span>I<span class="op">&gt;</span></span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a>  <span class="kw">struct</span> stream_safe_view <span class="op">:</span> view_interface<span class="op">&lt;</span>stream_safe_view<span class="op">&lt;</span>I, S<span class="op">&gt;&gt;</span> <span class="op">{</span></span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a>    <span class="kw">using</span> iterator <span class="op">=</span> I;</span>
<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a>    <span class="kw">using</span> sentinel <span class="op">=</span> S;</span>
<span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> stream_safe_view<span class="op">()</span> <span class="op">{}</span></span>
<span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> stream_safe_view<span class="op">(</span>iterator first, sentinel last<span class="op">)</span> <span class="op">:</span></span>
<span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a>      first_<span class="op">(</span>first<span class="op">)</span>, last_<span class="op">(</span>last<span class="op">)</span></span>
<span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a>    <span class="op">{}</span></span>
<span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> iterator begin<span class="op">()</span> <span class="kw">const</span> <span class="op">{</span> <span class="cf">return</span> first_; <span class="op">}</span></span>
<span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> sentinel end<span class="op">()</span> <span class="kw">const</span> <span class="op">{</span> <span class="cf">return</span> last_; <span class="op">}</span></span>
<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a>    <span class="kw">friend</span> <span class="kw">constexpr</span> <span class="dt">bool</span> <span class="kw">operator</span><span class="op">==(</span>stream_safe_view lhs, stream_safe_view rhs<span class="op">)</span></span>
<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a>      <span class="op">{</span> <span class="cf">return</span> lhs<span class="op">.</span>first_ <span class="op">==</span> rhs<span class="op">.</span>first_ <span class="op">&amp;&amp;</span> lhs<span class="op">.</span>last_ <span class="op">==</span> rhs<span class="op">.</span>last_; <span class="op">}</span></span>
<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a>  <span class="kw">private</span><span class="op">:</span></span>
<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a>    iterator first_;                      <span class="co">// <em>exposition only</em></span></span>
<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a>    <span class="op">[[</span><span class="at">no_unique_address</span><span class="op">]]</span> sentinel last_; <span class="co">// <em>exposition only</em></span></span>
<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span>;</span>
<span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a>  <span class="kw">struct</span> <em>as-stream-safe-t</em> <span class="op">:</span> range_adaptor_closure<span class="op">&lt;</span><em>as-stream-safe-t</em><span class="op">&gt;</span> <span class="op">{</span> <span class="co">// <em>exposition only</em></span></span>
<span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a>    <span class="kw">template</span><span class="op">&lt;</span>utf_iter I, std<span class="op">::</span>sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S<span class="op">&gt;</span></span>
<span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a>      <span class="kw">constexpr</span> <em>unspecified</em> <span class="kw">operator</span><span class="op">()(</span>I first, S last<span class="op">)</span> <span class="kw">const</span>;</span>
<span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a>    <span class="kw">template</span><span class="op">&lt;</span>utf_range_like R<span class="op">&gt;</span></span>
<span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a>      <span class="kw">constexpr</span> <em>unspecified</em> <span class="kw">operator</span><span class="op">()(</span>R <span class="op">&amp;&amp;</span> r<span class="op">)</span> <span class="kw">const</span>;</span>
<span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span>;</span>
<span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-34"><a href="#cb7-34" aria-hidden="true" tabindex="-1"></a>  <span class="kw">inline</span> <span class="kw">constexpr</span> <em>as-stream-safe-t</em> as_stream_safe;</span>
<span id="cb7-35"><a href="#cb7-35" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code class="sourceCode default"><em>stream-safe-iter</em>&lt;T&gt;</code>
is <code class="sourceCode default">true</code> iff
<code class="sourceCode default">T</code> is a specialization of
<code class="sourceCode default">stream_safe_iterator</code>.</p>
<p>The <code class="sourceCode default">as_stream_safe()</code>
overloads each return a
<code class="sourceCode default">stream_safe_view</code> of the
appropriate type.</p>
<p><code class="sourceCode default">as_stream_safe(/*...*/)</code>
returns a <code class="sourceCode default">stream_safe_view</code> of
the appropriate type, except that the range overload returns
<code class="sourceCode default">ranges::dangling{}</code> if <code class="sourceCode default">!is_pointer_v&lt;remove_reference_t&lt;R&gt;&gt; &amp;&amp; !ranges::borrowed_range&lt;R&gt;</code>
is <code class="sourceCode default">true</code>. If either overload is
called with a non-common range
<code class="sourceCode default">r</code>, the type of the second
template parameter to
<code class="sourceCode default">stream_safe_view</code> will be
<code class="sourceCode default">decltype(ranges::end(r))</code>,
<em>not</em> a specialization of
<code class="sourceCode default">stream_safe_iterator</code>.</p>
<p><code class="sourceCode default">as_stream_safe</code> can also be
used as a range adaptor, as in
<code class="sourceCode default">r | std::uc::as_stream_safe</code>.</p>
<h2 data-number="5.4" id="add-concepts-that-describe-the-constraints-on-parameters-to-the-normalization-api"><span class="header-section-number">5.4</span> Add concepts that describe the
constraints on parameters to the normalization API<a href="#add-concepts-that-describe-the-constraints-on-parameters-to-the-normalization-api" class="self-link"></a></h2>
<div class="sourceCode" id="cb8"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> std<span class="op">::</span>uc <span class="op">{</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span><span class="kw">class</span> T, <span class="kw">class</span> CodeUnit<span class="op">&gt;</span></span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>  <span class="kw">concept</span> <em>eraseable-insertable-sized-bidi-range</em> <span class="op">=</span> <span class="co">// <em>exposition only</em></span></span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>    ranges<span class="op">::</span>sized_range<span class="op">&lt;</span>T<span class="op">&gt;</span> <span class="op">&amp;&amp;</span></span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a>    ranges<span class="op">::</span>bidirectional_range<span class="op">&lt;</span>T<span class="op">&gt;</span> <span class="op">&amp;&amp;</span></span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>    <span class="kw">requires</span><span class="op">(</span>T t, <span class="kw">const</span> CodeUnit<span class="op">*</span> it<span class="op">)</span> <span class="op">{</span></span>
<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>      <span class="op">{</span> t<span class="op">.</span>erase<span class="op">(</span>t<span class="op">.</span>begin<span class="op">()</span>, t<span class="op">.</span>end<span class="op">())</span> <span class="op">}</span> <span class="op">-&gt;</span> same_as<span class="op">&lt;</span>ranges<span class="op">::</span>iterator_t<span class="op">&lt;</span>T<span class="op">&gt;&gt;</span>;</span>
<span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a>      <span class="op">{</span> t<span class="op">.</span>insert<span class="op">(</span>t<span class="op">.</span>end<span class="op">()</span>, it, it<span class="op">)</span> <span class="op">}</span> <span class="op">-&gt;</span> same_as<span class="op">&lt;</span>ranges<span class="op">::</span>iterator_t<span class="op">&lt;</span>T<span class="op">&gt;&gt;</span>;</span>
<span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span>;</span>
<span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span><span class="kw">class</span> T<span class="op">&gt;</span></span>
<span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a>    <span class="kw">concept</span> utf8_string <span class="op">=</span></span>
<span id="cb8-13"><a href="#cb8-13" aria-hidden="true" tabindex="-1"></a>      utf8_code_unit<span class="op">&lt;</span>ranges<span class="op">::</span>range_value_t<span class="op">&lt;</span>T<span class="op">&gt;&gt;</span> <span class="op">&amp;&amp;</span></span>
<span id="cb8-14"><a href="#cb8-14" aria-hidden="true" tabindex="-1"></a>      <em>eraseable-insertable-sized-bidi-range</em><span class="op">&lt;</span>T, ranges<span class="op">::</span>range_value_t<span class="op">&lt;</span>T<span class="op">&gt;&gt;</span>;</span>
<span id="cb8-15"><a href="#cb8-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-16"><a href="#cb8-16" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span><span class="kw">class</span> T<span class="op">&gt;</span></span>
<span id="cb8-17"><a href="#cb8-17" aria-hidden="true" tabindex="-1"></a>    <span class="kw">concept</span> utf16_string <span class="op">=</span></span>
<span id="cb8-18"><a href="#cb8-18" aria-hidden="true" tabindex="-1"></a>      utf8_code_unit<span class="op">&lt;</span>ranges<span class="op">::</span>range_value_t<span class="op">&lt;</span>T<span class="op">&gt;&gt;</span> <span class="op">&amp;&amp;</span></span>
<span id="cb8-19"><a href="#cb8-19" aria-hidden="true" tabindex="-1"></a>      <em>eraseable-insertable-sized-bidi-range</em><span class="op">&lt;</span>T, ranges<span class="op">::</span>range_value_t<span class="op">&lt;</span>T<span class="op">&gt;&gt;</span>;</span>
<span id="cb8-20"><a href="#cb8-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-21"><a href="#cb8-21" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span><span class="kw">class</span> T<span class="op">&gt;</span></span>
<span id="cb8-22"><a href="#cb8-22" aria-hidden="true" tabindex="-1"></a>    <span class="kw">concept</span> utf_string <span class="op">=</span> utf8_string<span class="op">&lt;</span>T<span class="op">&gt;</span> <span class="op">||</span> utf16_string<span class="op">&lt;</span>T<span class="op">&gt;</span>;</span>
<span id="cb8-23"><a href="#cb8-23" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h2 data-number="5.5" id="add-an-enumeration-listing-the-supported-normalization-forms"><span class="header-section-number">5.5</span> Add an enumeration listing the
supported normalization forms<a href="#add-an-enumeration-listing-the-supported-normalization-forms" class="self-link"></a></h2>
<p><code class="sourceCode default">nf</code> is short for normalization
form, and the letter(s) of each enumerator indicate a form. The Unicode
normalization forms are NFC, NFD, NFKC, and NFKD. There is also an
important semi-official one called FCC (described in <a href="https://unicode.org/notes/tn5">Unicode Technical Note #5</a>).</p>
<p>Using this enumeration, a user would spell NFD
<code class="sourceCode default">std::uc::nf::d</code>.</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> std<span class="op">::</span>uc <span class="op">{</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">enum</span> <span class="kw">class</span> nf <span class="op">{</span></span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>    c,</span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a>    d,</span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>    kc,</span>
<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a>    kd,</span>
<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>    fcc</span>
<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span>;</span></code></pre></div>
<h2 data-number="5.6" id="add-a-generic-normalization-algorithm"><span class="header-section-number">5.6</span> Add a generic normalization
algorithm<a href="#add-a-generic-normalization-algorithm" class="self-link"></a></h2>
<p><code class="sourceCode default">normalize()</code> takes some input
in code points, and writes the result to the out iterator
<code class="sourceCode default">out</code>, also in code points. Since
there are fast implementations that normalize UTF-16 and UTF-8
sequences, if the user passes a UTF-8 -&gt; UTF-32 or UTF-16 -&gt;
UTF-32 transcoding iterator to
<code class="sourceCode default">normalize()</code>, it is allowed to
get the underlying iterators out of
<code class="sourceCode default">[first, last)</code>, and do all the
normalization in UTF-8 or UTF-16.</p>
<p>You may expect <code class="sourceCode default">normalize()</code> to
return an alias of
<code class="sourceCode default">in_out_result</code>, like
<code class="sourceCode default">std::ranges::copy()</code>, or
<code class="sourceCode default">std::uc::transcode_to_utf8()</code>
from <a href="https://isocpp.org/files/papers/P2728R0.html">P2728</a>.
The reason it does not is that to do so would interfere with using ICU
to implement these algorithms. See the section on implementation
experience for why that is important.</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> std<span class="op">::</span>uc <span class="op">{</span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>nf Normalization, utf_iter I, sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S, output_iterator<span class="op">&lt;</span><span class="dt">uint32_t</span><span class="op">&gt;</span> O<span class="op">&gt;</span></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> O normalize<span class="op">(</span>I first, S last, O out<span class="op">)</span>;</span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>nf Normalization, utf_range_like R, output_iterator<span class="op">&lt;</span><span class="dt">uint32_t</span><span class="op">&gt;</span> O<span class="op">&gt;</span></span>
<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> O normalize<span class="op">(</span>R<span class="op">&amp;&amp;</span> r, O out<span class="op">)</span>;</span>
<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>nf Normalization, utf_iter I, sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S<span class="op">&gt;</span></span>
<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <span class="dt">bool</span> is_normalized<span class="op">(</span>I first, S last<span class="op">)</span>;</span>
<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>nf Normalization, utf_range_like R<span class="op">&gt;</span></span>
<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <span class="dt">bool</span> is_normalized<span class="op">(</span>R<span class="op">&amp;&amp;</span> r<span class="op">)</span>;</span>
<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h2 data-number="5.7" id="add-an-append-version-of-the-normalization-algorithm"><span class="header-section-number">5.7</span> Add an append version of the
normalization algorithm<a href="#add-an-append-version-of-the-normalization-algorithm" class="self-link"></a></h2>
<p>In performance tests, I found that appending multiple elements to the
output in one go was substantially faster than the more generic
<code class="sourceCode default">normalize()</code> algorithm, which
appends to the output one code point at a time. So, we should provide
support for that as well, in the form of
<code class="sourceCode default">normalize_append()</code>.</p>
<p>If transcoding is necessary when the result is appended,
<code class="sourceCode default">normalize_append()</code> does
automatic transcoding to UTF-N, where N is implied by the size of
<code class="sourceCode default">String::value_type</code>.</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> std<span class="op">::</span>uc <span class="op">{</span></span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>nf Normalization, utf_iter I, sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S, utf_string String<span class="op">&gt;</span></span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <span class="dt">void</span> normalize_append<span class="op">(</span>I first, S last, String<span class="op">&amp;</span> s<span class="op">)</span>;</span>
<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>nf Normalization, utf_range_like R, utf_string String<span class="op">&gt;</span></span>
<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <span class="dt">void</span> normalize_append<span class="op">(</span>R<span class="op">&amp;&amp;</span> r, String<span class="op">&amp;</span> s<span class="op">)</span>;</span>
<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span>nf Normalization, utf_string String<span class="op">&gt;</span></span>
<span id="cb11-9"><a href="#cb11-9" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <span class="dt">void</span> normalize_string<span class="op">(</span>String<span class="op">&amp;</span> s<span class="op">)</span>;</span>
<span id="cb11-10"><a href="#cb11-10" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h2 data-number="5.8" id="add-normalization-aware-insertion-erasure-and-replacement-operations-on-strings"><span class="header-section-number">5.8</span> Add normalization-aware
insertion, erasure, and replacement operations on strings<a href="#add-normalization-aware-insertion-erasure-and-replacement-operations-on-strings" class="self-link"></a></h2>
<p>If you need to insert text into a
<code class="sourceCode default">std::string</code> or other
STL-compatible container, you can use the erase/insert/replace API.
There are iterator and range overloads of each. Each one:</p>
<ul>
<li>normalizes the inserted text (if text is being inserted);</li>
<li>places the inserted text in Stream-Safe Format (if text is being
inserted);</li>
<li>performs the erase/insert/replace operation on the string;</li>
<li>ensures that the result is in Stream-Safe Format (if text is being
erased); and</li>
<li>normalizes the code points on either side of the affected
subsequence within the string.</li>
</ul>
<p>This last step is necessary because insertions and erasures may
create situations in which code points which may combine are now next to
each other, when they were not before. It’s all very complicated, and
the user should have a means of doing this generically, and remaining
ignorant of the details.</p>
<p>This API is like the
<code class="sourceCode default">normalize_append()</code> overloads in
that it may operate on UTF-8 or UTF-16 containers, and deduces the
output UTF from the size of the mutated container’s
<code class="sourceCode default">value_type</code>.</p>
<p>About the need for
<code class="sourceCode default">replace_result</code>:
<code class="sourceCode default">replace_result</code> represents the
result of inserting a sequence of code points
<code class="sourceCode default">I</code> into an existing sequence of
code points <code class="sourceCode default">E</code>, ensuring proper
normalization. Since the insertion operation may need to change some
code points just before and/or just after the insertion due to
normalization, the code points described by
<code class="sourceCode default">replace_result</code> may be longer
than <code class="sourceCode default">I</code>.
<code class="sourceCode default">replace_result</code> values represent
the entire sequence of code points in
<code class="sourceCode default">E</code> that have changed – some
version of which may have already been present in the string before the
insertion.</p>
<p>Note that
<code class="sourceCode default">replace_result::iterator</code> refers
to the underlying sequence, which may not itself be a sequence of code
points. For example, the underlying sequence may be a sequence of
<code class="sourceCode default">char</code> which is interpreted as
UTF-8. We can’t return an iterator of the type
<code class="sourceCode default">I</code> passed to
<code class="sourceCode default">normalize_replace()</code>, for the
same reason we don’t return an
<code class="sourceCode default">in_out_result</code> from
<code class="sourceCode default">normalization()</code>.</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> std<span class="op">::</span>uc <span class="op">{</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span><span class="kw">class</span> I<span class="op">&gt;</span></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>  <span class="kw">struct</span> replace_result <span class="op">:</span> subrange<span class="op">&lt;</span>I<span class="op">&gt;</span> <span class="op">{</span></span>
<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a>    <span class="kw">using</span> iterator <span class="op">=</span> I;</span>
<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> replace_result<span class="op">()</span> <span class="op">=</span> <span class="cf">default</span>;</span>
<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> replace_result<span class="op">(</span>iterator first, iterator last<span class="op">)</span> <span class="op">:</span> subrange<span class="op">&lt;</span>I<span class="op">&gt;(</span>first, last<span class="op">)</span> <span class="op">{}</span></span>
<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a>    <span class="kw">constexpr</span> <span class="kw">operator</span> iterator<span class="op">()</span> <span class="kw">const</span> <span class="op">{</span> <span class="cf">return</span> <span class="kw">this</span><span class="op">-&gt;</span>begin<span class="op">()</span>; <span class="op">}</span></span>
<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span>;</span>
<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a>  <span class="kw">enum</span> insertion_normalization <span class="op">{</span></span>
<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a>    insertion_normalized,</span>
<span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a>    insertion_not_normalized</span>
<span id="cb12-15"><a href="#cb12-15" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span>;</span>
<span id="cb12-16"><a href="#cb12-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-17"><a href="#cb12-17" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span></span>
<span id="cb12-18"><a href="#cb12-18" aria-hidden="true" tabindex="-1"></a>    nf Normalization,</span>
<span id="cb12-19"><a href="#cb12-19" aria-hidden="true" tabindex="-1"></a>    utf_string String,</span>
<span id="cb12-20"><a href="#cb12-20" aria-hidden="true" tabindex="-1"></a>    code_point_iter I,</span>
<span id="cb12-21"><a href="#cb12-21" aria-hidden="true" tabindex="-1"></a>    <span class="kw">class</span> StringIter <span class="op">=</span> ranges<span class="op">::</span>iterator_t<span class="op">&lt;</span>String<span class="op">&gt;&gt;</span></span>
<span id="cb12-22"><a href="#cb12-22" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> replace_result<span class="op">&lt;</span>StringIter<span class="op">&gt;</span> normalize_replace<span class="op">(</span></span>
<span id="cb12-23"><a href="#cb12-23" aria-hidden="true" tabindex="-1"></a>    String<span class="op">&amp;</span> string,</span>
<span id="cb12-24"><a href="#cb12-24" aria-hidden="true" tabindex="-1"></a>    StringIter str_first,</span>
<span id="cb12-25"><a href="#cb12-25" aria-hidden="true" tabindex="-1"></a>    StringIter str_last,</span>
<span id="cb12-26"><a href="#cb12-26" aria-hidden="true" tabindex="-1"></a>    I first,</span>
<span id="cb12-27"><a href="#cb12-27" aria-hidden="true" tabindex="-1"></a>    I last,</span>
<span id="cb12-28"><a href="#cb12-28" aria-hidden="true" tabindex="-1"></a>    insertion_normalization insertion_norm <span class="op">=</span> insertion_not_normalized<span class="op">)</span>;</span>
<span id="cb12-29"><a href="#cb12-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-30"><a href="#cb12-30" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span></span>
<span id="cb12-31"><a href="#cb12-31" aria-hidden="true" tabindex="-1"></a>    nf Normalization,</span>
<span id="cb12-32"><a href="#cb12-32" aria-hidden="true" tabindex="-1"></a>    utf_string String,</span>
<span id="cb12-33"><a href="#cb12-33" aria-hidden="true" tabindex="-1"></a>    code_point_iter I,</span>
<span id="cb12-34"><a href="#cb12-34" aria-hidden="true" tabindex="-1"></a>    <span class="kw">class</span> StringIter <span class="op">=</span> ranges<span class="op">::</span>iterator_t<span class="op">&lt;</span>String<span class="op">&gt;&gt;</span></span>
<span id="cb12-35"><a href="#cb12-35" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> replace_result<span class="op">&lt;</span>StringIter<span class="op">&gt;</span> normalize_insert<span class="op">(</span></span>
<span id="cb12-36"><a href="#cb12-36" aria-hidden="true" tabindex="-1"></a>    String<span class="op">&amp;</span> string,</span>
<span id="cb12-37"><a href="#cb12-37" aria-hidden="true" tabindex="-1"></a>    StringIter at,</span>
<span id="cb12-38"><a href="#cb12-38" aria-hidden="true" tabindex="-1"></a>    I first,</span>
<span id="cb12-39"><a href="#cb12-39" aria-hidden="true" tabindex="-1"></a>    I last,</span>
<span id="cb12-40"><a href="#cb12-40" aria-hidden="true" tabindex="-1"></a>    insertion_normalization insertion_norm <span class="op">=</span> insertion_not_normalized<span class="op">)</span>;</span>
<span id="cb12-41"><a href="#cb12-41" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-42"><a href="#cb12-42" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span></span>
<span id="cb12-43"><a href="#cb12-43" aria-hidden="true" tabindex="-1"></a>    nf Normalization,</span>
<span id="cb12-44"><a href="#cb12-44" aria-hidden="true" tabindex="-1"></a>    utf_string String,</span>
<span id="cb12-45"><a href="#cb12-45" aria-hidden="true" tabindex="-1"></a>    code_point_range R,</span>
<span id="cb12-46"><a href="#cb12-46" aria-hidden="true" tabindex="-1"></a>    <span class="kw">class</span> StringIter <span class="op">=</span> ranges<span class="op">::</span>iterator_t<span class="op">&lt;</span>String<span class="op">&gt;&gt;</span></span>
<span id="cb12-47"><a href="#cb12-47" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> replace_result<span class="op">&lt;</span>StringIter<span class="op">&gt;</span> normalize_insert<span class="op">(</span></span>
<span id="cb12-48"><a href="#cb12-48" aria-hidden="true" tabindex="-1"></a>    String<span class="op">&amp;</span> string,</span>
<span id="cb12-49"><a href="#cb12-49" aria-hidden="true" tabindex="-1"></a>    StringIter at,</span>
<span id="cb12-50"><a href="#cb12-50" aria-hidden="true" tabindex="-1"></a>    R<span class="op">&amp;&amp;</span> r,</span>
<span id="cb12-51"><a href="#cb12-51" aria-hidden="true" tabindex="-1"></a>    insertion_normalization insertion_norm <span class="op">=</span> insertion_not_normalized<span class="op">)</span>;</span>
<span id="cb12-52"><a href="#cb12-52" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-53"><a href="#cb12-53" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span></span>
<span id="cb12-54"><a href="#cb12-54" aria-hidden="true" tabindex="-1"></a>    nf Normalization,</span>
<span id="cb12-55"><a href="#cb12-55" aria-hidden="true" tabindex="-1"></a>    utf_string String,</span>
<span id="cb12-56"><a href="#cb12-56" aria-hidden="true" tabindex="-1"></a>    <span class="kw">class</span> StringIter <span class="op">=</span> ranges<span class="op">::</span>iterator_t<span class="op">&lt;</span>String<span class="op">&gt;&gt;</span></span>
<span id="cb12-57"><a href="#cb12-57" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> replace_result<span class="op">&lt;</span>StringIter<span class="op">&gt;</span> normalize_erase<span class="op">(</span></span>
<span id="cb12-58"><a href="#cb12-58" aria-hidden="true" tabindex="-1"></a>    String<span class="op">&amp;</span> string, StringIter str_first, StringIter str_last<span class="op">)</span>;</span>
<span id="cb12-59"><a href="#cb12-59" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h2 data-number="5.9" id="add-a-feature-test-macro"><span class="header-section-number">5.9</span> Add a feature test macro<a href="#add-a-feature-test-macro" class="self-link"></a></h2>
<p>Add the feature test macro
<code class="sourceCode default">__cpp_lib_unicode_normalization</code>.</p>
<h1 data-number="6" id="implementation-experience"><span class="header-section-number">6</span> Implementation experience<a href="#implementation-experience" class="self-link"></a></h1>
<p>All of these interfaces have been implemented in <a href="https://github.com/tzlaine/text">Boost.Text</a> (proposed – not
yet a part of Boost). All of the interfaces here have been very
well-exercised by full-coverage tests, and by other parts of Boost.Text
that use normalization.</p>
<p>The first attempt at implementing the normalization algorithms was
fairly straightforward. I wrote code following the algorithms as
described in the Unicode standard and its accompanying Annexes, and got
all the Unicode-published tests to pass. However, comparing the
performance of the naive implementation to the performance of the
equivalent ICU normalization API showed that the naive implementation
was a <em>lot</em> slower – around a factor of 50!</p>
<p>I managed to optimize the initial implementation quite a lot, and got
the performance delta down to about a factor of 10. After that, I could
shave no more time off of the naive implementation. I looked at how ICU
performs normalization, and had a brief discussion about performance
with one of the ICU maintainers. It turns out that if you understand the
normalization-related Unicode data very deeply, you can take advantage
of certain patterns in those data to take shortcuts. In fact, it is only
necessary to perform the full algorithm as described in the Unicode
standard in a small minority of cases. ICU is maintained in lockstep
with the evolution of Unicode, so as new code points are added (often
from new languages), the new normalization data associated with those
new code points are designed so that they enable the shortcuts mentioned
above.</p>
<p>In the end, I looked at the ICU normalization algorithms, and
reimplemented them in a generic way, using templates and therefore
header-based code. Being generic, this reimplementation works for
numerous types of input (iterators, ranges, pointers to null-terminated
strings) – not just
<code class="sourceCode default">icu::UnicodeString</code> (a
<code class="sourceCode default">std::string</code>-like UTF-16 type)
and <code class="sourceCode default">icu::StringPiece</code> (a
<code class="sourceCode default">std::string_view</code>-like UTF-8
type) that ICU supports. Being inline, header-only code, the
reimplementation optimizes better, and I managed about a 20% speedup
over the ICU implementation.</p>
<p>However, this reimplementation of ICU was a lot of work, and there’s
no guarantee that it will work for more than just the current version of
Unicode supported by Boost.Text. Since ICU and Unicode evolve in
lockstep, any reimplementation needs to track changes to the ICU
implementation when the Unicode version is updated, and the equivalent
change needs to be applied to the reimplementation.</p>
<h2 data-number="6.1" id="tldr"><span class="header-section-number">6.1</span> tl;dr<a href="#tldr" class="self-link"></a></h2>
<p>Standard library implementers will probably want to just use ICU to
implement the normalization algorithms. Since ICU only implements the
normalization algorithms for UTF-16 and UTF-8, and since it only
implements the algorithms for the exact types
<code class="sourceCode default">icu::UnicodeString</code> (for UTF-16)
and <code class="sourceCode default">icu::StringPiece</code> (for
UTF-8), copying may need to occur. There are implementation-detail
interfaces within ICU that more intrepid implementers may wish to use;
these interfaces can be made to work with iterators and pointers more
directly.</p>
</div>
</div>
</body>
</html>
