<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang xml:lang>
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="mpark/wg21" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <meta name="dcterms.date" content="2025-01-13" />
  <title>Make the concurrent forward progress guarantee usable in bulk</title>
  <style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
</style>
  <style>
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ background-color: #f6f8fa; }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span { } 
code span.al { color: #ff0000; } 
code span.an { } 
code span.at { } 
code span.bn { color: #9f6807; } 
code span.bu { color: #9f6807; } 
code span.cf { color: #00607c; } 
code span.ch { color: #9f6807; } 
code span.cn { } 
code span.co { color: #008000; font-style: italic; } 
code span.cv { color: #008000; font-style: italic; } 
code span.do { color: #008000; } 
code span.dt { color: #00607c; } 
code span.dv { color: #9f6807; } 
code span.er { color: #ff0000; font-weight: bold; } 
code span.ex { } 
code span.fl { color: #9f6807; } 
code span.fu { } 
code span.im { } 
code span.in { color: #008000; } 
code span.kw { color: #00607c; } 
code span.op { color: #af1915; } 
code span.ot { } 
code span.pp { color: #6f4e37; } 
code span.re { } 
code span.sc { color: #9f6807; } 
code span.ss { color: #9f6807; } 
code span.st { color: #9f6807; } 
code span.va { } 
code span.vs { color: #9f6807; } 
code span.wa { color: #008000; font-weight: bold; } 
code.diff {color: #898887}
code.diff span.va {color: #00AA00}
code.diff span.st {color: #bf0303}
</style>
  <style type="text/css">
body {
margin: 5em;
font-family: serif;

hyphens: auto;
line-height: 1.35;
}
div.wrapper {
max-width: 60em;
margin: auto;
}
ul {
list-style-type: none;
padding-left: 2em;
margin-top: -0.2em;
margin-bottom: -0.2em;
}
a {
text-decoration: none;
color: #4183C4;
}
a.hidden_link {
text-decoration: none;
color: inherit;
}
li {
margin-top: 0.6em;
margin-bottom: 0.6em;
}
h1, h2, h3, h4 {
position: relative;
line-height: 1;
}
a.self-link {
position: absolute;
top: 0;
left: calc(-1 * (3.5rem - 26px));
width: calc(3.5rem - 26px);
height: 2em;
text-align: center;
border: none;
transition: opacity .2s;
opacity: .5;
font-family: sans-serif;
font-weight: normal;
font-size: 83%;
}
a.self-link:hover { opacity: 1; }
a.self-link::before { content: "§"; }
ul > li:before {
content: "\2014";
position: absolute;
margin-left: -1.5em;
}
:target { background-color: #C9FBC9; }
:target .codeblock { background-color: #C9FBC9; }
:target ul { background-color: #C9FBC9; }
.abbr_ref { float: right; }
.folded_abbr_ref { float: right; }
:target .folded_abbr_ref { display: none; }
:target .unfolded_abbr_ref { float: right; display: inherit; }
.unfolded_abbr_ref { display: none; }
.secnum { display: inline-block; min-width: 35pt; }
.header-section-number { display: inline-block; min-width: 35pt; }
.annexnum { display: block; }
div.sourceLinkParent {
float: right;
}
a.sourceLink {
position: absolute;
opacity: 0;
margin-left: 10pt;
}
a.sourceLink:hover {
opacity: 1;
}
a.itemDeclLink {
position: absolute;
font-size: 75%;
text-align: right;
width: 5em;
opacity: 0;
}
a.itemDeclLink:hover { opacity: 1; }
span.marginalizedparent {
position: relative;
left: -5em;
}
li span.marginalizedparent { left: -7em; }
li ul > li span.marginalizedparent { left: -9em; }
li ul > li ul > li span.marginalizedparent { left: -11em; }
li ul > li ul > li ul > li span.marginalizedparent { left: -13em; }
div.footnoteNumberParent {
position: relative;
left: -4.7em;
}
a.marginalized {
position: absolute;
font-size: 75%;
text-align: right;
width: 5em;
}
a.enumerated_item_num {
position: relative;
left: -3.5em;
display: inline-block;
margin-right: -3em;
text-align: right;
width: 3em;
}
div.para { margin-bottom: 0.6em; margin-top: 0.6em; text-align: justify; }
div.section { text-align: justify; }
div.sentence { display: inline; }
span.indexparent {
display: inline;
position: relative;
float: right;
right: -1em;
}
a.index {
position: absolute;
display: none;
}
a.index:before { content: "⟵"; }

a.index:target {
display: inline;
}
.indexitems {
margin-left: 2em;
text-indent: -2em;
}
div.itemdescr {
margin-left: 3em;
}
.bnf {
font-family: serif;
margin-left: 40pt;
margin-top: 0.5em;
margin-bottom: 0.5em;
}
.ncbnf {
font-family: serif;
margin-top: 0.5em;
margin-bottom: 0.5em;
margin-left: 40pt;
}
.ncsimplebnf {
font-family: serif;
font-style: italic;
margin-top: 0.5em;
margin-bottom: 0.5em;
margin-left: 40pt;
background: inherit; 
}
span.textnormal {
font-style: normal;
font-family: serif;
white-space: normal;
display: inline-block;
}
span.rlap {
display: inline-block;
width: 0px;
}
span.descr { font-style: normal; font-family: serif; }
span.grammarterm { font-style: italic; }
span.term { font-style: italic; }
span.terminal { font-family: monospace; font-style: normal; }
span.nonterminal { font-style: italic; }
span.tcode { font-family: monospace; font-style: normal; }
span.textbf { font-weight: bold; }
span.textsc { font-variant: small-caps; }
a.nontermdef { font-style: italic; font-family: serif; }
span.emph { font-style: italic; }
span.techterm { font-style: italic; }
span.mathit { font-style: italic; }
span.mathsf { font-family: sans-serif; }
span.mathrm { font-family: serif; font-style: normal; }
span.textrm { font-family: serif; }
span.textsl { font-style: italic; }
span.mathtt { font-family: monospace; font-style: normal; }
span.mbox { font-family: serif; font-style: normal; }
span.ungap { display: inline-block; width: 2pt; }
span.textit { font-style: italic; }
span.texttt { font-family: monospace; }
span.tcode_in_codeblock { font-family: monospace; font-style: normal; }
span.phantom { color: white; }

span.math { font-style: normal; }
span.mathblock {
display: block;
margin-left: auto;
margin-right: auto;
margin-top: 1.2em;
margin-bottom: 1.2em;
text-align: center;
}
span.mathalpha {
font-style: italic;
}
span.synopsis {
font-weight: bold;
margin-top: 0.5em;
display: block;
}
span.definition {
font-weight: bold;
display: block;
}
.codeblock {
margin-left: 1.2em;
line-height: 127%;
}
.outputblock {
margin-left: 1.2em;
line-height: 127%;
}
div.itemdecl {
margin-top: 2ex;
}
code.itemdeclcode {
white-space: pre;
display: block;
}
span.textsuperscript {
vertical-align: super;
font-size: smaller;
line-height: 0;
}
.footnotenum { vertical-align: super; font-size: smaller; line-height: 0; }
.footnote {
font-size: small;
margin-left: 2em;
margin-right: 2em;
margin-top: 0.6em;
margin-bottom: 0.6em;
}
div.minipage {
display: inline-block;
margin-right: 3em;
}
div.numberedTable {
text-align: center;
margin: 2em;
}
div.figure {
text-align: center;
margin: 2em;
}
table {
border: 1px solid black;
border-collapse: collapse;
margin-left: auto;
margin-right: auto;
margin-top: 0.8em;
text-align: left;
hyphens: none; 
}
td, th {
padding-left: 1em;
padding-right: 1em;
vertical-align: top;
}
td.empty {
padding: 0px;
padding-left: 1px;
}
td.left {
text-align: left;
}
td.right {
text-align: right;
}
td.center {
text-align: center;
}
td.justify {
text-align: justify;
}
td.border {
border-left: 1px solid black;
}
tr.rowsep, td.cline {
border-top: 1px solid black;
}
tr.even, tr.odd {
border-bottom: 1px solid black;
}
tr.capsep {
border-top: 3px solid black;
border-top-style: double;
}
tr.header {
border-bottom: 3px solid black;
border-bottom-style: double;
}
th {
border-bottom: 1px solid black;
}
span.centry {
font-weight: bold;
}
div.table {
display: block;
margin-left: auto;
margin-right: auto;
text-align: center;
width: 90%;
}
span.indented {
display: block;
margin-left: 2em;
margin-bottom: 1em;
margin-top: 1em;
}
ol.enumeratea { list-style-type: none; background: inherit; }
ol.enumerate { list-style-type: none; background: inherit; }

code.sourceCode > span { display: inline; }

div#refs p { padding-left: 32px; text-indent: -32px; }
</style>
  <link href="data:image/vnd.microsoft.icon;base64,AAABAAIAEBAAAAEAIABoBAAAJgAAACAgAAABACAAqBAAAI4EAAAoAAAAEAAAACAAAAABACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA////AIJEAACCRAAAgkQAAIJEAACCRAAAgkQAVoJEAN6CRADegkQAWIJEAACCRAAAgkQAAIJEAACCRAAA////AP///wCCRAAAgkQAAIJEAACCRAAsgkQAvoJEAP+CRAD/gkQA/4JEAP+CRADAgkQALoJEAACCRAAAgkQAAP///wD///8AgkQAAIJEABSCRACSgkQA/IJEAP99PQD/dzMA/3czAP99PQD/gkQA/4JEAPyCRACUgkQAFIJEAAD///8A////AHw+AFiBQwDqgkQA/4BBAP9/PxP/uZd6/9rJtf/bybX/upd7/39AFP+AQQD/gkQA/4FDAOqAQgBc////AP///wDKklv4jlEa/3o7AP+PWC//8+3o///////////////////////z7un/kFox/35AAP+GRwD/mVYA+v///wD///8A0Zpk+NmibP+0d0T/8evj///////+/fv/1sKz/9bCs//9/fr//////+/m2/+NRwL/nloA/5xYAPj///8A////ANKaZPjRmGH/5cKh////////////k149/3UwAP91MQD/lmQ//86rhv+USg3/m1YA/5hSAP+bVgD4////AP///wDSmmT4zpJY/+/bx///////8+TV/8mLT/+TVx//gkIA/5lVAP+VTAD/x6B//7aEVv/JpH7/s39J+P///wD///8A0ppk+M6SWP/u2sf///////Pj1f/Nj1T/2KFs/8mOUv+eWhD/lEsA/8aee/+0glT/x6F7/7J8Rvj///8A////ANKaZPjRmGH/48Cf///////+/v7/2qt//82PVP/OkFX/37KJ/86siv+USg7/mVQA/5hRAP+bVgD4////AP///wDSmmT40ppk/9CVXP/69O////////7+/v/x4M//8d/P//7+/f//////9u7n/6tnJf+XUgD/nFgA+P///wD///8A0ppk+NKaZP/RmWL/1qNy//r07///////////////////////+vXw/9akdP/Wnmn/y5FY/6JfFvj///8A////ANKaZFTSmmTo0ppk/9GYYv/Ql1//5cWm//Hg0P/x4ND/5cWm/9GXYP/RmGH/0ppk/9KaZOjVnmpY////AP///wDSmmQA0ppkEtKaZI7SmmT60ppk/9CWX//OkVb/zpFW/9CWX//SmmT/0ppk/NKaZJDSmmQS0ppkAP///wD///8A0ppkANKaZADSmmQA0ppkKtKaZLrSmmT/0ppk/9KaZP/SmmT/0ppkvNKaZCrSmmQA0ppkANKaZAD///8A////ANKaZADSmmQA0ppkANKaZADSmmQA0ppkUtKaZNzSmmTc0ppkVNKaZADSmmQA0ppkANKaZADSmmQA////AP5/AAD4HwAA4AcAAMADAACAAQAAgAEAAIABAACAAQAAgAEAAIABAACAAQAAgAEAAMADAADgBwAA+B8AAP5/AAAoAAAAIAAAAEAAAAABACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA////AP///wCCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAAyCRACMgkQA6oJEAOqCRACQgkQAEIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAA////AP///wD///8A////AIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRABigkQA5oJEAP+CRAD/gkQA/4JEAP+CRADqgkQAZoJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAAD///8A////AP///wD///8AgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAA4gkQAwoJEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQAxIJEADyCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAAgkQAAP///wD///8A////AP///wCCRAAAgkQAAIJEAACCRAAAgkQAAIJEAACCRAAWgkQAmIJEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAJyCRAAYgkQAAIJEAACCRAAAgkQAAIJEAACCRAAA////AP///wD///8A////AIJEAACCRAAAgkQAAIJEAACCRAAAgkQAdIJEAPCCRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAP+CRAD/gkQA/4JEAPSCRAB4gkQAAIJEAACCRAAAgkQAAIJEAAD///8A////AP///wD///8AgkQAAIJEAACCRAAAgkQASoJEANKCRAD/gkQA/4JEAP+CRAD/g0YA/39AAP9zLgD/bSQA/2shAP9rIQD/bSQA/3MuAP9/PwD/g0YA/4JEAP+CRAD/gkQA/4JEAP+CRADUgkQAToJEAACCRAAAgkQAAP///wD///8A////AP///wB+PwAAgkUAIoJEAKiCRAD/gkQA/4JEAP+CRAD/hEcA/4BBAP9sIwD/dTAA/5RfKv+viF7/vp56/76ee/+wiF7/lWAr/3YxAP9sIwD/f0AA/4RHAP+CRAD/gkQA/4JEAP+CRAD/gkQArIJEACaBQwAA////AP///wD///8A////AIBCAEBzNAD6f0EA/4NFAP+CRAD/gkQA/4VIAP92MwD/bSUA/6N1Tv/ezsL/////////////////////////////////38/D/6V3Uv9uJgD/dTEA/4VJAP+CRAD/gkQA/4JEAP+BQwD/fUAA/4FDAEj///8A////AP///wD///8AzJRd5qBlKf91NgD/dDUA/4JEAP+FSQD/cy4A/3YyAP/PuKP//////////////////////////////////////////////////////9K7qP94NQD/ciwA/4VJAP+CRAD/fkEA/35BAP+LSwD/mlYA6v///wD///8A////AP///wDdpnL/4qx3/8KJUv+PUhf/cTMA/3AsAP90LgD/4dK+/////////////////////////////////////////////////////////////////+TYxf91MAD/dTIA/31CAP+GRwD/llQA/6FcAP+gWwD8////AP///wD///8A////ANGZY/LSm2X/4ap3/92mcP+wdT3/byQA/8mwj////////////////////////////////////////////////////////////////////////////+LYxv9zLgP/jUoA/59bAP+hXAD/nFgA/5xYAPL///8A////AP///wD///8A0ppk8tKaZP/RmWL/1p9q/9ubXv/XqXj////////////////////////////7+fD/vZyG/6BxS/+gcUr/vJuE//r37f//////////////////////3MOr/5dQBf+dVQD/nVkA/5xYAP+cWAD/nFgA8v///wD///8A////AP///wDSmmTy0ppk/9KaZP/SmWP/yohJ//jo2P//////////////////////4NTG/4JDFf9lGAD/bSQA/20kAP9kGAD/fz8S/+Xb0f//////5NG9/6txN/+LOgD/m1QA/51aAP+cWAD/m1cA/5xYAP+cWADy////AP///wD///8A////ANKaZPLSmmT/0ppk/8+TWf/Unmv//v37//////////////////////+TWRr/VwsA/35AAP+ERgD/g0UA/4JGAP9lHgD/kFga/8KXX/+TRwD/jT4A/49CAP+VTQD/n10A/5xYAP+OQQD/lk4A/55cAPL///8A////AP///wD///8A0ppk8tKaZP/SmmT/y4tO/92yiP//////////////////////8NnE/8eCQP+rcTT/ez0A/3IyAP98PgD/gEMA/5FSAP+USwD/jj8A/5lUAP+JNwD/yqV2/694Mf+HNQD/jkAA/82rf/+laBj/jT4A8v///wD///8A////AP///wDSmmTy0ppk/9KaZP/LiUr/4byY///////////////////////gupX/0I5P/+Wuev/Lklz/l1sj/308AP+QSwD/ol0A/59aAP+aVQD/k0oA/8yoh///////+fXv/6pwO//Lp3v///////Pr4f+oay7y////AP///wD///8A////ANKaZPLSmmT/0ppk/8uJSv/hvJj//////////////////////+G7l//Jhkb/0ppk/96nc//fqXX/x4xO/6dkFP+QSQD/llEA/5xXAP+USgD/yaOA///////38uv/qG05/8ijdv//////8efb/6ZpLPL///8A////AP///wD///8A0ppk8tKaZP/SmmT/zIxO/9yxh///////////////////////7dbA/8iEQf/Sm2X/0Zlj/9ScZv/eqHf/2KJv/7yAQf+XTgD/iToA/5lSAP+JNgD/yKFv/611LP+HNQD/jT8A/8qmeP+kZRT/jT4A8v///wD///8A////AP///wDSmmTy0ppk/9KaZP/Pk1n/1J5q//78+//////////////////+/fv/1aFv/8iEQv/Tm2b/0ppl/9GZY//Wn2z/1pZc/9eldf/Bl2b/kUcA/4w9AP+OQAD/lUwA/59eAP+cWQD/jT8A/5ZOAP+eXADy////AP///wD///8A////ANKaZPLSmmT/0ppk/9KZY//KiEn/8d/P///////////////////////47+f/05tm/8iCP//KiEj/yohJ/8eCP//RmGH//vfy///////n1sP/rXQ7/4k4AP+TTAD/nVoA/5xYAP+cVwD/nFgA/5xYAPL///8A////AP///wD///8A0ppk8tKaZP/SmmT/0ptl/8uLTf/aq37////////////////////////////+/fz/6c2y/961jv/etY7/6Myx//78+v//////////////////////3MWv/5xXD/+ORAD/mFQA/51ZAP+cWAD/nFgA8v///wD///8A////AP///wDSmmTy0ppk/9KaZP/SmmT/0ppk/8mFRP/s1b//////////////////////////////////////////////////////////////////////////////+PD/0JFU/7NzMv+WUQD/kUsA/5tXAP+dWQDy////AP///wD///8A////ANKaZP/SmmT/0ppk/9KaZP/Sm2X/z5NZ/8yMT//z5NX/////////////////////////////////////////////////////////////////9Ofa/8yNUP/UmGH/36p5/8yTWv+qaSD/kksA/5ROAPz///8A////AP///wD///8A0ppk5NKaZP/SmmT/0ppk/9KaZP/TnGf/zY9T/82OUv/t1sD//////////////////////////////////////////////////////+7Yw//OkFX/zI5R/9OcZ//SmmP/26V0/9ymdf/BhUf/ol8R6P///wD///8A////AP///wDSmmQ80ppk9tKaZP/SmmT/0ppk/9KaZP/TnGj/zpFW/8qJSv/dson/8uHS//////////////////////////////////Lj0//etIv/y4lL/86QVf/TnGj/0ppk/9KaZP/RmWP/05xn/9ymdfjUnWdC////AP///wD///8A////ANKaZADSmmQc0ppkotKaZP/SmmT/0ppk/9KaZP/Tm2b/0Zli/8qJSf/NjlH/16Z3/+G8mP/myKr/5siq/+G8mP/Xp3f/zY5S/8qISf/RmGH/05tm/9KaZP/SmmT/0ppk/9KaZP/SmmSm0pljINWdaQD///8A////AP///wD///8A0ppkANKaZADSmmQA0ppkQtKaZMrSmmT/0ppk/9KaZP/SmmT/0ptl/9GYYf/Nj1P/y4lL/8qISP/KiEj/y4lK/82PU//RmGH/0ptl/9KaZP/SmmT/0ppk/9KaZP/SmmTO0ppkRtKaZADSmmQA0ppkAP///wD///8A////AP///wDSmmQA0ppkANKaZADSmmQA0ppkANKaZGzSmmTu0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmTw0ppkcNKaZADSmmQA0ppkANKaZADSmmQA////AP///wD///8A////ANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZBLSmmSQ0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppklNKaZBTSmmQA0ppkANKaZADSmmQA0ppkANKaZAD///8A////AP///wD///8A0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQy0ppkutKaZP/SmmT/0ppk/9KaZP/SmmT/0ppk/9KaZP/SmmT/0ppkvtKaZDbSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkAP///wD///8A////AP///wDSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkXNKaZODSmmT/0ppk/9KaZP/SmmT/0ppk5NKaZGDSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA////AP///wD///8A////ANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkBtKaZIbSmmTo0ppk6tKaZIrSmmQK0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZADSmmQA0ppkANKaZAD///8A////AP/8P///+B///+AH//+AAf//AAD//AAAP/AAAA/gAAAHwAAAA8AAAAPAAAADwAAAA8AAAAPAAAADwAAAA8AAAAPAAAADwAAAA8AAAAPAAAADwAAAA8AAAAPAAAADwAAAA+AAAAfwAAAP/AAAP/8AAP//gAH//+AH///4H////D//" rel="icon" />
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
  
</head>
<body>
<div class="wrapper">
<header id="title-block-header">
<h1 class="title" style="text-align:center">Make the concurrent forward
progress guarantee usable in <code>bulk</code></h1>

<table style="border:none;float:right">
  <tr>
    <td>Document #: </td>
    <td>P3564R0</td>
  </tr>
  <tr>
    <td>Date: </td>
    <td>2025-01-13</td>
  </tr>
  <tr>
    <td style="vertical-align:top">Project: </td>
    <td>Programming Language C++<br>
      LEWG<br>
    </td>
  </tr>
  <tr>
    <td style="vertical-align:top">Reply-to: </td>
    <td>
      Mark Hoemmen<br>&lt;<a href="mailto:mhoemmen@nvidia.com" class="email">mhoemmen@nvidia.com</a>&gt;<br>
      Bryce Adelstein Lelbach<br>&lt;<a href="mailto:brycelelbach@gmail.com" class="email">brycelelbach@gmail.com</a>&gt;<br>
      Michael Garland<br>&lt;<a href="mailto:mgarland@nvidia.com" class="email">mgarland@nvidia.com</a>&gt;<br>
    </td>
  </tr>
</table>

</header>
<div style="clear:both">
<div id="TOC" role="doc-toc">
<h1 id="toctitle">Contents</h1>
<ul>
<li><a href="#authors" id="toc-authors"><span class="toc-section-number">1</span> Authors</a></li>
<li><a href="#introduction" id="toc-introduction"><span class="toc-section-number">2</span> Introduction</a></li>
<li><a href="#why-saying-that-agents-are-distinct-fixes-bulks-forward-progress" id="toc-why-saying-that-agents-are-distinct-fixes-bulks-forward-progress"><span class="toc-section-number">3</span> Why saying that agents are distinct
fixes <code>bulk</code>’s forward progress</a>
<ul>
<li><a href="#summary" id="toc-summary"><span class="toc-section-number">3.1</span> Summary</a></li>
<li><a href="#execution-agents-relate-to-synchronization" id="toc-execution-agents-relate-to-synchronization"><span class="toc-section-number">3.2</span> Execution agents relate to
synchronization</a></li>
<li><a href="#execution-agents-are-more-general-than-threads" id="toc-execution-agents-are-more-general-than-threads"><span class="toc-section-number">3.3</span> Execution agents are more general
than threads</a></li>
<li><a href="#this-explains-what-it-means-for-a-single-threaded-execution-resource-to-have-a-forward-progress-guarantee" id="toc-this-explains-what-it-means-for-a-single-threaded-execution-resource-to-have-a-forward-progress-guarantee"><span class="toc-section-number">3.4</span> This explains what it means for a
single-threaded execution resource to have a forward progress
guarantee</a></li>
</ul></li>
<li><a href="#concurrent-forward-progress-is-not-currently-usable-in-bulk" id="toc-concurrent-forward-progress-is-not-currently-usable-in-bulk"><span class="toc-section-number">4</span> Concurrent forward progress is not
currently usable in <code>bulk</code></a>
<ul>
<li><a href="#summary-1" id="toc-summary-1"><span class="toc-section-number">4.1</span> Summary</a></li>
<li><a href="#why-we-want-bulk-to-have-the-strongest-possible-forward-progress-guarantee" id="toc-why-we-want-bulk-to-have-the-strongest-possible-forward-progress-guarantee"><span class="toc-section-number">4.2</span> Why we want <code>bulk</code> to
have the strongest possible forward progress guarantee</a>
<ul>
<li><a href="#users-need-the-freedom-to-express-their-intent" id="toc-users-need-the-freedom-to-express-their-intent"><span class="toc-section-number">4.2.1</span> Users need the freedom to
express their intent</a></li>
<li><a href="#sg1-wants-a-bulk-interface-that-creates-an-execution-agent-per-iteration" id="toc-sg1-wants-a-bulk-interface-that-creates-an-execution-agent-per-iteration"><span class="toc-section-number">4.2.2</span> SG1 wants a bulk interface that
“creates an execution agent per iteration”</a></li>
<li><a href="#run-on-n-execution-agents-parallelize-this-loop" id="toc-run-on-n-execution-agents-parallelize-this-loop"><span class="toc-section-number">4.2.3</span> “Run on N execution agents”
<code>!=</code> “parallelize this loop”</a></li>
<li><a href="#parallel-programming-models-give-users-a-way-to-ask-for-concurrent-forward-progress-where-possible" id="toc-parallel-programming-models-give-users-a-way-to-ask-for-concurrent-forward-progress-where-possible"><span class="toc-section-number">4.2.4</span> Parallel programming models give
users a way to ask for concurrent forward progress where
possible</a></li>
</ul></li>
<li><a href="#summary-of-design-history-for-bulk-executions-forward-progress" id="toc-summary-of-design-history-for-bulk-executions-forward-progress"><span class="toc-section-number">4.3</span> Summary of design history for bulk
execution’s forward progress</a></li>
</ul></li>
<li><a href="#proposed-solution" id="toc-proposed-solution"><span class="toc-section-number">5</span> Proposed solution</a>
<ul>
<li><a href="#specify-bulk-as-running-on-n-distinct-execution-agents" id="toc-specify-bulk-as-running-on-n-distinct-execution-agents"><span class="toc-section-number">5.1</span> Specify <code>bulk</code> as
running on N distinct execution agents</a>
<ul>
<li><a href="#summary-2" id="toc-summary-2"><span class="toc-section-number">5.1.1</span> Summary</a></li>
<li><a href="#alternative-designs" id="toc-alternative-designs"><span class="toc-section-number">5.1.2</span> Alternative designs</a></li>
</ul></li>
<li><a href="#make-it-ill-formed-to-use-default-bulk-with-a-scheduler-promising-concurrent-forward-progress" id="toc-make-it-ill-formed-to-use-default-bulk-with-a-scheduler-promising-concurrent-forward-progress"><span class="toc-section-number">5.2</span> Make it ill-formed to use default
<code>bulk</code> with a scheduler promising concurrent forward
progress</a></li>
<li><a href="#permit-a-bulk-customization-to-fail-if-it-cannot-fulfill-its-forward-progress-for-a-given-number-of-agents" id="toc-permit-a-bulk-customization-to-fail-if-it-cannot-fulfill-its-forward-progress-for-a-given-number-of-agents"><span class="toc-section-number">5.3</span> Permit a <code>bulk</code>
customization to fail if it cannot fulfill its forward progress for a
given number of agents</a>
<ul>
<li><a href="#summary-3" id="toc-summary-3"><span class="toc-section-number">5.3.1</span> Summary</a></li>
<li><a href="#discussion" id="toc-discussion"><span class="toc-section-number">5.3.2</span> Discussion</a></li>
<li><a href="#design-implications" id="toc-design-implications"><span class="toc-section-number">5.3.3</span> Design implications</a></li>
<li><a href="#alternative-designs-1" id="toc-alternative-designs-1"><span class="toc-section-number">5.3.4</span> Alternative designs</a></li>
</ul></li>
</ul></li>
<li><a href="#implementation-status" id="toc-implementation-status"><span class="toc-section-number">6</span>
Implementation status</a></li>
<li><a href="#wording" id="toc-wording"><span class="toc-section-number">7</span> Wording</a>
<ul>
<li><a href="#update-version-macro" id="toc-update-version-macro"><span class="toc-section-number">7.1</span> Update version macro</a></li>
<li><a href="#change-default-bulk" id="toc-change-default-bulk"><span class="toc-section-number">7.2</span> Change default
<code>bulk</code></a></li>
<li><a href="#specify-behavior-of-bulk-customizations" id="toc-specify-behavior-of-bulk-customizations"><span class="toc-section-number">7.3</span> Specify behavior of
<code>bulk</code> customizations</a></li>
</ul></li>
<li><a href="#appendix-a-design-history-for-bulk-executions-forward-progress" id="toc-appendix-a-design-history-for-bulk-executions-forward-progress"><span class="toc-section-number">8</span> Appendix A: Design history for bulk
execution’s forward progress</a>
<ul>
<li><a href="#predecessors-of-p0443" id="toc-predecessors-of-p0443"><span class="toc-section-number">8.1</span> Predecessors of P0443</a></li>
<li><a href="#earlier-versions-of-p0443" id="toc-earlier-versions-of-p0443"><span class="toc-section-number">8.2</span> Earlier versions of P0443</a></li>
<li><a href="#later-versions-of-p0443" id="toc-later-versions-of-p0443"><span class="toc-section-number">8.3</span> Later versions of P0443</a></li>
<li><a href="#p2300" id="toc-p2300"><span class="toc-section-number">8.4</span> P2300</a></li>
<li><a href="#how-our-proposal-fits-into-this-history" id="toc-how-our-proposal-fits-into-this-history"><span class="toc-section-number">8.5</span> How our proposal fits into this
history</a></li>
</ul></li>
<li><a href="#appendix-b-chunked-parallel-for_each" id="toc-appendix-b-chunked-parallel-for_each"><span class="toc-section-number">9</span> Appendix B: Chunked parallel
<code>for_each</code></a></li>
<li><a href="#references" id="toc-references"><span class="toc-section-number">10</span> References</a></li>
<li><a href="#revision-history" id="toc-revision-history"><span class="toc-section-number">11</span> Revision History</a></li>
</ul>
</div>
<h1 data-number="1" id="authors"><span class="header-section-number">1</span> Authors<a href="#authors" class="self-link"></a></h1>
<ul>
<li>Mark Hoemmen (mhoemmen@nvidia.com) (NVIDIA)</li>
</ul>
<h1 data-number="2" id="introduction"><span class="header-section-number">2</span> Introduction<a href="#introduction" class="self-link"></a></h1>
<p>Scheduler-generic std::execution code can never assume concurrent
forward progress in <code>bulk</code>’s function invocations, even if
the scheduler’s <code>get_forward_progress_guarantee</code> query
returns <code>concurrent</code>. This is because a scheduler’s forward
progress guarantee relates to distinct execution agents on the
scheduler, but nothing specifies when the different <code>f(k)</code>
invocations of the user’s function <code>f</code> execute on distinct
execution agents. Two function invocations that happen on the same agent
cannot have a forward progress guarantee stronger than parallel, because
they cannot perform blocking synchronization without the possibility of
deadlock.</p>
<p>We propose fixing this by specifying that each of <code>bulk</code>’s
function invocations happens in a distinct execution agent. This would
make <code>bulk</code> more effective as a basis function for
implementing asynchronous parallel algorithms. Since default
<code>bulk</code> is sequential, we propose making it ill-formed to use
default <code>bulk</code> with a scheduler that promises
<code>concurrent</code> forward progress.</p>
<p>We will begin by explaining why we think saying that
<code>bulk</code>’s function invocations happen in distinct execution
agents implies that those function invocations have the scheduler’s
forward progress guarantee. Then, we will explain our interpretation of
the current wording, and how it limits both users and implementers.
Next, we will go through our proposed solution and contrast it with
alternative designs. After that, we will present our proposed wording
changes.</p>
<h1 data-number="3" id="why-saying-that-agents-are-distinct-fixes-bulks-forward-progress"><span class="header-section-number">3</span> Why saying that agents are
distinct fixes <code>bulk</code>’s forward progress<a href="#why-saying-that-agents-are-distinct-fixes-bulks-forward-progress" class="self-link"></a></h1>
<h2 data-number="3.1" id="summary"><span class="header-section-number">3.1</span> Summary<a href="#summary" class="self-link"></a></h2>
<p>We propose saying that each of <code>bulk</code>’s <code>N</code>
function invocations happens in a distinct execution agent. This section
explains why we think that suffices to ensure that <code>bulk</code>’s
function invocations have the same forward progress guarantee as the
scheduler.</p>
<p>Execution agents do not necessarily correspond to hardware or
operating system features. “These operations happen on distinct
execution agents on the scheduler” is just the language that the
Standard gives us to express that “the scheduler’s forward progress
guarantee applies to these operations.” This establishes a context for
us to speak of forward progress of <code>bulk</code>’s function
invocations, even when <code>bulk</code> is executed on a
single-threaded execution resource.</p>
<p>For a scheduler using a single-threaded resource, <code>bulk</code>’s
function invocations would still execute on distinct execution agents,
but all those agents would run on the same thread. The agents would not
necessarily ever be reified as an actual hardware or software entity;
they are just a way to talk about forward progress of operations on the
scheduler. A single-threaded scheduler like this could promise no
stronger than parallel forward progress.</p>
<h2 data-number="3.2" id="execution-agents-relate-to-synchronization"><span class="header-section-number">3.2</span> Execution agents relate to
synchronization<a href="#execution-agents-relate-to-synchronization" class="self-link"></a></h2>
<p>The Standard defines “execution agent” in the context of
synchronization. [thread.req.lockable.general] 1 defines <em>execution
agent</em> as “an entity such as a thread that may perform work in
parallel with other execution agents.” Note 1 points out that
“[i]mplementations or users can introduce other kinds of agents such as
processes or thread-pool tasks.” The section goes on to explain that a
“given execution agent” can “acquire or release ownership of a lock”
([thread.req.lockable.general] 3). The Standard generally uses
“execution agent” when referring to locks and synchronization. For
example, [thread.mutex.requirements.general] 1 says: “A mutex object
facilitates protection against data races and allows safe
synchronization of data between execution agents.” See also
[thread.lock.general] 1 and [thread.sharedmutex.requirements.general] 2;
the latter explains a situation in which “execution agents block” until
some condition is met.</p>
<p>We therefore read other uses of “execution agent” in the Standard as
relating to synchronization. For example, coroutines can be resumed on
an execution agent ([coroutine.handle.resumption] 1); Note 2 explains
that “[a] concurrent resumption of the coroutine can result in a data
race.” This informs our reading of [exec.get.fwd.progress] 1, 3, which
explains that the result of querying
<code>get_forward_progress_guarantee</code> on a scheduler describes
forward progress of distinct execution agents created by the
scheduler.</p>
<h2 data-number="3.3" id="execution-agents-are-more-general-than-threads"><span class="header-section-number">3.3</span> Execution agents are more
general than threads<a href="#execution-agents-are-more-general-than-threads" class="self-link"></a></h2>
<p>The phrase “execution agent” differs from <em>thread of
execution</em> ([intro.multithread.general] 1), “also known as a
<em>thread</em>,” which “is a single flow of control within a program,
including the initial invocation of a specific top-level function, and
recursively including every function invocation subsequently executed by
the thread.” Execution agents are more general. For example, a program
could use a SIMD (Single Instruction Multiple Data) lane as an
“execution agent,” even though multiple SIMD lanes may share a single
instruction stream (“flow of control”). This is, in fact, one of the
motivations for the weakly parallel forward progress guarantee, as
explained in the 2014 paper <a href="https://wg21.link/n4156">N4156</a>,
“Light-Weight Execution Agents.”</p>
<h2 data-number="3.4" id="this-explains-what-it-means-for-a-single-threaded-execution-resource-to-have-a-forward-progress-guarantee"><span class="header-section-number">3.4</span> This explains what it means for
a single-threaded execution resource to have a forward progress
guarantee<a href="#this-explains-what-it-means-for-a-single-threaded-execution-resource-to-have-a-forward-progress-guarantee" class="self-link"></a></h2>
<p>This interpretation of “execution agents” lets us speak of “forward
progress” even for a single-threaded execution resource. A scheduler on
a single-threaded resource would execute <code>bulk</code> as a
sequential <code>for</code> loop (for example, using default
<code>bulk</code>). If we talk about each loop iteration as a distinct
execution agent, then we could say that this bulk implementation
promises at most parallel forward progress. That is, this scheduler’s
<code>bulk</code> could deadlock if any iteration blocks on any other
iteration.</p>
<p>In general, execution agents do not necessarily correspond to
hardware or operating system features. “These function invocations
happen on distinct execution agents on the scheduler” is just a way to
say “the scheduler’s forward progress guarantee applies to these
function invocations.” This is the context for us to talk about forward
progress of <code>bulk</code>.</p>
<h1 data-number="4" id="concurrent-forward-progress-is-not-currently-usable-in-bulk"><span class="header-section-number">4</span> Concurrent forward progress is
not currently usable in <code>bulk</code><a href="#concurrent-forward-progress-is-not-currently-usable-in-bulk" class="self-link"></a></h1>
<h2 data-number="4.1" id="summary-1"><span class="header-section-number">4.1</span> Summary<a href="#summary-1" class="self-link"></a></h2>
<p>Scheduler-generic code cannot assume that <code>bulk</code>’s
<code>N</code> function invocations have a forward progress guarantee
stronger than <code>parallel</code>, even if a scheduler’s
<code>get_forward_progress_guarantee</code> query returns
<code>concurrent</code>. The result of querying
<code>get_forward_progress_guarantee</code> on the scheduler describes
forward progress of distinct execution agents created by the scheduler
([exec.get.fwd.progress] 1, 3). The default <code>bulk</code> runs
sequentially on a single execution agent ([exec.bulk] 4) and permits any
number <code>N</code> of function invocations. As a result, even if the
scheduler promises concurrent forward progress, default
<code>bulk</code> cannot have forward progress stronger than parallel.
For a customization of <code>bulk</code>, the current Working Draft does
not specify how many agents it uses, just that those agents have the
scheduler’s forward progress guarantee. For example, if a scheduler
promises concurrent forward progress, but its custom <code>bulk</code>
“tiles” the <code>N</code> function invocations across fewer than
<code>N</code> execution agents, each invocation of <code>bulk</code>
could promise at most <code>parallel</code> forward progress.</p>
<h2 data-number="4.2" id="why-we-want-bulk-to-have-the-strongest-possible-forward-progress-guarantee"><span class="header-section-number">4.2</span> Why we want <code>bulk</code>
to have the strongest possible forward progress guarantee<a href="#why-we-want-bulk-to-have-the-strongest-possible-forward-progress-guarantee" class="self-link"></a></h2>
<h3 data-number="4.2.1" id="users-need-the-freedom-to-express-their-intent"><span class="header-section-number">4.2.1</span> Users need the freedom to
express their intent<a href="#users-need-the-freedom-to-express-their-intent" class="self-link"></a></h3>
<p>Users want to write various algorithms. Some of those will require
blocking synchronization between different activities, such as function
invocations in <code>bulk</code>. Right now, they have no way of doing
that in scheduler-generic code, except by launching successive
back-to-back <code>bulk</code> operations. We want to make it possible
for them to use the synchronization of their choice, if the scheduler
supports it.</p>
<p>One motivation for permitting synchronization in <code>bulk</code> is
to reduce algorithmic complexity. For example, bulk execution on some
platforms has a launch cost that is independent of the number of
execution agents. If we take away <code>bulk</code> and make users
express execution over <span class="math inline"><em>N</em></span>
agents via recursive calls to <code>when_all</code>, we force all
concurrent execution into a <span class="math inline">log <em>N</em></span> launch cost. If we force users
to launch multiple <code>bulk</code> operations, typical parallel
algorithms with tree-shaped local communication patterns end up costing
<span class="math inline">log <em>N</em></span> <code>bulk</code>
launches instead of a constant number of launches.</p>
<h3 data-number="4.2.2" id="sg1-wants-a-bulk-interface-that-creates-an-execution-agent-per-iteration"><span class="header-section-number">4.2.2</span> SG1 wants a bulk interface
that “creates an execution agent per iteration”<a href="#sg1-wants-a-bulk-interface-that-creates-an-execution-agent-per-iteration" class="self-link"></a></h3>
<p>SG1 polled with unanimous consent in its Wrocław 2024 review of
<a href="https://wg21.link/p3481r0">P3481R0, “Summarizing
<code>std::execution::bulk()</code> issues,”</a> that “[w]e need a
version of the bulk API that creates an execution agent per
iteration.”</p>
<h3 data-number="4.2.3" id="run-on-n-execution-agents-parallelize-this-loop"><span class="header-section-number">4.2.3</span> “Run on N execution agents”
<code>!=</code> “parallelize this loop”<a href="#run-on-n-execution-agents-parallelize-this-loop" class="self-link"></a></h3>
<p>A <code>bulk</code> that runs on N execution agents is a unique
facility. Nothing else in Standard C++ lets users execute on N distinct
execution agents “all at once.” We consider this a distinct capability
from parallelizing a loop, and therefore deserving of a separate
interface, for the following reasons.</p>
<ol type="1">
<li><p>Loop parallelization involves two separate tasks: mapping of loop
iterations to execution agents, and executing work on each of those
agents.</p></li>
<li><p>Users want, and popular programming models provide, more
complicated mappings of iterations to agents than simple chunking. Users
often want to <em>know</em> and even <em>specify</em> the
mapping.</p></li>
</ol>
<p>Regarding (1), one could build loop parallelization as a
straightforward composition of the mapping (a function from agent index
to a set of loop indices) with <code>bulk</code> (that executes a
function of agent index on a distinct agent). Appendiex B shows an
example of how to implement a chunked parallel <code>for_each</code>
algorithm in this way.</p>
<p>Regarding (2), examples of more complicated mapping facilities
include the variety of <code>schedule</code> modifiers for OpenMP’s
<code>for</code> clause, the many array mapping options in High
Performance Fortran, and the generality of ScaLAPACK’s array
distributions. Users need control over data distribution for best
performance, so they can exploit knowledge of data locality relative to
the various execution agents. Giving “implementation freedom” to
<code>bulk</code> to hide loop distribution inside would take freedom
away from users to build their own chunking schemes on top of
<code>bulk</code>’s basic execution facility.</p>
<p>Regarding <a href="https://wg21.link/p3481r0">P3481R0’s</a> proposed
<code>bulk_chunked</code> interface: given an arbitrary mapping, it
would be easier to implement loop parallelization with <code>bulk</code>
than with <code>bulk_chunked</code>. For example, users might want to
assign loop index <code>k</code> to agent <code>k % N</code> (out of
<code>N</code> agents). As a result, each agent would only see
contiguous “chunks” of size one.</p>
<h3 data-number="4.2.4" id="parallel-programming-models-give-users-a-way-to-ask-for-concurrent-forward-progress-where-possible"><span class="header-section-number">4.2.4</span> Parallel programming models
give users a way to ask for concurrent forward progress where possible<a href="#parallel-programming-models-give-users-a-way-to-ask-for-concurrent-forward-progress-where-possible" class="self-link"></a></h3>
<p>Customizations of <code>bulk</code> should be able to promise
concurrent forward progress, because popular parallel programming models
try to provide both concurrent and parallel progress, and give users a
way to ask for the one they want. Models offer users this choice in
three different ways.</p>
<ol type="1">
<li><p>They expose different programming constructs for “running on N
concurrent threads” versus “parallelizing a loop with M iterations over
N threads” (analogous to parallel <code>for_each</code>). Some models
<em>only</em> have a way to run on N concurrent threads.</p></li>
<li><p>They expose a way to ask for nondefault “extra” forward progress
guarantees, even though doing so might only have a performance benefit
in special cases.</p></li>
<li><p>They expose a more complicated programming model, so that users
can get concurrent forward progress between at least some execution
agents.</p></li>
</ol>
<p>Many parallel programming models expose different programming
constructs for “launching N concurrent threads” versus “parallelizing a
loop with M iterations over N threads.” The former lets users write
algorithms that assume concurrent forward progress, while the latter
makes the common case of distributing a large number of function
invocations over a smaller number of execution agents easier.</p>
<ul>
<li><p>OpenMP <a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a> distinguishes parallel regions
(e.g., <code>#pragma omp parallel</code>, with concurrent forward
progress over <code>N</code> threads, with caveats; see below) from
parallel loops (e.g., <code>#pragma omp parallel for</code>).</p></li>
<li><p>MPI (the Message Passing Interface standard for
distributed-memory parallel programming, first released in June 1994)
<em>only</em> has a way to launch some number of processes (the
distributed-memory analog of execution agents) with the equivalent of a
concurrent forward progress guarantee. Users are expected to distribute
parallel loop iterations over processes by hand.<a href="#fn2" class="footnote-ref" id="fnref2" role="doc-noteref"><sup>2</sup></a></p></li>
<li><p>PVM (the Parallel Virtual Machine): MPI’s ability to start a
program with the desired number N of execution agents contrasts with
another programming model of its day, PVM. PVM users who wanted N
execution agents generally had to write boilerplate code that spawned
agents in a binary tree (analogous to recursive <code>when_all</code>
invocations), while MPI users simply got that from the beginning (after
calling <code>MPI_Init</code> or the equivalent). This was a factor in
MPI’s popularity over PVM.</p></li>
<li><p>HPX includes both parallel algorithms like those in the C++
Standard Library, and different ways to enumerate execution agents of
various kinds and execute work in bulk fashion with concurrent forward
progress on them. For example, users can define a “parallel section”
(analogous to a section of code executing on possibly multiple processes
of an MPI communicator), and perform blocking synchronization across
subsets of “images” (execution agents) in the parallel section. Users
can launch their parallel section (via <code>define_spmd_block</code>)
with a given number of images per “locality” (e.g., node of a cluster,
or NUMA (Non-Uniform Memory Access) domain).</p></li>
</ul>
<p>Some parallel programming models expose a way to ask for nondefault
“extra” forward progress guarantees, even though doing so might only
offer a performance benefit in special cases. CUDA<a href="#fn3" class="footnote-ref" id="fnref3" role="doc-noteref"><sup>3</sup></a>
distinguishes ordinary bulk execution (“device kernel”) launch (with at
best parallel forward progress across thread blocks) from so-called
“cooperative” bulk execution launch (with concurrent forward progress
across thread blocks) as exposed by
<a href="https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html">CUDA
run-time functions</a> such as <code>cudaLaunchCooperativeKernel</code>.
The cooperative launch feature is not “free”: it takes away the latency
hiding feature of ordinary kernel launch, and thus only has performance
benefits for specialized algorithms.</p>
<p>Finally, some programming models expose complication, so that users
can get concurrent forward progress between at least some execution
agents. This complication generally takes the form of a hierarchy, with
different levels having possibly different forward progress guarantees.
OpenMP 6.0 (released in November 2024) introduces “progress units” of
some implementation-defined number of consecutive hardware threads.
OpenMP’s parallel regions (<code>#pragma omp parallel</code>) offer
concurrent forward progress across progress units, but only weakly
parallel progress within a progress unit. Exposing this lets OpenMP
support more architectures, at the cost of users needing to query the
progress unit size (which might be one, indicating that all threads make
concurrent forward progress). CUDA exposes two levels of parallelism:
thread blocks and threads within a block. By default, thread blocks have
the parallel guarantee, while threads within a block have the concurrent
guarantee.<a href="#fn4" class="footnote-ref" id="fnref4" role="doc-noteref"><sup>4</sup></a> CUDA provides various
synchronization operations on groups of threads within a thread block,
including synchronous barriers, that would only make sense given
concurrent forward progress.</p>
<h2 data-number="4.3" id="summary-of-design-history-for-bulk-executions-forward-progress"><span class="header-section-number">4.3</span> Summary of design history for
bulk execution’s forward progress<a href="#summary-of-design-history-for-bulk-executions-forward-progress" class="self-link"></a></h2>
<p>The “design intent” for such a thing as std::execution is not a
single, static thought. Instead, it represents desiderata from a variety
of coauthors from different institutions. These interests have been
carried over through three major consolidations of proposals: the
unification of three “executors” proposals that led to
<a href="https://wg21.link/p0443r0">P0443R0</a>, the second unification
(as expressed by <a href="https://wg21.link/p1658">P1658</a> and
<a href="https://wg21.link/p1660">P1660</a>) that led to
<a href="https://wg21.link/p0443r11">P0443R11</a>, and the third that
led to replacing P0443 with <a href="https://wg21.link/p2300">P2300</a>.
Nevertheless, one can observe a common set of principles from the
various proposals and their reviews.</p>
<ol type="1">
<li><p>Bulk execution is a “basis operation,” that is, one of some
minimal set of operations on a scheduler that could be used to construct
asynchronous parallel algorithms.</p></li>
<li><p>Bulk execution may have different forward progress guarantees
when running on different execution resources.</p></li>
<li><p>Bulk execution’s forward progress guarantees matter to users and
should be defined.</p></li>
</ol>
<p>Please see Appendix A for a review of this history.</p>
<h1 data-number="5" id="proposed-solution"><span class="header-section-number">5</span> Proposed solution<a href="#proposed-solution" class="self-link"></a></h1>
<h2 data-number="5.1" id="specify-bulk-as-running-on-n-distinct-execution-agents"><span class="header-section-number">5.1</span> Specify <code>bulk</code> as
running on N distinct execution agents<a href="#specify-bulk-as-running-on-n-distinct-execution-agents" class="self-link"></a></h2>
<h3 data-number="5.1.1" id="summary-2"><span class="header-section-number">5.1.1</span> Summary<a href="#summary-2" class="self-link"></a></h3>
<ol type="1">
<li><p><code>bulk</code> is not just another spelling of
<code>for_each</code>, but a way to create execution agents in order to
implement parallel algorithms. Therefore, <code>bulk</code> should offer
the strongest possible forward progress guarantee, by being specified as
each function invocation running in a distinct execution agent. As SG1
polled with unanimous consent in its Wrocław 2024 review of
<a href="https://wg21.link/p3481r0">P3481R0</a>, “[w]e need a version of
the bulk API that creates an execution agent per iteration.”</p></li>
<li><p><code>bulk</code> should not have a different forward progress
guarantee than any other operation on a scheduler, because otherwise it
hinders reasoning about the forward progress guarantee of an operation
defined by an arbitrary graph of senders.</p></li>
<li><p>Make it ill-formed to invoke default <code>bulk</code> with a
custom scheduler that claims the <code>concurrent</code> forward
progress guarantee. Make it ill-formed, no diagnostic required to invoke
default <code>bulk</code> with a scheduler that claims the
<code>parallel</code> forward progress guarantee, unless default
<code>bulk</code> on that scheduler has that behavior.</p></li>
</ol>
<h3 data-number="5.1.2" id="alternative-designs"><span class="header-section-number">5.1.2</span> Alternative designs<a href="#alternative-designs" class="self-link"></a></h3>
<p>We considered the following alternative designs.</p>
<ol type="1">
<li><p>Status quo: <code>bulk</code> uses an unspecified number of
execution agents; those agents have their scheduler’s forward progress
guarantee, but individual function invocations in <code>bulk</code> have
at best parallel forward progress.</p></li>
<li><p>Let <code>bulk</code> take a minimum forward progress guarantee
requirement as a parameter. <code>bulk</code> fails if it cannot satisfy
the requirement.</p></li>
<li><p>Treat <code>bulk</code> as a special case, with a separate query
for its per-invocation forward progress guarantee. (This was P0443’s
design.)</p></li>
<li><p>Let each sender algorithm have its own scheduler-dependent
forward progress guarantee, so that the
<code>get_forward_progress_guarantee</code> query depends on the
operation as well as the scheduler.</p></li>
<li><p>Require <code>bulk</code> to be customized, and require
customizations to have the same forward progress guarantee as the
scheduler’s.</p></li>
</ol>
<p>We reject the status quo, Option (1), for reasons discussed above.
Both default <code>bulk</code> and <code>bulk</code> customizations must
provide parallel forward progress for their function invocations as long
as their scheduler promises at least parallel forward progress. This is
because <code>bulk</code> creates an asynchronous operation that
performs its function invocations on a value completion operation
([exec.bulk] 6), and asynchronous operations on a scheduler “execute
value completion operations on an execution agent belonging to the
scheduler’s associated execution resource” ([exec.async.ops] 10). For
example, it would not be legal for a scheduler’s
<code>get_forward_progress_guarantee</code> query to return
<code>parallel</code>, yet for <code>bulk</code> on that scheduler to
use an OpenMP “<code>simd</code>” construct (which can only promise
<code>weakly_parallel</code> forward progress).</p>
<p>None of Options (2), (3), and (4) would give users a way to know the
mapping from function invocation to execution agent. Users need to know
that for performance as well as correctness reasons. Given that loop
distributions can be arbitrary one-to-one functions from <span class="math inline">[</span> 0, <code>N</code> <span class="math inline">)</span> to the set of execution agents, an
interface for users to discover the implementation’s mapping would be
complicated. An interface for users to <em>specify</em> the mapping
(analogous to OpenMP’s various loop distribution options) would also be
complicated, <em>and</em> would be separable from an interface for
running on <code>N</code> execution agents. (OpenMP likewise separates
running on <code>N</code> execution agents,
<code>#pragma omp parallel</code>, from loop parallelization,
<code>#pragma omp parallel for</code>.) The fundamental interface is
“run on <code>N</code> execution agents.”</p>
<p>Regarding Option (2), <a href="https://wg21.link/p3481r0">P3481R0</a>
proposes making <code>bulk</code> take an execution policy. This would
have a similar effect, except that it would exclude concurrent forward
progress.<a href="#fn5" class="footnote-ref" id="fnref5" role="doc-noteref"><sup>5</sup></a> The effect of Option (2) is to make
<code>bulk</code> like parallel <code>for_each</code>, in that it has a
parameter analogous to an execution policy. However, if
<code>bulk</code> is a basis operation – a way to implement parallel
algorithms – and not just another way to spell parallel
<code>for_each</code>, then developers would want to use
<code>bulk</code> in a different way. When <em>using</em> parallel
algorithms, execution policies like <code>par</code> and
<code>par_unseq</code> communicate a “lower bound” forward progress
requirement, often to expose opportunities for optimization. When
<em>implementing</em> parallel algorithms, developers instead want to
communicate the exact requirement of a particular code path. They might
query the scheduler for its forward progress guarantee, and use it to
dispatch to different code paths, as in the example below. This is
because the parallel forward progress path might be less efficient than
the concurrent forward progress path, for example because it replaces
blocking synchronization with multiple <code>bulk</code>
invocations.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="kw">auto</span> g <span class="op">=</span> get_forward_progress_guarantee<span class="op">(</span>scheduler<span class="op">)</span>;</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="kw">constexpr</span> <span class="op">(</span>g <span class="op">==</span> forward_progress_guarantee<span class="op">::</span>concurrent<span class="op">)</span> <span class="op">{</span></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>  <span class="co">// ... concurrent implementation ...</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> <span class="cf">else</span> <span class="cf">if</span> <span class="kw">constexpr</span> <span class="op">(</span>g <span class="op">==</span> forward_progress_guarantee<span class="op">::</span>parallel<span class="op">)</span> <span class="op">{</span></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>  <span class="co">// ... parallel implementation ...</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> <span class="cf">else</span> <span class="op">{</span></span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span><span class="kw">false</span>, <span class="st">&quot;Need at least parallel forward progress&quot;</span><span class="op">)</span>;  </span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Regarding Option (4), it’s clear to us what a forward progress
guarantee means for <code>bulk</code>’s function invocations that
complete on a single scheduler. However, it’s not clear what that would
mean for an arbitrary graph composed of various asynchronous operations
that might complete on an arbitrary variety of senders. Thus, having the
<code>get_forward_progress_guarantee</code> query depend on the sender
algorithm would complicate the syntax without actually being useful for
other sender algorithms besides <code>bulk</code>. This is why forward
progress is a property of a scheduler.</p>
<p>Regarding Option (5), requiring <code>bulk</code> to be customized
could be done either by removing the default implementation entirely, or
by making it still selected for overload resolution but ill-formed.
Users of <code>bulk</code> would either need to rely on a scheduler’s
customization, or find another scheduler that customizes
<code>bulk</code> and then transition to that scheduler whenever they
use <code>bulk</code>. Either approach to removing default
<code>bulk</code> would conflict with the P2300 design principle that
every sender algorithm has a default. Sender algorithms are only
customized for specific schedulers, and only because doing so is
necessary for correctness or performance on that scheduler. If a
scheduler has nothing to say about <code>bulk</code>, it should just
leave the default in place. Removing default <code>bulk</code> would
also have the unfortunate effect of bifurcating the set of schedulers
into “bulk-capable” and “not-bulk-capable.” SG1 has polled to include
bulk execution in the “minimal set” of an executor’s operations, for
example in Kona 2017. In addition, removing default <code>bulk</code>
would also prevent users from doing reasonable things like benchmarking
default <code>bulk</code> against a customization.</p>
<p>Scheduler authors who want to promise concurrent forward progress but
do not care about <code>bulk</code> can do either of the following.</p>
<ol type="a">
<li><p>Nothing (do not customize <code>bulk</code>). This proposal will
make it ill-formed to use default <code>bulk</code> with their scheduler
(see the next section). The scheduler may thus ignore <code>bulk</code>
when reporting its forward progress guarantee.</p></li>
<li><p>Customize <code>bulk</code>, but have the customization always
complete with an error if the number of agents <code>N</code> is greater
than 1 (see the section after the next).</p></li>
</ol>
<h2 data-number="5.2" id="make-it-ill-formed-to-use-default-bulk-with-a-scheduler-promising-concurrent-forward-progress"><span class="header-section-number">5.2</span> Make it ill-formed to use
default <code>bulk</code> with a scheduler promising concurrent forward
progress<a href="#make-it-ill-formed-to-use-default-bulk-with-a-scheduler-promising-concurrent-forward-progress" class="self-link"></a></h2>
<p>We propose that it be ill-formed to use default <code>bulk</code>
with a scheduler whose <code>get_forward_progress_guarantee</code> query
returns <code>concurrent</code>. This is because default
<code>bulk</code> cannot possibly fulfill concurrent forward progress
for any number of agents <code>N</code> other than one. This will
prevent a common error for scheduler authors who do not care about
<code>bulk</code>, yet still want to provide a <code>concurrent</code>
scheduler. It also makes default <code>bulk</code> usable for users who
don’t mind a weaker forward progress guarantee.</p>
<h2 data-number="5.3" id="permit-a-bulk-customization-to-fail-if-it-cannot-fulfill-its-forward-progress-for-a-given-number-of-agents"><span class="header-section-number">5.3</span> Permit a <code>bulk</code>
customization to fail if it cannot fulfill its forward progress for a
given number of agents<a href="#permit-a-bulk-customization-to-fail-if-it-cannot-fulfill-its-forward-progress-for-a-given-number-of-agents" class="self-link"></a></h2>
<h3 data-number="5.3.1" id="summary-3"><span class="header-section-number">5.3.1</span> Summary<a href="#summary-3" class="self-link"></a></h3>
<ol type="1">
<li><p>Permit a <code>bulk</code> customization to fail at run time if
it cannot provide the promised forward progress guarantee for a given
number of agents <code>N</code>.</p></li>
<li><p>This lets <code>bulk</code> still promise the strongest forward
progress guarantee that it can provide for any value of
<code>N</code>.</p></li>
<li><p><code>bulk</code> fails by propagating a scheduler-dependent
error completion.</p></li>
</ol>
<h3 data-number="5.3.2" id="discussion"><span class="header-section-number">5.3.2</span> Discussion<a href="#discussion" class="self-link"></a></h3>
<p>Letting <code>bulk</code> fail if it cannot fulfill its forward
progress guarantee for a given <code>N</code> would still permit
<code>bulk</code> to promise the strongest forward progress guarantee
that it can provide for any value of <code>N</code>. By analogy,
<code>std::thread</code> still promises an implementation-defined
forward progress guarantee, even though <code>std::thread</code>’s
constructor is permitted to fail. Just as <code>std::thread</code>’s
constructor does not devolve to running code inline if it fails to
launch a thread, <code>bulk</code> should not devolve to a weaker
forward progress guarantee if it cannot meet its promise.</p>
<p>The idea that execution should “indicate failure” if it cannot
provide the requested forward progress guarantee occurs in
<a href="http://wg21.link/p0058r1">P0058R1</a>, one of the three
predecessors of P0443 (see historical overview in Appendix A). SG1, in
<a href="https://wiki.edg.com/bin/view/Wg21issaquah2016/P0443r0">its
review of P0443R0 at the Issaquah 2016 meeting,</a> expressed the wish
for task launch to fail if it could not be launched with a concurrent
progress guarantee.</p>
<p>Permitting <code>bulk</code> to fail for <code>N</code> “too large”
makes this proposal trivial to implement. Scheduler authors who don’t
care about <code>bulk</code> have two minimal-effort choices.</p>
<ol type="1">
<li><p>Do nothing. Default <code>bulk</code> will be ill-formed if their
scheduler reports concurrent forward progress, and both well-formed and
correct if their scheduler reports weaker forward progress than
concurrent.</p></li>
<li><p>Customize <code>bulk</code> for their scheduler, and make
<code>bulk</code> fail at run time for <code>N</code> greater than
1.</p></li>
</ol>
<p>How should <code>bulk</code> fail? The idiomatic error reporting
method for senders in [exec] is to use the error channel. The sender
adapters <code>starts_on</code>, <code>schedule_from</code>, and
<code>on</code> all handle failing to schedule an operation on a
scheduler by executing an error completion on an unspecified execution
agent. Why shouldn’t other operations that may require the creation of
execution agents, like <code>bulk</code>, do the same? This behavior is
analogous to that of <code>std::thread</code>’s constructor, which
throws <code>std::system_error</code> “if unable to start the new
thread” ([thread.thread.constr] 8).</p>
<p>What error should <code>bulk</code> report? We should not require the
error to be an exception. This permits use of std::execution with custom
schedulers in freestanding environments, or environments where
exceptions are disabled. We should also let different schedulers report
different error types. For example, a scheduler that creates execution
agents one at a time using <code>jthread</code>’s constructor (not
necessarily wise from a performance perspective, though it would
guarantee concurrent forward progress) would reasonably just pass along
the thrown <code>system_error</code> if thread creation fails. Forcing
different schedulers to use the same error type would prevent custom
schedulers from passing along information that might help users handle
errors.</p>
<p>Implementing error handling by reporting down the error channel is
practical in at least some cases. For example, OpenMP has a run-time
function <code>omp_get_max_threads</code> that a <code>bulk</code>
implementation could use to check if N is too large.</p>
<h3 data-number="5.3.3" id="design-implications"><span class="header-section-number">5.3.3</span> Design implications<a href="#design-implications" class="self-link"></a></h3>
<p>Customizations of <code>bulk</code> would need to know that they have
<code>N</code> distinct execution agents available, before any of those
agents may invoke the user’s function. Any error reporting would need to
happen before any function invocations. For example, it would be
nonconforming for a customization to start some smaller number
<code>n</code> of agents opportunistically, permit them to begin
invoking the function for indices <span class="math inline">[</span> 0,
<code>n</code> <span class="math inline">)</span>, and only thereafter
try creating more agents with the possibility of failure. In our view,
this is an implication of the current specification; this proposal would
not change that.</p>
<h3 data-number="5.3.4" id="alternative-designs-1"><span class="header-section-number">5.3.4</span> Alternative designs<a href="#alternative-designs-1" class="self-link"></a></h3>
<p>We do <em>not</em> accept any design permitting <code>bulk</code> to
change its forward progress guarantee depending on <code>N</code>, for
example by tiling function invocations across available execution agents
if <code>N</code> is greater than the maximum number of concurrent
agents. This would effectively make <code>bulk</code> a parallel loop
interface. As we discuss elsewhere, we consider “running on
<code>N</code> execution agents” separate functionality from
“parallelize this loop,” and think the two should have distinct
interfaces.</p>
<p>The alternative designs discussed below address the question of how
users would prevent errors due to the number of agents <code>N</code>
being too large. Different implementations’ upper bounds for
<code>N</code> may vary from 1 (OpenMP implementation running on a
single-processor system that is not permitted to create threads) to
numbers too large to fit in 32 bits (the maximum product of the x, y,
and z dimensions of a grid of thread blocks in a CUDA kernel launch).
Here are three designs. We do <em>not</em> propose any of them, but if
we had to pick one, it would be the third design.</p>
<ol type="1">
<li><p>Let users query the scheduler’s maximum number of execution
agents that can execute with the promised forward progress guarantee:
<code>size_t get_max_num_agents()</code>.</p></li>
<li><p>Add a variant of <code>bulk</code> for which users do not specify
the number of execution agents <code>N</code>, but instead receive it as
a second argument of their function.</p></li>
<li><p>Add a variant of <code>bulk</code> for which users specify a
<em>maximum</em> number of execution agents <code>N</code>. They will
then receive the <em>actual</em> number of execution agents
<code>n</code>, where <code>n</code> is no larger than <code>N</code>,
as the second argument of their function.</p></li>
</ol>
<p>Regarding Option (1), we do not think a “maximum <code>N</code>”
query would be helpful for arbitrary user-defined schedulers. First,
users may not <em>need</em> to run on the maximum number of agents,
because they might only have a small number of tasks. Asking for more
agents than needed might waste resources. Second, the query could only
ever be an upper bound, because the actual maximum might change during
the program’s lifetime (as system resources are claimed by other
processes or become available again) and might depend on the context
(e.g., whether this is a nested <code>bulk</code> invocation, assuming
that those are allowed).<a href="#fn6" class="footnote-ref" id="fnref6" role="doc-noteref"><sup>6</sup></a>. Third, the best upper bound for
performance, or even a practically attainable upper bound, might be
smaller than the maximum value. Just because
<code>std::vector&lt;T,Allocator&gt;::max_size()</code> is huge doesn’t
necessarily mean that you can allocate that much memory on every system.
Likewise, just because a scheduler <em>can</em> create millions of
concurrent execution agents, doesn’t necessarily mean that this would
perform acceptably.</p>
<p>To elaborate on (2) and (3), <code>bulk</code>’s function would have
two parameters <code>Shape k</code>, <code>Shape N</code>. The argument
for <code>k</code> is the invocation index in <span class="math inline">[</span> 0, <code>N</code> <span class="math inline">)</span>, and the argument for <code>N</code> is the
actual number of execution agents.</p>
<p>Option (2) would correspond to using
<code>#pragma omp parallel</code> without a <code>num_threads(N)</code>
clause, where users can call <code>omp_get_num_threads</code> to query
the number of threads that the scheduler offered. The <code>bulk</code>
customization would provide some number <code>N</code> of execution
agents (not necessarily the maximum – e.g., the scheduler may reduce the
number to help with load balancing). Default <code>bulk</code> would
provide <code>N=1</code>, thus ensuring that it has the same progress
guarantee as any scheduler.</p>
<p>Option (2) has the same issue as Option (1), that users may only have
a small number of function invocations to do; getting more agents than
that might waste resources. Not giving users a way to specify
<code>N</code> would force a branch
<code>if (k &lt; N) do_work(k)</code> into the user’s function. Also,
the optimal <code>N</code> value might depend on some properties of the
user’s function (e.g., its stack size requirement, or how many registers
it uses) that the system might not be able to determine even at run time
(e.g., if the function was loaded from a dynamically linked library).
Thus, users still need a way to specify <code>N</code>.</p>
<p>Option (3) – letting users specify a maximum number of agents – would
still force a branch in the user’s function. However, it would expose an
opportunity for users and the system to negotiate over the number of
agents. Our one concern is that this would turn “failure to fulfill the
forward progress guarantee” from an error into a non-error that is
nevertheless the user’s problem. Implementations could abuse this to
claim concurrent forward progress while only ever offering
<code>N = 1</code> execution agent. This is why we do not propose this
feature, though we object to it the least of all three options here.</p>
<h1 data-number="6" id="implementation-status"><span class="header-section-number">6</span> Implementation status<a href="#implementation-status" class="self-link"></a></h1>
<p>Implementations would only need to make three changes to conform with
this proposal.</p>
<ol type="1">
<li><p>Schedulers must only promise concurrent forward progress if they
have a <code>bulk</code> customization and if that customization
promises concurrent forward progress for its function
invocations.</p></li>
<li><p>Default <code>bulk</code> must be ill-formed when used with a
scheduler that promises concurrent forward progress.</p></li>
<li><p>If a scheduler has a <code>bulk</code> customization, and if the
<code>bulk</code> customization can fail to provide the scheduler’s
forward progress guarantee for a given <code>Shape</code> argument
<code>shape</code>, then the <code>bulk</code> customization must report
an error (via <code>set_error</code>).</p></li>
</ol>
<p>The custom schedulers in
<a href="https://github.com/NVIDIA/stdexec/">NVIDIA’s reference
implementation of std::execution</a> already conform with this proposal,
because they do not promise concurrent forward progress. For example,
its CPU-based <code>static_thread_pool::scheduler</code> reports
parallel forward progress, because it uses a thread pool with a fixed
number of threads. The GPU-based schedulers report weakly parallel
forward progress. Thus, the only change we have left to implement would
be (2).</p>
<h1 data-number="7" id="wording"><span class="header-section-number">7</span> Wording<a href="#wording" class="self-link"></a></h1>
<blockquote>
<p>Text in blockquotes is not proposed wording, but rather instructions
for generating proposed wording. The � character is used to denote a
placeholder section number which the editor shall determine.</p>
</blockquote>
<h2 data-number="7.1" id="update-version-macro"><span class="header-section-number">7.1</span> Update version macro<a href="#update-version-macro" class="self-link"></a></h2>
<blockquote>
<p>In <strong>[version.syn]</strong>, change the value of the
<code>__cpp_lib_senders</code> macro to <code>YYYYMML</code>, where the
placeholder value <code>YYYYMML</code> denotes this proposal’s date of
adoption.</p>
</blockquote>
<h2 data-number="7.2" id="change-default-bulk"><span class="header-section-number">7.2</span> Change default
<code>bulk</code><a href="#change-default-bulk" class="self-link"></a></h2>
<blockquote>
<p>Change [bulk.exec] 4 as follows, so that it is ill-formed to complete
default <code>bulk</code> if its completion scheduler promises
concurrent forward progress. (The only changes to this section are in
the body of the lambda assigned to
<em><code>impls-for</code></em><code>&lt;bulk_t&gt;::</code><em><code>complete</code></em>.)</p>
</blockquote>
<p><span class="marginalizedparent"><a class="marginalized">4</a></span>
The member
<em><code>impls-for</code></em><code>&lt;bulk_t&gt;​::</code><em><code>​complete</code></em>
is initialized with a callable object equivalent to the following
lambda:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">[]&lt;</span><span class="kw">class</span> Index, <span class="kw">class</span> State, <span class="kw">class</span> Rcvr, <span class="kw">class</span> Tag, <span class="kw">class</span><span class="op">...</span> Args<span class="op">&gt;</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>  <span class="op">(</span>Index, State<span class="op">&amp;</span> state, Rcvr<span class="op">&amp;</span> rcvr, Tag, Args<span class="op">&amp;&amp;...</span> args<span class="op">)</span> <span class="kw">noexcept</span> <span class="op">-&gt;</span> <span class="dt">void</span> <span class="kw">requires</span> <em>see below</em> <span class="op">{</span></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> <span class="kw">constexpr</span> <span class="op">(</span>same_as<span class="op">&lt;</span>Tag, set_value_t<span class="op">&gt;)</span> <span class="op">{</span></span></code></pre></div>
<div class="add" style="color: #00AA00">

<div class="sourceCode" id="cb3"><pre class="sourceCode default"><code class="sourceCode default"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>      constexpr auto guarantee = get_forward_progress_guarantee(decltype(get_completion_scheduler&lt;set_value_t&gt;(get_env(rcvr))));</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>      static_assert(guarantee != forward_progress_guarantee::concurrent);</span></code></pre></div>

</div>
<div class="sourceCode" id="cb4"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>      <span class="kw">auto</span><span class="op">&amp;</span> <span class="op">[</span>shape, f<span class="op">]</span> <span class="op">=</span> state;</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>      <span class="kw">constexpr</span> <span class="dt">bool</span> nothrow <span class="op">=</span> <span class="kw">noexcept</span><span class="op">(</span>f<span class="op">(</span><span class="kw">auto</span><span class="op">(</span>shape<span class="op">)</span>, args<span class="op">...))</span>;</span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>      <em>TRY-EVAL</em><span class="op">(</span>rcvr, <span class="op">[&amp;]()</span> <span class="kw">noexcept</span><span class="op">(</span>nothrow<span class="op">)</span> <span class="op">{</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>        <span class="cf">for</span> <span class="op">(</span><span class="kw">decltype</span><span class="op">(</span><span class="kw">auto</span><span class="op">(</span>shape<span class="op">))</span> i <span class="op">=</span> <span class="dv">0</span>; i <span class="op">&lt;</span> shape; <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>          f<span class="op">(</span><span class="kw">auto</span><span class="op">(</span>i<span class="op">)</span>, args<span class="op">...)</span>;</span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>        <span class="op">}</span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>        Tag<span class="op">()(</span>std<span class="op">::</span>move<span class="op">(</span>rcvr<span class="op">)</span>, std<span class="op">::</span>forward<span class="op">&lt;</span>Args<span class="op">&gt;(</span>args<span class="op">)...)</span>;</span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a>      <span class="op">}())</span>;</span>
<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span> <span class="cf">else</span> <span class="op">{</span></span>
<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a>      Tag<span class="op">()(</span>std<span class="op">::</span>move<span class="op">(</span>rcvr<span class="op">)</span>, std<span class="op">::</span>forward<span class="op">&lt;</span>Args<span class="op">&gt;(</span>args<span class="op">)...)</span>;</span>
<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span></code></pre></div>
<p><span class="marginalizedparent"><a class="marginalized">5</a></span>
The expression in the <em>requires-clause</em> of the lambda above is
<code>true</code> if and only if <code>Tag</code> denotes a type other
than <code>set_value_t</code> or if the expression
<code>f(auto(shape), args...)</code> is well-formed.</p>
<h2 data-number="7.3" id="specify-behavior-of-bulk-customizations"><span class="header-section-number">7.3</span> Specify behavior of
<code>bulk</code> customizations<a href="#specify-behavior-of-bulk-customizations" class="self-link"></a></h2>
<blockquote>
<p>Change [bulk.exec] 6 as follows, so that <code>bulk</code>
customizations complete with an error completion if the customization is
unable to execute on <code>shape</code> distinct execution agents. (The
definition of the pack <code>args</code> has been moved from
subparagraph 6.1 to paragraph 6. Subparagraph 6.1 has been rewritten and
now includes two new subparagraphs 6.1.1 and 6.1.2.)</p>
</blockquote>
<p><span class="marginalizedparent"><a class="marginalized">6</a></span>
Let the subexpression <code>out_sndr</code> denote the result of the
invocation <code>bulk(sndr, shape, f)</code> or an object equal to such,
and let the subexpression <code>rcvr</code> denote a receiver such that
the expression <code>connect(out_sndr, rcvr)</code> is well-formed.
<span class="add" style="color: #00AA00"><ins>Let
<span><code>args</code></span> denote a pack of lvalue subexpressions
referring to the value completion result datums of the input sender
<span><code>sndr</code></span>.</ins></span> The expression
<code>connect(out_sndr, rcvr)</code> has undefined behavior unless it
creates an asynchronous operation ([exec.async.ops]) that, when
started,</p>
<div class="add" style="color: #00AA00">
<p><span class="marginalizedparent"><a class="marginalized">(6.1)</a></span> if
<code>sndr</code> propagated a value completion operation, does exactly
one of the following:</p>
<ul>
<li><p><span class="marginalizedparent"><a class="marginalized">(6.1.1)</a></span>
completes with an error completion operation with a scheduler-specific
error on <code>rcvr</code>, if the asynchronous operation’s value
completion operation would have been unable to invoke
<code>f(i, args...)</code> on a distinct execution agent for every
<code>i</code> of type <code>Shape</code> in <code>[0, shape)</code>;
or</p></li>
<li><p><span class="marginalizedparent"><a class="marginalized">(6.1.2)</a></span>
invokes <code>f(i, args...)</code> on a distinct execution agent for
every <code>i</code> of type <code>Shape</code> in
<code>[0, shape)</code>; and</p></li>
</ul>

</div>
<div class="rm" style="color: #bf0303">
<p><span class="marginalizedparent"><a class="marginalized">(6.1)</a></span> on a
value completion operation, invokes <code>f(i, args...)</code> for every
<code>i</code> of type <code>Shape</code> in <code>[0, shape)</code>,
where <code>args</code> is a pack of lvalue subexpressions referring to
the value completion result datums of the input sender, and</p>
</div>
<p><span class="marginalizedparent"><a class="marginalized">(6.2)</a></span>
propagates all completion operations sent by <code>sndr</code>.</p>
<h1 data-number="8" id="appendix-a-design-history-for-bulk-executions-forward-progress"><span class="header-section-number">8</span> Appendix A: Design history for
bulk execution’s forward progress<a href="#appendix-a-design-history-for-bulk-executions-forward-progress" class="self-link"></a></h1>
<p>Appendix A reviews the history of std::execution in the context of
bulk execution.</p>
<h2 data-number="8.1" id="predecessors-of-p0443"><span class="header-section-number">8.1</span> Predecessors of P0443<a href="#predecessors-of-p0443" class="self-link"></a></h2>
<p><a href="https://wg21.link/p0443">P0443</a>, the predecessor proposal
of P2300, was itself a unification of three different “executors”
proposals:</p>
<ol type="1">
<li><p>“Google’s executor model for interfacing with thread pools”
<a href="http://wg21.link/n4414">N4414</a> (2015),</p></li>
<li><p>“Chris Kohlhoff’s executor model for the Networking TS”
<a href="http://wg21.link/n4370">N4370</a> (2015), and</p></li>
<li><p>“NVIDIA’s executor model for the Parallelism TS”
<a href="http://wg21.link/p0058">P0058</a>.</p></li>
</ol>
<p>Of these three proposals, only P0058 includes bulk execution. It
defines different “tags” for specifying different levels of forward
progress of bulk function invocations, including
<code>concurrent_execution_tag</code>, as well as different kinds of
executors (including <code>concurrent_executor</code>) that offer
different levels of forward progress. It explains that different
strengths of forward progress guarantee can be used to construct
algorithms that communicate in different ways.</p>
<blockquote>
<p>For example, parallel agents might communicate via a shared
<code>atomic&lt;T&gt;</code> parameter, while concurrent agents might
use a shared <code>barrier</code> object to synchronize their
communication (Section 4.5).</p>
</blockquote>
<blockquote>
<p><code>concurrent_execution_tag</code> - … The basic idea is that the
function invocations executed by a group of concurrent execution agents
are permitted to block each others’ forward progress. This guarantee
allows concurrent execution agents to communicate and synchronize
(Section 4.10).</p>
</blockquote>
<p>N4370 and its foundational proposals on asynchronous operations
<a href="https://wg21.link/n4045">N4045</a> (2014) and
<a href="https://wg21.link/n4242">N4242</a> (2014) do not speak of
forward progress. N4414 makes forward progress a property of the
specific executor – e.g., the <code>system_executor</code> has
concurrent forward progress. However, it leaves a query for that
property to future work.</p>
<blockquote>
<p>An interface like that suggested around Execution Agents in N4156
(and related papers) could provide a generic mechanism for checking the
traits of an executor (whether it is concurrent, parallel, or weakly
parallel, as well as the behavior of thread local storage). Discussion
of whether that concept should be applied to all executors is left for a
subsequent paper but some form of execution agent traits is reasonable
as [sic].</p>
</blockquote>
<p>It may help to refer to <a href="https://wg21.link/n4156">N4156</a>
(“Light-Weight Execution Agents,” 2014), which defines “execution agent”
in terms of the three levels of forward progress guarantees that later
became part of the C++ Standard. It gives examples of implementations
that provide different levels of forward progress for their execution
agents.</p>
<h2 data-number="8.2" id="earlier-versions-of-p0443"><span class="header-section-number">8.2</span> Earlier versions of P0443<a href="#earlier-versions-of-p0443" class="self-link"></a></h2>
<p>SG1’s discussions of P0443 cover forward progress guarantees
extensively, including guarantees of bulk execution. See e.g., SG1
discussion in
<a href="https://wiki.edg.com/bin/view/Wg21issaquah2016/P0443r0">Issaquah
2016</a>,
<a href="https://wiki.edg.com/bin/view/Wg21kona2017/P0443">Kona
2017</a>, and
<a href="https://wiki.edg.com/bin/view/Wg21albuquerque/NotesWedP0443">Albuquerque
2017</a>. Some reviewers mentioned wanting bulk execution to be able to
provide concurrent forward progress if possible, for example if a
bounded thread pool sometimes has enough threads for it. In Kona 2017,
SG1 polled “Should the bulk two-way<a href="#fn7" class="footnote-ref" id="fnref7" role="doc-noteref"><sup>7</sup></a> functions be part of the
minimal set for a final TS?” with results 10/3/1/0/1.</p>
<p><a href="https://wg21.link/p0688r0">P0688R0</a> responded to Kona
2017 feedback on P0443R1. That feedback aimed to simplify P0443. P0668
does not clarify the forward progress of bulk execution’s individual
function invocations; “blocking” for <code>bulk_*_execute</code> refers
to the caller, as in, “[m]ay block forward progress of the caller until
one or more invocations of <code>f</code> finish execution.” Minutes of
SG1 discussion at the Toronto 2017 meeting (
https://wiki.edg.com/bin/view/Wg21toronto2017/P0688 ) refer to the
guarantees “within the bulk group” as distinct from the properties of
the “spawning thread.” It also points out that whether an operation
blocks the caller (in P0443, operations could be blocking or
nonblocking; P2300 <em>only</em> has asynchronous operations) means
something different than the forward progress guarantee of the
operation.</p>
<h2 data-number="8.3" id="later-versions-of-p0443"><span class="header-section-number">8.3</span> Later versions of P0443<a href="#later-versions-of-p0443" class="self-link"></a></h2>
<p>Later versions of <a href="https://wg21.link/p0443">P0443</a>
introduced a customization point <code>bulk_execute</code> which is the
analog and predecessor of P2300’s <code>bulk</code>.
<code>bulk_execute</code> lets callers specify a separate forward
progress guarantee for bulk execution, as a <code>bulk_guarantee</code>
query. It has offered three different guarantees
(<code>unsequenced</code>, <code>sequenced</code>, and
<code>parallel</code>) since R4 of the proposal. The
<code>parallel</code> guarantee means parallel forward progress,
<code>sequenced</code> means a sequential loop, and
<code>unsequenced</code> means something like
<code>std::execution::par_unseq</code>.</p>
<p>The 2019 paper <a href="https://wg21.link/p1658">P1658</a> led to the
<code>bulk_execute</code> customization point design in later versions
of P0443 (starting with R11). P1658 summarizes <code>bulk_execute</code>
as “eagerly creat[ing] multiple execution agents in bulk.”
<a href="https://wg21.link/p1660r0">P1660R0</a> (“A Compromise Executor
Design Sketch”) talks about different ways to express bulk execution –
either eager, or lazy – but leaves the bulk execution interface as
future work for P0443.</p>
<p>SG1’s Prague 2020 review of
<a href="https://wg21.link/p1897r2">P1897R2, “Towards C++23 executors: A
proposal for an initial set of algorithms”</a> (see
<a href="https://wiki.edg.com/bin/view/Wg21prague/P1897R2SG1">minutes</a>)
reflects a debate on whether bulk execution is a fundamental building
block. Much of this debate revolves around P0443’s separation of bulk
executors and non-bulk executors. The final version
<a href="https://wg21.link/p1897r3">P1897R3</a> removes
<code>indexed_for</code> (the “cleaned up version of
<code>bulk_execute</code>), with the suggestion that higher-level
algorithms over ranges could be built atop bulk execution.</p>
<p>Discussion of <a href="https://wg21.link/p1898">P1898, “Forward
progress delegation for executors,”</a> points out that forward progress
guarantees are not as well-defined for asynchronous algorithms as they
are for synchronous algorithms. This matters, for example, for launching
nested work, like bulk iteration being launched inside bulk
iteration.</p>
<p>We consider <a href="https://wg21.link/p2181">P2181, “Correcting the
Design of Bulk Execution</a> (R0 published June 2020, R1 published
November 2020) to describe the last pre-P2300 state of affairs of bulk
execution. It proposes changes to P0443R14 (which was the last published
version of P0443).<a href="#fn8" class="footnote-ref" id="fnref8" role="doc-noteref"><sup>8</sup></a> P0443R14 includes an”Editorial note”
that says, “We should probably define what ‘execute N invocations of the
function object F on the executor S in bulk’ means more carefully.”
Section 4.1 of P2181 proposes defining that phrase in the following
way.</p>
<blockquote>
<p>A <em>group of execution agents</em> created in bulk has a
<em>shape</em>. Execution agents within a group are identified by
<em>indices</em>, whose unique values are the set of contiguous indices
spanned by the group’s shape.</p>
<p>An executor <em>bulk executes</em> an expression by scheduling the
creation of a group of execution agents on which the expression executes
in bulk. Invocable expressions are invoked with each execution agent’s
index. Bulk execution of expressions that are not invocables is
executor-defined.</p>
</blockquote>
<p>SG1 ended up giving feedback on P2181R1, but not on bulk’s forward
progress. P2181 review stopped there, as the effort was folded into
P2300.</p>
<h2 data-number="8.4" id="p2300"><span class="header-section-number">8.4</span> P2300<a href="#p2300" class="self-link"></a></h2>
<p>In P2300, the “Asynchronous inclusive scan” example that uses
<code>bulk</code> (Section 1.3.2) says that it “creat[es]” or “spawn[s]”
a number of “execution agents” equal to its input argument N. The
example goes on to build an asynchronous inclusive scan algorithm out of
a sequence of <code>bulk</code> and <code>then</code> operations. The
example tiles the <code>bulk</code> operations by hand. However, the
later nonwording description of <code>bulk</code> leaves unspecified the
number of execution agents that <code>bulk</code> uses.</p>
<blockquote>
<p>Each invocation of <code>function</code> runs in an execution agent
whose forward progress guarantees are determined by the scheduler on
which they are run. All agents created by a single use of
<code>bulk</code> execute with the same guarantee. The number of
execution agents used by <code>bulk</code> is not specified. This allows
a scheduler to execute some invocations of the <code>function</code> in
parallel.</p>
</blockquote>
<p>The above text does not appear until R6 of P2300. R2 elaborates on
the brief description in R0 and R1 with the following text that supports
the “N execution agents” intent of <code>bulk</code>.</p>
<blockquote>
<p>Each invocation of function runs in an execution agent whose forward
progress guarantees are determined by the scheduler on which they are
run. All agents created by a single use of <code>bulk</code> execute
with the same guarantee. This allows, for instance, a scheduler to
execute all invocations of the function in parallel.</p>
<p>The <code>bulk</code> operation is intended to be used at the point
where the number of agents to be created is known and provided to
<code>bulk</code> via its shape parameter.</p>
</blockquote>
<p>A possible reason for the change in R6 is that <code>bulk</code> has
always had a default sequential implementation since R0 of P2300. Making
sender algorithms customizable and providing default implementations for
them are key parts of P2300’s design, as explained in Section 5.4,
“Sender algorithms are customizable.” However, <code>bulk</code> having
a sequential default means that it cannot offer a concurrent forward
progress guarantee, even if the scheduler does.</p>
<h2 data-number="8.5" id="how-our-proposal-fits-into-this-history"><span class="header-section-number">8.5</span> How our proposal fits into this
history<a href="#how-our-proposal-fits-into-this-history" class="self-link"></a></h2>
<ol type="1">
<li><p>We want to restore <a href="https://wg21.link/p2181">P2181’s</a>
definition of bulk execution as running each function invocation on a
distinct execution agent.</p></li>
<li><p>We also want to preserve default <code>bulk</code>’s sequential
implementation.</p></li>
<li><p>We propose resolving this apparent contradiction by using
“execution agent” just as a way to talk about forward progress. This
means that default <code>bulk</code> promises at best parallel forward
progress.</p></li>
</ol>
<h1 data-number="9" id="appendix-b-chunked-parallel-for_each"><span class="header-section-number">9</span> Appendix B: Chunked parallel
<code>for_each</code><a href="#appendix-b-chunked-parallel-for_each" class="self-link"></a></h1>
<p><a href="https://godbolt.org/z/eejzn6cex">This Compiler Explorer
example</a> shows how to build loop parallelization as a straightforward
composition of the mapping (a function from agent index to a set of loop
indices) with <code>bulk</code> (that executes a function of agent index
on a distinct agent). We show the example’s complete source code
below.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;algorithm&gt;</span><span class="pp"> </span><span class="co">// ranges::for_each</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;cassert&gt;</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;functional&gt;</span><span class="pp"> </span><span class="co">// invoke</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;iostream&gt;</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;type_traits&gt;</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;ranges&gt;</span></span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;vector&gt;</span></span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> mystd <span class="op">{</span></span>
<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span><span class="op">&lt;</span><span class="kw">class</span> I, <span class="kw">class</span> Fun<span class="op">&gt;</span></span>
<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> for_each_result <span class="op">{</span></span>
<span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a>  I in;</span>
<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a>  Fun fun;</span>
<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a><span class="op">}</span>;</span>
<span id="cb5-16"><a href="#cb5-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-17"><a href="#cb5-17" aria-hidden="true" tabindex="-1"></a><span class="co">// Sequential implementation of something like ranges::for_each.</span></span>
<span id="cb5-18"><a href="#cb5-18" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> seq_for_each_fn <span class="op">{</span></span>
<span id="cb5-19"><a href="#cb5-19" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span></span>
<span id="cb5-20"><a href="#cb5-20" aria-hidden="true" tabindex="-1"></a>    std<span class="op">::</span>random_access_iterator I,</span>
<span id="cb5-21"><a href="#cb5-21" aria-hidden="true" tabindex="-1"></a>    std<span class="op">::</span>sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S,</span>
<span id="cb5-22"><a href="#cb5-22" aria-hidden="true" tabindex="-1"></a>    <span class="kw">class</span> Proj <span class="op">=</span> std<span class="op">::</span>identity,</span>
<span id="cb5-23"><a href="#cb5-23" aria-hidden="true" tabindex="-1"></a>    std<span class="op">::</span>indirectly_unary_invocable<span class="op">&lt;</span>std<span class="op">::</span>projected<span class="op">&lt;</span>I, Proj<span class="op">&gt;&gt;</span> Fun</span>
<span id="cb5-24"><a href="#cb5-24" aria-hidden="true" tabindex="-1"></a>  <span class="op">&gt;</span></span>
<span id="cb5-25"><a href="#cb5-25" aria-hidden="true" tabindex="-1"></a>  for_each_result<span class="op">&lt;</span>I, Fun<span class="op">&gt;</span></span>
<span id="cb5-26"><a href="#cb5-26" aria-hidden="true" tabindex="-1"></a>  <span class="kw">operator</span><span class="op">()</span> <span class="op">(</span>I first, S last, Fun f, Proj proj <span class="op">=</span> <span class="op">{})</span> <span class="kw">const</span> <span class="op">{</span></span>
<span id="cb5-27"><a href="#cb5-27" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span>; first <span class="op">!=</span> last; <span class="op">++</span>first<span class="op">)</span> <span class="op">{</span></span>
<span id="cb5-28"><a href="#cb5-28" aria-hidden="true" tabindex="-1"></a>      std<span class="op">::</span>invoke<span class="op">(</span>f, std<span class="op">::</span>invoke<span class="op">(</span>proj, <span class="op">*</span>first<span class="op">))</span>;</span>
<span id="cb5-29"><a href="#cb5-29" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb5-30"><a href="#cb5-30" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> <span class="op">{</span>std<span class="op">::</span>move<span class="op">(</span>first<span class="op">)</span>, std<span class="op">::</span>move<span class="op">(</span>f<span class="op">)}</span>;</span>
<span id="cb5-31"><a href="#cb5-31" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb5-32"><a href="#cb5-32" aria-hidden="true" tabindex="-1"></a><span class="op">}</span>;</span>
<span id="cb5-33"><a href="#cb5-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-34"><a href="#cb5-34" aria-hidden="true" tabindex="-1"></a><span class="kw">inline</span> <span class="kw">constexpr</span> seq_for_each_fn seq_for_each;</span>
<span id="cb5-35"><a href="#cb5-35" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-36"><a href="#cb5-36" aria-hidden="true" tabindex="-1"></a><span class="co">// Given the index in [0, num_agents) of an agent,</span></span>
<span id="cb5-37"><a href="#cb5-37" aria-hidden="true" tabindex="-1"></a><span class="co">// return the half-open interval of indices [first, last)</span></span>
<span id="cb5-38"><a href="#cb5-38" aria-hidden="true" tabindex="-1"></a><span class="co">// in the range [0, num_elements)</span></span>
<span id="cb5-39"><a href="#cb5-39" aria-hidden="true" tabindex="-1"></a><span class="co">// for which that agent is responsible.</span></span>
<span id="cb5-40"><a href="#cb5-40" aria-hidden="true" tabindex="-1"></a><span class="co">//</span></span>
<span id="cb5-41"><a href="#cb5-41" aria-hidden="true" tabindex="-1"></a><span class="co">// In case num_agents does not evenly divide num_elements,</span></span>
<span id="cb5-42"><a href="#cb5-42" aria-hidden="true" tabindex="-1"></a><span class="co">// load-balance so that the number of indices per agent</span></span>
<span id="cb5-43"><a href="#cb5-43" aria-hidden="true" tabindex="-1"></a><span class="co">// cannot vary by more than one. </span></span>
<span id="cb5-44"><a href="#cb5-44" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span><span class="op">&lt;</span>std<span class="op">::</span>integral Integral<span class="op">&gt;</span></span>
<span id="cb5-45"><a href="#cb5-45" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="kw">auto</span> my_chunk<span class="op">(</span>Integral num_elements, Integral agent_index, Integral num_agents<span class="op">)</span></span>
<span id="cb5-46"><a href="#cb5-46" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb5-47"><a href="#cb5-47" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral ZERO<span class="op">{</span><span class="dv">0</span><span class="op">}</span>;</span>
<span id="cb5-48"><a href="#cb5-48" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral ONE<span class="op">{</span><span class="dv">1</span><span class="op">}</span>;</span>
<span id="cb5-49"><a href="#cb5-49" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-50"><a href="#cb5-50" aria-hidden="true" tabindex="-1"></a>  <span class="cf">if</span> <span class="op">(</span>num_elements <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="op">{</span></span>
<span id="cb5-51"><a href="#cb5-51" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> std<span class="op">::</span>tuple<span class="op">{</span>ZERO, ZERO<span class="op">}</span>;</span>
<span id="cb5-52"><a href="#cb5-52" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb5-53"><a href="#cb5-53" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-54"><a href="#cb5-54" aria-hidden="true" tabindex="-1"></a>  <span class="kw">auto</span> quotient <span class="op">=</span> num_elements <span class="op">/</span> num_agents;</span>
<span id="cb5-55"><a href="#cb5-55" aria-hidden="true" tabindex="-1"></a>  <span class="co">// The first `remainder` agents get `quotient+1` number of elements;</span></span>
<span id="cb5-56"><a href="#cb5-56" aria-hidden="true" tabindex="-1"></a>  <span class="co">// the rest of the agents get `quotient` number of elements.</span></span>
<span id="cb5-57"><a href="#cb5-57" aria-hidden="true" tabindex="-1"></a>  <span class="kw">auto</span> remainder <span class="op">=</span> num_elements <span class="op">-</span> quotient <span class="op">*</span> num_agents;</span>
<span id="cb5-58"><a href="#cb5-58" aria-hidden="true" tabindex="-1"></a>  <span class="kw">auto</span> my_chunk_size <span class="op">=</span> quotient <span class="op">+</span> <span class="op">(</span>agent_index <span class="op">&lt;</span> remainder <span class="op">?</span> ONE <span class="op">:</span> ZERO<span class="op">)</span>;</span>
<span id="cb5-59"><a href="#cb5-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-60"><a href="#cb5-60" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Cast back to Integral to undo effects of integer promotion;</span></span>
<span id="cb5-61"><a href="#cb5-61" aria-hidden="true" tabindex="-1"></a>  <span class="co">// otherwise, no matching call for std::min with Integral=short.</span></span>
<span id="cb5-62"><a href="#cb5-62" aria-hidden="true" tabindex="-1"></a>  <span class="kw">auto</span> num_incremented <span class="op">=</span> std<span class="op">::</span>min<span class="op">(</span></span>
<span id="cb5-63"><a href="#cb5-63" aria-hidden="true" tabindex="-1"></a>    agent_index, <span class="kw">static_cast</span><span class="op">&lt;</span>Integral<span class="op">&gt;(</span>remainder<span class="op">))</span>;</span>
<span id="cb5-64"><a href="#cb5-64" aria-hidden="true" tabindex="-1"></a>  <span class="kw">auto</span> num_original <span class="op">=</span> <span class="op">(</span>agent_index <span class="op">&gt;=</span> remainder<span class="op">)</span> <span class="op">?</span></span>
<span id="cb5-65"><a href="#cb5-65" aria-hidden="true" tabindex="-1"></a>    <span class="op">(</span>agent_index <span class="op">-</span> remainder<span class="op">)</span> <span class="op">:</span> ZERO;</span>
<span id="cb5-66"><a href="#cb5-66" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-67"><a href="#cb5-67" aria-hidden="true" tabindex="-1"></a>  <span class="kw">auto</span> my_chunk_index <span class="op">=</span> num_incremented <span class="op">*</span> <span class="op">(</span>quotient <span class="op">+</span> ONE<span class="op">)</span> <span class="op">+</span> num_original <span class="op">*</span> quotient;</span>
<span id="cb5-68"><a href="#cb5-68" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span></span>
<span id="cb5-69"><a href="#cb5-69" aria-hidden="true" tabindex="-1"></a>    my_chunk_index <span class="op">&lt;</span> num_elements <span class="op">||</span> </span>
<span id="cb5-70"><a href="#cb5-70" aria-hidden="true" tabindex="-1"></a>    <span class="op">(</span>my_chunk_index <span class="op">==</span> num_elements <span class="op">&amp;&amp;</span> my_chunk_size <span class="op">==</span> ZERO<span class="op">)</span></span>
<span id="cb5-71"><a href="#cb5-71" aria-hidden="true" tabindex="-1"></a>  <span class="op">)</span>;</span>
<span id="cb5-72"><a href="#cb5-72" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-73"><a href="#cb5-73" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Cast back to Integral to undo effects of integer promotion</span></span>
<span id="cb5-74"><a href="#cb5-74" aria-hidden="true" tabindex="-1"></a>  <span class="co">// (for e.g., Integral=short).</span></span>
<span id="cb5-75"><a href="#cb5-75" aria-hidden="true" tabindex="-1"></a>  <span class="cf">return</span> std<span class="op">::</span>tuple<span class="op">{</span></span>
<span id="cb5-76"><a href="#cb5-76" aria-hidden="true" tabindex="-1"></a>    <span class="kw">static_cast</span><span class="op">&lt;</span>Integral<span class="op">&gt;(</span>my_chunk_index<span class="op">)</span>,</span>
<span id="cb5-77"><a href="#cb5-77" aria-hidden="true" tabindex="-1"></a>    <span class="kw">static_cast</span><span class="op">&lt;</span>Integral<span class="op">&gt;(</span>my_chunk_index <span class="op">+</span> my_chunk_size<span class="op">)</span></span>
<span id="cb5-78"><a href="#cb5-78" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span>;</span>
<span id="cb5-79"><a href="#cb5-79" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb5-80"><a href="#cb5-80" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-81"><a href="#cb5-81" aria-hidden="true" tabindex="-1"></a><span class="co">// For the range [iterator, sentinel), return</span></span>
<span id="cb5-82"><a href="#cb5-82" aria-hidden="true" tabindex="-1"></a><span class="co">// the given agent&#39;s [my_iterator, my_sentinel).</span></span>
<span id="cb5-83"><a href="#cb5-83" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span><span class="op">&lt;</span></span>
<span id="cb5-84"><a href="#cb5-84" aria-hidden="true" tabindex="-1"></a>  std<span class="op">::</span>random_access_iterator I,</span>
<span id="cb5-85"><a href="#cb5-85" aria-hidden="true" tabindex="-1"></a>  std<span class="op">::</span>sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S</span>
<span id="cb5-86"><a href="#cb5-86" aria-hidden="true" tabindex="-1"></a><span class="op">&gt;</span></span>
<span id="cb5-87"><a href="#cb5-87" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="kw">auto</span> <span class="co">/* iterator, sentinel (not necessarily S) */</span></span>
<span id="cb5-88"><a href="#cb5-88" aria-hidden="true" tabindex="-1"></a>my_chunk<span class="op">(</span>I first, S last,</span>
<span id="cb5-89"><a href="#cb5-89" aria-hidden="true" tabindex="-1"></a>  std<span class="op">::</span>iter_difference_t<span class="op">&lt;</span>I<span class="op">&gt;</span> agent_index,</span>
<span id="cb5-90"><a href="#cb5-90" aria-hidden="true" tabindex="-1"></a>  std<span class="op">::</span>iter_difference_t<span class="op">&lt;</span>I<span class="op">&gt;</span> num_agents<span class="op">)</span></span>
<span id="cb5-91"><a href="#cb5-91" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb5-92"><a href="#cb5-92" aria-hidden="true" tabindex="-1"></a>  <span class="kw">auto</span> num_elements <span class="op">=</span> std<span class="op">::</span>ranges<span class="op">::</span>distance<span class="op">(</span>first, last<span class="op">)</span>;</span>
<span id="cb5-93"><a href="#cb5-93" aria-hidden="true" tabindex="-1"></a>  <span class="kw">auto</span> <span class="op">[</span>my_first, my_last<span class="op">]</span> <span class="op">=</span> my_chunk<span class="op">(</span>num_elements, agent_index, num_agents<span class="op">)</span>;</span>
<span id="cb5-94"><a href="#cb5-94" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-95"><a href="#cb5-95" aria-hidden="true" tabindex="-1"></a>  <span class="cf">return</span> std<span class="op">::</span>tuple<span class="op">{</span>first <span class="op">+</span> my_first, first <span class="op">+</span> my_last<span class="op">}</span>;</span>
<span id="cb5-96"><a href="#cb5-96" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb5-97"><a href="#cb5-97" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-98"><a href="#cb5-98" aria-hidden="true" tabindex="-1"></a><span class="co">// Implementation corresponding to default bulk in [exec.bulk].</span></span>
<span id="cb5-99"><a href="#cb5-99" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span><span class="op">&lt;</span>std<span class="op">::</span>integral Shape, std<span class="op">::</span>invocable<span class="op">&lt;</span>Shape<span class="op">&gt;</span> Fun<span class="op">&gt;</span></span>
<span id="cb5-100"><a href="#cb5-100" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> bulk<span class="op">(</span>Shape N, Fun f<span class="op">)</span> <span class="op">{</span></span>
<span id="cb5-101"><a href="#cb5-101" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span>Shape k <span class="op">=</span> Shape<span class="op">(</span><span class="dv">0</span><span class="op">)</span>; k <span class="op">&lt;</span> N; <span class="op">++</span>k<span class="op">)</span> <span class="op">{</span></span>
<span id="cb5-102"><a href="#cb5-102" aria-hidden="true" tabindex="-1"></a>    std<span class="op">::</span>invoke<span class="op">(</span>f, k<span class="op">)</span>;</span>
<span id="cb5-103"><a href="#cb5-103" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span>    </span>
<span id="cb5-104"><a href="#cb5-104" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb5-105"><a href="#cb5-105" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-106"><a href="#cb5-106" aria-hidden="true" tabindex="-1"></a><span class="co">// Chunked &quot;parallel&quot; implementation of something like ranges::for_each.</span></span>
<span id="cb5-107"><a href="#cb5-107" aria-hidden="true" tabindex="-1"></a><span class="co">// It uses the above &quot;bulk&quot; inside.</span></span>
<span id="cb5-108"><a href="#cb5-108" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> chunked_for_each_fn <span class="op">{</span></span>
<span id="cb5-109"><a href="#cb5-109" aria-hidden="true" tabindex="-1"></a>  <span class="kw">template</span><span class="op">&lt;</span></span>
<span id="cb5-110"><a href="#cb5-110" aria-hidden="true" tabindex="-1"></a>    std<span class="op">::</span>random_access_iterator I,</span>
<span id="cb5-111"><a href="#cb5-111" aria-hidden="true" tabindex="-1"></a>    std<span class="op">::</span>sentinel_for<span class="op">&lt;</span>I<span class="op">&gt;</span> S,</span>
<span id="cb5-112"><a href="#cb5-112" aria-hidden="true" tabindex="-1"></a>    <span class="kw">class</span> Proj <span class="op">=</span> std<span class="op">::</span>identity,</span>
<span id="cb5-113"><a href="#cb5-113" aria-hidden="true" tabindex="-1"></a>    std<span class="op">::</span>indirectly_unary_invocable<span class="op">&lt;</span>std<span class="op">::</span>projected<span class="op">&lt;</span>I, Proj<span class="op">&gt;&gt;</span> Fun</span>
<span id="cb5-114"><a href="#cb5-114" aria-hidden="true" tabindex="-1"></a>  <span class="op">&gt;</span></span>
<span id="cb5-115"><a href="#cb5-115" aria-hidden="true" tabindex="-1"></a>  for_each_result<span class="op">&lt;</span>I, Fun<span class="op">&gt;</span></span>
<span id="cb5-116"><a href="#cb5-116" aria-hidden="true" tabindex="-1"></a>  <span class="kw">operator</span><span class="op">()</span> <span class="op">(</span>I first, S last, std<span class="op">::</span>iter_difference_t<span class="op">&lt;</span>I<span class="op">&gt;</span> num_agents, Fun f, Proj proj <span class="op">=</span> <span class="op">{})</span> <span class="kw">const</span> <span class="op">{</span></span>
<span id="cb5-117"><a href="#cb5-117" aria-hidden="true" tabindex="-1"></a>    bulk<span class="op">(</span>num_agents, <span class="op">[&amp;]</span> <span class="op">(</span>std<span class="op">::</span>iter_difference_t<span class="op">&lt;</span>I<span class="op">&gt;</span> agent_index<span class="op">)</span> <span class="op">{</span></span>
<span id="cb5-118"><a href="#cb5-118" aria-hidden="true" tabindex="-1"></a>      <span class="kw">auto</span> <span class="op">[</span>my_first, my_last<span class="op">]</span> <span class="op">=</span> my_chunk<span class="op">(</span>first, last, agent_index, num_agents<span class="op">)</span>;</span>
<span id="cb5-119"><a href="#cb5-119" aria-hidden="true" tabindex="-1"></a>      seq_for_each<span class="op">(</span>my_first, my_last, f, proj<span class="op">)</span>;      </span>
<span id="cb5-120"><a href="#cb5-120" aria-hidden="true" tabindex="-1"></a>    <span class="op">})</span>;</span>
<span id="cb5-121"><a href="#cb5-121" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-122"><a href="#cb5-122" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> <span class="op">{</span>std<span class="op">::</span>move<span class="op">(</span>first<span class="op">)</span>, std<span class="op">::</span>move<span class="op">(</span>f<span class="op">)}</span>;</span>
<span id="cb5-123"><a href="#cb5-123" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb5-124"><a href="#cb5-124" aria-hidden="true" tabindex="-1"></a><span class="op">}</span>;</span>
<span id="cb5-125"><a href="#cb5-125" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-126"><a href="#cb5-126" aria-hidden="true" tabindex="-1"></a><span class="kw">inline</span> <span class="kw">constexpr</span> chunked_for_each_fn chunked_for_each;</span>
<span id="cb5-127"><a href="#cb5-127" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-128"><a href="#cb5-128" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> <span class="co">// namespace mystd</span></span>
<span id="cb5-129"><a href="#cb5-129" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-130"><a href="#cb5-130" aria-hidden="true" tabindex="-1"></a><span class="kw">namespace</span> test <span class="op">{</span></span>
<span id="cb5-131"><a href="#cb5-131" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-132"><a href="#cb5-132" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span><span class="op">&lt;</span>std<span class="op">::</span>integral Integral<span class="op">&gt;</span></span>
<span id="cb5-133"><a href="#cb5-133" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> test_my_chunk_once<span class="op">(</span></span>
<span id="cb5-134"><a href="#cb5-134" aria-hidden="true" tabindex="-1"></a>  Integral num_elements, Integral num_agents<span class="op">)</span></span>
<span id="cb5-135"><a href="#cb5-135" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb5-136"><a href="#cb5-136" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral ZERO<span class="op">{</span><span class="dv">0</span><span class="op">}</span>;</span>
<span id="cb5-137"><a href="#cb5-137" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral ONE<span class="op">{</span><span class="dv">1</span><span class="op">}</span>;</span>
<span id="cb5-138"><a href="#cb5-138" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-139"><a href="#cb5-139" aria-hidden="true" tabindex="-1"></a>  Integral cur_first<span class="op">{</span><span class="dv">0</span><span class="op">}</span>;</span>
<span id="cb5-140"><a href="#cb5-140" aria-hidden="true" tabindex="-1"></a>  Integral cur_last<span class="op">{</span><span class="dv">0</span><span class="op">}</span>;</span>
<span id="cb5-141"><a href="#cb5-141" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span>Integral agent_index <span class="op">=</span> ZERO; agent_index <span class="op">&lt;</span> num_agents;</span>
<span id="cb5-142"><a href="#cb5-142" aria-hidden="true" tabindex="-1"></a>    <span class="op">++</span>agent_index<span class="op">)</span></span>
<span id="cb5-143"><a href="#cb5-143" aria-hidden="true" tabindex="-1"></a>  <span class="op">{</span></span>
<span id="cb5-144"><a href="#cb5-144" aria-hidden="true" tabindex="-1"></a>    <span class="kw">auto</span> <span class="op">[</span>first, last<span class="op">]</span> <span class="op">=</span> mystd<span class="op">::</span>my_chunk<span class="op">(</span>num_elements, agent_index, num_agents<span class="op">)</span>;</span>
<span id="cb5-145"><a href="#cb5-145" aria-hidden="true" tabindex="-1"></a>    <span class="ot">assert</span><span class="op">(</span>first <span class="op">==</span> cur_last<span class="op">)</span>;</span>
<span id="cb5-146"><a href="#cb5-146" aria-hidden="true" tabindex="-1"></a>    <span class="ot">assert</span><span class="op">(</span>last <span class="op">&gt;=</span> first<span class="op">)</span>;</span>
<span id="cb5-147"><a href="#cb5-147" aria-hidden="true" tabindex="-1"></a>    <span class="kw">auto</span> quotient <span class="op">=</span> num_elements <span class="op">/</span> num_agents;</span>
<span id="cb5-148"><a href="#cb5-148" aria-hidden="true" tabindex="-1"></a>    <span class="ot">assert</span><span class="op">(</span>last <span class="op">-</span> first <span class="op">==</span> quotient <span class="op">||</span> last <span class="op">-</span> first <span class="op">==</span> quotient <span class="op">+</span> ONE<span class="op">)</span>;</span>
<span id="cb5-149"><a href="#cb5-149" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-150"><a href="#cb5-150" aria-hidden="true" tabindex="-1"></a>    cur_first <span class="op">=</span> first;</span>
<span id="cb5-151"><a href="#cb5-151" aria-hidden="true" tabindex="-1"></a>    cur_last <span class="op">=</span> last;</span>
<span id="cb5-152"><a href="#cb5-152" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb5-153"><a href="#cb5-153" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span>cur_last <span class="op">==</span> num_elements<span class="op">)</span>;</span>
<span id="cb5-154"><a href="#cb5-154" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span>cur_first <span class="op">&lt;=</span> cur_last<span class="op">)</span>;</span>
<span id="cb5-155"><a href="#cb5-155" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb5-156"><a href="#cb5-156" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-157"><a href="#cb5-157" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span><span class="op">&lt;</span>std<span class="op">::</span>integral Integral<span class="op">&gt;</span></span>
<span id="cb5-158"><a href="#cb5-158" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> test_my_chunk_tmpl<span class="op">()</span> <span class="op">{</span></span>
<span id="cb5-159"><a href="#cb5-159" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral ZERO<span class="op">{</span><span class="dv">0</span><span class="op">}</span>;</span>
<span id="cb5-160"><a href="#cb5-160" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral ONE<span class="op">{</span><span class="dv">1</span><span class="op">}</span>;</span>
<span id="cb5-161"><a href="#cb5-161" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral TWO<span class="op">{</span><span class="dv">2</span><span class="op">}</span>;</span>
<span id="cb5-162"><a href="#cb5-162" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral FIVE<span class="op">{</span><span class="dv">5</span><span class="op">}</span>;</span>
<span id="cb5-163"><a href="#cb5-163" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral FIFTEEN<span class="op">{</span><span class="dv">15</span><span class="op">}</span>;</span>
<span id="cb5-164"><a href="#cb5-164" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> Integral SEVENTEEN<span class="op">{</span><span class="dv">17</span><span class="op">}</span>;</span>
<span id="cb5-165"><a href="#cb5-165" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-166"><a href="#cb5-166" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_once<span class="op">(</span>ZERO, ONE<span class="op">)</span>;</span>
<span id="cb5-167"><a href="#cb5-167" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_once<span class="op">(</span>ONE, ONE<span class="op">)</span>;</span>
<span id="cb5-168"><a href="#cb5-168" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_once<span class="op">(</span>ONE, SEVENTEEN<span class="op">)</span>;</span>
<span id="cb5-169"><a href="#cb5-169" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_once<span class="op">(</span>SEVENTEEN, ONE<span class="op">)</span>;</span>
<span id="cb5-170"><a href="#cb5-170" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_once<span class="op">(</span>SEVENTEEN, TWO<span class="op">)</span>;</span>
<span id="cb5-171"><a href="#cb5-171" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-172"><a href="#cb5-172" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_once<span class="op">(</span>FIFTEEN, FIVE<span class="op">)</span>;</span>
<span id="cb5-173"><a href="#cb5-173" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_once<span class="op">(</span>SEVENTEEN, FIVE<span class="op">)</span>;</span>
<span id="cb5-174"><a href="#cb5-174" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb5-175"><a href="#cb5-175" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-176"><a href="#cb5-176" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> test_my_chunk<span class="op">()</span> <span class="op">{</span></span>
<span id="cb5-177"><a href="#cb5-177" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_tmpl<span class="op">&lt;</span><span class="dt">short</span><span class="op">&gt;()</span>;</span>
<span id="cb5-178"><a href="#cb5-178" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_tmpl<span class="op">&lt;</span><span class="dt">unsigned</span> <span class="dt">short</span><span class="op">&gt;()</span>;</span>
<span id="cb5-179"><a href="#cb5-179" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_tmpl<span class="op">&lt;</span><span class="dt">int</span><span class="op">&gt;()</span>;</span>
<span id="cb5-180"><a href="#cb5-180" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_tmpl<span class="op">&lt;</span><span class="dt">unsigned</span> <span class="dt">int</span><span class="op">&gt;()</span>;</span>
<span id="cb5-181"><a href="#cb5-181" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_tmpl<span class="op">&lt;</span><span class="dt">long</span> <span class="dt">long</span><span class="op">&gt;()</span>;</span>
<span id="cb5-182"><a href="#cb5-182" aria-hidden="true" tabindex="-1"></a>  test_my_chunk_tmpl<span class="op">&lt;</span><span class="dt">unsigned</span> <span class="dt">long</span> <span class="dt">long</span><span class="op">&gt;()</span>;</span>
<span id="cb5-183"><a href="#cb5-183" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb5-184"><a href="#cb5-184" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-185"><a href="#cb5-185" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> <span class="co">// namespace test</span></span>
<span id="cb5-186"><a href="#cb5-186" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-187"><a href="#cb5-187" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">()</span> <span class="op">{</span></span>
<span id="cb5-188"><a href="#cb5-188" aria-hidden="true" tabindex="-1"></a>  test<span class="op">::</span>test_my_chunk<span class="op">()</span>;</span>
<span id="cb5-189"><a href="#cb5-189" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-190"><a href="#cb5-190" aria-hidden="true" tabindex="-1"></a>  <span class="kw">auto</span> x <span class="op">=</span> std<span class="op">::</span>ranges<span class="op">::</span>to<span class="op">&lt;</span>std<span class="op">::</span>vector<span class="op">&gt;(</span>std<span class="op">::</span>views<span class="op">::</span>iota<span class="op">(</span><span class="dv">1</span>, <span class="dv">21</span><span class="op">))</span>;</span>
<span id="cb5-191"><a href="#cb5-191" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-192"><a href="#cb5-192" aria-hidden="true" tabindex="-1"></a>  std<span class="op">::</span>cout <span class="op">&lt;&lt;</span> <span class="st">&quot;std::ranges::for_each:</span><span class="sc">\n</span><span class="st">&quot;</span>;</span>
<span id="cb5-193"><a href="#cb5-193" aria-hidden="true" tabindex="-1"></a>  <span class="op">[[</span><span class="at">maybe_unused</span><span class="op">]]</span> <span class="kw">auto</span> result <span class="op">=</span></span>
<span id="cb5-194"><a href="#cb5-194" aria-hidden="true" tabindex="-1"></a>    std<span class="op">::</span>ranges<span class="op">::</span>for_each<span class="op">(</span>x<span class="op">.</span>begin<span class="op">()</span>, x<span class="op">.</span>end<span class="op">()</span>, <span class="op">[]</span> <span class="op">(</span><span class="dt">float</span> x_k<span class="op">)</span> <span class="op">{</span></span>
<span id="cb5-195"><a href="#cb5-195" aria-hidden="true" tabindex="-1"></a>      std<span class="op">::</span>cout <span class="op">&lt;&lt;</span> <span class="st">&quot;x_k: &quot;</span> <span class="op">&lt;&lt;</span> x_k <span class="op">&lt;&lt;</span> <span class="st">&quot;</span><span class="sc">\n</span><span class="st">&quot;</span>;</span>
<span id="cb5-196"><a href="#cb5-196" aria-hidden="true" tabindex="-1"></a>    <span class="op">})</span>;</span>
<span id="cb5-197"><a href="#cb5-197" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-198"><a href="#cb5-198" aria-hidden="true" tabindex="-1"></a>  std<span class="op">::</span>cout <span class="op">&lt;&lt;</span> <span class="st">&quot;</span><span class="sc">\n</span><span class="st">seq_for_each:</span><span class="sc">\n</span><span class="st">&quot;</span>;</span>
<span id="cb5-199"><a href="#cb5-199" aria-hidden="true" tabindex="-1"></a>  <span class="op">[[</span><span class="at">maybe_unused</span><span class="op">]]</span> <span class="kw">auto</span> result_seq_0 <span class="op">=</span></span>
<span id="cb5-200"><a href="#cb5-200" aria-hidden="true" tabindex="-1"></a>    mystd<span class="op">::</span>seq_for_each<span class="op">(</span>x<span class="op">.</span>begin<span class="op">()</span>, x<span class="op">.</span>end<span class="op">()</span>, <span class="op">[]</span> <span class="op">(</span><span class="dt">float</span> x_k<span class="op">)</span> <span class="op">{</span></span>
<span id="cb5-201"><a href="#cb5-201" aria-hidden="true" tabindex="-1"></a>      std<span class="op">::</span>cout <span class="op">&lt;&lt;</span> <span class="st">&quot;x_k: &quot;</span> <span class="op">&lt;&lt;</span> x_k <span class="op">&lt;&lt;</span> <span class="st">&quot;</span><span class="sc">\n</span><span class="st">&quot;</span>;</span>
<span id="cb5-202"><a href="#cb5-202" aria-hidden="true" tabindex="-1"></a>    <span class="op">})</span>;</span>
<span id="cb5-203"><a href="#cb5-203" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-204"><a href="#cb5-204" aria-hidden="true" tabindex="-1"></a>  std<span class="op">::</span>cout <span class="op">&lt;&lt;</span> <span class="st">&quot;</span><span class="sc">\n</span><span class="st">seq_for_each with nontrivial projection:</span><span class="sc">\n</span><span class="st">&quot;</span>;</span>
<span id="cb5-205"><a href="#cb5-205" aria-hidden="true" tabindex="-1"></a>  <span class="op">[[</span><span class="at">maybe_unused</span><span class="op">]]</span> <span class="kw">auto</span> result_seq_1 <span class="op">=</span></span>
<span id="cb5-206"><a href="#cb5-206" aria-hidden="true" tabindex="-1"></a>    mystd<span class="op">::</span>seq_for_each<span class="op">(</span>x<span class="op">.</span>begin<span class="op">()</span>, x<span class="op">.</span>end<span class="op">()</span>, <span class="op">[]</span> <span class="op">(</span><span class="dt">float</span> x_k<span class="op">)</span> <span class="op">{</span></span>
<span id="cb5-207"><a href="#cb5-207" aria-hidden="true" tabindex="-1"></a>      std<span class="op">::</span>cout <span class="op">&lt;&lt;</span> <span class="st">&quot;10 * x_k: &quot;</span> <span class="op">&lt;&lt;</span> x_k <span class="op">&lt;&lt;</span> <span class="st">&quot;</span><span class="sc">\n</span><span class="st">&quot;</span>;</span>
<span id="cb5-208"><a href="#cb5-208" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span>, <span class="op">[]</span> <span class="op">(</span><span class="dt">float</span> x_k<span class="op">)</span> <span class="op">{</span> <span class="cf">return</span> <span class="dv">10</span> <span class="op">*</span> x_k; <span class="op">}</span> <span class="op">)</span>;</span>
<span id="cb5-209"><a href="#cb5-209" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-210"><a href="#cb5-210" aria-hidden="true" tabindex="-1"></a>  <span class="dt">int</span> num_agents <span class="op">=</span> <span class="dv">3</span>;</span>
<span id="cb5-211"><a href="#cb5-211" aria-hidden="true" tabindex="-1"></a>  std<span class="op">::</span>cout <span class="op">&lt;&lt;</span> <span class="st">&quot;</span><span class="sc">\n</span><span class="st">chunked_for_each (num_agents=&quot;</span> <span class="op">&lt;&lt;</span> num_agents <span class="op">&lt;&lt;</span> <span class="st">&quot;):</span><span class="sc">\n</span><span class="st">&quot;</span>;</span>
<span id="cb5-212"><a href="#cb5-212" aria-hidden="true" tabindex="-1"></a>  <span class="op">[[</span><span class="at">maybe_unused</span><span class="op">]]</span> <span class="kw">auto</span> result3 <span class="op">=</span></span>
<span id="cb5-213"><a href="#cb5-213" aria-hidden="true" tabindex="-1"></a>    mystd<span class="op">::</span>chunked_for_each<span class="op">(</span>x<span class="op">.</span>begin<span class="op">()</span>, x<span class="op">.</span>end<span class="op">()</span>, num_agents, <span class="op">[]</span> <span class="op">(</span><span class="dt">float</span> x_k<span class="op">)</span> <span class="op">{</span></span>
<span id="cb5-214"><a href="#cb5-214" aria-hidden="true" tabindex="-1"></a>      std<span class="op">::</span>cout <span class="op">&lt;&lt;</span> <span class="st">&quot;x_k: &quot;</span> <span class="op">&lt;&lt;</span> x_k <span class="op">&lt;&lt;</span> <span class="st">&quot;</span><span class="sc">\n</span><span class="st">&quot;</span>;</span>
<span id="cb5-215"><a href="#cb5-215" aria-hidden="true" tabindex="-1"></a>    <span class="op">})</span>;</span>
<span id="cb5-216"><a href="#cb5-216" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-217"><a href="#cb5-217" aria-hidden="true" tabindex="-1"></a>  <span class="cf">return</span> <span class="dv">0</span>;</span>
<span id="cb5-218"><a href="#cb5-218" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h1 data-number="10" id="references"><span class="header-section-number">10</span> References<a href="#references" class="self-link"></a></h1>
<ul>
<li><p>L. S. Blackford et al., “ScaLAPACK Users’ Guide,” Society of
Industrial and Applied Mathematics, Philadelpha, PA, 1997. Available
online: https://netlib.org/scalapack/slug/ [last accessed
2025/01/02].</p></li>
<li><p>G. A. Geist, J. A. Kohl, and P. M. Papadopoulos, “PVM and MPI: a
Comparison of Features,” Calculateurs Paralleles, Vol. 8 No. 2, 1996.
Available online: http://www.csm.ornl.gov/pvm/PVMvsMPI.ps [last accessed
2024/12/31].</p></li>
<li><p>Hartmut Haiser et al., “HPX - The C++ Standard Library for
Parallelism and Concurrency,” Journal of Open Source Software, September
2020. Available online:
https://joss.theoj.org/papers/10.21105/joss.02352 [last accessed
2025/01/10].</p></li>
<li><p>HPX Documentation, v1.10.0. Available online:
https://hpx-docs.stellar-group.org/latest/pdf/HPX.pdf [last accessed
2025/01/10].</p></li>
<li><p>Ken Kennedy, Charles Koelbel, and Hans Zima, “The Rise and Fall
of High Performance Fortran,” Communications of the ACM, Vol. 54,
No. 11, pp. 74 - 82, November 2011.</p></li>
<li><p>Message Passing Interface Forum, “MPI: A Message-Passing
Interface Standard Version 4.1”, November 2023. Available online:
https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report.pdf [last accessed
2024/12/31].</p></li>
<li><p>“OpenMP Application Programming Interface,” Version 6.0, November
2024. Available online:
https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-6-0.pdf
[last accessed 2024/12/31].</p></li>
<li><p>Parallel Virtual Machine website. Available online:
https://www.csm.ornl.gov/pvm/pvm_home.html [last accessed
2024/12/31].</p></li>
</ul>
<h1 data-number="11" id="revision-history"><span class="header-section-number">11</span> Revision History<a href="#revision-history" class="self-link"></a></h1>
<ul>
<li>Revision 0 (pre-Hagenberg) to be submitted 2025-01-13</li>
</ul>
<aside id="footnotes" class="footnotes footnotes-end-of-document" role="doc-endnotes">
<hr />
<ol>
<li id="fn1"><p>OpenMP is a standard for thread-parallel programming,
whose first version was released in 1997 (for Fortran; 1998 for C and
C++).<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn2"><p>Other parallel programming models contemporary to MPI
1.0 had built-in support for parallel data distributions and automatic
data redistribution. MPI’s lack of these features simplified its
implementation, and also forced users to write more specialized but
therefore more performant code for explicit data movement. This
performance difference helped MPI’s success relative to higher-level
programming models such as High Performance Fortran. See Kennedy et
al. 2011.<a href="#fnref2" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn3"><p>CUDA is a programming model that can be used for
parallelism on NVIDIA’s graphics processing units (GPUs).<a href="#fnref3" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn4"><p>CUDA changed with the release of the Volta architecture,
to support concurrent forward progress among threads within a thread
block. This move from SIMT (Single Instruction Multiple Threads) thread
scheduling to independent thread scheduling required both hardware
changes and backwards-incompatible changes to users’ code.<a href="#fnref4" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn5"><p>Fixing that would require introducing a new Standard
execution policy type, or defining a query on execution policies
(including implementation-defined ones) for their forward progress
guarantees.<a href="#fnref5" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn6"><p>An OpenMP-based implementation might have different
maximum <code>N</code> values for <code>bulk</code>, based on whether
the <code>bulk</code> is nested inside another <code>bulk</code>. This
is because nested parallelism may make fewer threads available, so that
OpenMP can keep its forward progress guarantee while preserving
performance.<a href="#fnref6" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn7"><p>”One-way” interfaces are ”fire and forget,” with no
proposed standard way to synchronize or get the result. ”Two-way”
interfaces return a future or have some other way to synchronize on the
result of the asynchronous operation.<a href="#fnref7" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn8"><p>See also <a href="https://wg21.link/p2224r0">P2224R0, ”A
Better <code>bulk_schedule</code>,”</a> which explains the design change
between P2181R0 and P2181R1, as well as
<a href="https://wiki.edg.com/pub/Wg21summer2020/SG1/P2181R0.md">minutes
from SG1’s summer 2020 discussion of P2181R0, and
<a href="https://wiki.edg.com/pub/Wg21fall2020/SG1/P2224r0_-_Minutes.pdf">minutes
from SG1’s December 16, 2020 discussion of P2224R0.<a href="#fnref8" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</aside>
</div>
</div>
</body>
</html>
