<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <meta name="author" content="Jared Hoberock (jhoberock@nvidia.com)" />
  <meta name="author" content="Michael Garland (mgarland@nvidia.com)" />
  <meta name="dcterms.date" content="2020-11-13" />
  <title>Correcting the Design of Bulk Execution</title>
  <style>
    code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    span.underline{text-decoration: underline;}
    div.column{display: inline-block; vertical-align: top; width: 50%;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    ul.task-list{list-style: none;}
    pre > code.sourceCode { white-space: pre; position: relative; }
    pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
    pre > code.sourceCode > span:empty { height: 1.2em; }
    code.sourceCode > span { color: inherit; text-decoration: inherit; }
    div.sourceCode { margin: 1em 0; }
    pre.sourceCode { margin: 0; }
    @media screen {
    div.sourceCode { overflow: auto; }
    }
    @media print {
    pre > code.sourceCode { white-space: pre-wrap; }
    pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
    }
    pre.numberSource code
      { counter-reset: source-line 0; }
    pre.numberSource code > span
      { position: relative; left: -4em; counter-increment: source-line; }
    pre.numberSource code > span > a:first-child::before
      { content: counter(source-line);
        position: relative; left: -1em; text-align: right; vertical-align: baseline;
        border: none; display: inline-block;
        -webkit-touch-callout: none; -webkit-user-select: none;
        -khtml-user-select: none; -moz-user-select: none;
        -ms-user-select: none; user-select: none;
        padding: 0 4px; width: 4em;
        background-color: #ffffff;
        color: #a0a0a0;
      }
    pre.numberSource { margin-left: 3em; border-left: 1px solid #a0a0a0;  padding-left: 4px; }
    div.sourceCode
      { color: #1f1c1b; background-color: #ffffff; }
    @media screen {
    pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
    }
    code span. { color: #1f1c1b; } /* Normal */
    code span.al { color: #bf0303; background-color: #f7e6e6; font-weight: bold; } /* Alert */
    code span.an { color: #ca60ca; } /* Annotation */
    code span.at { color: #0057ae; } /* Attribute */
    code span.bn { color: #b08000; } /* BaseN */
    code span.bu { color: #644a9b; font-weight: bold; } /* BuiltIn */
    code span.cf { color: #1f1c1b; font-weight: bold; } /* ControlFlow */
    code span.ch { color: #924c9d; } /* Char */
    code span.cn { color: #aa5500; } /* Constant */
    code span.co { color: #898887; } /* Comment */
    code span.cv { color: #0095ff; } /* CommentVar */
    code span.do { color: #607880; } /* Documentation */
    code span.dt { color: #0057ae; } /* DataType */
    code span.dv { color: #b08000; } /* DecVal */
    code span.er { color: #bf0303; text-decoration: underline; } /* Error */
    code span.ex { color: #0095ff; font-weight: bold; } /* Extension */
    code span.fl { color: #b08000; } /* Float */
    code span.fu { color: #644a9b; } /* Function */
    code span.im { color: #ff5500; } /* Import */
    code span.in { color: #b08000; } /* Information */
    code span.kw { color: #1f1c1b; font-weight: bold; } /* Keyword */
    code span.op { color: #1f1c1b; } /* Operator */
    code span.ot { color: #006e28; } /* Other */
    code span.pp { color: #006e28; } /* Preprocessor */
    code span.re { color: #0057ae; background-color: #e0e9f8; } /* RegionMarker */
    code span.sc { color: #3daee9; } /* SpecialChar */
    code span.ss { color: #ff5500; } /* SpecialString */
    code span.st { color: #bf0303; } /* String */
    code span.va { color: #0057ae; } /* Variable */
    code span.vs { color: #bf0303; } /* VerbatimString */
    code span.wa { color: #bf0303; } /* Warning */
  </style>
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
  <style>
  /* ==========================================================================
     Basic formatting of the whole article
     ========================================================================== */

  @media screen
  {
      html { font-size: 11pt; }
  }

  @media print
  {
      html { font-size: 10pt; }
  }


  body {
      width: 45em;
      margin-top: 2em;
      margin-left: auto;
      margin-right: auto;
      
      font-family: "Times", "Times New Roman", serif;
      font-style: normal;
      font-variant: normal;

      background-color: white;
      color: black;
  }

  /* Adjustments for printing */
  @page {
      size: portrait;
      margin-top:    12%;
      margin-left:   12%;
      margin-top:    12%;
      margin-bottom: 18%;
      orphans: 3;
      widows:  3;
  }


  /* ==========================================================================
     Text elements
     ========================================================================== */

  p { text-align: justify; }
  li p { text-align: left; }

  h1,h2,h3,h4,h5,h6 {
      font-family: "Helvetica Neue", "Helvetica", "Arial", sans-serif; 
      page-break-after: avoid;
  }

  h1 { font-size: 1.73em; }
  h2 { font-size: 1.44em; }
  h3 { font-size: 1.20em; }
  h4 { font-size: 1.00em; }

  /* Pandoc hard-codes the section number in a span */
  span.header-section-number {
      margin-right: 0.5em;
  }

  span.header-section-number:after {
      content: ".";
  }

  /* Setup the header information at the top of the document */
  header   { text-align: center; margin-bottom: 4em; }
  h1.title { font-family: sans-serif; font-size: 2.0em; }
  p.subtitle { font-family: sans-serif; font-size: 1.4em; }
  p.author { font-size: 1.2em; }
  p.date   { font-size: 1.2em; }
  p.author a.email { font-family: sans-serif; font-size: 0.8em; }
  header p { text-align: center; }

  /* Special rules for code block formatting */
  pre
  {
      padding-left:   1em;
      padding-top:    1ex;
      padding-bottom: 1ex;

      font-family: "Courier New", "Courier", monospace;

      page-break-inside: avoid;

      overflow: visible;
  }

  blockquote
  {
      border-left: solid 2pt #ddd;
      padding-left: 1em;
  }

  /* ==========================================================================
     Figures, equations, and other numbered elements
     ========================================================================== */


  /* Figures and figure captions */
  body { counter-reset: figure; }

  figcaption:before
  {
      counter-increment: figure;
      content: "Figure " counter(figure) ": ";
  }

  figcaption { margin-top: 0.5em; }

  figure
  {
      margin-top: 1.5em;
      margin-bottom: 3em;
  }

  /* Patch default Pandoc style settings */
  @media screen {
      div.sourceCode { overflow: visible; }
  }
  </style>
</head>
<body>
<header id="title-block-header">
<h1 class="title">Correcting the Design of Bulk Execution</h1>
<p class="subtitle">P2181r1</p>
<p class="author">Jared Hoberock (<a href="mailto:jhoberock@nvidia.com" class="email">jhoberock@nvidia.com</a>)</p>
<p class="author">Michael Garland (<a href="mailto:mgarland@nvidia.com" class="email">mgarland@nvidia.com</a>)</p>
<p class="date">November 13, 2020</p>
</header>
<h1 data-number="1" id="introduction"><span class="header-section-number">1</span> Introduction</h1>
<p>A bulk execution interface was introduced as a fundamental operation supported by executors in <a href="http://wg21.link/N4406">N4406</a> (“Parallel algorithms need executors”) and adopted in <a href="http://wg21.link/p0443r0">P0443r0</a>, the first unified executor proposal, in the form of a <code>bulk_execute</code> interface. This interface has been present in P0443 from the beginning because a properly designed <code>bulk_execute</code> interface accomplishes two goals of fundamental importance. It provides the basis for exploiting platforms that support efficient mechanisms for creating many execution agents simultaneously, and it encapsulates the (potentially platform-specific) means of doing so.</p>
<p>The design of P0443 has evolved significantly since its initial revision, most notably to adopt the sender/receiver approach for lazy execution. The design of <code>bulk_execute</code> has lagged behind these changes, and is presented with inconsistent signatures in <a href="http://wg21.link/p0443r14">P0443r14</a>. The lack of a consistently defined interface for bulk execution must be resolved before P0443 can be adopted.</p>
<p>In this paper, we propose a design for bulk execution that corrects this defect in P0443r14. Our proposal:</p>
<ul>
<li><p>Defines <code>bulk_execute</code> as an interface for eager work submission, following the semantics of <code>execute</code>.</p></li>
<li><p>Introduces a new <code>bulk_schedule</code> that provides a basis for lazy work submission, following the semantics of <code>schedule</code>.</p></li>
</ul>
<p>Adopting these proposals requires only minor changes to P0443. They do not change any of the concepts or mechanisms in P0443 aside from the defective definition of <code>bulk_execute</code>. They also broaden the scope of bulk execution by providing for both eager and lazy submission, rather than eager submission alone.</p>
<h2 data-number="1.1" id="changes-in-revision-1"><span class="header-section-number">1.1</span> Changes in Revision 1</h2>
<p>The following are the primary changes made in this revision of the paper:</p>
<ul>
<li><p>Adopted the interface proposed in <a href="http://wg21.link/p2224">P2224</a> (“A better <code>bulk_schedule</code>”), which supercedes the <code>bulk_schedule</code> interfaces explored in both <a href="http://wg21.link/p2181r0">Revision 0</a> and <a href="http://wg21.link/p2209">P2209</a>, while also eliminating the need for any form of <code>bulk_join</code>.</p></li>
<li><p>Eliminated the <code>many_receiver_of</code> concept, which is no longer needed in the revised design of <code>bulk_schedule</code>.</p></li>
<li><p>Specified that <code>connect</code> is called in each agent of a launch created by <code>bulk_schedule</code>.</p></li>
<li><p>Consistent with ongoing discussions in the design review of P0443r14, we have distinguished executors and schedulers more carefully and do not assume implicit conversion between them.</p></li>
<li><p>The default implementation of <code>bulk_execute</code> now makes a single call to <code>execute</code>.</p></li>
<li><p>Replaced types <code>executor_shape_t&lt;E&gt;</code> and <code>executor_index_t&lt;E&gt;</code> with a single type <code>executor_coordinate_t&lt;E&gt;</code>.</p></li>
<li><p>Added a new discussion section on a possible refactoring of <code>bulk_schedule</code> into two operators.</p></li>
</ul>
<p>The bulk of the changes occur in Section 3 (<em>Corrected Bulk Interface</em>) and Section 5 (<em>Discussion</em>).</p>
<h1 data-number="2" id="background"><span class="header-section-number">2</span> Background</h1>
<p>Every revision of P0443 has included <code>bulk_execute</code> as the lowest level primitive operation for creating work in bulk through an executor. Both P0443 and the interface of <code>bulk_execute</code> have evolved since its first revision, but the intended functionality of <code>bulk_execute</code> has remained unchanged: it is the basis for creating a group of function invocations in bulk in a single operation.</p>
<p>The design sketched in <a href="http://wg21.link/p1660r0">P1660r0</a> (“A compromise executor design sketch”) is the basis for the current specification in <a href="http://wg21.link/p0443r14">P0443r14</a>. While reaffirming the importance of bulk execution, it proposed only to:</p>
<blockquote>
<p>Introduce a customizable bulk execution API whose specific shape is left as future work.</p>
</blockquote>
<p>Section 5.3 of that paper provided some “highly speculative” suggestions, but no definitive design was given. P0443r14 also attempts to incorporate the proposal of <a href="http://wg21.link/p1993r1">P1993r1</a> (“Restore shared state to <code>bulk_execute</code>”) to return a sender result so that dependent work may be chained with a bulk task.</p>
<p>This results in the intended interface of <code>bulk_execute</code> in P0443r14:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb1-1"><a href="#cb1-1"></a>sender_of&lt;<span class="dt">void</span>&gt; <span class="kw">auto</span> bulk_execute(executor <span class="kw">auto</span> ex,</span>
<span id="cb1-2"><a href="#cb1-2"></a>                                  invocable <span class="kw">auto</span> f,</span>
<span id="cb1-3"><a href="#cb1-3"></a>                                  <span class="dt">executor_shape_t</span>&lt;<span class="kw">decltype</span>(ex)&gt; shape);</span></code></pre></div>
<p>This formulation creates <code>shape</code> invocations of function <code>f</code> on execution agents created by executor <code>ex</code>. A sender of <code>void</code> corresponding to the completion of these invocations is the result.</p>
<h2 data-number="2.1" id="inconsistent-definitions-in-p0443"><span class="header-section-number">2.1</span> Inconsistent definitions in P0443</h2>
<p>Despite this intent, the material addressing bulk execution in <a href="http://wg21.link/p0443r14">P0443r14</a> is not self-consistent. This inconsistency is particularly apparent in the envisioned return type of <code>bulk_execute</code>.</p>
<ul>
<li>Section 1.3 includes an example use of <code>bulk_execute</code> that returns a sender:</li>
</ul>
<div class="sourceCode" id="cb2"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb2-1"><a href="#cb2-1"></a>    sender <span class="kw">auto</span> s = execution::bulk_execute(ex, ...);</span></code></pre></div>
<ul>
<li><p>Section 2.2.3.9 specifies the customization point <code>execution::bulk_execute</code>, yet remains silent on its return type.</p></li>
<li><p>Section 2.5.5.5 specifies that the interface of <code>static_thread_pool</code> includes a <code>bulk_execute</code> method returning <code>void</code>:</p></li>
</ul>
<div class="sourceCode" id="cb3"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb3-1"><a href="#cb3-1"></a>    <span class="kw">template</span>&lt;<span class="kw">class</span> Function&gt;</span>
<span id="cb3-2"><a href="#cb3-2"></a>    <span class="dt">void</span> bulk_execute(Function&amp;&amp; f, <span class="dt">size_t</span> n) <span class="at">const</span>;</span></code></pre></div>
<p>Our proposal eliminates this inconsistency with a single, clearly defined interface for <code>bulk_execute</code>.</p>
<h2 data-number="2.2" id="shared-state-and-dependent-tasks"><span class="header-section-number">2.2</span> Shared state and dependent tasks</h2>
<p>Programs need to chain dependent tasks together, in both the singular and bulk cases. Furthermore, it is particularly important to provide a means for delivering shared state (e.g., barrier objects or shared output arrays) to all the constituent invocations of a bulk operation.</p>
<p>SG1 considered this issue at its February 2020 meeting in Prague, and decided that:</p>
<blockquote>
<p>Poll: We should add a sender argument and sender result to bulk execution functions (providing an opportunity to build shared state, established dependencies in/out)</p>
<pre><code>SF  F  N  A  SA
17  7  0  0  0</code></pre>
</blockquote>
<p>Our proposal fulfills this requirement with a new <code>bulk_schedule</code> interface.</p>
<h1 data-number="3" id="corrected-bulk-interface"><span class="header-section-number">3</span> Corrected Bulk Interface</h1>
<p>The inconsistent interfaces for bulk execution in <a href="http://wg21.link/p0443r14">P0443r14</a> arise from uncertainty about the means for integrating senders into the <code>bulk_execute</code> interface. The design for singular execution in P0443r14 avoids this confusion by providing <em>two</em> interfaces (<code>execute</code> and <code>schedule</code>) that disentangle the concerns of eager submission and lazy scheduling. The defects in the interface for bulk execution in P0443r14 are readily corrected by adopting a similar approach.</p>
<h2 data-number="3.1" id="design-synopsis"><span class="header-section-number">3.1</span> Design Synopsis</h2>
<p>The <code>bulk_execute</code> operation should be the mechanism for eager submission of work in bulk, a role analogous to <code>execute</code>. The interface sketched in P0443 should be replaced with one of the following form:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb5-1"><a href="#cb5-1"></a>    <span class="kw">template</span>&lt;executor E, invocable F&gt;</span>
<span id="cb5-2"><a href="#cb5-2"></a>    <span class="dt">void</span> bulk_execute(E ex,</span>
<span id="cb5-3"><a href="#cb5-3"></a>                      F&amp;&amp; f,</span>
<span id="cb5-4"><a href="#cb5-4"></a>                      <span class="dt">executor_coordinate_t</span>&lt;E&gt; shape);</span></code></pre></div>
<p>The invocable <code>f</code> accepts a single argument corresponding to its assigned coordinate in <code>shape</code>. The work to be done by this invocation has been submitted for execution in a group of the given shape before <code>bulk_execute</code> returns, but the point at which actual execution occurs is implementation defined. This interface provides no further mechanism for synchronizing with the completion of the submitted work. Therefore, some additional means of synchronization is required to determine when the bulk operation has completed. Such facilities could be provided by executor-specific interfaces or sequencing guarantees. In general, it is necessary to use a synchronization object such as the <code>std::barrier</code> used in the following example.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb6-1"><a href="#cb6-1"></a>    <span class="kw">auto</span> exec = ...</span>
<span id="cb6-2"><a href="#cb6-2"></a>    vector&lt;<span class="dt">int</span>&gt; ints = ...</span>
<span id="cb6-3"><a href="#cb6-3"></a>    barrier bar(ints.size() + <span class="dv">1</span>);</span>
<span id="cb6-4"><a href="#cb6-4"></a></span>
<span id="cb6-5"><a href="#cb6-5"></a>    <span class="co">// launch work to mutate a vector of integers</span></span>
<span id="cb6-6"><a href="#cb6-6"></a>    bulk_execute(exec,</span>
<span id="cb6-7"><a href="#cb6-7"></a>                [&amp;](<span class="dt">size_t</span> idx) { ints[i] += <span class="dv">1</span>; bar.arrive(); },</span>
<span id="cb6-8"><a href="#cb6-8"></a>                ints.size());</span>
<span id="cb6-9"><a href="#cb6-9"></a></span>
<span id="cb6-10"><a href="#cb6-10"></a>    <span class="co">// wait for work to be completed</span></span>
<span id="cb6-11"><a href="#cb6-11"></a>    bar.arrive_and_wait();</span></code></pre></div>
<p>A new interface is required for scheduling work for later submission. This interface should use senders as the means of composition and for ordering chains of dependent operations. This is the role of <code>schedule</code> for singular execution; therefore, we propose the addition of an analogous bulk operation. This new <code>bulk_schedule</code> operation should have an interface of the following form:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb7-1"><a href="#cb7-1"></a>    <span class="kw">template</span>&lt;scheduler S, invocable F, <span class="kw">class</span>... Ts&gt;</span>
<span id="cb7-2"><a href="#cb7-2"></a>    sender_of&lt;Ts...&gt; bulk_schedule(sender_of&lt;Ts...&gt; <span class="kw">auto</span>&amp;&amp; prologue,</span>
<span id="cb7-3"><a href="#cb7-3"></a>                                   S sched,</span>
<span id="cb7-4"><a href="#cb7-4"></a>                                   <span class="dt">executor_coordinate_t</span>&lt;S&gt; shape,</span>
<span id="cb7-5"><a href="#cb7-5"></a>                                   F&amp;&amp; factory);</span></code></pre></div>
<p>which was proposed in <a href="http://wg21.link/p2224">P2224</a> and adopted by SG1 at its meeting on October 12, 2020.</p>
<p>The returned object is a sender representing the entire computation of the bulk section. The invocable <code>factory</code> is responsible for constructing a sender that represents the computation to be performed in each agent of the bulk launch. The signature for this is a sender-factory and should be of the form:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb8-1"><a href="#cb8-1"></a>    <span class="kw">auto</span> factory(sender_of&lt;<span class="dt">executor_coordinate_t</span>&lt;S&gt;, Ts&amp;...&gt;) -&gt; sender_of&lt;<span class="dt">void</span>&gt;</span></code></pre></div>
<p>The factory is called with a single parameter: a sender representing the initiation of each agent. This sender delivers to its receiver both an agent index and the values (if any) provided by <code>prologue</code>. The factory must return a <code>sender_of&lt;void&gt;</code> representing the computation to be performed by each agent. This bulk operation will be submitted for execution when the the sender returned by <code>bulk_schedule</code> is connected to a receiver and started. In each agent of the bulk launch, a receiver will be connected to the sender returned by the factory, or a copy thereof. The resulting operation returned by <code>connect</code> will subsequently be executed in that agent with <code>start</code>.</p>
<p>The “prologue” sender provided to <code>bulk_schedule</code> is intended to deliver state that should be shared across the group of execution agents created upon execution. Each agent is identified by an index sent via <code>set_value</code> along with the shared state (if any) delivered by the prologue. The following example illustrates the use of <code>bulk_schedule</code>, along with functionality proposed in <a href="http://wg21.link/p1897r3">P1897r3</a>, to share a collection of integers across a group of execution agents and mutate each element individually.</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb9-1"><a href="#cb9-1"></a>    <span class="kw">auto</span> sched = ...</span>
<span id="cb9-2"><a href="#cb9-2"></a>    <span class="bu">std::</span>vector&lt;<span class="dt">int</span>&gt; ints = ...</span>
<span id="cb9-3"><a href="#cb9-3"></a></span>
<span id="cb9-4"><a href="#cb9-4"></a>    <span class="co">// assemble a computation to mutate a vector of integers</span></span>
<span id="cb9-5"><a href="#cb9-5"></a>    <span class="kw">auto</span> increment =</span>
<span id="cb9-6"><a href="#cb9-6"></a>        bulk_schedule(just(ints),</span>
<span id="cb9-7"><a href="#cb9-7"></a>                      sched,</span>
<span id="cb9-8"><a href="#cb9-8"></a>                      ints.size(),</span>
<span id="cb9-9"><a href="#cb9-9"></a>                      transform([](<span class="dt">size_t</span> idx, <span class="bu">std::</span>vector&lt;<span class="dt">int</span>&gt;&amp; ints)</span>
<span id="cb9-10"><a href="#cb9-10"></a>                      {</span>
<span id="cb9-11"><a href="#cb9-11"></a>                          ints[idx] += <span class="dv">1</span>;</span>
<span id="cb9-12"><a href="#cb9-12"></a>                      });</span>
<span id="cb9-13"><a href="#cb9-13"></a></span>
<span id="cb9-14"><a href="#cb9-14"></a>    <span class="co">// perform the computation and wait for its completion</span></span>
<span id="cb9-15"><a href="#cb9-15"></a>    execution::sync_wait( increment );</span></code></pre></div>
<p>Because <code>increment</code> is a sender, we can wait on its completion using a generic <code>sync_wait</code> operation rather than having to embed a synchronization object like a barrier directly in the code.</p>
<h2 data-number="3.2" id="specification-of-bulk_execute"><span class="header-section-number">3.2</span> Specification of <code>bulk_execute</code></h2>
<p>[<em>Editorial note:</em> Replace Section 2.2.3.9 (<code>execution::bulk_execute</code>) in P0443r14 with the material in this section. –<em>end editorial note</em>]</p>
<p>The name <code>execution::bulk_execute</code> denotes a customization point object. If:</p>
<pre><code>    is_convertible_v&lt;decltype(S), execution::executor_coordinate_t&lt;decltype(remove_cvref_t&lt;E&gt;)&gt;&gt;</code></pre>
<p>is true, then the expression <code>execution::bulk_execute(E, F, S)</code> for some subexpressions <code>E</code>, <code>F</code>, and <code>S</code> is expression-equivalent to:</p>
<ul>
<li><p><code>E.bulk_execute(F, S)</code>, if that expression is valid. If the function selected does not execute <code>F</code> in an <code>S</code>-shaped group of execution agents with forward progress <code>query(E, execution::bulk_guarantee)</code> on executor <code>E</code>, the program is ill-formed with no diagnostic required.</p></li>
<li><p>Otherwise, <code>bulk_execute(E, F, S)</code>, if that expression is valid, with overload resolution performed in a context that includes the declaration</p>
<pre><code>  void bulk_execute();</code></pre>
<p>and that does not include a declaration of <code>execution::bulk_execute</code>.</p>
<p>If the function selected does not bulk execute <code>F</code> with shape <code>S</code> on executor <code>E</code>, the program is ill-formed with no diagnostic required.</p></li>
<li><p>Otherwise, if the type of <code>E</code> models <code>executor</code>, and the type of <code>F</code> and <code>executor_coordinate_t&lt;remove_cvref_t&lt;E&gt;&gt;</code> model <code>invocable</code>, and if <code>query(E, execution::bulk_guarantee)</code> equals <code>execution::bulk_guarantee.unsequenced</code></p>
<ul>
<li><p>If the type of <code>F</code> models <code>copy_constructible</code>, then equivalent to <code>execution::execute(E, [f=DECAY_COPY(F)]{ for(auto idx=0; idx&lt;S; ++idx) invoke(f, idx); })</code>.</p></li>
<li><p>Otherwise, equivalent to <code>execution::execute(E, [&amp;]{ for(auto idx=0; idx&lt;S; ++idx) invoke(F, idx); })</code>.</p></li>
</ul>
<p>[<em>Note:</em> The requirement of <code>bulk_guarantee.unsequenced</code> here means that the default implementation is not available to an executor that chooses to advertise a different guarantee. Such executors are required to provide implementations of <code>bulk_execute</code> that fulfill the advertised guarantee. –<em>end note</em>]</p></li>
<li><p>Otherwise, <code>execution::bulk_execute(E, F, S)</code> is ill-formed.</p></li>
</ul>
<h2 data-number="3.3" id="specification-of-bulk_schedule"><span class="header-section-number">3.3</span> Specification of <code>bulk_schedule</code></h2>
<p>[<em>Editorial note:</em> Introduce a new Section 2.2.3.10 (<code>execution::bulk_schedule</code>) containing the material in this section. –<em>end editorial note</em>]</p>
<p>The name <code>execution::bulk_schedule</code> denotes a customization point object. For some subexpressions <code>scheduler</code>, <code>shape</code>, <code>prologue</code>, and <code>factory</code>: let <code>E</code> be a type such that <code>decltype((scheduler))</code> is <code>E</code>, and let <code>S</code> be a type such that <code>decltype((shape))</code> is <code>S</code>, and let <code>P</code> be a type such that <code>decltype((prologue))</code> is <code>P</code>, and let <code>F</code> be a type such that <code>decltype((factory))</code> is <code>F</code>. The expression <code>execution::bulk_schedule(prologue, scheduler, shape, factory)</code> is ill-formed if <code>typed_sender&lt;P&gt;</code> is not <code>true</code>.</p>
<p>Otherwise, the expression <code>execution::bulk_schedule(prologue, scheduler, shape, factory)</code> is expression-equivalent to:</p>
<ul>
<li><p><code>scheduler.bulk_schedule(prologue, shape, factory)</code>, if that expression is valid and its type <code>R</code> satisfies <code>typed_sender&lt;R&gt;</code>, and if <code>sender_traits&lt;R&gt;::value_types&lt;tuple, variant&gt;</code> is <code>variant&lt;tuple&lt;executor_coordinate_t&lt;decltype(scheduler)&gt;, add_lvalue_reference_t&lt;Values&gt;...&gt;...&gt;</code> for all <code>Values...</code> parameter packs sent by <code>prologue</code>.</p></li>
<li><p>Otherwise, <code>bulk_schedule(prologue, scheduler, shape, factory)</code>, if that expression is valid with overload resolution performed in a context that includes the declaration</p>
<pre><code>void bulk_schedule();</code></pre>
<p>and that does not include a declaration of <code>execution::bulk_schedule</code>, and if that expression’s type satisfies <code>typed_sender&lt;R&gt;</code>, and if <code>sender_traits&lt;R&gt;::value_types&lt;tuple, variant&gt;</code> is <code>variant&lt;tuple&lt;executor_coordinate_t&lt;decltype(scheduler)&gt;, add_lvalue_reference_t&lt;Values&gt;...&gt;...&gt;</code> for all <code>Values...</code> parameter packs sent by <code>prologue</code>.</p></li>
<li><p>Otherwise, if <code>scheduler&lt;E&gt;</code> is true and <code>executor_coordinate_t&lt;E&gt;</code> is <code>S</code>, returns a sender object <code>s</code> whose implementation-defined type <code>R</code> satisfies <code>typed_sender&lt;R&gt;</code>. <code>execution::connect(s,r)</code> returns an object <code>o</code> whose implementation-defined type satisfies <code>operation_state</code>.</p>
<ul>
<li><p>Let <code>values...</code> be a parameter pack of values sent by <code>prologue</code>, and <code>coord</code> be a coordinate spanned by <code>shape</code>. Let <code>__just(args...)</code> be an exposition-only operation that sends the parameter pack <code>args...</code> to a connected receiver, and let <code>__discard_receiver</code> be an exposition-only receiver whose <code>set_value</code>, <code>set_error</code>, and <code>set_done</code> functions have no effect. <code>execution::start(o)</code> calls <code>submit(factory(__just(coord, values...), __discard_receiver{})</code> once for each <code>coord</code> spanned by <code>shape</code>. Upon completion of all such invocations, it calls <code>execution::set_value(move(r), values...)</code>.</p></li>
<li><p>Otherwise, let <code>error</code> be an error sent by <code>prologue</code>. <code>execution::start(o)</code> calls <code>execution::set_error(move(r), error)</code>.</p></li>
<li><p>Otherwise, <code>execution::start(o)</code> calls <code>execution::set_done(move(r))</code>.</p></li>
</ul></li>
<li><p>Otherwise, <code>execution::bulk_schedule(prologue, scheduler, shape, factory)</code> is ill-formed.</p></li>
</ul>
<h1 data-number="4" id="supporting-definitions"><span class="header-section-number">4</span> Supporting Definitions</h1>
<p>Each instantiation of the body in a bulk launch receives an <em>index</em> argument that identifies that instance. The index space of the launch itself is described by a <em>shape</em> parameter provided to <code>bulk_execute</code> and <code>bulk_schedule</code>. The current draft of P0443 uses separate types for these values: <code>executor_index_t&lt;E&gt;</code> and <code>executor_shape_t&lt;E&gt;</code>, respectively. Our implementation experience to date suggests that there is no benefit to differentiating these types. Therefore, our proposal replaces them with a single <code>executor_coordinate_t&lt;E&gt;</code> type that is used for both purposes.</p>
<h2 data-number="4.1" id="definitions-of-execution"><span class="header-section-number">4.1</span> Definitions of execution</h2>
<p>An editorial note in <a href="http://wg21.link/P0443r14#executionexecute">P0334r14, Section 2.2.3.4</a> says that:</p>
<blockquote>
<p>We should probably define what “execute the function object F on the executor E” means more carefully.</p>
</blockquote>
<p>We suggest the following definition:</p>
<p>An executor <em>executes</em> an expression by scheduling the creation of an execution agent on which the expression executes. Invocable expressions are invoked by that execution agent. Execution of expressions that are not invocable is executor-defined.</p>
<p>Furthermore, we suggest adding the analogous definitions for bulk execution:</p>
<p>A <em>group of execution agents</em> created in bulk has a <em>shape</em>. Execution agents within a group are identified by <em>indices</em>, whose unique values are the set of contiguous indices spanned by the group’s shape.</p>
<p>An executor <em>bulk executes</em> an expression by scheduling the creation of a group of execution agents on which the expression executes in bulk. Invocable expressions are invoked with each execution agent’s index. Bulk execution of expressions that are not invocables is executor-defined.</p>
<h1 data-number="5" id="discussion"><span class="header-section-number">5</span> Discussion</h1>
<p>The preceding sections contain the entirety of our proposed corrections and additions to <a href="http://wg21.link/p0443r14">P0443r14</a>. This section provides some additional background explanation and highlights some additional proposals that may be considered separately.</p>
<h2 data-number="5.1" id="design-of-the-bulk-interface"><span class="header-section-number">5.1</span> Design of the bulk interface</h2>
<p>This proposal positions <code>bulk_execute</code> as the direct analogue of <code>execute</code>. Both are low-level interfaces for creating execution and are necessary to expose platform-level work creation interfaces, which may be implemented outside the standard library. Furthermore, individual executor types may provide important platform-provided forward progress guarantees, such as a guarantee of mutual concurrency among agents.</p>
<p>While the default implementation of the <code>bulk_execute</code> customization point decays to a loop in the absence of an executor-provided method, the <code>bulk_execute</code> operation is semantically distinct from a loop. Every loop construct in the standard is either explicitly sequential or permitted to fall back to a sequential equivalent at the sole discretion of the implementation. In contrast, executors may be used with <code>bulk_execute</code> to guarantee execution semantics that have no lowering onto sequential execution. For example, an executor whose <code>bulk_execute</code> method guarantees that all its created agents are concurrent with each other has no sequential equivalent.</p>
<h2 data-number="5.2" id="execution-policies"><span class="header-section-number">5.2</span> Execution policies</h2>
<p>As in all prior revisions of P0443, the <code>bulk_execute</code> interface we propose does not include an execution policy argument. The use of execution policies in <code>bulk_execute</code> would be fundamentally inconsistent with their use throughout the rest of the library.</p>
<p>Execution policies were designed as a mechanism for customizing the execution of algorithms in the standard library in a way that could support the broadest possible range of architectures (see <a href="http://wg21.link/N3554">N3554</a>). As designed, they are suitable for customizing operations that can optionally change execution semantics (e.g., parallel execution in multiple threads). They are not, however, suitable for customizing low-level interfaces such as <code>bulk_execute</code> where mandatory execution semantics have already been specified in the form of an executor.</p>
<p>For every invocation of an algorithm with an execution policy, it is valid to replace the policy specified in the call with <code>execution::seq</code> without changing the meaning of the program. Similarly, conforming implementations are granted the freedom to fall back to sequential execution, regardless of the policy specified. This cannot be done with <code>bulk_execute</code> if the executor provides guarantees (e.g., non-blocking execution or concurrent forward progress) inconsistent with sequential execution in the calling thread.</p>
<p>The use of execution policies in the library is also designed to support a variety of vendor-supplied execution policies. Providing such vendor-specific policies to <code>bulk_execute</code> would typically have no meaning unless the executor is also a vendor-specific executor specifically designed to recognize that policy. In this case, all information provided by the policy could have been provided via the executor itself, making the policy parameter unnecessary. Once the executor semantics have been customized via the property-based <code>require</code> mechanism, any semantics implied by a policy are at best redundant and at worst contradictory.</p>
<h2 data-number="5.3" id="copying-invocables"><span class="header-section-number">5.3</span> Copying invocables</h2>
<p>Both <code>execute</code>, and by extension <code>bulk_execute</code>, allow non-copyable invocable types. This manifests in the third bullet point of the specification of <code>bulk_execute</code>, which has two cases. The first case opportunistically creates copies of the user’s invocable when it is possible to do so. Each agent created by the executor receives one of these copies. Otherwise, if the invocable is not copyable, each agent receives a reference to the invocable instead of a copy. This policy was chosen to ensure that invocables containing non-copyable, non-moveable types (e.g., synchronization objects) are still usable with <code>bulk_execute</code>. The caller of <code>execute</code> and/or <code>bulk_execute</code> must ensure that a non-copyable, non-moveable invocable outlives the group of agents that invokes it and that overlapping invocations do not create data races.</p>
<h2 data-number="5.4" id="default-implementations"><span class="header-section-number">5.4</span> Default implementations</h2>
<p>We follow the existing practice in P0443 and specify default implementations for the <code>bulk_execute</code> and <code>bulk_schedule</code> customization points when the executor/scheduler does not provide corresponding methods. These default implementations are meant as a fallback only. Executors/schedulers that aim to support parallel execution of some form should provide their own implementations of the bulk interfaces.</p>
<p>It would be valuable if the default implementation of <code>bulk_schedule</code> could rely upon <code>bulk_execute</code>. This would encapsulate the details of submission in a single place, and it would help guarantee semantic equivalence between eager and lazy mechanisms for work submission. The current design of <a href="http://wg21.link/p0443r14">P0443r14</a> prevents this because executors rely on exceptions for error reporting whereas schedulers rely on <code>set_error</code> calls on receivers. Bridging the gap between these two would likely result in a simpler and more seamless design. A candidate solution was sketched in <a href="http://wg21.link/p1660r0">P1660, Section 5.2</a>, which recommended allowing the caller of <code>execute</code> or <code>bulk_execute</code> to control the error delivery channel by providing either an invocable—resulting in the use of exceptions—or a receiver—resulting in delivery via <code>set_error</code>. A more complete solution is proposed in <a href="http://wg21.link/p2254">P2254</a>.</p>
<h2 data-number="5.5" id="additional-convenience-overloads"><span class="header-section-number">5.5</span> Additional convenience overloads</h2>
<p>The <code>bulk_schedule</code> interface may be marginally more convenient if an additional overload is provided without a prologue sender:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb13-1"><a href="#cb13-1"></a>    <span class="kw">template</span>&lt;scheduler S, invocable F, <span class="kw">class</span>... Ts&gt;</span>
<span id="cb13-2"><a href="#cb13-2"></a>    sender_of&lt;<span class="dt">void</span>&gt; bulk_schedule(S sched,</span>
<span id="cb13-3"><a href="#cb13-3"></a>                                  <span class="dt">executor_coordinate_t</span>&lt;S&gt; shape,</span>
<span id="cb13-4"><a href="#cb13-4"></a>                                  F&amp;&amp; factory);</span></code></pre></div>
<p>While an equivalent result can already be achieved by passing a suitable “empty” prologue sender through the interface we have proposed, this overload would be more convenient for the user of the interface.</p>
<p>It may also be worth considering adding an overload of <code>schedule</code> that accepts a prologue sender, mirroring the <code>bulk_schedule</code> interface we have proposed:</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb14-1"><a href="#cb14-1"></a>    <span class="kw">template</span>&lt;scheduler S, <span class="kw">class</span>... Ts&gt;</span>
<span id="cb14-2"><a href="#cb14-2"></a>    sender_of&lt;Ts...&gt; schedule(sender_of&lt;Ts...&gt; <span class="kw">auto</span>&amp;&amp; prologue,</span>
<span id="cb14-3"><a href="#cb14-3"></a>                              S sched)</span></code></pre></div>
<p>Neither of these changes is essential, but adding these options to the existing overloads for <code>schedule</code> and <code>bulk_schedule</code> in P0443r14 and our proposal above, respectively, would make the scheduling interface more convenient and more predictable.</p>
<h2 data-number="5.6" id="possible-refactoring-of-bulk_schedule"><span class="header-section-number">5.6</span> Possible refactoring of <code>bulk_schedule</code></h2>
<p>While implementing our proposed <code>bulk_schedule</code> interface, we discovered that our implementation could be simplified by decomposing the functionality of <code>bulk_schedule</code> into two primitive operations:</p>
<ol type="1">
<li><code>on(prologue, scheduler)</code></li>
<li><code>bulk(prologue, shape, sender_factory)</code></li>
</ol>
<p>such that <code>bulk_schedule</code> can be formed from their composition:</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb15-1"><a href="#cb15-1"></a><span class="kw">template</span>&lt;typed_sender P, scheduler S, invocable SF)</span>
<span id="cb15-2"><a href="#cb15-2"></a>typed_sender <span class="kw">auto</span> bulk_schedule(P&amp;&amp; prologue,</span>
<span id="cb15-3"><a href="#cb15-3"></a>                                S scheduler,</span>
<span id="cb15-4"><a href="#cb15-4"></a>                                <span class="dt">scheduler_coordinate_t</span>&lt;S&gt; shape,</span>
<span id="cb15-5"><a href="#cb15-5"></a>                                SF sender_factory)</span>
<span id="cb15-6"><a href="#cb15-6"></a>{</span>
<span id="cb15-7"><a href="#cb15-7"></a>    <span class="cf">return</span> on(prologue, scheduler) | bulk(shape, sender_factory);</span>
<span id="cb15-8"><a href="#cb15-8"></a>}</span></code></pre></div>
<p>In our current implementation, the <code>on</code> operation establishes a transition from a sender chain’s “upstream” execution agent to a new “downstream” execution agent. The <code>bulk</code> operation shares values sent by its <code>prologue</code> sender across <code>shape</code> execution agents created by the <code>scheduler</code> associated with the <code>prologue</code>. We accomplish this by assuming that all <code>typed_sender</code>s advertise their <code>scheduler</code> via a customization point <code>get_scheduler(sender) -&gt; scheduler</code>.</p>
<p>Aside from the substantial reduction in implementation complexity, this refactoring has the side-benefit of providing the missing <code>scheduler</code> transition combinator <code>on</code>. If P0443 is revised to include our proposed <code>bulk_execute</code> and <code>bulk_schedule</code> interfaces only, the sole mechanism for midchain transition is by misuse of <code>bulk_schedule</code> as a particularly complicated form of <code>on</code>.</p>
<h1 class="unnumbered" data-number="" id="refs">References</h1>
<div id="refs" class="references hanging-indent" role="doc-bibliography">
<div id="ref-P2224">
<p>Garland, Michael, Lee Howes, and Jared Hoberock. 2020. “A Better <code>bulk_schedule</code>.” <a href="http://wg21.link/p2224">http://wg21.link/p2224</a>.</p>
</div>
<div id="ref-P1993">
<p>Hoberock, Jared. 2020. “Restore Shared State to bulk_execute.” <a href="http://wg21.link/p1993r1">http://wg21.link/p1993r1</a>.</p>
</div>
<div id="ref-N4406">
<p>Hoberock, Jared, Michael Garland, and Olivier Girioux. 2015. “Parallel Algorithms Need Executors.” <a href="http://wg21.link/N4406">http://wg21.link/N4406</a>.</p>
</div>
<div id="ref-P0443">
<p>Hoberock, J., M. Garland, C. Kohlhoff, C. Mysen, C. Edwards, G. Brown, D. Hollman, et al. 2020. “A Unified Executors Proposal for C++.” <a href="http://wg21.link/p0443r14">http://wg21.link/p0443r14</a>.</p>
</div>
<div id="ref-P1660">
<p>Hoberock, J., M. Garland, B. Lelbach, M. Dominiak, E. Niebler, K. Shoop, L. Baker, L. Howes, D. Hollman, and G. Brown. 2019. “A Compromise Executor Design Sketch.” <a href="http://wg21.link/p1660r0">http://wg21.link/p1660r0</a>.</p>
</div>
<div id="ref-N3554">
<p>Hoberock, J., J. Marathe, M. Garland, O. Giroux, V. Grover, A. Laksberg, H. Sutter, and A. Robison. 2013. “A Parallel Algorithms Library.” <a href="http://wg21.link/N3554">http://wg21.link/N3554</a>.</p>
</div>
<div id="ref-P2209">
<p>Howes, Lee, Lewis Baker, Kirk Shoop, and Eric Neibler. 2020. “Bulk Schedule.” <a href="http://wg21.link/p2209">http://wg21.link/p2209</a>.</p>
</div>
</div>
</body>
</html>
