<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head>
    <title>Copy elision for subobjects</title>
    <meta content="http://schemas.microsoft.com/intellisense/ie5" name="vs_targetSchema">
    <meta http-equiv="Content-Language" content="en-us">
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    <style type="text/css">
        .addition { color: green; }
        .right { float:right }
        .changed-deleted { background-color: #CFF0FC ; text-decoration: line-through; display: none; }
        .addition.changed-deleted { color: green; background-color: #CFF0FC ; text-decoration: line-through; text-decoration: black double line-through; display: none; }
        .changed-added { background-color: #CFF0FC ;}
        .notes { background-color: gold ;}
        pre { line-height: 1.2; font-size: 10pt; margin-top: 25px; }
        .desc { margin-left: 35px; margin-top: 10px; padding:0; white-space: normal; }
        body {max-width: 1024px; margin-left: 25px;}
        .lmargin50{margin-left: 50px;}
        .cppkeyword { color: blue; }
        .asmcostly { color: red; }
        .cppcomment { color: green; }
        .cppcomment > .cppkeyword{ color: green; }
        .cpptext { color: #2E8B57; }
        .cppaddition { background-color: #CFC; }
        .cppdeletion {  text-decoration: line-through; background-color: #FCC; }
        .stdquote { background-color: #ECECEC; font-family: Consolas,monospace; }
    </style>
    </head>
    <body bgcolor="#ffffff">
    <address>Document number: P0878R0</address>
    <address>Project: Programming Language C++</address>
    <address>Audience: Evolution Working group</address>
    <address>&nbsp;</address>
    <address>Antony Polukhin &lt;<a href="mailto:antoshkka@gmail.com">antoshkka@gmail.com</a>&gt;, &lt;<a href="mailto:antoshkka@yandex-team.ru">antoshkka@yandex-team.ru</a>&gt;</address>
    <address>&nbsp;</address>
    <address>Date: 2018-01-08</address>
    <h1>Subobjects copy elision</h1>

    <!--p 
class='changed-added'>Significant changes to <a 
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0539r1.html">P0539R1</a>
 are marked with blue.<p-->
    <!--p><input
 type="checkbox" id="show_deletions" onchange="show_hide_deleted()"/>
 Show deleted lines from P0275R1.</p-->

    <h2>I. Quick Introduction</h2>
    <p>[class.copy.elision]
    in the current working draft [<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4713.pdf">N4713</a>] provides a whitelist of cases when copy elision is permitted. With the current strict rules, it is impossible
    to elide copying while returning a subobject,
    so users of the C++ language have to write code like <code>return std::move(some_pair.second);</code> in order to get better performance. But even with <code>std::move</code> the generated code 
    may be not as good as it could be with the copy elided.</p>
    <p>This paper proposes to relax the [class.copy.elision] rules in order to allow compilers to produce better code.</p>

    <h2>II. Motivating example #1</h2>
    <p>Consider the following code:</p>
<pre class="lmargin50">
#include &lt;utility&gt;
#include &lt;string&gt;

std::pair&lt;std::string, std::string&gt; produce();

static std::string first_non_empty() {
    auto v = produce();
    if (!v.first.empty()) {
        return v.first;
    }
    return v.second;
}

int example_1() {
    return first_non_empty().size();
}
</pre>
    <p>Currently, compilers compile <code>example_1()</code> to the following steps:</p>
    <ol>
    <li>call <code>produce()</code> to initialize <code>v</code></li>
    <li>evaluate the condition of the <code>if</code> statement</li>
    <li>copy construct a <code>std::string</code> from either <code>v.first</code> or <code>v.second</code></li>
    <li>call the destructor for <code>v</code></li>
    <li>return the size of the copy constructed <code>std::string</code></li>
    </ol>
<p>User may try to optimize the <code>first_non_empty()</code> by adding <code>std::move</code> to turn the copy construction of the string into a move construction.</p>

<pre class="lmargin50">
#include &lt;utility&gt;
#include &lt;string&gt;

std::pair&lt;std::string, std::string&gt; produce();

static std::string first_non_empty_move() {
    auto v = produce();
    if (!v.first.empty()) {
        return <span class="notes">std::move(</span>v.first<span class="notes">)</span>;
    }
    return <span class="notes">std::move(</span>v.second<span class="notes">)</span>;
}

int example_1_move() {
    return first_non_empty_move().size();
}
</pre>

<p>Unfortunately, the above improvement <b>requires a highly experienced user</b> and still <b>does not produce an optimal result</b>: compilers are still not always able to optimize out all the unnecessary operations, even after inlining.</p>

<p>Ideally,
 compilers should be allowed to optimize away 
subobjects' copying. In that case the compiler would be able to optimize 
the code to the following:</p>

<pre class="lmargin50">
#include &lt;utility&gt;
#include &lt;string&gt;

std::pair&lt;std::string, std::string&gt; produce();

int example_1_optimized() {
    auto v = produce();
    if (!v.first.empty()) {
        return v.first.size();
    }
    return v.second.size();
}
</pre>

<p>Compilers
 could do the above optimization if the callee is inlined and subobject 
copy elision were allowed (as proposed in this paper).
In that case, a compiler could allocate enough
stack space for storing the owning object <code>v</code> in the caller, call the destructors of non-returned-from-the-callee subobjects and proceed with the subobject as a return value.</p>

<p>But [class.copy.elision] currently prevents that optimization:</p>
<pre class="lmargin50 stdquote">
This elision of copy/move operations, called copy elision, is permitted in the following
circumstances (which may be combined to eliminate multiple copies):

  - in a return statement in a function with a class return type, when the expression <b>is the name of
    a non-volatile automatic object</b> (other than a function parameter or a variable introduced by the
    exception-declaration of a handler (18.3)) with the same type (ignoring cv-qualification) as the function
    return type, the copy/move operation can be omitted by constructing the automatic object directly
    into the function call’s return object
</pre>



<p>Here are the compiler-optimized assemblies for the above 3 code snippets:</p>
<table border="1" rules="cols" style="width: 100%">
<tr><th style="width: 33%">example_1()</th><th style="width: 33%">example_1_move()</th><th style="width: 33%">example_1_optimized()</th></tr>
<tr><th><a href="https://godbolt.org/g/V6KqWv">https://godbolt.org/g/V6KqWv</a></th><th><a href="https://godbolt.org/g/3xZvtw">https://godbolt.org/g/3xZvtw</a></th><th><a href="https://godbolt.org/g/RiTmPE">https://godbolt.org/g/RiTmPE</a></th></tr>
<tr>
    <td valign="top" style="border-right: solid 1px #000;"><pre>
.LC0:
    .string "basic_string::_M_construct
        null not valid"
.LC1:
    .string "basic_string::_M_create"
&lt;skip&gt;::_M_construct(&lt;skip&gt;)
    push    rbp
    push    rbx
    mov     rbp, rdi
    sub     rsp, 24
    test    rsi, rsi
    jne     .L2
    test    rdx, rdx
    jne     .L19
.L2:
    mov     rbx, rdx
    sub     rbx, rsi
    cmp     rbx, 15
    ja      .L20
    mov     rdx, QWORD PTR [rbp+0]
    cmp     rbx, 1
    mov     rax, rdx
    je      .L21
    test    rbx, rbx
    jne     .L5
    mov     QWORD PTR [rbp+8], rbx
    mov     BYTE PTR [rdx+rbx], 0
    add     rsp, 24
    pop     rbx
    pop     rbp
    ret
.L21:
    movzx   eax, BYTE PTR [rsi]
    mov     BYTE PTR [rdx], al
    mov     rdx, QWORD PTR [rbp+0]
    mov     QWORD PTR [rbp+8], rbx
    mov     BYTE PTR [rdx+rbx], 0
    add     rsp, 24
    pop     rbx
    pop     rbp
    ret
.L20:
    test    rbx, rbx
    js      .L22
    lea     rdi, [rbx+1]
    mov     QWORD PTR [rsp+8], rsi
    call    operator new(unsigned long)
    mov     rsi, QWORD PTR [rsp+8]
    mov     QWORD PTR [rbp+0], rax
    mov     QWORD PTR [rbp+16], rbx
.L5:
    mov     rdx, rbx
    mov     rdi, rax
    call    memcpy
    mov     rdx, QWORD PTR [rbp+0]
    mov     QWORD PTR [rbp+8], rbx
    mov     BYTE PTR [rdx+rbx], 0
    add     rsp, 24
    pop     rbx
    pop     rbp
    ret
.L19:
    mov     edi, OFFSET FLAT:.LC0
    call    std::__throw_logic_error
.L22:
    mov     edi, OFFSET FLAT:.LC1
    call    std::__throw_length_error
example_1():
    push    rbp
    push    rbx
    sub     rsp, 104
    lea     rdi, [rsp+32]
    mov     rbx, rsp
    call    produce[abi:cxx11]()
    mov     rdx, QWORD PTR [rsp+40]
    lea     rax, [rbx+16]
    mov     QWORD PTR [rsp], rax
    test    rdx, rdx
    je      .L24
    mov     rsi, QWORD PTR [rsp+32]
    mov     rdi, rbx
    add     rdx, rsi
    call    &lt;skip&gt;::_M_construct(&lt;skip&gt;)
.L27:
    mov     rdi, QWORD PTR [rsp+64]
    lea     rax, [rsp+80]
    cmp     rdi, rax
    je      .L26
    call    operator delete(void*)
.L26:
    mov     rdi, QWORD PTR [rsp+32]
    lea     rax, [rsp+48]
    cmp     rdi, rax
    je      .L28
    call    operator delete(void*)
.L28:
    mov     rdi, QWORD PTR [rsp]
    add     rbx, 16
    mov     rbp, QWORD PTR [rsp+8]
    cmp     rdi, rbx
    je      .L23
    call    operator delete(void*)
.L23:
    add     rsp, 104
    mov     eax, ebp
    pop     rbx
    pop     rbp
    ret
.L24:
    mov     rsi, QWORD PTR [rsp+64]
    mov     rdx, QWORD PTR [rsp+72]
    mov     rdi, rbx
    add     rdx, rsi
    call    &lt;skip&gt;::_M_construct(&lt;skip&gt;)
    jmp     .L27
    mov     rdi, QWORD PTR [rsp+64]
    mov     rbx, rax
    lea     rax, [rsp+80]
    cmp     rdi, rax
    je      .L32
    call    operator delete(void*)
.L32:
    mov     rdi, QWORD PTR [rsp+32]
    lea     rax, [rsp+48]
    cmp     rdi, rax
    je      .L33
    call    operator delete(void*)
.L33:
    mov     rdi, rbx
    call    _Unwind_Resume
    </pre></td>
    <td valign="top" style="border-right: solid 1px #000;"><pre>
example_1_move():
    push    rbp
    push    rbx
    sub     rsp, 104
    lea     rdi, [rsp+32]
    mov     rbx, rsp
    call    produce[abi:cxx11]()
    mov     rax, QWORD PTR [rsp+40]
    test    rax, rax
    je      .L2
    lea     rdx, [rbx+16]
    lea     rcx, [rsp+48]
    mov     QWORD PTR [rsp], rdx
    mov     rdx, QWORD PTR [rsp+32]
    cmp     rdx, rcx
    je      .L13
    mov     QWORD PTR [rsp], rdx
    mov     rdx, QWORD PTR [rsp+48]
    mov     QWORD PTR [rsp+16], rdx
.L4:
    mov     QWORD PTR [rsp+8], rax
    lea     rax, [rsp+48]
    mov     rdi, QWORD PTR [rsp+64]
    mov     QWORD PTR [rsp+40], 0
    mov     BYTE PTR [rsp+48], 0
    mov     QWORD PTR [rsp+32], rax
    lea     rax, [rsp+80]
    cmp     rdi, rax
    je      .L6
    call    operator delete(void*)
    jmp     .L6
.L2:
    lea     rax, [rbx+16]
    lea     rdx, [rsp+80]
    mov     QWORD PTR [rsp], rax
    mov     rax, QWORD PTR [rsp+64]
    cmp     rax, rdx
    je      .L14
    mov     QWORD PTR [rsp], rax
    mov     rax, QWORD PTR [rsp+80]
    mov     QWORD PTR [rsp+16], rax
.L8:
    mov     rax, QWORD PTR [rsp+72]
    mov     QWORD PTR [rsp+8], rax
.L6:
    mov     rdi, QWORD PTR [rsp+32]
    lea     rax, [rsp+48]
    cmp     rdi, rax
    je      .L9
    call    operator delete(void*)
.L9:
    mov     rdi, QWORD PTR [rsp]
    add     rbx, 16
    mov     rbp, QWORD PTR [rsp+8]
    cmp     rdi, rbx
    je      .L1
    call    operator delete(void*)
.L1:
    add     rsp, 104
    mov     eax, ebp
    pop     rbx
    pop     rbp
    ret
.L14:
    movdqa  xmm0, XMMWORD PTR [rsp+80]
    movaps  XMMWORD PTR [rsp+16], xmm0
    jmp     .L8
.L13:
    movdqa  xmm0, XMMWORD PTR [rsp+48]
    movaps  XMMWORD PTR [rsp+16], xmm0
    jmp     .L4
    </pre></td>
    <td valign="top" style="border-right: solid 1px #000;"><pre>
    example_1_optimized():
    push    rbx
    sub     rsp, 64
    mov     rdi, rsp
    call    produce[abi:cxx11]()
    mov     rax, QWORD PTR [rsp+8]
    test    rax, rax
    mov     ebx, eax
    jne     .L3
    mov     ebx, DWORD PTR [rsp+40]
.L3:
    mov     rdi, QWORD PTR [rsp+32]
    lea     rax, [rsp+48]
    cmp     rdi, rax
    je      .L4
    call    operator delete(void*)
.L4:
    mov     rdi, QWORD PTR [rsp]
    lea     rdx, [rsp+16]
    cmp     rdi, rdx
    je      .L1
    call    operator delete(void*)
.L1:
    add     rsp, 64
    mov     eax, ebx
    pop     rbx
    ret
    </pre></td>
</tr>
</table>
<h3 class="notes"><b>example_1_optimized() pros:</b> avoided copy construction, avoided destructor call, smaller binary, faster code.</h3>

    <h2>III. Motivating example #2</h2>
    <p>Let's change the above example to use a type, such as <code>std::variant</code>, with a non-trivial destructor:</p>
<pre class="lmargin50">
#include &lt;variant&gt;
#include &lt;string&gt;
std::variant&lt;int, std::string&gt; produce();

static std::string get_string() {
    auto v = produce();
    if (auto* p = std::get_if&lt;std::string&gt;(&amp;v)) {
        return *p;
    }
    return {};
}

int example_2() {
    return get_string().size();
}
</pre>

<p>Without user help, compilers are forced to copy construct <code>std::string</code> from <code>*p</code>.</p><p> Such code often appears in practice, where functions like <code>get_string()</code> are defined in header file of the library and <code>example_2()</code> like functions are written by some other developers in source file.</p> <p>Experienced users may improve the above code to be more optimal:</p>

<pre class="lmargin50">
#include &lt;variant&gt;
#include &lt;string&gt;
std::variant&lt;int, std::string&gt; produce();

static std::string get_string_move() {
    auto v = produce();
    if (auto* p = std::get_if&lt;std::string&gt;(&amp;v)) {
        return <span class="notes">std::move(</span>*p<span class="notes">)</span>;
    }
    return {};
}

int example_2_move() {
    return get_string_move().size();
}
</pre>

<p>Again, the above optimization <b>requires a highly experienced user</b> and, as in the first example, still does not produce a perfect result.</p>
<p>I assume that allowing copy elision for subobjects may allow compilers to transform <code>example_2()</code> into the code that is close to the following:</p>

<pre class="lmargin50">
#include &lt;variant&gt;
#include &lt;string&gt;
std::variant&lt;int, std::string&gt; produce();

int example_2_optimized() {
    auto v = produce();
    if (auto* p = std::get_if&lt;std::string&gt;(&amp;v)) {
        return p->size();
    }
    return 0;
}
</pre>

<p>Currently [class.copy.elision] prevents that optimization:</p>
<pre class="lmargin50 stdquote">
This elision of copy/move operations, called copy elision, is permitted in the following
circumstances (which may be combined to eliminate multiple copies):

  - in a return statement in a function with a class return type, when the expression <b>is the name of
    a non-volatile automatic object</b> (other than a function parameter or a variable introduced by the
    exception-declaration of a handler (18.3)) <b>with the same type</b> (ignoring cv-qualification) as the function
    return type, the copy/move operation can be omitted by constructing the automatic object directly
    into the function call’s return object
</pre>

<p>Here's how the compiler optimized assemblies for the above 3 code snippets look like:</p>
<table border="1" rules="cols" style="width: 100%">
<tr><th style="width: 33%">example_2()</th><th style="width: 33%">example_2_move()</th><th style="width: 33%">example_2_optimized()</th></tr>
<tr><th><a href="https://godbolt.org/g/BMC1QY">https://godbolt.org/g/BMC1QY</a></th><th><a href="https://godbolt.org/g/WaV79e">https://godbolt.org/g/WaV79e</a></th><th><a href="https://godbolt.org/g/aT1Zmm">https://godbolt.org/g/aT1Zmm</a></th></tr>
<tr>
    <td valign="top" style="border-right: solid 1px #000;"><pre>
__erased_dtor<&lt;skip&gt;, 0ul>:
    rep ret
__erased_dtor<&lt;skip&gt;, 1ul>:
    mov     rdx, QWORD PTR [rdi]
    lea     rax, [rdi+16]
    cmp     rdx, rax
    je      .L3
    mov     rdi, rdx
    jmp     operator delete(void*)
.L3:
    rep ret
.LC0:
    .string "basic_string::_M_construct
        null not valid"
.LC1:
    .string "basic_string::_M_create"
example_2():
    push    r12
    push    rbp
    push    rbx
    sub     rsp, 80
    lea     rdi, [rsp+32]
    mov     rbx, rsp
    call    produce[abi:cxx11]()
    movzx   eax, BYTE PTR [rsp+64]
    cmp     al, 1
    je      .L35
    lea     rdx, [rbx+16]
    mov     QWORD PTR [rsp+8], 0
    mov     BYTE PTR [rsp+16], 0
    mov     QWORD PTR [rsp], rdx
.L13:
    cmp     al, -1
    je      .L14
    lea     rdi, [rsp+32]
    call    [QWORD PTR _Variant_stg[0+rax*8]]
.L14:
    mov     rdi, QWORD PTR [rsp]
    add     rbx, 16
    mov     rbp, QWORD PTR [rsp+8]
    cmp     rdi, rbx
    je      .L5
    call    operator delete(void*)
.L5:
    add     rsp, 80
    mov     eax, ebp
    pop     rbx
    pop     rbp
    pop     r12
    ret
.L35:
    mov     r12, QWORD PTR [rsp+32]
    lea     rax, [rbx+16]
    mov     rbp, QWORD PTR [rsp+40]
    mov     QWORD PTR [rsp], rax
    mov     rax, r12
    add     rax, rbp
    je      .L7
    test    r12, r12
    je      .L36
.L7:
    cmp     rbp, 15
    ja      .L37
    cmp     rbp, 1
    je      .L38
    test    rbp, rbp
    jne     .L19
.L34:
    mov     rax, QWORD PTR [rsp]
.L12:
    mov     QWORD PTR [rsp+8], rbp
    mov     BYTE PTR [rax+rbp], 0
    movzx   eax, BYTE PTR [rsp+64]
    jmp     .L13
.L38:
    movzx   eax, BYTE PTR [r12]
    mov     BYTE PTR [rsp+16], al
    lea     rax, [rbx+16]
    jmp     .L12
.L37:
    test    rbp, rbp
    js      .L39
    lea     rdi, [rbp+1]
    call    operator new(unsigned long)
    mov     QWORD PTR [rsp], rax
    mov     QWORD PTR [rsp+16], rbp
.L10:
    mov     rdx, rbp
    mov     rsi, r12
    mov     rdi, rax
    call    memcpy
    jmp     .L34
.L19:
    lea     rax, [rbx+16]
    jmp     .L10
.L36:
    mov     edi, OFFSET FLAT:.LC0
    call    std::__throw_logic_error
    movzx   edx, BYTE PTR [rsp+64]
    mov     rbx, rax
    cmp     dl, -1
    je      .L18
    lea     rdi, [rsp+32]
    call    [QWORD PTR _Variant_stg[0+rdx*8]]
.L18:
    mov     rdi, rbx
    call    _Unwind_Resume
.L39:
    mov     edi, OFFSET FLAT:.LC1
    call    std::__throw_length_error
_Variant_stg:
    .quad   __erased_dtor<&lt;skip&gt;, 0ul>
    .quad   __erased_dtor<&lt;skip&gt;, 1ul>
    </pre></td>


    <td valign="top" style="border-right: solid 1px #000;"><pre>
__erased_dtor<&lt;skip&gt;, 0ul>:
    rep ret
__erased_dtor<&lt;skip&gt;, 1ul>:
    mov     rdx, QWORD PTR [rdi]
    lea     rax, [rdi+16]
    cmp     rdx, rax
    je      .L3
    mov     rdi, rdx
    jmp     operator delete(void*)
.L3:
    rep ret
example_2_move():
    push    rbp
    push    rbx
    sub     rsp, 88
    lea     rdi, [rsp+32]
    mov     rbx, rsp
    call    produce[abi:cxx11]()
    movzx   eax, BYTE PTR [rsp+64]
    lea     rdx, [rbx+16]
    mov     QWORD PTR [rsp], rdx
    cmp     al, 1
    je      .L16
    cmp     al, -1
    mov     QWORD PTR [rsp+8], 0
    mov     BYTE PTR [rsp+16], 0
    je      .L10
.L9:
    lea     rdi, [rsp+32]
    call    [QWORD PTR _Variant_stg[0+rax*8]]
.L10:
    mov     rdi, QWORD PTR [rsp]
    add     rbx, 16
    mov     rbp, QWORD PTR [rsp+8]
    cmp     rdi, rbx
    je      .L5
    call    operator delete(void*)
.L5:
    add     rsp, 88
    mov     eax, ebp
    pop     rbx
    pop     rbp
    ret
.L16:
    mov     rdx, QWORD PTR [rsp+32]
    lea     rcx, [rsp+48]
    cmp     rdx, rcx
    je      .L17
    mov     QWORD PTR [rsp], rdx
    mov     rdx, QWORD PTR [rsp+48]
    mov     QWORD PTR [rsp+16], rdx
.L8:
    mov     rdx, QWORD PTR [rsp+40]
    mov     BYTE PTR [rsp+48], 0
    mov     QWORD PTR [rsp+40], 0
    mov     QWORD PTR [rsp+8], rdx
    lea     rdx, [rsp+48]
    mov     QWORD PTR [rsp+32], rdx
    jmp     .L9
.L17:
    movdqa  xmm0, XMMWORD PTR [rsp+48]
    movaps  XMMWORD PTR [rsp+16], xmm0
    jmp     .L8
_Variant_stg:
    .quad   __erased_dtor<&lt;skip&gt;, 0ul>
    .quad   __erased_dtor<&lt;skip&gt;, 1ul>
    </pre></td>


    <td valign="top" style="border-right: solid 1px #000;"><pre>
__erased_dtor<&lt;skip&gt;, 0ul>:
    rep ret
__erased_dtor<&lt;skip&gt;, 1ul>:
    mov     rdx, QWORD PTR [rdi]
    lea     rax, [rdi+16]
    cmp     rdx, rax
    je      .L3
    mov     rdi, rdx
    jmp     operator delete(void*)
.L3:
    rep ret
example_2_optimized():
    push    rbx
    xor     ebx, ebx
    sub     rsp, 48
    mov     rdi, rsp
    call    produce[abi:cxx11]()
    movzx   edx, BYTE PTR [rsp+32]
    cmp     dl, -1
    je      .L5
    cmp     dl, 1
    je      .L12
.L7:
    mov     rdi, rsp
    call    [QWORD PTR _Variant_stg[0+rdx*8]]
.L5:
    add     rsp, 48
    mov     eax, ebx
    pop     rbx
    ret
.L12:
    mov     ebx, DWORD PTR [rsp+8]
    jmp     .L7
_Variant_stg:
    .quad   __erased_dtor<&lt;skip&gt;, 0ul>
    .quad   __erased_dtor<&lt;skip&gt;, 1ul>
    </pre></td>
</tr>
</table>

<h3 class="notes"><b>Pros:</b> Avoided copy construction, avoided destructor call, smaller binary, faster code.</h3>


















    <h2>IV. Motivating example #3</h2>
    <p>This time we'll look at some code where compilers show the same results with copy elision and with <code>std::move</code>:</p>
<pre class="lmargin50">
#include &lt;utility&gt;
#include &lt;string&gt;
#include &lt;memory&gt;

struct some_type {
    short i;
};

std::pair&lt;int, std::shared_ptr&lt;some_type&gt;&gt; produce();

static std::shared_ptr&lt;some_type&gt; return_second() {
    auto v = produce();
    return v.second;
}

int example_3() {
    return !!return_second();
}
</pre>


    <p>With <code>std::move</code>:</p>
<pre class="lmargin50">
#include &lt;utility&gt;
#include &lt;string&gt;
#include &lt;memory&gt;

struct some_type {
    short i;
};

std::pair&lt;int, std::shared_ptr&lt;some_type&gt;&gt; produce();

static std::shared_ptr&lt;some_type&gt; return_second_move() {
    auto v = produce();
    return std::move(v.second);
}

int example_3_move() {
    return !!return_second_move();
}
</pre>

    <p>Code that emulates copy elision:</p>
<pre class="lmargin50">
#include &lt;utility&gt;
#include &lt;string&gt;
#include &lt;memory&gt;

struct some_type {
    short i;
};

std::pair&lt;int, std::shared_ptr&lt;some_type&gt;&gt; produce();

int example_3_optimized() {
    return !!produce().second;
}
</pre>

<table border="1" rules="cols" style="width: 100%">
<tr><th style="width: 33%">example_3()</th><th style="width: 33%">example_3_move()</th><th style="width: 33%">example_3_optimized()</th></tr>
<tr><th><a href="https://godbolt.org/g/LcTD1y">https://godbolt.org/g/LcTD1y</a></th><th><a href="https://godbolt.org/g/HMqKAr">https://godbolt.org/g/HMqKAr</a></th><th><a href="https://godbolt.org/g/QK3nPF">https://godbolt.org/g/QK3nPF</a></th></tr>
<tr>
    <td valign="top" style="border-right: solid 1px #000;"><pre>
&lt;skip&gt;::_M_dispose():
    rep ret
&lt;skip&gt;_M_destroy():
    mov     rax, QWORD PTR [rdi]
    jmp     [QWORD PTR [rax+8]]
example_3():
    push    r13
    push    r12
    push    rbp
    push    rbx
    sub     rsp, 56
    lea     rdi, [rsp+16]
    call    produce()
    mov     rbx, QWORD PTR [rsp+32]
    mov     rax, QWORD PTR [rsp+24]
    test    rbx, rbx
    je      .L34
    test    rax, rax
    lea     rbp, [rbx+8]
    mov     r12d, OFFSET FLAT:__gthrw___pthread_key_create
    setne   al
    test    r12, r12
    mov     rcx, rbp
    movzx   eax, al
    je      .L7
    lock add        DWORD PTR [rbp+0], 1
    mov     r13, QWORD PTR [rsp+32]
    test    r13, r13
    lea     rcx, [r13+8]
    je      .L23
    test    r12, r12
    je      .L10
.L37:
    mov     edx, -1
    lock xadd       DWORD PTR [rcx], edx
    cmp     edx, 1
    je      .L35
.L23:
    test    r12, r12
    je      .L17
    mov     edx, -1
    lock xadd       DWORD PTR [rbp+0], edx
    cmp     edx, 1
    je      .L36
.L4:
    add     rsp, 56
    pop     rbx
    pop     rbp
    pop     r12
    pop     r13
    ret
.L7:
    add     DWORD PTR [rbx+8], 1
    test    r12, r12
    mov     r13, rbx
    jne     .L37
.L10:
    mov     edx, DWORD PTR [r13+8]
    lea     ecx, [rdx-1]
    cmp     edx, 1
    mov     DWORD PTR [r13+8], ecx
    jne     .L23
.L35:
    mov     rdx, QWORD PTR [r13+0]
    mov     rdx, QWORD PTR [rdx+16]
    cmp     rdx, OFFSET FLAT:&lt;skip&gt;::_M_dispose()
    jne     .L38
.L13:
    test    r12, r12
    je      .L14
    mov     edx, -1
    lock xadd       DWORD PTR [r13+12], edx
.L15:
    cmp     edx, 1
    jne     .L23
    mov     rdx, QWORD PTR [r13+0]
    mov     DWORD PTR [rsp+12], eax
    mov     rdi, r13
    mov     rcx, QWORD PTR [rdx+24]
    cmp     rcx, OFFSET FLAT:&lt;skip&gt;_M_destroy()
    jne     .L16
    call    [QWORD PTR [rdx+8]]
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L23
.L34:
    test    rax, rax
    setne   al
    add     rsp, 56
    pop     rbx
    movzx   eax, al
    pop     rbp
    pop     r12
    pop     r13
    ret
.L17:
    mov     edx, DWORD PTR [rbx+8]
    lea     ecx, [rdx-1]
    cmp     edx, 1
    mov     DWORD PTR [rbx+8], ecx
    jne     .L4
.L36:
    mov     rdx, QWORD PTR [rbx]
    mov     rdx, QWORD PTR [rdx+16]
    cmp     rdx, OFFSET FLAT:&lt;skip&gt;::_M_dispose()
    jne     .L39
.L19:
    test    r12, r12
    je      .L20
    mov     edx, -1
    lock xadd       DWORD PTR [rbx+12], edx
.L21:
    cmp     edx, 1
    jne     .L4
    mov     rdx, QWORD PTR [rbx]
    mov     DWORD PTR [rsp+12], eax
    mov     rdi, rbx
    mov     rcx, QWORD PTR [rdx+24]
    cmp     rcx, OFFSET FLAT:&lt;skip&gt;_M_destroy()
    jne     .L22
    call    [QWORD PTR [rdx+8]]
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L4
.L14:
    mov     edx, DWORD PTR [r13+12]
    lea     ecx, [rdx-1]
    mov     DWORD PTR [r13+12], ecx
    jmp     .L15
.L20:
    mov     edx, DWORD PTR [rbx+12]
    lea     ecx, [rdx-1]
    mov     DWORD PTR [rbx+12], ecx
    jmp     .L21
.L39:
    mov     DWORD PTR [rsp+12], eax
    mov     rdi, rbx
    call    rdx
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L19
.L38:
    mov     DWORD PTR [rsp+12], eax
    mov     rdi, r13
    call    rdx
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L13
.L16:
    call    rcx
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L23
.L22:
    call    rcx
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L4
    </pre></td>

    <td valign="top" style="border-right: solid 1px #000;"><pre>
&lt;skip&gt;::_M_dispose():
    rep ret
&lt;skip&gt;_M_destroy():
    mov     rax, QWORD PTR [rdi]
    jmp     [QWORD PTR [rax+8]]
example_3_move():
    push    rbp
    push    rbx
    sub     rsp, 56
    lea     rdi, [rsp+16]
    call    produce()
    xor     eax, eax
    cmp     QWORD PTR [rsp+24], 0
    mov     rbx, QWORD PTR [rsp+32]
    setne   al
    test    rbx, rbx
    je      .L4
    mov     ebp, OFFSET FLAT:__gthrw___pthread_key_create
    test    rbp, rbp
    je      .L7
    mov     edx, -1
    lock xadd       DWORD PTR [rbx+8], edx
    cmp     edx, 1
    je      .L18
.L4:
    add     rsp, 56
    pop     rbx
    pop     rbp
    ret
.L7:
    mov     edx, DWORD PTR [rbx+8]
    lea     ecx, [rdx-1]
    cmp     edx, 1
    mov     DWORD PTR [rbx+8], ecx
    jne     .L4
.L18:
    mov     rdx, QWORD PTR [rbx]
    mov     rdx, QWORD PTR [rdx+16]
    cmp     rdx, OFFSET FLAT:&lt;skip&gt;::_M_dispose()
    jne     .L19
.L10:
    test    rbp, rbp
    je      .L11
    mov     edx, -1
    lock xadd       DWORD PTR [rbx+12], edx
.L12:
    cmp     edx, 1
    jne     .L4
    mov     rdx, QWORD PTR [rbx]
    mov     DWORD PTR [rsp+12], eax
    mov     rdi, rbx
    mov     rcx, QWORD PTR [rdx+24]
    cmp     rcx, OFFSET FLAT:&lt;skip&gt;_M_destroy()
    jne     .L13
    call    [QWORD PTR [rdx+8]]
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L4
.L11:
    mov     edx, DWORD PTR [rbx+12]
    lea     ecx, [rdx-1]
    mov     DWORD PTR [rbx+12], ecx
    jmp     .L12
.L19:
    mov     DWORD PTR [rsp+12], eax
    mov     rdi, rbx
    call    rdx
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L10
.L13:
    call    rcx
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L4
    </pre></td>

    <td valign="top" style="border-right: solid 1px #000;"><pre>
&lt;skip&gt;::_M_dispose():
    rep ret
&lt;skip&gt;_M_destroy():
    mov     rax, QWORD PTR [rdi]
    jmp     [QWORD PTR [rax+8]]
example_3_optimized():
    push    rbp
    push    rbx
    sub     rsp, 56
    lea     rdi, [rsp+16]
    call    produce()
    xor     eax, eax
    cmp     QWORD PTR [rsp+24], 0
    mov     rbx, QWORD PTR [rsp+32]
    setne   al
    test    rbx, rbx
    je      .L4
    mov     ebp, OFFSET FLAT:__gthrw___pthread_key_create
    test    rbp, rbp
    je      .L7
    mov     edx, -1
    lock xadd       DWORD PTR [rbx+8], edx
    cmp     edx, 1
    je      .L18
.L4:
    add     rsp, 56
    pop     rbx
    pop     rbp
    ret
.L7:
    mov     edx, DWORD PTR [rbx+8]
    lea     ecx, [rdx-1]
    cmp     edx, 1
    mov     DWORD PTR [rbx+8], ecx
    jne     .L4
.L18:
    mov     rdx, QWORD PTR [rbx]
    mov     rdx, QWORD PTR [rdx+16]
    cmp     rdx, OFFSET FLAT:&lt;skip&gt;::_M_dispose()
    jne     .L19
.L10:
    test    rbp, rbp
    je      .L11
    mov     edx, -1
    lock xadd       DWORD PTR [rbx+12], edx
.L12:
    cmp     edx, 1
    jne     .L4
    mov     rdx, QWORD PTR [rbx]
    mov     DWORD PTR [rsp+12], eax
    mov     rdi, rbx
    mov     rcx, QWORD PTR [rdx+24]
    cmp     rcx, OFFSET FLAT:&lt;skip&gt;_M_destroy()
    jne     .L13
    call    [QWORD PTR [rdx+8]]
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L4
.L11:
    mov     edx, DWORD PTR [rbx+12]
    lea     ecx, [rdx-1]
    mov     DWORD PTR [rbx+12], ecx
    jmp     .L12
.L19:
    mov     DWORD PTR [rsp+12], eax
    mov     rdi, rbx
    call    rdx
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L10
.L13:
    call    rcx
    mov     eax, DWORD PTR [rsp+12]
    jmp     .L4
    </pre></td>
</tr>
</table>

<h3 class="notes"><b>Pros:</b> Avoided copy construction, avoided destructor call, smaller binary, faster code.</h3>

<p class="notes">Even when there's no benefit to be had over using <code>std::move</code>, it is better to have that benefit automatically without requiring user intervention.</p>





















    <h2>V. Proposed wording 1: A simple solution for relaxing [class.copy.elision]</h2>
    <p>Modify the [class.copy.elision] paragraph 1 to allow copy elision for returning subobjects of the objects with defaulted destructor:</p>
    <div class="lmargin50 stdquote">
    <p>This elision of copy/move operations, called copy elision, is permitted
in the following circumstances (which may be combined to eliminate multiple copies):</p>
    <p><...></p>
    <div class="cppaddition" style="font-family: Consolas,monospace">
    <p>&ndash; in a return statement in a function with a class return type, when</p>
    <ul>
        <li>the expression is a non-volatile subobject of the non-volatile local object</li>
        <li>and the non-volatile local object has defaulted destructor</li>
        <li>and the type of subobject is same (ignoring cv-qualification) as the function return type</li>
        <li>and the subobject is not accessed between a copy/move construction of it and its destruction.</li>
    </ul>
    <p>the copy/move operation can be omitted by inlining the function call, constructing the automatic object directly
    in the caller,
    calling the destructors for non-returned subobjects and treating the target as another way to refer to the subobject.</p>
    </div>
    <p><...></p>
    </div>
    <p>This will allow optimizing <code>std::pair</code>, <code>std::tuple</code> and all the aggregate initializable types.</p>


    <h2>VI. Proposed wording 2: A generic solution for relaxing [class.copy.elision]</h2>
    <p>Modify the [class.copy.elision] paragraph 1 to allow copy elision for returning subobjects of the objects (dropped the "<i>defaulted destructor</i>" requirement):</p>
    <div class="lmargin50 stdquote">
    <p>This elision of copy/move operations, called copy elision, is permitted
in the following circumstances (which may be combined to eliminate multiple copies):</p>
    <p><...></p>
    <div class="cppaddition">
    <p>&ndash; in a return statement in a function with a class return type, when</p>
    <ul>
        <li>the expression is a non-volatile subobject of the non-volatile local object</li>
        <li>and the type of subobject is same (ignoring cv-qualification) as the function return type</li>
        <li>and the subobject is not accessed between a copy/move construction of it and its destruction.</li>
    </ul>
    <p>the copy/move operation can be omitted by inlining the function call, constructing the automatic object directly
    in the caller,
    calling the destructors for non-returned subobjects and treating the target as another way to refer to the subobject.</p>
    </div>
    <p><...></p>
    </div>
    <p>This will allow optimizing <code>std::pair</code>, <code>std::tuple</code>, all the aggregate initializable types AND probably <code>std::variant</code>,
    <code>std::optional</code>, many user provided classes.</p>


    <h2>VII. Proposed wording 3: An ultimate solution for relaxing [class.copy.elision]</h2>
    <p>Use the "Ultimate copy elision" wording from our companion paper P0889R0.</p>


    <h2>VIII. Minor points</h2>
    <p>Automatically applying <code>std::move</code> to returned subobjects may be a good idea, but for a separate proposal. As was shown in the first two examples, copy elision
    could be profitable even with moving out the subobject.</p>

    <p>It would be wrong to guarantee/require the above optimizations as they are highly dependent on inlining and aliasing reasoning of the compiler.</p>

    <p>Copy elision of subobjects could be useful in many cases. The paper tries to allow more optimizations rather than forcing some particular optimizations.
    Thus, compiler vendors may find alternatives to the above optimizations. 
    </p>


    <h2>IX. Acknowledgements</h2>
    <p>Many thanks to Walter E. Brown for fixing numerous issues in draft versions of this paper.</p>
    <p>Many thanks to Eyal Rozenberg for numerous useful comments.</p>
    <p>Many thanks to Vyacheslav Napadovsky for helping with the assembly and the optimizations.</p>
    <p>Thanks also
 to Marc Glisse for pointing me to the forbidden copy elision cases,
and to Thomas Köppe, Peter Dimov, Richard Smith, Gabriel Dos Reis, 
Anton Bikineev for
    participating in early discussions and thus helping me to assemble my thoughts.</p>


    <h2>X. Feature-testing macro</h2>
	<p>For the purposes of SG10, we recommend the feature-testing macro name <code>__cpp_subobject_copy_elision</code>.</p>

    <script type="text/javascript">
        function colorize_texts(texts) {
        for (var i = 0; i < texts.length; ++i) {
            var text = texts[i].innerHTML;
            text = text.replace(/namespace|sizeof|long|enum|void|constexpr|extern|noexcept|bool|template|class |struct|auto|const|typename|explicit|public|private|operator|#include|inline| char|typedef|static_assert|static_cast|static/g,"<span class='cppkeyword'>$&<\/span>");
            text = text.replace(/\/\/[\s\S]+?\n/g,"<span class='cppcomment'>$&<\/span>");
            //text = text.replace(/\"[\s\S]+?\"/g,"<span class='cpptext'>$&<\/span>");
            texts[i].innerHTML = text;
        }
        }
/*
        colorize_texts(document.getElementsByTagName("pre"));
        colorize_texts(document.getElementsByTagName("code"));
*/


        function colorize_asm(texts) {
        for (var i = 0; i < texts.length; ++i) {
            var text = texts[i].innerHTML;
            text = text.replace(/operator new|std::__throw_length_error|std::__throw_logic_error|lock|operator delete|div/g,"<span class='asmcostly'>$&<\/span>");
            text = text.replace(/jmp|ja|je|mov[a-z]* |call|lea|cmp|rep|push|sub|add|ret|xor|pop|test|jne|js/g,"<span class='cppkeyword'>$&<\/span>");
            text = text.replace(/\/\/[\s\S]+?\n/g,"<span class='cppcomment'>$&<\/span>");
            //text = text.replace(/\"[\s\S]+?\"/g,"<span class='cpptext'>$&<\/span>");
            texts[i].innerHTML = text;
        }
        }

        colorize_asm(document.getElementsByTagName("td"));

        function show_hide_deleted() {
        var to_change = document.getElementsByClassName('changed-deleted');
        for (var i = 0; i < to_change.length; ++i) {
            to_change[i].style.display = (document.getElementById("show_deletions").checked ? 'block' : 'none');
        }
        }
        show_hide_deleted()
    </script>
</body></html>


