<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- ===================================================================== -->
<!--  File:       IntegerRepresentation.html                               -->
<!--  Author:     J. Kanze                                                 -->
<!--  Date:       14/05/2008                                               -->
<!--      Copyright (c) 2008 James Kanze                                   -->
<!-- _____________________________________________________________________ -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"/>
  <meta http-equiv="content-language" content="en"/>
  <meta name="author" content="J. Kanze"/>
  <meta name="date" content="2008-05-14T11:45:16MEST"/>
  <meta name="generator" content="vim"/>
  <title>Resolving the difference between C and C++ with regards to object representation of integers.</title>
  <style type="text/css">
    <!--
    .added { color:darkcyan; text-decoration:underline; }
    .fromc { color:blue; }
    .comment { color:black; font-style:italic; font-size:smaller;
      margin-right:5%; margin-left:5%; }
    .removed { color:red; text-decoration:line-through; }
    .clause { padding-left:5em; }
    .hlabel { padding-right:1em; text-align:right; }
    -->
  </style>
</head>
<body>
<h1>Resolving the difference between C and C++ with regards to object representation of integers.</h1>
<p>
<table rules="none" style="margin-left:60%">
  <tr><td class="hlabel">Doc. no.:</td><td>N2631=08-0141</td></tr>
  <tr><td class="hlabel">Date:</td><td>2008-05-14</td></tr>
  <tr><td class="hlabel">Author:</td><td>James Kanze</td></tr>
  <tr><td class="hlabel">email:</td><td><a href="mailto:james.kanze@gmail.com">james.kanze@gmail.com</a></td></tr>
</table>
</p>
<hr>
<h2>Introduction</h2>
<p>
  In recent discussions in <tt>comp.lang.c++</tt>, it became clear that
  C and C++ have different requirements concerning the object
  representation of integers, and that at least one real implementation
  of C does not meet the C++ requirements.  The purpose of this paper is
  to suggest wording to align the C++ standard with C.
</p>
<p>
  It should be noted that the issue only concerns some fairly
  &ldquo;exotic&rdquo; hardware.  In this regard, it raises a somewhat
  larger issue: how far do we want to go in supporting exotic hardware?
  (In the discussions in the news group, one noted expert expressed the
  opinion that we could even go so far as to require two's complement.)
  My personal opinion is that at least with regards to integer types, we
  should remain 100% C compatible, and simply follow C.  That is,
  however, only a personal opinion.  (Perhaps some sort of vote should
  be in order, expressing the direction we want to take.)
</p>
<h2>The Problem</h2>
<p>
  The requirements concerning the object representation of integers are
  not the same in the current draft (and in all previous versions) of
  the C++ standard and in the C99 standard.  In [basic.types]/4, the C++
  standard says:
  <blockquote>
    [...]The value representation of an object is the set of bits that
    hold the value of type T.[...]
  </blockquote>
  and in [basic.fundamental]/3:
  <blockquote>
    For each of the standard signed integer types, there exists a
    corresponding (but different) standard unsigned integer type:
    [...]the value representation of each corresponding signed/unsigned
    type shall be the same.[...]
  </blockquote>
  This implies, indirectly, that the maximum value of an unsigned
  integral type must be greater than that of a signed integral type (or
  more precisely, that the sign bit in a signed integral type must
  participate in the value representation of the corresponding unsigned
  type&mdash;other constraints mean that it must, in fact, be the most
  significant bit).
  C90 didn't
  have this restriction, and C99 explicitly says that (&sect;6.2.6.2):
  <blockquote>
    [...] For signed integer types, the bits of the object
    representation shall be divided into three groups: value bits,
    padding bits, and the sign bit.  [In other words, unlike C++, the
    sign bit is *not* part of the value representation.] [...](if there
    are M value bits in the signed type and N in the unsigned type, then
    M &le; N).
  </blockquote>
  In other words, in C, given an architecture which doesn't support
  unsigned arithmetic, the implementation can fake it by simply masking
  out the sign bit of a signed int.  Concretely, the Unisys MCP
  processors make use of this.  From their C manual:
  <blockquote>
    <table>
      <tr>
        <th colspan="4">Range of Data Types</th>
      </tr>
      <tr>
        <th>Type</th>
        <th>Bits</th>
        <th><tt>sizeof</tt></th>
        <th>Range</th>
      </tr>
      <tr>
        <td><tt>char</tt></td>
        <td align=center>8</td>
        <td align=center>1</td>
        <td align=center>0 to 255</td>
      </tr>
      <tr>
        <td colspan="4">[...]</td>
      </tr>
      <tr>
        <td><tt>int</tt></td>
        <td align=center>48</td>
        <td align=center>6</td>
        <td align=center>1&minus;2**39 to 2**39&minus;1</td>
      </tr>
      <tr>
        <td><tt>signed int</tt></td>
        <td align=center>48</td>
        <td align=center>6</td>
        <td align=center>1&minus;2**39 to 2**39&minus;1</td>
      </tr>
      <tr>
        <td><tt>unsigned int</tt></td>
        <td align=center>48</td>
        <td align=center>6</td>
        <td align=center>0 to 2**39&minus;1</td>
      </tr>
    </table>
  </blockquote>
  I suspect that this difference is unintentional, and that it was never
  the intent of the C++ committee to be incompatible with C here, but as
  it stands, there is an incompatibility, and it affects at least one
  architecture currently being sold.
</p>
<h2>Proposed solution</h2>
<p>
  If C compatibility is desired, it seems to me that the simplest and
  surest way of attaining this is by incorporating the exact words from
  the C standard, in place of the current wording.  I thus propose that
  we adopt the wording from the C standard, as follows (text taken
  verbatim from the C standard is in blue; text that has been modified
  or added is dark cyan, underlined, and inline comments concerning the
  text&mdash;which aren't meant to be incorporated into the text of the
  standard&mdash;are black and in italics):
</p>
<p>
  In [basic.types], after paragraph 4, add the following paragraph:
  <blockquote class="fromc">
    Certain object representations need not represent a value of the
    object type. If the stored value of an object has such a
    representation and is read by an lvalue expression that does not
    have character type, the behavior is undefined. If such a
    representation is produced by a side effect that modifies all or any
    part of the object by an lvalue expression that does not have
    character type, the behavior is undefined.  Such a representation is
    called a <i>trap representation</i>.
  </blockquote>
</p>
<p>
  In [basic.fundamental], replace paragraphs 1&ndash;4 with:
</p>
<blockquote class="fromc">
<p>
  An object declared as type <tt>char</tt> is large enough to store
  any member of the basic execution character set. If a member of the
  basic execution character set is stored in a <tt>char</tt> object,
  its value is guaranteed to be positive; in addition, the integral
  value of that character object is equal to the value of the single
  character literal form of that character.  If any other character is
  stored in a char object, the resulting value is
  implementation-defined but shall be within the range of values that
  can be represented in that type.
</p>
<p>
  There are five <i>standard signed integer types</i>, designated as
  <tt>signed char</tt>, <tt>short int</tt>, <tt>int</tt>, <tt>long
  int</tt>, and <tt>long long int</tt>. (These and other types may
  be designated in several additional ways, as described in [cstdint])
  There may also be implementation-defined <i>extended signed integer
  types</i> [<i>Note:</i> Implementation-defined keywords shall have
  the form of an identifier reserved for any use as described in
  [global.names] &mdash;<i>end note</i>]. The standard and extended
  signed integer types are collectively called <i>signed integer
  types</i>.  [<i>Note:</i> Therefore, any statement in this
  Standard about signed integer types also applies to the extended
  signed integer types.  &mdash;<i>end note</i>]
</p>
<p>
  An object declared as type signed char occupies the same amount of
  storage as a &ldquo;plain&rdquo; char object. A &ldquo;plain&rdquo;
  int object has the natural size suggested by the architecture of the
  execution environment (large enough to contain any value in the
  range <tt>INT_MIN</tt> to <tt>INT_MAX</tt> as defined in the header
  <tt>&lt;climits&gt;</tt>).
</p>
<p>
  For each of the signed integer types, there is a corresponding (but
  different) unsigned integer type (designated with the keyword
  <tt>unsigned</tt>) that uses the same amount of storage (including
  sign information) and has the same alignment requirements. The
  unsigned integer types that correspond to the standard signed
  integer types are the <i>standard unsigned integer types</i>. The
  unsigned integer types that correspond to the extended signed
  integer types are the <i>extended unsigned integer types</i>.  The
  standard and extended unsigned integer types are collectively called
  <i>unsigned integer types</i>. [<i>Note:</i> Therefore, any
  statement in this Standard about unsigned integer types also applies
  to the extended unsigned integer types. &mdash;<i>end note</i>]
</p>
<p>
  The standard signed integer types and standard unsigned integer
  types are collectively called the <i>standard integer types</i>, the
  extended signed integer types and extended unsigned integer types
  are collectively called the <i>extended integer types</i>.
</p>
<p class="added">
  Thge standard types <tt>char</tt>, <tt>unsigned char</tt> and
  <tt>signed char</tt> are collecdtively called <i>standard character
  types</i>.  They shall not contain padding bits; all bits must
  participate in the value representation.
</p>
<p>
  For any two integer types with the same signedness and different
  integer conversion rank (see [conv.rank]), the range of values of
  the type with smaller integer conversion rank is a subrange of the
  values of the other type.
</p>
<p>
  The range of nonnegative values of a signed integer type is a
  subrange of the corresponding unsigned integer type, and the
  representation of the same value in each type is the
  same.[<i>Note:</i> The same representation and alignment
  requirements are meant to imply interchangeability as arguments to
  functions, return values from functions, and members of unions.
  &mdash;<i>end note</i>] A computation involving unsigned operands
  can never overflow, because a result that cannot be represented by
  the resulting unsigned integer type is reduced modulo the number
  that is one greater than the largest value that can be represented
  by the resulting type.
</p>
<p>
  For unsigned integer types <span class="added">(and plain
  <tt>char</tt>, if it takes on the same values as an <tt>unsigned
  char</tt>)</span>, the bits of the object representation shall
  be divided into two groups: value bits and padding bits (there need
  not be any of the latter, <span class="added">and shall not be in the case of <tt>unsigned
  char</tt> and <tt>char</tt></span>). If there are <i>N</i> value bits,
  each bit shall represent a different power of 2 between 1 and
  2<sup><small><i>N</i>&minus;1</small></sup>, so that objects of that
  type shall be capable of representing values from 0 to
  2<sup><small><i>N</i></small></sup>&minus; 1 using a pure binary
  representation; this shall be known as the value representation. The
  values of any padding bits are unspecified. [<i>Note:</i> Some
  combinations of padding bits might generate trap representations,
  for example, if one padding bit is a parity bit. Regardless, no
  arithmetic operation on valid values can generate a trap
  representation other than as part of an exceptional condition such
  as an overflow, and this cannot occur with unsigned types. All other
  combinations of padding bits are alternative object representations
  of the value specified by the value bits. &mdash;<i>end note</i>]
</p>
<p>
  For signed integer types (and plain <tt>char</tt>, if it takes on
  the same values as a <tt>signed char</tt>), the bits of the object
  representation shall be divided into three groups: value bits,
  padding bits, and the sign bit. There need not be any padding bits
  (and in the case of <tt>char</tt>, if it is signed, there shall not
  be); there shall be exactly one sign bit.  Each bit that is a value
  bit shall have the same value as the same bit in the object
  representation of the corresponding unsigned type (if there are
  <i>M</i> value bits in the signed type and <i>N</i> in the unsigned
  type, then <i>M</i>&le;<i>N</i>).  If the sign bit is zero, it shall
  not affect the resulting value. If the sign bit is one, the value
  shall be modified in one of the following ways:
  <ul>
    <li>
      the corresponding value with sign bit 0 is negated (sign and
      magnitude);
    </li>
    <li>
      the sign bit has the value
      &minus;(2<sup><small><i>N</i></small></sup>) (two's complement);
    </li>
    <li>
      the sign bit has the value
      &minus;(2<sup><small><i>N</i></small></sup> &minus; 1) (one's
      complement).
    </li>
  </ul>
  Which of these applies is implementation-defined, as is whether the
  value with sign bit 1 and all value bits zero (for the first two),
  or with sign bit and all value bits 1 (for one's complement), is a
  trap representation or a normal value. In the case of sign and
  magnitude and one's complement, if this representation is a normal
  value it is called a <i>negative zero</i>.
</p>
<p>
  If the implementation supports negative zeros, they shall be
  generated only by:
  <ul>
    <li>
      the <tt>&amp;</tt>, <tt>|</tt>, <tt>^</tt>, <tt>~</tt>,
      <tt>&lt;&lt;</tt> and <tt>&gt;&gt;</tt> operators with arguments
      that produce such a value;
    </li>
    <li>
      the <tt>+</tt>, <tt>-</tt>, <tt>*</tt>, <tt>/</tt>, and
      <tt>%</tt> operators where one argument is a negative zero and
      the result is zero;
    </li>
    <li>
      compound assignment operators based on the above cases.
    </li>
  </ul>
  It is unspecified whether these cases actually generate a negative
  zero or a normal zero, and whether a negative zero becomes a normal
  zero when stored in an object.
</p>
<p>
  If the implementation does not support negative zeros, the behavior
  of the <tt>&amp;</tt>, <tt>|</tt>, <tt>^</tt>, <tt>~</tt>,
  <tt>&lt;&lt;</tt>, and <tt>&gt;&gt;</tt> operators with arguments
  that would produce such a value is undefined.
</p>
<p>
  The values of any padding bits are unspecified.[<i>Note:</i> Some
  combinations of padding bits might generate trap representations,
  for example, if one padding bit is a parity bit. Regardless, no
  arithmetic operation on valid values can generate a trap
  representation other than as part of an exceptional condition such
  as an overflow. All other combinations of padding bits are
  alternative object representations of the value specified by the
  value bits. &mdash;<i>end note</i>]  A valid (non-trap) object
  representation of a signed integer type where the sign bit is zero
  is a valid object representation of the corresponding unsigned type,
  and shall represent the same value.
</p>
<p class="added">
  The types <tt>unsigned char</tt> and <tt>char</tt> may be used for
  &ldquo;bitwise&rdquo; copy. [<i>Note:</i> this means that if
  <tt>signed char</tt> has a negative zero which is either a trapping
  value, or will be forced to positive zero on assignment, plain
  <tt>char</tt> must be unsigned. &mdash;<i>end note</i>]
</p> 
<p class="comment">
  I'm not sure that this was every really clearly
  specified, but it does seem common practice in the C++ community to
  do bitwise copies through <tt>char*</tt>.  Of the two architectures
  with other than 2's complement that I'm aware of, both make plain
  char unsigned, presumably to allow this to work.  (Of course,
  logically, plain char should always be unsigned, but historical
  reasons mean that implementors cannot always be logical.)
</p>
<p>
  The <i>precision</i> of an integer type is the number of bits it
  uses to represent values, excluding any sign and padding bits. The
  <i>width</i> of an integer type is the same but including any sign
  bit; thus for unsigned integer types the two values are the same,
  while for signed integer types the width is one greater than the
  precision.
</p>
</blockquote>

<h2>Open points</h2>
<p>
  The C standard has various text (e.g. &sect;6.2.6.1/6 and 7) which
  basically states that you cannot count on the contents of padding
  bytes in a struct or a union, but that they will never cause a
  trapping representation.  I don't think this is necessary per se in
  C++, since assignment and initialization are member-wise in C++,
  rather than byte-wise as in C.  On the other hand, perhaps we need
  something somewhere to say that the compiler generated copy operations
  may change the values of these bytes (but not in such a way as to
  create a trapping representation if there wasn't one there previously
  (and that after initialization, it is guaranteed that there is not a
  trapping representation).
</p>
</body>
</html>
<!-- Local Variables:               === for emacs -->
<!-- mode: html                     === for emacs -->
<!-- tab-width: 8                   === for emacs -->
<!-- End:                           === for emacs -->
<!-- vim: set ts=8 filetype=html:   === for vim   -->

