HP-UX Floating-Point Guide
Chapter 2 41
Floating-Point Principles and the IEEE Standard for Binary Floating-Point Arithmetic
Floating-Point Formats
Because of this granularity in floating-point representation, most real
numbers cannot be represented exactly. The result of an arithmetic
operation (including the operation of converting from a decimal string
into IEEE format) usually must be rounded to a nearby representable
number. (For information on rounding, see “Inexact Result (Rounding)”
on page 53.)
Even some simple fractions cannot be represented exactly. Consider the
fraction 1/3. The exact value of this expression would require an infinite
number of bits, because the value is an infinitely repeating fraction
(0.33333…in decimal, 0.010101…in binary). Many values that can be
represented exactly in a few decimal digits cannot be represented exactly
in binary: for example, 1/10, which in decimal is 0.1, is in binary
0.000110011001100… Because simple numbers like 1/3 and 1/10 cannot
be represented exactly, no floating-point operation can ever yield these
exact values.
Although most real numbers cannot be represented exactly in
floating-point arithmetic, a great many can. Any integer with a
magnitude less than 16 million can be represented exactly in any format,
and any 32-bit integer can be represented exactly in double-precision or
quad-precision. Also, all numbers representable as some number over a
power of 2, such as 0.1875 (3/16) or 27.375 (219/8), can be represented
exactly if they have no more decimal digits than the chosen precision can
faithfully represent.
Normalized and Denormalized Values
Values that are represented by a sign bit, a fraction, and an exponent
whose bits are not all zeros and not all ones are called normalized
values (also called normal values). The size of the exponent field, and
the fact that the value in the exponent field of a normalized value cannot
be 0, determine the smallest magnitude that can be represented in
normalized form. For single-precision numbers, the largest-magnitude
negative exponent is −126 (that is, 1 − 127); for double-precision
numbers, it is −1022 (that is, 1 − 1023); for quad-precision numbers, it is
−16382 (that is, 1 − 16383).
Denormalized values (also called subnormal values) fill in the gap on
the number line between the smallest-magnitude normalized value and
zero. They also allow floating-point values to satisfy the arithmetic rule
that x is equal to y if and only if x - y is equal to 0.