HP-UX Floating-Point Guide

Chapter 2 41

Floating-Point Principles and the IEEE Standard for Binary Floating-Point Arithmetic

Floating-Point Formats

Because of this granularity in ﬂoating-point representation, most real

numbers cannot be represented exactly. The result of an arithmetic

operation (including the operation of converting from a decimal string

into IEEE format) usually must be rounded to a nearby representable

number. (For information on rounding, see “Inexact Result (Rounding)”

on page 53.)

Even some simple fractions cannot be represented exactly. Consider the

fraction 1/3. The exact value of this expression would require an inﬁnite

number of bits, because the value is an inﬁnitely repeating fraction

(0.33333…in decimal, 0.010101…in binary). Many values that can be

represented exactly in a few decimal digits cannot be represented exactly

in binary: for example, 1/10, which in decimal is 0.1, is in binary

0.000110011001100… Because simple numbers like 1/3 and 1/10 cannot

be represented exactly, no ﬂoating-point operation can ever yield these

exact values.

Although most real numbers cannot be represented exactly in

ﬂoating-point arithmetic, a great many can. Any integer with a

magnitude less than 16 million can be represented exactly in any format,

and any 32-bit integer can be represented exactly in double-precision or

quad-precision. Also, all numbers representable as some number over a

power of 2, such as 0.1875 (3/16) or 27.375 (219/8), can be represented

exactly if they have no more decimal digits than the chosen precision can

faithfully represent.

Normalized and Denormalized Values

Values that are represented by a sign bit, a fraction, and an exponent

whose bits are not all zeros and not all ones are called normalized

values (also called normal values). The size of the exponent ﬁeld, and

the fact that the value in the exponent ﬁeld of a normalized value cannot

be 0, determine the smallest magnitude that can be represented in

normalized form. For single-precision numbers, the largest-magnitude

negative exponent is −126 (that is, 1 − 127); for double-precision

numbers, it is −1022 (that is, 1 − 1023); for quad-precision numbers, it is

−16382 (that is, 1 − 16383).

Denormalized values (also called subnormal values) ﬁll in the gap on

the number line between the smallest-magnitude normalized value and

zero. They also allow ﬂoating-point values to satisfy the arithmetic rule

that x is equal to y if and only if x - y is equal to 0.