Floating-Point Number System

Float number.

https://codingnest.com/the-little-things-comparing-floating-point-numbers/

First introduced through PMPP. Formally learned through CS370.

Representing a Real Number

Each positive number in $R$ can be represented in normalized form by $0. d_{1} d_{2} \dots d_{t} \times β^{p}$

where

$d_{k}$ are digits in base $β$ , that is, $0, 1, \dots, β - 1$ ;
‘normalized’ means $d_{1} \neq = 0$ ;
exponent $p$ is an integer (positive, negative or zero).

The sequence of digits $0. d_{1} d_{2} d_{3} \dots \times β^{p}$ is called the mantissa (also significand).

Knowing this, we can define a Floating-Point Number System.

Floating Point Number System

Every floating point number system $F$ that we will consider can be characterized by four integer parameters, ${β, t, L, U}$ . $F = {β, t, L, U}$

$β$ is the base (ex: 10)
$t$ is the number of digits in the mantissa (represents density / precision)
$L$ and $U$ are bounds on the exponent $p$ (represents extent / range)

Thus, the numbers in such a system are precisely those of the form $\pm 0. d_{1} d_{2} ... d_{t} \times β^{p} for L \leq p \leq U and d_{1} \neq = 0$ or $0$ (a very special floating point number)

In practice, there are 2 common standardized floating point systems:

IEEE Single Precision (fp32)
IEEE Double Precision (fp64)

See IEEE Floating Point Standard.

IEEE single precision: ${β = 2; t = 24; L = - 126; U = 127}$ IEEE double precision: ${β = 2; t = 53; L = - 1022; U = 1023}$

There’s also FP16 (half-precision)

This is super confusing

Go see the IEEE Standard because they split numbers differently than how we see it at the beginning of this note. IEEE uses a normalization convention in which the first non-zero digit lies to the left of the decimal point, rather than the right.

Int vs Floats and Doubles

I was thinking like of an Integer, where the largest signed int is $2^{31} - 1$ . But it’s actually more complicated for floats and doubles.

See IEEE Floating Point Standard for the 3 parts of the number (sign, exponent, mantissa)

Properties of Floating Point Numbers

Relationship between Real and Floating Point Numbers

This is really important to understand.

Generally case, $f l (x) \neq = x$ because most real numbers cannot be represented exactly. In order to compute $f l (x)$ , we must modify $x$ to become a valid representable floating point number, typically by eliminating some of its smaller digits.

One way to think about going from real to floating number is laying it out in slots.

Important

Floating Point Numbers are not uniformly spaced. This is due to the way they are designed (see the equation at the beginning of the page).

Numbers that are closer to zero, which have more negative exponents, are spaced more closely together. Conversely, numbers with larger magnitudes, characterized by more positive exponents, are spaced further apart. See Fixed-Point Number System for fixed spacing.

You can see how the spacing is

$0. d_{1} d_{2} d_{3} \dots \times β^{p}$

The conversion from real to floating number is called round-off error.

See Machine Epsilon.

🛠️ Steven Gong

Table of Contents

Floating-Point Number System

Representing a Real Number

Floating Point Number System

Properties of Floating Point Numbers

Graph View

Backlinks