Programmers mostly fall into one of three categories in their understanding of floating point math: There are some who don’t know enough about it to recognize that its results are not completely reliable; there are some who know just enough about it to think that its results are never reliable; and there are a few who understand it thoroughly and know exactly how reliable it is. As Donald Knuth puts it:1
There’s a credibility gap: We don’t know how much of the computer’s answers to believe. Novice computer users solve this problem by implicitly trusting in the computer as an infallible authority; they tend to believe that all digits of a printed answer are significant. Disillusioned computer users have just the opposite approach; they are constantly afraid that their answers are almost meaningless.
Here in The Journeyman’s Shop we try to fit ourselves into yet another category: those who know enough about floating point math to understand and avoid the most common pitfalls, and to recognize the times when our knowledge isn’t sufficient to produce reliable results, the times when we need help from an expert.
I’m not an expert on floating point math. On a good day I can tell
you what `ulp’ means2, but if someone
starts explaining how to get an additional half an ulp in the
atanh(x) when x is close to 1 I’ll nod and
try to keep my eyes open while imagining that I’m on a beach soaking up
the sunshine. Don’t misunderstand me: I admire people who know enough to
be able to do this sort of analysis. But the intimate details of
producing maximally accurate results are not something that I need to
know for the work that I do, and they’re complex enough that they aren’t
something that I want to dig into for recreation3. For most of us it simply isn’t worthwhile to
spend the time needed to truly understand floating point math.
Nevertheless, we need to know something about it in order to use it with
a reasonable likelihood of success. From Knuth, once again:4
[E]very well-rounded programmer ought to have a knowledge of what goes on during the elementary steps of floating point arithmetic. This subject is not at all as trivial as most people think, and it involves a surprising amount of interesting information.
This month we’ll begin by looking at examples of some of the ways that floating point math can go astray, and then, in the interest of well-roundedness, we’ll look at the implementation of a class that supports floating point math. From time to time in future columns we’ll come back to that class to improve our understanding of some of the subtleties of floating point math.
Beginners are often surprised by the behavior of code like this:
for (float index = 0.0; index != 1.0; index += 0.1) printf("%f\n", index);
They expect it to loop ten times and then stop. It doesn’t, of course; it loops forever, or until they run out of patience and kill the program. And if they’ve thought to capture the output in a file, they see that it displayed exactly the values that they expected:
0.000000 0.100000 0.200000 0.300000 0.400000 0.500000 0.600000 0.700000 0.800000 0.900000 1.000000 1.100000 ...
Naturally, they want to know why the loop didn’t stop when index reached 1.0. Most of us know the answer: The increment value, 0.1, cannot be represented exactly in a binary floating point value, so each time through the loop the value of index increases by an amount that’s close to but not equal to 0.1. On the tenth time through the loop its value isn’t exactly 1.0, so the test for equality fails. The way to "fix" this is to change the loop condition from index != 1.0 to index < 1.0. Of course, that’s only half the answer: printf seems to think that the value of index the tenth time through the loop is exactly 1.0.
We all know that when we divide two positive integers the result is truncated: the result of the division is the largest integer that’s less than or equal to the result that we’d get if we used real numbers instead of integers. So nobody except a rank beginner is surprised that one divided by three is zero. Integers represent only the integral part of a value, and any fractional part is simply discarded.
The same sort of thing happens in floating point math. The discarding is more subtle, so we often don’t notice that it is happening. But consider the same division performed with floating point math: one divided by three is 0.333333, with however many threes are supported by the floating point package that is being used. We’re used to thinking of this sort of number as a shorthand representation for an infinitely repeating decimal, so it doesn’t surprise us when we multiply that result by three and get back our original value of one. Try it on your calculator.
Despite what the calculator shows, the result of the multiplication is not one. The calculator is rounding the result to the number of digits are being displayed, and when you round 0.999999 to a shorter representation it comes out as 1.0. If that’s what your calculator is showing you, try increasing the number of digits that it displays to the right of the decimal point. If I set my HP-48GX to display ten digits to the right of the decimal point it shows the result as 1.0000000000E0. If I increase it to 11 digits it shows as 9.99999999999E-1. If your calculator won’t display enough digits to show that the result really isn’t one, try subtracting one from the result.
Once you get the rounding out of the way, it’s clear that the result is actually 0.999999, with however many nines are supported by the floating point package in the calculator. And if we remember that the result of the original division is not an infinite sequence of threes, but a truncated sequence of threes, limited by the resolution of the calculator’s floating point math, then the result makes sense: (1.0/3.0)*3.0 is 0.999999, not 1.0. When you’re looking at the results of a floating point computation, don’t let rounding in the output fool you. The actual answer may be somewhat different from what is being displayed.5
To understand how floating point values are represented internally, it helps to think about what’s commonly known as "scientific notation". Very large and very small values are represented by showing the digits of their values, and a power of ten to multiply the digits by. For example, Avogadro’s number is 6.023*10^23. The convention here is to reduce the digits to a value that’s greater than or equal to one and less than ten, and to adjust the exponent accordingly. We could, if we wanted to, use 60.23*10^22 for this value, but that would raise eyebrows because it is not in the usual form.
In a floating point math package the same basic convention is applied. A number is represented as a set of digits and an exponent. The tricky part in understanding floating point representations is that the base for the digits and the exponent varies from system to system. Calculators typically use base 10, so conversion problems such as the internal representation of 0.1 that we looked at earlier rarely come up because the internal representation of the value is in the same form as the external one. Most floating point packages on computers, however, use some form of binary base, most often base 2. That is, the digits in the value must be converted to base 2, and the exponent must be converted from a power of 10 to the appropriate power of 2. Further, the usual convention for a floating point math package is that the digits are normalized so that their value is greater than 1/2 and less than 16. So the value 1.0*10^1 translated to a binary floating point value would be .101*2^4.
In order to be able to manipulate floating point values efficiently, a floating point package packs the parts of the value into a few bytes of memory. Intel chips, for example, support 32-bit, 64-bit, and 80-bit floating point values. These are stored in 4, 8, and 10 bytes, respectively, which makes them easy for the hardware to manipulate. Assignment between floating point values, for example, doesn’t require any knowledge of how the parts of the value are laid out in the memory block. Copying the block is all that’s needed. Arithmetic, on the other hand, requires pulling out the various parts of the value so that they can be manipulated separately. The details of the internal representation are implementation-specific -- the C and C++ standards do not impose requirements on how the parts of a floating point value should be laid out. You can get at those parts through the function frexp, but what you get from frexp does not necessarily have exactly the same bit patterns as the internal representation of the parts.
For the 32-bit representation that Intel uses, the high bit contains the sign of the value. A 0 means that the value is positive, and a 1 means that the value is negative. The low 23 bits contain the fractional part, that is, what I’ve been calling the "digits" in the previous paragraphs. There’s a quirk here, though, that can fool you. Remember that the digits are normalized, so that their value is always between ½ and 1. That is, the high bit in the digits will always be 1. In Intel’s internal representation this bit is not stored. The floating point package puts it back in when it’s needed in a computation. So the 23-bit fractional part actually gives 24 bits of precision. One more thing to keep in mind: the bits in the fractional part are the same for a positive value and its negative counterpart. Only the sign bit changes7.
Finally, the exponent is represented in the remaining 8 bits in biased form: 127 is added to the actual value. So an exponent of 0 is represented in the 32-bit format as the value 127, an exponent of 1 is 128, and an exponent of -1 is 126. To convert an internal value back to an actual exponent subtract 127 from the internal value. Since the largest value that can be stored in an 8-bit field is 255, the largest possible exponent in this 32-bit representation is 128. The smallest possible exponent is -127. These exponents will only rarely show up in normal computations, however. An exponent of 128 is used to represent infinite values (both positive and negative), and invalid values, known as NaNs, which stands for Not a Number. An exponent of -127 with a fraction of 0 represents the value 0. All other values with an exponent of -127 are denormalized values. We’ll talk about Infinities, NaNs, and denormalized values some other time. For now, the possible values of the exponent range from -126 to 127, and an exponent of -127 for the values 0.0 and -0.0.
Listing 1 contains the class
definition for the class
fp, which provides basic floating
point operations. It separates the three components of a floating point
value in order to make it easier to see how these parts are manipulated.
is_neg indicates whether the
value is negative. The
unsigned char member
exp holds the exponent, stored in biased form with a bias
value of 127. The
frac holds the
fractional part containing 15 bits. It does not drop the high order bit,
in order to simplify the code a bit.
I’ve chosen these particular types in order to produce a floating
point implementation that resembles Intel’s, but can be written in
portable C++. For example, the language definition guarantees that an
unsigned char can hold values from 0 to 255, so the values of exponents
for the class
fp are limited to this range, even though
some C++ implementations might support a larger range of values in an
unsigned char. Similarly, the fraction is limited to 15 bits in order to
guarantee that the result of multiplying two fractions are will fit in a
long. The goal here is not efficient use of space, but clarity. Please
don’t write to me about ways of packing more bits into this class.
That’s not its purpose.
Multiplying two floating point values is fairly easy, although there are a couple of edge conditions that we have to be careful of. To see how it’s done, let’s go back to scientific notation and look at how to compute the square of Avogadro’s number:
6.023*10^23 6.023*10^23 --------------- 36.276529*10^46
To get this raw result we multiplied the two digit parts and added the two exponents. To put it back into the usual form, though, we have to do two things: we have to adjust the digit part so that its value is less than 10, and we have to get rid of the extra digits that we got on the end of the value. We only had four significant figures in our original value, and we should have four significant figures in the result. We begin by normalizing the result: we replace 36.276529*10^46 with 3.6276529*10^47. The we round the resulting value to four significant digits, producing 3.628*10^47. We usually do these latter two steps as part of the multiplication itself, so instead of writing out the multiplication as we did above we’d write it like this:
6.023*10^23 6.023*10^23 ----------- 3.628*10^47
If you look at the implementation of
operator *= in Listing 2 you can see how this technique
is applied in the class
fp. The code begins by checking for
zeros. If either value in the multiplication is zero the result is set
to zero, with the appropriate sign. Otherwise it computes the exponent
of the result by adding the exponents of the two factors and subtracting
the bias, and it computes the fractional part of the result by
multiplying the fractional parts of the two factors. When it computes
the exponent the code has to subtract the bias to compensate for the
fact that both exponents are biased. This could also be done by
subtracting the bias from each exponent, adding the results, and adding
the bias back in, but subtracting the bias is simpler. When it computes
the fractional part of the result the code uses the member function lf
which combines the fraction with the sign and converts the result to a
long. Multiplying the two resulting signed fractions produces a result
that fits in a long. The results of this multiplication and the
computation of the new exponent are passed to the function
normalize, which takes the raw result and converts it into
a properly normalized form, just as we did above in scientific
normalize is where we deal with various
edge conditions. The first two
if statements recognize zero
values and overflowed fractions. Those don’t occur in multiplication or
division. We’ll look at them when we talk about addition. The
while loop that follows them normalizes fraction values
whose high bit is zero. Again, think about scientific notation: .1 times
.1 is .01, which has a leading zero digit. The same thing occurs in
binary multiplication. We require that the high bit of the fraction be
1, so we multiply the fraction by 2 and reduce the exponent by 1 as many
times as are needed. The next
if statement checks whether
the highest bit in the part of the fractional value that we’re going to
discard is a 1. If so, it rounds the value up or down, according to its
sign. Rounding can turn a normalized value into a value that is not
normalized. Looking again at scientific notation, consider 9.9999*10^0.
If we round it to two figures to the right of the decimal point the
result is 10.00*10^0. To make the digits part of this value less than 10
we need to move the decimal point to the left by one position and
increase the exponent by 1, producing the value 1.00*10^1. The floating
point code does the same thing: if bit 31 is 1, the fractional part is
too large to represent in our internal notation, so the code divides it
by 2 and increases the exponent by 1. Now that the code has done all the
adjustments needed to get the result into normalized form, it checks
whether the resulting exponent is too large or too small. If so, it
calls the member function
exception, which handles the
problem by displaying an error message and exiting from the program. In
a later column we’ll look at less drastic (but not necessarily better)
ways of handling these problems.
Finally, after all these adjustments and checks, the code is ready to
store the new exponent and fractional part into the object. There’s one
last adjustment needed to the fractional part, though. We started out
with two 15-bit fractions, and multiplied them to produce a 30-bit
value. All of the normalization code has dealt with this 30-bit value.
Before we can store this value into an fp object we have to convert it
back into a 15-bit fraction. That’s what the right shift in the next to
last line of
With the code for normalizing results in place, division is quite
easy. It’s done with
operator /= in Listing 2. Just like multiplication, the
code begins by checking for zeros. A zero numerator produces a result of
zero8. If the numerator is not zero but
the denominator is, the code calls
exception to report
division by zero. If neither of the values is zero, the code computes
the new exponent and the new fractional part and passes them to
normalize for all of the necessary adjustments. The new exponent is the
difference between the two exponents, again adjusted for the bias. The
computation of the fractional part looks tricky, but it isn’t really. It
shifts the numerator left by 15 bits to produce a 30-bit value, then
divides that value by the denominator, producing a 15 bit quotient, then
shifts the quotient left by 15 bits to produce a 30-bit value to pass to
To add two floating point values we have to adjust the one that’s closer to zero so that they both have the same exponent. This means shifting the fractional part of the smaller value to the right and increasing its exponent appropriately. We do the same thing when we use scientific notation. To compute 1.0*10^1 + 1.0*10^0 (that is, 10 + 1), we adjust 1.0*10^0 by dividing its digits by 10 and increasing its exponent by 1 to produce 0.1*10^1. This is not in the usual form, but we’re willing to tolerate that during the addition, so long as the result we get is properly adjusted. Now we finish the addition by adding 1.0*10^1 and 0.1*10^1, producing the result 1.1*10^1.
The code in Listing 2 goes through two layers of functions to perform
this additional adjustment. The code in
compares the exponents of the two values being added to decide which
should be adjusted. It calls the function
add with the
value that has the larger exponent as the first argument and the value
that has the smaller exponent as the second argument.
add begins by converting the fraction from
the first argument into a 30-bit value. It then looks at the difference
between the exponents of the two arguments. If the second argument’s
exponent so small that the second argument won’t affect the result it
ignores the second argument. Otherwise it gets the fraction from the
second argument, shifts it left to produce a 30-bit value, shifts that
result right by the difference between the two exponents, and adds the
resulting value to the 30-bit value that it got from the first argument.
The resulting value is passed to
normalize, along with the
exponent for the result.
Since the sum of two non-zero numbers can be zero,
normalize checks to see whether the fraction that was
passed to it is zero. If so, it stores a zero fraction and a zero
exponent in the
fp object in order to properly represent a
value of zero. Otherwise, it checks whether bit 31 in the fraction is
on. If so, it divides the fraction by 2 and increases the exponent by 1
to remove the overflow. This occurs when two values that are fairly
close together are added. In scientific notation, consider 9.0*10^1 +
8.0*10^1. The result is 17.0*10^1, which must be adjusted to 1.7*10^2.
The same thing can occur with the fraction part of a floating point
value: adding two 15-bit values can produce a 31-bit value, which has to
be adjusted back to 30 bits in order for the succeeding code to work
After these two adjustments,
normalize goes through all
of the steps that we looked at earlier. Some of those steps aren’t
actually needed for addition, but checking for them doesn’t do any
With the code for adding two floating point values in place,
subtraction is easy. The code in
operator -= uses the
a - b == a +
(-b): it negates the second argument, and adds the result to the
If you want to understand how floating point math works you need to
experiment with it. The exploration that we’ve gone through in this
column provides a sound basis for analyzing why floating point math
produces the results that it does, but until you’ve seen the details in
actual computations it’s hard to have a sound grasp of how it works. You
can download the code for the class
fp from fp.zip. The archive also includes a
four-function command-line calculator that you can use to explore
floating point computations in more detail than we’ve gone through here.
Here are a few things that you might want to look into:
What is the smallest positive value that can be added to 1.0 to produce a value that is different from 1.0?
When you evaluate the expression ((1-x^2)/(1-x)) for values of x that are very close to 1, why do the results differ from the value of the expression (1+x) for the same values of x?
We looked at the expression (1.0/3.0)*3.0, and concluded that its
value should be slightly less than 1.0. If you use the class
fp to evaluate this expression you get exactly 1.0.
1. Knuth, Donald E., The Art of Computer Programming, vol. 2, Seminumerical Algorithms, 3rd edition, 1998, ISBN 0-201-89683-4, p. 229.
2. No, it’s not the sound of the loud swallow that you make when you suddenly realize that you have no idea what’s going on. It stands for "unit in the last place," and it’s used to describe the accuracy of a floating point result.
3. I’m also sure that their eyes would glaze over if they had to listen to an explanation of some of the things that I think are fascinating
4. Knuth, op. cit., p. 214.
5. Use the C function
frexp to see
what’s really going on.
6. More generally, the fractional part must have a value that is less than 1 and greater than or equal to 1/b, where b is the base of the floating point package. For a binary package this becomes ¨. For a decimal package it would be 1/10.
7. Yes, this means that 0.0 and -0.0 have different representations. We’ll look at the implications of negative zeros in a later column.
8. This code ought to check for a zero denominator
before deciding that the result is actually zero. We’ll look at how to
handle 0.0/0.0 in a later column, when we talk about
9. Except, of course, for speed.
10. Be careful about assuming that common
identities apply to floating point math. For example, if
a+b==a+c, it does not follow that
identity that we’re using for subtraction does work for floating point
Copyright © 2000-2006 by Pete Becker. All rights reserved.