When Bad Things Happen to Good Numbers

One way to tell whether a piece of code was written by a beginner or by an experienced professional is to look at how it handles errors. Beginners often start by writing code that implements the desired algorithm, then, if they think of it, adding code to handle errors. The result can be a hodge-podge of ad hockeries, abandoning a computation when an input error produces noticeable symptoms, or attempting to patch a computation that has gone astray, rather than detecting input conditions early that will lead to problems later and taking appropriate steps to limit the damage before starting the real work. Here in The Journeyman’s Shop we’ve learned that error handling is often the most difficult part of implementing an algorithm. We’ve learned that spending time identifying potential sources of problems and deciding how to handle them before we start writing code reduces the time we spend implementing and debugging an algorithm. We’ve also learned that analyzing possible problems before coding can help us understand an algorithm better, sometimes producing insights that result in much simpler and cleaner code for the algorithm itself. We all enjoy writing code -- that’s the most obviously productive part of our job -- and we look forward to the feeling of satisfaction that we get when our code passes all of the rigorous tests that we’ve prepared for it. Experience teaches us that a furious rush toward this end can lead to disappointment, while a more deliberate pace, postponing completion until we’ve built a sound foundation, ultimately produces a more satisfying result.

Avoiding Numerical Errors

In numerical analysis this rule works overtime. The biggest difficulty is often not writing the code, but deciding what code to write. In an earlier column we looked at some of the sources of errors in floating point computations. A numerical analyst has to understand those sources of errors and use that understanding to design a computation so that those errors won’t invalidate the result. I’m not qualified to do that, so I won’t try to discuss it in detail. But I do want to give one example, to emphasize how easily numerical computations can go astray if we aren’t careful. Back in high school we all learned the closed form solution for the roots of a quadratic equation. Given an equation of the form


ax^2 + bx + c = 0

we can find its roots with the following formulae:


x1 = (-b + sqrt(b^2 - 4ac))/2a
x2 = (-b - sqrt(b^2 - 4ac))/2a

We also learned that if 4ac is greater than b^2 the equation has no real roots. Obviously, if we are going to write code to compute the roots of a quadratic equation we have to watch out for these cases. In Java we might be tempted to simply ignore the problem, and rely on the square root function to detect the attempt to take the square root of a negative number and throw an exception. In C and C++ we don’t have that option. The behavior of the sqrt function is undefined when it is called with a negative number, so we have to write code to detect this case and handle it appropriately:


double disc = b*b - 4 * a * c;
if (disc < 0)
    printf("no real roots\n");
else
    {
    /* continue the computation */
    }

That’s also the right solution in Java; it makes the control flow clear, and doesn’t abuse the exception mechanism by using it to handle ordinary, non-exceptional conditions.

The possibility of having no real roots is inherent in solving quadratic equations. When we try to solve them using floating point math we introduce an additional set of complications that don’t come with the original problem. Rather, they arise from our use of limited precision representations of values that often cannot be expressed exactly in a finite number of bits. That is, we run into the roundoff errors that we discussed earlier. The simplest example is the case where the equation has only one root. In that case, as you probably remember, the value of the discriminant, b^2 - 4ac, is zero. But that’s only true in the theoretical world of abstract algebra. In the concrete world of finite representations of floating point values, the values of a, b, and c have been rounded to fit in floating point variables, so they might not have the exact values that theory would indicate. Further, the multiplications that we have to make to calculate the discriminant can introduce additional roundoff errors. When we get around to subtracting 4ac from b^2 we have probably introduced enough errors into those two values that the result of the subtraction will not be zero. So in our algorithm we will probably erroneously conclude that we do not have a single root, but that we either have no roots or two roots.

Even if b^2 is much larger than 4ac we still can run into problems. There’s another subtraction here which could magnify roundoff errors and produce incorrect results. If b is positive, then the numerator in the calculation of x1 is equivalent to


sqrt(disc) - b

and if sqrt(disc) and b are fairly close in value roundoff errors can easily dominate the result. Similarly, if b is negative, the numerator in the calculation of x2 is equivalent to


b - sqrt(disc)

and again roundoff errors can easily dominate the result.

The important thing to notice here is that one of the two results can only be obtained by subtracting two numbers that may be close together. So numerical analysts usually don’t use this formula for solving quadratic equations. They use other approaches. One of those approaches is to divide both sides of the original equation by a, so that its coefficients look like this:


x^2 + 2bx + c = 0

Note that we’ve also expressed the coefficient of x as 2b, rather than b. That is, the value of b that we’ll use in solving this equation is the coefficient of x, divided by 2. Now the two solutions can be written as:


x1 = -b + sqrt(b^2 - c)
x2 = -b - sqrt(b^2 - c)

This doesn’t eliminate the problem we saw earlier in calculating the discriminant, but with the additional observation that the product of the two roots will be c, we can eliminate the second subtraction. If b is negative, we first get the value of x1 from our formula. This involves the addition of two positive numbers, so it avoids the magnification of roundoff errors that subtraction produces. Once we have that result, we divide it into c to get the value of x2. Similarly, if b is positive, we first get the value of x2 from our formula, and divide it into c to get the value of x1. In both cases we’ve avoided one of the subtractions that could have led to erroneous results¹.

Detecting Numerical Errors

Sometimes the best approach to a calculation is simply to do it and check afterwards whether the result is sensible. For example, we may know two different ways of computing the result we’re looking for, one of which is fast when it works, but can end up dividing by zero for some input values, and the other of which is much slower, but works for all input values. In that case we might decide to try the first one, and if it fails, drop back to the second. To do this our code needs to be able to detect that a division by zero occurred, so that it can switch to the slower method.

If you try this with the floating point emulator that we’ve been building in the past few columns, you’ll see that it doesn’t handle this sort of situation gracefully. Giving the four-function calculator an expression that divides by zero terminates the program:


d:> 4bang 1.0/0.0
floating point error: divide by zero
d:>

That doesn’t help much if you need to detect the failure in order to try another approach. So these days, many floating point systems provide less abrupt failure mechanisms. Most of these follow, more or less, the specifications set out in the IEEE Standard for Binary Floating-Point Arithmetic (IEEE-754)². The IEEE specification provides for error handling through exceptions, flags, traps, and special floating point values that indicate that errors have occurred. In the remainder of this column we’ll see how these error handling mechanisms are defined, by looking at how they can be implemented in our floating point emulator.

Floating Point Exceptions

A floating point exception is not the same thing as an exception in C++. They just happen to have the same name, and a certain superficial similarity, since they both deal with computations that have somehow gone astray. Don’t confuse the two terms: a floating point exception has nothing to do with a C++ exception³.

A floating point exception occurs when something goes wrong in a floating point computation. IEEE-754 defines five types of floating point exception: invalid operation, division by zero, overflow, underflow, and inexact. For each of these exception types you can install your own handler, known as a "trap handler," which will be called whenever an exception of the corresponding type occurs. If you don’t install your own trap handler, the default behavior of an IEEE-754 implementation is to set a flag indicating the type of exception that occurred, and, depending on the type of exception, to produce a special value that indicates that an exception occurred. The flags are sometimes known as "sticky bits." They are set when an exception occurs, but floating point computations never clear them. That means that when a computation has finished you can check the flags to see if any exceptions occurred.

With an updated emulator that supports IEEE-754 exceptions, the four-function calculator produces the following output for division by zero:


d:> 4bang 1.0/0.0
Inf (+,128,0)
d:>

That is, the result of 1.0/0.0 is the special value Infinity. Unlike the previous version of the emulator, dividing by zero does not terminate execution, so we can continue with our computation without having to know that we divided by zero:


d:> 4bang 1.0/0.0+1.0
Inf (+,128,0)
d:>

Exception Handling

With that in mind, let’s look at the code that implements the default behavior of the five IEEE-754 categories of exceptions. Listing 1 is the new version of the header file fp.h. There’s been quite a bit added to support floating point exceptions. The enumerated type fp_exception, which indicates the type of the exception that occurred, has been moved from the private section to the public section. It is followed by a type declaration, to simplify trap handling, then five functions that deal with flags and two functions that deal with traps. We’ll look at those in the next column. The private section of class fp now has an additional enumerated type, fpx, and the static member function exception, which is called whenever an exception occurs, now takes a second argument, of type fp, and returns a result, also of type fp. These are followed by seven new static data members that are a convenient way of getting the special floating point values that we’ll be looking at later.

The static member exc_handlers is an array of pointers to functions. When an exception occurs, the function exception uses the type code to generate an index into this array and call the corresponding exception handler.

The class fp also has five static member functions that implement the default handling for the five exception types. Their names take the form ex_XXX, where the letters XXX are replaced by a three-letter code for the exception type that the particular function handles. The class also declares five friend functions, with names of the form def_XXX, where XXX, again, is a code for the exception type that the function handles. This two-layer system makes trap handling simpler. The member functions ex_XXX implement the default trap handling. Since user-installed traps cannot be members of the class fp, the trap handling mechanism must be able to manage non-member functions. That’s the reason for the def_XXX functions: they are installed as the default trap handlers, and each of them simply calls the corresponding ex_XXX function. Since the function exception does not have to distinguish between default handling and user- installed trap handlers, it is very simple to write:


fp fp::exception(fpx except, fp arg)
    {   // handle floating point exceptions
    return exc_hndlrs[except]
        ((fp_exception)(1 << except), arg);
    }

Default Handlers

The handling of the inexact exception is the simplest: when rounding drops low bits from a value the inexact flag is set. Here’s the code that does this:


fp fp::ex_ine(fp::fp_exception except, fp arg)
    {
    set_flag(except);
    return arg;
    }

The function set_flag, as its name suggests, sets the flag that indicates that an exception of the specified type occurred. All of the default exception handlers call this function. After calling set_flag, this handler just returns the value of its second argument. The value that an exception handler returns will be used in the floating point computation in place of the value that triggered the exception. In this case we want to use the rounded value that has already been computed, so the exception handler simply returns that value.

On division by zero, the default behavior is to set the flag and return positive or negative infinity, depending on the signs of the operands in the division. The default handler looks like this:


fp fp::ex_dbz(fp::fp_exception except, fp arg)
    {
    set_flag(except);
    return arg.is_neg ? ninf : pinf;
    }

This code relies on a trick that doesn’t conform to the IEEE-754 standard: the value passed in arg has its sign set according to the sign of the desired result. The function uses that sign value to determine the sign of the infinite value that it returns.

On overflow, the default behavior is to set a flag and to return a value that depends on the sign of the result and on the rounding mode. When the rounding mode is set to round toward zero, overflows produce the largest finite representable value with the appropriate sign. When the rounding mode is round down, an overflow in the positive direction produces the largest positive finite representable value. When the rounding mode is round up, an overflow in the negative direction produces the largest negative finite representable value. In all other cases an overflow produces an infinity with the appropriate sign. The code for the default overflow handler looks like this:


fp fp::ex_exo(fp::fp_exception except, fp arg)
    {
    set_flag(except);
    return rm == rm_nearest
            ? arg.is_neg ? ninf : pinf
        : rm == rm_zero
            ? arg.is_neg ? nmax : pmax
        : rm == rm_down
            ? arg.is_neg ? ninf : pmax
            : arg.is_neg ? nmax : pinf;
    }

Underflow requires us to introduce a new concept. In the previous versions of the emulator, when an arithmetic operation produced a value whose exponent was too small to represent, we converted the value to zero. IEEE-754 provides the notion of gradual underflow: when an exponent is too small to represent, the value is represented with the smallest representable exponent and a denormalized fraction, that is, a fraction in which the high bit is not one. In our emulator, the smallest representable exponent is -127. The smallest normalized value has an exponent of -127 and a fraction of 0x4000. If we divide this value by two, since we can’t decrease the exponent below -127, we decrease the fraction by dividing it by two, producing a value with an exponent of -127 and a fraction of 0x2000. The smallest positive non-zero value that can be represented has an exponent of -127 and a fraction of 1. If we divide this value by two we end up with an exponent of -127 and a fraction of 0, which is the representation for the value zero.

The thing to keep in mind when dealing with denormals is that when the fraction has high-order bits equal to zero it has fewer significant bits than a normalized value. So gradual underflow is accompanied by gradual loss of precision. Sometimes that’s okay. If the low bits that are discarded in producing the denormalized fraction are all zero, then the denormalized fraction hasn’t actually lost any bits, and subsequent computations using this value won’t be compromised by the denormalization. Consequently, an underflow exception is generated only when the result of a computation is too small to represent in normalized form, and some low order non-zero bits were discarded in producing the denormalized value. The underflow handler is called with a floating point value whose exponent is 192 more than the exponent that would have been generated, and a normalized fraction. That is, the value that the handler gets is 2^192 times the correct value. This value can be represented with full precision in a normal floating point value. Our default handler simply removes this factor of 2^192 to produce the actual exponent, then shifts the fraction, denormalizing it:


fp fp::ex_exu(fp::fp_exception except, fp arg)
    {
    set_flag(except);
    int ex = arg.exp - 192;
    int fr = arg.frac >> -ex;
    return fp(0 - BIAS, fr, arg.is_neg);
    }

Finally, the invalid operation exception introduces another special value, NaN, short for "not a number." These come in two forms: signaling NaNs and quiet NaNs. Floating point computations never produce signaling NaNs. They only arise if you write code to create them. We’ll see later how NaNs propagate through floating point computations. The default handler for an invalid operation exception sets the flag to indicate that this exception occurs and returns a quiet NaN. It looks like this:


fp fp::ex_inv(fp::fp_exception except, fp)
    {
    set_flag(except);
    return qnan;
    }

Generating Exceptions

Now that we’ve seen the code that handles floating point exceptions we need to look at where these functions are called. The code changes from the old emulator are mostly tedious additions to the member functions fp::add, fp::operator *=, fp::operator /=, and fp::normalize.

The new version of fp::add is in Listing 2. The code begins by testing whether either argument is a NaN. If so, the result of the addition is also a NaN. Then it checks for addition of infinities of opposite signs. Such an operation doesn’t have any sensible result, so it generates an invalid operation exception. As we saw earlier, the default behavior for an invalid operation exception is to return a NaN. Next, the function checks whether its first argument is an infinity. If so, it returns that value. If not, it checks whether its second argument is an infinity. If so, it returns that value. Finally, if none of those special cases has occurred, it falls through into the addition code for ordinary floating point values.

The order of these tests is important. Any arithmetic operation on a NaN must produce a NaN⁴, so we check for NaN values first. Any operations involving infinities but no NaNs must produce either a NaN (for addition of infinities of opposite signs) or an infinity, so we check next for infinities. Once we’ve dealt with any NaNs and infinities we’re left with two floating point values. We don’t have to look for denormals here -- ordinary arithmetic operations work correctly with denormals.

The new version of fp::operator *= is in Listing 3. It begins by converting its arguments with the macro ARG. This macro checks whether its argument is a signaling NaN, and if so, raises an invalid operation exception⁵. As we saw earlier, the default result of an invalid operation exception is a quiet NaN⁶. After these adjustments, the code checks whether either of the function’s arguments is a NaN, and if so, assigns that NaN to the return value. If neither argument is a NaN, the code checks for the two cases of infinity times zero and zero times infinity. These trigger an invalid operation exception, resulting, again, in a NaN. If neither of these cases has occurred, the code checks whether either argument is zero. If so, it returns a value of zero with the appropriate sign. Next it checks whether either value is infinity, and if either one is, it returns a value of infinity with the appropriate sign. Otherwise it does the multiplication and returns the result.

The code for fp::operator /= is in Listing 4. It looks quite a bit like the code for fp::operator *=. The first difference is in the tests that produce an invalid operation exception. For multiplication, zero times infinity is an invalid operation. For division, infinity divided by infinity and zero divided by zero are invalid operations. The second difference is in the handling when the second argument is zero, which triggers a division by zero exception.

Finally, the function fp::normalize has been changed a bit from the previous version. The new code is in Listing 5. The function round has been changed to return a boolean value indicating whether any low-order bits will be discarded in the rounding. The code saves the result of the call to round for later use. The tests for underflow and overflow at the end of the function have to test more case because of the additional rules imposed by IEEE-754.

The first test checks whether the exponent is smaller than the smallest representable exponent. If so, we have to denormalize the result. If rounding the result will discard non-zero bits we have an underflow exception. If not, we just generate a denormalized value.

The next test checks whether the exponent is larger than the largest representable exponent. If so, we have an overflow exception.

The next test checks whether rounding will discard non-zero bits. If so, we have an inexact exception.

If we get through all those tests, we have the pieces of a properly normalized value, and we simply generate that value and return.

Trying It Out

After you’ve downloaded the code in fp.zip, build the four-function calculator and try out some of the edge cases we’ve talked about. You’ll need to compile 4bang.cpp and fp.cpp. The four-function calculator is a bit limited in the form of input that it takes, so you sometimes have to be a little tricky to hit the edge cases. But with a little ingenuity and some carefully placed parentheses you can generate anything you’d like. Here are a few things you might try.

To generate infinity:


d:> 4bang 1.0/0.0
Inf (+,128,0)
d:>

To generate negative infinity:


d:> 4bang -1.0/0.0
-Inf (-,128,0)
d:>

To generate NaN:


d:> 4bang 0.0/0.0
NaN (+,128,1)
d:>

d:> 4bang (1.0/0.0)/(-1.0/0.0)
NaN (+,128,1)
d:>



Next Month

Next month we’ll wrap up this exploration of floating point math,
looking at user- defined trap handlers and the changes made to the C
language definition by the recent C99 revision.