Article by Ken Shirriff on the Pentium floating point bug.
In 1993, Intel released the high-performance Pentium processor, the start of the long-running Pentium line. The Pentium had many improvements over the previous processor, the Intel 486, including a faster floating-point division algorithm. A year later, Professor Nicely, a number theory professor, was researching reciprocals of twin prime numbers when he noticed a problem: his Pentium sometimes generated the wrong result when performing floating-point division. Intel considered this “an extremely minor technical problem”, but much to Intel’s surprise, the bug became a large media story. After weeks of criticism, mockery, and bad publicity, Intel agreed to replace everyone’s faulty Pentium chips, costing the company $475 million.
In this article, I discuss the Pentium’s division algorithm, show exactly where the bug is on the Pentium chip, take a close look at the circuitry, and explain what went wrong. In brief, the division algorithm uses a lookup table. In 1994, Intel stated that the cause of the bug was that five entries were omitted from the table due to an error in a script. However, my analysis shows that 16 entries were omitted due to a mathematical mistake in the definition of the lookup table. Five of the missing entries trigger the bug— also called the FDIV bug after the floating-point division instruction “FDIV”—while 11 of the missing entries have no effect.
Say, does anyone know how thoroughly hardware multipliers are tested nowadays? I imagine the FDIV bug serves as a cautionary tale every time someone designs a multiplier. It is certainly the most famous, but I can think of another example:
With 32-bit ints it’s already hard to build a multiplication table.
That’s an interesting one because it’s not an arithmetic multiplication bug, but something in the pipelining or internal state. Something one might miss even in an exhaustive search of arguments.
In my time, there was some limited use of formal methods, and a great deal of directed random testing. The verification team was similar in size to the implementation team. Verification suites used the great majority of computer time in our compute farm, and tapeouts would be gated on successful verification. (I think these days the data for earlier masks, the lower levels of the chip, will go out before full verification, because there are so many interconnect layers that most problems can be corrected in metal-only fixes.)
In other words, verification is at least equally difficult and important compared to design and implementation. Then again, corporate cultures differ, so my (limited) experience isn’t going to be typical. Likewise, projects differ in their internal complexity, and in the possibility of software fixes (as in this case.)
Today’s complex processor designs require projects of the order of hundreds of person years’ effort, and most of this effort is verification-related rather than design-related.
In my experience people expect hardware to be perfect, but software gets a pass because it can be “easily” changed. Of course as that letter from Pentium says, no microprocessor is perfect, but that doesn’t stop people from pretending it is. You could say that the FDIV bug was a blessing in disguise as it was severe enough to get discoved, whereas if it was more subtle more results could have gotten affected.