Floating-point Assist Fault on Intel Itanium 2 Aurélien Dumez - aurelien.dumez@inrialpes.fr INRIA Rhône-Alpes - Apache, Remap, Sardes research projects - ADVICE : You should open this document in an editor using fixed-width font for the display. This short document gives an explanation of the "floating-point assist fault" message that sometimes appears in the /var/log/messages file, on the Intel Itanium platform. Some theory ----------- Floating-point numbers can't be put in CPU registers as easily as integers. They have to be normalized as defined in the IEEE754 standard. Each number is cut into three elements: sign, exponent and fraction. The standard offers two level of precision: single and double, depending on the number of bits used to code each element: Type Sign(S) Exponent(E) Fraction(F) Total length C type ---------------------------------------------------------------------- Single 1 8 23 32 bits float Double 1 11 52 64 bits double ---------------------------------------------------------------------- When normalized, a floating-point number looks like: (-1)^S * (1 + F) * 10^(E-bias) To get this form, the number must be converted in binary format, something like (101.01101)b. Then, this number can be written in normalized form: (101.01101)b -> (1.0101101E10)b. The final step is to extract the three elements: -The sign: 0 for positive numbers, 1 for negative ones. -The exponent (unsigned): its format is said to be biased. That is: a constant value has been added to the real exponent. This value is +127 for singles and +1023 for doubles. The non-biased exponent range is: [-127,127] for singles [-1023,1023] for doubles (theoritically, the exponent should be 128 or 1024, but some values have special meanings) -The fraction: it's a bit tricky. The first digit (before the comma) is always "one" in normalized format. So it does not appear in the coded fraction. In our example, we use "0101101" as the fraction, instead of "10101101". Warning : in this case, the first zero is mandatory. Need an example? We want to code 9.75 (simple precision). Step 1 : convert this number into binary format: 9 -> (1001)b (1*2^8 + 1*2^0) 0.75 -> (0.11)b (1*2^-1 + 1*2^-2) So 9.75 -> (1001.11)b Step 2: normalize the number: (1001.11)b -> (1.00111E11)b Step 3: get each component. The sign. 9.75 is a positive number, so the sign bit is zero. The exponent is 127+(11)b and the fraction is 00111 (remember that the first digit is discarded because it is always one). You manually computed your first IEEE754 single precision floating-point number. Hopefully, the computer does it well on our behalf. As said above, the standard defines reserved values to represent special numbers like infinite ones. For further informations, you should have a look to the IEE754 standard. What does happen in real life? ------------------------------ Many programs deal with floating numbers. Many programmers don't know how these numbers are coded and so could they do big mistakes. An example. Do you think the number 0.0009765625 is hard to code? Many people say, let's round it to 0.00098, it's enough. You should notice that the first number is 1/2^10 (1E-1010)b in normalized representation: 0 for the sign, 0 for the fraction and 127+(-10) = 117 for the exponent. I suggest you try to compute the normalized representation of the second number. Many "zero" digits, some "one" and an endless conversion, the precision is not so good. A small C program could illustrate this: #include #include int main (int argc, char **argv) { float f; f = 0.0009765625; printf ("Really precise !\n"); printf ("I assigned 0.0009765625 to f\n"); printf ("f = %f\n", f); printf ("f = %.40f\n", f); f = 0.00098; printf ("Really precise ?\n"); printf ("I assigned 0.00098 to f\n"); printf ("f = %f\n", f); printf ("f = %.40f\n", f); exit (0); } Compile it and run it. Surprise! The number 0.00098 can't be exactly stored, even when using double precision (try it). Always keep in mind what does precision imply for floating-point numbers. What about this floating-point assist fault? -------------------------------------------- When one does computation involving floats, the result may not always be turned into normalized representation, these numbers are called "denormals". They can be thought of as really tiny numbers (almost zero). The IEEE754 standard handles these cases, but not always does the Floating-Point Unit. There are two ways to deal with this problem: -Silently ignore it (maybe by turning the number into zero) -Inform the user that the result is a denormal and let him do what he wants with it (=we ask the user and his software to assist the FPU). The Intel Itanium does not fully support IEEE denormals and requires software assistance to handle them. Without further informations, the ia64 GNU/Linux kernel triggers a fault when denormals are computed. This is the "floating-point software assist" fault (FPSWA) in the kernel messages. It is the user's task to clearly design his program to prevent such cases. The following C program's goal is to trigger a FPSWA fault. #include #include #include #define ITERATIONS 1024 typedef struct { char sign; long frac; int exp; } sdouble; /* Warning : double's size is hardcoded here, very bad habit. This program is initially designed to only run on Itanium. */ sdouble cut_double (double d) { sdouble ret; long *db; db = (long *)&d; ret.sign = (*db >> 63) & 0x0000000000000001; ret.exp = ((*db >> 52) & 0x00000000000007FF) - 1023; ret.frac = (*db ) & 0x000FFFFFFFFFFFFF; return ret; } int main (int argc, char **argv) { sdouble sa; double a; int i; a = 1; i = 1; openlog ("double_test", LOG_CONS, LOG_USER); syslog (LOG_USER | LOG_NOTICE, "starting"); while (i <= ITERATIONS) { syslog (LOG_USER | LOG_NOTICE, "iteration #%d", i); a /= 2; sa = cut_double (a); printf ("iteration %d : a=(%d, %lx, %d) : ", i, sa.sign, sa.frac, sa.exp); printf ("\n"); i++; } exit (0); } Its principle is very simple: during each iteration, a number is divided by 2. The initial value of this number is 1 (easy to normalize). Remember the normalized representation. Each division will only change the exponent: (1E0)b / 2 = (1E-1)b , (1E-1)b / 2 = (1E-10)b ... So we just have to keep an eye on the exponent value. Let's guess the result: in double precision, the exponent value goes from -1023 to 1023 (and 0 to 2046 when biased). So what happens when the next iteration is 1E-1111111111 / 2 ? The value 1E-10000000000 is reserved and so it can't be used to code the result. Our number is so tiny that it can't be represented in normalized form: it's a denormal and the kernel should trigger a FPSWA fault. Let's check it. Compile the program with the following command line and run it. gcc -o test test.c Read the file /var/log/messages. The iteration 1024 will trigger a FPSWA fault. The theory was good. This fault is not an issue as long as you know what you do. If you don't care about denormals, you could ignore them (and avoid kernel messages) by using compilers flags. They are: -GNU CC (tested on 3.3.2): -ffast-math. Rebuild the program with the command gcc --fast-math test.c -o test and run it. No FPSWA and the number is kept as 1E-1023 (increase the number of iterations to see it). -Intel CC (tested with 8.0): -ftz. Rebuild the program with the command icc -ftz test.c -o test and run it. Same behaviour as above. You should notice that the FPSWA fault message is written only four times in the /var/log/messages file. This code was tested on GNU/Linux Redhat Advanced Server 2.1 with a 2.4.18 kernel. By default, if the FPSWA fault occurs more than four times per second, only the first four messages are logged and the next are silently discarded (see the file /usr/src/linux/arch/ia64/kernel/traps.c). FPSWA support is provided by Intel. It is an EFI application loaded during the boot sequence. The GNU/Linux kernel calls the assist function when needed. This function should be replaced, but the goal of this document is not to explain how. This document is about tiny numbers, almost zero, but you may wonder what happens for big ones (positive and negative). When an operation produces an overflow (a number that can't fit in the defined range), the number is turned into a special value (remember the reserved values): infinite (+inf or -inf). To conclude, I'd like to stress the fact that the programmer has to be careful when dealing with floating-point numbers. Even with high precision, it is easy to produce denormals and get strange behaviour. Some links ---------- -Mosberger (D) - Eranian (S) ia-64 linux kernel - design and implementation Prentice Hall -Itanium Processor Floating-Point Software Assistance and Floating-point Exception Handling (Intel) http://www.intel.com/design/itanium/downloads/24541501.pdf (last seen: 2004-04-06) -hp online technical resources - what's up with those "floating-point assist fault" messages? - Linux on Itanium http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,165,00.hml (last seen: 2004-04-06) -Microsoft Web site - IEEE Standard 754 Floating Point Numbers http://research.microsoft.com/~hollasch/cgindex/coding/ieeefloat.html (last seen: 2004-04-06)