IEEE 754 Single-Precision Numbers

40 downloads 299 Views 56KB Size Report
IEEE 754 Single-Precision Numbers. Rev. 1.1 (060307). Introduction. To represent both very large and very small values, C++ has adopted the floating point ...
IEEE 754 Single-Precision Numbers Rev. 1.1 (060307) Introduction To represent both very large and very small values, C++ has adopted the floating point representation specified in the IEEE754 standard. The float data type represents a single-precision number, whereas the double data type represents a double-precision number. The two types do not differ in concept, only in the number of bits used to represent them in memory. This note will describe the single-precision number representation (called a float in C++ terminology) in detail, and briefly describe the double-precision number representation. The scientific notation The number is stored in scientific notation using 2 as the base number. This means that all numbers are stored in the form k =± 1.M ⋅ 2 . Since 2 is chosen as the base number, only three characteristic values need to be stored for any number: The sign S, the exponent E, and the mantissa M. E

The single-point representation consists of 32 bits (4 bytes) divided into these three fields:

The sign bit S (bit 31) The sign bit S is set to 0 if the number is positive, 1 if the number is negative. The exponent E (bits 30 – 23) The exponent E is the power 2 must be raised to in the scientific notation of the number. The exponent is biased with a value of 127, which means that the value stored in E is 127 for exponent = 0, 128 for exponent = 1, etc. This is done to be able to represent very small exponents without the need of representing the exponent itself as a negative number. The exponent value range is -126 – 127 (E = 0x01 – 0xFE) The mantissa M (bits 22 – 0) The mantissa M is the part of the value (in scientific notation) that comes after the binary point. Since scientific notation with base number 2 is used, a value will always be of the form 1.xxxx, so the “1.” -part is left out (called “the hidden bit”). The mantissa is constructed by writing the after-the-point value ap as a sum 22

of negative powers of 2. i.e. ap =

∑M i =0

i

⋅ 2 −( 23−i ) , where Mi is the value of the i’th bit in the mantissa (0..22)

The use of S, E, and M will become clearer by some methods and a couple of examples: Converting a real value k to its single-precision representation b: To convert a value k into its single-precision representation b, follow these steps: (1) Deduct the sign bit from k (2) If |k| > 2, continuously divide k with two so that the value is finally between 1 and two. The exponent is now the number of times k has been divided. Add 127 to the exponent to find E = b[30..23] (3) If |k| < 1, continuously multiply k with two so that the value is finally between 1 and two. The exponent is now minus the number of times k has been multiplied. Add 127 to the exponent to find E (4) Write the part of the divided/multiplied |k| after the binary point as a sum of negative powers of 2 to find M. (e.g. 0.625 = 0.5 +0.125 Æ M = 101000…) Example: Convert k = -4.625 to its single-precision representation b: Following the method above, (1) S = 1, since k is negative. Hence b[31] = 1

(2) |k| is divided by 2 two times, yielding |k| = 4.625 = |1.15625| * 22. Hence the exponent is 2 + 127 = 129, and therefore b[30..23] = 10000001 (3) Not applicable in this example (4) Since the divided value of |k| Is 1.15625, the part after the binary point, 0.15625, must be written as a sum of negative powers of 2. 0.15625 = 0.125 + 0,03125 = 2-3 + 2-5. Hence, the mantissa (b[22..0]) will be 00101000…0 The complete representation of k = -4.625 is therefore b=11000000100101000000000000000000 Converting a represented value b to a real number k. To convert a single-precision value b to the real number k it represents, follow these steps: (1) Deduct the sign of k from the sign bit S = b[31]. (2) Deduct E from b[30..23]. (3) Deduct the mantissa of K from b[22..0]. For each 1-bit in M, add 2pos-23 to the mantissa (pos is the bit position M) (4) Use S, E, and M to construct k using the formula

k =S1.M ⋅ 2 E

Example: Find the number represented by 1 10001010 10110100010000000000000 Following the method above, (1) S = b[31] = 1 (2) E = b[30..23] = 10001010 = 138 (3) M = b[22...0] = 10110100010000000000000 = 2-1 + 2-3 + 2-4 + 2-6 + 2-10 = 0.7041015625 (4)

k =± 1.M ⋅ 2 E = -1.7041015625 * 2138-127 = -3490

Some special values Some special combinations of S, E, and M are used to represent special numbers such as 0 and ± ∞ : S E M Meaning 0 or 1 00000000 0000000…0 Zero 1 11111111 0000000…0 −∞ 0 11111111 0000000…0 +∞ 0 or 1 11111111 0000000…0 Not-a-Number (NaN)

Double-precision numbers Double-precision numbers are very similar to single-precision numbers. The only difference is the size of the exponent (11 bits), the offset of the exponent (1027), the range of values of the exponent (-1026 to 1027), and the number of bits to store the mantissa (62 bits)