Information about Ieee Floating Point Standard

The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. The standard defines formats for representing floating-point numbers (including negative zero and denormal numbers) and special values (infinities and NaNs) together with a set of floating-point operations that operate on these values. It also specifies four rounding modes and five exceptions (including when the exceptions occur, and what happens when they do occur).

IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (≥ 43-bit, not commonly used) and double-extended precision (≥ 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard; the others are optional. Many languages specify that IEEE formats and arithmetic be implemented, although sometimes it is optional. For example, the C programming language, which pre-dated IEEE 754, now allows but does not require IEEE arithmetic (the C float typically is used for IEEE single-precision and double uses IEEE double-precision).

The full title of the standard is IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), and it is also known as IEC 60559:1989, Binary floating-point arithmetic for microprocessor systems (originally the reference number was IEC 559:1989).[1] Later there was an IEEE 854-1987 for "radix independent floating point" as long as the radix is 2 or 10.

Anatomy of a floating-point number

Following is a description of the standards' format for floating-point numbers.

Bit conventions used in this article

Bits within a word of width W are indexed by integers in the range 0 to W−1 inclusive. The bit with index 0 is drawn on the right. The lowest indexed bit is usually the lsb (Least Significant Bit, the one that if changed would cause the smallest variation of the represented value).

General layout

The three fields in an IEEE 754 float
Binary floating-point numbers are stored in a sign-magnitude form where the most significant bit is the sign bit, exponent is the biased exponent, and "fraction" is the significand minus the most significant bit.

Exponent biasing

The exponent is biased by 2e−1−1. See also Excess-N. Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder. To solve this the exponent is biased before being stored, by adjusting its value to put it within an unsigned range suitable for comparison.

For example, to represent a number which has exponent of 17, exponent is 17 + 2e−1−1. Assuming e = 8, the exponent is equal to 17 + 128 − 1 = 144.

Cases

The most significant bit of the significand (not stored) is determined by the value of exponent. If exponent , the most significant bit of the significand is 1, and the number is said to be normalized. If exponent is 0, the most significant bit of the significand is 0 and the number is said to be de-normalized. Three special cases arise:
  1. if exponent is 0 and fraction is 0, the number is ±0 (depending on the sign bit)
  2. if exponent = and fraction is 0, the number is ±infinity (again depending on the sign bit), and
  3. if exponent = and fraction is not 0, the number being represented is not a number (NaN).


This can be summarized as:

Type Exponent Fraction
Zeroes00
Denormalized numbers0non zero
Normalized numbers to any
Infinities0
NaNsnon zero

Single-precision 32 bit

A single-precision binary floating-point number is stored in 32 bits.
Bit values for the the IEEE 754 32bit float 0.15625


The exponent is biased by in this case (Exponents in the range −126 to +127 are representable. See the above explanation to understand why biasing is done). An exponent of −127 would be biased to the value 0 but this is reserved to encode that the value is a denormalized number or zero. An exponent of 128 would be biased to the value 255 but this is reserved to encode an infinity or not a number (NaN). See the chart above.

For normalised numbers, the most common, exponent is the biased exponent and fraction is the significand minus the most significant bit.

The number has value v:

v = s × 2e × m

Where

s = +1 (positive numbers) when the sign bit is 0

s = −1 (negative numbers) when the sign bit is 1

e = Exp − 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")

m = 1.fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of the fraction). Therefore, 1 ≤ m < 2.

In the example shown above, the sign is zero, the exponent is −3, and the significand is 1.01 (in binary, which is 1.25 in decimal). The represented number is therefore +1.25 × 2−3, which is +0.15625.

Notes:
  1. Denormalized numbers are the same except that e = −126 and m is 0.fraction. (e is NOT −127 : The fraction has to be shifted to the right by one more bit, in order to include the leading bit, which is not always 1 in this case. This is balanced by incrementing the exponent to −126 for the calculation.)
  2. −126 is the smallest exponent for a normalized number
  3. There are two Zeroes, +0 (s is 0) and −0 (s is 1)
  4. There are two Infinities +∞ (s is 0) and −∞ (s is 1)
  5. NaNs may have a sign and a fraction, but these have no meaning other than for diagnostics; the first bit of the fraction is often used to distinguish signaling NaNs from quiet NaNs
  6. NaNs and Infinities have all 1s in the Exp field.
  7. The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are
  8. : ±2−149 ≈ ±1.401298510−45
  9. The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are
  10. : ±2−126 ≈ ±1.17549435110−38
  11. The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are
  12. : ±((1-(1/2)24)2128) [1] ≈ ±3.40282351038


Here is the summary table from the previous section with some example 32-bit single-precision examples:
Type Exponent Significand Value
Zero0000 0000000 0000 0000 0000 0000 00000.0
One0111 1111000 0000 0000 0000 0000 00001.0
Small denormalized number0000 0000000 0000 0000 0000 0000 00011.410-45
Large denormalized number0000 0000111 1111 1111 1111 1111 11111.1810-38
Large normalized number1111 1110111 1111 1111 1111 1111 11113.41038
Small normalized number0000 0001000 0000 0000 0000 0000 00001.1810-38
Infinity1111 1111000 0000 0000 0000 0000 0000Infinity
NaN1111 1111non zeroNaN

A more complex example

Bit values for the IEEE 754 32bit float -118.625
Let us encode the decimal number −118.625 using the IEEE 754 system.
  1. First we need to get the sign, the exponent and the fraction. Because it is a negative number, the sign is "1".
  2. Now, we write the number (without the sign; i.e. unsigned, no two's complement) using binary notation. The result is 1110110.101. We get the 101 after the decimal like this:
  3. 0.625 x 2 = 1.25 which means we write 1 after decimal and move on
  4. 0.25 x 2 = 0.5 which means we write 0 after the decimal and move on
  5. 0.5 x 2 = 1.00 which means we write 1 after the decimal and we are also finished since we have no residuum left to work with
  6. Next, let's move the radix point left, leaving only a 1 at its left: 1110110.101 = 1.110110101 × 26. This is a normalized floating point number. The first 1 binary digit is dropped. The fraction is the part at the right of the radix point, filled with 0 on the right until we get all 23 bits. That is 11011010100000000000000.
  7. The exponent is 6, but we need to convert it to binary and bias it (so the most negative exponent is 0, and all exponents are non-negative binary numbers). For the 32-bit IEEE 754 format, the bias is 127 and so 6 + 127 = 133. In binary, this is written as 10000101.

Double-precision 64 bit

The three fields in a 64bit IEEE 754 float
Double precision is essentially the same except that the fields are wider:

The fraction part is much larger, while the exponent is only slightly larger. The standard creators believed precision is more important than range.

NaNs and Infinities are represented with Exp being all 1s (2047).

For Normalized numbers the exponent bias is +1023 (so e is exponent (− 1023)). For Denormalized numbers the exponent is (−1022) (the minimum exponent for a normalized number—it is not (−1023) because normalised numbers have a leading 1 digit before the binary point and denormalized numbers do not). As before, both infinity and zero are signed.

Notes:
  1. The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are
  2. : ±2−1074 ≈ ±510−324
  3. The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are
  4. : ±2−1022 ≈ ±2.225073858507202010−308
  5. The finite positive and finite negative numbers furthest from zero (represented by the value with 2046 in the Exp field and all 1s in the fraction field) are
  6. : ±((1-(1/2)53)21024) <ref name="Kahan" /> ≈ ±1.797693134862315710308


Comparing floating-point numbers

Every possible bit combination is either a NaN or a number with a unique value in the affinely extended real number system with its associated order, except for the two bit combinations negative zero and positive zero, which sometimes require special attention (see below). The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.

Floating-point arithmetic is subject to rounding that may affect the outcome of comparisons on the results of the computations.

Although negative zero and positive zero are generally considered equal for comparison purposes, some programming language relational operators and similar constructs might or do treat them as distinct. According to the Java Language Specification[2], comparison and equality operators treat them as equal, but Math.min() and Math.max() distinguish them (officially starting with Java version 1.1 but actually with 1.1.1), as do the comparison methods equals(), compareTo() and even compare() of classes Float and Double. For C++, the standard does not have anything to say on the subject, so it is important to verify this (one environment tested treated them as equal when using a floating-point variable and treated them as distinct and with negative zero preceding positive zero when comparing floating-point literals).

Rounding floating-point numbers

The IEEE standard has four different rounding modes; the first is the default; the others are called directed roundings.
  • Round to Nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time (in IEEE 754r this mode is called roundTiesToEven to distinguish it from another round-to-nearest mode)
  • Round toward 0 – directed rounding towards zero
  • Round toward +∞ – directed rounding towards positive infinity
  • Round toward −∞ – directed rounding towards negative infinity.

Extending the real numbers

The IEEE standard employs (and extends) the affinely extended real number system, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The Intel 8087 and Intel 80287 floating point co-processors both support this projective mode.[2][3][4]

Recommended functions and predicates

  • Under some C compilers, copysign(x,y) returns x with the sign of y, so abs(x) equals copysign(x,1.0). This is one of the few operations which operates on a NaN in a way resembling arithmetic. The function copysign is new in the C99 standard.
  • −x returns x with the sign reversed. This is different from 0−x in some cases, notably when x is 0. So −(0) is −0, but the sign of 0−0 depends on the rounding mode.
  • scalb (y, N)
  • logb (x)
  • finite (x) a predicate for "x is a finite value", equivalent to −Inf < x < Inf
  • isnan (x) a predicate for "x is a nan", equivalent to "x ≠ x"
  • x <> y which turns out to have different exception behavior than NOT(x = y).
  • unordered (x, y) is true when "x is unordered with y", i.e., either x or y is a NaN.
  • class (x)
  • nextafter(x,y) returns the next representable value from x in the direction towards y

References

1. ^ Prof. W. Kahan. "Lecture Notes on the Status of IEEE 754" (PDF). October 1, 1997 3:36 am. Elect. Eng. & Computer Science University of California. Retrieved on 2007-04-12.
2. ^ John R. Hauser (March 1996). "Handling Floating-Point Exceptions in Numeric Programs" (PDF). ACM Transactions on Programming Languages and Systems 18 (2). 
3. ^ David Stevenson (March 1981). "IEEE Task P754: A proposed standard for binary floating-point arithmetic". Computer 14 (3): 51–62. 
4. ^ Kahan, W. and Palmer, J. (1979). "On a proposed floating-point standard". SIGNUM Newsletter 14 (Special): 13–21. 

Revision of the standard

Note that the IEEE 754 standard is currently under revision. See: IEEE 754r

See also

  • minifloat for simple examples of properties of IEEE 754 floating point numbers
  • −0 (negative zero)
  • IEEE 754r working group to revise IEEE 754-1985.
  • Intel 8087 (early implementation effort)
  • Q (number format) For constant resolution

External links

Institute of Electrical and Electronics Engineers

Type Professional Organization
Founded January 1, 1963
Origins Merger of the American Institute of Electrical Engineers and the Institute of Radio Engineers
Key people Leah H.
..... Click the link for more information.
In computing, floating-point is a numerical-representation system in which a string of digits (or bits) represents a real number. The most commonly encountered representation is that defined by the IEEE 754 Standard.
..... Click the link for more information.
central processing unit (CPU), or sometimes simply processor, is the component in a digital computer capable of executing a program.(Knott 1974) It interprets computer program instructions and processes data.
..... Click the link for more information.
A floating point unit (FPU) is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root.
..... Click the link for more information.
In computer science, denormal numbers or denormalized numbers (now often called subnormal numbers) fill the gap around zero in floating point arithmetic: any non-zero number which is smaller than the smallest normal number is 'sub-normal'.
..... Click the link for more information.
The word infinity comes from the Latin infinitas or "unboundedness." It refers to several distinct concepts (usually linked to the idea of "without end") which arise in philosophy, mathematics, and theology.
..... Click the link for more information.
Nan or NAN may refer to one of the following.

Acronym

  • NaN, "Not a Number" used in computer arithmetic and defined in the IEEE floating-point standard.

..... Click the link for more information.
C

The C Programming Language, Brian Kernighan and Dennis Ritchie, the original edition that served for many years as an informal specification of the language.
..... Click the link for more information.
BIT is an acronym for:
  • Bannari amman Institute of Technology
  • Bangalore Institute of Technology
  • Beijing Institute of Technology
  • Benzisothiazolinone
  • Bilateral Investment Treaty
  • Bhilai Institute of Technology - Durg

..... Click the link for more information.
word" is a term for the natural unit of data used by a particular computer design. A word is simply a fixed-sized group of bits that are handled together by the machine. The number of bits in a word (the word size or word length
..... Click the link for more information.
The integers (from the Latin integer, which means with untouched integrity, whole, entire) are the set of numbers including the whole numbers (0, 1, 2, 3, …) and their negatives (0, −1, −2, −3, …).
..... Click the link for more information.
  • In the description of a mathematical set, the term inclusive denotes that the endpoints of a range are included within the set. For example, "the integers -2 to 2 inclusive" refers to the set ; the endpoints, -2 and 2, are included.

..... Click the link for more information.
In computing, the most significant bit (msb) is the bit position in a binary number having the greatest value. The msb is sometimes referred to as the left-most bit, due to the convention in positional notation of writing more significant digits further to the left.
..... Click the link for more information.
In computer science the sign bit is the bit in a computer numbering format which indicates the sign of the number. Typically the bit is the most significant bit in the format.
..... Click the link for more information.
exponent bias. Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder.
..... Click the link for more information.
The significand (also coefficient or mantissa) is the part of a floating-point number that contains its significant digits. Depending on the interpretation of the exponent, the significand may be considered to be an integer or a fraction.
..... Click the link for more information.
The two's complement of a binary number is defined as the value obtained by subtracting the number from a large power of two (specifically, from 2N for an N-bit two's complement).
..... Click the link for more information.
The significand (also coefficient or mantissa) is the part of a floating-point number that contains its significant digits. Depending on the interpretation of the exponent, the significand may be considered to be an integer or a fraction.
..... Click the link for more information.
The word infinity comes from the Latin infinitas or "unboundedness." It refers to several distinct concepts (usually linked to the idea of "without end") which arise in philosophy, mathematics, and theology.
..... Click the link for more information.
Nan or NAN may refer to one of the following.

Acronym

  • NaN, "Not a Number" used in computer arithmetic and defined in the IEEE floating-point standard.

..... Click the link for more information.
In computing, single precision is a computer numbering format that occupies one storage location in computer memory at a given address. A single-precision number, sometimes simply a single, may be defined to be an integer, fixed point, or floating point.
..... Click the link for more information.
BIT is an acronym for:
  • Bannari amman Institute of Technology
  • Bangalore Institute of Technology
  • Beijing Institute of Technology
  • Benzisothiazolinone
  • Bilateral Investment Treaty
  • Bhilai Institute of Technology - Durg

..... Click the link for more information.
The significand (also coefficient or mantissa) is the part of a floating-point number that contains its significant digits. Depending on the interpretation of the exponent, the significand may be considered to be an integer or a fraction.
..... Click the link for more information.
binary numeral system, or base-2 number system, is a numeral system that represents numeric values using two symbols, usually 0 and 1. More specifically, the usual base-2 system is a positional notation with a radix of 2.
..... Click the link for more information.
In computing, double precision is a computer numbering format that occupies two storage locations in computer memory at address and address+1. A double precision number, sometimes simply called a double
..... Click the link for more information.
In mathematics, the affinely extended real number system is obtained from the real number system R by adding two elements: +∞ and −∞ (pronounced "positive infinity" and "negative infinity"). These new elements are not real numbers.
..... Click the link for more information.
signed numbers: sign-and-magnitude, ones' complement, two's complement, and excess-N.

For most purposes, modern computers typically use the two's-complement representation, but other representations are used in some circumstances.
..... Click the link for more information.
In mathematics, the lexicographic or lexicographical order, (also known as dictionary order, alphabetic order or lexicographic(al) product), is a natural order structure of the Cartesian product of two ordered sets.
..... Click the link for more information.
In computing, endianness is the byte (and sometimes bit) ordering in memory used to represent some kind of data. Typical cases are the order in which integer values are stored as bytes in computer memory (relative to a given memory addressing scheme) and the transmission order over
..... Click the link for more information.
A programming language is an artificial language that can be used to control the behavior of a machine, particularly a computer. Programming languages, like natural languagess, are defined by syntactic and semantic rules which describe their structure and meaning respectively.
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter