1004.3374
Model: nemotron-free
# On the Precision Attainable with Various Floating-Point Number Systems ∗
## On the Precision Attainable with Various Floating-Point Number Systems ∗
Richard P. Brent â€
## Abstract
For scientific computations on a digital computer the set of real numbers is usually approximated by a finite set F of 'floating-point' numbers. We compare the numerical accuracy possible with different choices of F having approximately the same range and requiring the same word length. In particular, we compare different choices of base (or radix) in the usual floating-point systems. The emphasis is on the choice of F , not on the details of the number representation or the arithmetic, but both rounded and truncated arithmetic are considered. Theoretical results are given, and some simulations of typical floating point-computations (forming sums, solving systems of linear equations, finding eigenvalues) are described. If the leading fraction bit of a normalized base 2 number is not stored explicitly (saving a bit), and the criterion is to minimise the mean square roundoff error, then base 2 is best. If unnormalized numbers are allowed, so the first bit must be stored explicitly, then base 4 (or sometimes base 8) is the best of the usual systems.
Index Terms : Base, floating-point arithmetic, radix, representation error, rms error, rounding error, simulation.
## 1 Introduction
A real number x is usually approximated in a digital computer by an element fl( x ) of a finite set F of 'floating-point' numbers. We regard the elements of F as exactly representable real numbers, and take fl( x ) as the floating-point number closest to x . The definition of 'closest', rules for breaking ties, and the possibility of truncating instead of rounding are discussed later.
We restrict our attention to binary computers in which floating-point numbers are represented in a word (or multiple word) of fixed length w bits, using some convenient (possibly redundant) code. Usually F is a set of numbers of the form
$$\sum _ { i = 1 } ^ { s } d _ { i } p e ^ { - i }$$
where β = 2 k > 1 is the base (or radix), t > 0 is the number of digits, s = ± 1 is a sign, e is an exponent in some fixed range
$$\frac { m } { ( 1 . 2 ) }$$
∗ Earlier versions appeared as Report TR RC 3751, IBM Research (February 1972); and in IEEE Transactions on Computers , C-22 (June 1973), 601-607 (manuscript received May 15, 1972; revised May 31, 1972). Retyped (with corrections) in L A T E X by Frances Page, July 2000.
†Former address: Mathematical Sciences Department, IBM T.J. Watson Research Center, Yorktown Heights, N.Y. 10598. Current address: Mathematical Sciences Institute, Australian National University, Canberra. Copyright c © 1972-2010, Richard P. Brent. rpb017 typeset using L A T E X.
and each d i is a β -ary digit 0 , 1 , . . . , β -1. Other possible floating-point number systems (i.e, choices of F ) are mentioned in Section 3.
Since the coding of the exponent e and the signed fraction ( s ; d 1 , . . . , d t ) must fit into w bits, there is a tradeoff between precision and range. (A discussion of precision and range requirements for general scientific computing may be found in Cody [7, 8].) We do not consider this tradeoff; instead we suppose that the range and word length is prescribed, and we study the dependence of the precision on the base β .
With higher bases less bits are needed for the exponent, so more are available for the fraction (see Section 2 for details). However, more leading fraction bits may be zero, so the best choice of base is not immediately obvious. Our aim is to compare the attainable precision of systems with different bases. Theoretical results are given in Sections 4 and 5, and some simulations are described in Sections 6 and 7. The conclusions are summarised in Section 7.
Since we are interested in the precision attainable with different number systems, we assume that the arithmetic is the best possible. In other words, if x, y ∈ F , and †is an arithmetic operation, we asume that x †y is found to sufficient accuracy to give the correct (rounded) result fl( x †y ). Ensuring this may be too expensive in practice, but our conclusions should be valid provided several guard units are used when computing fl( x †y ). The reduction in precision caused by using only a small number of guard digits is discussed by Kuki and Cody [18].
## 2 The Usual Systems
A floating-point number of the form (1.1) may be written as
$$\sum _ { j = 1 } ^ { s } b _ { j } 2 ^ { k e - j }$$
where b k ( i -1)+1 · · · b ki is the binary form of the 2 k -ary digit d i , and u = kt is the number of bits required to code d i , . . . , d t . We use (2.1) in preference to (1.1), and do not insist that t must be an integer. The details of the coding of the exponent e and the signed fraction ( s ; b 1 , . . . , b u ) in a w -bit word do not concern us.
The representation (2.1) is said to be normalised if at least one of b 1 , . . . , b k is nonzero. From (2.1) and the bound (1.2) on e , the largest and smallest floating-point numbers having a normalised representation are and
$$f _ { \max } = 2 ^ { k M } ( 1 - 2 ^ { - u } )$$
$$f _ { \min } = 2 k m ,$$
respectively. If the range R of the system is defined to be log 2 ( f max /f min ) then, negelecting the term 2 -u in (2.2)
$$\begin{array}{ll}
k ( M - m ) = R . \\
\end{array}$$
Goldberg [10], McKeeman [21], and others have observed that with base 2 the leading fraction bit b 1 can be implicit, provided only normalized representations of nonzero numbers are allowed and a special exponent is reserved for zero. Define
Thus, for systems with the same range, k ( M -m ) is invatiant.
$$p = \{ 2 , 1 , other \}$$
so u -log 2 p bits are required to code the fraction ( b 1 , . . . , b u ) . One bit is required for the sign, and at least /ceilingleft log 2 ( M -m ) /ceilingright for the exponent. Thus
$$u - \log _ { 2 } p + 1 + [ \log _ { 2 } ( M - m ) ]$$
For a sensible design, equality will hold in (2.6), and M -m will be a power of two (or one less if the exponent is coded in one's complement or a special exponent is reserved for zero, but such minor differences are unimportant). Thus (2.4) gives
$$2 ^ { - u } k p = 2 ^ { - w } R .$$
The right side depends only on the word length and the range, so (2.7) gives a useful relation between the fraction length u and the base β = 2 k .
Many different sustems of the class described here have actually been used. They include, with various word lengths, ranges and rounding (or truncating) rules:
$$\beta = 2 , p = 2 (e.g., PDP 11-45); CDC 6400; Iliac II; Burroughs 5500; and IBM 360).$$
In some machines, bases other than β are used in the arithmetic unit. For example, in the ILLIAC III (Atkins [2]) multiplication and division are performed with base 256, but numbers are stored with base 16.
## 3 Other Systems
Morris [22] suggests using 'tapered' systems in which the division of bits between the exponent and the fraction depends on the exponent. The idea is to have a longer fraction for the (commonly occurring) numbers with exponents close to zero than for numbers with large exponents. We do not consider these interesting systems here.
Brown and Richman [5] assume that floating-point numbers are represented in a computer word with two sign bits and a fixed number of q -state devices, for some fixed q ≥ 2, and they compare bases of the form q k . Although the results of Sections 4 and 5 can be generalized easily to cover their assumptions, we restrict ourselves to q = 2, for this is the only case of practical importance.
Finally, we describe a 'logarithmic' system that is interesting for theoretical reasons (see Section 4), although it is impractical (because of the difficulty of performing floating-point additions). Let a and b be positive integers which, together with the word length w , characterize the system. The floating-point numbers are zero and all nonzero real numbers x such that a · log 2 | x | + b is one of the integers 1 , 2 , . . . , 2 w -1 -1. If
$$x ( x ) = \begin { cases } 0 , & \text{if } x = 0 \\ \frac { \log ( | a - \log _ { 2 } | x + b ) } { | a - \log _ { 2 } | x + b } , & \text{if } x \neq 0 \end { cases }$$
/negationslash then the floating-point number x may be represented in a computer word by a convenient code for the integer λ ( x ). Since
$$x ( x ) = \begin { cases } 0 , & if x > 0 \\ sign ( x ) \cdot 2 ( x - b ) / a , & if x \leq 0$$
the largest and smallest positive floating-point numbers are
$$f _ { \max } = 2 ^ { ( 2 n - 1 ) / a }$$
$$f _ { \min } = 2 ( 1 - b ) / a ,$$
/negationslash and
respectively, and the range log 2 ( f max /f min ) is
$$R = \frac { 2 w - 1 - 2 } { a } .$$
For example, taking a = 2 w -10 and b = 2 w -2 gives f max /similarequal 2 256 , f min /similarequal 2 -256 , and r /similarequal 512. If x and y are positive floating-point numbers with f xy f then, from (3.1)
min ≤ ≤ max
$$x ( x ) = x ( x ) + x ( y ) - b .$$
Thus, floating-point multiplication and division are easy to perform in a logarithmic system, and do not introduce any rounding errors. Unfortunately, there does not seem to be any easy way to perform floating-point addition.
## 4 The Worst Case Relative-Error Criterion
One measure of the precision of a floating-point number system is the worst relative error /epsilon1 made in approximating a real number x (not too large or small) by fl( x ), i.e.,
$$e = \sup _ { f \min \leq l ^ { 2 } \leq f \max } | \frac { x - f ( x ) } { x } | .$$
∣ ∣ The 'worst case relative-error' criterion is simply to choose a number system (with the prescribed R and w ) to minimise /epsilon1 .
For the logarithmic systems described in Section 3, we see from (3.2) that
$$e = 2 ^ { - 1 } ( 2 a ) - 1 = \frac { \log _ { 2 } 2 a } { 2 a } .$$
(Here and later we neglect terms of order a -2 or 2 -2 u , and logarithms are natural unless otherwise indicated.) If
$$( 1 ) = R ^ { 2 } - w log _ { 2 } 2$$
$$( 4 . 4 )$$
Now consider any floating-point number system with range R and word length w . If /epsilon1 is defined by (4.1), then
$$\sum _ { i = 0 } ^ { e } e _ { i }$$
(In a logarithmic system, the logarithms of positive floating-point numbers are uniformly spaced and all bit patterns are used.) Thus we use logarithmic systems as a standard of comparison for other, more practical, systems.
Wilkinson [28] shows that then (3.5) and (4.2) give
$$( 4 . 6 )$$
for the number systems of Section 2. From (2.7), (4.3) and (4.6)
$$\frac { e } { c _ { 0 } } = \frac { 2 k } { k p \log 2 } = f _ { 1 } ( k , p )$$
which shows how much /epsilon1 exceeds the best possible value /epsilon1 0 for a number system with the same R and w . Table 1 gives f 1 ( k, p ) for k = 1 , 2 , . . . , 8.
Thus where /epsilon1 is defined by (4.1), and we have neglected a term of order n 2 /epsilon1 2 . Many other bounds on the rounding errors in algebraic processes are also of the form f ( n ) /epsilon1 (see Wilkinson [28], [29]), which is a good reason for choosing a floating-point number system according to the worst case criterion of Section 4. However, the bound (5.2) is rather pessimistic, for the individual rounding errors δ i in (5.1) usually tend to cancel rather than to reinforce each other. (We are assuming an unbiased rounding rule as described in Section 6. With truncation or biased rounding the bound (5.2) may be realistic.)
TABLE 1 THEORETICAL WORST CASE AND RMS ERRORS*
| k | p | β = 2 k | f 1 ( k,p ) | f 2 ( k,p ) |
|-----|-----|-----------|---------------|---------------|
| 1 | 2 | 2 | 1.44 | 1.06 |
| 1 | 1 | 2 | 2.89 | 2.12 |
| 2 | 1 | 4 | 2.89 | 1.68 |
| 3 | 1 | 8 | 3.85 | 1.87 |
| 4 | 1 | 16 | 5.77 | 2.45 |
| 5 | 1 | 32 | 9.23 | 3.51 |
| 6 | 1 | 64 | 15.4 | 5.34 |
| 7 | 1 | 128 | 26.4 | 8.47 |
| 8 | 1 | 256 | 46.2 | 13.9 |
* See (4.7) and (5.8) for definitions of f 1 and f 2 .
The table shows that the implicit-first-bit base 2 systems are the best of those described in Section 2, and close to the best possible, on the worst case criterion. Of the explicit-first-bit systems, base 2 and base 4 are equally good. This may be explained as follows. Changing from base 2 to base 4 frees a bit from the exponent for the fraction. If the first 4-ary digit d 1 is 2 or 3, the first fraction bit b 1 is 1, and the extra fraction bit may increase the precision. However, if d 1 is 1 then b 1 is 0, and the bit gained is wasted. (According to Richman [24], Goldberg observed this independently.) If fl( x ) is defined by truncation rather than rounding then /epsilon1 is doubled, but the comparison between different bases is not changed.
## 5 The RMS Relative-Error Criterion
Consider forming the product of nonzero floating-point numbers x 0 , . . . , x n (in one of the usual systems) by n floating-point multiplications, i.e., define p 0 = x 0 and p i = fl( p i -1 x i ) for i = 1 , . . . , n . If δ i = ( p i -1 x i -p i ) / ( p i -1 x i ) is the relative error made in forming the i th product, then the relative error in the final result is
$$\Delta = \frac { x _ { 0 } \cdots x _ { n } - p _ { n } } { x _ { 0 } \cdots x _ { n } } = 1 - [ I ( 1 - \delta _ { i } ) ] _ { i = 1 } ^ { n } = \sum _ { i = 1 } ^ { n } \delta _ { i } + higher order terms.$$
$$\vert \Delta \vert \leq n e ^ { 2 }$$
If the δ i were independent random variables, distributed with mean 0 and variance σ 2 i , then ∆ would be distributed with mean 0 and variance Σ n i =1 σ 2 i . Thus a reasonable probabilistic measure of the precision of a floating-point number system is the root-mean-square (rms) value δ rms of δ = ( x -fl( x )) /x , where x is distributed like the nonzero results of arithmetic operations performed during a typical floating-point computation.
The simulations described in Sections 6 and 7 suggest that the rms rounding error in floatingpoint comuptations involving many arithmetic operations is often roughly proportional to δ rms (see also Weinstein [27]). Thus we prefer δ rms to other probabilistic measures of precision such as the expected value of | δ | (McKeeman [21]), the expected value of log 2 | δ | (Kuki and Codi [18]), and the expected error in 'units in the last place' (Kahan [15]). We disregard errors in the conversion from internal floating-point results to decimal output, for the rms value of these errors depends on the number of decimal places rather than on the internal number system. (For the effect of repeated conversions back and forth, see Matula [20].)
What distribution should we assume for the nonzero real numbers x that are to be approximated by floating-point numbers? Hamming [11], Knuth [17], and others argue that we should assume that log | x | is uniformly distributed. There are two reasons why this assumption is only an approximation. Although log | x | may be approximately uniform locally, it is certainly not uniform on the entire interval [log f min , log f max ] . Also, the fine structure of the distribution is not uniform, for the numbers arising from multiplications or (more importantly) additions of floating-point numbers are really discrete rather than continuous variables. Nevertheless, we shall make Hamming's assumption in this section. It is certainly a much better approximation than assuming that x is uniformly distributed on some interval.
For the logarithmic systems, δ is uniformly distributed on [ -/epsilon1 0 , /epsilon1 0 ] , where /epsilon1 0 is given by (4.3). Thus δ rms = δ 0 , where
$$\delta _ { 0 } = \frac { e _ { 0 } } { \sqrt { 3 } } = \frac { R \cdot \log _ { 2 } ^ { 2 } } { 2 w \sqrt { 3 } } .$$
Because the assumption that log | x | is uniform is only an approximation, there is no result corresponding to the inequality (4.5), but the logarithmic systems still provide a convenient standard of comparison for other, more practical, systems.
For the systems of Section 2, there is no loss of generality in assuming that x lies in [1 /β, 1) and (by our assumption) log β x is uniformly distributed on [ -1 , 0) . Consider numbers y distributed uniformly on a small interval near x . The absolute error y -fl( y ) is approximately uniform on ( -2 -u -1 , 2 -u -1 ) . (It is certainly not logarithmically distributed, as is assumed to derive (18 ′ ) in Benschop and Ratz [3].) Hence α = ( y -fl( y )) /y is uniform on ( -2 -u -1 /x, 2 -u -1 /x ) , and has probability density function (Feller [9])
$$g _ { x } ( a ) = \begin{cases} 2 ^ { x } , & if \vert a \vert < 2 ^ { - u } \\ 0 , & otherwise. \end{cases}$$
Integrating over the interval [1 /β, 1) , we see that δ is distributed with density
$$\int _ { 0 } ^ { 1 } f ( x ) \, d x = \int _ { - \infty } ^ { 1 / \beta } g ( x ) \, d x$$
It is easy to find the expected value of δ , δ 2 , | δ | , log 2 | δ | , etc. from (5.6). In particular, we find that δ is distributed with standard deviation
$$\delta r _ { m s } = 2 ^ { - u } \sqrt [ 4 ] { \frac { 1 } { 2 4 k } - \log 2 }$$
and mean 0. (The mean is actually of order 2 -2 u , but terms of this order have been neglected.)
From (2.7), (5.3) and (5.6),
$$\frac { \delta r _ { m s } } { \delta _ { 0 } } = \sqrt [ 4 ] { \frac { - 1 } { 2 p ^ { 2 } ( k : \log 2 ) ^ { 3 } } } = f _ { 2 } ($$
and the last column of Table 1 gives f 2 ( k, p ) for k = 1 , . . . , 8. The table shows that the implicitfirst-bit base 2 systems are the best of the systems of Section 2 (and only 6 per cent worse than the logarithmic systems) on the rms relative-error criterion. Base 4 (closely followed by base 8) is best in the explicit-first-bit systems. The reason why base 4 is better than explicit base 2 is apparent from the discussion at the end of Section 4: | δ | is never greater for base 4 than for explicit base 2, and sometimes it is smaller. A similar argument shows that implicit base 2 is better than base 4.
Because of the different ranges possible with base 4 and base 8, there are some choices of minimal acceptable range for which base 8 is preferable to base 4, but bases higher than 8 are always inferior to base 4 on the rms relative-error criterion.
## 6 Simulation of Different Systems
Three classes of floating-point computations were run, using various number systems with w = 32 and R /similarequal 512 (the same as for single-precision on the IBM 360 and many other computers). The systems were a logarithmic system S 0 with a = 2 22 and b = 2 30 (see Section 3), and the following examples of the systems described in Section 2.
S 1 : β = 2 , u = 23 , p = 2 (base 2 with a 23-bit fraction, the first bit implicit).
S 2 : β = 4 , u = 23 (base 4 with 23 bits or 11 1 2 digits).
S 3 : β = 2 , u = 22 , p = 1 (base 2 with 22 bits, all explicit).
S 4 : β = 16 , u = 24 (base 16 with 24 bits or 6 digits).
S 5 : β = 256 , u = 25 (base 256 with 25 bits or 3 1 8 digits).
S ′ 4 : The same as S 4 with truncation (towards zero) rather than rounding.
The rounding rule for systems S 1 to S 5 is the ' R *-mode' of Kuki and Cody [18]: fl( x ) is defined to be the floating-point number closest to x , and ties are broken by choosing fl( x ) so that its least significant fraction bit is one. Formally, if x is a nonzero real number with binary expansion
$$x = s \sum _ { j = 1 } ^ { \infty } b _ { j } z ^ { - j }$$
(taking the terminating expansion if there is one, normalizing so that one of b 1 , . . . , b k is nonzero, and neglecting the possibility of underflow or overflow), then
$$f ( x ) = \begin{cases} u \sum _ { j = 1 } ^ { n } b _ { j } ^ { 2 k e - i } , & if b _ { u + 1 } = 0 or \sum _ { j = 1 } ^ { n } b _ { j } ^ { 2 k e - i } = 0 \\ - i + 2 k e - i , & otherwise. \end{cases}$$
The special case b u b u +1 · · · = 11000 · · · is quite important, for it often occurs when x is the result of a floating-point addition, and neglecting it can lead to bias in the rounding.
All the floating point number systems were simulated on an IBM 360/91 computer, with arithmetic operations performed in double precision ( β = 16 , u = 56) before rounding or truncating approximately. Thus, the number of guard units used was effectively infinite. The data
were pseudorandom double-precision numbers distributed as described in Section 7, and 'exact' results were computed using double precision throughout.
Forming sums, solving systems of linear equations, and finding the eigenvalues of symmetric matrices were the chosen classes of floating-point computations. They appear to be fairly typical of computations in which the effect of rounding errors may be important. Details, and the results of the simulations, are given in Section 7. Other classes that have been considered include solving ordinary differential equations (Henrici [12], [13], Hull and Swenson [14]), fast Fourier transforms (Kaneko and Liu [16], Ramos [23], and Weinstein [27]), matrix iterative processes (Benschop and Ratz [3]), solving positive-definite linear systems (Tienari [25]), and forming products (Section 5).
## 7 Details and Results of the Simulations
Sums
Let m and n be positive integers. A number z was drawn from a uniform distribution on [0 , 1], then numbers x 1 , . . . , x n were drawn independently from a uniform distribution on [ -Z, Z ], where Z = 256 z is a scale factor used to avoid a bias in favour of any of the number systems (see Kuki and Cody [18]). The approximate sums s j of fl( x 1 ) , . . . , fl( x n ) were accumulated, in the usual way, with each of the number systems S j described in Section 6, and the errors
$$\alpha _ { j } = \frac { \sum _ { i = 1 } ^ { n } x _ { i } - s _ { j } } { \sum _ { i = 1 } ^ { n } | x _ { i } | } (7.1)$$
were found. (The denominator is used in preference to ∑ n i =1 x i to ensure that α j is small.) The procedure was repeated m times and the rms values β j of the α j were found. For purposes of comparison between the systems, it is convenient to consider the normalized rms errors γ j = β j /β 0 . (Recall that β 0 is the rms error for the logarithmic system S 0 .)
Table 2 gives γ j for various choices of m and n . If the α j are considered as random variables drawn from a distribution with mean square B 2 j , then β j and γ j may be regarded as estimates of B j and B j /B 0 , respectively. m was chosen large enough to ensure that the standard error of the estimates γ j given in Tables 2-4 is less than five units in the last decimal place.
For n = 1 we are merely estimating the rms relative error in approximating x 1 by fl( x 1 ) , and the results agree with the predictions of Section 5 (see the last column of Table 1). Except for S ′ 4 , the effect of varying n is small, and does not affect the ranking of the systems.
It may be shown that
$$B _ { j } = \left\{ O ( n ^ { 3 / 2 } ) , \quad for S _ { 4 } ^ { 1 } \right\} , \quad for the other systems$$
so it is not surprising that γ ′ 4 appears to grow like n 1 / 2 . (The same applies if truncation is downwards instead of towards zero.) Results for sums of positive numbers are similar, although B j is larger by a factor of order n 1 / 2 for all the systems.
TABLE 2 RESULTS FOR SUMS
| n | m/ 1000 | γ 1 | γ 2 | γ 3 | γ 4 | γ ′ 4 | γ 5 |
|-----|-----------|-------|-------|-------|-------|---------|-------|
| 1 | 1000 | 1.06 | 1.68 | 2.12 | 2.45 | 4.89 | 13.9 |
| 2 | 100 | 1.11 | 1.68 | 2.23 | 2.38 | 5.53 | 13.4 |
| 4 | 100 | 1.13 | 1.69 | 2.25 | 2.36 | 6.33 | 13.2 |
| 8 | 100 | 1.12 | 1.69 | 2.24 | 2.36 | 7.95 | 13.2 |
| 10 | 100 | 1.12 | 1.69 | 2.23 | 2.36 | 8.76 | 13.4 |
| 16 | 10 | 1.11 | 1.72 | 2.22 | 2.37 | 10.9 | 13.3 |
| 32 | 10 | 1.09 | 1.71 | 2.18 | 2.39 | 15.9 | 13.6 |
| 64 | 10 | 1.08 | 1.67 | 2.14 | 2.43 | 22.4 | 13.9 |
| 100 | 30 | 1.06 | 1.68 | 2.13 | 2.41 | 28.1 | 13.6 |
## Solving Systems of Linear Equations
z 1 and z 2 were drawn independently from a uniform distribution on [0 , 1] , giving scale factors Z 1 = 256 z 1 and Z 2 = 256 z 2 . Numbers a p,q ( p, q = 1 , . . . , n ) were drawn independently from a uniform distribution on [ -Z 1 , Z 1 ] ; and x 1 , . . . , x n were drawn similarly from [ -Z 2 , Z 2 ] . For each of the number systems S j , let A ( j ) = (fl( a p,q )), A = ( a p,q ), x = ( x p ), b = ( b p ) = Ax , and b ( j ) = (fl( b p )). The system of equations
$$( 7 . 3 )$$
was solved by Gaussian elimination with complete pivoting, giving the approximate solution y ( j ) , and the error
$$\sigma _ { j } = \frac { \vert A g ( j ) - b \vert _ { 2 } } { \vert A \vert _ { 2 } \vert E \vert _ { 2 } \vert x \vert _ { 2 } }$$
was computed. (Here ‖ A ‖ E = ( ∑ n p =1 ∑ n q =1 a 2 p,q ) 1 / 2 . From results of Wilkinson [28], [29], α j is small even if A is rather ill conditioned.) The procedure was repeated m times, the rms values β j of the α j were computed, and the ratios γ j = β j /β 0 were found. The results for various m and n are given in Table 3.
TABLE 3
## RESULTS FOR SYSTEMS OF LINEAR EQUATIONS
| n | m/ 1000 | γ 1 | γ 2 | γ 3 | γ 4 | γ ′ 4 | γ 5 |
|-----|-----------|-------|-------|-------|-------|---------|-------|
| 1 | 100 | 1.3 | 2.06 | 2.61 | 2.99 | 4.92 | 17 |
| 2 | 100 | 1.3 | 2.01 | 2.59 | 2.9 | 5.33 | 16.3 |
| 4 | 10 | 1.27 | 1.97 | 2.56 | 2.8 | 5.63 | 15.7 |
| 8 | 4 | 1.23 | 1.89 | 2.45 | 2.65 | 6.1 | 14.9 |
| 16 | 1 | 1.18 | 1.82 | 2.35 | 2.6 | 7.1 | 14.4 |
Multiplication and division are performed exactly in a logarithmic system, so β 0 is less than would otherwise be expected, and γ 1 , . . . , γ 5 are higher than for sums, especially for small values of n . However, the ratios of γ 1 , . . . , γ 5 are much the same as for sums, and the ranking of the systems is preserved. Results for positive a p,q and/or x p are similar.
It is interesting that γ 4 < 2 γ ′ 4 for n = 1 and 2. When n = 1 and S ′ 4 is used, the errors made in forming fl( a 1 , 1 ) and fl( b 1 ) tend to cancel when fl( b 1 ) / fl( a 1 , 1 ) is computed. Presumably there is a similar, though less marked, effect for n > 1.
## Finding Eigenvalues of Symmetric Matrices
Numbers a p,q (1 ≤ p ≤ q ≤ n ) were drawn independently from a uniform distribution on [ -Z, Z ] , where Z was a scale factor chosen as above. The other elements of A = ( a p,q ) were defined by symmetry. For each number system S j , the approximate eigenvalues λ ( j ) 1 ≤ · · · ≤ λ ( j ) n of A ( j ) = (fl( a p,q )) were computed by reducing A ( j ) to tridiagonal form and then using the QR algorithm (Wilkinson [29]). We used translations of the Algol 60 procedures TRED1 (Martin et al [19]) and TQL1 (Bowler et al [4]), except for some trivial modifications to avoid unnecessary rounding errors when n = 2. The stopping criterion for the QR algorithm was the same for all number systems. (The parameters macheps and tol of the procedures were set to 10 -8 and 10 -60 , respectively.) The errors
$$\sigma _ { j } = ( \sum _ { i = 1 } ^ { n } ( x _ { i } - x ( j ) ) ^ { 2 } ) / \vert A \vert$$
were computed. (Here λ 1 ≤ ·· · ≤ λ n are the exact eigenvalues of A .) The procedure was repeated m times, the rms values β j of the α j computed, and the ratios γ j = β j /β 0 found, as above. The results are given in Table 4.
TABLE 4 RESULTS FOR EIGENVALUES OF SYMMETRIC MATRICES
| n | m/ 1000 | γ 1 | γ 2 | γ 3 | γ 4 | γ ′ 4 | γ 5 |
|-----|-----------|-------|-------|-------|-------|---------|-------|
| 2 | 100 | 1.07 | 1.61 | 2.14 | 2.38 | 6.06 | 15.2 |
| 4 | 10 | 1.33 | 2.24 | 2.65 | 3.6 | 10.5 | 25.8 |
| 8 | 3 | 1.14 | 2.01 | 2.34 | 3.73 | 10.8 | 29.6 |
| 16 | 1 | 1 | 1.82 | 1.99 | 3.49 | 10.7 | 28.8 |
The method used for finding eigenvalues depends heavily on multiplications by matrices of the form ( c s -s c ) , where c 2 + s 2 = 1. The numbers c and s are certainly not distributed as assumed in Section 5. This, along with other observations made above, may explain the interesting variations in the γ j . Despite these variations, the ranking of the different systems is as predicted in Section 5.
## 8 Conclusions
Comparing γ ′ 4 with γ 4 in Tables 2-4 shows that the rms error for truncation is usually considerably more than twice as much as for rounding. However, truncation is often preferred because the usual implementation of rounding requires an extra carry propagation. An interesting compromise is the 'von Neumann round' (Burks et al [6], Urabe [26]), for which the result of an arithmetic operation is truncated, and then the least significant bit is set to one. (An exception could be made if the result is exactly representable; this would involve checking if the truncated bits were all zero.) No extra carry propagation is required, and the rms error is twice that for normal rounding, so considerably better than for truncation.
The most accurate practical systems are base 2 with the first fraction bit implicit. If the accuracy gained by having the first bit implicit is not considered sufficient compensation for the disadvantages entailed, then base 4 (or perhaps base 8) is the best choice.
The accuracy lost by using base 16 or higher is roughly as predicted in Section 5. High bases may have some implementation advantages (Anderson et al [1], Atkins [2]). In practice both factors should be considered. The number of guard digits used is also important. The use of high bases, only one guard digit, and truncation instead of rounding is probably acceptable on machines with a long floating-point word. However, to minimize the need for double-precision computations, it seems wise to try to squeeze out the last drop of accuracy on a computer with a short floating-point word (say 32-40 bits). The amount that can be squeezed out is often significant. For example, our simulations show that using system S 1 instead of S ′ 4 is roughly equivalent to carrying one more decimal place.
## Acknowledgement
The author wishes to thank W J Cody, the late Prof. G E Forsythe, and Prof. W Kahan for their comments on an earlier version of this paper.
## References
- [1] S F Anderson, J G Earle and R E Goldschmidt, 'The IBM system/360 model 91: Floatingpoint execution unit', IBM J Res Develop , vol 11, pp 34-53, January 1967.
- [2] D E Atkins, 'Design of arithmetic units of ILLIAC III : The use of redundancy and higher radix methods', IEEE Trans Comput , vol C-19, pp 720-733, August 1970.
- [3] N F Benschop and H C Ratz, 'A mean square estimate of the generated roundoff error in constant matrix interative processes', J Ass Comput Mach , vol 18, pp 48-62, January 1971.
- [4] H Bowdler, R S Martin, C Reinsch and J H Wilkinson, 'The QR and QL algorithms for symmetric matrices', Numer Math , vol 11, pp 293-306, 1968.
- [5] WS Brown and P L Richman, 'The choice of base', Commun Ass Comput Mach , vol 12, pp 560-561, October 1969.
- [6] A W Burks, H H Goldstine and J von Neumann, 'Preliminary discussion of the logical design of an electronic computing instrument', in Collected Works of John von Neumann , vol 5. New York: Macmillan, 1963, pp 57-58. (Report prepared for the US Army, 1946.)
- [7] W J Cody, 'Desirable hardware characteristics for scientific computation', Preliminary Report to the SIGNUM Board of Directors, 1970.
- [8] WJCody, 'Static and dynamic numerical characteristics of floating-point arithmetic', this issue [ IEEE Trans Comput , vol C-22, pp 598-601, June 1973].
- [9] W Feller, An Introduction to Probability Theory and its Applications , New York: Wiley, 1950.
- [10] I B Goldberg, '27 bits are not enough for 8-digit accuracy', Commun Ass Comput Mach , vol 10, pp 105-106, February 1967.
- [11] R W Hamming, 'On the distribution of numbers', Bell Syst Tech J , vol 49, pp 1609-1625, October 1970.
- [12] P Henrici, Discrete Variable Methods in Ordinary Differential Equations . New York: Wiley, 1962, pp 50-54.
- [13] P Henrici, 'Test of probabilistic models for the propagation of roundoff errors', Commun Ass Comput Mach , (Letter to the Editor), vol 9, pp 409-410, June 1966.
- [14] T E Hull and J R Swenson, 'Tests of probabilistic models for propagation of roundoff errors', Commun Ass Comput Mach , vol 9, pp 108-113, February 1966.
- [15] W Kahan, 'What is the best base for floating-point arithmetic? Is binary best?', Dep Comput Sci, Univ California, Berkeley, Lecture Notes, December 1970.
- [16] T Kaneko and B Liu, 'Accumulation of roundoff error in fast Fourier transforms', J Ass Comput Mach , vol 17, pp 637-654, October 1970.
- [17] D E Knuth, The Art of Computer Programming , vol 2, Reading, Mass: Addison-Wesley, 1969, pp 218-228.
- [18] H Kuki and W J Cody, 'A statistical study of the accuracy of floating-point number systems', Commun Ass Comput Mach , to be published. [Appeared in vol 16, pp 223-230, April 1973.]
- [19] R S Martin, C Reinsch and J H Wilkinson, 'Householder's tridiagonalization of a symmetric matrix', Numer Math , vol 11, pp 181-195, 1968.
- [20] D W Matula, ' A formalization of floating-point numeric base conversion', IEEE Trans Comput , vol C-19, pp 681-692, August 1970.
- [21] W McKeeman, 'Representation error for real numbers in binary computer arithmetic', IEEE Trans Electron Comput . (Short Notes), vol EC-16, pp 682-683, October 1967.
- [22] R Morris, 'Tapered floating-point: A new floating-point representation', IEEE Trans Comput . (Short Notes), vol C-20, pp 1578-1579, December 1971.
- [23] G U Ramos, 'Roundoff error analysis of the fast Fourier transform', Math Comput , vol 25, pp 757-768, October 1971.
- [24] P L Richman, 'Floating-point number representations: Base choice versus exponent range', Dep Comput Sci, Stanford Univ, Stanford, Calif, Tech Rep CS 64, 1967.
- [25] M Tienari, 'A statistical model of roundoff error for varying length floating-point arithmetic', BIT , vol 10, pp 355-365, 1970.
- [26] M Urabe, 'Roundoff error distribution in fixed-point multiplication and a remark about the rounding rule', SIAM J Numer Anal , vol 5, pp 202-210, 1968.
- [27] CJ Weinstein, 'Roundoff noise in floating-point fast Fourier transform computation', IEEE Trans Audio Electroacoust , vol AU-17, pp 209-215, September 1969.
- [28] J H Wilkinson, Rounding Errors in Algbraic Processes . London: HMSO, 1963.
- [29] J H Wilkinson, The Algebraic Eigenvalue Problem . Oxford: Oxford, 1965.