Estimating Errors in Summing Arrays of Floating Point Numbers

January 15, 2018

Here I investigate how to calculate

\[ x = \sum _ {i=1} ^N x _ i \]

with floating point arithmetic.

Calculating this sum efficiently and accurately can be tricky, especially for large \( N \) The obvious algorithm

float xhat=0.0;
for(int i=0;i<N;i++){
  xhat+=x[i];
}

has a worst-case relative error of

\[ \frac{\hat{x} - x}{x} = O(\epsilon N) \]

where \( \epsilon \) depends on the precision used in the floating point arithmetic, here I take \( \epsilon \approx 10^{-6} \). This is a very problematic error when working on large datasets. Scientific computation algorithms routinely work with arrays well above the limit that would cause this pessimistic error bound to sound serious correctness alarms.

Some algorithms which can reduce this bound to more useful levels. Examples include pairwise summation which reduces the linear factor to a logarithmic factor, and Kahan summation which eliminates the \( N \) dependent factor completely in the error estimate. But these algorithms have their own problems too.

Pairwise summation and Kahan summation are complex and hurt parallelism and vectorization opportunities unless they fall back to naive summation once they achieve smaller array sizes.

What I’m curious to know is whether this horrendous error estimate for naive summation is ever actually realized in practice? My experience suggests not as I have worked on very large-scale codes which use naive summation and they had no issues with explosively large relative errors. So I decided to test this numerically.

The experiment

One issue with naive summation on large problems is that it will likely be executed in parallel. Since floating point arithmetic is not associative, that means the errors we create between different runs of the same code (even given the same inputs) could be different. Ideally we would like to give some kind of guarantee that even though these answers are different, that they are within some prescribed error tolerance. Since the errors are not determnistic, this leaves us only with the worst-case bound which is useless.

Thus I model the relative errors as a random variable with normal distribution:

\[ \frac{\hat{x} - x}{x\epsilon} \sim N(\mu,\sigma) \]

and attempt to find the parameters \( \mu, \sigma \). Originally I thought to use the Bayesian approach of my previous post, however in the large sample sizes that I needed for convergence I found severe numerical problems in evaluating things like the likelihood, which in turn ruins the numeric integration. I believe these problems are fixable, but I need to learn more about those techniques. So for now I simply use the maximum likelihood estimate. To simulate non-deterministic summation as might happen in parallel execution I lightly modify the naive summation as follows:

float xhat=0.0;
for(int i=0;i<N;i++){
  xhat+=x[perm[i]];
}

where the index array perm is a permutation of the original indices. I compute different permutations through the NAG routine nag_rand_permute. In what follows I use a fixed array x of \(N=10000\) floating point numbers. I compute relative error by computing the permuted sum in 32 bit floating point arithmetic and the “correct” sum again in 64 bit floating point with Kahan summation. The relative error is computed in 64 bit floating point by casting up the 32 bit result. I compute the mean and standard deviation from random permutation samples of sizes 1,10,100,1000,100000,1000000 hoping to see a convergence trend, and follow with histogram plots:

Finding parameters for normal distribution

Number of permutation samples \( \mu \) \( \sigma \)
1-1.1150
10-1.0724.97
100-1.6508.59
1000-1.5287.45
10000-0.9127.37
100000-0.9407.38
1000000-0.9427.34

In order to have confidence in the mean and standard deviation calculations I used the robust NAG routine nag_summary_stats_onevar. I show next histograms of the computed relative errors so as to visually confirm approximate normal distribution.

Histograms of relative errors

Number of permutation samples Histogram
1 Permutation samples1 Sample
10 Permutation samples10 Sample
100 Permutation samples100 Sample
1000 Permutation samples1000 Sample
10000 Permutation samples10000 Sample
100000 Permutation samples100000 Sample
1000000 Permutation samples1000000 Sample

Determining the probability of worst-case error

The worst case error in this case is about \(N=10000\). With our fitted normal distribution I can see immediately that this is not even worth computing because it will yield such a small probability.

To compute other more interesting probabilities I used the julia script:

using Distributions
N=Normal(-0.942,7.34);
function f(x)
  return pdf(N,x);
end
(P1,err)=quadgk(f,10,100);
(P2,err)=quadgk(f,-100,-10);
P=P1+P2;
println(P);

using this I compute some other probabilities:

\[ P \left( \frac{\hat{x} - x}{x\epsilon} > 10 \right ) \approx 0.18 \]

which means that

\[ P \left( \frac{\hat{x} - x}{x\epsilon} \leq 10 \right ) \approx 0.82 \]

We could also determine a new “probabilistic worst case” estimate by finding a bound which results in a very low probability. For example:

\[ P \left( \frac{\hat{x} - x}{x\epsilon} > 20 \right ) < 0.001 \]

Conclusion

The purpose of this exercise was to see how likely we are to actually achieve worst-case error in the naive floating point sum algorithm. I also wanted to produce an estimate that would apply to the parallel summation case where we don’t know what order certain summations will happen, I achieved this by randomly sampling permutations of the floating point array. In this particular case I reduced the worst-case error bound of \( \frac{|error|}{\epsilon} < 10000 \) to a much better \( \frac{|error|}{\epsilon} < 20 \) (with probability greater than \( 0.99 \)).

This analysis was very simple. An actual parallel algorithm won’t actually take all permutations of the sum with equal probability, which I assumed in making these calculations. Therefore a more sophisticated analysis would account for the fact that some evaluation orders are more likely than others. Another issue here is the use of a fixed array x, the only thing allowed to change here was the evaluation order of floating point summation, not the floating point numbers themselves. Thus the analysis could be improved to account for all of these weaknesses.

Future Work

For future work I can systematically account for the following issues:

Extras

I put the code for these computations on Github.

The large sample sizes are very computationally intense, evidenced by the following screengrab of “top”

top output