Breaking Bayesian Inference

January 10, 2018

I don’t actually break Bayesian inference. I want to develop an intuition for Bayesian methods, and so here I push them to their limits to see what happens. First I describe what I will do with Bayesian inference, and how I will test it. I want to see how well it performs on a simple test.

I try here a simple test: inferring the distribution of a pseudorandom sequence of floating point numbers. I generate this sequence using the NAG routine nag_rand_normal and I follow a simple hypothesis testing procedure I first learned about by reading Machine Learning, A Probabilistic Perspective by Kevin P. Murphy. I outline the procedure below.

Hypothesis Testing

Hypothesis testing with Bayesian methods means you start with a prior belief about whether a hypothesis \( \mathbf{H} \) is true or not, expressed numerically as a probability \( P(\mathbf{H}) \), called a prior probability. Next you observe some data \( \{x _ k\} _ {k=1}^N \) and attempt to compute the new posterior probability \( P(\mathbf{H} \mid \{x _ k\} _ {k=1}^N ) \) (This means: The probability that the hypothesis \( \mathbf{H} \) is true given that we have observed the data \( \{x _ k\} _ {k=1}^N \) ). A high posterior probability expresses a high confidence in a hypothesis based on available information (where “available information” is encoded into the prior probability somehow). The following formula, based on Bayes’ Theorem, calculates the posterior probability

\[ P(\mathbf{H} \mid \{x _ k\} _ {k=1}^N ) = \frac{P(\{x _ k\} _ {k=1}^N \mid \mathbf{H}) P(\mathbf{H})}{P(\{x _ k\} _ {k=1}^N )} \].

This formula will allow us to combine prior information with observed data to try and infer the distribution used to generate a pseudorandom sequence of floating point numbers.

Inferring Distribution Used to Generate Pseudorandom Sequence

In what I consider here I have a sequence \( \{x _ k\} _ {k=1}^N \) generated by nag_rand_normal. I will try to infer the parameters to the normal distribution used to generate this sequence. This means I actually have infinitely many hypotheses \( \mathbf{H}(\mu,\sigma) \) where \(\mu\) is the mean of the normal distribution, and \( \sigma \) is the standard deviation of the normal distribution. I need to figure out which values of these parameters were input to nag_random_normal to generate the sequence \( \{x _ k\} _ {k=1}^N \). First let’s recast the formula for the posterior given above for this case. Given any \( \mu \) and \( \sigma \) the posterior formula gives

\[ P(\mathbf{H}(\mu,\sigma) \mid \{x _ k\} _ {k=1}^N ) = \frac{P(\{x _ k\} _ {k=1}^N \mid \mathbf{H}(\mu,\sigma)) P(\mathbf{H}(\mu,\sigma))}{P(\{x _ k\} _ {k=1}^N )} \].

To calculate \(P(\{x _ k\} _ {k=1}^N )\) we may appeal to the law of total probability and find that

\[(P(\{x _ k\} _ {k=1}^N ) = \int _{0}^{\infty} \int _{-\infty}^{\infty} P(\{x _ k\} _ {k=1}^N \mid \mathbf{H}(\mu,\sigma)) P(\mathbf{H}(\mu,\sigma)) d\mu d\sigma \]

The integrand may be implemented by calculating the likelihood and applying our prior, both computable and known quantities. To compute the actual integral I used the NAG routine nag_quad_2d_fin by supplying large integration bounds.

I now have everything I need to start doing calculations, let’s jump straight into my tests.

Numerical Results

In all tests I generate the pseudorandom sequence \( \{x _ k\} _ {k=1}^N \) with the normal distribution parameters \(\mu=0\) and \(\sigma=1\). I plot the posterior probability as a heat map in \(\mu,\sigma\) space, yellow probabilities indicate higher confidence values and blue probabilities indicate lower confidence values. I also place a black circle on the “true” values so that we may visually see the quality of the confidence values. I place an orange circle on the maximum posterior probability. Thus if the orange circle is close to the black circle, the bayesian hypothesis testing procedure has succeeded.

I produce this plot for sample sizes \( N=1,2,4,8,16,32,64 \) and for three different prior probabilities. I summarize them as follows:

A very helpful prior: force the Bayesian update formula to give us the right answer
An unhelpful prior: assign almost the same probability to all parameters
A very unhelpful prior: Give the right parameters prior probability almost zero.

Case 1: A very helpful prior

For this test I take the prior probability \( P(\mathbf{H}(\hat{\mu},\hat{\sigma})) \) to be a bivariate normal centered on the “true” mean and standard deviations \( \mu=0 \) and \(\sigma=1\) and with very small standard deviation. The idea here is a sanity check on the method, if it can’t preserve a correct answer that is fed to it then it has no chance on approximate inputs.

Case: \(N=1\) Samples

1 Sample

Case: \(N=2\) Samples

2 Sample

Case: \(N=4\) Samples

4 Sample

Case: \(N=8\) Samples

8 Sample

Case: \(N=16\) Samples

16 Sample

Case: \(N=32\) Samples

32 Sample

Case: \(N=64\) Samples

64 Sample

Case 2: An Unhelpful Prior

For this test I take the prior probability \( P(\mathbf{H}(\hat{\mu},\hat{\sigma})) \) to be a bivariate normal centered on the “true” mean and standard deviation, but this time I give it very large variances. This means the prior probability takes roughly the same value for any input mean and standard deviation. The idea of this test is to see how well our method can pick out the right paramters when the prior is not very helpful.