I don’t actually break Bayesian inference. I want to develop an intuition for Bayesian methods, and so here I push them to their limits to see what happens. First I describe what I will do with Bayesian inference, and how I will test it. I want to see how well it performs on a simple test.

I try here a simple test: inferring the distribution of a pseudorandom sequence of floating point numbers. I generate this sequence using the NAG routine nag_rand_normal and I follow a simple hypothesis testing procedure I first learned about by reading Machine Learning, A Probabilistic Perspective by Kevin P. Murphy. I outline the procedure below.

Hypothesis testing with Bayesian methods means you start with a prior belief about whether a hypothesis \( \mathbf{H} \) is true or not, expressed numerically as a probability \( P(\mathbf{H}) \), called a *prior probability.* Next you observe some data \( \{x _ k\} _ {k=1}^N \) and attempt to compute the new *posterior* probability \( P(\mathbf{H} \mid \{x _ k\} _ {k=1}^N ) \) (This means: The probability that the hypothesis \( \mathbf{H} \) is true given that we have observed the data \( \{x _ k\} _ {k=1}^N \) ). A high posterior probability expresses a high confidence in a hypothesis based on available information (where “available information” is encoded into the prior probability somehow). The following formula, based on Bayes’ Theorem, calculates the posterior probability

\[ P(\mathbf{H} \mid \{x _ k\} _ {k=1}^N ) = \frac{P(\{x _ k\} _ {k=1}^N \mid \mathbf{H}) P(\mathbf{H})}{P(\{x _ k\} _ {k=1}^N )} \].

This formula will allow us to combine prior information with observed data to try and infer the distribution used to generate a pseudorandom sequence of floating point numbers.

In what I consider here I have a sequence \( \{x _ k\} _ {k=1}^N \) generated by `nag_rand_normal.`

I will try to infer the parameters to the normal distribution used to generate this sequence. This means I actually have infinitely many hypotheses \( \mathbf{H}(\mu,\sigma) \) where \(\mu\) is the *mean* of the normal distribution, and \( \sigma \) is the *standard deviation* of the normal distribution. I need to figure out which values of these parameters were input to `nag_random_normal`

to generate the sequence \( \{x _ k\} _ {k=1}^N \). First let’s recast the formula for the posterior given above for this case. Given any \( \mu \) and \( \sigma \) the posterior formula gives

\[ P(\mathbf{H}(\mu,\sigma) \mid \{x _ k\} _ {k=1}^N ) = \frac{P(\{x _ k\} _ {k=1}^N \mid \mathbf{H}(\mu,\sigma)) P(\mathbf{H}(\mu,\sigma))}{P(\{x _ k\} _ {k=1}^N )} \].

To calculate \(P(\{x _ k\} _ {k=1}^N )\) we may appeal to the law of total probability and find that

\[(P(\{x _ k\} _ {k=1}^N ) = \int _{0}^{\infty} \int _{-\infty}^{\infty} P(\{x _ k\} _ {k=1}^N \mid \mathbf{H}(\mu,\sigma)) P(\mathbf{H}(\mu,\sigma)) d\mu d\sigma \]

The integrand may be implemented by calculating the likelihood and applying our prior, both computable and known quantities. To compute the actual integral I used the NAG routine nag_quad_2d_fin by supplying large integration bounds.

I now have everything I need to start doing calculations, let’s jump straight into my tests.

In all tests I generate the pseudorandom sequence \( \{x _ k\} _ {k=1}^N \) with the normal distribution parameters \(\mu=0\) and \(\sigma=1\). I plot the posterior probability as a heat map in \(\mu,\sigma\) space, yellow probabilities indicate higher confidence values and blue probabilities indicate lower confidence values. I also place a black circle on the “true” values so that we may visually see the quality of the confidence values. I place an orange circle on the maximum posterior probability. Thus if the orange circle is close to the black circle, the bayesian hypothesis testing procedure has succeeded.

I produce this plot for sample sizes \( N=1,2,4,8,16,32,64 \) and for three different prior probabilities. I summarize them as follows:

- A very helpful prior: force the Bayesian update formula to give us the right answer
- An unhelpful prior: assign almost the same probability to all parameters
- A
*very*unhelpful prior: Give the right parameters prior probability almost zero.

For this test I take the prior probability \( P(\mathbf{H}(\hat{\mu},\hat{\sigma})) \) to be a bivariate normal centered on the “true” mean and standard deviations \( \mu=0 \) and \(\sigma=1\) and with very small standard deviation. The idea here is a sanity check on the method, if it can’t preserve a correct answer that is fed to it then it has no chance on approximate inputs.

For this test I take the prior probability \( P(\mathbf{H}(\hat{\mu},\hat{\sigma})) \) to be a bivariate normal centered on the “true” mean and standard deviation, but this time I give it very large variances. This means the prior probability takes roughly the same value for any input mean and standard deviation. The idea of this test is to see how well our method can pick out the right paramters when the prior is not very helpful.

For this test I take the prior probability \( P(\mathbf{H}(\hat{\mu},\hat{\sigma})) \) to be a bivariate normal centered on the *wrong* parameters and also give it very low variance. Thus it matches the situation where our prior belief is very wrong. I want to see what happens in this case as we increase the number of observations, does it still work?

Let me summarize what the above figures show:

- Very helpful prior: The computed posterior immediately gives very high confidence around the correct distribution, even for just 1 sample.
- Unhelpful prior: The computed posterior requires more samples to converge around the correct distribution, but does still converge.
- Very unhelpful prior: The posterior requires many more samples for convergence, but does appear to converge.

This however is a very simple test. There are a few things which I do not know the answer to right now. The inference seems to work, but my hypothesis \( \mathbf{H} \) only considers *normal distributions* and the data is generated from a normal distribution. I do not expect to see this nice alignment of hypothesis and data in practice. Thus I would need to try many different kinds of distributions as hypothesis, leading to many more parameters to search over. Furthermore data may contain errors and noise, so it may be that no classical distribution truly describes it. The conclusions I draw here do not apply to these more complicated cases, but they do provide a necessary sanity check.