Abnormally Distributed

A Simple Introduction to Bayesian Inference

Brandon Vaughan

last update: 12 September 2018

Introduction

Bayesian statistics has become something of a buzzword. But what is it? Many online tutorials explain it only in a limited context relating to simplistic medical testing scenarios. Other information online sometimes is saturated with integrals and derivatives, clearly written for those already “in the know” or trained in mathematics. The aim here is to explain what Bayesian statistics is, where it came from, and what benefits it has, with little in the way of statistical formulas beyond the basic statement of Bayes’ theorem. Examples are fully demonstrated with figures, with examples written in Stan. The goal is for the interested reader to be able to fully understand the workflow of using Bayesian inference and, if desired, program and run basic regression models. The programming exercises are not necessary to understand the concepts presented here, but do enable trying out Bayesian methods with Stan. Data management and other code for R is not presented as it usually would to keep the focus on Stan, which is multi-platform. The code blocks will work in any Stan interace, be it Rstan, Pystan, or Cmdstan.

What is it?

In a Bayesian analysis the probability of a parameter value \(\small \theta\) given the model and data is called the posterior probability. The posterior probability is dependent on the model assumptions represented in the prior and the likelihood function representing the likelihood of the data given the model (prior). This tells you the probability of a parameter relative to the sample space for the data \(\small X\), known as the marginal evidence of the data (or \(\small P(X|\pi)\)). Usually the \(\pi\), representing the model assumptions, is not written alongside \(\small \theta\), but is included here to note that even in the prior the probability of a particular value of \(\small \theta\) is contingent on how the analyst constructs the model.


\[ \color{purple}{P(\theta|X,\pi)} = \frac{\color{red}{P(X|\theta,\pi)} \cdot \color{blue}{P(\theta|\pi)}}{\ P(X)} = \frac{\color{red}{P(X|\theta,\pi)} \cdot \color{blue}{P(\theta|\pi)}}{\int P(X|\theta_i , \pi_i) \ d\theta_i ; \pi_i} \]

This is the only integral you will have to look at today (on this page, at least)! Fortunately, it can be visualized easily using Venn diagrams. Suppose that \(\small \theta\) is a single binary outcome, the presence or absence of an event, and \(\small X\) is a piece of information that indicates \(\small \theta\) may occur. The statement \(\small P(X|\theta) \cdot P(\theta)\) is equivalent to the joint probability \(\small P(\theta \cap X)\), meaning our posterior probability will be proportional to the joint probabilities of \(\small \theta\) and \(\small X\). To find the probability of \(\small \theta\) given \(\small X\) is the area in the circles where \(\small X\) where \(\small \theta\) occur together relative the total area of the circle where \(\small X\) occurs.



In practice, we do this over a range of possible values for \(\small \theta\) to obtain a posterior distribution, which describes the probabilities of a range of candidate parameter values. Why use Bayes’ theorem? Bayesian inference is conditional on the observed data and model. That is, we just have one set of data and our conclusions depend on the model. Frequentist inference considers data as conditional on a fixed assumption - the null hypothesis, and seeks to demonstrate importance by obtaining a low probability of the data given the null hypothesis. That is, frequentist inference proceeds by \(\small P(Data|\textit{Null Hypothesis})\) while Bayesian inference proceeds by \(\small P(Hypotheses | \textit{Data})\).

This is useful in situations where it doesn’t necessarily make sense to assume that we could randomly sample our data from a population. Hence, data are often sampled from an ill-defined population, not sampled randomly, or both. Frequentist founding father Fisher himself insisted that statistical populations are convenient fictions (Fisher 1922, 1925). This situation is very common in many of the social, political, and economic sciences (Berk and Freedman 2003; Gill 2014). It is also a common situation in many physical sciences as well, including astronomy and meteorology. For example, to ask “the probability that it will rain tomorrow, given the current weather conditions” is awkward from a frequentist perspective. Days are not randomly sampled. Neither are weather patterns. In fact, one of the first major modern applications of Bayesian inference occurred in meteorology where Bayesian inferential methods revolutionized weather forecasting (Epstein 1962, 1966; Murphy and Epstein 1967; Lorenc and Hammon 1988).

In general, most inferential questions in science can be framed far more naturally as conclusions drawn about some outcome or quantity \(\small \theta\) from the observed data \(\small X\). We would often like to find the best estimate of \(\small \theta\), where “best” means “most probable”, conditional on a single observation or set of observations. This estimate is known as the maximum a posteriori estimate or MAP estimate.

Monte Carlo Sampling

Finding the MAP estimate in simple models is often analytically fairly simple and involves tractable integrals. For complex models the integrals needed to be solved to obtain the posterior distribution become difficult, often to the point that they are intractable. Historically this led to a negative view of Bayesian inference and the development of the frequentist approaches.

For models that are too difficult to solve analytically a method called Markov chain Monte Carlo (MCMC) sampling is used. MCMC is a way to explore the solution to a difficult to solve problem through probabilistic sampling of the parameter space defined by the model. There are many different algorithms, but the main ones are the Metropolis-Hastings algorithm, Gibbs sampling, and Hamiltonian Monte Carlo. The specifics are not terribly important here. The underlying idea between all of them is to generate a series of proposals, which are rejected out of hand or considered. If considered, a second decision is made about whether to accept or reject the value. For a given MCMC method, there will be different rules about the consideration and acceptance decisions.



As the number of iterations increase the sampling chain explores the target distribution with greater accuracy. The result is a full probability distribution. The amazing thing is that the difficult-to-specify denominator of Bayes’ theorem is not needed, only the joint probabilities of \(\small \theta\) and \(\small X\), \(\small P(X\cap\theta) = P(X|\theta, \pi) \cdot P(\theta|\pi)\). This is because MCMC only requires a function proportional to the desired quantity \(\small P(\theta | X ; \pi)\). The samples produced from the target distribution defined by the joint probability are the posterior probability distribution.

Building a Simple Model

The first step of building up a statistical model is to think about the data generation process, which will form the likelihood of the model. Let’s start with a simple statement, observations \(y_i\) are drawn from a normal distribution with known mean \(\mu\) and standard deviation \(\sigma\). We can write out a statement of this assumption as shown below.


\(y_i \sim normal(\mu, \sigma)\)


Next we can represent this statement with a diagram, known as a Kruschke diagram (Kruschke 2011). These were introduced as an alternative to the logic-diagram like directed acyclic graphs sometimes used. Kruschke diagrams show the shape and name of a distribution, and each level of the diagram corresponds to a model statement, and make it clear what the model is.


Next we draw observations \(y_i\) from the distribution. Stan code is provided below, with a histogram of draws \(y_i\).

parameters {
  real y;
}

model {
  y ~ normal(0,1);
}

However, in real analyses we have real samples of \(y_i\) and need to estimate \(\mu\) and \(\sigma\). This is a situation where one might use a one sample t-test in frequentist statistics. For now, we can assume non-informative priors on \(\mu\) and \(\sigma\), with \(\sigma\) having a lower-bound of zero since standard deviation cannot be negative. That is, we believe that \(y_i \sim normal(\mu, \sigma)\) as before, only this time don’t know what the parameters are.


The Stan code for this model is given below.


parameters {
  real y;
  real mu;
  real sigma<lower=0>;
}

model {
  y ~ normal(mu,sigma);
}


This model can be extended further. Let’s consider a situation where we think that observations \(y_i\) are predicted by a variable. We think that the mean of the vector of observations \(y\) is determined by a variable \(\small X\) and that this relationship is described by a baseline value \(\alpha\) plus a the product of \(x_i\) and a term \(\small \beta\) describing the relationship of \(\small X\) and \(y\). This is of course just the formula for the slope of the line, \(y \sim \alpha + \beta \cdot x\). The residual standard deviation, which is the variance unexplained by the predictors, defines the standard deviation of the distribution of \(y_i\)

This is simply a least squares linear model. In fact, although least squares is thought of as a frequentist procedure, Gauss originally justified the least squares method on the basis that so long as we assume the residuals of the model are distributed according to the Gaussian (normal) distribution, the coefficients will be those which maximize the mode of the posterior distribution and minimize posterior error (Stigler 1981, 2007). The only prior assumed here is that the observations \(y_i\) are normal. This implies that we have uninformative, improper priors. They are called improper because they have no bounds and thus have infinite area under the curve, so do not integrate to 1 as a probability mass must.

Our least squares model then looks the below figure, representing the model statement
\[ y_i \sim normal(\mu = \alpha + X \cdot \beta,\ \ \ \sigma) \]


The Stan code for this model is below. Note that the bold line in the model code is a statement of the model above the figure. In Stan you can expect each rung of the model hierarchy to correspond to a line of code in the model. We don’t need to write out the uniform priors, however, since Stan figures out the uniform prior on its own.


data {
  int<lower=0> P;
  int<lower=0> N;
  vector[N] X;
  real y[N];
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
y ~ normal(alpha + X * beta, sigma);
}


Below are the 95% Bayesian credible intervals plotted alongside the 95% confidence intervals from the least squares fit. It is easy to see that the results are identical (note that the least squares does not directly fit a parameter for the standard deviation, however).



One benefit of the Bayesian approach is we get as many credible regression lines as we have posterior samples. We can plot these lines to get a visual assessment of the plausible relationships in the data.



Multiple Predictors

This model is easily extendable to multiple linear regression. We can also add priors to the model. Underlined are the lines of code that pertain to accommodating multiple predictors. The main things are that the data now must specify the number of predictors \(P\) and the data \(\small X\) is now represented by an N by P matrix. In the parameters beta is now represented by a vector of P coefficients.

Suppose we wanted to add priors to the model, but don’t want to make strong claims about the coefficients. That is, we want to use weakly informative priors. These will minimally affect the parameter estimates, if at all. What they do is allow for the use of Savage Dickey Density Ratios (a method of calculating the likelihood ratio or Bayes Factor for a parameter), which require proper prior distributions. This is worth mentioning, but the subject of Bayes Factors will not be treated here. They also assist in sampling for more complicated models where uninformative improper priors allow the Markov chains to spend too much time exploring implausible parameter values.

Such a model would be stated as:

\[ \sigma \sim student's \ \ t^{+}_{\small{half}}(3,0,10) \\ \beta_i \sim ~ student's \ \ t(3,0,10) \\ \alpha \sim student's \ \ t(3,0,10) \\ y \sim normal(\alpha + \beta_ix_j, \sigma) \]

Which corresponds to the Stan code shown below.


data {
int<lower=0> P;
  int<lower=0> N;
matrix[N,P] X;
  real y[N];
} parameters {
  real alpha;
vector[P] beta;
  real<lower=0> sigma;
}
model {
alpha ~ student_t(3,0,10);
beta ~ student_t(3,0,10);
sigma ~ student_t(3,0,10);
y ~ normal(alpha + X * beta, sigma);
}


So what?

“So what, you just turned a simple linear regression into more work”, you might think. But consider that in modern science we often deal with large numbers of comparisons or variables. In a regression model with more than 3 variables the classical estimate is shown to be inadmissible, that is, there’s always an estimator that has lower mean squared error (Stein 1956 ; James and Stein 1961). These estimators are known as shrinkage, penalized, or regularized models. They are extremely useful in situations where there is multicollinearity or a number of variables close to or greater the number of coefficients. In these situations the estimates can become unstable, exaggerated in magnitude, have the wrong sign, and have inflated standard errors.

One tool to deal with this is ridge regression, which was presented briefly alongside a simplistic ridge-like Bayesian model in a previous post. Unfortunately, such estimators lack a straightforward way to get uncertainty estimates in frequentist theory. In the Bayesian approach the penalized estimates have posterior distributions which quantify the uncertainty about the parameter estimates conditional on the model and data (Kyung et al. 2010; Xiang and Jones 2017). The 95% density interval or “credible interval” describes a range of values which contain the true parameter value with 95% posterior probability.

In Stan it’s very easy to fit a ridge-like model (but see the Stan programming manual for another implementation), which in this presentation estimates the shrinkage parameter from the data. A Gaussian prior of mean zero is used on each coefficient, rather than a weakly informative Student’s t-distribution. Each of their standard deviations (standard errors) are estimated from a common distribution. Here a gamma distribution with rate parameter \(\small \frac{1}{2p}\) and shape parameter = \(\small 1\) is used to describe the shrinkage parameter \(\small \lambda\). As the number of predictors in the model increases the more the pooled standard error for the coefficients is pulled towards zero. Because the priors for \(\small \beta_i\) are all centered on zero, this means the certainty that each \(\small \beta_i\) is zero increases with the number of parameters.

The model can be stated as:

\[ \sigma \sim student's \ \ t^{+}_{\small{half}}(3,0,10) \\ \lambda \sim gamma(\frac{1}{2P}, 1) \\ \beta_i \sim normal(0,\lambda) \\ \alpha \sim student's \ \ t(3,0,10) \\ y \sim normal(\alpha + \beta_ix_j, \sigma) \]



data {
  int<lower=0> P;
  int<lower=0> N;
  matrix[N,P] X;
  real y[N];
real tau;
} parameters {
real<lower=0> lambda;
  real alpha;
  vector[P] beta;
  real<lower=0> sigma;
}
model {
sigma ~ student_t(3,0,10);
lambda ~ gamma(tau,1);
beta ~ normal(0,lambda);;
alpha ~ student_t(3,0,10);
y ~ normal(alpha + X * beta, sigma);
}


Below is a plot of the variance inflation factors for the variables used in the diabetes data set. Collinearity is clearly a severe problem, leading to inflated standard error and possibly unstable estimates.



This is easily confirmed by comparing the results from the ordinary Bayesian regression fit (using the weakly informative Student’s t priors on the parameters) as well as the Bayesian ridge-regression like model. Point estimates obtained using the R package glmnet are also shown (with the penalty parameter chosen by cross-validation) as a reference point to which we can compare the Bayes model.



For \(\small \beta_i\) which have stronger evidence (lower standard error) in favor of a particular value this penalty doesn’t matter much, if at all. For \(\small \beta_i\) affected by the formentioned problems the penalty is stronger in proportion to the severity of the inflation. As expected, those unaffected by variance inflation are scarcely different at all from an ordinary regression.

Conclusion

Beginning with the simple case of simulating observations of a variable from a normal distribution, you learned how to build up a model from a basic assumption about the distribution of observations to estimating the mean and standard deviation, all the way up from a regularized shrinkage estimator in just a few short steps. This estimator reflects prior knowledge we’d like to include in an analysis. Namely, we know that data sets with correlated variables tend to produce coefficients of the wrong sign, magnitude, and with inflated standard errors.

The key to programming a good Bayesian model or selecting an existing model is to write out your assumptions as was done throughout the examples. Write a sentence justifying them in published research papers or presentations. Use exploratory data visualization to build your intuitions. If outcome variable has many outliers, you may want to consider changing the distribution assumed to generate the observations from a Gaussian distribution to a student’s t distribution. If the data are significantly skewed, a skew-normal distribution may be in order (Umlauf and Kneib 2018). These examples have been covered in the page on robust regression. If it is believed that there are only a few non-zero effects (or only a few not practically different) then a model including assumptions of sparsity on the coefficients can be used, such as the LASSO (least absolute shrinkage and selection operator) or Horseshoe models could be used (Tibshirani 1996; Park and Casella 2008; Carvalho, Polson, and Scott 2010; Datta and Ghosh 2013; Piironen and Vehtari 2017). Each level of the model requires justifiable assumptions and gives an opportunity to tailor an analysis to the needs of your data. Furthermore, the conclusions of a Bayesian model can be stated in intuitive, common sense terms. These benefits allow for a model to be built up with conceptual “legos” and makes your assumptions transparent to yourself and to an audience.

References


Berk, R, and D Freedman. 2003. “Statistical Assumptions as Empirical Commitments. Published as Chapter 10 in T. G. Blomberg and S. Cohen (Eds.) Punishment and Social Control: Essays in Honor of Sheldon Messinger, 2nd Ed. (2003), Aldine de Gruyter, Pp. 235–254.” Aldine de Gruyter. https://tinyurl.com/yb4ynxdn.

Carvalho, C. M., N. G. Polson, and J. G. Scott. 2010. “The Horseshoe Estimator for Sparse Signals.” Biometrika 97 (2): 465–80. https://doi.org/10.1093/biomet/asq017.

Datta, Jyotishka, and Jayanta. K. Ghosh. 2013. “Asymptotic Properties of Bayes Risk for the Horseshoe Prior.” Bayesian Analysis 8 (1): 111–32. https://doi.org/10.1214/13-BA805.

Epstein, E. 1962. “A Bayesian Approach to Decision Making in Applied Meteorology.” Journal of Applied Meteorology 1 (2): 169–77. http://journals.ametsoc.org/doi/abs/10.1175/1520-0450%281962%29001%3C0169%3AABATDM%3E2.0.CO%3B2.

———. 1966. “Quality Control for Probability Forecasts.” Monthly Weather Review 94 (8): 487–94. http://journals.ametsoc.org/doi/abs/10.1175/1520-0493%281966%29094%3C0487%3AQCFPF%3E2.3.CO%3B2.

Fisher, R. A. 1922. “On the Mathematical Foundations of Theoretical Statistics.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 222 (594-604): 309–68. https://doi.org/10.1098/rsta.1922.0009.

Fisher, R.A. 1925. Statistical Methods for Research Workers. 1st ed. Oliver; Boyd. http://psychclassics.yorku.ca/Fisher/Methods/.

Gill, Jeff. 2014. Bayesian Methods: A Social and Behavioral Sciences Approach. Third edition. Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences Series. Boca Raton: CRC Press, Taylor & Francis Group.

James, W., and Charles Stein. 1961. “Estimation with Quadratic Loss.” Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Fourth Berkeley Symposium on Mathematical Statistics and Probability,, 361–79. https://projecteuclid.org/euclid.bsmsp/1200512173.

Kruschke, John K. 2011. Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Burlington, MA: Academic Press.

Kyung, Minjung, Jeff Gill, Malay Ghosh, and George Casella. 2010. “Penalized Regression, Standard Errors, and Bayesian Lassos.” Bayesian Analysis 5 (2): 369–411. https://doi.org/10.1214/10-BA607.

Lorenc, A. C., and O. Hammon. 1988. “Objective Quality Control of Observations Using Bayesian Methods. Theory, and a Practical Implementation.” Quarterly Journal of the Royal Meteorological Society 114 (480): 515–43. https://doi.org/10.1002/qj.49711448012.

Murphy, Allan H., and Edward S. Epstein. 1967. “A Note on Probability Forecasts and ‘Hedging’.” Journal of Applied Meteorology 6 (6): 1002–4. https://doi.org/10.1175/1520-0450(1967)006<1002:ANOPFA>2.0.CO;2.

Park, Trevor, and George Casella. 2008. “The Bayesian Lasso.” Journal of the American Statistical Association 103 (482): 681–86. https://doi.org/10.1198/016214508000000337.

Piironen, Juho, and Aki Vehtari. 2017. “Sparsity Information and Regularization in the Horseshoe and Other Shrinkage Priors.” Electronic Journal of Statistics 11 (2): 5018–51. https://doi.org/10.1214/17-EJS1337SI.

Stein, Charles. 1956. “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution.” Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Third Berkeley Symposium on Mathematical Statistics and Probability,, 197–206. https://projecteuclid.org/euclid.bsmsp/1200501656.

Stigler, Stephen M. 1981. “Gauss and the Invention of Least Squares.” The Annals of Statistics 9 (3): 465–74. https://doi.org/10.1214/aos/1176345451.

———. 2007. “The Epic Story of Maximum Likelihood.” Statistical Science 22 (4): 598–620. https://doi.org/10.1214/07-STS249.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) 58 (1): 267–88. http://www.jstor.org/stable/2346178.

Umlauf, Nikolaus, and Thomas Kneib. 2018. “A Primer on Bayesian Distributional Regression.” Statistical Modelling 18 (3-4): 219–47. https://doi.org/10.1177/1471082X18759140.

Xiang, Ding, and Galin L. Jones. 2017. “Bayesian Penalized Regression.” arXiv:1706.07767 [Stat], June. http://arxiv.org/abs/1706.07767.