library(glmnet)
library(MASS)
library(robustbase)
library(car)
library(sandwich)
library(lmtest)
library(brms)
library(ggpubr)
library(ggthemes)
library(dplyr)
library(broom)
library(knitr)
library(kableExtra)
All statistical models have certain assumptions that must be met, at least fairly well, in order for the analysis to be reliable or easily interpretable. Outliers and heteroscedasticity are common in real data sets, which can pose problems for the application of ordinary least squares or maximum likelihood estimation (in the case of a simple normal distribution the MLE is equivalent to least squares, so only least squares will be discussed to avoid redundancy). Predictors entered into multiple linear regression models may also be correlated in practice, which can cause a number of problems for the resulting estimates. Fortunately, there are robust methods to deal with the former problem and regularized methods to deal with the latter.
Why use these methods? Common tests for heteroskedasticity or outliers can fail to identify deviations from the assumptions of linear regression. Data transformations (log, square-root, etc) can be useful to ameliorate some problems, but you lose what may be a more intuitive interpretation on the original scale. Discarding data points deemed “bad” may be questionable or unnecessary. I will review here some non-Bayesian solutions and then demonstrate some Bayesian alternatives based on flexible model building.
Robust Regression
The main focus of robust regression is resistance to regression outliers, which are data points far away from the rest of the distribution of the dependent variable. Bad leverage points, which are outliers in the independent predictor variables can also negatively affect the validity of classical models.
Heteroskedasticity means that the variability in one variable increases or decreases as a function of another variable. Classical regression works by finding values for the model coefficients which reduces the sum of squared residuals. This is why classical regression is often called ordinary least squares. Heteroskedasticity creates problems for minimizing the sum of squares. The regression line should have the desirable property of modeling the average response of the dependent variable. This reflects the intuition that the average value for a variable should represent the most common expected value for that variable. When heteroskedasticity is present the regression line may stop describing the expected value in the dependent variable as the value of the predictor changes.
Least Absolute Squares
Least absolute squares, also called median regression, is one way to conduct a regression analysis on data that include outliers or violate the assumption of homogeneity of variance. Median regression is just an instance of quantile regression where the quantile of interest is the 50th quantile. Please see the page Quantile regression for more information about this approach. There are, however, many more approaches. The major ones will be reviewed here. It is worth noting that least absolute squares and quantile regression more generally are L1 norm based methods. The L1 norm is any method based on minimizing the absolute value of the difference between the observed \(y\) values and predicted \(y_{pred}\) values. By contrast, the L2 norm
Weighted Least Squares
As mentioned before, heteroskedasticity increases the variance of the residuals in an uneven way. This makes the standard errors used in significance tests or confidence interval construction invalid. They may lead to the correct decision, but they may also cause a Type I or Type II error (in frequentist parlance).
One solution is to adopt a weighted least squares approach with weights to balance the influence of observations of \(x\) with larger magnitude to compensate for the variance of \(y\) increasing with \(x\). A common formula for the weights in this situation is for the \(i_{\text{th}}\) weight to be the reciprocal of the square root of the \(i_{\textit{th}}\) observation of predictor \(x\). The method of calculation is shown below.
\[ \text{VCOV}_{\text{wls}} = \underbrace{ \begin{pmatrix} \sigma^2_1&0&\cdots&0\\ 0&\sigma^2_2&\cdots&\vdots\\ \vdots&\cdots&\ddots&\vdots\\ 0&\cdots&\cdots&\sigma^2_i\\ \end{pmatrix}}_{\text{VCOV}_{\text{ols}}} ~~~~~~ \cdot ~~~~~~ \underbrace{ \begin{pmatrix} w_1&0&\cdots&0\\ 0&w_2&\cdots&\vdots\\ \vdots&\cdots&\ddots&\vdots\\ 0&\cdots&\cdots&w_i\\ \end{pmatrix}}_{\textit{w}_i ~ = ~ \frac{1}{\sqrt{x_i}}} \]
Below is a comparison of two model fits, one the ordinary least squares fit and the other the weighted least squares fit. The same simulated personality data from the quantile regression page will be used.
data = read.csv("https://tinyurl.com/ybkvf6e2", header=TRUE)
wls = lm(Neuroticism ~ Age,data, weights=1/sqrt(Age))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.6694371 | 0.4136421 | -1.618397 | 0.1080386 |
Age | 0.0706068 | 0.0219907 | 3.210764 | 0.0016740 |
ols = lm(Neuroticism ~ Age,data)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.6540123 | 0.4483924 | -1.458571 | 0.1471326 |
Age | 0.0698198 | 0.0229468 | 3.042683 | 0.0028458 |
Another solution is to use heteroskedastic consistent variance-covariance matrices to calculate standard error. This is done through a family of estimators called “sandwich” estimators. Although this won’t change the coefficient estimates it will result in standard errors (and subsequently p-values or confidence intervals) corrected for the heteroskedasticity. For the sake of space I will show how to obtain this correction, but won’t go into the math behind it.
vcov.rob = vcovHC(ols, method="HC4m")
sammich = coeftest(ols, vcov.rob)
kable(tidy(sammich))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.6540123 | 0.4513283 | -1.449083 | 0.1497598 |
Age | 0.0698198 | 0.0261474 | 2.670240 | 0.0085632 |
Least Trimmed Squares
One of the simplest approaches to robust regression is least trimmed squares (LTS). The error term is the squared value of the residuals (see equation below), which like ordinary least squares is an L2 norm based method because it is based on the squared residuals, rather than absolute values.
\[ \sum_{i(k)=1}^n (y − y_{pred})^2_{i(k)} \]
The error term in LTS is minimized by finding the subset \(k\) of the data set that produces the smallest value of the error term. Alternatively the amount of trimming can be specified manually, but this may not give optimal results. However, LTS has relatively low efficiency, meaning it takes more observations to achieve stable results. This is somewhat intuitive, since trimming observations does reduce the original sample size. LTS does have the advantage of a very high breakdown point of 0.5. This means that LTS is insensitive to influence by outliers, provided that the outliers constitute less than 50 % of the data (Mount et al. 2014). See Venables and Ripley (2002) and Wilcox (2012b) for more information on least trimmed squares regression.
For an example the diabetes data from Bradley Efron’s Computer Age Statistical Inference website will be used, with some “corruptions” added to the data for an example predicting ldl cholesterol from hdl. All variables are scaled (as z-scores).
diabetes = read.csv("https://tinyurl.com/ycl5qcz5", header=TRUE)
An lts model can be fit with the ltsReg function in the robustbase package. Below is a comparison of the leasts squares and least trimmed squares fits. An LTS function is also available in the MASS package, but requires bootstrapping to obtain p-values.
lts = ltsReg(formula = ldl ~ hdl, data = diabetes)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
Intercept | -0.0826366 | 0.0358277 | -2.306501 | 0.0215674 |
hdl | -0.1774956 | 0.0370846 | -4.786228 | 0.0000024 |
ols = tidy(lm(formula = ldl ~ hdl, data = diabetes))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0000000 | 0.0475236 | 0.000000 | 1.0000000 |
hdl | -0.0633094 | 0.0475775 | -1.330659 | 0.1839903 |
### M-Estimators and Extensions
Weighted least square can also be applied to problems with regression outliers and leverage points. To do this, a set of optimal weights must be found. This is done through an iteratively weighted least squares (IWLS) approach where a family of maximum-likelihood like estimators (M-Estimators) are used to find an optimal set of weights.
Generally speaking, M-Estimators work by minimizing a particular function \(\xi\) on the residuals (which are scaled by a scale parameter \(\tau\)) (Wilcox 2012b).
\[ \sum_{i=1}^n \xi\Big(\frac{y-y_{{pred}}}{\tau}\Big)_i^2 \]
The most popular class of \(\xi\) functions for M-Estimation is the \(\psi\) family of estimators. The two most common examples are Tukey’s bisquare and Huber’s psi. The particular mathematical details will not be reviewed here, but both Tukey’s and Huber’s measures are robust measures of central tendency. For more details Wilcox (2012a), Wilcox (2012b), and Venables and Ripley (2002) give thorough coverage of the particulars.
The purpose of \(\psi\) is to find a set of weights which reduce the influence of regression outliers on the final estimates. This is done as an iterative process for different values for \(w_i\).
\[\begin{align} w_i = \frac{\psi(x_i)}{x_i} & & \text{weights}\\ \sum_{i=1}^n w_i \cdot (y-y_{pred})_i^2 & & \text{weighted least squares} \end{align}\]M-Estimators are useful for finding an optimal solution when there are a few regression outliers. They can fail more easily in the presence of bad leverage points. S-Estimators, or scale estimators, attempt to find values which minimize the scale of the residuals. This means the predictor variables must be taken into account to find the coefficients that minimize the residual variance. There are a few approaches to this but generally S-Estimators depend on IWLS. However, S-Estimators can be inefficient compared to M-Estimators.
The solution that has emerged is the Modified M-Estimators, or MM-Estimators. MM-Estimators use S-Estimators to find coefficient values which minimize the residual standard deviation of M-Estimators. The process begins by calculating a candidate set of M-Estimator solutions. The M-Estimators need a measure of scale \(\tau\), so S-Estimators find an optimal solution to the candidate set. This increases the efficiency of S-Estimation because it does not have to search a parameter space that grows with sample size. Once the optimal scale is chosen, M-Estimation finds the most likely values for the coefficients. It is capable of achieving a breakdown point of .5, and confidence intervals and p-values can be obtained with much less effort and time than bootstrapping the LTS method.
As an example the same data used for the least trimmed squares example will be used. The function rlm can be found in the MASS package.
rlm = rlm(ldl ~ hdl, data = diabetes, psi="psi.huber", scale.est="Huber")
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0454903 | 0.0409183 | -1.111733 | 0.2662528 |
hdl | -0.1442526 | 0.0409647 | -3.521389 | 0.0004293 |
Regularized Regresssion
There are two major methods of regularized regression, the ridge regression approach and the LASSO (least absolute shrinkage and selection operator) approach. Ridge regression is typically used in situations where there is significant correlation (collinearity) between predictor variables and/or there are a great number of predictor variables but a smaller number of observations. Ridge regression shrinks the values of coefficients as a way of preventing them from taking on wildly large magnitudes.
LASSO is a modification of the ridge that reduces some coefficients in a large model to a value of exactly zero. It is typically used as a method of variable selection for building a regression model. It is useful when you have a large number of predictors and don’t know which ones would be useful to build a simple model. The LASSO will be covered in a future post on model selection and machine learning.
The estimate of each regression coefficient in a multiple regression model depends not only on the variable \(x_i\). Each regression coefficient represents the unique contribution of a predictor in predicting \(y\). When there is too much correlation between predictors it becomes difficult to differentiate the unique effects of the correlated predictors on \(y\). The variance of the effect estimate becomes inflated, leading to a large standard error. Furthermore, the effect estimate itself may bounce around wildly leading to estimates of the wrong sign. This is not surprising, if the standard error is high the confidence in an estimate is lower, meaning that any particular estimate obtained is less likely to be correct.
There are three practical results of this other than estimate errors. First, the coefficient estimates can change wildly in response to small perturbations to the data and very unlikely that the model would generalize to a new data sample. Second, larger standard errors mean a loss of power when trying to find “statistically significant” results. Third, if the model does a good job of predicting new data points the interpretation of the coefficients can become difficult because you don’t know how much unique variance for a given predictor is reflected in a coefficient. This isn’t a problem if prediction is the goal, but if scientific questions about the effects of \(x_i\) on \(y\) are the focus of an analysis it undermines the research goals.
The cause for these issues is quite easy to see. Consider the formulas below for the case of two predictors \(a\) and \(b\). The first formula shows how the unique effects of predictor \(a\) on \(y\) are calculated, and the second the standard error of that estimate.
\[\begin{align} \hat{\beta_a} = \frac{\Big(\sum(x^2_b)\sum(x_a~\cdot y)\Big) - \Big(\sum(x^2_b)\sum(x_a~\cdot y)\Big) }{\sum(x^2_a)\cdot \sum(x^2_b) - \sum(x_b \cdot x_a)^2 } & & \text{Coefficient of predictor a} \\ \\ SE_{\hat{\beta_a}} = \sqrt{\frac{sd(y-y_{pred})^2}{\sum x^2_a \cdot \big(1-(y_{a\sim b}-y_{{pred}_{a\sim b}})^2\big)}} & & \text{Standard Error of}~ \hat{\beta}_a \end{align}\]Notice in the standard error formula the part of the denominator \(1-(y_{a\sim b}-y_{{pred}_{a\sim b}})^2\big)\). This represents the difference of the total possible explained variance (1) and the percentage of explained variance (coefficient of determination or \(R^2\)) for a linear model of \(a\) predicted by co-predictor \(b\). When there is perfect collinearity between a and b, the formula becomes 1-1=0, leading to a denominator of zero!
Below is a plot of what happens to the standard error as a function of increased predictor correlation, holding the vector of x values and numerator the same and allowing only the \(R^2\) for a~b to vary.
Ridge Regression
Ridge regression is a solution to the problem of collinearity. It adds a penalty term (see equation below) to the sum of squared errors formula for each predictor. This regularizes the estimate so that there is greater tolerance to the effects of collinearity.
\[ \underbrace{\sum_{i=1}^n (y_i - \sum_{j=1}x_{ij}\beta_j)^2}_{\text{L2 norm SSE}} + \underbrace{\lambda \sum_{j=1} \beta_j^2}_\text{L2 Penalty} \]
The penalty is known as an L2 penalty because it is based on the sum of squares, rather than an absolute value based term penalty (L1 penalty). The penalty adds the product of lambda (a tuning constant used to select how strong the penalty should be) and the squared coefficient for the \(\textit{j}_{\textit{th}}\) predictor. This has the effect of adding shrinkage to each coefficient that pulls its value towards zero.
An Example
Consider a data set based on the iris data set in R. The data can be downloaded at the link below directly into R. Ridge regression is evaluated based on its out of sample predictive error. Although ridge regression under-fits the model on the data it is given, this actually results in better out of sample predictive accuracy. The training set will be what the model sees, and the test set will be used to evaluate the model.
set.seed(2018)
iris.sample = read.csv("https://tinyurl.com/yaq7grsh", header=TRUE)
train_indices = 1:round(0.80 * nrow(iris.sample))
train = as.data.frame(iris.sample[train_indices, ])
test = as.data.frame(iris.sample[-train_indices,])
test = data.frame(lapply(test, jitter, amount=.5))
Below is shown the correlation matrix for the predictors, which indicates a high level of correlation.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
Sepal.Length | 1.0000000 | -0.1476321 | 0.8734977 | 0.8191581 |
Sepal.Width | -0.1476321 | 1.0000000 | -0.4235719 | -0.3728722 |
Petal.Length | 0.8734977 | -0.4235719 | 1.0000000 | 0.9524889 |
Petal.Width | 0.8191581 | -0.3728722 | 0.9524889 | 1.0000000 |
High levels of correlation between predictors, or collinearity results in inflated variance and potentially incorrect estimates of \(\hat{\beta}\). The level of variance inflation in a model can be checked using the vif function of the car package. Let’s consider a model predicting the sepal width.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0002896 | 0.0348456 | 0.0083107 | 0.9933726 |
Sepal.Length | 0.9367440 | 0.0707245 | 13.2449775 | 0.0000000 |
Petal.Length | -1.6622037 | 0.1334507 | -12.4555627 | 0.0000000 |
Petal.Width | 0.4484609 | 0.1126177 | 3.9821511 | 0.0000790 |
VIF | |
---|---|
Sepal.Length | 4.251253 |
Petal.Length | 15.076571 |
Petal.Width | 10.861367 |
These values are far less than ideal, and warrant the use of a regularized regression. Ridge regression can be used with the glmnet package. The glmnet package allows the fitting of ridge, lasso, and elastic net (a combination of ridge and lasso) models. It requires the model be given to it as a matrix of predictors and a vector of dependent variable values, but this is easy to get using the model.matrix function.
Next, I will fit the model using k-fold cross-validation. This splits the data into \(\textit{k}\) subsets in order to fine tune the lambda parameter, which controls the amount of shrinkage. The returned model is then the model fit to the whole data set
x.mat =model.matrix(Sepal.Width ~ ., train)[,-1]
y.var = train$Sepal.Width
cv.ridge = cv.glmnet(x.mat,y.var, alpha=0, nlambda=100, lambda.min.ratio=0.0001) #Set alpha to zero for ridge regression
A plot showing the the candidate values of the tuning parameter \(\lambda\) can be produced. The lower dashed line is the minimum value of \(\lambda\) that gives minimum mean cross-validated mean squared error. The other dashed line is the maximum value of \(\lambda\) that gives a regularized model such that error is within one standard error of the minimum. Either one of these values can be used. The lower dashed line is stored as lambda.min in cv.ridge object. This will give the most regularized model if used to produce coefficients. The upper dashed line is stored as lambda.1se and will result in a slightly less regularized model. I prefer lambda.min.
plot(cv.ridge)
The coefficients for the best value of lambda can then be retrieved and made into a table. In a separate tab the results for the ordinary least squares model fit can be seen.
term | estimate |
---|---|
(Intercept) | 0.002 |
Sepal.Length | 0.632 |
Petal.Length | -0.907 |
Petal.Width | -0.017 |
term | estimate |
---|---|
(Intercept) | 0.000 |
Sepal.Length | 0.937 |
Petal.Length | -1.662 |
Petal.Width | 0.448 |
Ridge regression really shines when it comes to out of sample predictive error. Ordinary least squares can potentially over-fit the data, such that when presented with new cases it may not perform well. Ridge regression has the advantage that by under-fitting the model it generalizes to new samples more readily.
x.test =model.matrix(Sepal.Width ~ ., test)[,-1]
y.test = test$Sepal.Width
ridge.train = data.frame(predict(cv.ridge, x.test, s="lambda.1se"), y.test)
colnames(ridge.train)=c("predicted", "actual")
ridge.train$residual = ridge.train$actual-ridge.train$predicted
ols.train = data.frame(predict(lm(Sepal.Width ~ ., train), newdata=as.data.frame(x.test)), y.test)
colnames(ols.train)=c("predicted", "actual")
ols.train$residual = ols.train$actual-ols.train$predicted
Below is a comparison of the predictive performance of both models by several common measures calculated from the code above.
Least Squares | Ridge | |
---|---|---|
MSE | 0.8249143 | 0.6842070 |
RMSE | 0.9089825 | 0.8281906 |
MAE | 0.6696339 | 0.6075643 |
Bayesian Approaches
Modeling Heteroskedastic Variance
One benefit of Bayesian model building is constructing a model where the variance of an independent variable \(y\) depends on the value of \(x_i\) is quite easy. This falls under the general category of distributional Bayesian modeling (Umlauf and Kneib 2018). Distributional modeling attempts to estimate model parameters such as the variance from the data, rather than treating them as nuisance parameters, and is a useful alternative to quantile regression in situations like this (Koenker, Leorato, and Peracchi 2013). The brms package can easily implement this model.
formula = bf(Neuroticism ~ Age, sigma ~ Age)
prior = get_prior(formula, data)
prior$prior[1] = "student_t(3, 0, 10)"
fit.hetvar = brm(formula, data=data, prior=prior, iter = 4000, cores = 4)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
b_Intercept | -0.6981544 | 0.3815397 | -1.5003434 | 0.0039283 |
b_sigma_Intercept | -0.5409742 | 0.2117965 | -0.9475223 | -0.1209372 |
b_Age | 0.0727339 | 0.0220293 | 0.0279566 | 0.1149342 |
b_sigma_Age | 0.0421517 | 0.0107176 | 0.0212218 | 0.0629651 |
prior = get_prior(Neuroticism ~ Age, data)
prior$prior[1] = "student_t(3,0,10)"
fit.normal = brm(Neuroticism ~ Age, data, prior=prior, iter = 4000, cores = 4)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
b_Intercept | -0.6480362 | 0.4529351 | -1.5242901 | 0.2387324 |
b_Age | 0.0696351 | 0.0232452 | 0.0236176 | 0.1145285 |
sigma | 1.3696693 | 0.0855839 | 1.2058130 | 1.5369421 |
Student T Model
A Bayesian alternative to MM-Estimators or Least Trimmed Squares is to use the Student’s t model as the family function in a regression model (Gelman et al. 2014). This assumes that the observations \(y_i\) are drawn from a t-distribution, which has longer tails stretching past the center of the distribution’s mass. This is useful to represent situations where the bulk of the data are from an essentially gaussian distribution but contain some data points a distance away from the rest. Unlike MM-Estimators or Least Trimmed Squares, this directly models the outlier-generating process. The model gives the most weight to the bulk of the distribution, giving less attention to the tails. This results in a lower standard error and decreases the chances that rarely occuring observations mislead you about the relationship of the predictors to the dependent variable. Furthermore, the normality parameter is estimated from the data to indicate how “robust” the model needs to be.
For example, consider the ldl variable from the corrupted diabetes data set. The outlying observations create long, thin, tails of the distribution.
The brms package can easily accomodate this situation in a Student’s t regression model. The results can be compared to the OLS, LTS, and MM-Estimator results from before.
prior = get_prior(ldl ~ hdl, data = diabetes, family=student)
prior$prior[1] = "student_t(3,0,5)"
fit.robust = brm(ldl ~ hdl, diabetes, prior=prior, family=student, iter = 4000, cores = 4)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
b_Intercept | -0.0518807 | 0.0403906 | -0.1335647 | 0.0246811 |
b_hdl | -0.1540760 | 0.0414662 | -0.2309634 | -0.0712865 |
sigma | 0.7321908 | 0.0386226 | 0.6595099 | 0.8103602 |
nu | 5.4133865 | 1.1911686 | 3.3439331 | 7.7970095 |
Skew Normal Model
Data may sometimes not quite follow a normal distribution, but lack outliers. When there are more observations on one side of the Gaussian curve than another the result is a skewed normal distribution (Arellano-Valle et al. 2008; Alhamide, Ibrahim, and Alodat 2015). If the observations are skewed badly enough one might wish to account for this skewness in a regression model. This is fully achievable by using a skew normal probability density function in lieu of a normal p.d.f. for the likelihood.
For example, consider the bmi variable in the diabetes data set used in previous examples.
The distribution is clearly skewed. For an example, let’s fit a model of bmi with age and hdl as predictors.
prior = get_prior(bmi ~ age + hdl, data = diabetes, family=skew_normal)
prior$prior[1] = "student_t(7,0,5)"
prior$prior[2] = "student_t(3,0,10)"
fit.skew = brm(bmi ~ age + hdl, diabetes, family=skew_normal, iter = 4000, cores = 4)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
b_Intercept | -0.0038928 | 0.0457674 | -0.0927215 | 0.0861433 |
b_age | 0.2320552 | 0.0326219 | 0.1683877 | 0.2972522 |
b_hdl | -0.2080734 | 0.0393863 | -0.2843064 | -0.1303651 |
sigma | 0.9492181 | 0.0362647 | 0.8796179 | 1.0210291 |
alpha | 5.1128051 | 1.0480318 | 3.2134268 | 7.2100664 |
prior = get_prior(bmi ~ age + hdl, data = diabetes)
prior$prior[1] = "student_t(3,0,10)"
fit.gauss= brm(bmi ~ age + hdl, data = diabetes, iter = 4000, cores = 4)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
b_Intercept | -0.0003448 | 0.0449536 | -0.0879228 | 0.0858921 |
b_age | 0.1616910 | 0.0450829 | 0.0740293 | 0.2496552 |
b_hdl | -0.3173459 | 0.0438082 | -0.4039143 | -0.2313988 |
sigma | 0.9348709 | 0.0318879 | 0.8762074 | 1.0019232 |
When compared to the Gaussian model, the skew normal model’s coefficients are of moderately different strengths. If the sample size were smaller or the coefficients smaller in magnitude, these differences could make the difference between declaring a coefficient to be credibly non-zero and not making such a claim. If the data in the dependent variable are skewed, the skew normal model is the safest bet. If the skewness is not enough to matter the coefficients will come out the same as a typical normal model.
Regularized Regresssion
The ridge penalty is equivalent to setting a normally distributed (Gaussian) prior of \(Normal~\sim~(\mu=0, ~ \sigma = \frac{1}{2\cdot p}\cdot I(\theta)~)\), where \(I\) is the expected fisher information.
The Fisher information is estimate is greater than or equal to the reciprocal of the fisher information (a rule called the Cramer-Rao bound). The standard deviation of the estimate (or prior in this case) is then, at least, the square root of the reciprocal of the fisher information.
\[ \text{VAR}({\hat\theta}) \geq \frac{1}{I(\theta)} \\ \text{sd}({\hat\theta}) \geq \sqrt{\frac{1}{I(\theta)}} \]
Standardizing variables to a mean of zero and standard deviation of 1 makes it an easy matter to decide on the prior standard deviation for the coefficients by simply multiplying \(1/2p\) by 1 for the prior sd; \(Normal~\sim~(\mu=0, ~ \sigma = \frac{1}{2\cdot p}\cdot 1)\).
In this example there are three predictors, making the prior equal to \(Normal~\sim~(0, ~ \frac{1}{6})\). Now that the prior has been obtained the model can be fit in brms.
prior = get_prior(Sepal.Width ~ ., train)
prior$prior[1] = "normal(0, 0.1666667)"
fit.ridge = brm(Sepal.Width ~ ., train, prior=prior, iter=4000, cores = 4)
## Compiling the C++ model
## Start sampling
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
b_Intercept | 0.0020479 | 0.0373395 | -0.0700503 | 0.0741704 |
b_Sepal.Length | 0.6366583 | 0.0654164 | 0.5113319 | 0.7655038 |
b_Petal.Length | -0.9166436 | 0.0992095 | -1.1045157 | -0.7180442 |
b_Petal.Width | -0.0121009 | 0.0850504 | -0.1765799 | 0.1529008 |
sigma | 0.7897063 | 0.0266057 | 0.7414062 | 0.8439046 |
Notice that the estimates are nearly identical to those obtained using lambda$min in the glmnet ridge model. This is no coincidence. The prior was designed to reflect the properties of the ridge estimator after all! So why use this over glmnet? The benefit of the Bayesian approach is that full posterior distributions are obtained for each parameter, allowing statistical inference about which parameters are credibly non-zero. There is no widely accepted standard for statistical inference from a ridge regression model fit in glmnet. Classical frequentist hypothesis tests largely rely on unbiased estimators to obtain test statistics that follow specific sampling distributions, but Bayesian inference proceeds by updating prior beliefs with the data to obtain a posterior inference.
The model’s predictions can also be compared against a test sample to evaluate its performance, just as with any other regression model. Below is the mean square error, which is comparable (if not slightly better) than what was obtained using glmnet.
pred = as.data.frame(predict(fit.ridge, newdata=x.test))
mean((pred$Estimate-y.test)^2)
## [1] 0.6780166
Conclusion
These methods can be applied to the most common situations where ordinary linear regression fails. While these are not the only methods, they are the most widely used. Future posts will build on these concepts and methods for even more powerful methds in machine learning as well as mixed-effects linear models.
References
Alhamide, A. A., K. Ibrahim, and M. T. Alodat. 2015. “Multiple Linear Regression Estimators with Skew Normal Errors.” AIP Conference Proceedings 1678 (060013): 1–4. doi:10.1063/1.4931340.
Arellano-Valle, Reinaldo B., Luis M. Castro, Marc G. Genton, and HÃctor W. G’omez. 2008. “Bayesian Inference for Shape Mixtures of Skewed Distributions, with Application to Regression Analysis.” Bayesian Analysis 3 (3): 513–39. doi:10.1214/08-BA320.
Gelman, Andrew, J Carlin, H Stern, D Dunson, A Vehtari, and D Rubin. 2014. Bayesian Data Analysis. Edited by A Gelman. Third edition. Chapman & Hall/CRC Texts in Statistical Science. Boca Raton: CRC Press.
Koenker, R, S Leorato, and F Peracchi. 2013. “Distributional Vs. Quantile Regression.” Einaudi Institute for Economics; Finance (EIEF). https://ideas.repec.org/p/eie/wpaper/1329.html.
Mount, David M., Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. 2014. “On the Least Trimmed Squares Estimator.” Algorithmica 69 (1): 148–83. doi:10.1007/s00453-012-9721-8.
Umlauf, Nikolaus, and Thomas Kneib. 2018. “A Primer on Bayesian Distributional Regression.” Statistical Modelling 18 (3-4): 219–47. doi:10.1177/1471082X18759140.
Venables, W. N., and B. D. Ripley. 2002. “Linear Statistical Models.” In Modern Applied Statistics with S, edited by J. Chambers, W. Eddy, W. Härdle, S. Sheather, and L. Tierney, 139–78. Statistics and Computing. New York, NY: Springer New York. doi:10.1007/978-0-387-21706-2.
Wilcox, R. 2012a. “A Foundation for Robust Methods.” In Introduction to Robust Estimation and Hypothesis Testing, edited by R Wilcox, 3rd ed., 23–42. Elsevier. doi:10.1016/B978-0-12-386983-8.00002-0.
———. 2012b. “Robust Regression.” In Introduction to Robust Estimation and Hypothesis Testing, edited by R Wilcox, 3rd ed., 471–532. Elsevier. doi:10.1016/B978-0-12-386983-8.00010-X.