Abnormally Distributed

Quantile (Percentile) Regression

Brandon Vaughan

last update: 20 May 2018

I’ll be using these packages:

library(ggpubr)
library(ggthemes)
library(brms)
library(broom)
library(dplyr)
library(knitr)
library(kableExtra)
library(quantreg)

What is quantile regression?

Ordinary linear regression compares the mean difference in a response (independent) variable between different values of the predictor (dependent) variables, while quantile regression models quantiles of a response variable.

Ordinary regression seeks to find the regression line that minimizes the sum of squared errors, such that any point in the response variable is the minimum distance from the expected value. In the case of a single vector \(y_i\) the expected value \(E[y_i]\) is simply the sample mean. This reflects the intuition that the mean of a normally distributed variable is the best representation of that variable. In linear regression \(E[y_i]\) is also defined by a set of predictor variables, \(\beta_i\), which leads to an error term where \(y_{pred}\) is the difference in the fitted value on the regression line and the observed value of the response variable.

Quantile regression instead attempts to minimize the sum of absolute errors for a given quantile. The .50 quantile (50% Quantile) is the median, in which case quantile regression becomes median regression (Koenker 2005).

\[\begin{align} \sum_{i=1}^{n} (y_i-y_{pred})^{2} & & = \sum_{i=1}^{n}e_i^{2} & & \text{Sum of Squared Errors} \\ \\ \sum_{i=1}^{n} |y_i-y_{pred}| & & = \sum_{i=1}^{n}|e_i| & & \text{Sum of Absolute Errors} \end{align}\]

Technically speaking the formula for the sum of absolute errors is for least absolute deviation regression, the median analog for ordinary least squares (mean) regression.

Quantile regression extends median regression to other quantiles by applying an asymmetric penalty to the error term for under prediction and over predictions (see equation below; the left hand side is the under-prediction penalty and the right side the penalty for over-prediction).

\[ \sum_{i=1}^{n} ~q|e_i|~ ~+ ~~ \sum_{i=1}^{n} ~(1-q)|e_i| \] The sum of these error terms allows a quantile \(q\) to be predicted, rather than only the median. In the case of the median, each half of the penalty term is equal to half of the sum of absolute errors which sums to \(\Sigma_i~|e_i|\). What this amounts to in layman’s terms is that the symmetry of the penalty results in finding the line which divides the data in half. For the 10th quantile, this error penalty finds the line through the data where 10% of the data fall above the line and 90% fall below.

Why would you want to do this? Consider the following situation where the data display heteroskedasticity:

A linear regression model cannot be properly fit to this because it won’t be able to successfully minimize the sum of squares. The following figure shows the distance between the real data points and the predictions of a simple linear model called with lm(Neuroticism ~ Age).

With quantile regression we can find regression lines for each quantile. Furthermore, quantile regression is robust to outliers. Notice that the regression lines in the following figure are pulled toward outliers, such as the one at the bottom left of the figure.

ggscatter(data, x="Age", y="Neuroticism", xlab = "Age", ylab="Neuroticism", color="Gender", palette = c("lightsteelblue4","navyblue")) + theme_stata() +  geom_quantile(quantiles=c(.10, .50, .90), method = "rq", color="darkviolet")

So heteroskedasticity and robustness to outliers are two reasons to spring for a quantile regression approach, but there are also reasons to use it when it is hypothesized a predictor variable may not have the same effect across all quantiles of the response variable. This is a common situation in econometrics and ecological science, and may also be common in bio-medical sciences (Cade and Noon 2003; Cameron and Trivedi 2010).

The Likelihood Function for a Quantile Regression Model

There are a few ways to fit a quantile regression model but one that utilizes a specific likelihood function for such data is the asymmetric laplace distribution model (S’anchez, Lachos, and Labra 2013).

In normal linear regression we have the situation where the data are modeled as being drawn from a normal distribution.

\[ y_i \sim normal(\mu, \sigma) \]

\(\mu\) is in turn presumed to be determined by a set of linear predictors \(\beta_i ,...\beta_{ith}\) and an intercept \(\beta_0\) (the value of \(y_i\) when all \(\beta=0\)) For modeling quantiles the data are similarly modeled as distributed according to location parameter \(\mathit{m}\) (like \(\mu\)), scale parameter \(\lambda\) (like sigma), and skewness parameter \(q\) corresponding to the quantile of interest.

\[ y_i \sim asym.~laplace (\mathit{m}, \lambda, q) \]

The reason the asymmetric Laplace distribution works is setting the skew parameter to a value corresponding to the quantile results in \(q\%\) of the data falling above the mode and \(1-q\%\) falling below the mode, exactly corresponding to the error term. Again, if modeling the median, 50% of the data would fall on either side of the line. This is true regardless of what data generating process may be responsible for the data as a whole (Yang, Wang, and He 2016).

For those interested in more of the nitty gritty math, the probability density functions for a normal distribution and asymmetric Laplace distribution are given below. These functions are the equations that describe the shape of the probability density. You can see the parallels in the probability density functions to each other and to the corresponding error terms. The parameter \(q\) is a constant in the asymmetric Laplace P.D.F. corresponding to the quantile.

\[\begin{align} f(y_i \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2} } exp \Bigg\{{ -\frac{(y_i-\mu)^2}{2\sigma^2} }\Bigg\} & & \text{Normal P.D.F.} \\ \\ f(y_i \mid \mathit{m}, \lambda, q) = \frac{q(1-q)}{\lambda} exp \Bigg\{{ -q_p\Big(\frac{(y_i-\mathit{m})}{\lambda}\Big) }\Bigg\} & & \text{Asymmetric Laplace P.D.F.} \end{align}\]

Since we’re good Bayesians (if you aren’t, you should be!) we can approach this problem by modeling the unknown parameters \(\mu\) and \(\sigma\), along with the value of each coefficient \(\beta\), with probability distributions. Extensive work has been done validating the asymmetric Laplace model for quantile regression in a Bayesian context (Yu and Moyeed 2001; Kottas and Krnjaji’c 2009; Sriram, Ramamoorthi, and Ghosh 2013; Yang, Wang, and He 2016).

I will demonstrate how to do this using the R package brms, which requires the installation of the C++ dependent Stan language. If you do not have this installed, visit this link to learn how to get started. After installing Stan, simply install the brms package from the CRAN repository in R.

Fitting a model with the brms package

The model structure defined by the priors I will use in the brms example is below:

\[\begin{align} y_i \sim asym.~Laplace (m, \lambda, q) & & \text{family function in the GLM } \\ m_q \sim \beta_0 + \beta_i...\beta_{ith} & & \text{quantile 'mean' determined by } \beta \\ \beta_0 \sim student~t(\nu= 3, \mu=0, \sigma=10) & & \text{intercept by t-distribution} \\ \beta_{i...i_th} \sim student~t(\nu= 3, \mu=0, \sigma=10) & & \text{coefficients by t-distribution} \\ \lambda \sim half-student~t(\nu= 3, \mu=0, \sigma=10) & & \text{scale by half-t-distribution} \end{align}\]

The first step is to load the data. These data are simulated, so don’t worry about playing around with the data set. To make things more interesting, I’ll use both Gender and Digit Span as predictors in addition to Age, but I won’t include any interaction terms.

data = read.csv("https://tinyurl.com/y9uct2c5",header = TRUE)

prior = get_prior(bf(Neuroticism ~  Gender + Age + Digit_Span, quantile=0.50), data=data, family = asym_laplace(link="identity"))

prior$prior[1] = "student_t(3, 0, 10)"

fit = brm(bf(Neuroticism ~  Gender + Age + Digit_Span, quantile=0.50), data=data, family = asym_laplace(link="identity"), prior=prior, sample_prior = TRUE, save_all_pars = TRUE, iter = 2000, cores = 4)

fit0.10 = update(fit,bf(Neuroticism ~  Gender + Age + Digit_Span, quantile=0.10),iter = 4000, cores = 4)
fit0.90 = update(fit,bf(Neuroticism ~  Gender + Age + Digit_Span, quantile=0.90),iter = 4000, cores = 4)

The first thing you’ll want to do is check the MCMC chains to make sure they are sufficiently fuzzy. Usually if there’s an error in the MCMC sampling brms/Stan will tell you, but it’s always a good idea to check. The plot function in brms will show you the posterior distributions for each coefficient and the sampling chains. I’ll only show the plot for a couple parameters for the 50% Quantile chain here for the sake of space.

plot(fit, par = c("b_Age", "b_Digit_Span"), theme=theme_stata(), N=2, ask = FALSE, newpage = TRUE)

This looks good! The model clearly sampled well. If the chains ever start overlapping to the point of looking like a single ‘EEG’ trace then there’s a problem. If you ever run into problems you can usually Google any error messages or ask for help at the Stan discourse site.

You can also plot the predicted effects. For the sake of saving space, I’ll only show the plots for the 50% Quantile regression.

plot(marginal_effects(fit), points=TRUE, theme=theme_stata(), ask=FALSE) 

Making the results tables

Now that the models have been fit let’s take a look at a table of coefficients. I will be using the broom , knitr, and kableExtra packages to create all of the following tables. Note that in the kable function “html” is specified to create HTML tables. These tables can also be made in latex format or markdown by specifying “latex” or “markdown”. See the list of vignettes for the kableExtra package to learn more about how to format these tables with your own preferred customizations.

# extract the MCMC samples
fitMCMC = as.mcmc(fit, pars="^b_", exact_match = FALSE, combine_chains = TRUE)
fit0.10MCMC = as.mcmc(fit0.10, pars="^b_", exact_match = FALSE, combine_chains = TRUE)
fit0.90MCMC = as.mcmc(fit0.90, pars="^b_", exact_match = FALSE, combine_chains = TRUE)

# tidy up the samples and get highest density credible intervals
median =tidyMCMC(fitMCMC, conf.int = TRUE, droppars = "prior_sigma", conf.method = "HPDinterval", conf.level = .95)
tenpct= tidyMCMC(fit0.10MCMC, conf.int = TRUE, droppars = "prior_sigma", conf.method = "HPDinterval", conf.level = .95)
ninetypct=tidyMCMC(fit0.90MCMC, conf.int = TRUE, droppars = "prior_sigma", conf.method = "HPDinterval", conf.level = .95)
quicktable = data.frame(tenpct$estimate, median$estimate, ninetypct$estimate) 
colnames(quicktable) = c("10% Quantile", "50% Quantile", "90% Quantile")
rownames(quicktable) = median$term
quicktable = round(quicktable, digits=3)

quicktable %>% 
mutate(
  term= c("Intercept", "Gender Male", "Age", "Digit Span")
  ) %>%
  select(term, `10% Quantile`, `50% Quantile`, `90% Quantile`) %>%
  kable("html",escape = FALSE, digits=3, align='r') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  add_header_above(c("Quantile Regression Coefficients" = 4))
Quantile Regression Coefficients
term 10% Quantile 50% Quantile 90% Quantile
Intercept 1.550 3.940 4.100
Gender Male -0.936 -0.692 -0.515
Age -0.004 0.063 0.095
Digit Span -0.155 -0.305 -0.274

It’s also easy to make a results table showing the estimates and 95% Credible Intervals.

median[1:4, 1:5] %>%
  select(term, estimate, std.error, conf.low, conf.high) %>%
  kable("html",escape = FALSE, digits=2, align='r') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  add_header_above(c(" " = 3, "95% Highest Density Interval " = 2)) %>%
  add_header_above(c("Quantile Regression Results - 50% Quantile" = 5))
Quantile Regression Results - 50% Quantile
95% Highest Density Interval
term estimate std.error conf.low conf.high
b_Intercept 3.94 0.71 2.52 5.24
b_GenderMale -0.69 0.19 -1.06 -0.32
b_Age 0.06 0.02 0.03 0.10
b_Digit_Span -0.30 0.05 -0.40 -0.20

Bayes Factors

The hypothesis function in brms also allows specifying a list of hypotheses to test. With this Bayes Factors can be obtained to add to the table. The Bayes Factors are reported as the strength of evidence in favor of the null hypothesis, \(BF_{01}= \frac{L(H_{0}|D)}{L(H_{\theta}|D)}\). Smaller values (\(0.10 \geq BF\)) mean stronger evidence against the null hypothesis, larger values (\(10 \leq BF\)) mean stronger evidence for the null hypothesis, and values near 1 mean the evidence is inconclusive.

hypotheses = c("b_Intercept=0", "b_GenderMale=0", "b_Age=0", "b_Digit_Span=0")
h = hypothesis(fit, hypotheses, class = NULL)
h = h[["hypothesis"]][["Evid.Ratio"]]
median.table = cbind(median, `Bayes Factor`= round(h, digits=2))
median.table[1:4, 1:6] %>%
mutate(
`Bayes Factor` = cell_spec(`Bayes Factor`, "html", bold=ifelse(`Bayes Factor` < 0.101, TRUE, FALSE), color = ifelse(`Bayes Factor` < 0.101, "blue", "grey"))
  ) %>%
  select(term, estimate, std.error, conf.low, conf.high, `Bayes Factor`) %>%
  kable("html",escape = FALSE, digits=2, align='r') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  add_header_above(c(" " = 3, "95% Highest Density " = 2, " " = 1)) %>%
  add_header_above(c("Quantile Regression Results - 50% Quantile" = 6))
Quantile Regression Results - 50% Quantile
95% Highest Density
term estimate std.error conf.low conf.high Bayes Factor
b_Intercept 3.94 0.71 2.52 5.24 0
b_GenderMale -0.69 0.19 -1.06 -0.32 0.1
b_Age 0.06 0.02 0.03 0.10 2.94
b_Digit_Span -0.30 0.05 -0.40 -0.20 0

Empirical P-Values

Even though this is a Bayesian model fit it’s possible to calculate frequentist p-values from MCMC samples. From a frequentist perspective these p-values reflect a simulated likelihood from a biased estimator (which isn’t a bad thing, but that’s a discussion for another time). From a Bayesian perspective it’s a little funny, but many journal reviewers and colleagues feel more comfortable or familiar with p-values. Just remember, p-values tell you the probability of the observed estimate or one more extreme given the assumption of the null hypothesis and posterior probabilities tell you the probability of an estimate given the observed data and prior. Just source the script from tiny URL to gain this ability, which contains a correction I added based on North, Curtis, and Sham (2002) and North, Curtis, and Sham (2003).

source("https://tinyurl.com/y8mvwbqy")
median.pvals = as.numeric(apply(fitMCMC[,-6], 2, mcmcpvalue))
ten.pvals = as.numeric(apply(fit0.10MCMC[,-6], 2, mcmcpvalue))
ninety.pvals = as.numeric(apply(fit0.90MCMC[,-6], 2, mcmcpvalue))

# Store the original tables with the p-values, which I am rounding to 4 digits.
median.table = cbind(median,`MCMC p-value`= round(median.pvals,digits = 4))
ten.table = cbind(tenpct,`MCMC p-value`= round(ten.pvals, digits= 4))
ninety.table = cbind(ninetypct,`MCMC p-value`= round(ninety.pvals,digits=4)) 

A full regression table for the model with p-values can be made just like the one with Bayes Factors.

median.table[1:4, 1:6]  %>% 
mutate(
`MCMC p-value` = cell_spec(`MCMC p-value`, "html", color = ifelse(`MCMC p-value` < 0.05, "blue", "grey"))
  ) %>%
  select(term, estimate, std.error, conf.low, conf.high, `MCMC p-value`) %>%
  kable("html",escape = FALSE, digits=3,align='r') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  add_header_above(c(" " = 3, "95% Highest Density " = 2, " " = 1)) %>%
  add_header_above(c("Quantile Regression Results - 50% Quantile" = 6))
Quantile Regression Results - 50% Quantile
95% Highest Density
term estimate std.error conf.low conf.high MCMC p-value
b_Intercept 3.940 0.707 2.522 5.239 2e-04
b_GenderMale -0.692 0.189 -1.062 -0.324 7e-04
b_Age 0.063 0.019 0.027 0.100 7e-04
b_Digit_Span -0.305 0.050 -0.397 -0.203 2e-04

The quantreg package

The quantreg package uses frequentist methods to fit the quantile regression model. If you only want p-value based methods, this is the best package for the job. There are several methods by which p-values and confidence intervals can be calculated, most being non-parametric or semi-parametric hypothesis tests involving ranks, kernel density estimation, or bootstrapping techniques. The conclusions reached can differ depending on which approach you take, so tread carefully. The package documentation explains the different methods and can be quickly accessed by typing ?summary.rq into the console.

Here the quantiles are specified by the paramater tau.

rq = rq(Neuroticism ~ Gender + Age + Digit_Span, data=data, tau=c(.10,.50,.90))

rq.summary = as.data.frame(coef(rq))
colnames(rq.summary) = c("10% Quantile", "50% Quantile", "90% Quantile")

The results can be plotted using the same methods demonstrated previously, so I won’t show the whole code again.

Quantile Regression Coefficients
term 10% Quantile 50% Quantile 90% Quantile
Intercept 1.675 3.874 4.003
Gender Male -0.748 -0.616 -0.521
Age -0.033 0.061 0.095
Digit Span -0.133 -0.300 -0.268
Quantile Regression Results - 50% Quantile
term estimate std.error statistic p.value
Intercept 3.874 0.969 4.000 1e-04
Gender Male -0.616 0.245 -2.515 0.0132
Age 0.061 0.023 2.705 0.0078
Digit Span -0.300 0.062 -4.840 0

References

Cade, Brian S., and Barry R. Noon. 2003. “A Gentle Introduction to Quantile Regression for Ecologists.” Frontiers in Ecology and the Environment 1 (8): 412. doi:10.2307/3868138.

Cameron, Adrian Colin, and P. K. Trivedi. 2010. “Quantile Regression.” In Microeconometrics Using Stata, edited by P. K. Trivedi, 211–34. College Station, Tex: Stata Press.

Koenker, Roger. 2005. Quantile Regression. Econometric Society Monographs, no. 38. Cambridge ; New York: Cambridge University Press.

Kottas, Athanasios, and Milovan Krnjaji’c. 2009. “Bayesian Semiparametric Modelling in Quantile Regression.” Scandinavian Journal of Statistics 36 (2): 297–319. doi:10.1111/j.1467-9469.2008.00626.x.

North, B.V., D. Curtis, and P.C. Sham. 2002. “A Note on the Calculation of Empirical P Values from Monte Carlo Procedures.” The American Journal of Human Genetics 71 (2): 439–41. doi:10.1086/341527.

———. 2003. “A Note on the Calculation of Empirical P Values from Monte Carlo Procedures.” The American Journal of Human Genetics 72 (2): 498–99. doi:10.1086/346173.

Sriram, Karthik, R.V. Ramamoorthi, and Pulak Ghosh. 2013. “Posterior Consistency of Bayesian Quantile Regression Based on the Misspecified Asymmetric Laplace Density.” Bayesian Analysis 8 (2): 479–504. doi:10.1214/13-BA817.

S’anchez, Luis, Victor H. Lachos, and Filidor V. Labra. 2013. “Likelihood Based Inference for Quantile Regression Using the Asymmetric Laplace Distribution.” https://tinyurl.com/yaws5ddu.

Yang, Yunwen, Huixia Judy Wang, and Xuming He. 2016. “Posterior Inference in Bayesian Quantile Regression with Asymmetric Laplace Likelihood: Bayesian Quantile Regression.” International Statistical Review 84 (3): 327–44. doi:10.1111/insr.12114.

Yu, Keming, and Rana A. Moyeed. 2001. “Bayesian Quantile Regression.” Statistics & Probability Letters 54 (4): 437–47. doi:10.1016/S0167-7152(01)00124-9.