Here we will walk through the most common statistical tests and show how they can be represented in the linear regression format. This section is based on this post. Check it out for a more in-depth description of the non-parametric alternatives.


One sample t-test

The test evaluates if the mean of a continuous variable is different from 0. It is equivalent to a linear regression with no predictor, that tests if the intercept is different from 0:

\(y = \beta_0 \qquad \mathcal{H}_0: \beta_0 = 0\)

 

# number of observations
n <- 50

# set seed
set.seed(123)

# create variable with mean 1
y <- rnorm(n = n, mean = 1)

# run t test
(t <- t.test(y))
## 
##  One Sample t-test
## 
## data:  y
## t = 7.9, df = 49, p-value = 2.718e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.7712742 1.2975329
## sample estimates:
## mean of x 
##  1.034404
# run equivalent linear regression
(lm_t <- summary(lm(y ~ 1)))
## 
## Call:
## lm(formula = y ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0010 -0.5937 -0.1070  0.6638  2.1345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0344     0.1309     7.9 2.72e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9259 on 49 degrees of freedom

 

Let’s put the results from both tests together to compare them more closely:

model estimate p.value t
t-test 1.0344 0 7.9
lm 1.0344 0 7.9

 

Note that, as there are no predictor in the model, we used a ‘1’ in the place for predictors in the model formula (‘y ~ 1’).


Paired t-test

A paired t-test evaluates if the mean difference between two numeric variables is 0:

\(y1 - y2 = \beta_0 \qquad \mathcal{H}_0: \beta_0 = 0\)

 

The correspondent linear model is the same than that for the one-sample t-test, but the input variable is the difference between the two variables (y1 - y2):

set.seed(123)

# sample size
n <- 50

# variable with mean 1
y1 <- rnorm(n = n, mean = 1)

# variable with mean 3
y2 <- rnorm(n = n, mean = 1.4)

# run paired t test
paired_t <- t.test(y1, y2, paired = TRUE)

paired_t
## 
##  Paired t-test
## 
## data:  y1 and y2
## t = -2.7468, df = 49, p-value = 0.008395
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8865856 -0.1374239
## sample estimates:
## mean of the differences 
##              -0.5120047
# difference between the 2 variables
diff_y <- y1 - y2

# run model
lm_paired_t <- summary(lm(formula = diff_y ~ 1))

lm_paired_t
## 
## Call:
## lm(formula = diff_y ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6003 -0.7010  0.1121  1.0348  2.9089 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.5120     0.1864  -2.747  0.00839 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.318 on 49 degrees of freedom
model estimate p.value t
paired t-test -0.512 0.0084 -2.7468
lm -0.512 0.0084 -2.7468

 


Two means t-test

Also called independent t-test. Evaluates if the means of the two variables are different. The null hypothesis is that the difference is 0:

\(y = \beta_0 + \beta_1 * x1 \qquad \mathcal{H}_0: \beta_1 = 0\)
# set seed
set.seed(123)

# number of observations
n <- 50
b0 <- -4
b1 <- 3
error <- rnorm(n = n, mean = 0, sd = 1)

# random variables
x1_num <- rbinom(n = n, size = 1, prob = 0.5)
y <- b0 + b1 * x1_num + error

x1 <- factor(x1_num, labels = c("a", "b"))

# create data frame
xy_data_cat <- data.frame(x1, x1_num, y)

# run paired t test
indep_t <- t.test(xy_data_cat$y[xy_data_cat$x1 == "a"], xy_data_cat$y[xy_data_cat$x1 ==
    "b"])

indep_t
## 
##  Welch Two Sample t-test
## 
## data:  xy_data_cat$y[xy_data_cat$x1 == "a"] and xy_data_cat$y[xy_data_cat$x1 == "b"]
## t = -12.276, df = 47.946, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.751804 -2.695745
## sample estimates:
##  mean of x  mean of y 
## -4.0774838 -0.8537091
# run regression model
lm_indep_t <- summary(lm(formula = y ~ x1, data = xy_data_cat))

lm_indep_t
## 
## Call:
## lm(formula = y ~ x1, data = xy_data_cat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1129 -0.5939 -0.1355  0.5519  2.2464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.0775     0.1857  -21.96   <2e-16 ***
## x1b           3.2238     0.2626   12.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9285 on 48 degrees of freedom
## Multiple R-squared:  0.7584, Adjusted R-squared:  0.7534 
## F-statistic: 150.7 on 1 and 48 DF,  p-value: < 2.2e-16
model estimate p.value t
2 means t-test 3.2238 0 -12.2759
lm 3.2238 0 12.2759

 


Three or more means: one-way ANOVA

Very similar to the two means t-test but with three or more levels in the categorical variable:

\(\hat{Y} \sim \beta_{o} + \beta_{1} * x_{1} + \beta_{2} * x_{2} + ... \qquad \mathcal{H}_0: y = \beta_0\)

 

So it is a multiple regression model. The categorical variable is dummy coded so it gets split into indicator predictors (\(x_i\) is either \(x=0\) or \(x=1\)).

A data set with a 3-level categorical variable can be simulated like this:

# set seed
set.seed(123)

# number of observations
n <- 50
b0 <- -4
b1 <- 3
error <- rnorm(n = n, mean = 0, sd = 1)

# random variables
x1_num <- rbinom(n = n, size = 2, prob = c(0.33, 0.33))
y <- b0 + b1 * x1_num + error

x1 <- factor(x1_num, labels = c("a", "b", "c"))

# create data frame
xy_data_cat2 <- data.frame(x1, y)

head(xy_data_cat2)
x1 y
b -1.5605
a -4.2302
b 0.5587
c 2.0705
b -0.8707
b 0.7151
# ANOVA function
anv_1w <- anova(aov(formula = y1 ~ x1, data = xy_data_cat2))

anv_1w
Df Sum Sq Mean Sq F value Pr(>F)
2 4.3728 2.1864 2.7307 0.0755
47 37.6318 0.8007 NA NA
# linear regression
lm_anv_1w <- summary(lm(formula = y1 ~ x1, data = xy_data_cat2))

lm_anv_1w
## 
## Call:
## lm(formula = y1 ~ x1, data = xy_data_cat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67208 -0.51953 -0.02887  0.48386  2.32657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.8424     0.1866   4.515 4.24e-05 ***
## x1b           0.5646     0.2774   2.035   0.0475 *  
## x1c          -0.1409     0.3673  -0.384   0.7030    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8948 on 47 degrees of freedom
## Multiple R-squared:  0.1041, Adjusted R-squared:  0.06598 
## F-statistic: 2.731 on 2 and 47 DF,  p-value: 0.07552
model F.statistic p.value
1 way anova 2.7307 0.0755
lm 2.7307 0.0755

 


Chi-square for contingency tables

This test looks to find and association between two categorical variables based on the co-occurrence of the categories, measured as counts. This is what a contingency table is, counts for all combinations of categories between two variables. For instance deaths by age of passenger of the Titanic:

# Contingency table modified from
# https://lindeloev.github.io/tests-as-linear/
mood_data <- matrix(data = c(100, 70, 30, 32, 110, 120), nrow = 2,
    dimnames = list(handedness = c("left_handed", "right_handed"),
        mood = c("happy", "meh", "sad")))

mood_data
##               mood
## handedness     happy meh sad
##   left_handed    100  30 110
##   right_handed    70  32 120

 

Although the coding is a bit more elaborated it can be narrow down to log-linear two-way ANOVA model:

\(log(y_i) = log(N) + log(\alpha_i) + log(\beta_j) + log(\alpha_i\beta_j)\)

 

The regression coefficients are \(A_i\) and \(B_i\). \(\alpha_i\) and \(\beta_j\) are the proportions of counts for the two levels of the binary categorical variable (one for each of the levels in the second categorical variable). Fortunately, there is an implementation of this model from the R package ‘MASS’ that simplifies coding:

# Built-in chi-square. It requires matrix format.
chi_mod <- chisq.test(mood_data)

chi_mod
## 
##  Pearson's Chi-squared test
## 
## data:  mood_data
## X-squared = 5.0999, df = 2, p-value = 0.07809
# convert to long
mood_data_long <- as.data.frame(as.table(mood_data))

# log linear model from MASS package
log_lm_mod <- MASS::loglm(Freq ~ handedness + mood, data = mood_data_long)

log_lm_mod
## Call:
## MASS::loglm(formula = Freq ~ handedness + mood, data = mood_data_long)
## 
## Statistics:
##                       X^2 df   P(> X^2)
## Likelihood Ratio 5.119915  2 0.07730804
## Pearson          5.099859  2 0.07808717
model X.squared p.value
chi-square 5.0999 0.0781
log lm 5.0999 0.0781

 

‘Non-parametric’ alternatives as linear regression models

We will only feature one example of a ‘non-parametric’ test. Nonetheless, this single example should be enough to demonstrate the logic behind ‘non-parametric’ procedures. Most ‘non-parametric’ rely on a simple data transformation trick: rank transforming responses. Rank transformations converts values to an index according to their position when sorted by magnitude:

# simulate variable
set.seed(123)
y <- round(rnorm(n = 5), 2)

y
## [1] -0.56 -0.23  1.56  0.07  0.13
# convert to ranks
rank_y <- rank(y)

rank_y
## [1] 1 2 5 3 4

 

Wilcoxon test for two means

A.k.a Mann-Whitney test, is the non-parametric alternative to a two-mean t test:

# set seed
set.seed(123)

# number of observations
n <- 50
y <- rnorm(n = n, mean = 0.2, sd = 1)

# create data frame
y_data <- data.frame(y)


y
##  [1] -0.360475647 -0.030177489  1.758708314  0.270508391  0.329287735
##  [6]  1.915064987  0.660916206 -1.065061235 -0.486852852 -0.245661970
## [11]  1.424081797  0.559813827  0.600771451  0.310682716 -0.355841135
## [16]  1.986913137  0.697850478 -1.766617157  0.901355902 -0.272791408
## [21] -0.867823706 -0.017974915 -0.826004448 -0.528891229 -0.425039268
## [26] -1.486693311  1.037787044  0.353373118 -0.938136937  1.453814921
## [31]  0.626464221 -0.095071483  1.095125661  1.078133488  1.021581082
## [36]  0.888640254  0.753917654  0.138088289 -0.105962664 -0.180471001
## [41] -0.494706979 -0.007917278 -1.065396352  2.368955965  1.407961998
## [46] -0.923108583 -0.202884835 -0.266655354  0.979965118  0.116630934
# Wilcoxon / Mann-Whitney U
wilcx_mod <- wilcox.test(y_data$y)

wilcx_mod
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  y_data$y
## V = 797, p-value = 0.1248
## alternative hypothesis: true location is not equal to 0
signed_rank = function(x) sign(x) * rank(abs(x))

# As linear model with our dummy-coded group_y2:
wicx_lm_mod <- lm(signed_rank(y) ~ 1, data = y_data)  # compare to 

summary(wicx_lm_mod)
## 
## Call:
## lm(formula = signed_rank(y) ~ 1, data = y_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.38 -24.13   0.12  25.37  43.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    6.380      4.085   1.562    0.125
## 
## Residual standard error: 28.89 on 49 degrees of freedom
model p.value
1 mean wilcoxon 0.1248
wilcoxon_lm 0.1248

 


References

 


 

Session information

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_CR.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_CR.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_CR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_CR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] viridis_0.6.2     viridisLite_0.4.0 ggplot2_3.3.5     knitr_1.37       
## [5] kableExtra_1.3.4 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1  xfun_0.30         bslib_0.2.5.1     purrr_0.3.4      
##  [5] colorspace_2.0-3  vctrs_0.3.8       generics_0.1.0    htmltools_0.5.2  
##  [9] yaml_2.3.5        utf8_1.2.2        rlang_1.0.2       jquerylib_0.1.4  
## [13] pillar_1.7.0      glue_1.6.2        withr_2.5.0       DBI_1.1.1        
## [17] lifecycle_1.0.1   stringr_1.4.0     munsell_0.5.0     gtable_0.3.0     
## [21] rvest_1.0.1       evaluate_0.15     fastmap_1.1.0     fansi_1.0.2      
## [25] highr_0.9         scales_1.1.1      formatR_1.12      webshot_0.5.2    
## [29] jsonlite_1.7.2    systemfonts_1.0.2 gridExtra_2.3     digest_0.6.29    
## [33] stringi_1.7.6     dplyr_1.0.7       grid_4.1.1        cli_3.2.0        
## [37] tools_4.1.1       magrittr_2.0.2    sass_0.4.0        klippy_0.0.0.9500
## [41] tibble_3.1.6      crayon_1.5.0      pkgconfig_2.0.3   MASS_7.3-54      
## [45] ellipsis_0.3.2    xml2_1.3.2        assertthat_0.2.1  rmarkdown_2.10   
## [49] svglite_2.0.0     httr_1.4.2        rstudioapi_0.13   R6_2.5.1         
## [53] compiler_4.1.1