Here we will walk through the most common statistical tests and show how they can be represented in the linear regression format. This section is based on this post. Check it out for a more in-depth description of the non-parametric alternatives.

One sample t-test

The test evaluates if the mean of a continuous variable is different from 0. It is equivalent to a linear regression with no predictor, that tests if the intercept is different from 0:

\(y = \beta_0 \qquad \mathcal{H}_0: \beta_0 = 0\)

# number of observations
n <- 50

# set seed
set.seed(123)

# create variable with mean 1
y <- rnorm(n = n, mean = 1)

# run t test
(t <- t.test(y))

## 
##  One Sample t-test
## 
## data:  y
## t = 7.9, df = 49, p-value = 2.718e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.7712742 1.2975329
## sample estimates:
## mean of x 
##  1.034404

# run equivalent linear regression
(lm_t <- summary(lm(y ~ 1)))

## 
## Call:
## lm(formula = y ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0010 -0.5937 -0.1070  0.6638  2.1345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0344     0.1309     7.9 2.72e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9259 on 49 degrees of freedom

Let’s put the results from both tests together to compare them more closely:

model	estimate	p.value	t
t-test	1.0344	0	7.9
lm	1.0344	0	7.9

Note that, as there are no predictor in the model, we used a ‘1’ in the place for predictors in the model formula (‘y ~ 1’).

Paired t-test

A paired t-test evaluates if the mean difference between two numeric variables is 0:

\(y1 - y2 = \beta_0 \qquad \mathcal{H}_0: \beta_0 = 0\)

The correspondent linear model is the same than that for the one-sample t-test, but the input variable is the difference between the two variables (y1 - y2):

set.seed(123)

# sample size
n <- 50

# variable with mean 1
y1 <- rnorm(n = n, mean = 1)

# variable with mean 3
y2 <- rnorm(n = n, mean = 1.4)

# run paired t test
paired_t <- t.test(y1, y2, paired = TRUE)

paired_t

## 
##  Paired t-test
## 
## data:  y1 and y2
## t = -2.7468, df = 49, p-value = 0.008395
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8865856 -0.1374239
## sample estimates:
## mean of the differences 
##              -0.5120047

# difference between the 2 variables
diff_y <- y1 - y2

# run model
lm_paired_t <- summary(lm(formula = diff_y ~ 1))

lm_paired_t

## 
## Call:
## lm(formula = diff_y ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6003 -0.7010  0.1121  1.0348  2.9089 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.5120     0.1864  -2.747  0.00839 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.318 on 49 degrees of freedom

model	estimate	p.value	t
paired t-test	-0.512	0.0084	-2.7468
lm	-0.512	0.0084	-2.7468

Two means t-test

Also called independent t-test. Evaluates if the means of the two variables are different. The null hypothesis is that the difference is 0:

\(y = \beta_0 + \beta_1 * x1 \qquad \mathcal{H}_0: \beta_1 = 0\)

# set seed
set.seed(123)

# number of observations
n <- 50
b0 <- -4
b1 <- 3
error <- rnorm(n = n, mean = 0, sd = 1)

# random variables
x1_num <- rbinom(n = n, size = 1, prob = 0.5)
y <- b0 + b1 * x1_num + error

x1 <- factor(x1_num, labels = c("a", "b"))

# create data frame
xy_data_cat <- data.frame(x1, x1_num, y)

# run paired t test
indep_t <- t.test(xy_data_cat$y[xy_data_cat$x1 == "a"], xy_data_cat$y[xy_data_cat$x1 ==
    "b"])

indep_t

## 
##  Welch Two Sample t-test
## 
## data:  xy_data_cat$y[xy_data_cat$x1 == "a"] and xy_data_cat$y[xy_data_cat$x1 == "b"]
## t = -12.276, df = 47.946, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.751804 -2.695745
## sample estimates:
##  mean of x  mean of y 
## -4.0774838 -0.8537091

# run regression model
lm_indep_t <- summary(lm(formula = y ~ x1, data = xy_data_cat))

lm_indep_t

## 
## Call:
## lm(formula = y ~ x1, data = xy_data_cat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1129 -0.5939 -0.1355  0.5519  2.2464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.0775     0.1857  -21.96   <2e-16 ***
## x1b           3.2238     0.2626   12.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9285 on 48 degrees of freedom
## Multiple R-squared:  0.7584, Adjusted R-squared:  0.7534 
## F-statistic: 150.7 on 1 and 48 DF,  p-value: < 2.2e-16

model	estimate	p.value	t
2 means t-test	3.2238	0	-12.2759
lm	3.2238	0	12.2759

Three or more means: one-way ANOVA

Very similar to the two means t-test but with three or more levels in the categorical variable:

\(\hat{Y} \sim \beta_{o} + \beta_{1} * x_{1} + \beta_{2} * x_{2} + ... \qquad \mathcal{H}_0: y = \beta_0\)

So it is a multiple regression model. The categorical variable is dummy coded so it gets split into indicator predictors (\(x_i\) is either \(x=0\) or \(x=1\)).

A data set with a 3-level categorical variable can be simulated like this:

# set seed
set.seed(123)

# number of observations
n <- 50
b0 <- -4
b1 <- 3
error <- rnorm(n = n, mean = 0, sd = 1)

# random variables
x1_num <- rbinom(n = n, size = 2, prob = c(0.33, 0.33))
y <- b0 + b1 * x1_num + error

x1 <- factor(x1_num, labels = c("a", "b", "c"))

# create data frame
xy_data_cat2 <- data.frame(x1, y)

head(xy_data_cat2)

x1	y
b	-1.5605
a	-4.2302
b	0.5587
c	2.0705
b	-0.8707
b	0.7151

# ANOVA function
anv_1w <- anova(aov(formula = y1 ~ x1, data = xy_data_cat2))

anv_1w

Df	Sum Sq	Mean Sq	F value	Pr(>F)
2	4.3728	2.1864	2.7307	0.0755
47	37.6318	0.8007	NA	NA

# linear regression
lm_anv_1w <- summary(lm(formula = y1 ~ x1, data = xy_data_cat2))

lm_anv_1w

## 
## Call:
## lm(formula = y1 ~ x1, data = xy_data_cat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67208 -0.51953 -0.02887  0.48386  2.32657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.8424     0.1866   4.515 4.24e-05 ***
## x1b           0.5646     0.2774   2.035   0.0475 *  
## x1c          -0.1409     0.3673  -0.384   0.7030    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8948 on 47 degrees of freedom
## Multiple R-squared:  0.1041, Adjusted R-squared:  0.06598 
## F-statistic: 2.731 on 2 and 47 DF,  p-value: 0.07552

model	F.statistic	p.value
1 way anova	2.7307	0.0755
lm	2.7307	0.0755

Chi-square for contingency tables

This test looks to find and association between two categorical variables based on the co-occurrence of the categories, measured as counts. This is what a contingency table is, counts for all combinations of categories between two variables. For instance deaths by age of passenger of the Titanic:

# Contingency table modified from
# https://lindeloev.github.io/tests-as-linear/
mood_data <- matrix(data = c(100, 70, 30, 32, 110, 120), nrow = 2,
    dimnames = list(handedness = c("left_handed", "right_handed"),
        mood = c("happy", "meh", "sad")))

mood_data

##               mood
## handedness     happy meh sad
##   left_handed    100  30 110
##   right_handed    70  32 120

Although the coding is a bit more elaborated it can be narrow down to log-linear two-way ANOVA model:

\(log(y_i) = log(N) + log(\alpha_i) + log(\beta_j) + log(\alpha_i\beta_j)\)

The regression coefficients are \(A_i\) and \(B_i\). \(\alpha_i\) and \(\beta_j\) are the proportions of counts for the two levels of the binary categorical variable (one for each of the levels in the second categorical variable). Fortunately, there is an implementation of this model from the R package ‘MASS’ that simplifies coding:

# Built-in chi-square. It requires matrix format.
chi_mod <- chisq.test(mood_data)

chi_mod

## 
##  Pearson's Chi-squared test
## 
## data:  mood_data
## X-squared = 5.0999, df = 2, p-value = 0.07809

# convert to long
mood_data_long <- as.data.frame(as.table(mood_data))

# log linear model from MASS package
log_lm_mod <- MASS::loglm(Freq ~ handedness + mood, data = mood_data_long)

log_lm_mod

## Call:
## MASS::loglm(formula = Freq ~ handedness + mood, data = mood_data_long)
## 
## Statistics:
##                       X^2 df   P(> X^2)
## Likelihood Ratio 5.119915  2 0.07730804
## Pearson          5.099859  2 0.07808717

model	X.squared	p.value
chi-square	5.0999	0.0781
log lm	5.0999	0.0781

‘Non-parametric’ alternatives as linear regression models

We will only feature one example of a ‘non-parametric’ test. Nonetheless, this single example should be enough to demonstrate the logic behind ‘non-parametric’ procedures. Most ‘non-parametric’ rely on a simple data transformation trick: rank transforming responses. Rank transformations converts values to an index according to their position when sorted by magnitude:

# simulate variable
set.seed(123)
y <- round(rnorm(n = 5), 2)

y

## [1] -0.56 -0.23  1.56  0.07  0.13

# convert to ranks
rank_y <- rank(y)

rank_y

## [1] 1 2 5 3 4

Wilcoxon test for two means

A.k.a Mann-Whitney test, is the non-parametric alternative to a two-mean t test:

# set seed
set.seed(123)

# number of observations
n <- 50
y <- rnorm(n = n, mean = 0.2, sd = 1)

# create data frame
y_data <- data.frame(y)


y

##  [1] -0.360475647 -0.030177489  1.758708314  0.270508391  0.329287735
##  [6]  1.915064987  0.660916206 -1.065061235 -0.486852852 -0.245661970
## [11]  1.424081797  0.559813827  0.600771451  0.310682716 -0.355841135
## [16]  1.986913137  0.697850478 -1.766617157  0.901355902 -0.272791408
## [21] -0.867823706 -0.017974915 -0.826004448 -0.528891229 -0.425039268
## [26] -1.486693311  1.037787044  0.353373118 -0.938136937  1.453814921
## [31]  0.626464221 -0.095071483  1.095125661  1.078133488  1.021581082
## [36]  0.888640254  0.753917654  0.138088289 -0.105962664 -0.180471001
## [41] -0.494706979 -0.007917278 -1.065396352  2.368955965  1.407961998
## [46] -0.923108583 -0.202884835 -0.266655354  0.979965118  0.116630934

# Wilcoxon / Mann-Whitney U
wilcx_mod <- wilcox.test(y_data$y)

wilcx_mod

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  y_data$y
## V = 797, p-value = 0.1248
## alternative hypothesis: true location is not equal to 0

signed_rank = function(x) sign(x) * rank(abs(x))

# As linear model with our dummy-coded group_y2:
wicx_lm_mod <- lm(signed_rank(y) ~ 1, data = y_data)  # compare to 

summary(wicx_lm_mod)

## 
## Call:
## lm(formula = signed_rank(y) ~ 1, data = y_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.38 -24.13   0.12  25.37  43.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    6.380      4.085   1.562    0.125
## 
## Residual standard error: 28.89 on 49 degrees of freedom

model	p.value
1 mean wilcoxon	0.1248
wilcoxon_lm	0.1248

References

Common statistical tests are linear models

Session information

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_CR.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_CR.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_CR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_CR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] viridis_0.6.2     viridisLite_0.4.0 ggplot2_3.3.5     knitr_1.37       
## [5] kableExtra_1.3.4 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1  xfun_0.30         bslib_0.2.5.1     purrr_0.3.4      
##  [5] colorspace_2.0-3  vctrs_0.3.8       generics_0.1.0    htmltools_0.5.2  
##  [9] yaml_2.3.5        utf8_1.2.2        rlang_1.0.2       jquerylib_0.1.4  
## [13] pillar_1.7.0      glue_1.6.2        withr_2.5.0       DBI_1.1.1        
## [17] lifecycle_1.0.1   stringr_1.4.0     munsell_0.5.0     gtable_0.3.0     
## [21] rvest_1.0.1       evaluate_0.15     fastmap_1.1.0     fansi_1.0.2      
## [25] highr_0.9         scales_1.1.1      formatR_1.12      webshot_0.5.2    
## [29] jsonlite_1.7.2    systemfonts_1.0.2 gridExtra_2.3     digest_0.6.29    
## [33] stringi_1.7.6     dplyr_1.0.7       grid_4.1.1        cli_3.2.0        
## [37] tools_4.1.1       magrittr_2.0.2    sass_0.4.0        klippy_0.0.0.9500
## [41] tibble_3.1.6      crayon_1.5.0      pkgconfig_2.0.3   MASS_7.3-54      
## [45] ellipsis_0.3.2    xml2_1.3.2        assertthat_0.2.1  rmarkdown_2.10   
## [49] svglite_2.0.0     httr_1.4.2        rstudioapi_0.13   R6_2.5.1         
## [53] compiler_4.1.1

Common statistical tests as linear regressions

Tropical Biology 2022
Organization for Tropical Studies

Marcelo Araya-Salas, PhD

“2022-06-15”

One sample t-test

Paired t-test

Two means t-test

Three or more means: one-way ANOVA

Chi-square for contingency tables

‘Non-parametric’ alternatives as linear regression models

Wilcoxon test for two means

References

Common statistical tests as linear regressions

Tropical Biology 2022 Organization for Tropical Studies

Marcelo Araya-Salas, PhD

“2022-06-15”

One sample t-test

Paired t-test

Two means t-test

Three or more means: one-way ANOVA

Chi-square for contingency tables

‘Non-parametric’ alternatives as linear regression models

Wilcoxon test for two means

References

Tropical Biology 2022
Organization for Tropical Studies