Here we will walk through the most common statistical tests and show how they can be represented in the linear regression format. This section is based on this post. Check it out for a more in-depth description of the non-parametric alternatives.
The test evaluates if the mean of a continuous variable is different from 0. It is equivalent to a linear regression with no predictor, that tests if the intercept is different from 0:
# number of observations
n <- 50
# set seed
set.seed(123)
# create variable with mean 1
y <- rnorm(n = n, mean = 1)
# run t test
(t <- t.test(y))
##
## One Sample t-test
##
## data: y
## t = 7.9, df = 49, p-value = 2.718e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.7712742 1.2975329
## sample estimates:
## mean of x
## 1.034404
##
## Call:
## lm(formula = y ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0010 -0.5937 -0.1070 0.6638 2.1345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0344 0.1309 7.9 2.72e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9259 on 49 degrees of freedom
Let’s put the results from both tests together to compare them more closely:
model | estimate | p.value | t |
---|---|---|---|
t-test | 1.0344 | 0 | 7.9 |
lm | 1.0344 | 0 | 7.9 |
Note that, as there are no predictor in the model, we used a ‘1’ in the place for predictors in the model formula (‘y ~ 1’).
A paired t-test evaluates if the mean difference between two numeric variables is 0:
The correspondent linear model is the same than that for the one-sample t-test, but the input variable is the difference between the two variables (y1 - y2):
set.seed(123)
# sample size
n <- 50
# variable with mean 1
y1 <- rnorm(n = n, mean = 1)
# variable with mean 3
y2 <- rnorm(n = n, mean = 1.4)
# run paired t test
paired_t <- t.test(y1, y2, paired = TRUE)
paired_t
##
## Paired t-test
##
## data: y1 and y2
## t = -2.7468, df = 49, p-value = 0.008395
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8865856 -0.1374239
## sample estimates:
## mean of the differences
## -0.5120047
# difference between the 2 variables
diff_y <- y1 - y2
# run model
lm_paired_t <- summary(lm(formula = diff_y ~ 1))
lm_paired_t
##
## Call:
## lm(formula = diff_y ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6003 -0.7010 0.1121 1.0348 2.9089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5120 0.1864 -2.747 0.00839 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.318 on 49 degrees of freedom
model | estimate | p.value | t |
---|---|---|---|
paired t-test | -0.512 | 0.0084 | -2.7468 |
lm | -0.512 | 0.0084 | -2.7468 |
Also called independent t-test. Evaluates if the means of the two variables are different. The null hypothesis is that the difference is 0:
# set seed
set.seed(123)
# number of observations
n <- 50
b0 <- -4
b1 <- 3
error <- rnorm(n = n, mean = 0, sd = 1)
# random variables
x1_num <- rbinom(n = n, size = 1, prob = 0.5)
y <- b0 + b1 * x1_num + error
x1 <- factor(x1_num, labels = c("a", "b"))
# create data frame
xy_data_cat <- data.frame(x1, x1_num, y)
# run paired t test
indep_t <- t.test(xy_data_cat$y[xy_data_cat$x1 == "a"], xy_data_cat$y[xy_data_cat$x1 ==
"b"])
indep_t
##
## Welch Two Sample t-test
##
## data: xy_data_cat$y[xy_data_cat$x1 == "a"] and xy_data_cat$y[xy_data_cat$x1 == "b"]
## t = -12.276, df = 47.946, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.751804 -2.695745
## sample estimates:
## mean of x mean of y
## -4.0774838 -0.8537091
##
## Call:
## lm(formula = y ~ x1, data = xy_data_cat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1129 -0.5939 -0.1355 0.5519 2.2464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.0775 0.1857 -21.96 <2e-16 ***
## x1b 3.2238 0.2626 12.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9285 on 48 degrees of freedom
## Multiple R-squared: 0.7584, Adjusted R-squared: 0.7534
## F-statistic: 150.7 on 1 and 48 DF, p-value: < 2.2e-16
model | estimate | p.value | t |
---|---|---|---|
2 means t-test | 3.2238 | 0 | -12.2759 |
lm | 3.2238 | 0 | 12.2759 |
Very similar to the two means t-test but with three or more levels in the categorical variable:
So it is a multiple regression model. The categorical variable is dummy coded so it gets split into indicator predictors (\(x_i\) is either \(x=0\) or \(x=1\)).
A data set with a 3-level categorical variable can be simulated like this:
# set seed
set.seed(123)
# number of observations
n <- 50
b0 <- -4
b1 <- 3
error <- rnorm(n = n, mean = 0, sd = 1)
# random variables
x1_num <- rbinom(n = n, size = 2, prob = c(0.33, 0.33))
y <- b0 + b1 * x1_num + error
x1 <- factor(x1_num, labels = c("a", "b", "c"))
# create data frame
xy_data_cat2 <- data.frame(x1, y)
head(xy_data_cat2)
x1 | y |
---|---|
b | -1.5605 |
a | -4.2302 |
b | 0.5587 |
c | 2.0705 |
b | -0.8707 |
b | 0.7151 |
Df | Sum Sq | Mean Sq | F value | Pr(>F) |
---|---|---|---|---|
2 | 4.3728 | 2.1864 | 2.7307 | 0.0755 |
47 | 37.6318 | 0.8007 | NA | NA |
##
## Call:
## lm(formula = y1 ~ x1, data = xy_data_cat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.67208 -0.51953 -0.02887 0.48386 2.32657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8424 0.1866 4.515 4.24e-05 ***
## x1b 0.5646 0.2774 2.035 0.0475 *
## x1c -0.1409 0.3673 -0.384 0.7030
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8948 on 47 degrees of freedom
## Multiple R-squared: 0.1041, Adjusted R-squared: 0.06598
## F-statistic: 2.731 on 2 and 47 DF, p-value: 0.07552
model | F.statistic | p.value |
---|---|---|
1 way anova | 2.7307 | 0.0755 |
lm | 2.7307 | 0.0755 |
This test looks to find and association between two categorical variables based on the co-occurrence of the categories, measured as counts. This is what a contingency table is, counts for all combinations of categories between two variables. For instance deaths by age of passenger of the Titanic:
# Contingency table modified from
# https://lindeloev.github.io/tests-as-linear/
mood_data <- matrix(data = c(100, 70, 30, 32, 110, 120), nrow = 2,
dimnames = list(handedness = c("left_handed", "right_handed"),
mood = c("happy", "meh", "sad")))
mood_data
## mood
## handedness happy meh sad
## left_handed 100 30 110
## right_handed 70 32 120
Although the coding is a bit more elaborated it can be narrow down to log-linear two-way ANOVA model:
\(log(y_i) = log(N) + log(\alpha_i) + log(\beta_j) + log(\alpha_i\beta_j)\)
The regression coefficients are \(A_i\) and \(B_i\). \(\alpha_i\) and \(\beta_j\) are the proportions of counts for the two levels of the binary categorical variable (one for each of the levels in the second categorical variable). Fortunately, there is an implementation of this model from the R package ‘MASS’ that simplifies coding:
##
## Pearson's Chi-squared test
##
## data: mood_data
## X-squared = 5.0999, df = 2, p-value = 0.07809
# convert to long
mood_data_long <- as.data.frame(as.table(mood_data))
# log linear model from MASS package
log_lm_mod <- MASS::loglm(Freq ~ handedness + mood, data = mood_data_long)
log_lm_mod
## Call:
## MASS::loglm(formula = Freq ~ handedness + mood, data = mood_data_long)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 5.119915 2 0.07730804
## Pearson 5.099859 2 0.07808717
model | X.squared | p.value |
---|---|---|
chi-square | 5.0999 | 0.0781 |
log lm | 5.0999 | 0.0781 |
We will only feature one example of a ‘non-parametric’ test. Nonetheless, this single example should be enough to demonstrate the logic behind ‘non-parametric’ procedures. Most ‘non-parametric’ rely on a simple data transformation trick: rank transforming responses. Rank transformations converts values to an index according to their position when sorted by magnitude:
## [1] -0.56 -0.23 1.56 0.07 0.13
## [1] 1 2 5 3 4
A.k.a Mann-Whitney test, is the non-parametric alternative to a two-mean t test:
# set seed
set.seed(123)
# number of observations
n <- 50
y <- rnorm(n = n, mean = 0.2, sd = 1)
# create data frame
y_data <- data.frame(y)
y
## [1] -0.360475647 -0.030177489 1.758708314 0.270508391 0.329287735
## [6] 1.915064987 0.660916206 -1.065061235 -0.486852852 -0.245661970
## [11] 1.424081797 0.559813827 0.600771451 0.310682716 -0.355841135
## [16] 1.986913137 0.697850478 -1.766617157 0.901355902 -0.272791408
## [21] -0.867823706 -0.017974915 -0.826004448 -0.528891229 -0.425039268
## [26] -1.486693311 1.037787044 0.353373118 -0.938136937 1.453814921
## [31] 0.626464221 -0.095071483 1.095125661 1.078133488 1.021581082
## [36] 0.888640254 0.753917654 0.138088289 -0.105962664 -0.180471001
## [41] -0.494706979 -0.007917278 -1.065396352 2.368955965 1.407961998
## [46] -0.923108583 -0.202884835 -0.266655354 0.979965118 0.116630934
##
## Wilcoxon signed rank test with continuity correction
##
## data: y_data$y
## V = 797, p-value = 0.1248
## alternative hypothesis: true location is not equal to 0
signed_rank = function(x) sign(x) * rank(abs(x))
# As linear model with our dummy-coded group_y2:
wicx_lm_mod <- lm(signed_rank(y) ~ 1, data = y_data) # compare to
summary(wicx_lm_mod)
##
## Call:
## lm(formula = signed_rank(y) ~ 1, data = y_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.38 -24.13 0.12 25.37 43.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.380 4.085 1.562 0.125
##
## Residual standard error: 28.89 on 49 degrees of freedom
model | p.value |
---|---|
1 mean wilcoxon | 0.1248 |
wilcoxon_lm | 0.1248 |
Session information
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_CR.UTF-8 LC_COLLATE=es_ES.UTF-8
## [5] LC_MONETARY=es_CR.UTF-8 LC_MESSAGES=es_ES.UTF-8
## [7] LC_PAPER=es_CR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_CR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] viridis_0.6.2 viridisLite_0.4.0 ggplot2_3.3.5 knitr_1.37
## [5] kableExtra_1.3.4
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.30 bslib_0.2.5.1 purrr_0.3.4
## [5] colorspace_2.0-3 vctrs_0.3.8 generics_0.1.0 htmltools_0.5.2
## [9] yaml_2.3.5 utf8_1.2.2 rlang_1.0.2 jquerylib_0.1.4
## [13] pillar_1.7.0 glue_1.6.2 withr_2.5.0 DBI_1.1.1
## [17] lifecycle_1.0.1 stringr_1.4.0 munsell_0.5.0 gtable_0.3.0
## [21] rvest_1.0.1 evaluate_0.15 fastmap_1.1.0 fansi_1.0.2
## [25] highr_0.9 scales_1.1.1 formatR_1.12 webshot_0.5.2
## [29] jsonlite_1.7.2 systemfonts_1.0.2 gridExtra_2.3 digest_0.6.29
## [33] stringi_1.7.6 dplyr_1.0.7 grid_4.1.1 cli_3.2.0
## [37] tools_4.1.1 magrittr_2.0.2 sass_0.4.0 klippy_0.0.0.9500
## [41] tibble_3.1.6 crayon_1.5.0 pkgconfig_2.0.3 MASS_7.3-54
## [45] ellipsis_0.3.2 xml2_1.3.2 assertthat_0.2.1 rmarkdown_2.10
## [49] svglite_2.0.0 httr_1.4.2 rstudioapi_0.13 R6_2.5.1
## [53] compiler_4.1.1