Coding and Reproducibility

Published

April 24, 2024

Objectives

Understand how coding can help make research reproducible
Learn programming practices that can improve reproducibliity

0.1 Free Software and Code

Free and open-source programs allow users to inspect, modify, and enhance their design by providing access to their source code.
Open-source code is ideal for reproducible research because scripts can contain all the steps of the analysis (self-documentation).
Code, in general, allows colleagues to see what we have done and rerun or even modify our analyses.
Free tools can be used by anyone unlike commercial tools.
Open-source code enables a detailed understanding of the analyses

0.1.1 Why R?

www.traininginbangalore.com

0.2 Tools for Reproducible Programming

0.2.1 Literate Programming

Involves documenting in detail what the problem consists of, how it is solved, how and why a certain flow of analysis was adopted, how it was optimized (if it was optimized), and how it was implemented in the programming language.
Dynamic reports in R facilitate the use of literate programming to document data handling and statistical analysis (this file you are reading right now is a dynamic report created in R).
The main way R facilitates reproducible research is by allowing users to create a document that is a combination of content and data analysis code.

0.2.2 Reproducible Environments

Reproducibility is also about ensuring that someone else can reuse your code to get the same results.
For this, you need to provide more than just the code and the data.
Documenting and managing your project’s dependencies correctly can be complicated. However, even simple documentation that helps others understand the setup you used can have a significant impact.
Ideally, you should document the exact versions of all packages and software you used and the operating system.

0.2.3 Session Information

The simplest way to document the environment (R + packages and their versions) in which an analysis was done is by using the sessionInfo() function:

Code

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Costa_Rica
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.35     assertthat_0.2.1  lubridate_1.9.3   fastmap_1.1.1    
 [5] xfun_0.43         magrittr_2.0.3    glue_1.7.0        stringr_1.5.1    
 [9] knitr_1.46        htmltools_0.5.8.1 timechange_0.2.0  generics_0.1.3   
[13] rmarkdown_2.26    lifecycle_1.0.4   cli_3.6.2         vctrs_0.6.5      
[17] compiler_4.3.2    purrr_1.0.2       emo_0.0.0.9000    rstudioapi_0.15.0
[21] tools_4.3.2       evaluate_0.23     yaml_2.3.8        formatR_1.14     
[25] rlang_1.1.3       jsonlite_1.8.8    crayon_1.5.2      htmlwidgets_1.6.4
[29] stringi_1.8.3

However, this documentation does not necessarily make the analyses replicable since package versions often get updated and even some packages may not be available after a while.

0.2.4 Packrat: Reproducible Package Management in R

R packages (and their specific versions) used in an analysis can be difficult to replicate:

Have you ever had to use trial and error to figure out which R packages you need to install to make someone else’s code work?
Have you ever updated a package to make your project’s code work, only to find out that the updated package causes another project’s code to stop working?

With the packrat package, projects have several useful features in terms of reproducibility:

Isolation: Installing a new or updated package for a project will not affect your other projects and vice versa. That’s because packrat gives each project its own private package library.
Portable: Easily move your projects from one computer to another, even on different platforms. packrat makes it easy to install the packages your project depends on.
Reproducible: packrat records the exact versions of the package it depends on and ensures that those exact versions are installed wherever you go.

Packrat is a package management system for R that helps you manage dependencies for your R projects. It ensures that your projects use the same package versions, making your code more reproducible.

0.2.4.1 Using Packrat

Of course, first, we need to install the packrat package in R:

Code

# install package
install.packages("packrat")

Now, let’s create a new R project (in a new directory).
After creating a project (or moving to an existing one) we can start monitoring and managing packages with packrat like this:

Code

# start packrat in project
packrat::init(path = "/project/directory")

If the working directory is set as the project directory, it is not necessary to define the ‘path’:

Code

# start packrat in project
packrat::init()

After this, the use of packages in this project will be managed by packrat (you will see some differences in what the R console prints when installing packages). So, we are already using packrat. A packrat project contains some additional files and directories. The init() function creates these files and directories if they do not already exist:

packrat/packrat.lock: lists the precise versions of the package that were used to satisfy the dependencies, including dependencies of dependencies (should never be edited manually!).
packrat/packrat.opts: Project-specific packrat options. These can be consulted and configured with get_opts and set_opts; see “packrat-options” for more information.
packrat/lib/: Private package library for this project.
packrat/src/: Source packages of all dependencies that have been reported to packrat.
.Rprofile: Tells R to use the private package library when started from the project directory.

The only difference with other projects is that projects using packrat have their own package library. This is located in /project/directory/packrat/lib. For example, let’s install a couple of new packages, they can be some you are familiar with or these ones we have here as an example:

Code

install.packages("fun")

Every time we install one or more packages, it is necessary to update the tracking status of packrat. We do this as follows:

Code

# check current status
packrat::status()

# update packrat in project
packrat::snapshot()

With this package, we can play in R:

Code

# example of an irrelevant game X
library(fun)

if (.Platform$OS.type == "windows") x11() else x11(type = "Xlib")

mine_sweeper()

Or take an Alzheimer’s test:

Code

# another slightly less irrelevant game
x = alzheimer_test()

If we remove a package that we used in the project, we can reinstall it using restore():

Code

# remove
remove.packages("fun")

# check current status
packrat::status()

# restore
packrat::restore()

New packages can be installed:

Code

# install
install.packages("cowsay")

# load
library(cowsay)

# diagram
say("Hello world!")

# random echo
say("rms")


 -------------- 
Hello world! 
 --------------
    \
      \
        \
            |\___/|
          ==) ^Y^ (==
            \  ^  /
             )=*=(
            /     \
            |     |
           /| | | |\
           \| | |_|/\
      jgs  //_// ___/
               \_)

Code

# random echo
say("rms")


 -------------- 
Richard Stallman doesn't really believe in open software, because it's not free enough. 
 --------------
    \
      \
        \
            |\___/|
          ==) ^Y^ (==
            \  ^  /
             )=*=(
            /     \
            |     |
           /| | | |\
           \| | |_|/\
      jgs  //_// ___/
               \_)

…and they should be “referenced” in the same way:

Code

# check current status
packrat::status()

# update packrat in project
packrat::snapshot()

In this GitHub repository, there is an R project with packrat. We can clone it just to see how it works without needing to install the packages:

Code

git clone https://github.com/maRce10/ejemplo_packrat_repo.git

If you or someone else wants to reproduce your project, they can use the packrat::restore() function to install the exact versions of the packages listed in the packrat.lock file.

Code

packrat::restore()

This will ensure that the correct package versions are installed, maintaining consistency across different environments.

Session Information

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Costa_Rica
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] cowsay_0.9.0

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         knitr_1.46       
 [5] rlang_1.1.3       xfun_0.43         stringi_1.8.3     purrr_1.0.2      
 [9] generics_0.1.3    assertthat_0.2.1  jsonlite_1.8.8    glue_1.7.0       
[13] htmltools_0.5.8.1 formatR_1.14      rmarkdown_2.26    emo_0.0.0.9000   
[17] evaluate_0.23     fastmap_1.1.1     yaml_2.3.8        lifecycle_1.0.4  
[21] stringr_1.5.1     compiler_4.3.2    htmlwidgets_1.6.4 timechange_0.2.0 
[25] rstudioapi_0.15.0 fortunes_1.5-4    digest_0.6.35     rmsfact_0.0.3    
[29] magrittr_2.0.3    tools_4.3.2       lubridate_1.9.3

--- title: Coding and Reproducibility --- ```{r, echo = FALSE} # devtools::install_github("hadley/emo") library("emo") library("knitr") # options to customize chunk outputs knitr::opts_chunk$set( tidy.opts = list(width.cutoff = 65), tidy = TRUE, message = FALSE ) # this is a customized printing style data frames # screws up tibble function tibble <- function(x, ...) { x <- kbl(x, digits=4, align= 'c', row.names = FALSE) x <- kable_styling(x, position ="center", full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive")) asis_output(x) } registerS3method("knit_print", "data.frame", tibble) ``` ```{r setting functions and parameters, echo=FALSE, message=FALSE} # remove all objects rm(list = ls()) # unload all non-based packages out <- sapply(paste('package:', names(sessionInfo()$otherPkgs), sep = ""), function(x) try(detach(x, unload = FALSE, character.only = TRUE), silent = T)) options("digits"=5) options("digits.secs"=3) # library(knitr) # library(kableExtra) # # options(knitr.table.format = "html") # # x <- c("RColorBrewer", "ggplot2") # # aa <- lapply(x, function(y) { # if(!y %in% installed.packages()[,"Package"]) {if(y != "warbleR") install.packages(y) else devtools::install_github("maRce10/warbleR") # } # try(require(y, character.only = T), silent = T) # }) # # # theme_set(theme_classic(base_size = 50)) # # cols <- brewer.pal(10,"Spectral") ``` ::: {.alert .alert-info} # Objectives {.unnumbered .unlisted} - Understand how coding can help make research reproducible - Learn programming practices that can improve reproducibliity ::: ## Free Software and Code - Free and open-source programs allow users to **inspect, modify, and enhance their design** by providing access to their source code. - Open-source code is ideal for reproducible research because **scripts can contain all the steps of the analysis** (self-documentation). - Code, in general, **allows colleagues to see what we have done** and rerun or even modify our analyses. - **Free tools can be used by anyone** unlike commercial tools. - Open-source code enables a **detailed understanding of the analyses** ### Why R? <center><img src="./images/whylearnr.jpeg" alt="Why R" height="500" width="750"/></center> *www.traininginbangalore.com* ## Tools for Reproducible Programming ### Literate Programming - Involves **documenting in detail** what the problem consists of, how it is solved, how and why a certain flow of analysis was adopted, how it was optimized (if it was optimized), and how it was implemented in the programming language. - **Dynamic reports in R facilitate the use of literate programming** to document data handling and statistical analysis (this file you are reading right now is a dynamic report created in R). - The main way R facilitates reproducible research is by allowing users to **create a document that is a combination of content and data analysis code**. ### Reproducible Environments - Reproducibility is also about ensuring that someone else can reuse your code to get the same results. - For this, you need to provide more than just the code and the data. - Documenting and managing your project's dependencies correctly can be complicated. However, even simple documentation that helps others understand the setup you used can have a significant impact. - Ideally, you should document the exact versions of all packages and software you used and the operating system. ### Session Information The simplest way to document the environment (R + packages and their versions) in which an analysis was done is by using the `sessionInfo()` function: ```{r session info example, echo=TRUE} sessionInfo() ``` However, this documentation does not necessarily make the analyses replicable since package versions often get updated and even some packages may not be available after a while. ### Packrat: Reproducible Package Management in R R packages (and their specific versions) used in an analysis can be difficult to replicate: - Have you ever had to use trial and error to figure out which R packages you need to install to make someone else's code work? - Have you ever updated a package to make your project's code work, only to find out that the updated package causes another project's code to stop working? With the `packrat` package, projects have several useful features in terms of reproducibility: - Isolation: Installing a new or updated package for a project will not affect your other projects and vice versa. That's because `packrat` gives each project its own private package library. - Portable: Easily move your projects from one computer to another, even on different platforms. `packrat` makes it easy to install the packages your project depends on. - Reproducible: `packrat` records the exact versions of the package it depends on and ensures that those exact versions are installed wherever you go. Packrat is a package management system for R that helps you manage dependencies for your R projects. It ensures that your projects use the same package versions, making your code more reproducible. ```{r, eval = F, echo = FALSE} However, defaults in some functions change and new functions are introduced regularly. If you wrote your code in a recent version of R and gave it to someone who hasn't updated recently, they may not be able to run your code. Code written for one version of a package may produce very different results with a newer version.   <div class="alert alert-info"> ### Exercise 1 - XXXXX </div>   ``` #### Using Packrat 1. Of course, first, we need to install the `packrat` package in R: ```{r, eval = FALSE} # install package install.packages("packrat") ``` 2. Now, let's create a new R project (in a new directory). 3. After creating a project (or moving to an existing one) we can start monitoring and managing packages with `packrat` like this: ```{r, eval = FALSE} # start packrat in project packrat::init(path = "/project/directory") ``` If the working directory is set as the project directory, it is not necessary to define the 'path': ```{r, eval = FALSE} # start packrat in project packrat::init() ``` After this, the use of packages in this project will be managed by `packrat` (you will see some differences in what the R console prints when installing packages). So, we are already using `packrat`. A `packrat` project contains some additional files and directories. The `init()` function creates these files and directories if they do not already exist: - **packrat/packrat.lock**: lists the precise versions of the package that were used to satisfy the dependencies, including dependencies of dependencies (should never be edited manually!). - **packrat/packrat.opts**: Project-specific `packrat` options. These can be consulted and configured with `get_opts` and `set_opts`; see "packrat-options" for more information. - **packrat/lib/**: Private package library for this project. - **packrat/src/**: Source packages of all dependencies that have been reported to packrat. - **.Rprofile**: Tells R to use the private package library when started from the project directory. The only difference with other projects is that projects using `packrat` have their own package library. This is located in `/project/directory/packrat/lib`. For example, let's install a couple of new packages, they can be some you are familiar with or these ones we have here as an example: ```{r, eval = FALSE} install.packages("fun") ``` Every time we install one or more packages, it is necessary to update the tracking status of `packrat`. We do this as follows: ```{r, eval = FALSE} # check current status packrat::status() # update packrat in project packrat::snapshot() ``` With this package, we can play in R: ```{r, eval = FALSE} # example of an irrelevant game X library(fun) if (.Platform$OS.type == "windows") x11() else x11(type = "Xlib") mine_sweeper() ``` Or take an Alzheimer's test: ```{r, eval = FALSE} # another slightly less irrelevant game x = alzheimer_test() ``` If we remove a package that we used in the project, we can reinstall it using `restore()`: ```{r, eval = FALSE} # remove remove.packages("fun") # check current status packrat::status() # restore packrat::restore() ``` New packages can be installed: ```{r, eval = FALSE} # install install.packages("cowsay") # load library(cowsay) # diagram say("Hello world!") # random echo say("rms") ``` ```{r, echo = FALSE} # install # install.packages("cowsay") # load library(cowsay) ``` ```{r, echo = FALSE, eval = TRUE} # random echo say("Hello world!") ``` ```{r, echo = TRUE, eval = FALSE} # random echo say("rms") ``` ```{r, echo = FALSE, eval = TRUE} # random echo say("rms") ``` ...and they should be "referenced" in the same way: ```{r, eval = FALSE} # check current status packrat::status() # update packrat in project packrat::snapshot() ``` In this [GitHub repository](https://github.com/maRce10/ejemplo_packrat_repo), there is an R project with `packrat`. We can clone it just to see how it works without needing to install the packages: ```{r eval = FALSE} git clone https://github.com/maRce10/ejemplo_packrat_repo.git ``` If you or someone else wants to reproduce your project, they can use the `packrat::restore()` function to install the exact versions of the packages listed in the `packrat.lock` file. ```{r, eval = FALSE} packrat::restore() ``` This will ensure that the correct package versions are installed, maintaining consistency across different environments. ------------------------------------------------------------------------ <font size="5">Session Information</font> ```{r session info, echo=F} sessionInfo() ```