graph LR; a[1_cleaning.R] --> b[2_tables.R] & c[3_figures.R] & d[4_models.R]
Not taught in statistical class
Hard to find good examples
A matter of Statistical tool vs Programming language
Automate it once, and let the code do it for you!
Important
When you start a R session, R locate itself somewhere in your computer. The exact location is called Working directory (wd).
Important
Rstudio always start at the same place (contrary to other editors). To find the exact location go to:
Tools > Global Options > General
(in the tab “Basic”)
This is the used location when RStudio does not work with a project. See next picture for a visual representation.
Take notes!
It is good practice to place all (or most) of your R code in a same folder.
In my case, I have a
Code
folder that contain every programming language I use (There is one for R). Inside each folder, there are at least Two folders:Test(s)
, where I test new things andProject
where I put all my serious projects in their dedicated folder. This is in theTest(s)
folder that we will set our default working directory.
Exercices
Test
and Project
folders inside it.Test
directory.Project
folder.Test
folder, it will be useful later!Test
folder) and your serious works Project
ctrl/cmd + shift + f
in RStudioctrl/cmd + shift + f
This command allow you to search all the R code in a specific folder. Since all our R code is in the same place, we can search the “archive” of our previous work. It come really handy when we need to remember how to use a particular code. You can either search anything (function, comment, etc.).
When you search for a particular keyword, it will show you all the time you used it and will indicate you in wich folder and which line. You can also click on the result and RStudio will open the select file in the right line (really powerful).
See an example in the next slide
A code you write once is a thing you do not need to redo or remember!
You will spend a good amout of time in RStudio, makes it all the more enjoyable!
Ressources:
Reproducibility!!!
Long-term reproducibility is enhanced when you turn this feature off and clear R’s memory at every restart. Starting with a blank slate provides timely feedback that encourages the development of scripts that are complete and self-contained. Posit
It make the whole work smoother and colorful
RStudio project
RStudio projects (.Rproj) are file that tell R where to works. They also allow to set options and behaviour that R specific to the project.
renv
{renv} is a pacakge in are that allow to control the version of your package. What it does is to isolate yourt project from your whole R ecosystem and start as if you did not load any packages.
setwd()
?You need to find the path manually every time
You need to put it on every document or always start with the same document
Projects are only separeted by files (not the entire environment)
If you share your code, your colleague need to change the setwd path every time
RStudio project define the working directory once easily and it is shareable!
renv
. In this project try to use a library (for instance {dplyr})Expert data scientists keep all the files associated with a given project together — input data, scripts, analytical results, and figures. This is such a wise and common practice that RStudio has built-in support for this via Projects. RStudio Projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents. Posit
setwd()
from your code!)Sometime when package is updated, major changes are applied. Which mean that a code that worked fine until now could produce error or breaks. Since we cannot predict how packages will evolve, we can create an environment that is frozen in time. {renv} create this isolated context from other package so that package inside it have their own version. Which also mean that we have to reinstall all the packages.
If the project was not started with renv the first time
Update from time to time to keep track of the packages
Download a specific package version of a package
Data
: Raw dataMyData
: Cleaned dataReport
: Report/presentation created with a notebookResult
: Folder for tables and figuresScripts
: Where the code isRun code on RStudio, Tutorial RStudio Text Editor, R console and Shortcuts
Objectif for a clean code
(1) Files can be run entirely in one go (ctrl/cmd + alt + r)
setwd()
, install.packages()
, View()
)(2) Files should be self-contained
(3) Files should be named, organized and used with a specific purpose
Create the file “1_cleaning.R
” in your “Script
” folder equivalent and open it.
You could also name it “
01_the_data_cleaning_file.R
”. The name you choose does not matter as long as it is ordered and self-explanatory.
Tidyverse style guide: Files
Note that we do not use spaces or special characters to follow the tidyverse style guide for files.
Should follow a convention: snake_case
, CamelCase
, etc.
No spaces and no special character
Clear but not too long
Between expression (readability)
Make line breaks as frequent as possible
Good comments only explain the overall goal, not each specific line or step.
Sometimes I replace “Analysis” by “Work”
Step 1 - set the libraries needed
Packages should only be loaded once at the begining of the code
Option (1)
# Libraries ----
library(tidyverse)
library(rio) # Should be installed before
library(janitor) # Should be installed before
Multiple cursors
Ctrl + Alt/Option + {Up/Down}
Ctrl + Alt/Option + Shift + {Direction}
Ctrl + Alt/Option + Shift + {Mouse}
Option (2)
It loads the package with library()
If it does not find it, it installs it with install.packages()
It updates packages only if specified
Select the option that you like the most!
To much packages/functions
Bellow a few examples
# R
load("data.rda")
data <- readRDS("data.rds")
# csv
data <- read.csv("data.csv")
data <- readr::read_csv("data.csv")
# Excel
data <- xlsx::read.xlsx("data.xlsx")
data <- readxl::read_excel("data.xlsx")
# SPSS/Stata/SAS
data <- haven::read_spss("data.sav")
data <- haven::read_stata("data.dta")
# etc.
# Arrow
data <- arrow::open_dataset("data.parquet")
# etc.
With {rio} you only need two functions:
import()
and export()
It works for all of the common data format: rdata, rda, rds, csv, tsv, sas7bdat, sav, dta, xlsx, parquet, feather, json and many more!
It reduce the number of packages/functions to memorize and load.
# Libraries ----
pacman::p_load(tidyverse, rio, janitor) # Should be installed before
# Data ----
andorra_raw <-
import("Data/WVS_Wave_7_Andorra_Stata_v5.0.dta",
setclass = "tbl_df") # {rio}
RStudio propose you sometimes options based on what you already typed. Use the arrow keys or your mouse to choose one of the option and press either Enter
or Tab
to accept one.
For file path: If you are in a middle of a pair of double quote, press Tab
and it will show you all the available folders and files in the working directory. You can navigate using the arrow key or your mouse, go inside folders using Tab
and select using Enter
.
# Libraries ----
pacman::p_load(tidyverse, rio, janitor) # Should be installed before
# Data ----
andorra_raw <-
import("Data/WVS_Wave_7_Andorra_Stata_v5.0.dta",
setclass = "tbl_df") # {rio}
andorra_raw %>%
clean_names() # {janitor} standardize column names
The function clean_names()
from {janitor} clean the dataframe name by:
_
” and “%
” by “_percent_
” and “#
” by “_number_
”# Libraries ----
pacman::p_load(tidyverse, rio, janitor) # Should be installed before
# Data ----
andorra_raw <-
import("Data/WVS_Wave_7_Andorra_Stata_v5.0.dta",
setclass = "tbl_df") # {rio}
andorra_raw %>%
clean_names() %>% # {janitor} standardize column names
select(sex = q260, # Select and rename
age = q262, # Select and rename
emancipative = resemaval, # Select and rename
starts_with("h_"), # Select all variables that starts with "h_"
q1:q6) # Select from variable q1 to q6
To go deeper check the following link: Tidy-Select
# Libraries ----
pacman::p_load(tidyverse, rio, janitor) # Should be installed before
# Data ----
andorra_raw <-
import("Data/WVS_Wave_7_Andorra_Stata_v5.0.dta",
setclass = "tbl_df") # {rio}
andorra_raw %>%
clean_names() %>% # {janitor} standardize column names
select(sex = q260, # Select and rename
age = q262, # Select and rename
emancipative = resemaval, # Select and rename
starts_with("h_"), # Select all variables that starts with "h_"
q1:q6) %>% # Select from variable q1 to q6
filter_all(all_vars(. >= 0)) %>% # Remove missing values
mutate(sex = factor(sex, labels = c("Male", "Female")), # Labels
h_settlement = factor(h_settlement, # Labels
labels = c("Capital city",
"Regional center",
"Another city",
"Village")),
h_urbrural = factor(h_urbrural, # Labels
labels = c("Urban",
"Rural"))) %>%
mutate_at(6:11, ~factor(.x,
labels = c("Very important",
"Rather important",
"Not very important",
"Not at all important")))
# Libraries ----
pacman::p_load(tidyverse, rio, janitor) # Should be installed before
# Data ----
andorra_raw <-
import("Data/WVS_Wave_7_Andorra_Stata_v5.0.dta",
setclass = "tbl_df") # {rio}
andorra_clean <- # Save the whole process here
andorra_raw %>%
clean_names() %>% # {janitor} standardize column names
select(sex = q260, # Select and rename
age = q262, # Select and rename
emancipative = resemaval, # Select and rename
starts_with("h_"), # Select all variables that starts with "h_"
q1:q6) %>% # Select from variable q1 to q6
filter_all(all_vars(. >= 0)) %>% # Remove missing values
mutate(sex = factor(sex, labels = c("Male", "Female")), # Labels
h_settlement = factor(h_settlement, # Labels
labels = c("Capital city",
"Regional center",
"Another city",
"Village")),
h_urbrural = factor(h_urbrural, # Labels
labels = c("Urban",
"Rural"))) %>%
mutate_at(6:11, ~factor(.x,
labels = c("Very important",
"Rather important",
"Not very important",
"Not at all important")))
# Libraries ----
pacman::p_load(tidyverse, rio, janitor) # Should be installed before
# Data ----
andorra_raw <-
import("Data/WVS_Wave_7_Andorra_Stata_v5.0.dta",
setclass = "tbl_df") # {rio}
andorra_clean <- # Save the whole process here
andorra_raw %>%
clean_names() %>% # {janitor} standardize column names
select(sex = q260, # Select and rename
age = q262, # Select and rename
emancipative = resemaval, # Select and rename
starts_with("h_"), # Select all variables that starts with "h_"
q1:q6) %>% # Select from variable q1 to q6
filter_all(all_vars(. >= 0)) %>% # Remove missing values
mutate(sex = factor(sex, labels = c("Male", "Female")), # Labels
h_settlement = factor(h_settlement, # Labels
labels = c("Capital city",
"Regional center",
"Another city",
"Village")),
h_urbrural = factor(h_urbrural, # Labels
labels = c("Urban",
"Rural"))) %>%
mutate_at(6:11, ~factor(.x,
labels = c("Very important",
"Rather important",
"Not very important",
"Not at all important")))
# Saving ----
export(andorra_clean, file = "MyData/andorra_clean.rds") # {rio}
# Libraries ----
pacman::p_load(tidyverse, rio, janitor) # Should be installed before
# Data ----
andorra_raw <-
haven::read_dta("Data/WVS_Wave_7_Andorra_Stata_v5.0.dta") # {haven}
andorra_clean <- # Save the whole process here
andorra_raw %>%
clean_names() %>% # {janitor} standardize column names
select(sex = q260, # Select and rename
age = q262, # Select and rename
emancipative = resemaval, # Select and rename
starts_with("h_"), # Select all variables that starts with "h_"
q1:q6) %>% # Select from variable q1 to q6
filter_all(all_vars(. >= 0)) %>%
mutate_at(c(1,4:11), haven::as_factor) # Automatic factor
# Saving ----
export(andorra_clean, file = "MyData/andorra_clean.rds") # {rio}
Easier with haven::as_factor()
but we need to import the data with haven::read_dta()
because of compatibility problem!
1_cleaning.R
with the command ctrl/cmd + alt + r
to be sure that everything works fine.2_tables.R
, 3_figures.R
and 4_models.R
and put them in the right folder.Alway start by loading the packages and data. So that, next time we start the project, we do not need to open the previous script to load the packages and the data since every scripts is self-contained.
Many references: R graph gallery and D3.js graph gallery
# Libraries ----
pacman::p_load(tidyverse, rio, gtsummary)
# Data ----
andorra_clean <-
import("MyData/andorra_clean.rds") %>%
mutate_at(c(1,4:11), fct_drop)
# Work ----
# 1. Summary
# A simple summary table for all the variables
summary_table <-
andorra_clean %>%
tbl_summary()
# Save docx
summary_table %>%
as_gt() %>%
gt::gtsave(file = "Results/summary_table.docx")
# Save data
export(summary_table, file = "MyData/summary_table.rds")
# 2. Emancipative: urban vs rural
# Test the difference in emancipative and variables related
# to the importance of value in life by urbanicity of region
# of residence.
emancipative_table <-
andorra_clean %>%
select(h_urbrural,
emancipative,
starts_with("q")) %>%
tbl_summary(by = h_urbrural) %>%
add_p()
# Save docx
emancipative_table %>%
as_gt() %>%
gt::gtsave(file = "Results/emancipative_table.docx")
# Save data
export(emancipative_table, file = "MyData/emancipative_table.rds")
# Libraries ----
pacman::p_load(tidyverse, rio, gtsummary)
# Options ----
theme_set(theme_bw())
theme_update(legend.position = "top")
# Data ----
andorra_clean <- import("MyData/andorra_clean.rds")
# Visualization ----
# 1. Emancipative distribution
# Simple dansity plot for the emancipative part
emancipative_distribution <-
andorra_clean %>%
ggplot(aes(emancipative)) +
geom_density(fill = "cyan", alpha = 0.5) +
labs(title = "Emancipative index distribution")
# Save png
emancipative_distribution %>%
ggsave(plot = .,
filename = "Results/emancipative_distribution.png")
# Save ggplot
emancipative_distribution %>%
export(file = "MyData/emancipative_distribution.rds")
# 2. Emancipaitve/sex/urbrural
# Emancipative destribtuion by sex and urbanicity
# region of residence
emancipative_sex_urban <-
andorra_clean %>%
ggplot(aes(emancipative, fill = h_urbrural)) +
geom_density(alpha = 0.5) +
facet_wrap(~sex) +
labs(title = "Emancipative index distribution",
subtitle = "by sex and urbanicity of region of residence",
fill = "Urbanicity")
# Save png
emancipative_sex_urban %>%
ggsave(plot = .,
scale = 1.5,
filename = "Results/emancipative_sex_urban.png")
# Save ggplot
emancipative_sex_urban %>%
export(file = "MyData/emancipative_sex_urban.rds")
# Libraries ----
pacman::p_load(tidyverse, rio, sjPlot)
# Data ----
andorra_clean <-
import("MyData/andorra_clean.rds") %>%
mutate_at(c(1,4:11), fct_drop)
# Analysis ----
## Model ----
# A linear model testing the relationship between the emancipative
# index and predictors (sex, age, urbanicity of region of residence)
model <- lm(emancipative ~ sex + age + h_urbrural,
data = andorra_clean)
# Save data (model)
export(model, file = "MyData/model.rds")
# Regression table
# The model regression table (save automaticaly)
tab_model(model,
collapse.ci = TRUE,
p.style = "stars",
file = "Results/regression_table.doc")
# Save data (regression table)
regression_table <-
tab_model(model,
collapse.ci = TRUE,
p.style = "stars")
export(regression_table, file = "MyData/regression_table.rds")
Now that every file is self-contained, we can run them as background jobs to save time.
But we still have a dependency tree problem!
graph LR; a[1_cleaning.R] --> b[2_tables.R] & c[3_figures.R] & d[4_models.R]
1_cleaning.R
changes, we need to rerun all the filesThe best tool that automaticaly handle which file to run is {target}, but it is pretty advanced.
There are other tools to make your work better/efficient
{todor}: TODO, FIXME, etc. to keep track of your work
Help me help you with {reprex}
SPEED (data): {collapse}, {data.table}, {dtplyr}, {dbplyr}, {arrow}, {polar}, {tidypolar}
SPEED (computing): {future}, {furrr}, {Rcpp}, {JuliaCall}
Getting help with R: How do I know how it works?
Bookdown: free books online
{tidyverse}: The new coding standard in R
Awesome R: List of awesome packages/works/etc. in R
Metacran: All the available packages
One of the best tool for scientific publishing! Keep your code, result and writting together!
As a (data) scientist/researcher, you should save time when you need to publish. Quarto allows you to put your code, its output and your writting in a single file that can be exported in many format.
Language agnostic (R, Python, Julia and JavaScript)
Highly portable (Rstudio, VS Code, Jupyter, Neovim, etc.)
Versatile formats (articles, presentation, website, books, etc.)
Good defaults and highly customizable (template for journals)
Better interface in general (visual mode improved)
Better format handling!
Code completion in YAML and markdown
Easier to learn (everything builtin and feature cross format)
Rmarkdown are easily convertible (also your knowledge)
Quarto will continue to grow, Rmarkdown won’t!
As a RMarkdown user, it is easy to start Quarto (they are mostly the same!)
Now let’s add the results of our analysis to a notebook!
graph LR; a[1_cleaning.R] --> b[2_tables.R] & c[3_figures.R] & d[4_models.R] b & c & d --> e[report.qmd] e --> f[articles] & g[presentation] & h[website] & i[book]
You can:
Discuss/ask questions
Try the whole workflow with your own projects
Leave
I can:
Show you optimization with functions (Do not repeat yourself)
Go in depth we some ressources or subjects