Introduction: Welcome to the Tidyverse!
Imagine you’re a data wizard with a magic wand. The tidyverse is your spellbook, and each package is a different magical incantation you can use to transform and manipulate data! 🧙♂️✨
# Load the tidyverse - it's like summoning ALL your magic powers at once!
library(tidyverse)
The tidyverse is a collection of R packages that share a common philosophy and are designed to work together harmoniously. Think of it as the Avengers of data science - each package has its own superpower, but together they’re unstoppable!
The Core Tidyverse Packages
# The main tidyverse packages - your magical toolkit
library(dplyr) # Data manipulation - like having telekinesis for data!
library(ggplot2) # Data visualization - painting with data!
library(tidyr) # Data tidying - Marie Kondo for your datasets!
library(readr) # Data import - your portal to other data dimensions!
library(purrr) # Functional programming - clone yourself to do multiple tasks!
library(tibble) # Modern dataframes - your magical workbench!
library(stringr) # String manipulation - speak the language of text!
library(forcats) # Factor handling - taming wild categorical variables!
The Magic Pipe: %>%
The pipe operator %>%
is like your magic wand - it allows you to chain spells together in a logical sequence! It takes the output from one function and feeds it as the input to the next function.
# Without the pipe - nested spells that are hard to read
round(mean(c(1, 2, 3, NA), na.rm = TRUE), digits = 2)
# With the pipe - a clear sequence of magical steps
c(1, 2, 3, NA) %>%
mean(na.rm = TRUE) %>%
round(digits = 2)
💡 Pro Tip: You can use the keyboard shortcut Ctrl+Shift+M (Windows) or Cmd+Shift+M (Mac) to insert the pipe.
Package Name Prefixes
When casting spells, sometimes you need to be specific about which spellbook you’re using:
# Different packages may have functions with the same name
::filter() # Time series filtering
stats::filter() # Row filtering for dataframes
dplyr
# Each tidyverse package has consistent prefixes
::read_csv() # Reading CSV files
readr::write_csv() # Writing CSV files
readr::str_detect() # String detection
stringr::fct_relevel() # Factor releveling forcats
Exercise 1: Your First Tidyverse Spell
Let’s start with a simple spell - creating and exploring a tibble:
Use tibble()
to create a magical data table, then try exploring it with glimpse()
. It’s like having X-ray vision for your data!
# Load the tidyverse spellbook
library(tidyverse)
# Create a magical creature dataset
<- tibble(
magical_creatures creature = c("Dragon", "Unicorn", "Phoenix", "Griffin", "Mermaid"),
magic_power = c(95, 80, 90, 75, 60),
habitat = c("Mountains", "Forest", "Volcano", "Sky", "Ocean"),
lifespan = c(1000, 500, 1500, 300, 200)
)
# Look at our magical dataset
magical_creatures
# Use the glimpse spell to see through its structure
glimpse(magical_creatures)
# Check the data type - it's a tibble, not a plain dataframe!
class(magical_creatures)
# Use the pipe to chain operations
%>%
magical_creatures filter(magic_power > 70) %>%
arrange(desc(lifespan))
Tibbles: The Modern Data Workbench
Tibbles are modern reimagined dataframes - they’re like regular dataframes but with superpowers! They don’t change variable names or types, they don’t create row names, and they make printing large datasets much more pleasant.
Why Use Tibbles?
Regular dataframes have some quirks that tibbles fix: - They don’t automatically convert strings to factors - They don’t mangle variable names - They show only the first 10 rows and all columns that fit on screen - They have consistent subsetting behavior - They give you better error messages
# Create a tibble from scratch - building your workbench!
<- tibble(
wizards name = c("Gandalf", "Dumbledore", "Merlin", "Dr. Strange"),
specialty = c("Fireworks", "Transfiguration", "Time Magic", "Reality Warping"),
power_level = c(95, 90, 99, 85)
)
# Convert existing dataframe to tibble - upgrade your workbench!
data(mtcars)
<- as_tibble(mtcars, rownames = "car_model")
mtcars_tibble
# Creating a tibble row-by-row (like SAS CARDS/DATALINES)
<- tribble(
spells ~spell_name, ~power, ~element, ~casting_time,
"Fireball", 80, "Fire", 3,
"Ice Lance", 65, "Water", 1,
"Earthquake", 90, "Earth", 5,
"Lightning Bolt", 75, "Air", 2
) spells
Tibble Subsetting
Tibbles maintain consistent output types, which helps prevent errors in your code:
# Single bracket [ ] always returns a tibble
"name"] # Still a tibble with 1 column
wizards[1:2, "name"] # Still a tibble with 1 column
wizards[
# Double bracket [[ ]] or $ extracts a single column as a vector
"name"]] # Character vector
wizards[[$name # Character vector wizards
Exercise 2: Tibble Transformation
Transform this plain old dataframe into a shiny new tibble:
Use as_tibble()
to convert a dataframe to a tibble. For extra magic, use rownames_to_column()
to preserve row names!
# Load the tidyverse
library(tidyverse)
# Create a regular dataframe - the old rusty workbench
data(iris)
head(iris)
# Convert to tibble with as_tibble()
<- as_tibble(iris)
iris_tibble
iris_tibble
# Another way - if your dataframe has rownames you want to keep
data(mtcars)
<- as_tibble(mtcars, rownames = "car_model")
mtcars_tibble
# Or using rownames_to_column()
<- mtcars %>%
mtcars_tibble2 rownames_to_column("car_model") %>%
as_tibble()
# Print to see the difference
mtcars_tibble
# Create a tibble from scratch with tribble
<- tribble(
potion_ingredients ~potion, ~ingredient, ~amount, ~unit,
"Health Potion", "Red Mushroom", 3, "pieces",
"Health Potion", "Spring Water", 100, "ml",
"Mana Potion", "Blue Flower", 2, "pieces",
"Mana Potion", "Moon Water", 100, "ml",
"Strength Potion","Dragon Scale", 1, "piece",
"Strength Potion","Volcano Ash", 50, "g"
) potion_ingredients
Data Import & Export: Opening Portals to Other Dimensions
The tidyverse makes it super easy to import and export data from various file formats. It’s like having a magical portal that connects to many different data dimensions!
Reading Data with readr
The readr package provides a fast and friendly way to read rectangular data files:
# Reading data - opening a portal!
# CSV files
<- read_csv("data.csv")
my_data
# TSV files
<- read_tsv("data.tsv")
my_tsv_data
# Fixed width files
<- read_fwf("data.txt",
my_fixed_data col_positions = fwf_widths(c(10, 5, 8)))
# Delimited files with any delimiter
<- read_delim("data.txt", delim = "|") my_delim_data
Controlling Column Types
You can specify the types of columns you’re reading to ensure your data comes through the portal correctly:
# Specify column types
<- read_csv("potions.csv",
potions_data col_types = cols(
name = col_character(),
power = col_double(),
ingredients = col_integer(),
is_legendary = col_logical(),
discovery_date = col_date(format = "%Y-%m-%d")
)
)
# Preview the column specification without reading the file
spec_csv("potions.csv")
Writing Data
Sending your magical creations to other dimensions is just as easy:
# Writing data
write_csv(my_data, "new_data.csv")
write_tsv(my_data, "new_data.tsv")
write_delim(my_data, "new_data.txt", delim = "|")
# Save R objects
saveRDS(my_data, "my_data.rds")
Other File Formats
The tidyverse ecosystem can also connect with other magical realms:
# Excel files (requires readxl package)
library(readxl)
<- read_excel("spellbook.xlsx", sheet = "Potions")
excel_data
# Writing Excel files (requires writexl package)
library(writexl)
write_xlsx(my_data, "spellbook.xlsx")
# SAS files (requires haven package)
library(haven)
<- read_sas("wizard_data.sas7bdat") sas_data
Exercise 3: Data Portal Mastery
Practice your portal creation skills by importing and exporting data:
Use read_csv()
to import CSV data and write_csv()
to export it. Don’t forget to peek at your data with head()
or glimpse()
!
# Load the tidyverse
library(tidyverse)
# Create some sample data to export
<- tibble(
potion_recipes potion_name = c("Invisibility", "Strength", "Healing", "Flying", "Wisdom"),
primary_ingredient = c("Ghost Orchid", "Dragon Scale", "Phoenix Tear", "Eagle Feather", "Ancient Scroll"),
brewing_time_hours = c(12, 3, 8, 24, 72),
potency = c(8, 7, 10, 6, 9)
)
# Export our potion recipes to CSV
write_csv(potion_recipes, "potion_recipes.csv")
# Now import it back
<- read_csv("potion_recipes.csv")
imported_potions
# Let's check if our portal worked correctly
identical(potion_recipes, imported_potions)
# Take a peek at our imported data
glimpse(imported_potions)
# Create a custom column specification
<- cols(
my_col_types potion_name = col_character(),
primary_ingredient = col_character(),
brewing_time_hours = col_integer(),
potency = col_double()
)
# Import with specification
<- read_csv("potion_recipes.csv", col_types = my_col_types)
imported_potions_spec glimpse(imported_potions_spec)
Subsetting and Sorting: Finding What You Need
Subsetting and sorting data is like having a magical filter and organizer for your data. With just a few spell words, you can find exactly what you need!
Filtering Rows with filter()
filter()
allows you to select rows based on their values - it’s like having a magic sieve that only lets through the data you want!
# Load a dataset to play with
data(starwars, package = "dplyr")
<- as_tibble(starwars)
starwars_tibble
# Filter for humans only - separating humans from aliens!
<- starwars_tibble %>%
humans filter(species == "Human")
# Multiple conditions - finding very tall droids!
<- starwars_tibble %>%
tall_droids filter(species == "Droid", height > 100)
# More complex conditions with logical operators
<- starwars_tibble %>%
powerful_humans filter(species == "Human" & (mass > 80 | height > 180))
# Excluding values
<- starwars_tibble %>%
non_droids filter(species != "Droid")
# Checking for multiple values
<- starwars_tibble %>%
tatooine_naboo filter(homeworld %in% c("Tatooine", "Naboo"))
Slicing Rows
Sometimes you want to select rows by position rather than by values:
# Get the first 5 rows
%>% slice(1:5)
starwars_tibble
# Get specific rows
%>% slice(c(1, 3, 5))
starwars_tibble
# Get the last 5 rows
%>% slice_tail(n = 5)
starwars_tibble
# Get 3 random rows
%>% slice_sample(n = 3)
starwars_tibble
# Get 10% of the rows randomly
%>% slice_sample(prop = 0.1)
starwars_tibble
# Get the 3 tallest characters
%>% slice_max(height, n = 3)
starwars_tibble
# Get the 3 lightest characters with known mass
%>% slice_min(mass, n = 3, na.rm = TRUE) starwars_tibble
Selecting Columns with select()
select()
lets you focus on just the variables you need - it’s like having a magical lens that only shows you what’s important!
# Select only certain columns - focusing your magical lens!
<- starwars_tibble %>%
names_heights select(name, height, mass)
# Remove columns - banishing unwanted information!
<- starwars_tibble %>%
no_homeworld select(-homeworld, -species)
# Select columns by position
<- starwars_tibble %>%
first_three select(1:3)
# Use helper functions to select columns matching patterns
<- starwars_tibble %>%
measurements select(starts_with("h"), contains("mass"))
# Select columns by data type
<- starwars_tibble %>%
numeric_cols select(where(is.numeric))
# Rename columns while selecting
<- starwars_tibble %>%
renamed select(character_name = name, height, weight = mass)
Selection Helpers
There are many helper functions that make selecting variables easier:
# Different ways to select variables
%>% select(starts_with("h")) # Starts with "h"
starwars_tibble %>% select(ends_with("s")) # Ends with "s"
starwars_tibble %>% select(contains("o")) # Contains "o"
starwars_tibble %>% select(matches("..r.")) # Matches regex pattern
starwars_tibble %>% select(everything()) # All columns
starwars_tibble %>% select(last_col()) # Last column starwars_tibble
Arranging Rows with arrange()
arrange()
allows you to reorder your rows based on the values of selected columns:
# Sort by height - from shortest to tallest!
<- starwars_tibble %>%
by_height arrange(height)
# Sort by descending mass - heaviest first!
<- starwars_tibble %>%
by_mass_desc arrange(desc(mass))
# Multiple sort criteria - sort by species, then by height within species
<- starwars_tibble %>%
by_species_height arrange(species, height)
# Sort by species descending, then by height ascending
<- starwars_tibble %>%
complex_sort arrange(desc(species), height)
Renaming and Relocating Columns
Tidyverse also provides tools to rename or reposition your variables:
# Rename columns
%>%
starwars_tibble rename(character = name, weight = mass)
# Rename using a function (convert to uppercase)
%>%
starwars_tibble rename_with(toupper)
# Rename only some columns
%>%
starwars_tibble rename_with(toupper, starts_with("h"))
# Move columns to different positions
%>%
starwars_tibble relocate(species, homeworld, .before = name)
%>%
starwars_tibble relocate(name, species, .after = last_col())
Exercise 4: The Magic of Subsetting
Use your magical powers to find and sort specific creatures:
Use filter()
to find rows meeting certain conditions, select()
to choose columns, and arrange()
to sort. Combine them with the magical %>%
pipe!
# Load the tidyverse
library(tidyverse)
# We'll use the built-in starwars dataset
data(starwars, package = "dplyr")
<- as_tibble(starwars)
starwars_tibble
# Find all characters taller than 200 cm
<- starwars_tibble %>%
giants filter(height > 200)
giants
# Select only the name, homeworld, and species of characters from Tatooine
<- starwars_tibble %>%
tatooine_chars filter(homeworld == "Tatooine") %>%
select(name, species, height, mass)
tatooine_chars
# Find the 5 heaviest characters with known mass
<- starwars_tibble %>%
heaviest_chars filter(!is.na(mass)) %>%
arrange(desc(mass)) %>%
slice(1:5)
heaviest_chars
# Find all humans and sort them by height (tallest first)
<- starwars_tibble %>%
sorted_humans filter(species == "Human") %>%
arrange(desc(height))
sorted_humans
# Find characters from the same homeworld as Luke Skywalker
<- starwars_tibble %>%
luke_homeworld filter(name == "Luke Skywalker") %>%
pull(homeworld)
<- starwars_tibble %>%
luke_neighbors filter(homeworld == luke_homeworld) %>%
select(name, species, height) %>%
arrange(species, desc(height))
luke_neighbors
# Complex pipeline combining multiple operations
<- starwars_tibble %>%
starwars_analysis # Keep only characters with complete height and mass data
filter(!is.na(height), !is.na(mass)) %>%
# Calculate BMI
mutate(bmi = mass / ((height / 100)^2)) %>%
# Select relevant columns
select(name, species, gender, height, mass, bmi) %>%
# Sort by BMI
arrange(desc(bmi)) %>%
# Take top 10
slice_head(n = 10)
starwars_analysis
Creating Variables: Brewing New Data Potions
Sometimes you need to create new variables based on existing ones. This is like brewing a new potion by combining ingredients you already have!
Transforming Variables with mutate()
mutate()
lets you create new variables while preserving existing ones - it’s like adding new magical properties to your potion without changing its base ingredients!
# Add a new column - brewing a new data potion!
<- starwars_tibble %>%
starwars_bmi filter(!is.na(height), !is.na(mass)) %>%
mutate(bmi = mass / ((height / 100)^2))
# Create multiple columns at once - advanced potion brewing!
<- starwars_tibble %>%
starwars_stats mutate(
height_m = height / 100,
height_ft = height / 30.48,
heavy = mass > 100
)
Conditional Transformations
You can create variables with values that depend on conditions:
# Simple if-else condition
%>%
starwars_tibble mutate(size_category = if_else(height > 180, "Tall", "Short", missing = "Unknown"))
# Multiple conditions with case_when
%>%
starwars_tibble mutate(
size_category = case_when(
is.na(height) ~ "Unknown",
> 200 ~ "Very Tall",
height > 180 ~ "Tall",
height > 160 ~ "Average",
height TRUE ~ "Short"
) )
Working Across Multiple Columns
Apply the same transformation to multiple columns at once:
# Apply the same function to multiple columns
%>%
starwars_tibble mutate(across(c(height, mass), ~ . / mean(., na.rm = TRUE)))
# Apply different functions to different columns
%>%
starwars_tibble mutate(across(where(is.numeric), ~ round(., 1)))
# Apply multiple functions to the same columns
%>%
starwars_tibble mutate(across(
c(height, mass),
list(
centered = ~ . - mean(., na.rm = TRUE),
scaled = ~ . / sd(., na.rm = TRUE)
) ))
Replacing or Creating New Data Frames
Sometimes you want to completely replace your variables instead of adding to them:
# Replace variables with transmute
%>%
starwars_tibble transmute(
name,height_in_meters = height / 100,
weight_in_pounds = mass * 2.2
)
Special Transformation Functions
The tidyverse provides many functions for common transformations:
# Ranking
%>%
starwars_tibble mutate(
height_rank = min_rank(height),
height_dense_rank = dense_rank(height),
height_percent_rank = percent_rank(height)
)
# Offset values
%>%
starwars_tibble mutate(
next_mass = lead(mass),
prev_mass = lag(mass)
)
# Cumulative calculations
%>%
starwars_tibble mutate(
cumulative_mass = cumsum(mass),
running_avg = cummean(mass)
)
Exercise 5: Potion Brewing with mutate()
Brew some new variables from existing data:
Use mutate()
to create new columns based on existing ones. You can create as many new columns as you want in a single mutate()
spell!
# Load the tidyverse
library(tidyverse)
# Let's create a magical creatures dataset
<- tibble(
creatures name = c("Dragon", "Griffin", "Phoenix", "Unicorn", "Basilisk"),
age = c(250, 75, 500, 150, 200),
max_age = c(1000, 300, 2000, 500, 800),
weight_kg = c(2500, 450, 15, 350, 800),
magical_power = c(95, 75, 90, 80, 85)
)
# Now let's brew some new potions... I mean variables!
<- creatures %>%
creatures_enhanced mutate(
# Calculate age as percentage of maximum lifespan
age_percentage = (age / max_age) * 100,
# Classify creatures as ancient (over 50% of lifespan) or young
age_category = if_else(age_percentage > 50, "Ancient", "Young"),
# Calculate power-to-weight ratio (magical efficiency)
power_efficiency = magical_power / weight_kg * 100,
# Create a magical threat level
threat_level = case_when(
> 90 & weight_kg > 1000 ~ "Extreme",
magical_power > 80 | weight_kg > 500 ~ "High",
magical_power > 70 ~ "Moderate",
magical_power TRUE ~ "Low"
),
# Power rank compared to other creatures
power_rank = min_rank(desc(magical_power)),
# Normalized power (percentage of max)
power_normalized = magical_power / max(magical_power) * 100,
# Estimated years left to live
years_remaining = max_age - age,
# Calculate a weighted magical score
magical_score = (magical_power * 0.6) + (power_efficiency * 0.4)
)
# Let's see our enhanced creatures dataset!
%>%
creatures_enhanced arrange(power_rank)
Summaries: Distilling Magical Essences
Summarizing data is like distilling the essence of your dataset down to its most powerful components. It reveals the hidden patterns and secrets!
Summarizing with summarize()
summarize()
(or summarise()
, if you prefer British spelling) reduces your dataset to a single row of summary statistics:
# Calculate basic summaries - distilling the essence!
<- starwars_tibble %>%
height_summary summarize(
avg_height = mean(height, na.rm = TRUE),
max_height = max(height, na.rm = TRUE),
min_height = min(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
n_characters = n(),
n_with_height = sum(!is.na(height))
)
# Counting values - counting magical artifacts!
<- starwars_tibble %>%
species_count count(species, sort = TRUE)
Common Summary Functions
Here are some useful functions for creating summaries:
# Statistical functions
%>%
starwars_tibble summarize(
mean_height = mean(height, na.rm = TRUE),
median_height = median(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
var_height = var(height, na.rm = TRUE),
min_height = min(height, na.rm = TRUE),
max_height = max(height, na.rm = TRUE),
q25_height = quantile(height, 0.25, na.rm = TRUE),
q75_height = quantile(height, 0.75, na.rm = TRUE)
)
# Counting functions
%>%
starwars_tibble summarize(
n_rows = n(),
n_species = n_distinct(species),
n_homeworlds = n_distinct(homeworld)
)
# First, last, and nth values
%>%
starwars_tibble summarize(
first_character = first(name),
last_character = last(name),
tenth_character = nth(name, 10)
)
Summarizing Multiple Columns
You can summarize multiple columns at once using across()
:
# Apply the same summary function to multiple columns
%>%
starwars_tibble summarize(across(c(height, mass), mean, na.rm = TRUE))
# Apply different summary functions to different columns
%>%
starwars_tibble summarize(
across(c(height, mass), list(avg = mean, med = median), na.rm = TRUE),
across(species, list(n = n_distinct))
)
Exercise 6: The Art of Summary Magic
Practice your summarizing skills on this dataset:
Use summarize()
to calculate statistics across the entire dataset, or group_by()
then summarize()
to get statistics for each group. The count()
spell is great for quick frequency tables!
# Load the tidyverse
library(tidyverse)
# Let's work with the built-in mpg dataset
data(mpg)
<- as_tibble(mpg)
mpg_tibble
# Overall summary statistics for continuous variables
<- mpg_tibble %>%
overall_summary summarize(
avg_mpg = mean(hwy),
max_mpg = max(hwy),
min_mpg = min(hwy),
median_mpg = median(hwy),
sd_mpg = sd(hwy),
total_cars = n(),
efficiency_ratio = mean(hwy) / mean(cty)
)
overall_summary
# Count the number of cars by manufacturer
<- mpg_tibble %>%
manufacturer_counts count(manufacturer, sort = TRUE)
manufacturer_counts
# Group by class and find average mpg
<- mpg_tibble %>%
class_mpg group_by(class) %>%
summarize(
avg_city_mpg = mean(cty),
avg_hwy_mpg = mean(hwy),
mpg_difference = mean(hwy - cty),
car_count = n(),
manufacturers = n_distinct(manufacturer)
%>%
) arrange(desc(avg_hwy_mpg))
class_mpg
# Find the most fuel-efficient car in each class
<- mpg_tibble %>%
best_in_class group_by(class) %>%
slice_max(order_by = hwy, n = 1) %>%
select(class, manufacturer, model, hwy) %>%
arrange(desc(hwy))
best_in_class
# Create a comprehensive efficiency report by manufacturer
<- mpg_tibble %>%
manufacturer_report group_by(manufacturer) %>%
summarize(
models = n_distinct(model),
avg_city = mean(cty),
avg_hwy = mean(hwy),
best_hwy = max(hwy),
worst_hwy = min(hwy),
range = max(hwy) - min(hwy),
total_cars = n()
%>%
) # Only include manufacturers with at least 3 cars
filter(total_cars >= 3) %>%
# Sort by average highway MPG
arrange(desc(avg_hwy))
manufacturer_report
Group Operations: Organizing Your Magical Creatures
Grouping allows you to perform operations on subsets of your data. It’s like organizing your magical creatures by species before studying them!
Grouping with group_by()
group_by()
transforms your data frame into a grouped data frame, where operations are performed “by group”:
# Group by species and find average height/mass
<- starwars_tibble %>%
species_stats group_by(species) %>%
summarize(
count = n(),
avg_height = mean(height, na.rm = TRUE),
avg_mass = mean(mass, na.rm = TRUE)
%>%
) filter(count > 1) # Only include species with more than 1 character
# Find max height by gender and homeworld
<- starwars_tibble %>%
max_heights group_by(homeworld, gender) %>%
summarize(
tallest = max(height, na.rm = TRUE),
n = n()
%>%
) filter(!is.na(tallest), !is.na(homeworld))
Grouping by Multiple Variables
You can group by multiple variables to create nested groups:
# Group by species and gender
%>%
starwars_tibble group_by(species, gender) %>%
summarize(
count = n(),
avg_height = mean(height, na.rm = TRUE)
)
# Getting the number of groups
%>%
starwars_tibble group_by(species, gender) %>%
summarize(count = n()) %>%
nrow()
# Getting information about the groups
<- starwars_tibble %>% group_by(species, gender)
starwars_groups group_keys(starwars_groups)
n_groups(starwars_groups)
Group Mutations
You can use group_by()
with mutate()
to compute values within each group:
# Calculate z-scores within species groups
%>%
starwars_tibble group_by(species) %>%
filter(n() > 1) %>% # Only species with multiple members
mutate(
height_avg = mean(height, na.rm = TRUE),
height_sd = sd(height, na.rm = TRUE),
height_z = (height - height_avg) / height_sd
%>%
) select(name, species, height, height_avg, height_z) %>%
arrange(species, desc(height_z))
# Rank heights within each species
%>%
starwars_tibble group_by(species) %>%
filter(n() > 1) %>%
mutate(height_rank = min_rank(desc(height))) %>%
select(name, species, height, height_rank) %>%
arrange(species, height_rank)
Managing Groups
You can add or remove grouping variables:
# Add a grouping variable
%>%
starwars_tibble group_by(species) %>%
group_by(gender, .add = TRUE) # Keep species grouping and add gender
# Remove all grouping
%>%
starwars_tibble group_by(species, gender) %>%
ungroup()
Row-wise Operations
For operations across rows (rather than down columns), use rowwise()
:
# Calculate the sum of height and mass for each character
%>%
starwars_tibble rowwise() %>%
mutate(
height_plus_mass = sum(c(height, mass), na.rm = TRUE)
)
# Find the maximum value across several columns
%>%
starwars_tibble rowwise() %>%
mutate(
max_value = max(c(height, mass), na.rm = TRUE)
)
Exercise 7: The Power of Grouping
Use grouping to analyze this dataset of magical creatures:
Use group_by()
followed by summarize()
to calculate statistics for each group. Try grouping by multiple variables to dig deeper!
# Load the tidyverse
library(tidyverse)
# Create a dataset of potions sold at a magical marketplace
<- tibble(
potions_sales potion_type = rep(c("Healing", "Strength", "Invisibility", "Love", "Wisdom"), each = 20),
merchant = rep(c("Elixir Emporium", "Witch's Brew", "Magical Mixtures", "Cauldron Creations"), times = 25),
price = c(
# Healing potions prices
runif(20, 10, 20),
# Strength potions prices
runif(20, 15, 30),
# Invisibility potions prices
runif(20, 25, 50),
# Love potions prices
runif(20, 5, 15),
# Wisdom potions prices
runif(20, 20, 40)
),quantity_sold = sample(1:10, 100, replace = TRUE),
customer_rating = sample(1:5, 100, replace = TRUE, prob = c(0.05, 0.1, 0.2, 0.4, 0.25))
)
# Calculate average price by potion type
<- potions_sales %>%
avg_prices group_by(potion_type) %>%
summarize(
avg_price = mean(price),
median_price = median(price),
min_price = min(price),
max_price = max(price),
price_range = max_price - min_price,
total_sold = sum(quantity_sold),
avg_rating = mean(customer_rating)
%>%
) arrange(desc(avg_price))
avg_prices
# Find total revenue by merchant and potion type
<- potions_sales %>%
merchant_revenue mutate(revenue = price * quantity_sold) %>%
group_by(merchant, potion_type) %>%
summarize(
total_revenue = sum(revenue),
avg_price = mean(price),
total_sold = sum(quantity_sold),
avg_rating = mean(customer_rating)
%>%
) arrange(merchant, desc(total_revenue))
merchant_revenue
# Find the most profitable potion type for each merchant
<- potions_sales %>%
best_potions mutate(revenue = price * quantity_sold) %>%
group_by(merchant, potion_type) %>%
summarize(total_revenue = sum(revenue)) %>%
ungroup() %>%
group_by(merchant) %>%
slice_max(order_by = total_revenue, n = 1)
best_potions
# Calculate the average rating for each merchant and how it compares to overall average
<- potions_sales %>%
rating_analysis group_by(merchant) %>%
summarize(
avg_rating = mean(customer_rating),
total_ratings = n()
%>%
) ungroup() %>%
mutate(
overall_avg = mean(potions_sales$customer_rating),
rating_difference = avg_rating - overall_avg,
performance = case_when(
> 0.5 ~ "Excellent",
rating_difference > 0 ~ "Above Average",
rating_difference > -0.5 ~ "Average",
rating_difference TRUE ~ "Below Average"
)%>%
) arrange(desc(avg_rating))
rating_analysis
# Advanced analysis: Find which merchants are specialized in certain potions
<- potions_sales %>%
specialization_analysis group_by(merchant, potion_type) %>%
summarize(
potion_count = n(),
potion_revenue = sum(price * quantity_sold)
%>%
) group_by(merchant) %>%
mutate(
total_potions = sum(potion_count),
total_revenue = sum(potion_revenue),
potion_percent = potion_count / total_potions * 100,
revenue_percent = potion_revenue / total_revenue * 100,
is_specialized = potion_percent > 30 | revenue_percent > 40
%>%
) filter(is_specialized) %>%
select(merchant, potion_type, potion_percent, revenue_percent) %>%
arrange(merchant, desc(revenue_percent))
specialization_analysis
Capstone Project: The Ultimate Tidyverse Spell
Now it’s time to combine all your tidyverse skills into one magnificent spell! Create a comprehensive analysis of magical creatures and their powers.
The Complete Tidyverse Wizard
A true tidyverse wizard can combine all their magical spells - tibbles, importing, filtering, arranging, mutating, summarizing, and grouping - into a single powerful workflow. Let’s put everything together!
Here’s what your capstone should demonstrate: - Creating and transforming tibbles - Importing and cleaning data - Filtering and selecting relevant information - Creating new variables - Summarizing by groups - Visualizing results (if desired) - Exporting your processed data
Exercise 8: The Complete Tidyverse Magic System
Combine all the magical spells you’ve learned - create tibbles, import/export data, filter, select, arrange, mutate, summarize, and group. Think of it as creating your own complete magical analysis system!
# Load the tidyverse - our magical toolkit
library(tidyverse)
# Create a comprehensive magical creature database
<- tibble(
magical_creatures species = c("Dragon", "Phoenix", "Unicorn", "Griffin", "Mermaid", "Centaur",
"Basilisk", "Fairy", "Troll", "Werewolf", "Vampire", "Ghost",
"Dragon", "Unicorn", "Griffin", "Fairy", "Phoenix", "Mermaid"),
name = c("Smaug", "Fawkes", "Twilight", "Buckbeak", "Ariel", "Firenze",
"Slytherin", "Tinkerbell", "Grumpy", "Remus", "Dracula", "Casper",
"Norbert", "Silver", "Talon", "Periwinkle", "Ash", "Marina"),
age = c(250, 500, 150, 75, 120, 80, 200, 50, 100, 45, 300, 150,
100, 50, 120, 25, 300, 80),
power_level = c(95, 90, 70, 75, 60, 65, 85, 40, 60, 70, 80, 50,
80, 65, 70, 35, 85, 55),
habitat = c("Mountain", "Volcano", "Forest", "Mountain", "Ocean", "Forest",
"Cave", "Forest", "Mountain", "Forest", "Castle", "Haunted House",
"Mountain", "Forest", "Mountain", "Forest", "Volcano", "Ocean"),
element = c("Fire", "Fire", "Light", "Air", "Water", "Earth",
"Poison", "Light", "Earth", "Moon", "Blood", "Spirit",
"Fire", "Light", "Air", "Light", "Fire", "Water"),
is_friendly = c(FALSE, TRUE, TRUE, TRUE, TRUE, TRUE,
FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,
FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)
)
# Export our original dataset to CSV
write_csv(magical_creatures, "magical_creatures.csv")
# Re-import and verify (in a real workflow, you might combine these steps)
<- read_csv("magical_creatures.csv")
creatures_imported
# ===== STEP 1: DATA CLEANING AND ENRICHMENT =====
<- creatures_imported %>%
creatures_enhanced # Remove duplicates
distinct() %>%
# Add calculated fields
mutate(
# Create power categories
power_category = case_when(
>= 85 ~ "Supreme",
power_level >= 70 ~ "High",
power_level >= 50 ~ "Moderate",
power_level TRUE ~ "Low"
),# Age categories
age_category = case_when(
>= 200 ~ "Ancient",
age >= 100 ~ "Old",
age >= 50 ~ "Adult",
age TRUE ~ "Young"
),# Danger assessment
danger_level = if_else(
> 80 & !is_friendly,
power_level "Extremely Dangerous",
if_else(power_level > 60 & !is_friendly, "Dangerous",
if_else(!is_friendly, "Exercise Caution", "Generally Safe"))
),# Normalized power (as a percentage of maximum)
power_normalized = round(power_level / max(power_level) * 100, 1),
# Create a magical power index
magic_index = (power_level * 0.6) + (age * 0.4 / 10)
)
# ===== STEP 2: SPECIES ANALYSIS =====
<- creatures_enhanced %>%
species_analysis group_by(species) %>%
summarize(
count = n(),
avg_power = mean(power_level),
max_power = max(power_level),
min_power = min(power_level),
power_range = max_power - min_power,
avg_age = mean(age),
pct_friendly = mean(is_friendly) * 100
%>%
) arrange(desc(avg_power))
# ===== STEP 3: HABITAT ANALYSIS =====
<- creatures_enhanced %>%
habitat_analysis group_by(habitat, element) %>%
summarize(
creature_count = n(),
avg_power = mean(power_level),
most_dangerous = max(power_level),
pct_friendly = mean(is_friendly) * 100
%>%
) arrange(habitat, desc(avg_power))
# ===== STEP 4: ELEMENT CHAMPIONS =====
<- creatures_enhanced %>%
element_champions group_by(element) %>%
slice_max(order_by = power_level, n = 1) %>%
select(element, name, species, power_level, danger_level) %>%
arrange(desc(power_level))
# ===== STEP 5: FRIENDSHIP ANALYSIS =====
<- creatures_enhanced %>%
friendship_analysis group_by(power_category, is_friendly) %>%
summarize(
count = n(),
avg_age = mean(age),
avg_power = mean(power_level)
%>%
) arrange(power_category, desc(is_friendly))
# ===== STEP 6: FEATURE CORRELATION =====
# Checking relationship between power and age
<- cor(creatures_enhanced$power_level,
power_age_correlation $age,
creatures_enhancedmethod = "pearson")
# ===== STEP 7: DANGER ASSESSMENT =====
<- creatures_enhanced %>%
danger_assessment filter(danger_level == "Extremely Dangerous") %>%
select(name, species, habitat, power_level, element)
# ===== STEP 8: ADVANCED FILTERING =====
# Find creatures matching specific criteria
<- creatures_enhanced %>%
special_creatures filter(
%in% c("Fire", "Water")) &
(element > 70 | age > 200) &
(power_level != "Cave")
(habitat %>%
) select(name, species, element, habitat, power_level, age) %>%
arrange(desc(power_level))
# ===== STEP 9: CREATE FINAL REPORT =====
<- list(
magical_report dataset_summary = list(
creature_count = nrow(creatures_enhanced),
species_count = n_distinct(creatures_enhanced$species),
habitat_count = n_distinct(creatures_enhanced$habitat),
element_count = n_distinct(creatures_enhanced$element),
avg_power_level = mean(creatures_enhanced$power_level),
avg_age = mean(creatures_enhanced$age),
friendly_pct = mean(creatures_enhanced$is_friendly) * 100,
power_age_correlation = power_age_correlation
),most_powerful = creatures_enhanced %>%
slice_max(order_by = power_level, n = 1) %>%
select(name, species, power_level, element),
oldest_creature = creatures_enhanced %>%
slice_max(order_by = age, n = 1) %>%
select(name, species, age, power_level),
species_analysis = species_analysis,
habitat_analysis = habitat_analysis,
element_champions = element_champions,
friendship_analysis = friendship_analysis,
danger_assessment = danger_assessment,
special_creatures = special_creatures
)
# Show the complete report
magical_report
Advanced Tidyverse Topics: Mastering the Arcane Arts
For those who wish to continue their magical journey, here are some advanced tidyverse topics to explore:
The Magic of Joins
Combining datasets is like merging two magical potions to create something even more powerful:
# Create two datasets
<- tibble(
wizards name = c("Gandalf", "Dumbledore", "Merlin", "Elminster"),
element = c("Light", "Fire", "Earth", "Air"),
power = c(95, 92, 99, 90)
)
<- tibble(
spells caster = c("Gandalf", "Gandalf", "Dumbledore", "Merlin", "Unknown"),
spell = c("Light Beam", "Flame Shield", "Phoenix Call", "Earth Shake", "Tempest"),
power_cost = c(20, 35, 40, 50, 60)
)
# Inner join - only keeps matching rows
inner_join(wizards, spells, by = c("name" = "caster"))
# Left join - keeps all rows from the left table
left_join(wizards, spells, by = c("name" = "caster"))
# Right join - keeps all rows from the right table
right_join(wizards, spells, by = c("name" = "caster"))
# Full join - keeps all rows from both tables
full_join(wizards, spells, by = c("name" = "caster"))
The Art of Pivoting
Reshaping data is like transforming your magical creatures into different forms:
# Wide to long format
<- tibble(
measurements name = c("Dragon", "Phoenix", "Unicorn"),
height = c(300, 120, 180),
weight = c(2000, 15, 450),
wingspan = c(500, 300, NA)
)
# Convert to long format
<- measurements %>%
measurements_long pivot_longer(
cols = c(height, weight, wingspan),
names_to = "measurement",
values_to = "value"
)
# Long to wide format
<- measurements_long %>%
measurements_wide pivot_wider(
names_from = measurement,
values_from = value
)
Working with Nested Data
Nested data is like having magical creatures with smaller creatures inside them:
# Group and nest data
<- magical_creatures %>%
nested_creatures group_by(species) %>%
nest()
# Work with nested data
%>%
nested_creatures mutate(
creature_count = map_int(data, nrow),
power_stats = map(data, ~ summary(.$power_level)),
max_power = map_dbl(data, ~ max(.$power_level))
)
# Unnest data
%>%
nested_creatures unnest(data)
With these advanced techniques in your magical arsenal, there’s no data enchantment you can’t master!
Further Learning
To continue your journey to becoming a tidyverse archmage, consult these magical tomes: - R for Data Science - tidyverse.org - RStudio Cheatsheets