This practical will teach you the basics of ggplot2
. It is split in 4 parts, each dedicated to a chart family:
→ I strongly suggest to use R Studio, the best way to develop R code.
→ Start a new script with File
→ new file
→ R Script
. Save this file somewhere in your computer. You will store all your command lines in it.
→ This practical requires several R packages.
Q0.1 Install the ggplot2
package if needed and load it. ggplot2
is a very powerful package for data visualization with R. It is the main topic of this practical.
# Install the package if needed
#install.packages("ggplot2")
# Load it
library(ggplot2)
Q0.2 Install and load dplyr
. dplyr
is part of the tidyverse and is very useful for data manipulation. It also provides the %>%
operators that allows to ‘pipe’ commands.
# Install the package if needed
#install.packages("dplyr")
# Load it
library(dplyr)
The first part of this practical will teach you how to build scatterplots and bubble charts: the 2 most common chart types to visualize correlations.
Q1.1 Load the gapminder
dataset stored in the gapminder
package. Have a look to the 6 first rows using the head()
function. Describe briefly what you see as comments in your script. Check how many rows are available with nrow()
# Install the package if needed
#install.packages("gapminder")
# Load it
library(gapminder)
# Have a look to the first rows
head(gapminder)
# How many rows?
nrow(gapminder)
[1] 1704
Q1.2 How many years are available in this dataset? How many data-points for each year? Full code is provided for this question.
# Number of different year?
%>%
gapminder select(year) %>%
unique() %>%
nrow()
# or
length(unique(gapminder$year))
# Number of country available per year?
%>%
gapminder group_by(year) %>%
summarize( n=n() )
Q1.3 Build a scatterplot showing the relationship between gdpPercap
and lifeExp
in 1952. Use geom_point()
. What do you observe?
# basic scatterplot
%>%
gapminder filter(year=="1952") %>%
ggplot( aes(x=gdpPercap, y=lifeExp)) +
geom_point()
Q1.4 On the previous chart, one country is very different. Which one is it?
You should get something like:
# Number of different year?
%>%
gapminder filter(year=="1952" & gdpPercap>90000)
Q1.5 Build the same chart, but get rid of this country. What trend do you observe? Does it make sense? What’s missing? What could be better?
You should get a chart like this:
# basic scatterplot
%>%
gapminder filter(year=="1952" & country!="Kuwait") %>%
ggplot( aes(x=gdpPercap, y=lifeExp)) +
geom_point()
Q1.6 Color dots according to their continent
. In the aes()
part of the code, use the color
argument.
%>%
gapminder filter(year=="1952" & country!="Kuwait") %>%
ggplot( aes(x=gdpPercap, y=lifeExp, color=continent)) +
geom_point()
Q1.7 Let’s observe an additional variable: make the circle size proportionnal to the population (pop
). This is done with the size
argument of aes()
. How do you call this kind of chart?
%>%
gapminder filter(year=="1952" & country!="Kuwait") %>%
ggplot( aes(x=gdpPercap, y=lifeExp, color=continent, size=pop)) +
geom_point()
Bonus You’re in advance? Try to do the following:
theme_ipsum
of the hrbrthemes
library.alpha
argument of aes
ggplotly()
function of the plotly
package to make this chart interactiveYou should get something like this chart:
# Additionnal packages:
library(hrbrthemes) # for general style
library(plotly) # to make the chart interactive
# Chart
<- gapminder %>%
p filter(year=="1952" & country!="Kuwait") %>%
arrange(desc(pop)) %>%
ggplot( aes(x=gdpPercap, y=lifeExp, fill=continent, size=pop)) +
geom_point(alpha=0.7, stroke="white", shape=21) +
theme_ipsum()
# Interactive more
ggplotly(p)
This second part is dedicated to the visualization of distribution. It is split in 2 parts:
The example dataset provides the AirBnb night prices of ~1000 appartments on the French Riviera. Data is stored on Github and can be loaded in R as follow:
# Load dataset from github
<- read.table("DATA/1_OneNum.csv", header=TRUE) data
Q2.1.1 How many rows in the dataset? (use nrow()
) What is the min? The max? (use summary()
). Do you see anything weird? What kind of chart would you build to visualize this kind of data?
nrow(data)
[1] 9995
summary(data)
price
Min. : 11.0
1st Qu.: 69.0
Median : 103.0
Mean : 179.4
3rd Qu.: 172.0
Max. :17242.0
Q2.1.2 Build a histogram of the data with geom_histogram()
. Are you happy with the output? How can we improve it?
%>%
data ggplot( aes(x=price)) +
geom_histogram()
Q2.1.3 Build a histogram without prices over 1500 euros. ggplot2
displays a warning message, why? What does it mean? What’s the main caveat of histograms?
%>%
data filter(price<1500) %>%
ggplot( aes(x=price)) +
geom_histogram()
Q2.1.4 Build the histogram with different values of binwidth
, for prices <400. What do you observe? Is it important to play with this parameter?
%>%
data filter(price<400) %>%
ggplot( aes(x=price)) +
geom_histogram(binwidth = 2)
Q2.1.5 Use geom_density()
to build a density chart. Use the fill
argument to set the color. Use the help()
function to find out what is the equivalent of bin_width
for density chart? Check its effect using different values.
%>%
data filter(price<1000) %>%
ggplot( aes(x=price)) +
geom_density(color="transparent", fill="#69b3a2", bw=5)
Dataset: questions like What probability would you assign to the phrase Highly likely
were asked. Answers were given in the range 0-100. It allows to understand how people perceive probability vocabulary. Data is stored on Github and can be loaded in R
as follow:
# Load dataset from github
<- read.table("DATA/probability.csv", header=TRUE, sep=",") data
Q2.2.1 As usual, check data main features with nrow()
, summary()
or any other function you think is useful.
# Data size?
nrow(data)
[1] 368
# occurence of each word:
table(data$text)
About Even Almost Certainly Almost No Chance Improbable
46 46 46 46
Likely Probably Not Unlikely Very Good Chance
46 46 46 46
Q2.2.2 What kind of chart would you do to compare the 8 categories?
Q2.2.3 Build a basic boxplot using the default options of geom_boxplot()
ggplot(data, aes(x=text, y=value, fill=text)) +
geom_boxplot()
Q2.2.4 What do you observe? Can you improve this chart? What would you change? Do you remind what the different parts of the box mean?
Q2.2.5 Apply the following modifications to the previous boxplot:
value
. This is done thanks to the forcats
package. Code is provided.coord_flip()
)theme
)# Library forcats to reorder data
library(forcats)
# Reorder data
%>%
data mutate(text = fct_reorder(text, value, .fun = median)) %>%
ggplot(aes(x=text, y=value, fill=text)) +
geom_boxplot() +
theme(
legend.position = "none"
+
) coord_flip()
Q2.2.6 What is the main caveat with boxplot? How can we do better?
Q2.2.7 Let’s show individual data points using the geom_jitter()
function. Explain what this function exactly does. Try to get a nice output using the width
, size
, alpha
and color
options.
# Library forcats to reorder data
library(forcats)
# Reorder data
%>%
data mutate(text = fct_reorder(text, value, .fun = median)) %>%
ggplot(aes(x=text, y=value, fill=text)) +
geom_boxplot() +
geom_jitter(color="grey", width=.4, size=.5, alpha=.8) +
theme(
legend.position = "none"
+
) coord_flip()
Bonus Too fast? Try to do the following:
geom_violin()
Let’s talk about the quantity of weapons exported by the top 50 largest exporters in 2017 (source). The dataset is available on github. Load it in R:
# Load dataset from github
<- read.table("DATA/7_OneCatOneNum.csv", header=TRUE, sep=",") data
Q3.1 Have a quick look to the dataset. Describe it. What kind of chart can you build with this dataset? Which one would be the best in your opinion? What are the countries on top of the ranking?
head(data)
nrow(data)
[1] 51
%>%
data arrange(desc(Value)) %>%
head(5)
Q3.2 Start with a basic barplot using geom_bar()
.
Note: by default geom_bar()
takes only one categorical variable as input, used for the x
axis. It counts the number of cases at each x position and display it on the Y axis. In our case, we want to provide a y
value for each group. This is why we need to specify stat="identity"
.
%>%
data ggplot( aes(x=Country, y=Value) ) +
geom_bar(stat="identity")
Q3.3 Color all bars with the same color: #69b3a2
. Don’t like the color? Pick another one. Do you have to use fill
or color
? Why?
%>%
data ggplot( aes(x=Country, y=Value) ) +
geom_bar(stat="identity", fill="#69b3a2")
Q3.4 Set a different color for each bar. Do you like the output? Is it useful? Do you understand the difference between adding an option inside or outside aes()
?
%>%
data ggplot( aes(x=Country, y=Value, fill=Country) ) +
geom_bar(stat="identity")
Q3.5 Previous barplots are a bit disappointing aren’t they? What can you improve?
Q3.6 Try the following:
coord_flip()
to get a horizontal versionYou should get that kind of output:
%>%
data filter(!is.na(Value)) %>%
arrange(Value) %>%
mutate(Country=factor(Country, Country)) %>%
ggplot( aes(x=Country, y=Value) ) +
geom_bar(stat="identity", fill="#69b3a2") +
coord_flip() +
xlab("")
Q3.7 A lollipop plot is used in the same conditions as a barplot. Build it with:
geom_segment()
for the stems. Arguments needed are x
, xend
, y
and yend
.geom_point()
for the circles. Needs x
and y
only.You should get:
%>%
data filter(!is.na(Value)) %>%
arrange(Value) %>%
mutate(Country=factor(Country, Country)) %>%
ggplot( aes(x=Country, y=Value) ) +
geom_segment( aes(x=Country ,xend=Country, y=0, yend=Value), color="grey") +
geom_point(size=3, color="#69b3a2") +
coord_flip() +
xlab("")
BONUSTry the following:
theme_ipsum
. Be creative to make it even better.# Package
library(treemap)
<- na.omit(data) #just to plot here simpler
data # Plot
treemap(data,
# data
index="Country",
vSize="Value",
type="index",
# Main
title="",
palette="Dark2",
# Borders:
border.col=c("black"),
border.lwds=1,
# Labels
# fontsize.labels=0.5,
fontcolor.labels="white",
fontface.labels=1,
bg.labels=c("transparent"),
align.labels=c("left", "top"),
overlap.labels=0.5,
inflate.labels=T # If true, labels are bigger when rectangle is bigger.
)
Let’s consider the evolution of the bitcoin price between April 2013 and April 2018. Data are stored on github. Load the dataset using the following code:
# Load dataset from github
<- read.table("DATA/3_TwoNumOrdered.csv", header=T)
data $date <- as.Date(data$date)
data#Here is one change
Q4.1 Build a basic line chart showing the bitcoin price evolution using geom_line()
.
%>%
data ggplot( aes(x=date, y=value)) +
geom_line(color="#69b3a2")
Q4.2 Switch to an area chart using geom_area()
. Use the color
and fill
argument to customize chart colors.
%>%
data ggplot( aes(x=date, y=value)) +
geom_area(color="#69b3a2", fill="#69b3a2")
Q4.3 Select the last 10 values using tail()
. Build a connected scatterplot using geom_point()
, geom_line()
and geom_area()
.
%>%
data tail(10) %>%
ggplot( aes(x=date, y=value)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
geom_point()
Bonus Visit the time series section of the R graph gallery. Try to use the HTML widget called dygraph
to build an interactive version of this lineplot.
You should get something like this.
# Library
library(dygraphs)
library(xts) # To make the convertion data-frame / xts format
library(lubridate)
# Then you can create the xts format
<- xts(x = data$value, order.by = data$date)
don
# graph
dygraph(don) %>%
dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
dyRangeSelector() %>%
dyCrosshair(direction = "vertical") %>%
dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE) %>%
dyRoller(rollPeriod = 1)
A work by a practical by We Data