Chapter 5 Part of a Whole
5.1 Circular Packing
Circular packing or circular treemap allows to visualize a hierarchic organization. It is an equivalent of a treemap or a dendrogram, where each node of the tree is represented as a circle and its sub-nodes are represented as circles inside of it.
5.1.0.1 One Level - packcircles
and ggplot2
If your dataset has no hierarchy (it is basically just a few entities with attributed numeric values), the packcircles
package is the best way to build a circular packing chart in R. The packages basically computes the position of each bubble, allowing to build the chart with ggplot2
.
5.1.1 Basic Circle Packing with One Level
This page aims to describe how to build a basic circle packing chart with only one level of hierarchy. It uses the packcircle
package for circle position, and ggplot2
for drawing. This page aims to describe how to build a basic circle packing chart with only one level of hierarchy. Basically, you just represent each entity or individual of your dataset with a circle, its size depending on a provided value.
It is like a barplot, but you use circle size instead of bar length. It is close to a bubble plot, but X and Y positions do not mean anything. It is a circle version of a treemap.
Calculating the arrangement of dots is not a trivial problem. The packcircles
library solves it and output coordinates of every points of the circle edges.
Finally, ggplot2 allows to draw shapes thanks to geom_polygon()
.
# Libraries
library(packcircles)
library(ggplot2)
# Create data
<- data.frame(group=paste("Group", letters[1:20]), value=sample(seq(1,100),20))
data
# Generate the layout. This function return a dataframe with one line per bubble.
# It gives its center (x and y) and its radius, proportional of the value
<- circleProgressiveLayout(data$value, sizetype='area')
packing
# We can add these packing information to the initial data frame
<- cbind(data, packing)
data
# Check that radius is proportional to value. We don't want a linear relationship, since it is the AREA that must be proportionnal to the value
# plot(data$radius, data$value)
# The next step is to go from one center + a radius to the coordinates of a circle that
# is drawn by a multitude of straight lines.
<- circleLayoutVertices(packing, npoints=50)
dat.gg
# Make the plot
ggplot() +
# Make the bubbles
geom_polygon(data = dat.gg, aes(x, y, group = id, fill=as.factor(id)), colour = "black", alpha = 0.6) +
# Add text in the center of each bubble + control its size
geom_text(data = data, aes(x, y, size=value, label = group)) +
scale_size_continuous(range = c(1,4)) +
# General theme:
theme_void() +
theme(legend.position="none") +
coord_equal()
5.1.2 Circle Packing Customization with R
This page is dedicated to one level circle packing customization with R. It notably shows how to use different color palettes and provides reproducible code snippets.
5.1.2.1 Using the Viridis
Color Scale
This chart follows the previous most basic circle packing section.
It shows how to use the awesome viridis
package to build color scales, a very good alternative to the usual colorBrewer
.
Note that magma
is used here, but you could use the same code with inferno
or viridis
instead.
# libraries
library(packcircles)
library(ggplot2)
library(viridis)
# Create data
<- data.frame(group=paste("Group", letters[1:20]), value=sample(seq(1,100),20))
data
# Generate the layout. sizetype can be area or radius, following your preference on what to be proportional to value.
<- circleProgressiveLayout(data$value, sizetype='area')
packing <- cbind(data, packing)
data <- circleLayoutVertices(packing, npoints=50)
dat.gg
# Basic color customization
ggplot() +
geom_polygon(data = dat.gg, aes(x, y, group = id, fill=as.factor(id)), colour = "black", alpha = 0.6) +
scale_fill_manual(values = magma(nrow(data))) +
geom_text(data = data, aes(x, y, size=value, label = group)) +
scale_size_continuous(range = c(1,4)) +
theme_void() +
theme(legend.position="none") +
coord_equal()
5.1.3 Map Color to Bubble Value
It is a common task to make the bubble color being lighter or darker according to its value.
This is possible by passing the focus variable to the dataframe that is read by ggplot2
, and specifying it in tha aes()
.
# First I need to add the 'value' of each group to dat.gg.
# Here I repeat each value 51 times since I create my polygons with 50 lines
$value <- rep(data$value, each=51)
dat.gg# Plot
ggplot() +
# Make the bubbles
geom_polygon(data = dat.gg, aes(x, y, group = id, fill=value), colour = "black", alpha = 0.6) +
scale_fill_distiller(palette = "BuPu", direction = 1 ) +
# Add text in the center of each bubble + control its size
geom_text(data = data, aes(x, y, size=value, label = group)) +
scale_size_continuous(range = c(1,4)) +
# General theme:
theme_void() +
theme(legend.position="none") +
coord_equal()
5.1.4 Background Customization
Change the background thanks to the theme()
function and its plot.background()
argument.
ggplot() +
# Make the bubbles
geom_polygon(data = dat.gg, aes(x, y, group = id, fill=value), colour = "grey", alpha = 0.6, size=.5) +
scale_fill_distiller(palette = "Spectral", direction = 1 ) +
# Add text in the center of each bubble + control its size
geom_label(data = data, aes(x, y, size=value, label = group)) +
scale_size_continuous(range = c(1,4)) +
# General theme:
theme_void() +
theme(
legend.position="none",
plot.background = element_rect(fill="black"),
plot.title = element_text(color="white")
+
) coord_equal() +
ggtitle("A custom circle packing with\nblack background")
5.1.5 Space between Bubbles
This chart is just a customization of the chart #305 which describes the basic process to make a one level circle packing chart. I personally like to add a bit of space between each circle.
Basically, all you have to do is to reduce the radius
size in your data once this one has been calculated. Just multiply it by a number under 0, and it will decrease the circle size.
If you have been so far, you probably want to check the interactive version of the chart !
# libraries
library(packcircles)
library(ggplot2)
library(viridis)
# Create data
<- data.frame(group=paste("Group", letters[1:20]), value=sample(seq(1,100),20))
data # Generate the layout
<- circleProgressiveLayout(data$value, sizetype='area')
packing $radius <- 0.95*packing$radius
packing<- cbind(data, packing)
data <- circleLayoutVertices(packing, npoints=50)
dat.gg # Plot
ggplot() +
geom_polygon(data = dat.gg, aes(x, y, group = id, fill=id), colour = "black", alpha = 0.6) +
scale_fill_viridis() +
geom_text(data = data, aes(x, y, size=value, label = group), color="black") +
theme_void() +
theme(legend.position="none")+
coord_equal()
5.1.6 Interactive Circle Packing with R
This section describes how to build an interactive circle packing chart with R and the ggiraph
package. It allows to hover bubbles to get additionnal information.
This chart follows sections #305 and #306 that explains how to build a static version of circle packing, and how to customize it.
This interactive version is very close to the static one. It uses the ggiraph
library to transform the ggplot2
code in something interactive. The steps are quite easy:
- First you need to prepare a column in the data frame with the text you want to display while hovering.
- Second, you need to change the geometries to use the interactive geometries of ggiraph.
Check the code below:
# libraries
library(packcircles)
library(ggplot2)
library(viridis)
library(ggiraph)
# Create data
<- data.frame(group=paste("Group_", sample(letters, 70, replace=T), #sample(letters, 70, replace=T), sample(letters, 70, replace=T), sep="" ), #value=sample(seq(1,70),70))
data # Add a column with the text you want to display for each bubble:
$text <- paste("name: ",data$group, "\n", "value:", data$value, "\n", "You can add a story here!")
data# Generate the layout
<- circleProgressiveLayout(data$value, sizetype='area')
packing <- cbind(data, packing)
data <- circleLayoutVertices(packing, npoints=50)
dat.gg # Make the plot with a few differences compared to the static version:
<- ggplot() +
p geom_polygon_interactive(data = dat.gg, aes(x, y, group = id, fill=id, #tooltip = data$text[id], data_id = id), colour = "black", alpha = 0.6) +
scale_fill_viridis() +
geom_text(data = data, aes(x, y, label = gsub("Group_", "", group)), #size=2, color="black") +
theme_void() +
theme(legend.position="none", plot.margin=unit(c(0,0,0,0),"cm") ) +
coord_equal()
# Turn it interactive
<- ggiraph(ggobj = p, width_svg = 7, height_svg = 7)
widg
widg# save the widget
library(htmlwidgets)
saveWidget(widg, file=paste0( getwd(), "/HtmlWidget/circular_packing_interactive.html"))
5.1.7 Basic Circle Packing with Several Hierarchy Level
This page is dedicated to multi level circle packing. It explains how to build one using R and the ggraph package.
5.1.7.1 Several Levels - ggraph
If your dataset is a hierarchy, it is time to switch to other tools. For static versions, the ggraph
package is the best option. It follows the grammar of graphic and makes it a breeze to customize the appearance following the same logic than ggplot2
.
5.1.8 Input & Concept
Circular packing represents a hierarchy: The biggest circle (origin of the hierarchy) contains several big circles (nodes of level 1), which contain smaller circle (level 2) and so on.. The last level is called leaf.
The input data is a list of edges between nodes. It should look more or less like the table beside. Moreover, we usually accompany this table with another one that gives features for each node.
5.1.8.1 Most Basic Circular Packing with ggraph
The ggraph
package makes it a breeze to build a circular packing from an edge list. Here is an example based on the flare
dataset proovded with the package.
The first step is to transform the dataframe to a graph object thanks to the graph_from_data_frame()
function of the igraph
package. Then, ggraph
offers the geom_node_circle()
function that will build the chart.
# Libraries
library(ggraph)
library(igraph)
library(tidyverse)
# We need a data frame giving a hierarchical structure. Let's consider the flare dataset:
<- flare$edges
edges
# Usually we associate another dataset that give information about each node of the dataset:
<- flare$vertices
vertices
# Then we have to make a 'graph' object using the igraph library:
<- graph_from_data_frame( edges, vertices=vertices )
mygraph
# Make the plot
ggraph(mygraph, layout = 'circlepack') +
geom_node_circle() +
theme_void()
5.1.9 Switching to another Chart Type
Note that the ggraph
library allows to easily go from one type of representation to another. Indeed several types of representation are suitable for hierarchical data: dendrogram (can be circular), treemap, sunburst diagram or network!
library(ggraph)
ggraph(mygraph, layout='dendrogram', circular=TRUE) +
geom_edge_diagonal() +
theme_void() +
theme(legend.position="none")
ggraph(mygraph, layout='dendrogram', circular=FALSE) +
geom_edge_diagonal() +
theme_void() +
theme(legend.position="none")
ggraph(mygraph, 'treemap', weight = size) +
geom_node_tile(aes(fill = depth), size = 0.25) +
theme_void() +
theme(legend.position="none")
ggraph(mygraph, 'partition', circular = TRUE) +
geom_node_arc_bar(aes(fill = depth), size = 0.25) +
theme_void() +
theme(legend.position="none")
ggraph(mygraph) +
geom_edge_link() +
geom_node_point() +
theme_void() +
theme(legend.position="none")
5.1.10 Customized Circle Packing with R and ggraph
This page follows the previous introduction that explained the basis of circle packing with R and the ggraph library. It describes how to customize color, size, labels and more.
5.1.10.1 Bubble Size Proportionnal to a Variable
Mapping the bubble size to a numeric variable allows to add an additionnal layer of information to the chart.
Here, the vertices
data frame has a size
column that is used for the bubble size. Basically, it just needs to be passed to the weight
argument of the ggraph()
function.
# Libraries
library(ggraph)
library(igraph)
library(tidyverse)
library(viridis)
# We need a data frame giving a hierarchical structure. Let's consider the flare dataset:
<- flare$edges
edges <- flare$vertices
vertices <- graph_from_data_frame(edges, vertices=vertices)
mygraph
# Control the size of each circle: (use the size column of the vertices data frame)
ggraph(mygraph, layout = 'circlepack', weight=size) +
geom_node_circle() +
theme_void()
5.1.11 Map Color to Hierarchy Depth
Adding color to circular packing definitely makes sense. The first option is to map color to depth: the origin of every node will have a color, the level 1 another one, and so on..
As usual, you can play with the colour palette to fit your needs. Here are 2 examples with the viridis
and the RColorBrewer
palettes:
# Left: color depends of depth
<- ggraph(mygraph, layout = 'circlepack', weight=size) +
p geom_node_circle(aes(fill = depth)) +
theme_void() +
theme(legend.position="FALSE")
p
# Adjust color palette: viridis
+ scale_fill_viridis() p
# Adjust color palette: colorBrewer
+ scale_fill_distiller(palette = "RdPu") p
5.1.12 Map Color to Hierarchy Depth
To add more insight to the plot, we often need to add labels to the circles. However you can do it only if the number of circle is not to big. Note that you can use geom_node_text
(left) or geom_node_label
to annotate leaves of the circle packing:
# Create a subset of the dataset (I remove 1 level)
<- flare$edges %>%
edges filter(to %in% from) %>%
droplevels()
<- flare$vertices %>%
vertices filter(name %in% c(edges$from, edges$to)) %>%
droplevels()
$size <- runif(nrow(vertices))
vertices
# Rebuild the graph object
<- graph_from_data_frame( edges, vertices=vertices ) mygraph
# left
ggraph(mygraph, layout = 'circlepack', weight=size ) +
geom_node_circle(aes(fill = depth)) +
geom_node_text( aes(label=shortName, filter=leaf, fill=depth, size=size)) +
theme_void() +
theme(legend.position="FALSE") +
scale_fill_viridis()
# Right
ggraph(mygraph, layout = 'circlepack', weight=size ) +
geom_node_circle(aes(fill = depth)) +
geom_node_label( aes(label=shortName, filter=leaf, size=size)) +
theme_void() +
theme(legend.position="FALSE") +
scale_fill_viridis()
5.1.14 Hiding the First Level for Better Styling
I personally do not like to display the big circle that surrounds the whole chart (level 0, origin). This circle does not provide any information, and the chart looks better without it in my opinion.
To get rid of it, just specify a color equal to the background color in the scale_fill_manual()
and scale_color_manual()
functions. Following the same idea, you can get rid of as many levels of hierarchy as you like.
# Libraries
library(ggraph)
library(igraph)
library(tidyverse)
library(viridis)
# We need a data frame giving a hierarchical structure. Let's consider the flare dataset:
=flare$edges
edges= flare$vertices
vertices <- graph_from_data_frame( edges, vertices=vertices )
mygraph
# Hide the first level (right)
ggraph(mygraph, layout = 'circlepack', weight="size") +
geom_node_circle(aes(fill = as.factor(depth), color = as.factor(depth) )) +
scale_fill_manual(values=c("0" = "white", "1" = viridis(4)[1], "2" = viridis(4)[2], "3" = viridis(4)[3], "4"=viridis(4)[4])) +
scale_color_manual( values=c("0" = "white", "1" = "black", "2" = "black", "3" = "black", "4"="black") ) +
theme_void() +
theme(legend.position="FALSE")
# Second one: hide 2 first levels
ggraph(mygraph, layout = 'circlepack', weight="size") +
geom_node_circle(aes(fill = as.factor(depth), color = as.factor(depth) )) +
scale_fill_manual(values=c("0" = "white", "1" = "white", "2" = magma(4)[2], "3" = magma(4)[3], "4"=magma(4)[4])) +
scale_color_manual( values=c("0" = "white", "1" = "white", "2" = "black", "3" = "black", "4"="black") ) +
theme_void() +
theme(legend.position="FALSE")
5.1.15 Add Labels to a Specific Level of the Hierarchy
A related problem consists to add labels for one specific level of hierarchy only. For instance, if you want to display the names of group of level2, but not of level 3 to avoid cluttering the chart.
To solve this issue, the trickiest part is to determine the level of each node in the edge list data frame. Fortunately, the data.tree
library is here to help with its FromDataFrameNetwork()
function. It allows to isolate the level of each node, making it a breeze to select the labels to display.
# Add the data.tree library
library(data.tree)
# Rebuild the data
<-flare$edges
edges <- flare$vertices
vertices # Transform it in a 'tree' format
<- FromDataFrameNetwork(edges)
tree # Then I can easily get the level of each node, and add it to the initial data frame:
<- data.frame( name=tree$Get('name'), level=tree$Get("level") )
mylevels <- vertices %>%
vertices left_join(., mylevels, by=c("name"="name"))
# Now we can add label for level1 and 2 only for example:
<- vertices %>%
vertices mutate(new_label=ifelse(level==2, shortName, NA))
<- graph_from_data_frame( edges, vertices=vertices )
mygraph # Make the graph
ggraph(mygraph, layout = 'circlepack', weight="size") +
geom_node_circle(aes(fill = as.factor(depth), color = as.factor(depth) )) +
scale_fill_manual(values=c("0" = "white", "1" = viridis(4)[1], "2" = viridis(4)[2], "3" = viridis(4)[3], "4"=viridis(4)[4])) +
scale_color_manual( values=c("0" = "white", "1" = "black", "2" = "black", "3" = "black", "4"="black") ) +
geom_node_label( aes(label=new_label), size=4) +
theme_void() +
theme(legend.position="FALSE", plot.margin = unit(rep(0,4), "cm"))
5.1.16 Zoomable Circle Packing with R and circlePacker
The circlePacker package allows to build interactive and zoomable circle packing charts. This section explains how to use the package with different kind of input datasets.
The circlepackeR
package allows to build interactive circle packing. Click on a group, and a smooth zoom will reveal the subgroups behind it.
Circle packing is a visualization method for hierarchical data. This kind of data can be stored in 2 main ways:
- Nested data frame
- Edge list
5.1.16.1 Circular Packing fom Nested Data Frame
In a nested data frame, each line represents a leaf of the organization. Each column represents a level of the organization.
This data format will require the data.tree
library to reformat the input dataset into something readable by circlepackeR
.
# Circlepacker package
library(circlepackeR)
# devtools::install_github("jeromefroe/circlepackeR") # If needed
# create a nested data frame giving the info of a nested dataset:
<- data.frame(
data root=rep("root", 15),
group=c(rep("group A",5), rep("group B",5), rep("group C",5)),
subgroup= rep(letters[1:5], each=3),
subsubgroup=rep(letters[1:3], 5),
value=sample(seq(1:15), 15)
)# Change the format. This use the data.tree library. This library needs a column that looks like root/group/subgroup/..., so I build it
library(data.tree)
$pathString <- paste("world", data$group, data$subgroup, data$subsubgroup, sep = "/")
data<- as.Node(data)
population # Make the plot
#circlepackeR(population, size = "value")
# You can custom the minimum and maximum value of the color range.
<- circlepackeR(population, size = "value", color_min = "hsl(56,80%,80%)", color_max = "hsl(341,30%,40%)")
p
p# save the widget
# library(htmlwidgets)
# saveWidget(p, file=paste0( getwd(), "/HtmlWidget/circular_packing_circlepackeR2.html"))
5.1.17 Circular Packing fom Edge List
The edge list format has at least 2 columns. It describes all the edges of the data.
This format is widely spread. In this example, we just convert it to a nested data frame before plotting it as seen above.
# Circlepacker package
library(circlepackeR)
# devtools::install_github("jeromefroe/circlepackeR") # If needed
# Let's use the 'flare dataset' (stored in the ggraph library)
library(ggraph)
<- flare$edges
data_edge $from <- gsub(".*\\.","",data_edge$from)
data_edge$to <- gsub(".*\\.","",data_edge$to)
data_edgehead(data_edge) # This is an edge list
# We need to convert it to a nested data frame. the data.tree library is our best friend for that:
library(data.tree)
<- FromDataFrameNetwork(data_edge)
data_tree <- ToDataFrameTree(data_tree,
data_nested level1 = function(x) x$path[2],
level2 = function(x) x$path[3],
level3 = function(x) x$path[4],
level4 = function(x) x$path[5])[-1,-1]
<- na.omit(data_nested)
data_nested # Now we can plot it as seen before!
$pathString <- paste("roots", data_nested$level1, data_nested$level2, data_nested$level3, data_nested$level4, sep = "/")
data_nested$value=1
data_nested<- as.Node(data_nested)
data_Node <- circlepackeR(data_Node, size = "value")
p
p # save the widget
# library(htmlwidgets)
# saveWidget(p, file=paste0( getwd(), "/HtmlWidget/circular_packing_circlepackeR1.html"))
5.2 Circular Stacked Barchart
A barchart can look pretty good using a circular layout, even if there are some caveats associated. If it interests you, visit the circular barchart section.
5.2.1 Circular Stacked Barplot
A circular barplot is a barplot where bars are displayed along a circle instead of a line. This page aims to teach you how to make a grouped and stacked circular barplot with R
and ggplot2
.
A circular barplot is a barplot where bars are displayed along a circle instead of a line. This page aims to teach you how to make a grouped and stacked circular barplot. I highly recommend to visit graph #295, #296 and #297 Before diving into this code, which is a bit rough.
I tried to add as many comments as possible in the code, and thus hope that the method is understandable. If it is not, please comment and ask supplementary explanations.
You first need to understand how to make a stacked barplot with ggplot2. Then understand how to properly add labels, calculating the good angles, flipping them if necessary, and adjusting their position. The trickiest part is probably the one allowing to add space between each group. All these steps are described one by one in the circular barchart section.
# library
library(tidyverse)
library(viridis)
# Create dataset
<- data.frame(
data individual=paste( "Mister ", seq(1,60), sep=""),
group=c( rep('A', 10), rep('B', 30), rep('C', 14), rep('D', 6)) ,
value1=sample( seq(10,100), 60, replace=T),
value2=sample( seq(10,100), 60, replace=T),
value3=sample( seq(10,100), 60, replace=T)
)
# Transform data in a tidy format (long format)
<- data %>% gather(key = "observation", value="value", -c(1,2))
data
# Set a number of 'empty bar' to add at the end of each group
<- 2
empty_bar <- nlevels(as.factor(data$observation))
nObsType <- data.frame( matrix(NA, empty_bar*nlevels(data$group)*nObsType, ncol(data)) )
to_add colnames(to_add) <- colnames(data)
$group <- rep(levels(data$group), each=empty_bar*nObsType )
to_add<- rbind(data, to_add)
data <- data %>% arrange(group, individual)
data $id <- rep( seq(1, nrow(data)/nObsType) , each=nObsType)
data
# Get the name and the y position of each label
<- data %>% group_by(id, individual) %>% summarize(tot=sum(value))
label_data <- nrow(label_data)
number_of_bar <- 90 - 360 * (label_data$id-0.5) /number_of_bar # I substract 0.5 because the letter must have the angle of the center of the bars. Not extreme right(1) or extreme left (0)
angle $hjust <- ifelse( angle < -90, 1, 0)
label_data$angle <- ifelse(angle < -90, angle+180, angle)
label_data
# prepare a data frame for base lines
<- data %>%
base_data group_by(group) %>%
summarize(start=min(id), end=max(id) - empty_bar) %>%
rowwise() %>%
mutate(title=mean(c(start, end)))
# prepare a data frame for grid (scales)
<- base_data
grid_data $end <- grid_data$end[ c( nrow(grid_data), 1:nrow(grid_data)-1)] + 1
grid_data$start <- grid_data$start - 1
grid_data<- grid_data[-1,]
grid_data
# Make the plot
<- ggplot(data) +
p
# Add the stacked bar
geom_bar(aes(x=as.factor(id), y=value, fill=observation), stat="identity", alpha=0.5) +
scale_fill_viridis(discrete=TRUE) +
# Add a val=100/75/50/25 lines. I do it at the beginning to make sur barplots are OVER it.
geom_segment(data=grid_data, aes(x = end, y = 0, xend = start, yend = 0), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=grid_data, aes(x = end, y = 50, xend = start, yend = 50), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=grid_data, aes(x = end, y = 100, xend = start, yend = 100), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=grid_data, aes(x = end, y = 150, xend = start, yend = 150), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=grid_data, aes(x = end, y = 200, xend = start, yend = 200), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
# Add text showing the value of each 100/75/50/25 lines
::annotate("text", x = rep(max(data$id),5), y = c(0, 50, 100, 150, 200), label = c("0", "50", "100", "150", "200") , color="grey", size=6 , angle=0, fontface="bold", hjust=1) +
ggplot2
ylim(-150,max(label_data$tot, na.rm=T)) +
theme_minimal() +
theme(
legend.position = "none",
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
plot.margin = unit(rep(-1,4), "cm")
+
) coord_polar() +
# Add labels on top of each bar
geom_text(data=label_data, aes(x=id, y=tot+10, label=individual, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=5, angle= label_data$angle, inherit.aes = FALSE ) +
# Add base line information
geom_segment(data=base_data, aes(x = start, y = -5, xend = end, yend = -5), colour = "black", alpha=0.8, size=0.6 , inherit.aes = FALSE ) +
geom_text(data=base_data, aes(x = title, y = -18, label=group), hjust=c(1,1,0,0), colour = "black", alpha=0.8, size=4, fontface="bold", inherit.aes = FALSE)
p# Save at png
#ggsave(p, file="output.png", width=10, height=10)
5.2.2 Stacked Barplot for Evolution
Stacked area chart are sometimes used to study an evolution using each group on the X axis as a timestamp. There are many alternatives to that, like streamgraph or area chart:
5.2.3 Base R
A stacked area chart showing the evolution of a few baby names in the US. Zoom on a specific time frame through brushing. Highlight a specific group by hovering the legend. Double click to unzoom
5.2.4 Stacking Barplot
# Libraries
library(tidyverse)
library(babynames)
library(streamgraph)
library(viridis)
library(hrbrthemes)
library(plotly)
# Load dataset from github
<- babynames %>%
data filter(name %in% c("Amanda", "Jessica", "Patricia", "Deborah", "Dorothy", "Helen")) %>%
filter(sex=="F")
# Plot
<- data %>%
p ggplot( aes(x=year, y=n, fill=name, text=name)) +
geom_area( ) +
scale_fill_viridis(discrete = TRUE) +
theme(legend.position="none") +
ggtitle("Popularity of American names in the previous 30 years") +
theme_ipsum() +
theme(legend.position="none")
ggplotly(p, tooltip="text")
5.2.5 Three Periods Stacked Barplot
# create dummy data
<- data.frame(
don x = rep(seq(2000,2005), 3),
value = c( 75, 73, 68, 57, 36, 0, 15, 16, 17, 18, 19, 20, 10, 11, 15, 25, 45, 80),
group = rep(c("A", "B", "C"), each=6)
)#plot
%>%
don ggplot( aes(x=x, y=value, fill=group)) +
geom_area( ) +
scale_fill_viridis(discrete = TRUE) +
theme(legend.position="none") +
theme_ipsum() +
theme(legend.position="none")
5.2.6 Base R
A stacked area chart showing the evolution of a few baby names in the US. Zoom on a specific time frame through brushing. Highlight a specific group by hovering the legend. Double click to unzoom.
5.2.7 Grouped, Stacked and Percent Stacked Barplot in Base R
This section explains how to build grouped, stacked and percent stacked barplot with base R. It provides a reproducible example with code for each type.
5.2.7.1 Grouped Barchart
A grouped barplot display a numeric value for a set of entities split in groups and subgroups. Before trying to build one, check how to make a basic barplot with R
and ggplot2
.
A few explanation about the code below:
- Input dataset must be a numeric matrix. Each group is a column. Each subgroup is a row.
- The
barplot()
function will recognize this format, and automatically perform the grouping for you. - The
beside
allows to toggle between the grouped and the stacked barchart.
# Create data
set.seed(112)
<- matrix(sample(1:30,15) , nrow=3)
data colnames(data) <- c("A","B","C","D","E")
rownames(data) <- c("var1","var2","var3")
# Grouped barplot
barplot(data,
col=colors()[c(23,89,12)] ,
border="white",
font.axis=2,
beside=T,
legend=rownames(data),
xlab="group",
font.lab=2)
5.2.8 Grouped Stacked Barchart
A stacked barplot is very similar to the grouped barplot above. The subgroups are just displayed on top of each other, not beside. The stacked barchart is the default option of the barplot()
function in base R, so you don’t need to use the beside
argument.
# Create data
set.seed(112)
<- matrix(sample(1:30,15) , nrow=3)
data colnames(data) <- c("A","B","C","D","E")
rownames(data) <- c("var1","var2","var3")
# Get the stacked barplot
barplot(data,
col=colors()[c(23,89,12)] ,
border="white",
space=0.04,
font.axis=2,
xlab="group")
5.2.9 Percent Stacked Barplot
A percent stacked barchart displays the evolution of the proportion of each subgroup. The sum is always equal to 100%.
In base R, you have to manually compute the percentages, using the apply()
function. This is more straightforward using ggplot2.
Note that here, a custom color palette is used, thanks to the RColorBrewer
package.
# Create data
set.seed(1124)
<- matrix(sample(1:30,15) , nrow=3)
data colnames(data) <- c("A","B","C","D","E")
rownames(data) <- c("var1","var2","var3")
# create color palette:
library(RColorBrewer)
<- brewer.pal(3, "Pastel2")
coul
# Transform this data in %
<- apply(data, 2, function(x){x*100/sum(x,na.rm=T)})
data_percentage
# Make a stacked barplot--> it will be in %!
barplot(data_percentage, col=coul , border="white", xlab="group")
5.2.10 Barplot for likert
Type Items
This section shows how to use the likert R package. It allows to build 0-centered stacked barplot to study likert type items.
Likert is an R package designed to help analyzing and visualizing Likert type items. It has been developped by Jason Bryer and Kim Speerschneider.
This barplot comes from the demo page and has been sent by Carlos Ortega.
It allows to analyse the reading attitudes from a panel of people. Each line represents a question. The barplot explains the feeling of people concerning this question.
# library
library(likert)
# Use a provided dataset
data(pisaitems)
<- pisaitems[, substr(names(pisaitems), 1, 5) == "ST24Q"]
items28
# Build plot
<- likert(items28)
p plot(p)
5.3 Dendrogram
A dendrogram (or tree diagram) is a network structure. It is constituted of a root node that gives birth to several nodes connected by edges or branches. The last nodes of the hierarchy are called leaves. Many options are available to build one with R. This sections aims to lead you toward the best strategy for your data.
5.3.0.1 Two Types of Dendrogram
Dendrograms can be built from:
- Hierarchical dataset: Think about a CEO managing team leads managing employees and so on.
- Clustering result: clustering divides a set of individuals in group according to their similarity. Its result can be visualized as a tree.
5.3.1 Dendrogram fromn Hierarchical Data
The ggraph
package is the best option to build a dendrogram from hierarchical data with R. It is based on the grammar of graphic and thus follows the same logic that ggplot2
.
5.3.2 Dendrogram from Edge List
This section aims to describe how to make a basic dendrogram representing hierarchical data with the ggraph library. Two input formats are considered:
- edge list - 2 columns, one row is on connection.
- nested data frame - one row is one path from root to leaf. As many columns as the number of levels in the hierarchy.
Please visit this page to learn how to custom these dendrograms. If you want to create a dendrogram from clustering result, visit the dendrogram section of the gallery.
Edge list is the most convenient format to use ggraph
. Follow those steps:
- Transform the input dataframe to a graph object using the
graph_from_data_frame()
function from theigraph
library. - Use the dendrogram layout of
ggraph
withlayout = 'dendrogram'
.
# libraries
library(ggraph)
library(igraph)
library(tidyverse)
# create an edge list data frame giving the hierarchical structure of your individuals
<- data.frame(from="origin", to=paste("group", seq(1,5), sep=""))
d1 <- data.frame(from=rep(d1$to, each=5), to=paste("subgroup", seq(1,25), sep="_"))
d2 <- rbind(d1, d2)
edges
# Create a graph object
<- graph_from_data_frame( edges )
mygraph
# Basic tree
ggraph(mygraph, layout = 'dendrogram', circular = FALSE) +
geom_edge_diagonal() +
geom_node_point() +
theme_void()
5.3.3 Dendrogram from a Nested Dataframe
Another common format is the nested data frame. The code below shows how to easily transform it into a nested data frame. Once it is done, just apply the code described above once more.
# libraries
library(ggraph)
library(igraph)
library(tidyverse)
# create a data frame
<- data.frame(
data level1="CEO",
level2=c( rep("boss1",4), rep("boss2",4)),
level3=paste0("mister_", letters[1:8])
)
# transform it to a edge list!
<- data %>% select(level1, level2) %>% unique %>% rename(from=level1, to=level2)
edges_level1_2 <- data %>% select(level2, level3) %>% unique %>% rename(from=level2, to=level3)
edges_level2_3 =rbind(edges_level1_2, edges_level2_3)
edge_list
# Now we can plot that
<- graph_from_data_frame( edge_list )
mygraph ggraph(mygraph, layout = 'dendrogram', circular = FALSE) +
geom_edge_diagonal() +
geom_node_point() +
theme_void()
mygraph
5.3.4 Dendrogram Customization with R and ggraph
This section follows the previous introduction to ggraph
and dendrogram. It shows how to customize the dendrogram: layout, edge style, node features and more.
Start by creating a dataset and a graph object using the igraph package.
# Libraries
library(ggraph)
library(igraph)
library(tidyverse)
theme_set(theme_void())
# data: edge list
<- data.frame(from="origin", to=paste("group", seq(1,7), sep=""))
d1 <- data.frame(from=rep(d1$to, each=7), to=paste("subgroup", seq(1,49), sep="_"))
d2 <- rbind(d1, d2)
edges
# We can add a second data frame with information for each node!
<- unique(c(as.character(edges$from), as.character(edges$to)))
name <- data.frame(
vertices name=name,
group=c( rep(NA,8) , rep( paste("group", seq(1,7), sep=""), each=7)),
cluster=sample(letters[1:4], length(name), replace=T),
value=sample(seq(10,30), length(name), replace=T)
)
# Create a graph object
<- graph_from_data_frame( edges, vertices=vertices) mygraph
5.3.5 Circular or Linear layout
First of all, you can use a linear or a circular representation using the circular option thanks to the layout
argument of ggraph
.
Note: a customized version of the circular dendrogram is available here, with more node features and labels.
# Left
ggraph(mygraph, layout = 'dendrogram', circular = FALSE) +
geom_edge_diagonal()
# Right
ggraph(mygraph, layout = 'dendrogram', circular = TRUE) +
geom_edge_diagonal()
5.3.6 Edge Style
Then you can choose between different styles for your edges. The ggraph
package comes with 2 main functions: geom_edge_link()
and geom_edge_diagonal()
.
Note that the most usual elbow representation is not implemented for hierarchical data yet.
# Left
ggraph(mygraph, layout = 'dendrogram') +
geom_edge_link()
# Right
ggraph(mygraph, layout = 'dendrogram') +
geom_edge_diagonal()
5.3.7 Labels and Nodes
You probably want to add labels to give more insight to your tree. And eventually nodes. This can be done using the geom_node_text
and geom_node_point
respectively.
Note: the label addition is a bit more tricky for circular dendrogram, a solution is suggested in graph #339.
# Left
ggraph(mygraph, layout = 'dendrogram') +
geom_edge_diagonal() +
geom_node_text(aes( label=name, filter=leaf) , angle=90 , hjust=1, nudge_y = -0.01) +
ylim(-.4, NA)
# Right
ggraph(mygraph, layout = 'dendrogram') +
geom_edge_diagonal() +
geom_node_text(aes( label=name, filter=leaf) , angle=90 , hjust=1, nudge_y = -0.04) +
geom_node_point(aes(filter=leaf) , alpha=0.6) +
ylim(-.5, NA)
5.3.8 Customize Aesthetics
It is a common task to add color or shapes to your dendrogram. It allows to show more clearly the organization of the dataset.
ggraph
works the same way as ggplot2
. In the aesthetics part of each component, you can use a column of your initial data frame to be mapped to a shape, a color, a size or other..
ggraph(mygraph, layout = 'dendrogram') +
geom_edge_diagonal() +
geom_node_text(aes( label=name, filter=leaf, color=group) , angle=90 , hjust=1, nudge_y=-0.1) +
geom_node_point(aes(filter=leaf, size=value, color=group) , alpha=0.6) +
ylim(-.6, NA) +
theme(legend.position="none")
5.3.9 Circular Dendrogram with R and ggraph
This section shows how to build a customized circular dendrogram using R and the ggraph
package. It provides explanation and reproducible code.
The circular dendrogram of the ggraph
library deserves its own page because it can be a bit tricky to adjust the labels. Indeed they need to have a good angle, be flipped upside down on the left part of the chart, and their alignment needs to be adjusted as well.
The following piece of code should allow you to display them correctly as shown beside. Note that the graph #334 gives the basis of dendrogram with ggraph, and that graph #335 presents a few possible customizations.
# Libraries
library(ggraph)
library(igraph)
library(tidyverse)
library(RColorBrewer)
# create a data frame giving the hierarchical structure of your individuals
=data.frame(from="origin", to=paste("group", seq(1,10), sep=""))
d1=data.frame(from=rep(d1$to, each=10), to=paste("subgroup", seq(1,100), sep="_"))
d2=rbind(d1, d2)
edges
# create a vertices data.frame. One line per object of our hierarchy
= data.frame(
vertices name = unique(c(as.character(edges$from), as.character(edges$to))) ,
value = runif(111)
) # Let's add a column with the group of each name. It will be useful later to color points
$group = edges$from[ match( vertices$name, edges$to ) ]
vertices
#Let's add information concerning the label we are going to add: angle, horizontal adjustement and potential flip
#calculate the ANGLE of the labels
$id=NA
vertices=which(is.na( match(vertices$name, edges$from) ))
myleaves=length(myleaves)
nleaves$id[ myleaves ] = seq(1:nleaves)
vertices$angle= 90 - 360 * vertices$id / nleaves
vertices
# calculate the alignment of labels: right or left
# If I am on the left part of the plot, my labels have currently an angle < -90
$hjust<-ifelse( vertices$angle < -90, 1, 0)
vertices
# flip angle BY to make them readable
$angle<-ifelse(vertices$angle < -90, vertices$angle+180, vertices$angle)
vertices
# Create a graph object
<- graph_from_data_frame( edges, vertices=vertices )
mygraph
# Make the plot
ggraph(mygraph, layout = 'dendrogram', circular = TRUE) +
geom_edge_diagonal(colour="grey") +
scale_edge_colour_distiller(palette = "RdPu") +
geom_node_text(aes(x = x*1.15, y=y*1.15, filter = leaf, label=name, angle = angle, hjust=hjust, colour=group), size=2.7, alpha=1) +
geom_node_point(aes(filter = leaf, x = x*1.07, y=y*1.07, colour=group, size=value, alpha=0.2)) +
scale_colour_manual(values= rep( brewer.pal(9,"Paired") , 30)) +
scale_size_continuous( range = c(0.1,10) ) +
theme_void() +
theme(
legend.position="none",
plot.margin=unit(c(0,0,0,0),"cm"),
+
) expand_limits(x = c(-1.3, 1.3), y = c(-1.3, 1.3))
5.3.10 Interactive Dendrogram with R
This section describes how to use the CollapsibleTree
package to build an interactive tree diagram. Explanation and reproducible code provided.
The collapsibletree
package is the best option to build interactive dendrogram with R.
The input must be a data frame that stores the hierarchical information. Basically, each row describes a complete path from the root to the leaf. In this example, the warpbreaks
dataset has 3 columns: wool
, tension
and breaks
.
# Load library
# install.packages("collapsibleTree")
library(collapsibleTree)
# input data must be a nested data frame:
head(warpbreaks)
# Represent this tree:
<- collapsibleTree( warpbreaks, c("wool", "tension", "breaks"))
p
p# save the widget
# library(htmlwidgets)
# saveWidget(p, file=paste0( getwd(), "/HtmlWidget/dendrogram_interactive.html"))
5.3.11 Dendrogram from Clustering
Result
Hierarchical clustering is a common task in data science and can be performed with the hclust()
function in R. The following examples will guide you through your process, showing how to prepare the data, how to run the clustering and how to build an appropriate chart to visualize its result.
5.3.11.1 Most Basic Dendrogram with R
Input dataset is a
matrix
where each row is a sample, and each column is a variable. Keep in mind you can transpose a matrix using thet()
function if needed.Clustering is performed on a square matrix (sample x sample) that provides the distance between samples. It can be computed using the
dist()
or thecor()
function depending on the question your askingThe
hclust()
function is used to perform the hierarchical clusteringIts output can be visualized directly with the
plot()
function. See possible customization.
# Dataset
<- matrix( sample(seq(1,2000),200), ncol = 10 )
data rownames(data) <- paste0("sample_" , seq(1,20))
colnames(data) <- paste0("variable",seq(1,10))
# Euclidean distance
<- dist(data[ , c(4:8)] , diag=TRUE)
dist # Hierarchical Clustering with hclust
<- hclust(dist)
hc # Plot the result
plot(hc)
5.3.11.2 Hierarchical clustering principle:
- Take distances between objects.
- Seek the smallest distance between 2 objects.
- Aggregate the 2 objects in a cluster.
- Replace them with their barycenter. Again until having only one cluster containing every points.
There are several ways to calculate the distance between 2 clusters ( using the max between 2 points of the clusters, or the mean, or the min, or ward (default) ).
5.3.12 Zoom on a Group
It is possible to zoom on a specific part of the tree. Select the group of interest using the [[..]]
operator:
# store the dedrogram in an object
<- as.dendrogram(hc)
dhc # set the margin
par(mar=c(4,4,2,2))
# Plot the Second group
plot(dhc[[2]] , main= "zoom on a part of the dendrogram")
5.3.13 Dendrogram with Color and Legend in R
This section describes how to apply a clustering method to a dataset and visualize the result as a dendrogram with colors and legends.
This is a upgrade of the basic dendrogram presented in the figure #29. Please refer to this previous section to understand how a dendrogram works.
In this exemple, we just show how to add specific colors to leaves and sample name. It allows to check if the expected groups are indeed found after clustering.
# Build dataset (just copy and paste, this is NOT interesting)
<- paste(rep("sample_",24) , seq(1,24) , sep="")
sample <- c(rep("dicoccoides" , 8) , rep("dicoccum" , 8) , rep("durum" , 8))
specie <- rep(c(rep("High",4 ) , rep("Low",4)),3)
treatment <- data.frame(sample,specie,treatment)
data for (i in seq(1:5)){
=sample(c(1:40) , 24 )
gene=cbind(data , gene)
datacolnames(data)[ncol(data)]=paste("gene_",i,sep="")
}$treatment=="High" , c(4:8)]=data[data$treatment=="High" , c(4:8)]+100
data[data$specie=="durum" , c(4:8)]=data[data$specie=="durum" , c(4:8)]-30
data[datarownames(data) <- data[,1]
# Have a look to the dataset
# head(data)
# Compute Euclidean distance between samples
=dist(data[ , c(4:8)] , diag=TRUE)
dist# Perfor clustering with hclust
<- hclust(dist)
hc <- as.dendrogram(hc)
dhc # Actually, each leaf of the tree has several attributes, like the color, the shape.. Have a look to it:
<- dhc[[1]][[1]][[1]]
specific_leaf # specific_leaf
# attributes(specific_leaf)
#So if I Want to color each leaf of the Tree, I have to change the attribute of each leaf. This can be done using the dendrapply function. So I create a function that # # add 3 attributes to the leaf : one for the color (lab.col) ,one for the font lab.font and one for the size (lab.cex).
=0
i<<-function(n){
colLabif(is.leaf(n)){
#I take the current attributes
=attributes(n)
a
#I deduce the line in the original data, and so the treatment and the specie.
=match(attributes(n)$label,data[,1])
ligne=data[ligne,3];
treatmentif(treatment=="Low"){col_treatment="blue"};if(treatment=="High"){col_treatment="red"}
=data[ligne,2];
specieif(specie=="dicoccoides"){col_specie="red"};if(specie=="dicoccum"){col_specie="Darkgreen"};if(specie=="durum"){col_specie="blue"}
#Modification of leaf attribute
attr(n,"nodePar")<-c(a$nodePar,list(cex=1.5,lab.cex=1,pch=20,col=col_treatment,lab.col=col_specie,lab.font=1,lab.cex=1))
}return(n)
}# Finally I just have to apply this to my dendrogram
<- dendrapply(dhc, colLab)
dL
# And the plot
plot(dL , main="structure of the population")
legend("topright",
legend = c("High Nitrogen" , "Low Nitrogen" , "Durum" , "Dicoccoides" , "Dicoccum"),
col = c("red", "blue" , "blue" , "red" , "Darkgreen"),
pch = c(20,20,4,4,4), bty = "n", pt.cex = 1.5, cex = 0.8 ,
text.col = "black", horiz = FALSE, inset = c(0, 0.1))
5.3.14 More Customization with Dendextend
The dendextend
package allows to go one step further in term of dendrogram customization. Here is a set of examples showing the main possibilities, like adding color bar on the bottom, drawing 2 trees face to face and more.
5.3.15 Customized Dendrogram with R and the Dendextend Package
The dendextend
package allows to apply all kinds of customization to a dendrogram: coloring nodes, labels, putting several tree face to face and more.
5.3.15.1 Basic Dendrogram
First of all, let’s remind how to build a basic dendrogram with R
:
- Input dataset is a dataframe with individuals in row, and features in column.
dist()
is used to compute distance between sample.hclust()
performs the hierarchical clustering.- the
plot()
function can plot the output directly as a tree.
# Library
library(tidyverse)
library(dendextend)
# Data
head(mtcars)
# Clusterization using 3 variables
%>%
mtcars select(mpg, cyl, disp) %>%
dist() %>%
hclust() %>%
as.dendrogram() -> dend
# Plot
par(mar=c(7,3,1,1)) # Increase bottom margin to have the complete label
plot(dend)
5.3.16 The set()
Function
The set()
function of dendextend
allows to modify the attribute of a specific part of the tree.
You can customize the cex
, lwd
, col
, lty
for branches and labels for example. You can also custom the nodes or the leaf. The code below illustrates this concept:
# library
library(dendextend)
# Chart (left)
%>%
dend # Custom branches
set("branches_col", "grey") %>% set("branches_lwd", 3) %>%
# Custom labels
set("labels_col", "orange") %>% set("labels_cex", 0.8) %>%
plot()
# Middle
%>%
dend set("nodes_pch", 19) %>%
set("nodes_cex", 0.7) %>%
set("nodes_col", "orange") %>%
plot()
# right
%>%
dend set("leaves_pch", 19) %>%
set("leaves_cex", 0.7) %>%
set("leaves_col", "skyblue") %>%
plot()
5.3.17 Highlight Clusters
The dendextend
library has some good functionalities to highlight the tree clusters.
You can color branches and label following their cluster attribution, specifying the number of cluster you want. The rect.dendrogram()
function even allows to highlight one or several specific clusters with a rectangle.
# Color in function of the cluster
par(mar=c(1,1,1,7))
%>%
dend set("labels_col", value = c("skyblue", "orange", "grey"), k=3) %>%
set("branches_k_color", value = c("skyblue", "orange", "grey"), k = 3) %>%
plot(horiz=TRUE, axes=FALSE)
abline(v = 350, lty = 2)
v
# Highlight a cluster with rectangle
par(mar=c(9,1,1,1))
%>%
dend set("labels_col", value = c("skyblue", "orange", "grey"), k=3) %>%
set("branches_k_color", value = c("skyblue", "orange", "grey"), k = 3) %>%
plot(axes=FALSE)
rect.dendrogram( dend, k=3, lty = 5, lwd = 0, x=1, col=rgb(0.1, 0.2, 0.4, 0.1) )
5.3.18 Comparing with an Expected Clustering
It is a common task to compare the cluster you get with an expected distribution.
In the mtcars
dataset we used to build our dendrogram, there is an am
column that is a binary variable. We can check if this variable is consistent with the cluster we got using the colored_bars()
function.
# Create a vector of colors, darkgreen if am is 0, green if 1.
<- ifelse(mtcars$am==0, "forestgreen", "green")
my_colors
# Make the dendrogram
par(mar=c(10,1,1,1))
%>%
dend set("labels_col", value = c("skyblue", "orange", "grey"), k=3) %>%
set("branches_k_color", value = c("skyblue", "orange", "grey"), k = 3) %>%
set("leaves_pch", 19) %>%
set("nodes_cex", 0.7) %>%
plot(axes=FALSE)
# Add the colored bar
colored_bars(colors = my_colors, dend = dend, rowLabels = "am")
v
5.3.19 Comparing 2 Dendrograms with tanglegram()
It is possible to compare 2 dendrograms using the tanglegram()
function.
Here it illustrates a very important concept: when you calculate your distance matrix and when you run your hierarchical clustering algorithm, you cannot simply use the default options without thinking about what you’re doing. Have a look to the differences between 2 different methods of clusterisation.
# Make 2 dendrograms, using 2 different clustering methods
<- USArrests %>% dist() %>% hclust( method="average" ) %>% as.dendrogram()
d1 <- USArrests %>% dist() %>% hclust( method="complete" ) %>% as.dendrogram()
d2
# Custom these kendo, and place them in a list
<- dendlist(
dl %>%
d1 set("labels_col", value = c("skyblue", "orange", "grey"), k=3) %>%
set("branches_lty", 1) %>%
set("branches_k_color", value = c("skyblue", "orange", "grey"), k = 3),
%>%
d2 set("labels_col", value = c("skyblue", "orange", "grey"), k=3) %>%
set("branches_lty", 1) %>%
set("branches_k_color", value = c("skyblue", "orange", "grey"), k = 3)
)
# Plot them together
tanglegram(dl,
common_subtrees_color_lines = FALSE, highlight_distinct_edges = TRUE, highlight_branches_lwd=FALSE,
margin_inner=7,
lwd=2
)
5.4 Donut Chart
A donut or doughnut chart is a ring divided into sectors that each represent a proportion of the whole. It is very close from a pie chart and thus suffers the same problem. In R, it can be built in both ggplot2 and base R. There is no specific geom to build donut charts with ggplot2.
5.4.1 Most Basic Doughnut Chart with ggplot2
The ggplot2
package allows to build donut charts. Note however that this is possible thanks a hack, since no specific function has been created for this kind of chart. (This is voluntary, to avoid donut charts that are dataviz bad practice).
Here is the process:
* Input data provides a numeric variable for a set of entities.
* Absolute numeric values must be translated to proportion.
* Group positions must be stacked: we’re gonna display them one after the other.
* geom_rect()
is used to plot each group as a rectangle.
* coord_polar()
is used to switch from stacked rectangles to a ring.
* xlim()
allows to switch from pie to donut: it adds the empty circle in the middle.
# load library
library(ggplot2)
# Create test data.
<- data.frame(
data category=c("A", "B", "C"),
count=c(10, 60, 30)
)
# Compute percentages
$fraction = data$count / sum(data$count)
data# Compute the cumulative percentages (top of each rectangle)
$ymax = cumsum(data$fraction)
data# Compute the bottom of each rectangle
$ymin = c(0, head(data$ymax, n=-1))
data
# Make the plot
ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
geom_rect() +
coord_polar(theta="y") + # Try to remove that to understand how the chart is built initially
xlim(c(2, 4)) # Try to remove that to see how to make a pie chart
5.4.2 Customization
Here are a couple of things you can do improve your donut chart style:
- Use
theme_void()
to get rid of the unnecessary background, axis, labels and so on. - Use a better color palette.
- Don’t use a legend, add labels to groups directly.
# load library
library(ggplot2)
# Create test data.
<- data.frame(
data category=c("A", "B", "C"),
count=c(10, 60, 30)
)
# Compute percentages
$fraction <- data$count / sum(data$count)
data# Compute the cumulative percentages (top of each rectangle)
$ymax <- cumsum(data$fraction)
data# Compute the bottom of each rectangle
$ymin <- c(0, head(data$ymax, n=-1))
data# Compute label position
$labelPosition <- (data$ymax + data$ymin) / 2
data# Compute a good label
$label <- paste0(data$category, "\n value: ", data$count)
data# Make the plot
ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
geom_rect() +
geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
scale_fill_brewer(palette=4) +
coord_polar(theta="y") +
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "none")
5.4.3 Donut Thickness
It is important to understand that donut chart are just stacked rectangles that are made circular thanks to coord_polar
.
Thus, the empty circle that makes it a donut chart is just the space between the initial Y axis and the left part of the rectangle.
- If
xlim
left boundary is big, no empty circle. You get a pie chart - If
xlim
is low, the ring becomes thinner.
If you don’t get it, just plot the chart without coord_polar()
.
# load library
library(ggplot2)
# Create test data.
<- data.frame(
data category=c("A", "B", "C"),
count=c(10, 60, 30)
)
# Compute percentages
$fraction <- data$count / sum(data$count)
data# Compute the cumulative percentages (top of each rectangle)
$ymax <- cumsum(data$fraction)
data# Compute the bottom of each rectangle
$ymin <- c(0, head(data$ymax, n=-1))
data# Compute label position
$labelPosition <- (data$ymax + data$ymin) / 2
data# Compute a good label
$label <- paste0(data$category, "\n value: ", data$count)
data# Make the plot
ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
geom_rect() +
geom_text( x=2, aes(y=labelPosition, label=label, color=category), size=6) + # x here controls label position (inner / outer)
scale_fill_brewer(palette=3) +
scale_color_brewer(palette=3) +
coord_polar(theta="y") +
xlim(c(-1, 4)) +
theme_void() +
theme(legend.position = "none")
5.4.4 Donut Chart with Base R
It is also possible to build your donut chart without using any library. The example shows how, providing a reusable function that you can quickly apply to your input dataset.
If you want to stick to base R however, the function given below should allow you to get there.
To draw a donut plot, the easiest way is to use ggplot2
, as suggested in graph #128.
If you want to stick to base R however, the function given below should allow you to get there.
Important: this functions comes from here.
# The doughnut function permits to draw a donut plot
<-
doughnut function (x, labels = names(x), edges = 200, outer.radius = 0.8,
inner.radius=0.6, clockwise = FALSE,
init.angle = if (clockwise) 90 else 0, density = NULL,
angle = 45, col = NULL, border = FALSE, lty = NULL,
main = NULL, ...)
{if (!is.numeric(x) || any(is.na(x) | x < 0))
stop("'x' values must be positive.")
if (is.null(labels))
<- as.character(seq_along(x))
labels else labels <- as.graphicsAnnot(labels)
<- c(0, cumsum(x)/sum(x))
x <- diff(x)
dx <- length(dx)
nx plot.new()
<- par("pin")
pin <- ylim <- c(-1, 1)
xlim if (pin[1L] > pin[2L])
<- (pin[1L]/pin[2L]) * xlim
xlim else ylim <- (pin[2L]/pin[1L]) * ylim
plot.window(xlim, ylim, "", asp = 1)
if (is.null(col))
<- if (is.null(density))
col palette()
else par("fg")
<- rep(col, length.out = nx)
col <- rep(border, length.out = nx)
border <- rep(lty, length.out = nx)
lty <- rep(angle, length.out = nx)
angle <- rep(density, length.out = nx)
density <- if (clockwise)
twopi -2 * pi
else 2 * pi
<- function(t, radius) {
t2xy <- twopi * t + init.angle * pi/180
t2p list(x = radius * cos(t2p),
y = radius * sin(t2p))
}for (i in 1L:nx) {
<- max(2, floor(edges * dx[i]))
n <- t2xy(seq.int(x[i], x[i + 1], length.out = n),
P
outer.radius)polygon(c(P$x, 0), c(P$y, 0), density = density[i],
angle = angle[i], border = border[i],
col = col[i], lty = lty[i])
<- t2xy(mean(x[i + 0:1]), outer.radius)
Pout <- as.character(labels[i])
lab if (!is.na(lab) && nzchar(lab)) {
lines(c(1, 1.05) * Pout$x, c(1, 1.05) * Pout$y)
text(1.1 * Pout$x, 1.1 * Pout$y, labels[i],
xpd = TRUE, adj = ifelse(Pout$x < 0, 1, 0),
...)
}## Add white disc
<- t2xy(seq.int(0, 1, length.out = n*nx),
Pin
inner.radius)polygon(Pin$x, Pin$y, density = density[i],
angle = angle[i], border = border[i],
col = "white", lty = lty[i])
}title(main = main, ...)
invisible(NULL)
}# Let's use the function, it works like PiePlot !
# inner.radius controls the width of the ring!
doughnut( c(3,5,9,12) , inner.radius=0.5, col=c(rgb(0.2,0.2,0.4,0.5), rgb(0.8,0.2,0.4,0.5), rgb(0.2,0.9,0.4,0.4) , rgb(0.0,0.9,0.8,0.4)) )
5.5 Grouped and Stacked Barplot
Grouped and Stacked barplot display a numeric value for several entities, organised in groups and subgroups. It is probably better to have a solid understanding of the basic barplot first.
5.5.0.1 Step by Step - ggplot2
ggplot2 is probably the best option to build grouped and stacked barchart. The input data frame requires to have 2 categorical variables that will be passed to the x
and fill
arguments of the aes()
function. Toggling from grouped to stacked is pretty easy thanks to the position
argument.
5.5.1 Grouped, Stacked and Percent Stacked Barplot in ggplot2
This section explains how to build grouped, stacked and percent stacked barplot with R
and ggplot2
. It provides a reproducible example with code for each type.
5.5.1.1 Grouped Barchart
A grouped barplot display a numeric value for a set of entities split in groups and subgroups. Before trying to build one, check how to make a basic barplot with R
and ggplot2
.
A few explanation about the code below:
- Input dataset must provide 3 columns: the numeric value (
value
), and 2 categorical variables for the group (specie
) and the subgroup (condition
) levels. - In the
aes()
call, x is the group (specie), and the subgroup (condition) is given to thefill
argument. - In the
geom_bar()
call,position="dodge"
must be specified to have the bars one beside each other.
# library
library(ggplot2)
# create a dataset
<- c(rep("sorgho" , 3) , rep("poacee" , 3) , rep("banana" , 3) , rep("triticum" , 3) )
specie <- rep(c("normal" , "stress" , "Nitrogen") , 4)
condition <- abs(rnorm(12 , 0 , 15))
value <- data.frame(specie,condition,value)
data
# Grouped
ggplot(data, aes(fill=condition, y=value, x=specie)) +
geom_bar(position="dodge", stat="identity")
5.5.2 Stacked Barchart
A stacked barplot is very similar to the grouped barplot above. The subgroups are just displayed on top of each other, not beside.
The only thing to change to get this figure is to switch the position
argument to stack.
# library
library(ggplot2)
# create a dataset
<- c(rep("sorgho" , 3) , rep("poacee" , 3) , rep("banana" , 3) , rep("triticum" , 3) )
specie <- rep(c("normal" , "stress" , "Nitrogen") , 4)
condition <- abs(rnorm(12 , 0 , 15))
value <- data.frame(specie,condition,value)
data
# Stacked
ggplot(data, aes(fill=condition, y=value, x=specie)) +
geom_bar(position="stack", stat="identity")
5.5.3 Percent Stacked Barchart
Once more, there is not much to do to switch to a percent stacked barplot. Just switch to position="fill"
. Now, the percentage of each subgroup is represented, allowing to study the evolution of their proportion in the whole.
# library
library(ggplot2)
# create a dataset
<- c(rep("sorgho" , 3) , rep("poacee" , 3) , rep("banana" , 3) , rep("triticum" , 3) )
specie <- rep(c("normal" , "stress" , "Nitrogen") , 4)
condition <- abs(rnorm(12 , 0 , 15))
value <- data.frame(specie,condition,value)
data
# Stacked + percent
ggplot(data, aes(fill=condition, y=value, x=specie)) +
geom_bar(position="fill", stat="identity")
5.5.4 Grouped Barchart Customization
As usual, some customization are often necessary to make the chart look better and personnal. Let’s:
- Add a
title
- Use a
theme
- Change color palette. See more here.
- Customize axis titles
# library
library(ggplot2)
library(viridis)
library(hrbrthemes)
# create a dataset
<- c(rep("sorgho" , 3) , rep("poacee" , 3) , rep("banana" , 3) , rep("triticum" , 3) )
specie <- rep(c("normal" , "stress" , "Nitrogen") , 4)
condition <- abs(rnorm(12 , 0 , 15))
value <- data.frame(specie,condition,value)
data
# Small multiple
ggplot(data, aes(fill=condition, y=value, x=specie)) +
geom_bar(position="stack", stat="identity") +
scale_fill_viridis(discrete = T) +
ggtitle("Studying 4 species..") +
theme_ipsum() +
xlab("")
5.5.5 Small Multiple
Small multiple can be used as an alternative of stacking or grouping. It is straightforward to make thanks to the facet_wrap()
function.
# library
library(ggplot2)
library(viridis)
library(hrbrthemes)
# create a dataset
<- c(rep("sorgho" , 3) , rep("poacee" , 3) , rep("banana" , 3) , rep("triticum" , 3) )
specie <- rep(c("normal" , "stress" , "Nitrogen") , 4)
condition <- abs(rnorm(12 , 0 , 15))
value <- data.frame(specie,condition,value)
data
# Graph
ggplot(data, aes(fill=condition, y=value, x=condition)) +
geom_bar(position="dodge", stat="identity") +
scale_fill_viridis(discrete = T, option = "E") +
ggtitle("Studying 4 species..") +
facet_wrap(~specie) +
theme_ipsum() +
theme(legend.position="none") +
xlab("")
5.6 Piechart
A piechart is a circle divided into sectors that each represent a proportion of the whole. It is highly criticized in dataviz for meaningful reasons, read more. This section teaches how to build one using R, using the pie()
function or the ggplot2 package. The pie()
function is natively provided in R. It allows to build nice piechart in seconds. Here is an overview of its functioning:
5.6.0.1 Step by Step - The pie()
Function
The pie()
function is natively provided in R
. It allows to build nice piechart in seconds. Here is an overview of its functioning:
5.6.1 Most Basic Piechart with pie()
R
natively offers the pie()
function that builds pie charts. The input is just a numeric variable, each value providing the value of a group of the piechart.
Important note: pie chart are widely known as a bad way to visualize information. Check this section for reasons and alternatives.
# Create Data
<- c(3,7,9,1,2)
Prop
# Make the default Pie Plot
pie(Prop)
5.6.2 Change Labels with labels
Provide a vector of labels
to the labels argument to add names to piechart groups:
# You can also custom the labels:
pie(Prop , labels = c("Gr-A","Gr-B","Gr-C","Gr-D","Gr-E"))
5.6.3 Non-Circular Piechart with edges
Decrease the value of the edges
argument to get angles around your piechart.
# If you give a low value to the "edge" argument, you go from something circular to a shape with edges
pie(Prop , labels = c("Gr-A","Gr-B","Gr-C","Gr-D","Gr-E") , edges=10)
5.6.4 Add Stripes with density
The density
arguments adds stripes.
You can control the angle of those stripes with angle
.
# The density arguments adds stripes. You can control the angle of this lines with "angle"
pie(Prop , labels = c("Gr-A","Gr-B","Gr-C","Gr-D","Gr-E") , density=10 , angle=c(20,90,30,10,0))
5.6.5 Color with col
and border
Change group color with col
, and border color with border
.
Here, the RcolorBrewer
package is used to build a nice color palette.
# Prepare a color palette. Here with R color brewer:
library(RColorBrewer)
<- brewer.pal(5, "Set2")
myPalette # You can change the border of each area with the classical parameters:
pie(Prop , labels = c("Gr-A","Gr-B","Gr-C","Gr-D","Gr-E"), border="white", col=myPalette)
5.6.6 Ggplot2 Piechart
A pie chart is a circle divided into sectors that each represent a proportion of the whole. This page explains how to build one with the ggplot2 package.
5.6.6.1 Step by Step - The ggplot2
Package
There is no specific geom
to build piechart with ggplot2. The trick is to build a barplot and use coord_polar
to make it circular. This is why the pie()
function described above is probably a better alternative.
5.6.6.2 Most Basic Piechart
ggplot2
does not offer any specific geom to build piecharts. The trick is the following:
- Input data frame has 2 columns: the group names (
group
here) and its value (value
here). - Build a stacked barchart with one bar only using the
geom_bar()
function. - Make it circular with
coord_polar()
.
The result is far from optimal yet, keep reading for improvements.
# Load ggplot2
library(ggplot2)
# Create Data
<- data.frame(
data group=LETTERS[1:5],
value=c(13,7,9,21,2)
)# Basic piechart
ggplot(data, aes(x="", y=value, fill=group)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0)
5.6.7 Improve Appearance
Previous version looks pretty bad. We need to:
- Remove useless numeric labels.
- Remove grid and grey background.
It’s better now, just need to add labels directly on chart.
# Load ggplot2
library(ggplot2)
# Create Data
<- data.frame(
data group=LETTERS[1:5],
value=c(13,7,9,21,2)
)# Basic piechart
ggplot(data, aes(x="", y=value, fill=group)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void() # remove background, grid, numeric labels
5.6.8 Adding Labels with geom_text()
The tricky part is to compute the y position of labels using this weird coord_polar
transformation.
# Load ggplot2
library(ggplot2)
library(dplyr)
# Create Data
<- data.frame(
data group=LETTERS[1:5],
value=c(13,7,9,21,2)
)# Compute the position of labels
<- data %>%
data arrange(desc(group)) %>%
mutate(prop = value / sum(data$value) *100) %>%
mutate(ypos = cumsum(prop)- 0.5*prop )
# Basic piechart
ggplot(data, aes(x="", y=prop, fill=group)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void() +
theme(legend.position="none") +
geom_text(aes(y = ypos, label = group), color = "white", size=6) +
scale_fill_brewer(palette="Set1")
5.7 Treemap
A Treemap displays hierarchical data as a set of nested rectangles. Each group is represented by a rectangle, which area is proportional to its value. Visit data-to-viz.com for more theoretical explanation about what it is. For a R implementation, see below.
5.7.0.1 Step by Step - The treemap
Package
The treemap
package is probably the best way to build treemaps in R. The 3 examples below will teach you how to build a very basic treemap, how to deal with subgroups](https://www.r-graph-gallery.com/235-treemap-with-subgroups), and how to customize the figure. Note that once you master this package, you can very easily build an interactive version as described below.
5.7.1 Most Basic Treemap with R
This section explains how to build a very basic treemap with R
. It uses the treemap
package, provides reproducible code and explains how input data must be formatted.
5.7.1.1 Most Basic Treemap
This is the most basic treemap you can do. The input dataset is simple: we just have 3 groups, and each has a value which we map to an area.
It allows to learn the syntax of the treemap
library: you need to provide at least a dataset (data
), the column that provides groups (index
), and the column that gives the size of each group (vSize
).
See graph #235 to learn how to add subgroups, and graph #236 to customize the chart appearance.
# library
library(treemap)
# Create data
<- c("group-1","group-2","group-3")
group <- c(13,5,22)
value <- data.frame(group,value)
data
# treemap
treemap(data,
index="group",
vSize="value",
type="index"
)
5.7.2 Treemap with Subgroups
This section explains how to build a treemap with subgroups in R
. It uses the treemap
package, provides reproducible code and explains how input data must be formatted.
This is a treemap with several levels. We have 3 groups, each containing several subgroups. Each subgroup has a value which we map to an area.
In the index
argument, you need to specify levels in the order of importance: group > subgroup > sub-subgroup.
Note: If you have one level of grouping only, see chart #234.
Note: Showing more than 2 levels often result in a cluttered and unredable figure. Why not considering an interactive version?
# library
library(treemap)
# Build Dataset
<- c(rep("group-1",4),rep("group-2",2),rep("group-3",3))
group <- paste("subgroup" , c(1,2,3,4,1,2,1,2,3), sep="-")
subgroup <- c(13,5,22,12,11,7,3,1,23)
value <- data.frame(group,subgroup,value)
data
# treemap
treemap(data,
index=c("group","subgroup"),
vSize="value",
type="index"
)
5.7.3 Customize your R Treemap
How to customize your treemap built with R
? Learn how to control borders, labels, and more. Several examples with reproducible code provided.
5.7.3.1 Labels
This page aims to explain how to customize R treemaps. Make sure you already understood how to build a basic treemap with R.
The first step is to control label appearance. All the options are explained in the code below. Note that you can apply a different feature to each level of the treemap, for example using white for group labels, and orange for subgroup labels.
# library
library(treemap)
# Create data
<- c(rep("group-1",4),rep("group-2",2),rep("group-3",3))
group <- paste("subgroup" , c(1,2,3,4,1,2,1,2,3), sep="-")
subgroup <- c(13,5,22,12,11,7,3,1,23)
value <- data.frame(group,subgroup,value)
data
# Custom labels:
treemap(data, index=c("group","subgroup"), vSize="value", type="index",
fontsize.labels=c(15,12), # size of labels. Give the size per level of aggregation: size for group, size for subgroup, sub-subgroups...
fontcolor.labels=c("white","orange"), # Color of labels
fontface.labels=c(2,1), # Font of labels: 1,2,3,4 for normal, bold, italic, bold-italic...
bg.labels=c("transparent"), # Background color of labels
align.labels=list(
c("center", "center"),
c("right", "bottom")
# Where to place labels in the rectangle?
), overlap.labels=0.5, # number between 0 and 1 that determines the tolerance of the overlap between labels. 0 means that labels of lower levels are not printed if higher level labels overlap, 1 means that labels are always printed. In-between values, for instance the default value .5, means that lower level labels are printed if other labels do not overlap with more than .5 times their area size.
inflate.labels=F, # If true, labels are bigger when rectangle is bigger.
)
5.7.4 Borders
You can control the border:
- Color with
border.col
- Width with
border.lwds
- Remember that you can still provide a vector to each option: it gives the value for groups, subgroups and so on.
# Custom borders:
treemap(data, index=c("group","subgroup"), vSize="value", type="index",
border.col=c("black","white"), # Color of borders of groups, of subgroups, of subsubgroups ....
border.lwds=c(7,2) # Width of colors
)
5.7.5 General Features
A few other arguments for more general customization. The palette
arguments accepts any palette from RColorBrewer
.
# General features:
treemap(data, index=c("group","subgroup"), vSize="value",
type="index", # How you color the treemap. type help(treemap) for more info
palette = "Set1", # Select your color palette from the RColorBrewer presets or make your own.
title="My Treemap", # Customize your title
fontsize.title=12, # Size of the title
)
5.7.5.1 Interactivity with d3treeR
The d3treeR
allows to build interactive treemaps with R
. Click on a group to zoom in and reveal subgroups. Click on the group name on top to unzoom and come back to the previous state. Note that the syntax used in previous charts above is exactly the same. Only one more line of code is needed, give it a go!
5.7.6 Interactive Treemap in R
With a big amount of data, a treemap can get cluttered and unreadable. Interactivity allows to keep a clean an insightful figure. This section shows how to build an interactive treemap with R
and the d3treeR
package.
This section follows the previous chart #234, #235 and #236 that describe how to build and customize treemaps with the treemap
package.
The idea is to turn the chart interactive: you can now click on a group to zoom in and show its subgroups. Click on the group name on top to unzoom and come back to the previous state.
This is done thanks to the d3treeR
package:
# library
library(treemap)
library(d3treeR)
# dataset
<- c(rep("group-1",4),rep("group-2",2),rep("group-3",3))
group <- paste("subgroup" , c(1,2,3,4,1,2,1,2,3), sep="-")
subgroup <- c(13,5,22,12,11,7,3,1,23)
value <- data.frame(group,subgroup,value)
data
# basic treemap
<- treemap(data,
p index=c("group","subgroup"),
vSize="value",
type="index",
palette = "Set2",
bg.labels=c("white"),
align.labels=list(
c("center", "center"),
c("right", "bottom")
)
)
# make it interactive ("rootname" becomes the title of the plot):
<- d3tree2( p , rootname = "General" )
inter # save the widget
# library(htmlwidgets)
# saveWidget(inter, file=paste0( getwd(), "/HtmlWidget/interactiveTreemap.html"))