Plotting in R for Biologists

Plotting in R for Biologists is a beginner course in data analysis and plotting with R, designed for biologists as a starting point for plotting your own data. This course can also be a good starting point for learning bioinformatics and computational biology because R is still what I use for most of my plots ranging from first getting my head around a dataset up to publication. R is what I recommend as the programming language for data analysis, statistics, and plotting for most scientists, including biologists and bioinformaticians.

I originally made this course to be a paid product, but I decided to give it away for free so more people can get use out of it. I do appreciate donations through Gumroad where you can also download the videos, scripts, and example data. Thank you so much to everyone who decided to contribute!

Maria


Getting started:

Install R and RStudio


Lesson 1: Quick start

This lesson takes you from opening R for the first time and loading data to creating your first plot and saving the plot as a file on your computer

Download practice data, scripts, and video files for offline viewing (for all 8 lessons)

# ==========================================================
#
#      Lesson 1 -- Hit the ground running
#      •   Reading in data
#      •   Creating a quick plot
#      •   Saving publication-quality plots in multiple
#          file formats (.png, .jpg, .pdf, and .tiff)
#
# ==========================================================

# Go to the packages tab in the bottom right part of Rstudio, click "Install" at the top, type in ggplot2, and hit Install
# Go to the Files tab in the bottom right part of Rstudio, navigate to where you can see the Lesson-01 folder.
# then click "More" and choose "Set As Working Directory"

library(ggplot2)

filename <- "Lesson-01/Encode_HMM_data.txt"

# Select a file and read the data into a data-frame
my_data <- read.csv(filename, sep="\t", header=FALSE)
# if this gives an error, make sure you have followed the steps above to set your working directory to the folder that contains the file you are trying to open

head(my_data)

# Rename the columns so we can plot things more easily without looking up which column is which
names(my_data)[1:4] <- c("chrom","start","stop","type")

# At any time, you can see what your data looks like using the head() function:
head(my_data)

# Now we can make an initial plot and see how it looks
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()

# Save the plot to a file

# Different file formats:
png("Lesson-01/plot.png")
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()
dev.off()

tiff("Lesson-01/plot.tiff")
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()
dev.off()

jpeg("Lesson-01/plot.jpg")
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()
dev.off()

pdf("Lesson-01/plot.pdf")
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()
dev.off()

# High-resolution:
png("Lesson-01/plot_hi_res.png",1000,1000)
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()
dev.off()

Lesson 2: Importing data from Excel, text files, and online sources

This lesson covers how to get various kinds of data into R in the first place. I show you step by step how to import data from an Excel spreadsheet into R.

Download practice data, scripts, and video files for offline viewing (for all 8 lessons)

# ==========================================================
#
#      Lesson 2 -- Importing and downloading data
#      •   Importing data from Excel
#      •   Downloading from UCSC
#      •   Downloading from ENSEMBL
#      •   Downloading from ENCODE
#
# ==========================================================

# Getting data from Excel
# Get the excel file from this paper: "Gene expression profiling of breast cell lines identifies potential new basal markers". Supplementary table 1
# Go into excel and save it as "Tab Delimited Text (.txt)"

filename <- "Lesson-02/micro_array_results_table1.txt"

my_data <- read.csv(filename, sep="\t", header=TRUE)
head(my_data)

# Where to find publicly available big data
# UCSC -- RefSeq genes from table browser
# Ensembl -- Mouse regulatory features MultiCell
# ENCODE -- HMM: wgEncodeBroadHmmGm12878HMM.bed

genes <- read.csv("Lesson-02/RefSeq_Genes", sep="\t", header=TRUE)
head(genes)
dim(genes)

regulatory_features <- read.csv("Lesson-02/RegulatoryFeatures_MultiCell.gff", sep="\t", header=FALSE)
head(regulatory_features)
dim(regulatory_features)

chromHMM <- read.csv("Lesson-02/wgEncodeBroadHmmGm12878HMM.bed", sep="\t", header=FALSE)
head(chromHMM)
dim(chromHMM)

Lesson 3: Interrogating your data

Here we cover several aspects of investigating your data in R:

  1. How to get a quick view at a large file
  2. How to get the number of rows and columns of the dataset
  3. How to get mean, median, standard deviation, min, and max of numerical columns in your dataset
  4. How to get counts by category, like how many times each chromosome shows up in the dataset

We also critically evaluate our first plot to see what aesthetic issues we are going to change in the next lesson

Download practice data, scripts, and video files for offline viewing (for all 8 lessons)

# ==========================================================
#
#      Lesson 3 -- Interrogating your data
#      •   Counting categorical variables
#      •   Mean, median, standard deviation
#      •   Finding issues
#
# ==========================================================

#############     We start this the same as in lesson 1     ###############

library(ggplot2)

filename <- "Lesson-03/Encode_HMM_data.txt"

# Select a file and read the data into a data-frame
my_data <- read.csv(filename, sep="\t", header=FALSE)
# if this gives an error, make sure you have followed the steps above to set your working directory to the folder that contains the file you are trying to open

head(my_data)

# Rename the columns so we can plot things more easily without looking up which column is which
names(my_data)[1:4] <- c("chrom","start","stop","type")

# At any time, you can see what your data looks like using the head() function:
head(my_data)

# Now we can make an initial plot and see how it looks
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()

#############     Now let's take a closer look at our data     ###############

# We can see how big our data is:
dim(my_data)

# We can ask our data some questions:
summary(my_data)

# We can break these down by column to see more:
summary(my_data$chrom)

summary(my_data$type)

summary(my_data$start)
summary(my_data$stop)

# We can even make a new column by doing math on the other columns
my_data$size = my_data$stop - my_data$start

# So now there's a new column
head(my_data)

# Basic statistics:
summary(my_data$size)

mean(my_data$size)
sd(my_data$size)

median(my_data$size)
max(my_data$size)
min(my_data$size)

# Let's think about the issues and in the next lesson we will learn how to deal with them
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()

# 1)    Chromosomes in the wrong order, and the "chr" prefixes don't fit on the x-axis
# 2)    Too many types
# 3)    Bad names for the types

Lesson 4: Filtering and cleaning up data

This is where things start to get hairy. I encountered several problems over and over again working with especially genomic data in R. This lesson will cover:

  1. My best trick to fix chromosome ordering in a plot
  2. Removing categories from a column, for instance containing feature or variant types
  3. Renaming categories of a column (this is called renaming the levels of a factor)

Download practice data, scripts, and video files for offline viewing (for all 8 lessons)


# ==========================================================
#
#      Lesson 4 -- Filtering and cleaning up data
#      •   Removing "chr" prefixes
#      •   Getting chromosomes in the right order
#      •   Filtering out excess data
#      •   Renaming categories
#
# ==========================================================

#############     We start this the same as in lesson 1     ###############

library(ggplot2)

filename <- "Lesson-04/Encode_HMM_data.txt"

# Select a file and read the data into a data-frame
my_data <- read.csv(filename, sep="\t", header=FALSE)
# if this gives an error, make sure you have followed the steps above to set your working directory to the folder that contains the file you are trying to open

head(my_data)

# Rename the columns so we can plot things more easily without looking up which column is which
names(my_data)[1:4] <- c("chrom","start","stop","type")

# At any time, you can see what your data looks like using the head() function:
head(my_data)

# Now we can make an initial plot and see how it looks
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()

#############     In Lesson 3 we decided on some issues to fix     #############

# 1)    Remove "chr" prefix from chromosome names
# 2)    Order the chromosomes correctly
# 3)    Pick a subset of the types and rename them

summary(my_data$chrom)
summary(my_data$type)

#############     Now let's address those issues     ###############

# Remove the "chr" prefix
my_data$chrom <- factor(gsub("chr", "", my_data$chrom))
# levels = possibilities/categories

# See the result
summary(my_data$chrom)

ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()

# Reorder the chromosomes numerically
seq(1,22)
c(seq(1,22),"X","Y")

my_data$chrom <- factor(my_data$chrom, levels=c(seq(1,22),"X","Y"))

summary(my_data$chrom)

ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()

# Much better!

# Now let's do something about those types.

summary(my_data$type)

# Filter to just a few types I'm interested in
my_data <- my_data[my_data$type %in% c("1_Active_Promoter","4_Strong_Enhancer","8_Insulator"), ]

summary(my_data$type)

# Rename the types
library(plyr) # this library has a useful revalue() function
my_data$type <- revalue(my_data$type, c("1_Active_Promoter"="Promoter", "4_Strong_Enhancer"="Enhancer","8_Insulator"="Insulator"))

summary(my_data$type)

# Check the plot again
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()


Lesson 5: Tweaking everything in a plot

And I do mean everything:

  1. Plot title, axis title, legend labels
  2. Font, font size
  3. Color schemes, where to find color palettes, mixing palettes from ColorBrewer to get a larger color palette
  4. How to shuffle colors and then keep your best mix
  5. Finding color-blind safe color palettes with ColorBrewer
  6. Background color
  7. Border around the plot and axes
  8. Grid-lines
  9. Tick-marks
  10. Legend (remove legend, remove title of legend)

I also show you how to create a publication-style plot and apply the same theme to all your plots.

Download practice data, scripts, and video files for offline viewing (for all 8 lessons)

# ==========================================================
#
#      Lesson 5 -- Tweaking everything in your plots
#      •   Text, axis labels
#      •   Legends
#      •   Color palettes
#      •   Sizes, fonts, line-widths, tick-marks,
#          grid-lines, and background-colors
#
# ==========================================================

# Loading, filtering, etc. from Lessons 1-4:
#            (shortened version)

library(ggplot2)

filename <- "Lesson-05/Encode_HMM_data.txt"
my_data <- read.csv(filename, sep="\t", header=FALSE)
names(my_data)[1:4] <- c("chrom","start","stop","type")

# Reorder chromosomes and cut their names down
my_data$chrom <- factor(gsub("chr", "", my_data$chrom, fixed=TRUE), levels=c(seq(1,22),"X","Y"))

# Filter to just a few types I'm interested in
my_data <- my_data[my_data$type %in% c("1_Active_Promoter","4_Strong_Enhancer","8_Insulator"),]

# Rename the types
library(plyr) # this library has a useful revalue() function
my_data$type <- revalue(my_data$type, c("1_Active_Promoter"="Promoter", "4_Strong_Enhancer"="Enhancer","8_Insulator"="Insulator"))

# Check the plot again
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()

# ==========================================================

# The basics
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar()

# Add a plot title
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar() + labs(title="Regulatory features by chromosome")

# Change axis and legend labels
ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar() + labs(x = "Chromosome",y="Count",fill="Feature")

# Save the plot to easily try new things:
basic <- ggplot(my_data,aes(x=chrom,fill=type)) + geom_bar() + labs(x = "Chromosome",y="Count",fill="Feature")
# Now when we run "basic" it makes the plot
basic

# Add theme with modifications to the basic plot, for instance with bigger text
basic + theme_gray(base_size = 20)
# But it only affects that plot, so the next plot we make is back to normal
basic

# You can also set a theme that will affect all the plots you make from now on
theme_set(theme_gray(base_size = 20))
basic

# To recover the default theme:
theme_set(theme_gray())
basic

# I prefer larger text myself
theme_set(theme_gray(base_size = 16))
basic

#==============================================================================
# Color palettes:

library(RColorBrewer)
display.brewer.all()

basic + scale_fill_brewer(palette="Set1")
basic + scale_fill_brewer(palette="Pastel1")
basic + scale_fill_brewer(palette="YlOrRd")

basic + scale_fill_manual(values = c("green","blue","red"))

colors()

# What happens if we need a lot of colors?
chrom_plot <- ggplot(my_data,aes(x=type,fill=chrom)) + geom_bar()
chrom_plot

# rainbow is confusing, but color palettes are too short:
chrom_plot + scale_fill_brewer(type="qual",palette=1)

# to get the colors from a palette:
palette1 <- brewer.pal(9,"Set1")
palette1

palette2 <- brewer.pal(8,"Set2")
palette2

palette3 <- brewer.pal(9,"Set3")
palette3

# We can use a quick pie chart to see the colors:
pie(rep(1,length(palette1)),col=palette1)
pie(rep(1,length(palette2)),col=palette2)
pie(rep(1,length(palette3)),col=palette3)

# We can just stick a few color palettes together
big_palette <- c(palette1,palette2,palette3)
big_palette

# Pie chart of all the colors together:
pie(rep(1,length(big_palette)),col=big_palette)

chrom_plot + scale_fill_manual(values = big_palette)

# To shuffle the colors:
chrom_plot + scale_fill_manual(values = sample(big_palette))

# if you want to keep the colors the same every time you plot,
# use set.seed()
set.seed(5)
# use different numbers until you find your favorite colors
chrom_plot + scale_fill_manual(values = sample(big_palette))

# This is possible, because:
# Fun fact: "Random" numbers from a computer aren't really random

# Color-blind safe palettes:
display.brewer.all(colorblindFriendly=TRUE)
# Mixing them might remove the color-friendly-ness so be careful
# Finding a set of 23 colors that a color-blind person can distinguish is a challenge

basic + scale_fill_brewer(palette="Set2")

# Done with colors
#======================================================================

# Default:
theme_set(theme_gray())

# Basic, normal plot:
basic

# Two basic themes:
basic + theme_gray() # the default
basic + theme_bw() # black and white

# Fonts and font sizes for everything at once
basic + theme_gray(base_size = 24, base_family = "Times New Roman")

# Font size for labels, tick labels, and legend separately ##############################
basic + theme(axis.text=element_text(size=20)) # numbers on axes
basic + theme(axis.title=element_text(size=20)) # titles on axes
basic + theme(legend.title=element_text(size=20)) # legend title
basic + theme(legend.text=element_text(size=20,family="Times New Roman"))
    # legend category labels

basic + theme(
    legend.text=element_text(size=20,family="Times New Roman"),
    axis.title=element_text(size=30),
    axis.text=element_text(size=20)
    ) # Mix and match

# Change background color
basic + theme(panel.background = element_rect(fill="pink"))
basic + theme(panel.background = element_rect(fill="white"))

# Change grid-lines
basic + theme(panel.grid.major = element_line(colour = "blue"), panel.grid.minor = element_line(colour = "red"))

    # Remove all gridlines:
basic + theme(panel.grid.major = element_line(NA),
              panel.grid.minor = element_line(NA))

    # Thin black major gridlines on y-axis, the others are removed
basic + theme(panel.grid.major.y = element_line(colour = "black",size=0.2),
              panel.grid.major.x = element_line(NA),
              panel.grid.minor = element_line(NA))

# Change tick-marks
basic # normal ticks
basic + theme(axis.ticks = element_line(size=2))
basic + theme(axis.ticks = element_line(NA))
basic + theme(axis.ticks = element_line(color="blue",size=2))
basic + theme(axis.ticks = element_line(size=2), # affects both x and y
              axis.ticks.x = element_line(color="blue"), # x only
              axis.ticks.y = element_line(color="red"))  # y only

# Place legend in different locations
basic + theme(legend.position="top")
basic + theme(legend.position="bottom")
basic + theme(legend.position=c(0,0)) # bottom left
basic + theme(legend.position=c(1,1)) # top right
basic + theme(legend.position=c(0.8,0.8)) # near the top right

# Remove legend title
basic + labs(fill="")
basic + labs(fill="") + theme(legend.position=c(0.8,0.8))

# Remove legend completely
basic + guides(fill=FALSE)

# clear background, axis lines but no box, no grid lines, basic colors, no legend
publication_style <- basic + guides(fill=FALSE) +  theme(axis.line = element_line(size=0.5),panel.background = element_rect(fill=NA,size=rel(20)), panel.grid.minor = element_line(colour = NA), axis.text = element_text(size=16), axis.title = element_text(size=18))

publication_style

publication_style + scale_y_continuous(expand=c(0,0)) # to stop the bars from floating above the x-axis

# You can set the theme with all these changes and have it apply to all the future plots
theme_set(theme_gray()+theme(axis.line = element_line(size=0.5),panel.background = element_rect(fill=NA,size=rel(20)), panel.grid.minor = element_line(colour = NA), axis.text = element_text(size=16), axis.title = element_text(size=18)))

basic

# These tweaks aren't part of the theme, so you will still have to add them separately to each plot
basic + scale_y_continuous(expand=c(0,0)) + guides(fill=FALSE)

# and you can always reset to defaults with:
theme_set(theme_gray())
basic

Lesson 6: Plot anything

Here is where you see the true magic of ggplot. It allows you to take the same data and quickly plot it in a bunch of different ways.

  1. Bar charts
  2. Histograms
  3. Scatter plots
  4. Box plots (box-and-whisker)
  5. Violin plots
  6. Log-scaling any plots
  7. Dot plots (showing small count data, unfortunately not the same “dot plot” as we make for genome assemblies, though if you have a genome assembly and want to make a dot plot for it, I encourage you to check out GenomeRibbon.com, a genomics visualization web app I made as part of my PhD research 🙂 )
  8. Line plots (time-course plot)
  9. Polar coordinate plots
  10. Pie charts
  11. Venn diagrams (intersections of gene lists)

Download practice data, scripts, and video files for offline viewing (for all 8 lessons)

# ==========================================================
#
#      Lesson 6 -- Plot anything
#      •   Bar plots
#      •   Histograms
#      •   Scatter plots
#      •   Box plots
#      •   Violin plots
#      •   Density plots
#      •   Dot plots
#      •   Line plots for time-course data
#      •   Pie charts
#      •   Venn diagrams (compare two or more lists of genes)
#
# ==========================================================

# ==========================================================
#
#      New data set for exploring plotting: variants
#
# ==========================================================

library(ggplot2)
theme_set(theme_gray(base_size = 18))

# Loading the data
filename <- "Lesson-06/variants_from_assembly.bed"

my_data <- read.csv(filename, sep="\t", quote='', stringsAsFactors=TRUE,header=FALSE)

head(my_data)

names(my_data)

names(my_data) <- c("chrom","start","stop","name","size","strand","type","ref.dist","query.dist")

head(my_data)
summary(my_data$chrom)

# Filtering and polishing data
my_data <- my_data[my_data$chrom %in% c(seq(1,22),"X","Y","MT"),]

    # ordering chromosomes
my_data$chrom <- factor(my_data$chrom, levels=c(seq(1,22),"X","Y","MT"))
    # ordering types
my_data$type <- factor(my_data$type, levels=c("Insertion","Deletion","Expansion","Contraction"))

####################################################################

# Creating a bar plot
ggplot(my_data, aes(x=chrom,fill=type)) + geom_bar()

# Creating a histogram
ggplot(my_data, aes(x=size,fill=type)) + geom_bar()
ggplot(my_data, aes(x=size,fill=type)) + geom_bar() + xlim(0,500)
ggplot(my_data, aes(x=size,fill=type)) + geom_bar(binwidth=5) + xlim(0,500)

# Creating scatter plot
ggplot(my_data, aes(x=ref.dist,y=query.dist)) + geom_point()
    # color by type (categorical)
ggplot(my_data, aes(x=ref.dist,y=query.dist,color=type)) + geom_point()
ggplot(my_data, aes(x=ref.dist,y=query.dist,color=type)) + geom_point() + xlim(-500,500) + ylim(-500,500)
    # color by size (numerical)
ggplot(my_data, aes(x=ref.dist,y=query.dist,color=size)) + geom_point() + xlim(-500,500) + ylim(-500,500)

ggplot(my_data, aes(x=ref.dist,y=query.dist,color=size)) + geom_point() + xlim(-500,500) + ylim(-500,500) + scale_colour_gradient(limits=c(0,500))

# Creating box plots
ggplot(my_data, aes(x=type,y=size)) + geom_boxplot()
ggplot(my_data, aes(x=type,y=size,fill=type)) + geom_boxplot()
ggplot(my_data, aes(x=type,y=size,fill=type)) + geom_boxplot() + coord_flip()

# Creating violin plots
ggplot(my_data, aes(x=type,y=size)) + geom_violin()
ggplot(my_data, aes(x=type,y=size,fill=type)) + geom_violin() + ylim(0,1000) + guides(fill=FALSE)
ggplot(my_data, aes(x=type,y=size,fill=type)) + geom_violin(adjust=0.2) + ylim(0,1000) + guides(fill=FALSE)
    # default adjust is 1, lower means finer resolution

# You can log-scale any numerical axis on any plot
ggplot(my_data, aes(x=type,y=size,fill=type)) + geom_violin() +
    scale_y_log10()

# Creating a density plot
ggplot(my_data, aes(x=size,fill=type)) + geom_density() + xlim(0,500)
    # similar to this histogram:
ggplot(my_data, aes(x=size,fill=type)) + geom_bar(binwidth=5) + xlim(0,500)

ggplot(my_data, aes(x=size,fill=type)) + geom_density(position="stack") + xlim(0,500)
ggplot(my_data, aes(x=size,fill=type)) + geom_density(alpha=0.5) + xlim(0,500)

# Sneak peak at Lesson 7: multifaceted plots:
ggplot(my_data, aes(x=size,fill=type)) + geom_density() + xlim(0,500) + facet_grid(type ~ .)

# Creating dot plots
ggplot(my_data, aes(x=size,fill=type)) + geom_dotplot()
    # a dot plot makes more sense with fewer observations where each individual item matters,
    # so let's grab the largest events only
large_data <- my_data[my_data$size>5000,  ] # [rows,columns]
ggplot(large_data, aes(x=size,fill=type)) + geom_dotplot(method="histodot")
# careful, they don't stack automatically, so some of the dots are behind others
ggplot(large_data, aes(x=size,fill=type)) + geom_dotplot(method="histodot",stackgroups=TRUE)

# Time-course data for line plot
filename <- "Lesson-06/time_course_data.txt"
time_course <- read.csv(filename, sep=",", quote='', stringsAsFactors=TRUE,header=TRUE)
time_course

# line plot:
ggplot(time_course, aes(x=seconds,y=value,colour=sample)) + geom_line()
ggplot(time_course, aes(x=seconds,y=value,colour=sample)) + geom_line(size=3)

# For fun:
# Any plot can be made in polar coordinates:
    # line plot
ggplot(time_course, aes(x=seconds,y=value,colour=sample)) + geom_line(size=3) + coord_polar()

    # violin plot
ggplot(my_data, aes(x=type,y=size,fill=type)) + geom_violin(adjust=0.5) + ylim(0,1000) + coord_polar()
    # bar plot
ggplot(my_data, aes(x=size,fill=type)) + geom_bar(binwidth=5) + xlim(0,500) + coord_polar()

# Pie charts
type_counts = summary(my_data$type)
type_counts

pie(type_counts)
pie(type_counts,col=brewer.pal(length(type_counts),"Set1"))

# Gene lists for Venn Diagram
listA <- read.csv("Lesson-06/genes_list_A.txt",header=FALSE)
A <- listA$V1
A

listB <- read.csv("Lesson-06/genes_list_B.txt",header=FALSE)
B <- listB$V1
B

listC <- read.csv("Lesson-06/genes_list_C.txt",header=FALSE)
C <- listC$V1
C

listD <- read.csv("Lesson-06/genes_list_D.txt",header=FALSE)
D <- listD$V1
D

length(A)
length(B)
length(C)
length(D)

# install package VennDiagram
library(VennDiagram)

# This function only works by saving directly to a file

venn.diagram(list("list C"=C, "list D"=D), fill = c("yellow","cyan"), cex = 1.5, filename="Lesson-06/Venn_diagram_genes_2.png")

venn.diagram(list(A = A, C = C, D = D), fill = c("yellow","red","cyan"), cex = 1.5,filename="Lesson-06/Venn_diagram_genes_3.png")

venn.diagram(list(A = A, B = B, C = C, D = D), fill = c("yellow","red","cyan","forestgreen"), cex = 1.5,filename="Lesson-06/Venn_diagram_genes_4.png")


Lesson 7: Multifaceted and inset figures

Splitting up your data by category into multiple plots arranged in rows, columns, or even tables.
Also making a figure inset into another figure, such as a zoomed in view in the upper right corner.

Download practice data, scripts, and video files for offline viewing (for all 8 lessons)

# ==========================================================
#
#      Lesson 7 -- Multifaceted figures
#      •   Grids of plots by chromosome, variant type, etc.
#      •   Make a figure as an inset in another figure
#          automatically -- without Photoshop or Illustrator
#
# ==========================================================

library(ggplot2)
# reset theme
theme_set(theme_gray())

# Loading the data
filename <- "Lesson-07/variants_from_assembly.bed"
my_data <- read.csv(filename, sep="\t", quote='', stringsAsFactors=TRUE,header=FALSE)
names(my_data) <- c("chrom","start","stop","name","size","strand","type","ref.dist","query.dist")

head(my_data)

# Filtering and polishing data
my_data <- my_data[my_data$chrom %in% c(seq(1,22),"X","Y","MT"),]

    # ordering chromosomes
my_data$chrom <- factor(gsub("chr", "", my_data$chrom), levels=c(seq(1,22),"X","Y","MT"))

    # ordering types
my_data$type <- factor(my_data$type, levels=c("Insertion","Deletion","Expansion","Contraction"))

############# Multifacetting figures ############

ggplot(my_data, aes(x=size,fill=type)) + geom_density(alpha=0.5) + xlim(0,500)

ggplot(my_data, aes(x=size,fill=type)) + geom_density() + xlim(0,500)  + facet_grid(type ~ .)

ggplot(my_data, aes(x=size,fill=type)) + geom_density() + xlim(0,500)  + facet_grid(. ~ type)

# plot + facet_grid(rows ~ columns)

# Facet on type and chrom
ggplot(my_data, aes(x=size,fill=type)) + geom_density() + xlim(0,500)  + facet_grid(chrom ~ type)

ggplot(my_data, aes(x=size,fill=type)) + geom_density() + xlim(0,500)  + facet_grid(type ~ chrom)

# Not always pretty, but it sure is fun
ggplot(my_data, aes(x=size,fill=type)) + geom_bar() + xlim(0,500)  + facet_grid(chrom ~ type)

ggplot(my_data, aes(x=type,y=size,color=type,fill=type)) + geom_boxplot() + facet_grid(chrom ~ .)

ggplot(my_data, aes(x=type,y=size,color=type,fill=type)) + geom_violin() + facet_grid(chrom ~ .)

ggplot(my_data, aes(x=ref.dist,y=query.dist,color=type,fill=type)) + xlim(0,500) + ylim(0,500) + geom_point() + facet_grid(chrom ~ type)

ggplot(my_data, aes(x=size,fill=type)) + geom_dotplot() + xlim(5000,10000) + facet_grid(chrom ~ type)

# Inset figures:

# Our special publication-style theme from Lesson 5 "Tweaking everything in your plots"
theme_set(theme_gray() +
              theme(
                  axis.line = element_line(size=0.5),
                  panel.background = element_rect(fill=NA,size=rel(20)),
                  panel.grid.minor = element_line(colour = NA),
                  axis.text = element_text(size=16),
                  axis.title = element_text(size=18)
                  )
          )

big_plot <-  ggplot(my_data, aes(x=size,fill=type)) +
    geom_bar(binwidth=100) +
    guides(fill=FALSE) +
    scale_y_continuous(expand=c(0,0)) # Move bars down to X-axis

big_plot

small_plot <- ggplot(my_data, aes(x=size,fill=type)) + geom_bar(binwidth=5) + xlim(0,500) + theme(axis.title=element_blank()) +  scale_y_continuous(expand=c(0,0))
small_plot

# Where to put the smaller plot:
library(grid)
vp <- viewport(width = 0.8, height = 0.7, x = 0.65, y = 0.65)
                # width, height, x-position, y-position of the smaller plot

png("Lesson-07/inset_plot.png")
print(big_plot)
print(small_plot, vp = vp)
dev.off()

Lesson 8: Heatmaps

How to create everything from simple heatmaps to adding different clustering and trees, partitions, and labels on the sides.

Download practice data, scripts, and video files for offline viewing (for all 8 lessons)

# ==========================================================
#
#      Lesson 8 -- Heatmaps
#      •   Dendrograms
#      •   To cluster or not to cluster
#      •   Flipping axes
#      •   Specifying distance calculation and clustering methods
#      •   Annotate sides of heatmap with associated data
#          (chromosomes, genders, samples)
#      •   Coloring branches in dendrograms
#
# ==========================================================
# The ComplexHeatmap package is downloaded from the bioconductor website, not from the packages tab like we did with the other packages
# https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html

# Run the first time to install:
# source("https://bioconductor.org/biocLite.R")
# biocLite("ComplexHeatmap")

library(ComplexHeatmap)

# Heatmap from single-cell copy number data

# Select a data file
filename <- "Lesson-08/copy_number_data.txt"

# Read the data into a data.frame
my_data <- read.table(filename, sep="\t", quote="", stringsAsFactors=FALSE,header=TRUE)

head(my_data)

dim(my_data) # (rows columns)

nrow(my_data) # rows: locations (bins) in genome
ncol(my_data) # columns: cells

# Make the heatmap data into a matrix
my_matrix <- as.matrix(my_data[  ,c(4:100)]) # [all rows, columns 4-100]
# leave out the first 3 columns (chrom,start,end) since they don't belong in the heatmap itself

# We can check the classes:
class(my_data)
class(my_matrix)

head(my_matrix)

# Save chromosome column for annotating the heatmap later
chromosome_info <- data.frame(chrom = my_data$CHR)
chromosome_info

## Now we make our first heatmap

# Default parameters
Heatmap(my_matrix)

# Flip rows and columns around
my_matrix <- t(my_matrix)  # "transpose"
Heatmap(my_matrix)

# Keep genome bins in order, not clustered
Heatmap(my_matrix, cluster_columns=FALSE)

fontsize <- 0.6

# Put cell labels on the left side
Heatmap(my_matrix, cluster_columns=FALSE,
        row_names_side = "left",
        row_hclust_side = "left",
        row_names_gp=gpar(cex=fontsize))

# Make the dendrogram wider
Heatmap(my_matrix,
        cluster_columns=FALSE,
        row_names_side = "left",
        row_hclust_side = "left",
        row_names_gp=gpar(cex=fontsize),
        row_hclust_width = unit(3, "cm"))

# Different distance calculation methods
# "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski", "pearson", "spearman", "kendall"
# euclidean is the default

# Different clustering methods
# "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC)

# Watch the dendrogram and heatmap change when we change the methods
Heatmap(my_matrix,
        cluster_columns=FALSE,
        row_names_side = "left",
        row_hclust_side = "left",
        row_names_gp=gpar(cex=fontsize),
        row_hclust_width = unit(3, "cm"),
        clustering_distance_rows ="maximum",
        clustering_method_rows = "ward.D")

# Coloring clusters in dendrogram

# install dendextend
library(dendextend)
# Need to build dendrogram first so we can use it for the color_brances() function
# 1. calculate distances (method="maximum")
# 2. cluster (method="ward.D")
dend = hclust(dist(my_matrix,method="maximum"),method="ward.D")

Heatmap(my_matrix,
        cluster_columns=FALSE,
        row_names_side = "left",
        row_hclust_side = "left",
        row_names_gp=gpar(cex=fontsize),
        row_hclust_width = unit(3, "cm"),
        cluster_rows = color_branches(dend, k = 3))

# We can split the heatmap into clusters

Heatmap(my_matrix,
        cluster_columns=FALSE,
        row_names_side = "left",
        row_hclust_side = "left",
        row_names_gp=gpar(cex=fontsize),
        row_hclust_width = unit(3, "cm"),
        clustering_distance_rows ="maximum",
        clustering_method_rows = "ward.D",
        km=2) # number of clusters you want

# Split columns of plot up into chromosomes using extra_info
chromosome_info

chromosome.colors <- c(rep(c("black","white"),11),"red")
chromosome.colors

names(chromosome.colors) <- paste("chr",c(seq(1,22),"X"),sep="")
chromosome.colors

Heatmap(my_matrix,
        cluster_columns=FALSE,
        row_names_side = "left",
        row_hclust_side = "left",
        row_names_gp=gpar(cex=fontsize),
        row_hclust_width = unit(3, "cm"),
        clustering_distance_rows ="maximum",
        clustering_method_rows = "ward.D",
        km=2, # number of clusters you want
        bottom_annotation = HeatmapAnnotation(chromosome_info,col = list(chrom=chromosome.colors),show_legend=FALSE)
        )

Download practice data, scripts, and video files for offline viewing (for all 8 lessons)

Hi there!

Whew, you made it all the way to the bottom. Congrats!
I hope you have enjoyed this course on Plotting in R for Biologists.
If you found it useful and you want contribute something to help me make more videos in the future you can buy this course and get all the scripts, practice data, and videos that you can download and watch offline.
Do you have any friends or colleagues who might like to take this course?
Please share it with them.

Thanks a bunch!
Maria

Get weekly emails with new videos and behind-the-scenes updates

Email_signup_thumbnail

Sign up to get free bioinformatics video content, guides, and other good stuff delivered straight to your email inbox

Powered by ConvertKit