Data Visualisation involves producing images that communicate relationships among the represented data to the audience. This is achieved through the use of a systematic mapping between graphic marks and data values in the creation of visualisations with elements like charts, graphs and maps. Data visualisation provides an accessible way to see and understand trends, outliers and patterns in big data.
In Bioinformatics, data visualisation techniques are essential for analysis of massive amounts of genomics information and making decisions.
Set working directory:
setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/3.DSB - Data Science for Bioinformatics/Practice/DSB W9 - Data Visualisation")
Install packages and Load Libraries:
install.packages("tidyverse", Force=TRUE)
qplot() function This function is similar to the basic plot() function from the R base package. It can be used to create and combine easily different types of plots. However, it remains less flexible than the function ggplot().
As an example we will use mtcars dataset.
mpg cyl wt
Mazda RX4 21.0 6 2.620
Mazda RX4 Wag 21.0 6 2.875
Datsun 710 22.8 4 2.320
Hornet 4 Drive 21.4 6 3.215
Hornet Sportabout 18.7 8 3.440
Valiant 18.1 6 3.460
mpg cyl wt
Porsche 914-2 26.0 4 2.140
Lotus Europa 30.4 4 1.513
Ford Pantera L 15.8 8 3.170
Ferrari Dino 19.7 6 2.770
Maserati Bora 15.0 8 3.570
Volvo 142E 21.4 4 2.780
Description: The data comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973 - 74 models)
Data frame with 32 observations on 3 variables.
Works like the plot function in base graphics system by looking for data in a data frame, similar to lattice, or in the parent environment.
Plots are made up of aesthetics (size, shape,colour) and geoms (points, lines)
Factors are important for indicating subsets of the data, they should be labeled
Example 1:
qplot(mpg, wt, data = mtcars, colour = cyl)
qplot(mpg, wt, data = mtcars, facets = vs ~ am)
qplot(displ, hwy, data = mpg, colour = drv)
f <- function() {
a <- 1:10
b <- (a ^ 2)
qplot(a, b)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
gplot() is the core function and very flexible for doing things qplot() cannot do
Common start up: 1. ggplot(), 2. supply a dataset and 3. aesthetic mapping (with aes()) 4. Then add on layers (like geom_point() or geom_histogram()), 5. Scales (like scale_colour_brewer()), 6. faceting specifications (like facet_wrap()) and 7. coordinate systems (like coord_flip()).
There are more predifined geoms to add:
Between the brackets we need to specify aesthetics: - required : the variable the geom represent - optional: attributes such as colour, size .
qplot(hwy, data = mpg, fill = drv)
qplot(displ, hwy, data = mpg, facets = . ~ drv)
qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)
landdata-counties.xlsx file is used in this exercise.
landdata <- read_excel("landdata-counties.xlsx")
# A tibble: 6 x 5
County region Date Home.Value Structure.Cost
<chr> <chr> <dbl> <dbl> <dbl>
1 BD West 2010. 224952 160599
2 BD West 2010. 225511 160252
3 BD West 2010. 225820 163791
4 BD West 2010 224994 161787
5 BD West 2008 234590 155400
6 BD West 2008. 233714 157458
The $ sign will extract all values from the Home.Value collumn.
A basic histogram bins a variable into fixed-width buckets and returns the number of data points that falls into each bucket. - For example, you could group your customers by age range, in intervals of five years: 20-25, 25-30, 30-35, and so on. - Customers at a boundary age would go into the higher bucket: 25-year-olds go into the 25-30 bucket. - For each bucket, you then count how many customers are in that bucket.
Now, lets plot a histogram with ggplot2:
ggplot(housing, aes(x = Home.Value)) +
It’s possible to save your plot to a folder: Save your plot: Click on export button on top of the Plots View and save to a folder of your choice Or use the navigation button to view the previous plot and compare
Compared to base graphics, ggplot2 - is more verbose for simple / canned graphics - is less verbose for complex / custom graphics - does not have methods (data should always be in a data.frame) - uses a different system for adding plot elements
Aesthetic Mapping In ggplot land aesthetic means “something you can see”. Examples include: - position (i.e., on the x and y axes) - color (“outside” color) - fill (“inside” color) - shape (of points) - linetype - size
Each type of geom accepts only a subset of all aesthetics
Aesthetic mappings are set with the aes() function.
Geometic Objects (geom) - Geometric objects are the actual marks we put on a plot. Examples include: - points (geom_point, for scatter plots, dot plots, etc) - lines (geom_line, for time series, trend lines, etc) - boxplot (geom_boxplot, for, well, boxplots!)
A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator
plot(Home.Value ~ Date,
col = factor(County) ,
data = filter(housing, County %in% c("MA","TX")))
legend = c("MA","TX"),
col = c("black", "red"),
pch = 1)
ggplot(filter(housing, County %in% c("MA", "TX")), aes(x=Date, y=Home.Value, color=County
)) + geom_point()
Scaterplot requires mappings for both x and y coordinates, let’s use the filter function ‘filter()’ to choose rows/cases that we want to plot.
hp2001Q1 = read.csv('hp2001Q1.csv')
ggplot(hp2001Q1, aes(y=Structure.Cost, x=Home.Value)) + geom_point()
ggplot(hp2001Q1, aes(y=Structure.Cost, x=log(Land.Value))) + geom_point()
hp2001Q1$pred.SC <- predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1))
p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost))
p1 + geom_point(aes(colour = Home.Value)) + geom_line(aes(y = pred.SC))
We have assigned our plot as p1, so we can just call p1 and add a smoother…
p1 + geom_point(aes(colour = Home.Value)) + geom_smooth()
We can plot the county details on p1 by using geom_text() function to label or annotate the land values
Some of our data is overlapping, so we can use the package ggrepel which will provide text and label geoms for ‘ggplot2’ that help to avoid overlapping text labels. Labels repel away from each other and away from the data points.
p1 + geom_point() + geom_text_repel(aes(label=County), size = 3)
p1 + geom_point(aes(colour="red"), size=2)
p1 + geom_point(aes(colour=Home.Value, shape = region))
weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
5 4.50 ctrl
6 4.61 ctrl