Data Visualization - R

Beatriz Manso
2021-12-03

Data Visualisation involves producing images that communicate relationships among the represented data to the audience. This is achieved through the use of a systematic mapping between graphic marks and data values in the creation of visualisations with elements like charts, graphs and maps. Data visualisation provides an accessible way to see and understand trends, outliers and patterns in big data.

In Bioinformatics, data visualisation techniques are essential for analysis of massive amounts of genomics information and making decisions.

Set working directory:

setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/3.DSB - Data Science for Bioinformatics/Practice/DSB W9 - Data Visualisation")

Install packages and Load Libraries:

install.packages("tidyverse", Force=TRUE)

1. Genesis of applying deep philosophy of visualisation

qplot() function This function is similar to the basic plot() function from the R base package. It can be used to create and combine easily different types of plots. However, it remains less flexible than the function ggplot().

As an example we will use mtcars dataset.

data(mtcars)
df <- mtcars [, c("mpg", "cyl", "wt")]
head(df)
                   mpg cyl    wt
Mazda RX4         21.0   6 2.620
Mazda RX4 Wag     21.0   6 2.875
Datsun 710        22.8   4 2.320
Hornet 4 Drive    21.4   6 3.215
Hornet Sportabout 18.7   8 3.440
Valiant           18.1   6 3.460
tail(df)
                mpg cyl    wt
Porsche 914-2  26.0   4 2.140
Lotus Europa   30.4   4 1.513
Ford Pantera L 15.8   8 3.170
Ferrari Dino   19.7   6 2.770
Maserati Bora  15.0   8 3.570
Volvo 142E     21.4   4 2.780

Description: The data comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973 - 74 models)

Data frame with 32 observations on 3 variables.

Usage of qplot() function

qplot(mpg, wt, data = mtcars, colour = cyl)

qplot(mpg, wt, data = mtcars, facets = vs ~ am)

qplot(displ, hwy, data = mpg, colour = drv) 

f <- function() {
 a <- 1:10
 b <- (a ^ 2)
 qplot(a, b)
}
f()

qplot(displ, hwy, data = mpg, geom = c("point", "smooth")) 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

gplot() is the core function and very flexible for doing things qplot() cannot do

Key steps in ggplot2 data visualisation

Common start up: 1. ggplot(), 2. supply a dataset and 3. aesthetic mapping (with aes()) 4. Then add on layers (like geom_point() or geom_histogram()), 5. Scales (like scale_colour_brewer()), 6. faceting specifications (like facet_wrap()) and 7. coordinate systems (like coord_flip()).

There are more predifined geoms to add:

Between the brackets we need to specify aesthetics: - required : the variable the geom represent - optional: attributes such as colour, size .

qplot(hwy, data = mpg, fill = drv) 

qplot(displ, hwy, data = mpg, facets = . ~ drv) 

qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)

2. Compare ggplot2 with base graphics

landdata-counties.xlsx file is used in this exercise.

library(readxl)

landdata <- read_excel("landdata-counties.xlsx")
housing<-landdata
head(housing[1:5])
# A tibble: 6 x 5
  County region  Date Home.Value Structure.Cost
  <chr>  <chr>  <dbl>      <dbl>          <dbl>
1 BD     West   2010.     224952         160599
2 BD     West   2010.     225511         160252
3 BD     West   2010.     225820         163791
4 BD     West   2010      224994         161787
5 BD     West   2008      234590         155400
6 BD     West   2008.     233714         157458

The $ sign will extract all values from the Home.Value collumn.

Examples:

hist(housing$Home.Value)

A basic histogram bins a variable into fixed-width buckets and returns the number of data points that falls into each bucket. - For example, you could group your customers by age range, in intervals of five years: 20-25, 25-30, 30-35, and so on. - Customers at a boundary age would go into the higher bucket: 25-year-olds go into the 25-30 bucket. - For each bucket, you then count how many customers are in that bucket.

Now, lets plot a histogram with ggplot2:

library(ggplot2)
ggplot(housing, aes(x = Home.Value)) +
geom_histogram()

It’s possible to save your plot to a folder: Save your plot: Click on export button on top of the Plots View and save to a folder of your choice Or use the navigation button to view the previous plot and compare

Compared to base graphics, ggplot2 - is more verbose for simple / canned graphics - is less verbose for complex / custom graphics - does not have methods (data should always be in a data.frame) - uses a different system for adding plot elements

Aesthetic Mapping In ggplot land aesthetic means “something you can see”. Examples include: - position (i.e., on the x and y axes) - color (“outside” color) - fill (“inside” color) - shape (of points) - linetype - size

Each type of geom accepts only a subset of all aesthetics

Aesthetic mappings are set with the aes() function.

Geometic Objects (geom) - Geometric objects are the actual marks we put on a plot. Examples include: - points (geom_point, for scatter plots, dot plots, etc) - lines (geom_line, for time series, trend lines, etc) - boxplot (geom_boxplot, for, well, boxplots!)

A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator

plot(Home.Value ~ Date,
 col = factor(County) ,
 data = filter(housing, County %in% c("MA","TX")))
legend("topleft",
 legend = c("MA","TX"),
 col = c("black", "red"),
 pch = 1)

ggplot(filter(housing, County %in% c("MA", "TX")), aes(x=Date, y=Home.Value, color=County
)) + geom_point()

Scatterplot

Scaterplot requires mappings for both x and y coordinates, let’s use the filter function ‘filter()’ to choose rows/cases that we want to plot.

library(dplyr)
library(ggplot2)

hp2001Q1 = read.csv('hp2001Q1.csv')

ggplot(hp2001Q1, aes(y=Structure.Cost, x=Home.Value)) + geom_point()

ggplot(hp2001Q1, aes(y=Structure.Cost, x=log(Land.Value))) + geom_point()

Prediction Line (Regression lines):

hp2001Q1$pred.SC <- predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1))
p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost))
p1 + geom_point(aes(colour = Home.Value)) + geom_line(aes(y = pred.SC))

Smothers with a ribbon:

We have assigned our plot as p1, so we can just call p1 and add a smoother…

p1 + geom_point(aes(colour = Home.Value)) + geom_smooth()

Plotting Text (Label Points)

We can plot the county details on p1 by using geom_text() function to label or annotate the land values

p1 + geom_text(aes(label=County), size = 3)

Some of our data is overlapping, so we can use the package ggrepel which will provide text and label geoms for ‘ggplot2’ that help to avoid overlapping text labels. Labels repel away from each other and away from the data points.

install.packages("ggrepel")
library("ggrepel")
p1 + geom_point() + geom_text_repel(aes(label=County), size = 3)

p1 + geom_point(aes(colour="red"), size=2) 

p1 + geom_point(aes(colour=Home.Value, shape = region))

Basic box plot

x <- "1"
y <- rnorm(100)
qplot(x, y, geom="boxplot")

head(PlantGrowth)
  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl
qplot(group, weight, data = PlantGrowth, 
 geom=c("boxplot"))

Dot plot

qplot(group, weight, data = PlantGrowth, 
 geom=c("dotplot"), 
 stackdir = "center", binaxis = "y")

Violin plot

qplot(group, weight, data = PlantGrowth,
 geom=c("violin"), trim = FALSE)