RNA-Seq data Analysis - Expression Analysis

Beatriz Manso
2022-04-07

Introduction

By quantifying the activity of RNA in a biological sample, gene expression can provide valuable insights into disease nature and its treatment. The RNA-Seq assay is one of the fastest growing Next Generation Sequencing (NGS) approaches used to assess gene expression and alternative splicing, where both known and novel features can be detected in a single assay, allowing for the identification of transcript isoforms, gene fusions, and single nucleotide variations as well as other features without a prior knowledge of the sequence.

Methods

1. Set Working Directory, Install necessary packages and load libraries:

setwd("C:/Users/manso/Desktop/practical")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rsubread")
BiocManager::install("edgeR")
BiocManager::install("affy")
BiocManager::install("pheatmap")
BiocManager::install("corrplot")

library(edgeR)
library(Rsubread)
library(Rsamtools) 
library(affy)
library(GEOquery)
library(tidyverse)
library(pheatmap)
library(stats)
library(corrplot)
library(knitr)

2. Alignment

2.1.List the fastq files:

fastq.files <- list.files(path = "C:/Users/manso/Desktop/practical", pattern = ".fastq$", full.names = TRUE)

fastq.files
[1] "C:/Users/manso/Desktop/practical/SRR14094802_1.fastq"
[2] "C:/Users/manso/Desktop/practical/SRR14094802_2.fastq"

2.2. Build the index:

Lets use the Saccharomyces reference genome for now because the human genome takes a long time to complete.

buildindex(basename="ht",reference="GCF_000146045.2_R64_genomic.fna")

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 2.8.2

//================================= setting ==================================\\
||                                                                            ||
||                Index name : ht                                             ||
||               Index space : base space                                     ||
||               Index split : no-split                                       ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : no                                             ||
||                                                                            ||
||       Free / total memory : 16.6GB / 31.4GB                                ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o GCF_000146045.2_R64_genomic.fna              ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
|| 11 uninformative subreads were found.                                      ||
|| These subreads were excluded from index building.                          ||
|| Estimate the index size...                                                 ||
||    8%,   0 mins elapsed, rate=3530.8k bps/s                                ||
||   16%,   0 mins elapsed, rate=4254.2k bps/s                                ||
||   24%,   0 mins elapsed, rate=4567.0k bps/s                                ||
||   33%,   0 mins elapsed, rate=4751.8k bps/s                                ||
||   41%,   0 mins elapsed, rate=4879.5k bps/s                                ||
||   49%,   0 mins elapsed, rate=4932.2k bps/s                                ||
||   58%,   0 mins elapsed, rate=4987.9k bps/s                                ||
||   66%,   0 mins elapsed, rate=5030.6k bps/s                                ||
||   74%,   0 mins elapsed, rate=5072.8k bps/s                                ||
||   83%,   0 mins elapsed, rate=5109.0k bps/s                                ||
||   91%,   0 mins elapsed, rate=5139.5k bps/s                                ||
|| 3.0 GB of memory is needed for index building.                             ||
|| Build the index...                                                         ||
||    8%,   0 mins elapsed, rate=831.4k bps/s                                 ||
||   16%,   0 mins elapsed, rate=1266.5k bps/s                                ||
||   24%,   0 mins elapsed, rate=1534.9k bps/s                                ||
||   33%,   0 mins elapsed, rate=1717.9k bps/s                                ||
||   41%,   0 mins elapsed, rate=1850.4k bps/s                                ||
||   49%,   0 mins elapsed, rate=1946.9k bps/s                                ||
||   58%,   0 mins elapsed, rate=2023.3k bps/s                                ||
||   66%,   0 mins elapsed, rate=2088.1k bps/s                                ||
||   74%,   0 mins elapsed, rate=2141.6k bps/s                                ||
||   83%,   0 mins elapsed, rate=2185.7k bps/s                                ||
||   91%,   0 mins elapsed, rate=2222.3k bps/s                                ||
|| Save current index block...                                                ||
||  [ 0.0% finished ]                                                         ||
||  [ 10.0% finished ]                                                        ||
||  [ 20.0% finished ]                                                        ||
||  [ 30.0% finished ]                                                        ||
||  [ 40.0% finished ]                                                        ||
||  [ 50.0% finished ]                                                        ||
||  [ 60.0% finished ]                                                        ||
||  [ 70.0% finished ]                                                        ||
||  [ 80.0% finished ]                                                        ||
||  [ 90.0% finished ]                                                        ||
||  [ 100.0% finished ]                                                       ||
||                                                                            ||
||                      Total running time: 0.3 minutes.                      ||
||     Index C:\Users\manso\Desktop\practical\ht was successfully built.      ||
||                                                                            ||
\\============================================================================//

2.3. Aligning reads with reference genome

reads1 <- list.files( path = "C:/Users/manso/Desktop/practical", pattern = "*_1.fastq$" )

reads2 <- list.files( path = "C:/Users/manso/Desktop/practical", pattern = "*_2.fastq$" )

align(index="ht",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=16)

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 2.8.2

//================================= setting ==================================\\
||                                                                            ||
|| Function      : Read alignment (RNA-Seq)                                   ||
|| Input file 1  : SRR14094802_1.fastq                                        ||
|| Input file 2  : SRR14094802_2.fastq                                        ||
|| Output file   : SRR14094802_1.fastq.subread.BAM (BAM)                      ||
|| Index name    : ht                                                         ||
||                                                                            ||
||                    ------------------------------------                    ||
||                                                                            ||
||                               Threads : 16                                 ||
||                          Phred offset : 33                                 ||
||               # of extracted subreads : 10                                 ||
||                        Min read1 vote : 3                                  ||
||                        Min read2 vote : 1                                  ||
||                     Max fragment size : 600                                ||
||                     Min fragment size : 50                                 ||
||                        Max mismatches : 3                                  ||
||                      Max indel length : 5                                  ||
||            Report multi-mapping reads : yes                                ||
|| Max alignments per multi-mapping read : 1                                  ||
||                                                                            ||
\\============================================================================//

//================ Running (05-May-2022 02:30:34, pid=10796) =================\\
||                                                                            ||
|| Check the input reads.                                                     ||
|| The input file contains base space reads.                                  ||
|| Initialise the memory objects.                                             ||
|| Estimate the mean read length.                                             ||
|| The range of Phred scores observed in the data is [2,40]                   ||
|| Create the output BAM file.                                                ||
|| Check the index.                                                           ||
|| Init the voting space.                                                     ||
|| Global environment is initialised.                                         ||
|| Load the 1-th index block...                                               ||
|| The index block has been loaded.                                           ||
|| Start read mapping in chunk.                                               ||
||    7% completed, 0.3 mins elapsed, rate=67.4k fragments per second         ||
||   14% completed, 0.4 mins elapsed, rate=71.4k fragments per second         ||
||   22% completed, 0.4 mins elapsed, rate=72.6k fragments per second         ||
||   30% completed, 0.4 mins elapsed, rate=73.7k fragments per second         ||
||   37% completed, 0.4 mins elapsed, rate=73.9k fragments per second         ||
||   45% completed, 0.5 mins elapsed, rate=74.0k fragments per second         ||
||   53% completed, 0.5 mins elapsed, rate=74.4k fragments per second         ||
||   61% completed, 0.5 mins elapsed, rate=74.4k fragments per second         ||
||   Estimated fragment length : 182 bp                                       ||
||   69% completed, 0.5 mins elapsed, rate=29.6k fragments per second         ||
||   73% completed, 0.6 mins elapsed, rate=29.8k fragments per second         ||
||   76% completed, 0.6 mins elapsed, rate=30.0k fragments per second         ||
||   79% completed, 0.6 mins elapsed, rate=30.2k fragments per second         ||
||   83% completed, 0.6 mins elapsed, rate=30.4k fragments per second         ||
||   86% completed, 0.6 mins elapsed, rate=30.6k fragments per second         ||
||   89% completed, 0.7 mins elapsed, rate=30.8k fragments per second         ||
||   93% completed, 0.7 mins elapsed, rate=31.0k fragments per second         ||
||   96% completed, 0.7 mins elapsed, rate=31.2k fragments per second         ||
||   99% completed, 0.7 mins elapsed, rate=31.3k fragments per second         ||
||                                                                            ||
||                           Completed successfully.                          ||
||                                                                            ||
\\====================================    ====================================//

//================================   Summary =================================\\
||                                                                            ||
||             Total fragments : 1372071                                      ||
||                      Mapped : 923639 (67.3%)                               ||
||             Uniquely mapped : 740587                                       ||
||               Multi-mapping : 183052                                       ||
||                                                                            ||
||                    Unmapped : 448432                                       ||
||                                                                            ||
||             Properly paired : 865412                                       ||
||         Not properly paired : 58227                                        ||
||                   Singleton : 43703                                        ||
||                    Chimeric : 7596                                         ||
||       Unexpected strandness : 998                                          ||
||  Unexpected fragment length : 5099                                         ||
||       Unexpected read order : 831                                          ||
||                                                                            ||
||                      Indels : 499                                          ||
||                                                                            ||
||                Running time : 0.7 minutes                                  ||
||                                                                            ||
\\============================================================================//
                                SRR14094802_1.fastq.subread.BAM
Total_fragments                                         1372071
Mapped_fragments                                         923639
Uniquely_mapped_fragments                                740587
Multi_mapping_fragments                                  183052
Unmapped_fragments                                       448432
Properly_paired_fragments                                865412
Singleton_fragments                                       43703
More_than_one_chr_fragments                                7596
Unexpected_strandness_fragments                             998
Unexpected_template_length                                 5099
Inversed_mapping                                            831
Indels                                                      499

3. Post alignment quality control

3.1. Setup a BamFile object:

bamFile <- BamFile(file = "SRR14094802_1.sorted.bam", index= "SRR14094802_1.sorted.bam.bai")

bamFile
class: BamFile 
path: SRR14094802_1.sorted.bam
index: SRR14094802_1.sorted.bam.bai
isOpen: FALSE 
yieldSize: NA 
obeyQname: FALSE 
asMates: FALSE 
qnamePrefixEnd: NA 
qnameSuffixStart: NA 

In Ubunto terminal - To create a BAM index:
- You must first sort the BAM file to create a sorted.bam - Run samtools index with the sorted.bam as input - This will create a file named sorted.bam.bai which
contains the index

samtools sort  SRR14094802_1.fastq.subread.BAM -o  SRR14094802_1.sorted.bam

samtools index SRR14094802_1.sorted.bam SRR14094802_1.sorted.bam.bai

3.2. High-level information can be accessed with seqinfo()

seqinfo(bamFile)

3.3. Read aligned reads using scanBam()

aln <- scanBam(bamFile)
length(aln)
[1] 1
class(aln)
[1] "list"

In the case of scanBam() it’s possible to get output from multiple genomic regions,here we get back a list of length 1. If it returns a list >1, we subset that list to a single list and then display that information.

aln <- aln[[1]]
names(aln)
 [1] "qname"  "flag"   "rname"  "strand" "pos"    "qwidth" "mapq"  
 [8] "cigar"  "mrnm"   "mpos"   "isize"  "seq"    "qual"  
lapply(aln, function(xx) xx[1])
$qname
[1] "SRR14094802.107538"

$flag
[1] 129

$rname
[1] NC_001133.9
17 Levels: NC_001133.9 NC_001134.8 NC_001135.5 ... NC_001224.1

$strand
[1] +
Levels: + - *

$pos
[1] 1

$qwidth
[1] 25

$mapq
[1] 0

$cigar
[1] "25M"

$mrnm
[1] NC_001133.9
17 Levels: NC_001133.9 NC_001134.8 NC_001135.5 ... NC_001224.1

$mpos
[1] 230104

$isize
[1] 230128

$seq
DNAStringSet object of length 1:
    width seq
[1]    25 CCACACCACACCCACACACCCACAC

$qual
PhredQuality object of length 1:
    width seq
[1]    25 GAGAGIGGGIIIIGGIIIIIGGGAG

3.4. Get BAM flag sumary

                                group |    nb of |    nb of | mean / max
                                   of |  records |   unique | records per
                              records | in group |   QNAMEs | unique QNAME
All records........................ A |  2744142 |  1372071 |    2 / 2
  o template has single segment.... S |        0 |        0 |   NA / NA
  o template has multiple segments. M |  2744142 |  1372071 |    2 / 2
      - first segment.............. F |  1372071 |  1372071 |    1 / 1
      - last segment............... L |  1372071 |  1372071 |    1 / 1
      - other segment.............. O |        0 |        0 |   NA / NA

Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
Indentation reflects this.

Details for group M:
  o record is mapped.............. M1 |  1803575 |   923639 | 1.95 / 2
      - primary alignment......... M2 |  1803575 |   923639 | 1.95 / 2
      - secondary alignment....... M3 |        0 |        0 |   NA / NA
  o record is unmapped............ M4 |   940567 |   492135 | 1.91 / 2

Details for group F:
  o record is mapped.............. F1 |   908542 |   908542 |    1 / 1
      - primary alignment......... F2 |   908542 |   908542 |    1 / 1
      - secondary alignment....... F3 |        0 |        0 |   NA / NA
  o record is unmapped............ F4 |   463529 |   463529 |    1 / 1

Details for group L:
  o record is mapped.............. L1 |   895033 |   895033 |    1 / 1
      - primary alignment......... L2 |   895033 |   895033 |    1 / 1
      - secondary alignment....... L3 |        0 |        0 |   NA / NA
  o record is unmapped............ L4 |   477038 |   477038 |    1 / 1

4. Counting and normalisation methods

For this step we’ll use The Human sample SRR8472776.bam

4.1. Counting - Quantification with featureCounts

SRRcount <- featureCounts(files="SRR8472776.bam",
                       annot.ext="gencode.v29.annotation.gtf",
                       isGTFAnnotationFile=TRUE, GTF.featureType="exon", 
                       GTF.attrType="gene_id")

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 2.8.2

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                                                                            ||
||                           SRR8472776.bam                                   ||
||                                                                            ||
||              Paired-end : no                                               ||
||        Count read pairs : no                                               ||
||              Annotation : gencode.v29.annotation.gtf (GTF)                 ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file gencode.v29.annotation.gtf ...                        ||
||    Features : 1262773                                                      ||
||    Meta-features : 58721                                                   ||
||    Chromosomes/contigs : 25                                                ||
||                                                                            ||
|| Process BAM file SRR8472776.bam...                                         ||
||    Single-end reads are included.                                          ||
||    Total alignments : 2347456                                              ||
||    Successfully assigned alignments : 2279419 (97.1%)                      ||
||    Running time : 0.04 minutes                                             ||
||                                                                            ||
|| Write the final count table.                                               ||
|| Write the read assignment summary.                                         ||
||                                                                            ||
\\============================================================================//

View the count file:

View(SRRcount[["counts"]])
SRR8472776 <- SRRcount[["counts"]]
write.csv(SRR8472776, "SRR8472776_counts.csv")
#Read files SRP029880.raw_counts.tsv as count file
Count_file <- read.table(file="SRP029880.raw_counts.tsv",
                        sep = '\t', header = TRUE, 
                        fill = TRUE)

counts <- as.matrix(Count_file)

summary(counts)
     CASE_1              CASE_2              CASE_3         
 Min.   :        0   Min.   :        0   Min.   :        0  
 1st Qu.:     5155   1st Qu.:     6464   1st Qu.:     3972  
 Median :    80023   Median :    85064   Median :    64145  
 Mean   :   295932   Mean   :   273099   Mean   :   263045  
 3rd Qu.:   252164   3rd Qu.:   245484   3rd Qu.:   210788  
 Max.   :205067466   Max.   :105248041   Max.   :222511278  
     CASE_4              CASE_5              CTRL_1         
 Min.   :        0   Min.   :        0   Min.   :        0  
 1st Qu.:     7088   1st Qu.:     5741   1st Qu.:     6905  
 Median :    99129   Median :    85530   Median :    86801  
 Mean   :   364558   Mean   :   346312   Mean   :   339093  
 3rd Qu.:   302838   3rd Qu.:   274448   3rd Qu.:   286483  
 Max.   :161624257   Max.   :127063679   Max.   :273078584  
     CTRL_2              CTRL_3              CTRL_4         
 Min.   :        0   Min.   :        0   Min.   :        0  
 1st Qu.:     4750   1st Qu.:     4761   1st Qu.:     2720  
 Median :    63088   Median :    61488   Median :    37227  
 Mean   :   273367   Mean   :   264306   Mean   :   182936  
 3rd Qu.:   215286   3rd Qu.:   211596   3rd Qu.:   134252  
 Max.   :398526040   Max.   :282352241   Max.   :263287148  
     CTRL_5              width        
 Min.   :        0   Min.   :    117  
 1st Qu.:     4558   1st Qu.:   9668  
 Median :    62271   Median :  27090  
 Mean   :   275877   Mean   :  67685  
 3rd Qu.:   219417   3rd Qu.:  70374  
 Max.   :292416098   Max.   :2473537  

4.2. Normalisation for RNA sequencing data

#To compute the CPM values for each sample (excluding the width column):
cpm <- apply(subset(counts, select = c(-width)), 
             2, function(x) x/sum(as.numeric(x)) * 10^6)

head(cpm)
              CASE_1     CASE_2      CASE_3      CASE_4     CASE_5
TSPAN6   133.0525426 69.0267115 118.0347011 63.45319990 75.1705899
TNMD       0.2541349  0.1496685   0.5774069  0.04131475  0.1603471
DPM1      62.5344859 50.9433750  47.9546579 51.61492360 47.6835767
SCYL3     17.7536255 18.1282742  18.9680587 16.32642166 14.9280985
C1ORF112  15.5608340 10.9995218  11.8488915 15.14762970  6.5660321
FGR        9.7817077 27.0411643   5.1667799 23.44215745 23.2660030
             CTRL_1     CTRL_2     CTRL_3     CTRL_4      CTRL_5
TSPAN6   83.6818109 90.8357832 63.7170750 66.1201154 116.5649088
TNMD      0.6925826  0.3495018  0.8603467  0.5436178   0.7308802
DPM1     31.5816791 22.8556382 21.7976235 18.6670761  36.4577954
SCYL3    24.0552454 19.8292196 20.5182320 17.9989890  18.6607898
C1ORF112  7.8228489  6.3511382  7.3077662  5.1345687  10.6632034
FGR       8.9317887  4.5418541  4.1459346  5.1711610   4.3532958
colSums(cpm)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5 CTRL_1 CTRL_2 CTRL_3 CTRL_4 CTRL_5 
 1e+06  1e+06  1e+06  1e+06  1e+06  1e+06  1e+06  1e+06  1e+06  1e+06 

Use the “ExpData.csv” file for this method We have to change the Gene_id column as the row names

Read the ExpData.csv count file:

Counts_NP1PE1 <- read.csv(file="ExpData.csv" , header =TRUE)

Change the Gene_id column as row names:

rownames(Counts_NP1PE1) <- Counts_NP1PE1[,1]

Counts_NP1PE1[,1] <- NULL

Computing CPM

# Creates a DGEList object from a table of counts (rows=features, columns=samples), group indicator for each column

group <-(c(rep("NP", 8), rep("PE", 8)))

y <- DGEList(counts=Counts_NP1PE1, group=group)


dge <- calcNormFactors(y) # Calculate norm. factors

normcounts <- cpm(dge, log =FALSE) # Get cpm normalized counts

Computing RPKM

# create a vector of gene lengths
geneLengths <- as.vector(subset(counts, select = c(width)))  

# compute rpkm
rpkm <- apply(X = subset(counts, select = c(-width)),
              MARGIN = 2, 
FUN = function(x) 10^9 * x / geneLengths /
 sum(as.numeric(x))) 

head(rpkm)
              CASE_1      CASE_2     CASE_3      CASE_4     CASE_5
TSPAN6   10.32776081 5.357968757 9.16205085 4.925343468 5.83486687
TNMD      0.01684798 0.009922336 0.03827943 0.002738979 0.01063028
DPM1      2.63981113 2.150507621 2.02434285 2.178856161 2.01289952
SCYL3     0.39773339 0.406126625 0.42494027 0.365759833 0.33443328
C1ORF112  0.08101479 0.057267105 0.06168920 0.078863509 0.03418491
FGR       0.42137106 1.164864492 0.22257172 1.009828442 1.00224016
             CTRL_1     CTRL_2     CTRL_3     CTRL_4     CTRL_5
TSPAN6   6.49552207 7.05082537 4.94582589 5.13235390 9.04796312
TNMD     0.04591505 0.02317037 0.05703704 0.03603937 0.04845400
DPM1     1.33317907 0.96482073 0.92015803 0.78800608 1.53901792
SCYL3    0.53890820 0.44423280 0.45966871 0.40323026 0.41805654
C1ORF112 0.04072831 0.03306610 0.03804662 0.02673224 0.05551612
FGR      0.38475871 0.19565151 0.17859631 0.22276045 0.18752890

Check the sample sizes of RPKM. Notice that the sums of samples are all different:

colSums(rpkm)
  CASE_1   CASE_2   CASE_3   CASE_4   CASE_5   CTRL_1   CTRL_2 
158291.0 153324.2 161775.4 173047.4 172761.4 210032.6 301764.2 
  CTRL_3   CTRL_4   CTRL_5 
241418.3 291674.5 252005.7 

Find gene length normalized values:

rpk <- apply( subset(counts, select = c(-width)), 2, 
function(x) x/(geneLengths/1000))

#normalize by the sample size using rpk values
head(rpk)
             CASE_1     CASE_2     CASE_3      CASE_4      CASE_5
TSPAN6   60267.4843 28853.9160 47523.4029 35406.89280 39845.92098
TNMD        98.3161    53.4341   198.5548    19.68974    72.59348
DPM1     15404.5760 11580.9870 10500.2322 15663.17700 13745.95804
SCYL3     2320.9669  2187.0869  2204.1580  2629.34337  2283.82284
C1ORF112   472.7605   308.3968   319.9808   566.92733   233.44648
FGR       2458.9041  6273.0680  1154.4757  7259.36935  6844.23193
             CTRL_1     CTRL_2     CTRL_3      CTRL_4     CTRL_5
TSPAN6   43432.7408 38007.6845 25776.9153 18514.01071 49221.0665
TNMD       307.0141   124.9006   297.2686   130.00530   263.5906
DPM1      8914.3906  5200.8949  4795.7280  2842.58517  8372.2825
SCYL3     3603.4456  2394.6502  2395.7255  1454.57804  2274.2344
C1ORF112   272.3325   178.2438   198.2934    96.43158   302.0086
FGR       2572.7147  1054.6653   930.8176   803.56681  1020.1602

Computing TPM

tpm <- apply(rpk, 2, function(x) x / sum(as.numeric(x)) * 10^6)
head(tpm)
             CASE_1      CASE_2     CASE_3      CASE_4      CASE_5
TSPAN6   65.2454175 34.94535342 56.6343869 28.46238940 33.77412727
TNMD      0.1064367  0.06471474  0.2366208  0.01582791  0.06153155
DPM1     16.6769528 14.02588411 12.5132919 12.59109195 11.65132402
SCYL3     2.5126726  2.64880948  2.6267297  2.11363915  1.93580977
C1ORF112  0.5118093  0.37350334  0.3813262  0.45573348  0.19787349
FGR       2.6620030  7.59739431  1.3758069  5.83555858  5.80129545
             CTRL_1      CTRL_2     CTRL_3      CTRL_4     CTRL_5
TSPAN6   30.9262602 23.36534574 20.4865448 17.59616734 35.9038047
TNMD      0.2186092  0.07678302  0.2362582  0.12356021  0.1922734
DPM1      6.3474872  3.19726681  3.8114683  2.70166227  6.1070760
SCYL3     2.5658315  1.47211886  1.9040346  1.38246645  1.6589171
C1ORF112  0.1939142  0.10957595  0.1575963  0.09165093  0.2202971
FGR       1.8319002  0.64835886  0.7397796  0.76372950  0.7441455

Check the sample sizes of tpm. Notice that the sums of samples are all equal to 10^6:

colSums(tpm)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5 CTRL_1 CTRL_2 CTRL_3 CTRL_4 CTRL_5 
 1e+06  1e+06  1e+06  1e+06  1e+06  1e+06  1e+06  1e+06  1e+06  1e+06 

4.3. Normalisation for Microarray data (Robust Multiarray Averaging (RMA))

Script to perform RMA normalization:

# get supplementary files
getGEOSuppFiles("GSE148537")
                                                                 size
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar 17704960
                                                             isdir
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar FALSE
                                                             mode
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar  666
                                                                           mtime
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar 2022-05-05 02:31:58
                                                                           ctime
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar 2022-05-05 02:07:54
                                                                           atime
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar 2022-05-05 02:31:58
                                                             exe
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar  no
# untar files
untar("GSE148537/GSE148537_RAW.tar", exdir="data/")
# reading in .cel files
raw.data <- ReadAffy(celfile.path = "data/")
# performing RMA normalization
normalized.data <- rma(raw.data)
Background correcting
Normalizing
Calculating Expression
# get expression estimates
normalized.expr <- as.data.frame(exprs(normalized.data))
# map probe IDs to gene symbols
gse <- getGEO("GSE148537", GSEMatrix = TRUE)
# fetch feature data to get ID - gene symbol mapping
feature.data <- gse$GSE148537_series_matrix.txt.gz@featureData@data
# subset
feature.data <- feature.data[,c(1,11)]

normalized.expr <- normalized.expr %>%
 rownames_to_column(var = 'ID') %>%
 inner_join(., feature.data, by = 'ID')

5. Exploratory analysis of the read count table

- Clustering

Compute the variance of each gene across samples:

V <- apply(tpm, 1, var)

Sort the results by variance in decreasing order and select the top 100 genes:

selectedGenes <- names(V[order(V, decreasing = T)][1:100])

Now we can produce a heatmap where samples and genes are clustered:

pheatmap(tpm[selectedGenes,], scale = 'row', show_rownames = FALSE)

Correlation plots

correlationMatrix <- cor(tpm)

Have a look at how the correlation matrix looks:

kable(correlationMatrix,booktabs = TRUE)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5 CTRL_1 CTRL_2 CTRL_3 CTRL_4 CTRL_5
CASE_1 1.0000000 0.9924606 0.9959093 0.9934098 0.9901892 0.9594011 0.9635760 0.9549607 0.9548669 0.9302553
CASE_2 0.9924606 1.0000000 0.9887290 0.9935931 0.9898379 0.9725646 0.9793835 0.9674813 0.9739136 0.9448425
CASE_3 0.9959093 0.9887290 1.0000000 0.9950905 0.9928878 0.9648812 0.9617747 0.9555745 0.9551676 0.9401090
CASE_4 0.9934098 0.9935931 0.9950905 1.0000000 0.9922521 0.9739553 0.9748582 0.9643073 0.9709639 0.9510741
CASE_5 0.9901892 0.9898379 0.9928878 0.9922521 1.0000000 0.9631663 0.9660868 0.9502160 0.9587173 0.9313136
CTRL_1 0.9594011 0.9725646 0.9648812 0.9739553 0.9631663 1.0000000 0.9879862 0.9902278 0.9874452 0.9864399
CTRL_2 0.9635760 0.9793835 0.9617747 0.9748582 0.9660868 0.9879862 1.0000000 0.9819794 0.9970251 0.9642249
CTRL_3 0.9549607 0.9674813 0.9555745 0.9643073 0.9502160 0.9902278 0.9819794 1.0000000 0.9850769 0.9902712
CTRL_4 0.9548669 0.9739136 0.9551676 0.9709639 0.9587173 0.9874452 0.9970251 0.9850769 1.0000000 0.9738649
CTRL_5 0.9302553 0.9448425 0.9401090 0.9510741 0.9313136 0.9864399 0.9642249 0.9902712 0.9738649 1.0000000

We can also draw more colourful correlation plots using the corrplot package:

# The correlation plot order by the results of the hierarchical clustering 
corrplot(correlationMatrix, order = 'hclust')

Pairwise correlation scores on the plot:

corrplot(correlationMatrix, order = 'hclust', addrect = 2, addCoef.col = 'white')

Plot the correlation matrix as a heatmap:

pheatmap(correlationMatrix)