By quantifying the activity of RNA in a biological sample, gene expression can provide valuable insights into disease nature and its treatment. The RNA-Seq assay is one of the fastest growing Next Generation Sequencing (NGS) approaches used to assess gene expression and alternative splicing, where both known and novel features can be detected in a single assay, allowing for the identification of transcript isoforms, gene fusions, and single nucleotide variations as well as other features without a prior knowledge of the sequence.
setwd("C:/Users/manso/Desktop/practical")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rsubread")
BiocManager::install("edgeR")
BiocManager::install("affy")
BiocManager::install("pheatmap")
BiocManager::install("corrplot")
library(edgeR)
library(Rsubread)
library(Rsamtools)
library(affy)
library(GEOquery)
library(tidyverse)
library(pheatmap)
library(stats)
library(corrplot)
library(knitr)
fastq.files <- list.files(path = "C:/Users/manso/Desktop/practical", pattern = ".fastq$", full.names = TRUE)
fastq.files
[1] "C:/Users/manso/Desktop/practical/SRR14094802_1.fastq"
[2] "C:/Users/manso/Desktop/practical/SRR14094802_2.fastq"
Lets use the Saccharomyces reference genome for now because the human genome takes a long time to complete.
buildindex(basename="ht",reference="GCF_000146045.2_R64_genomic.fna")
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 2.8.2
//================================= setting ==================================\\
|| ||
|| Index name : ht ||
|| Index space : base space ||
|| Index split : no-split ||
|| Repeat threshold : 100 repeats ||
|| Gapped index : no ||
|| ||
|| Free / total memory : 16.6GB / 31.4GB ||
|| ||
|| Input files : 1 file in total ||
|| o GCF_000146045.2_R64_genomic.fna ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Check the integrity of provided reference sequences ... ||
|| No format issues were found ||
|| Scan uninformative subreads in reference sequences ... ||
|| 11 uninformative subreads were found. ||
|| These subreads were excluded from index building. ||
|| Estimate the index size... ||
|| 8%, 0 mins elapsed, rate=3530.8k bps/s ||
|| 16%, 0 mins elapsed, rate=4254.2k bps/s ||
|| 24%, 0 mins elapsed, rate=4567.0k bps/s ||
|| 33%, 0 mins elapsed, rate=4751.8k bps/s ||
|| 41%, 0 mins elapsed, rate=4879.5k bps/s ||
|| 49%, 0 mins elapsed, rate=4932.2k bps/s ||
|| 58%, 0 mins elapsed, rate=4987.9k bps/s ||
|| 66%, 0 mins elapsed, rate=5030.6k bps/s ||
|| 74%, 0 mins elapsed, rate=5072.8k bps/s ||
|| 83%, 0 mins elapsed, rate=5109.0k bps/s ||
|| 91%, 0 mins elapsed, rate=5139.5k bps/s ||
|| 3.0 GB of memory is needed for index building. ||
|| Build the index... ||
|| 8%, 0 mins elapsed, rate=831.4k bps/s ||
|| 16%, 0 mins elapsed, rate=1266.5k bps/s ||
|| 24%, 0 mins elapsed, rate=1534.9k bps/s ||
|| 33%, 0 mins elapsed, rate=1717.9k bps/s ||
|| 41%, 0 mins elapsed, rate=1850.4k bps/s ||
|| 49%, 0 mins elapsed, rate=1946.9k bps/s ||
|| 58%, 0 mins elapsed, rate=2023.3k bps/s ||
|| 66%, 0 mins elapsed, rate=2088.1k bps/s ||
|| 74%, 0 mins elapsed, rate=2141.6k bps/s ||
|| 83%, 0 mins elapsed, rate=2185.7k bps/s ||
|| 91%, 0 mins elapsed, rate=2222.3k bps/s ||
|| Save current index block... ||
|| [ 0.0% finished ] ||
|| [ 10.0% finished ] ||
|| [ 20.0% finished ] ||
|| [ 30.0% finished ] ||
|| [ 40.0% finished ] ||
|| [ 50.0% finished ] ||
|| [ 60.0% finished ] ||
|| [ 70.0% finished ] ||
|| [ 80.0% finished ] ||
|| [ 90.0% finished ] ||
|| [ 100.0% finished ] ||
|| ||
|| Total running time: 0.3 minutes. ||
|| Index C:\Users\manso\Desktop\practical\ht was successfully built. ||
|| ||
\\============================================================================//
reads1 <- list.files( path = "C:/Users/manso/Desktop/practical", pattern = "*_1.fastq$" )
reads2 <- list.files( path = "C:/Users/manso/Desktop/practical", pattern = "*_2.fastq$" )
align(index="ht",readfile1=reads1,readfile2=reads2,input_format="FASTQ",output_format="BAM",nthreads=16)
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 2.8.2
//================================= setting ==================================\\
|| ||
|| Function : Read alignment (RNA-Seq) ||
|| Input file 1 : SRR14094802_1.fastq ||
|| Input file 2 : SRR14094802_2.fastq ||
|| Output file : SRR14094802_1.fastq.subread.BAM (BAM) ||
|| Index name : ht ||
|| ||
|| ------------------------------------ ||
|| ||
|| Threads : 16 ||
|| Phred offset : 33 ||
|| # of extracted subreads : 10 ||
|| Min read1 vote : 3 ||
|| Min read2 vote : 1 ||
|| Max fragment size : 600 ||
|| Min fragment size : 50 ||
|| Max mismatches : 3 ||
|| Max indel length : 5 ||
|| Report multi-mapping reads : yes ||
|| Max alignments per multi-mapping read : 1 ||
|| ||
\\============================================================================//
//================ Running (05-May-2022 02:30:34, pid=10796) =================\\
|| ||
|| Check the input reads. ||
|| The input file contains base space reads. ||
|| Initialise the memory objects. ||
|| Estimate the mean read length. ||
|| The range of Phred scores observed in the data is [2,40] ||
|| Create the output BAM file. ||
|| Check the index. ||
|| Init the voting space. ||
|| Global environment is initialised. ||
|| Load the 1-th index block... ||
|| The index block has been loaded. ||
|| Start read mapping in chunk. ||
|| 7% completed, 0.3 mins elapsed, rate=67.4k fragments per second ||
|| 14% completed, 0.4 mins elapsed, rate=71.4k fragments per second ||
|| 22% completed, 0.4 mins elapsed, rate=72.6k fragments per second ||
|| 30% completed, 0.4 mins elapsed, rate=73.7k fragments per second ||
|| 37% completed, 0.4 mins elapsed, rate=73.9k fragments per second ||
|| 45% completed, 0.5 mins elapsed, rate=74.0k fragments per second ||
|| 53% completed, 0.5 mins elapsed, rate=74.4k fragments per second ||
|| 61% completed, 0.5 mins elapsed, rate=74.4k fragments per second ||
|| Estimated fragment length : 182 bp ||
|| 69% completed, 0.5 mins elapsed, rate=29.6k fragments per second ||
|| 73% completed, 0.6 mins elapsed, rate=29.8k fragments per second ||
|| 76% completed, 0.6 mins elapsed, rate=30.0k fragments per second ||
|| 79% completed, 0.6 mins elapsed, rate=30.2k fragments per second ||
|| 83% completed, 0.6 mins elapsed, rate=30.4k fragments per second ||
|| 86% completed, 0.6 mins elapsed, rate=30.6k fragments per second ||
|| 89% completed, 0.7 mins elapsed, rate=30.8k fragments per second ||
|| 93% completed, 0.7 mins elapsed, rate=31.0k fragments per second ||
|| 96% completed, 0.7 mins elapsed, rate=31.2k fragments per second ||
|| 99% completed, 0.7 mins elapsed, rate=31.3k fragments per second ||
|| ||
|| Completed successfully. ||
|| ||
\\==================================== ====================================//
//================================ Summary =================================\\
|| ||
|| Total fragments : 1372071 ||
|| Mapped : 923639 (67.3%) ||
|| Uniquely mapped : 740587 ||
|| Multi-mapping : 183052 ||
|| ||
|| Unmapped : 448432 ||
|| ||
|| Properly paired : 865412 ||
|| Not properly paired : 58227 ||
|| Singleton : 43703 ||
|| Chimeric : 7596 ||
|| Unexpected strandness : 998 ||
|| Unexpected fragment length : 5099 ||
|| Unexpected read order : 831 ||
|| ||
|| Indels : 499 ||
|| ||
|| Running time : 0.7 minutes ||
|| ||
\\============================================================================//
SRR14094802_1.fastq.subread.BAM
Total_fragments 1372071
Mapped_fragments 923639
Uniquely_mapped_fragments 740587
Multi_mapping_fragments 183052
Unmapped_fragments 448432
Properly_paired_fragments 865412
Singleton_fragments 43703
More_than_one_chr_fragments 7596
Unexpected_strandness_fragments 998
Unexpected_template_length 5099
Inversed_mapping 831
Indels 499
bamFile <- BamFile(file = "SRR14094802_1.sorted.bam", index= "SRR14094802_1.sorted.bam.bai")
bamFile
class: BamFile
path: SRR14094802_1.sorted.bam
index: SRR14094802_1.sorted.bam.bai
isOpen: FALSE
yieldSize: NA
obeyQname: FALSE
asMates: FALSE
qnamePrefixEnd: NA
qnameSuffixStart: NA
In Ubunto terminal - To create a BAM index:
- You must first sort the BAM file to create a sorted.bam - Run samtools
index with the sorted.bam as input - This will create a file named
sorted.bam.bai which
contains the index
samtools sort SRR14094802_1.fastq.subread.BAM -o SRR14094802_1.sorted.bam
samtools index SRR14094802_1.sorted.bam SRR14094802_1.sorted.bam.bai
seqinfo(bamFile)
In the case of scanBam() it’s possible to get output from multiple genomic regions,here we get back a list of length 1. If it returns a list >1, we subset that list to a single list and then display that information.
aln <- aln[[1]]
names(aln)
[1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq"
[8] "cigar" "mrnm" "mpos" "isize" "seq" "qual"
lapply(aln, function(xx) xx[1])
$qname
[1] "SRR14094802.107538"
$flag
[1] 129
$rname
[1] NC_001133.9
17 Levels: NC_001133.9 NC_001134.8 NC_001135.5 ... NC_001224.1
$strand
[1] +
Levels: + - *
$pos
[1] 1
$qwidth
[1] 25
$mapq
[1] 0
$cigar
[1] "25M"
$mrnm
[1] NC_001133.9
17 Levels: NC_001133.9 NC_001134.8 NC_001135.5 ... NC_001224.1
$mpos
[1] 230104
$isize
[1] 230128
$seq
DNAStringSet object of length 1:
width seq
[1] 25 CCACACCACACCCACACACCCACAC
$qual
PhredQuality object of length 1:
width seq
[1] 25 GAGAGIGGGIIIIGGIIIIIGGGAG
quickBamFlagSummary(bamFile)
group | nb of | nb of | mean / max
of | records | unique | records per
records | in group | QNAMEs | unique QNAME
All records........................ A | 2744142 | 1372071 | 2 / 2
o template has single segment.... S | 0 | 0 | NA / NA
o template has multiple segments. M | 2744142 | 1372071 | 2 / 2
- first segment.............. F | 1372071 | 1372071 | 1 / 1
- last segment............... L | 1372071 | 1372071 | 1 / 1
- other segment.............. O | 0 | 0 | NA / NA
Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
Indentation reflects this.
Details for group M:
o record is mapped.............. M1 | 1803575 | 923639 | 1.95 / 2
- primary alignment......... M2 | 1803575 | 923639 | 1.95 / 2
- secondary alignment....... M3 | 0 | 0 | NA / NA
o record is unmapped............ M4 | 940567 | 492135 | 1.91 / 2
Details for group F:
o record is mapped.............. F1 | 908542 | 908542 | 1 / 1
- primary alignment......... F2 | 908542 | 908542 | 1 / 1
- secondary alignment....... F3 | 0 | 0 | NA / NA
o record is unmapped............ F4 | 463529 | 463529 | 1 / 1
Details for group L:
o record is mapped.............. L1 | 895033 | 895033 | 1 / 1
- primary alignment......... L2 | 895033 | 895033 | 1 / 1
- secondary alignment....... L3 | 0 | 0 | NA / NA
o record is unmapped............ L4 | 477038 | 477038 | 1 / 1
For this step we’ll use The Human sample SRR8472776.bam
SRRcount <- featureCounts(files="SRR8472776.bam",
annot.ext="gencode.v29.annotation.gtf",
isGTFAnnotationFile=TRUE, GTF.featureType="exon",
GTF.attrType="gene_id")
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 2.8.2
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| ||
|| SRR8472776.bam ||
|| ||
|| Paired-end : no ||
|| Count read pairs : no ||
|| Annotation : gencode.v29.annotation.gtf (GTF) ||
|| Dir for temp files : . ||
|| Threads : 1 ||
|| Level : meta-feature level ||
|| Multimapping reads : counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file gencode.v29.annotation.gtf ... ||
|| Features : 1262773 ||
|| Meta-features : 58721 ||
|| Chromosomes/contigs : 25 ||
|| ||
|| Process BAM file SRR8472776.bam... ||
|| Single-end reads are included. ||
|| Total alignments : 2347456 ||
|| Successfully assigned alignments : 2279419 (97.1%) ||
|| Running time : 0.04 minutes ||
|| ||
|| Write the final count table. ||
|| Write the read assignment summary. ||
|| ||
\\============================================================================//
View the count file:
View(SRRcount[["counts"]])
SRR8472776 <- SRRcount[["counts"]]
write.csv(SRR8472776, "SRR8472776_counts.csv")
#Read files SRP029880.raw_counts.tsv as count file
Count_file <- read.table(file="SRP029880.raw_counts.tsv",
sep = '\t', header = TRUE,
fill = TRUE)
counts <- as.matrix(Count_file)
summary(counts)
CASE_1 CASE_2 CASE_3
Min. : 0 Min. : 0 Min. : 0
1st Qu.: 5155 1st Qu.: 6464 1st Qu.: 3972
Median : 80023 Median : 85064 Median : 64145
Mean : 295932 Mean : 273099 Mean : 263045
3rd Qu.: 252164 3rd Qu.: 245484 3rd Qu.: 210788
Max. :205067466 Max. :105248041 Max. :222511278
CASE_4 CASE_5 CTRL_1
Min. : 0 Min. : 0 Min. : 0
1st Qu.: 7088 1st Qu.: 5741 1st Qu.: 6905
Median : 99129 Median : 85530 Median : 86801
Mean : 364558 Mean : 346312 Mean : 339093
3rd Qu.: 302838 3rd Qu.: 274448 3rd Qu.: 286483
Max. :161624257 Max. :127063679 Max. :273078584
CTRL_2 CTRL_3 CTRL_4
Min. : 0 Min. : 0 Min. : 0
1st Qu.: 4750 1st Qu.: 4761 1st Qu.: 2720
Median : 63088 Median : 61488 Median : 37227
Mean : 273367 Mean : 264306 Mean : 182936
3rd Qu.: 215286 3rd Qu.: 211596 3rd Qu.: 134252
Max. :398526040 Max. :282352241 Max. :263287148
CTRL_5 width
Min. : 0 Min. : 117
1st Qu.: 4558 1st Qu.: 9668
Median : 62271 Median : 27090
Mean : 275877 Mean : 67685
3rd Qu.: 219417 3rd Qu.: 70374
Max. :292416098 Max. :2473537
#To compute the CPM values for each sample (excluding the width column):
cpm <- apply(subset(counts, select = c(-width)),
2, function(x) x/sum(as.numeric(x)) * 10^6)
head(cpm)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5
TSPAN6 133.0525426 69.0267115 118.0347011 63.45319990 75.1705899
TNMD 0.2541349 0.1496685 0.5774069 0.04131475 0.1603471
DPM1 62.5344859 50.9433750 47.9546579 51.61492360 47.6835767
SCYL3 17.7536255 18.1282742 18.9680587 16.32642166 14.9280985
C1ORF112 15.5608340 10.9995218 11.8488915 15.14762970 6.5660321
FGR 9.7817077 27.0411643 5.1667799 23.44215745 23.2660030
CTRL_1 CTRL_2 CTRL_3 CTRL_4 CTRL_5
TSPAN6 83.6818109 90.8357832 63.7170750 66.1201154 116.5649088
TNMD 0.6925826 0.3495018 0.8603467 0.5436178 0.7308802
DPM1 31.5816791 22.8556382 21.7976235 18.6670761 36.4577954
SCYL3 24.0552454 19.8292196 20.5182320 17.9989890 18.6607898
C1ORF112 7.8228489 6.3511382 7.3077662 5.1345687 10.6632034
FGR 8.9317887 4.5418541 4.1459346 5.1711610 4.3532958
colSums(cpm)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5 CTRL_1 CTRL_2 CTRL_3 CTRL_4 CTRL_5
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
Use the “ExpData.csv” file for this method We have to change the Gene_id column as the row names
Read the ExpData.csv count file:
Counts_NP1PE1 <- read.csv(file="ExpData.csv" , header =TRUE)
Change the Gene_id column as row names:
rownames(Counts_NP1PE1) <- Counts_NP1PE1[,1]
Counts_NP1PE1[,1] <- NULL
Computing CPM
# Creates a DGEList object from a table of counts (rows=features, columns=samples), group indicator for each column
group <-(c(rep("NP", 8), rep("PE", 8)))
y <- DGEList(counts=Counts_NP1PE1, group=group)
dge <- calcNormFactors(y) # Calculate norm. factors
normcounts <- cpm(dge, log =FALSE) # Get cpm normalized counts
Computing RPKM
# create a vector of gene lengths
geneLengths <- as.vector(subset(counts, select = c(width)))
# compute rpkm
rpkm <- apply(X = subset(counts, select = c(-width)),
MARGIN = 2,
FUN = function(x) 10^9 * x / geneLengths /
sum(as.numeric(x)))
head(rpkm)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5
TSPAN6 10.32776081 5.357968757 9.16205085 4.925343468 5.83486687
TNMD 0.01684798 0.009922336 0.03827943 0.002738979 0.01063028
DPM1 2.63981113 2.150507621 2.02434285 2.178856161 2.01289952
SCYL3 0.39773339 0.406126625 0.42494027 0.365759833 0.33443328
C1ORF112 0.08101479 0.057267105 0.06168920 0.078863509 0.03418491
FGR 0.42137106 1.164864492 0.22257172 1.009828442 1.00224016
CTRL_1 CTRL_2 CTRL_3 CTRL_4 CTRL_5
TSPAN6 6.49552207 7.05082537 4.94582589 5.13235390 9.04796312
TNMD 0.04591505 0.02317037 0.05703704 0.03603937 0.04845400
DPM1 1.33317907 0.96482073 0.92015803 0.78800608 1.53901792
SCYL3 0.53890820 0.44423280 0.45966871 0.40323026 0.41805654
C1ORF112 0.04072831 0.03306610 0.03804662 0.02673224 0.05551612
FGR 0.38475871 0.19565151 0.17859631 0.22276045 0.18752890
Check the sample sizes of RPKM. Notice that the sums of samples are all different:
colSums(rpkm)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5 CTRL_1 CTRL_2
158291.0 153324.2 161775.4 173047.4 172761.4 210032.6 301764.2
CTRL_3 CTRL_4 CTRL_5
241418.3 291674.5 252005.7
Find gene length normalized values:
rpk <- apply( subset(counts, select = c(-width)), 2,
function(x) x/(geneLengths/1000))
#normalize by the sample size using rpk values
head(rpk)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5
TSPAN6 60267.4843 28853.9160 47523.4029 35406.89280 39845.92098
TNMD 98.3161 53.4341 198.5548 19.68974 72.59348
DPM1 15404.5760 11580.9870 10500.2322 15663.17700 13745.95804
SCYL3 2320.9669 2187.0869 2204.1580 2629.34337 2283.82284
C1ORF112 472.7605 308.3968 319.9808 566.92733 233.44648
FGR 2458.9041 6273.0680 1154.4757 7259.36935 6844.23193
CTRL_1 CTRL_2 CTRL_3 CTRL_4 CTRL_5
TSPAN6 43432.7408 38007.6845 25776.9153 18514.01071 49221.0665
TNMD 307.0141 124.9006 297.2686 130.00530 263.5906
DPM1 8914.3906 5200.8949 4795.7280 2842.58517 8372.2825
SCYL3 3603.4456 2394.6502 2395.7255 1454.57804 2274.2344
C1ORF112 272.3325 178.2438 198.2934 96.43158 302.0086
FGR 2572.7147 1054.6653 930.8176 803.56681 1020.1602
Computing TPM
tpm <- apply(rpk, 2, function(x) x / sum(as.numeric(x)) * 10^6)
head(tpm)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5
TSPAN6 65.2454175 34.94535342 56.6343869 28.46238940 33.77412727
TNMD 0.1064367 0.06471474 0.2366208 0.01582791 0.06153155
DPM1 16.6769528 14.02588411 12.5132919 12.59109195 11.65132402
SCYL3 2.5126726 2.64880948 2.6267297 2.11363915 1.93580977
C1ORF112 0.5118093 0.37350334 0.3813262 0.45573348 0.19787349
FGR 2.6620030 7.59739431 1.3758069 5.83555858 5.80129545
CTRL_1 CTRL_2 CTRL_3 CTRL_4 CTRL_5
TSPAN6 30.9262602 23.36534574 20.4865448 17.59616734 35.9038047
TNMD 0.2186092 0.07678302 0.2362582 0.12356021 0.1922734
DPM1 6.3474872 3.19726681 3.8114683 2.70166227 6.1070760
SCYL3 2.5658315 1.47211886 1.9040346 1.38246645 1.6589171
C1ORF112 0.1939142 0.10957595 0.1575963 0.09165093 0.2202971
FGR 1.8319002 0.64835886 0.7397796 0.76372950 0.7441455
Check the sample sizes of tpm. Notice that the sums of samples are all equal to 10^6:
colSums(tpm)
CASE_1 CASE_2 CASE_3 CASE_4 CASE_5 CTRL_1 CTRL_2 CTRL_3 CTRL_4 CTRL_5
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
Script to perform RMA normalization:
# get supplementary files
getGEOSuppFiles("GSE148537")
size
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar 17704960
isdir
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar FALSE
mode
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar 666
mtime
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar 2022-05-05 02:31:58
ctime
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar 2022-05-05 02:07:54
atime
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar 2022-05-05 02:31:58
exe
C:/Users/manso/Desktop/practical/GSE148537/GSE148537_RAW.tar no
# untar files
untar("GSE148537/GSE148537_RAW.tar", exdir="data/")
# reading in .cel files
raw.data <- ReadAffy(celfile.path = "data/")
# performing RMA normalization
normalized.data <- rma(raw.data)
Background correcting
Normalizing
Calculating Expression
# get expression estimates
normalized.expr <- as.data.frame(exprs(normalized.data))
# map probe IDs to gene symbols
gse <- getGEO("GSE148537", GSEMatrix = TRUE)
# fetch feature data to get ID - gene symbol mapping
feature.data <- gse$GSE148537_series_matrix.txt.gz@featureData@data
# subset
feature.data <- feature.data[,c(1,11)]
normalized.expr <- normalized.expr %>%
rownames_to_column(var = 'ID') %>%
inner_join(., feature.data, by = 'ID')
- Clustering
Compute the variance of each gene across samples:
V <- apply(tpm, 1, var)
Sort the results by variance in decreasing order and select the top 100 genes:
Now we can produce a heatmap where samples and genes are clustered:
pheatmap(tpm[selectedGenes,], scale = 'row', show_rownames = FALSE)
Correlation plots
correlationMatrix <- cor(tpm)
Have a look at how the correlation matrix looks:
kable(correlationMatrix,booktabs = TRUE)
CASE_1 | CASE_2 | CASE_3 | CASE_4 | CASE_5 | CTRL_1 | CTRL_2 | CTRL_3 | CTRL_4 | CTRL_5 | |
---|---|---|---|---|---|---|---|---|---|---|
CASE_1 | 1.0000000 | 0.9924606 | 0.9959093 | 0.9934098 | 0.9901892 | 0.9594011 | 0.9635760 | 0.9549607 | 0.9548669 | 0.9302553 |
CASE_2 | 0.9924606 | 1.0000000 | 0.9887290 | 0.9935931 | 0.9898379 | 0.9725646 | 0.9793835 | 0.9674813 | 0.9739136 | 0.9448425 |
CASE_3 | 0.9959093 | 0.9887290 | 1.0000000 | 0.9950905 | 0.9928878 | 0.9648812 | 0.9617747 | 0.9555745 | 0.9551676 | 0.9401090 |
CASE_4 | 0.9934098 | 0.9935931 | 0.9950905 | 1.0000000 | 0.9922521 | 0.9739553 | 0.9748582 | 0.9643073 | 0.9709639 | 0.9510741 |
CASE_5 | 0.9901892 | 0.9898379 | 0.9928878 | 0.9922521 | 1.0000000 | 0.9631663 | 0.9660868 | 0.9502160 | 0.9587173 | 0.9313136 |
CTRL_1 | 0.9594011 | 0.9725646 | 0.9648812 | 0.9739553 | 0.9631663 | 1.0000000 | 0.9879862 | 0.9902278 | 0.9874452 | 0.9864399 |
CTRL_2 | 0.9635760 | 0.9793835 | 0.9617747 | 0.9748582 | 0.9660868 | 0.9879862 | 1.0000000 | 0.9819794 | 0.9970251 | 0.9642249 |
CTRL_3 | 0.9549607 | 0.9674813 | 0.9555745 | 0.9643073 | 0.9502160 | 0.9902278 | 0.9819794 | 1.0000000 | 0.9850769 | 0.9902712 |
CTRL_4 | 0.9548669 | 0.9739136 | 0.9551676 | 0.9709639 | 0.9587173 | 0.9874452 | 0.9970251 | 0.9850769 | 1.0000000 | 0.9738649 |
CTRL_5 | 0.9302553 | 0.9448425 | 0.9401090 | 0.9510741 | 0.9313136 | 0.9864399 | 0.9642249 | 0.9902712 | 0.9738649 | 1.0000000 |
We can also draw more colourful correlation plots using the corrplot package:
# The correlation plot order by the results of the hierarchical clustering
corrplot(correlationMatrix, order = 'hclust')
Pairwise correlation scores on the plot:
corrplot(correlationMatrix, order = 'hclust', addrect = 2, addCoef.col = 'white')
Plot the correlation matrix as a heatmap:
pheatmap(correlationMatrix)