Among the epigenetic modifications, DNA methylation is important for regulating gene expression. As a gold standard technology, bisulfite sequencing provides data on genomic methylation patterns as a result of the conversion of the unmethylated cytosines into uracils that are represented as Ts in the data.
Set working directory:
setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 8 - DNA methylation analysis")
Install packages and load libraries:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("methylKit")
BiocManager::install("genomation")
library(methylKit)
library(genomation)
Get Files:
file.list=list( system.file("extdata",
"test1.myCpG.txt", package = "methylKit"),
system.file("extdata",
"test2.myCpG.txt", package = "methylKit"),
system.file("extdata",
"control1.myCpG.txt", package = "methylKit"),
system.file("extdata",
"control2.myCpG.txt", package = "methylKit") )
# read the files to a methylRawList object: myobj
myobj=methRead(file.list,
sample.id=list("test1","test2","ctrl1","ctrl2"),
assembly="hg18",
treatment=c(1,1,0,0),
context="CpG"
)
## inside the methylRawList object
length(myobj)
[1] 4
head(myobj[[1]])
chr start end strand coverage numCs numTs
1 chr21 9764513 9764513 - 12 0 12
2 chr21 9764539 9764539 - 12 3 9
3 chr21 9820622 9820622 + 13 0 13
4 chr21 9837545 9837545 + 11 0 11
5 chr21 9849022 9849022 + 124 90 34
6 chr21 9853296 9853296 + 17 10 7
getMethylationStats(myobj[[2]],plot=TRUE,both.strands=FALSE)
Check for general coverage statistics:
getCoverageStats(myobj[[2]],plot=TRUE,both.strands=FALSE)
Filter:
filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL,
hi.count=NULL,hi.perc=99.9)
Use :: notation to make sure unite() function from methylKit is called:
meth=methylKit::unite(myobj, destrand=FALSE)
head(meth)
chr start end strand coverage1 numCs1 numTs1 coverage2
1 chr21 9853296 9853296 + 17 10 7 333
2 chr21 9853326 9853326 + 17 12 5 329
3 chr21 9860126 9860126 + 39 38 1 83
4 chr21 9906604 9906604 + 68 42 26 111
5 chr21 9906616 9906616 + 68 52 16 111
6 chr21 9906619 9906619 + 68 59 9 111
numCs2 numTs2 coverage3 numCs3 numTs3 coverage4 numCs4 numTs4
1 268 65 18 16 2 395 341 54
2 249 79 16 14 2 379 284 95
3 78 5 83 83 0 41 40 1
4 97 14 23 18 5 37 33 4
5 104 7 23 14 9 37 27 10
6 109 2 22 18 4 37 29 8
Create a methylBase object, where only CpGs covered with at least 1 sample per group will be returned there were two groups defined by the treatment vector, given during the creation of myobj: treatment=c(1,1,0,0)
meth.min=unite(myobj,min.per.group=1L)
pm=percMethylation(meth) # get percent methylation matrix
mds=matrixStats::rowSds(pm) # calculate standard deviation of CpGs
head(meth[mds>20,])
chr start end strand coverage1 numCs1 numTs1 coverage2
11 chr21 9906681 9906681 + 21 12 9 60
12 chr21 9906694 9906694 + 21 9 12 60
13 chr21 9906700 9906700 + 13 6 7 53
14 chr21 9906714 9906714 + 14 3 11 41
18 chr21 9906873 9906873 + 12 8 4 41
23 chr21 9927527 9927527 + 17 5 12 40
numCs2 numTs2 coverage3 numCs3 numTs3 coverage4 numCs4 numTs4
11 56 4 37 14 23 26 11 15
12 53 7 39 16 23 26 15 11
13 43 10 30 8 22 23 10 13
14 37 4 25 19 6 21 19 2
18 33 8 15 4 11 22 7 15
23 22 18 32 32 0 14 11 3
hist(mds,col="cornflowerblue",xlab="Std. dev. per CpG")
methylKit object to a GRanges object to remove C->T mutations locations. These locations should be removed from the analysis as they do not represent bisulfite treatment associated conversions.
Select CpGs that do not overlap with mutations:
nrow(sub.meth)
[1] 961
Cluster and create dendrogram to group data points by their similarity:
clusterSamples(meth, dist="correlation", method="ward", plot=TRUE)
Call:
hclust(d = d, method = HCLUST.METHODS[hclust.method])
Cluster method : ward.D
Distance : pearson
Number of objects: 4
Setting the plot=FALSE will return a dendrogram
clusterSamples(meth, dist="correlation", method="ward", plot=FALSE)
Call:
hclust(d = d, method = HCLUST.METHODS[hclust.method])
Cluster method : ward.D
Distance : pearson
Number of objects: 4
Perform Principal component analysis (PCA) to mathematically assess possible correlated variables and uncorrelated variables:
PCASamples(meth, screeplot=TRUE)
Plot PC1 and PC2 axis as scatter plot:
pc=PCASamples(meth,obj.return = TRUE, adj.lim=c(1,1))
getSampleID(meth)
[1] "test1" "test2" "ctrl1" "ctrl2"
new.meth=reorganize(meth,sample.ids=c("test1","ctrl1"),treatment=c(1,0))
dmf=calculateDiffMeth(new.meth)
Pool the samples from the same group by adding up the number of Cs and Ts per group:
pooled.meth=pool(meth,sample.ids=c("test","control"))
dm.pooledf=calculateDiffMeth(pooled.meth)
#Get differentially methylated bases/regions with specific cutoffs:
all.diff=getMethylDiff(dm.pooledf,difference=25,qvalue=0.01,type="all")
#Get hyper-methylated
hyper=getMethylDiff(dm.pooledf,difference=25,qvalue=0.01,type="hyper")
#Get hypo-methylated
hypo=getMethylDiff(dm.pooledf,difference=25,qvalue=0.01,type="hypo")
#Using [ ] notation
hyper2=dm.pooledf[dm.pooledf$qvalue < 0.01 & dm.pooledf$meth.diff > 25,]
#Logistic regression based tests
dm.lr=calculateDiffMeth(meth,overdispersion = "MN",test ="Chisq")
#with Betabinomial distribution based tests
dm.dss=calculateDiffMethDSS(meth)
Using internal DSS code...
# Differential methylation for regions rather than base-pairs
tiles=tileMethylCounts(myobj,win.size=1000,step.size=1000)
head(tiles[[1]],3)
chr start end strand coverage numCs numTs
1 chr21 9764001 9765000 * 24 3 21
2 chr21 9820001 9821000 * 13 0 13
3 chr21 9837001 9838000 * 11 0 11
Summarizes the methylation information over a given set of promoter regions and outputs a methylRaw or methylRawList object:
Read the gene BED file:
#
gene.obj=readTranscriptFeatures(system.file("extdata", "refseq.hg18.bed.txt",
package = "methylKit"))
Adding covariates:
# Adding covariates
covariates=data.frame(age=c(30,80,34,30,80,40))
sim.methylBase=dataSim(replicates=6,sites=1000,treatment=c(rep(1,3),rep(0,3)),
covariates=covariates,sample.ids=c(paste0("test",1:3),paste0("ctrl",1:3)))
my.diffMeth3=calculateDiffMeth(sim.methylBase, covariates=covariates,
overdispersion="MN",test="Chisq",mc.cores=1)
Read Data:
methFile=system.file("extdata","H1.chr21.chr22.rds", package="compGenomRData")
mbw=readRDS(methFile)
Segment the methylation data:
res=methSeg(mbw, minSeg=10, G=1:4, join.neighbours = TRUE)
Plot mean methylation value and the length of the segment as a scatter plot to examine the characteristics of the segment classes:
Read the gene BED file:
transcriptBED=system.file("extdata", "refseq.hg18.bed.txt",
package = "methylKit")
gene.obj=readTranscriptFeatures(transcriptBED)
Annotate differentially methylated CpGs with promoter/exon/intron using annotation data:
annotateWithGeneParts(as(all.diff,"GRanges"),gene.obj)
promoter exon intron intergenic
28.24 15.27 33.59 58.02
promoter exon intron intergenic
28.24 0.00 13.74 58.02
promoter exon intron
0.29 0.03 0.17
Min. 1st Qu. Median Mean 3rd Qu. Max.
5 815 49918 52410 94645 313528
We can read the CpG island annotation and annotate our differentially methylated bases/regions with them:
cpg.file=system.file("extdata", "cpgi.hg18.bed.txt",
package = "methylKit")
cpg.obj=readFeatureFlank(cpg.file,
feature.flank.name=c("CpGi","shores"))
Convert methylDiff object to GRanges and annotate:
diffCpGann=annotateWithFeatureFlank(as(all.diff,"GRanges"),
cpg.obj$CpGi,cpg.obj$shores,
feature.name="CpGi",flank.name="shores")