DNA Methylation analysis

Beatriz Manso
2022-05-12

Introduction

Among the epigenetic modifications, DNA methylation is important for regulating gene expression. As a gold standard technology, bisulfite sequencing provides data on genomic methylation patterns as a result of the conversion of the unmethylated cytosines into uracils that are represented as Ts in the data.

Set working directory:

setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 8 - DNA methylation analysis")

Install packages and load libraries:

if (!require("BiocManager", quietly = TRUE))
 install.packages("BiocManager")
BiocManager::install("methylKit")
BiocManager::install("genomation")

library(methylKit)
library(genomation)

1. Data and Quality measures

Data filtering and exploratory analysis

Get Files:

file.list=list( system.file("extdata",
"test1.myCpG.txt", package = "methylKit"),
system.file("extdata",
"test2.myCpG.txt", package = "methylKit"),
system.file("extdata",
"control1.myCpG.txt", package = "methylKit"),
system.file("extdata",
"control2.myCpG.txt", package = "methylKit") )
# read the files to a methylRawList object: myobj
myobj=methRead(file.list,
sample.id=list("test1","test2","ctrl1","ctrl2"),
assembly="hg18",
treatment=c(1,1,0,0),
context="CpG"
)


## inside the methylRawList object
length(myobj)
[1] 4
head(myobj[[1]])
    chr   start     end strand coverage numCs numTs
1 chr21 9764513 9764513      -       12     0    12
2 chr21 9764539 9764539      -       12     3     9
3 chr21 9820622 9820622      +       13     0    13
4 chr21 9837545 9837545      +       11     0    11
5 chr21 9849022 9849022      +      124    90    34
6 chr21 9853296 9853296      +       17    10     7

Quality check

getMethylationStats(myobj[[2]],plot=TRUE,both.strands=FALSE)

Check for general coverage statistics:

getCoverageStats(myobj[[2]],plot=TRUE,both.strands=FALSE)

Filter:

filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL,
hi.count=NULL,hi.perc=99.9)

Merge samples into a single table:

Use :: notation to make sure unite() function from methylKit is called:

meth=methylKit::unite(myobj, destrand=FALSE)
head(meth)
    chr   start     end strand coverage1 numCs1 numTs1 coverage2
1 chr21 9853296 9853296      +        17     10      7       333
2 chr21 9853326 9853326      +        17     12      5       329
3 chr21 9860126 9860126      +        39     38      1        83
4 chr21 9906604 9906604      +        68     42     26       111
5 chr21 9906616 9906616      +        68     52     16       111
6 chr21 9906619 9906619      +        68     59      9       111
  numCs2 numTs2 coverage3 numCs3 numTs3 coverage4 numCs4 numTs4
1    268     65        18     16      2       395    341     54
2    249     79        16     14      2       379    284     95
3     78      5        83     83      0        41     40      1
4     97     14        23     18      5        37     33      4
5    104      7        23     14      9        37     27     10
6    109      2        22     18      4        37     29      8

Create a methylBase object, where only CpGs covered with at least 1 sample per group will be returned there were two groups defined by the treatment vector, given during the creation of myobj: treatment=c(1,1,0,0)

meth.min=unite(myobj,min.per.group=1L)

Filtering CpGs

pm=percMethylation(meth) # get percent methylation matrix
mds=matrixStats::rowSds(pm) # calculate standard deviation of CpGs
head(meth[mds>20,])
     chr   start     end strand coverage1 numCs1 numTs1 coverage2
11 chr21 9906681 9906681      +        21     12      9        60
12 chr21 9906694 9906694      +        21      9     12        60
13 chr21 9906700 9906700      +        13      6      7        53
14 chr21 9906714 9906714      +        14      3     11        41
18 chr21 9906873 9906873      +        12      8      4        41
23 chr21 9927527 9927527      +        17      5     12        40
   numCs2 numTs2 coverage3 numCs3 numTs3 coverage4 numCs4 numTs4
11     56      4        37     14     23        26     11     15
12     53      7        39     16     23        26     15     11
13     43     10        30      8     22        23     10     13
14     37      4        25     19      6        21     19      2
18     33      8        15      4     11        22      7     15
23     22     18        32     32      0        14     11      3
hist(mds,col="cornflowerblue",xlab="Std. dev. per CpG")

methylKit object to a GRanges object to remove C->T mutations locations. These locations should be removed from the analysis as they do not represent bisulfite treatment associated conversions.

library(GenomicRanges)
mut=GRanges(seqnames=c("chr21","chr21"),
ranges=IRanges(start=c(9853296, 9853326),
end=c( 9853296,9853326)))

Select CpGs that do not overlap with mutations:

sub.meth=meth[! as(meth,"GRanges") %over% mut,]
nrow(meth)
[1] 963
nrow(sub.meth)
[1] 961

Clustering samples

Cluster and create dendrogram to group data points by their similarity:

clusterSamples(meth, dist="correlation", method="ward", plot=TRUE)


Call:
hclust(d = d, method = HCLUST.METHODS[hclust.method])

Cluster method   : ward.D 
Distance         : pearson 
Number of objects: 4 

Setting the plot=FALSE will return a dendrogram

clusterSamples(meth, dist="correlation", method="ward", plot=FALSE)

Call:
hclust(d = d, method = HCLUST.METHODS[hclust.method])

Cluster method   : ward.D 
Distance         : pearson 
Number of objects: 4 

PCA - Principal component analysis

Perform Principal component analysis (PCA) to mathematically assess possible correlated variables and uncorrelated variables:

PCASamples(meth, screeplot=TRUE)

Plot PC1 and PC2 axis as scatter plot:

pc=PCASamples(meth,obj.return = TRUE, adj.lim=c(1,1))

2. Extracting interesting regions: segmentation and differential methylation

Differential methylation with Fisher’s exact test

getSampleID(meth)
[1] "test1" "test2" "ctrl1" "ctrl2"
new.meth=reorganize(meth,sample.ids=c("test1","ctrl1"),treatment=c(1,0))
dmf=calculateDiffMeth(new.meth)

Pool the samples from the same group by adding up the number of Cs and Ts per group:

pooled.meth=pool(meth,sample.ids=c("test","control"))
dm.pooledf=calculateDiffMeth(pooled.meth)
#Get differentially methylated bases/regions with specific cutoffs:
all.diff=getMethylDiff(dm.pooledf,difference=25,qvalue=0.01,type="all")

#Get hyper-methylated
hyper=getMethylDiff(dm.pooledf,difference=25,qvalue=0.01,type="hyper")

#Get hypo-methylated
hypo=getMethylDiff(dm.pooledf,difference=25,qvalue=0.01,type="hypo")

#Using [ ] notation
hyper2=dm.pooledf[dm.pooledf$qvalue < 0.01 & dm.pooledf$meth.diff > 25,]

#Logistic regression based tests
dm.lr=calculateDiffMeth(meth,overdispersion = "MN",test ="Chisq")

#with Betabinomial distribution based tests
dm.dss=calculateDiffMethDSS(meth)
Using internal DSS code... 
# Differential methylation for regions rather than base-pairs
tiles=tileMethylCounts(myobj,win.size=1000,step.size=1000)

head(tiles[[1]],3)
    chr   start     end strand coverage numCs numTs
1 chr21 9764001 9765000      *       24     3    21
2 chr21 9820001 9821000      *       13     0    13
3 chr21 9837001 9838000      *       11     0    11

Summarizes the methylation information over a given set of promoter regions and outputs a methylRaw or methylRawList object:

Read the gene BED file:

# 
gene.obj=readTranscriptFeatures(system.file("extdata", "refseq.hg18.bed.txt",
package = "methylKit"))

Adding covariates:

# Adding covariates
covariates=data.frame(age=c(30,80,34,30,80,40))

sim.methylBase=dataSim(replicates=6,sites=1000,treatment=c(rep(1,3),rep(0,3)),
                       covariates=covariates,sample.ids=c(paste0("test",1:3),paste0("ctrl",1:3)))

my.diffMeth3=calculateDiffMeth(sim.methylBase, covariates=covariates,
                               overdispersion="MN",test="Chisq",mc.cores=1)

Methylation segmentation

Read Data:

methFile=system.file("extdata","H1.chr21.chr22.rds", package="compGenomRData")
mbw=readRDS(methFile)

Segment the methylation data:

res=methSeg(mbw, minSeg=10, G=1:4, join.neighbours = TRUE)

Plot mean methylation value and the length of the segment as a scatter plot to examine the characteristics of the segment classes:

plot(res$seg.mean, log10(width(res)), pch=20,
     col=scales::alpha(rainbow(4)[as.numeric(res$seg.group)], 0.2),
     ylab="log10(length)",
     xlab="methylation proportion")

Working with large files

myobj=methRead( file.list, sample.id=list("test1","test2","ctrl1","ctrl2"),
                assembly="hg18",treatment=c(1,1,0,0), dbtype="tabix")

Annotation of DMRs/DMCs and segments

Read the gene BED file:

transcriptBED=system.file("extdata", "refseq.hg18.bed.txt",
                          package = "methylKit")
gene.obj=readTranscriptFeatures(transcriptBED)

Annotate differentially methylated CpGs with promoter/exon/intron using annotation data:

annotateWithGeneParts(as(all.diff,"GRanges"),gene.obj)
  promoter       exon     intron intergenic 
     28.24      15.27      33.59      58.02 
  promoter       exon     intron intergenic 
     28.24       0.00      13.74      58.02 
promoter     exon   intron 
    0.29     0.03     0.17 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      5     815   49918   52410   94645  313528 

We can read the CpG island annotation and annotate our differentially methylated bases/regions with them:

cpg.file=system.file("extdata", "cpgi.hg18.bed.txt",
                     package = "methylKit")

cpg.obj=readFeatureFlank(cpg.file,
                         feature.flank.name=c("CpGi","shores"))

Convert methylDiff object to GRanges and annotate:

diffCpGann=annotateWithFeatureFlank(as(all.diff,"GRanges"),
                                    cpg.obj$CpGi,cpg.obj$shores,
                                    feature.name="CpGi",flank.name="shores")