Introduction

Researchers used microarrays to study gene expression of cells in the ’00s in a transcriptome-wide manner. While RNA-seq has become the preferred method for genome-wide transcriptomics due to decreasing sequencing costs, microarrays are still used for their relative simplicity of analysis. Furthermore, there are a large number of existing microarray data sets that are still relevant for analysis. The purpose of this tutorial is to introduce the techniques of analyzing gene expression microarray data.

Methods

Set working directory:

setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 5 - Microarray analysis/practical")

Install necessary packages and load libraries:

if (!require("BiocManager", quietly = TRUE))
 install.packages("BiocManager")
BiocManager::install("affy")

library(affy)
library(ggplot2)

1. Data Exploation

Get the .CEL files:

celpath = "C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 5 - Microarray analysis/practical/Microarray data"
data = ReadAffy(celfile.path = celpath)
data # data is the name of your AffyBatch object

AffyBatch object
size of arrays=536x536 features (21 kb)
cdf=Hu6800 (7129 affyids)
number of samples=6
number of genes=7129
annotation=hu6800
notes=

CEL files contain raw intensities for each probe on the array. The data is now an AffyBatch object containing the data from your CEL files.

To view specific items of a simple object: refer to their location using [ ] e.g. - Object[1,2] will retrieve the item on the first row and second column - Object[1:2,] will retrieve all columns of the first two rows

To view an entire column of a data frame: use $ e.g. dataframe$columnname To access slots of complex objects: use object@ followed by the name of the slot.

exprs() and intensity() methods extract the intensities of all probes both PM and MM probes) from the AffyBatch

expr = exprs(data)
int = intensity(data)
expr[1:5,] #limit the view to the first 5 probes

  GSM1688666_MG1999060202AA.CEL.gz GSM1688667_MG2000030311AA.CEL.gz
1                           1206.5                           2184.0
2                          11187.5                          43209.5
3                           1301.3                           2017.0
4                          12234.5                          43422.0
5                            893.3                            708.5
  GSM1688668_MG1999052812AA.CEL.gz GSM1688669_MG1999052811AA.CEL.gz
1                           1007.5                           2495.8
2                           8083.0                          13405.5
3                           1126.3                           2830.3
4                           9114.0                          13769.3
5                            778.5                           1122.5
  GSM1688670_MG1999062405AA.CEL.gz GSM1688671_MG1999060203AA.CEL.gz
1                            688.3                           1112.5
2                           8949.3                           4943.3
3                            667.5                            951.0
4                           9328.0                           4082.8
5                            557.3                            728.5

Look at the intensities of the PM probes using the pm() method

pm(data)[1:5,]

    GSM1688666_MG1999060202AA.CEL.gz GSM1688667_MG2000030311AA.CEL.gz
542                            972.5                            679.5
543                           1055.5                            805.0
544                           1757.5                           1913.0
545                           2134.3                           3714.0
546                           3055.0                           1945.3
    GSM1688668_MG1999052812AA.CEL.gz GSM1688669_MG1999052811AA.CEL.gz
542                            893.0                           2284.5
543                           1000.0                           2257.3
544                           1494.3                           3321.5
545                           2454.3                           5854.3
546                           1105.5                           4398.8
    GSM1688670_MG1999062405AA.CEL.gz GSM1688671_MG1999060203AA.CEL.gz
542                            589.8                            881.8
543                            746.0                            912.0
544                            713.8                            986.8
545                            840.0                           1331.0
546                            873.3                           1194.5

Retrieve sample annotation using affy

Microarray data sets have information about the samples that were hybridized to the arrays, e.g. sample labels. To retrieve phenodata use:

ph = data@phenoData
ph

An object of class 'AnnotatedDataFrame'
  sampleNames: GSM1688666_MG1999060202AA.CEL.gz
    GSM1688667_MG2000030311AA.CEL.gz ...
    GSM1688671_MG1999060203AA.CEL.gz (6 total)
  varLabels: sample
  varMetadata: labelDescription

Select a column the dataframe using the $ sign.

Find the names of the columns in varLabels: there is one column named sample:

ph$sample

[1] 1 2 3 4 5 6

To look at all the data in the data frame ask for the data slot.

ph@data

                                 sample
GSM1688666_MG1999060202AA.CEL.gz      1
GSM1688667_MG2000030311AA.CEL.gz      2
GSM1688668_MG1999052812AA.CEL.gz      3
GSM1688669_MG1999052811AA.CEL.gz      4
GSM1688670_MG1999062405AA.CEL.gz      5
GSM1688671_MG1999060203AA.CEL.gz      6

pData() method is anonther way to look at the sample annotation:

pData(data)

                                 sample
GSM1688666_MG1999060202AA.CEL.gz      1
GSM1688667_MG2000030311AA.CEL.gz      2
GSM1688668_MG1999052812AA.CEL.gz      3
GSM1688669_MG1999052811AA.CEL.gz      4
GSM1688670_MG1999062405AA.CEL.gz      5
GSM1688671_MG1999060203AA.CEL.gz      6

Retrieve probe annotation using affy

Microarray data sets include information on the probes. AffyBatches have a slot called featureData, a data frame that contains labels for the probes.

feat = data@featureData
feat

An object of class 'AnnotatedDataFrame': none

Retrieve experiment annotation using affy

Microarray data sets include information on the experiment. AffyBatches have a slot for this called experimentData.

exp = data@experimentData
exp

Experiment data
  Experimenter name:  
  Laboratory:  
  Contact information:  
  Title:  
  URL:  
  PMIDs:  
  No abstract available.
  Information is available on: preprocessing 
  notes:
   :

2. Quality control of microarray data

Define that the first column of the data slot in the phenoData corresponds to the vector containing the sample names created by the c() command:

ph@data[ ,1] = c("wt1","wt2","wt3","mut1","mut2","mut3")
ph

An object of class 'AnnotatedDataFrame'
  sampleNames: GSM1688666_MG1999060202AA.CEL.gz
    GSM1688667_MG2000030311AA.CEL.gz ...
    GSM1688671_MG1999060203AA.CEL.gz (6 total)
  varLabels: sample
  varMetadata: labelDescription

In this dataset we have 2 groups of 3 replicates: 3 WT controls and 3 mutants

If we had 3 groups of 3 replicates:

ph@data[ ,1] = c("control1","control2","control3","iso1","iso2","iso3","swim1","swim2","swim3")
ph@data

3. Create plots to assess the quality of the data

Microarray pictures can show large inconsistencies on individual arrays. If you have a set of small arrays, e.g., 6 arrays and you want to plot them on a single plot, you can use the following code for the plotting:

for (i in 1:6)
{
 name = paste("image",i,".jpg",sep="")
 jpeg(name)
 image(data[,i],main=ph@data$sample[i])
 dev.off()
}

op = par(mfrow = c(2,3))
for (i in 1:6){image(data[,i],main=ph@data$sample[i])}

par(mfrow = c(2,3)) defines that you want to plot the 6 figures in a grid of 2 rows and 3 columns

If we have 3 groups of 3 replicates and want to plot them on a single plot:

op = par(mfrow = c(3,3))
for (i in 1:9){image(data[,i],main=ph@data$sample[i])}

Histograms and Boxplots

Histograms are plotted as a quality control check to determine if normalisation is required. We will plot the distribution of log base 2 intensities (log2(PMij) for array i and probe j) of perfect match probes for comparison of probe intensity behaviour between different arrays. If we see differences in shape or centre of the distributions, it means that normalization is required.

op = par(mfrow = c(2,3))
for(i in 1:6){hist(data[,i],lwd=2,which='pm',ylab='Density',xlab='Log2 
intensities',main=ph@data$sample[i])}

Create one histogram plot containing the histograms of all samples:

colour=c('green','green','green','red','red','red')
hist(data[,1:6],lwd=2,which='pm',col=colour,ylab='Density',xlab='Log2 
intensities',main='Histogram of raw data')

Boxplots and histograms show the same differences in probe intensity behavior between arrays. In order to perform meaningful statistical analysis and inferences from the data, you need to ensure that all the samples are comparable. To examine and compare the overall distribution of log transformed PM intensities between the samples you can use a histogram but you will get a clearer view with a box plot.

boxplot() method:

name = "boxplot.jpg"
jpeg(name)
boxplot( data, which='pm', col='red', names=ph@data$sample)
#dev.off()

The which argument allows you to specify if you want to use: • perfect match probes only: which=‘pm’ • mismatch probes only: which=‘mm’ • both: which=‘both’ Since the use of MM probes is highly controversial, we will work with PM probes only. The col argument specifies the color of the boxes, the names argument the labels on the X-axis.

Create box plot of the raw data using ggplot

Using the PM intensities:

pmexp = pm(data)

Create two empty vectors that will serve as the two columns of the data frame: - one to store the sample names in, called sampleNames - one to store the log intensities in , called logs

sampleNames = vector()
logs = vector()

for (i in 1:6)
{
sampleNames = c(sampleNames,rep(ph@data[i,1],dim(pmexp)[1]))
logs = c(logs,log2(pmexp[,i]))
}

If there’s 3 groups of 3 replicates, the code is as follows:

sampleNames = vector()
logs = vector()
15
for (i in 1:9)
{
sampleNames = c(sampleNames,rep(ph@data[i,1],dim(pmexp)[1]))
logs = c(logs,log2(pmexp[,i]))

After we have filled the vectors with the data we need, we combine sample names and log intensities into a single data frame:

logData = data.frame(logInt=logs,sampleName=sampleNames)

#Now we create the plot
dataBox = ggplot(logData,aes(sampleName,logInt))
dataBox + geom_boxplot()

Create a boxplot of normalized intensities

Use same code as for raw intensities but use the normalized intensities as input (data.matrix) instead of the raw data:

name = "boxplotnorm.jpg"
jpeg(name)
boxplot(data.matrix,col='red',names=ph@data$sample)
dev.off()

Create a box plot of normalized intensities using ggplot

sampleNames = vector()
normlogs = vector()

for (i in 1:6)
{
sampleNames = c(sampleNames,rep(ph@data[i,1],dim(data.matrix)[1]))
normlogs = c(normlogs,data.matrix[,i])
}

If you have 3 groups of 3 samples, the code is as follows:

sampleNames = vector()
normlogs = vector()
for (i in 1:9)
{
sampleNames = c(sampleNames,rep(ph@data[i,1],dim(data.matrix)[1]))
normlogs = c(normlogs,data.matrix[,i])
}

After filling the vectors with data, we combine sample names and normalized intensities into a single data frame:

normData = data.frame(norm_logInt=normlogs,sampleName=sampleNames)


dataBox = ggplot(normData, aes(sampleName,norm_logInt))
dataBox + geom_boxplot() + ylim(2,16) + ggtitle("after normalization")

dataBox = ggplot(logData,aes(sampleName,logInt))
dataBox + geom_boxplot() + ylim(2,16) + ggtitle("before normalization")

MicroArray Analysis in R