Class Prediction Analysis for Gene Expression Data

Beatriz Manso
2022-04-20

Introduction

For each sample that is provided with this package, it becomes possible to predict which class it belongs to using a multivariate predictor. In addition to Compound Covariate Predictor, Diagonal Linear Discriminant Analysis, Nearest Neighbor Predictor, Nearest Centroid Predictor, and Support Vector Machine Predictor, there are many other multivariate classification methods. A graphical representation of how accurately this multivariate class predictor can determine all class predictions requested is provided in this package.

The complete evaluation of the process is carried out by using cross-validation methods including leave-one-out cross-validation, k-fold validation, and bootstrap validation with 0.632+. Each classifier’s performance is examined along with a cross-validated estimate of misclassification rate. The full dataset can be used to classify new samples based on specific predictors and multivariate predictors.

Methods

Set working directory:

setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 5 - Microarray analysis/practical")

Install packages and load libraries:

if (!require("BiocManager", quietly = TRUE))
 install.packages("BiocManager")
BiocManager::install("ROC")

#Class prediction package
install.packages("https://brb.nci.nih.gov/BRB-ArrayTools/RPackagesAndManuals/classpredict_0.2.tar.gz", repos = NULL, type="source")

library(classpredict)

1. Get Built-in sample data - Cancer data

Expression data:

dataset <- "Brca"

x <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"), package = "classpredict"), header = FALSE)

head(x)
           V1         V2          V3         V4          V5
1 -1.39854932 -3.0817938 -2.73039293 -1.8744690 -2.28824496
2  0.39940688  0.2781018 -0.20113993 -0.5334322 -0.57929373
3 -0.02509096  0.4375801  0.10479617  0.9533499 -0.22050031
4 -0.13006058 -0.8389376 -0.23562828  0.6195197  0.81221521
5 -0.10309340 -0.4340958  0.06756324  0.7655347 -0.09386685
6 -0.46566358 -0.6667566 -0.61996901  0.4760281  0.10934740
          V6          V7         V8         V9         V10
1 -0.3453870 -1.42321134 -1.7776077 -0.2410080 -0.29195589
2 -0.2874397 -0.88264304 -0.4150376 -1.0223678 -0.74802077
3  0.3532323 -0.67318958  0.5109619 -0.1643868  0.02185956
4 -0.4181434 -0.52509099  0.2630344  0.6429682  1.45843005
5 -0.4181434  0.38414347 -0.1926452 -0.5145731 -0.62403196
6 -0.6036991  0.04809438 -0.6214885 -0.4374053 -1.01105523
          V11        V12         V13        V14        V15
1  0.24146917 -1.6374298 -1.01260006 -1.9765410 -1.4594316
2 -1.16699564 -0.1349296 -1.56008780  0.1882037 -0.6693548
3  0.24146917 -0.4323155 -0.02531105  0.2590872  0.7629608
4 -0.04146478  0.6189098  0.60572112 -0.3915786 -0.2762098
5 -0.01761806  0.2038723 -0.63603663 -0.4877939 -0.3301487
6 -0.27593011  0.1233824 -0.63603663 -0.5384200 -1.2370392
          V16         V17        V18        V19        V20        V21
1 -1.44625616 -2.26303434 -1.6828099 -0.8214026 -0.5618789 -0.4611339
2 -0.09575906 -0.97615325 -1.0000000 -0.8614801 -1.6322682 -0.7737241
3  0.11445862  0.15919858  0.9940752  0.4066253  0.4381211  0.4116309
4 -0.70209515  0.03562388  0.7697023  1.3286228  1.3737305  0.5574818
5 -0.81398809  0.15919858 -0.2725259  1.3330686  1.2422009  0.0402640
6 -0.22902554 -0.35230181 -0.1625530 -0.9668331  0.5901242  0.3164738
          V22
1 -0.93288577
2 -0.33342373
3  1.25153875
4  1.02272010
5  0.03394729
6 -0.15710096

2. Class Information

expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "classpredict"), as.is = TRUE)

head(expdesign)
  Patient.Array PID BRCA1.v.BRCA2.v.Sporadic BRCA1.V.BRCA2
1         s1321  20                 Sporadic              
2         s1996   1                    BRCA1         BRCA1
3         s1822   5                    BRCA1         BRCA1
4         s1714   3                    BRCA1         BRCA1
5         s1224   7                    BRCA1         BRCA1
6         s1252   2                    BRCA1         BRCA1
  BRCA1.v.Sporadic BRCA2.v.Sporadic BRCA1.v.notBRCA1 BRCA2.v.notBRCA2
1                                           notBRCA1         notBRCA2
2            BRCA1                             BRCA1         notBRCA2
3            BRCA1                             BRCA1         notBRCA2
4            BRCA1                             BRCA1         notBRCA2
5            BRCA1                             BRCA1         notBRCA2
6            BRCA1                             BRCA1         notBRCA2
  group predictTest
1     a    training
2     a    training
3     b    training
4     b    training
5     c    training
6     c    training

3. Class Prediction Analysis

The “classPredict” function calculates multiple classifiers that are used to predict the class of a new sample, implementing the class prediction tool with multiple methods in BRB-ArrayTools. This package provides test.classPrediction for a quick start of class prediction analysis over one of the built-in sample data (i.e., “Brca”, “Perou”, and “Pomeroy”).

res1 <- test.classPredict('Brca', outputName = "ClassPrediction_Brca", 
generateHTML = TRUE)
Getting analysis results ...
## Getting analysis results ...
res2 <- test.classPredict('Pomeroy', outputName = "ClassPrediction_Pomeroy", 
generateHTML = TRUE)
Getting analysis results ...
## Getting analysis results ...
res3 <- test.classPredict('Perou', outputName = "ClassPrediction_Brca", 
generateHTML = TRUE)
Getting analysis results ...

4. List Objects In The Results

names(res1)
 [1] "performClass"        "percentCorrectClass" "predNewSamples"     
 [4] "classifierTable"     "probInClass"         "CCPSenSpec"         
 [7] "LDASenSpec"          "K1NNSenSpec"         "K3NNSenSpec"        
[10] "CentroidSenSpec"     "SVMSenSpec"          "BCPPSenSpec"        
[13] "probNew"             "weightLinearPred"    "thresholdLinearPred"
[16] "GRPCentroid"         "pmethod"             "workPath"           
names(res2)
 [1] "performClass"        "percentCorrectClass" "classifierTable"    
 [4] "probInClass"         "CCPSenSpec"          "LDASenSpec"         
 [7] "K1NNSenSpec"         "K3NNSenSpec"         "CentroidSenSpec"    
[10] "SVMSenSpec"          "BCPPSenSpec"         "weightLinearPred"   
[13] "thresholdLinearPred" "GRPCentroid"         "pmethod"            
[16] "workPath"           
names(res3)
[1] "performClass"        "percentCorrectClass" "classifierTable"    
[4] "pmethod"             "workPath"           

Explanation about each object:

res$performClass is a data frame with the performance of classifiers during cross-validation:

res1$performClass[1:11,]
   Array id Class label Mean Number of genes in classifier
1     s1996       BRCA1                                 16
2     s1822       BRCA1                                 20
3     s1714       BRCA1                                 28
4     s1224       BRCA1                                 15
5     s1252       BRCA1                                 28
6     s1510       BRCA1                                 20
7     s1905       BRCA1                                 20
8     s1900       BRCA2                                 13
9     s1787       BRCA2                                 17
10    s1721       BRCA2                                 10
11    s1486       BRCA2                                 17
   CCP Correct? DLDA Correct? 1NN Correct? 3NN Correct?
1           YES           YES          YES          YES
2           YES           YES          YES          YES
3           YES           YES          YES          YES
4           YES           YES          YES          YES
5           YES            NO          YES           NO
6           YES           YES          YES          YES
7           YES           YES          YES          YES
8           YES           YES          YES           NO
9           YES           YES          YES          YES
10          YES           YES          YES          YES
11           NO            NO          YES           NO
   Nearest Centroid Correct? SVM Correct? BCCP Correct?
1                        YES          YES           YES
2                        YES          YES           YES
3                        YES          YES           YES
4                        YES          YES           YES
5                        YES          YES           YES
6                        YES          YES           YES
7                        YES          YES           YES
8                         NO          YES           YES
9                        YES          YES           YES
10                       YES          YES           YES
11                        NO           NO            NO

res$percentCorrectClass is a data frame with the mean percent of correct classification for each sample using different prediction methods:

res1$percentCorrectClass
  CCP Correct? DLDA Correct? 1NN Correct? 3NN Correct?
1           91            82          100           73
  Nearest Centroid Correct? SVM Correct? BCCP Correct?
1                        82           91            91

res$predNewSamples is a data frame with predicted class for each new sample. NC means that a sample is not classified. In this example, there are four new samples:

res1$predNewSamples[1:4,]
  ExpID TrueClass   CCP   LDA    K1    K3 Centroid   SVM  BCCP
1 s1816   predict BRCA2 BRCA2 BRCA2 BRCA2    BRCA2 BRCA2 BRCA2
2 s1616   predict BRCA2 BRCA1 BRCA2 BRCA1    BRCA2 BRCA2    NC
3 s1063   predict BRCA1 BRCA1 BRCA1 BRCA1    BRCA1 BRCA1 BRCA1
4 s1936   predict BRCA2 BRCA2 BRCA2 BRCA2    BRCA2 BRCA2 BRCA2

res$probNew is a data frame with the predicted probability of each new sample belonging to the class (BRCA1) from the the Bayesian Compound Covariate method:

res1$probNew[1:4,]
  Array id Class Probability
1    s1816 BRCA1  p < 1.0e-3
2    s1616 BRCA1       0.344
3    s1063 BRCA1           1
4    s1936 BRCA1  p < 1.0e-3

Note:

res$classifierTable Data frame with composition of classifiers such as geometric means of values in each class, p-values and Gene IDs
res$probInClass Data frame with predicted probability of each training sample belonging to a class during cross-validation from the Bayesian Compound Covariate
res$CCPSenSpec Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Compound Covariate Predictor Classifier
res$LDASenSpec Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Diagonal Linear Discriminant Analysis Classifier.
res$K1NNSenSpec Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the 1-Nearest Neighbor Classifier
res$K3NNSenSpec Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the 3-Nearest Neighbor Classifier
res$CentroidSenSpec Data frame with performance (i.e., sensitivity, * specificity, positive prediction value, negative | prediction value) of the Nearest Centroid Classifier
res$SVMSenSpec Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Support Vector Machine Classifier
res$BCPPSenSpec Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Bayesian Compound Covariate Classifier
res$weightLinearPred Data frame with gene weights for linear predictors such ** as Compound Covariate Predictor, Diagonal Linear Discriminant Analysis and Support Vector Machine
res$thresholdLinearPred Contains the thresholds for the linear prediction rules related with res$weightLinearPred. Each prediction rule is defined by the inner sum of the weights (wiwi) and log expression values (xixi) of significant genes. In this case, a sample is classified to the class BRCA1 if the sum is greater than the threshold; that is, ∑iwixi>threshold∑iwixi>threshold
res$GRPCentroid Data frame with centroid of each class for each predictor gene
res$pmethod Vector of prediction methods that are specified
res$workPath Path for Fortran and other intermediate outputs

5. Producing ROC Curves

Cross-validation ROC curves are provided for Compound Covariate Predictor, Diagonal Linear Discriminant Analysis and Bayesian Compound Covariate Classifiers.

plotROCCurve(res1,"ccp")

plotROCCurve(res1,"dlda")

plotROCCurve(res1,"bcc")