For each sample that is provided with this package, it becomes possible to predict which class it belongs to using a multivariate predictor. In addition to Compound Covariate Predictor, Diagonal Linear Discriminant Analysis, Nearest Neighbor Predictor, Nearest Centroid Predictor, and Support Vector Machine Predictor, there are many other multivariate classification methods. A graphical representation of how accurately this multivariate class predictor can determine all class predictions requested is provided in this package.
The complete evaluation of the process is carried out by using cross-validation methods including leave-one-out cross-validation, k-fold validation, and bootstrap validation with 0.632+. Each classifier’s performance is examined along with a cross-validated estimate of misclassification rate. The full dataset can be used to classify new samples based on specific predictors and multivariate predictors.
Set working directory:
setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 5 - Microarray analysis/practical")
Install packages and load libraries:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ROC")
#Class prediction package
install.packages("https://brb.nci.nih.gov/BRB-ArrayTools/RPackagesAndManuals/classpredict_0.2.tar.gz", repos = NULL, type="source")
library(classpredict)
Expression data:
dataset <- "Brca"
x <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"), package = "classpredict"), header = FALSE)
head(x)
V1 V2 V3 V4 V5
1 -1.39854932 -3.0817938 -2.73039293 -1.8744690 -2.28824496
2 0.39940688 0.2781018 -0.20113993 -0.5334322 -0.57929373
3 -0.02509096 0.4375801 0.10479617 0.9533499 -0.22050031
4 -0.13006058 -0.8389376 -0.23562828 0.6195197 0.81221521
5 -0.10309340 -0.4340958 0.06756324 0.7655347 -0.09386685
6 -0.46566358 -0.6667566 -0.61996901 0.4760281 0.10934740
V6 V7 V8 V9 V10
1 -0.3453870 -1.42321134 -1.7776077 -0.2410080 -0.29195589
2 -0.2874397 -0.88264304 -0.4150376 -1.0223678 -0.74802077
3 0.3532323 -0.67318958 0.5109619 -0.1643868 0.02185956
4 -0.4181434 -0.52509099 0.2630344 0.6429682 1.45843005
5 -0.4181434 0.38414347 -0.1926452 -0.5145731 -0.62403196
6 -0.6036991 0.04809438 -0.6214885 -0.4374053 -1.01105523
V11 V12 V13 V14 V15
1 0.24146917 -1.6374298 -1.01260006 -1.9765410 -1.4594316
2 -1.16699564 -0.1349296 -1.56008780 0.1882037 -0.6693548
3 0.24146917 -0.4323155 -0.02531105 0.2590872 0.7629608
4 -0.04146478 0.6189098 0.60572112 -0.3915786 -0.2762098
5 -0.01761806 0.2038723 -0.63603663 -0.4877939 -0.3301487
6 -0.27593011 0.1233824 -0.63603663 -0.5384200 -1.2370392
V16 V17 V18 V19 V20 V21
1 -1.44625616 -2.26303434 -1.6828099 -0.8214026 -0.5618789 -0.4611339
2 -0.09575906 -0.97615325 -1.0000000 -0.8614801 -1.6322682 -0.7737241
3 0.11445862 0.15919858 0.9940752 0.4066253 0.4381211 0.4116309
4 -0.70209515 0.03562388 0.7697023 1.3286228 1.3737305 0.5574818
5 -0.81398809 0.15919858 -0.2725259 1.3330686 1.2422009 0.0402640
6 -0.22902554 -0.35230181 -0.1625530 -0.9668331 0.5901242 0.3164738
V22
1 -0.93288577
2 -0.33342373
3 1.25153875
4 1.02272010
5 0.03394729
6 -0.15710096
expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "classpredict"), as.is = TRUE)
head(expdesign)
Patient.Array PID BRCA1.v.BRCA2.v.Sporadic BRCA1.V.BRCA2
1 s1321 20 Sporadic
2 s1996 1 BRCA1 BRCA1
3 s1822 5 BRCA1 BRCA1
4 s1714 3 BRCA1 BRCA1
5 s1224 7 BRCA1 BRCA1
6 s1252 2 BRCA1 BRCA1
BRCA1.v.Sporadic BRCA2.v.Sporadic BRCA1.v.notBRCA1 BRCA2.v.notBRCA2
1 notBRCA1 notBRCA2
2 BRCA1 BRCA1 notBRCA2
3 BRCA1 BRCA1 notBRCA2
4 BRCA1 BRCA1 notBRCA2
5 BRCA1 BRCA1 notBRCA2
6 BRCA1 BRCA1 notBRCA2
group predictTest
1 a training
2 a training
3 b training
4 b training
5 c training
6 c training
The “classPredict” function calculates multiple classifiers that are used to predict the class of a new sample, implementing the class prediction tool with multiple methods in BRB-ArrayTools. This package provides test.classPrediction for a quick start of class prediction analysis over one of the built-in sample data (i.e., “Brca”, “Perou”, and “Pomeroy”).
res1 <- test.classPredict('Brca', outputName = "ClassPrediction_Brca",
generateHTML = TRUE)
Getting analysis results ...
## Getting analysis results ...
res2 <- test.classPredict('Pomeroy', outputName = "ClassPrediction_Pomeroy",
generateHTML = TRUE)
Getting analysis results ...
## Getting analysis results ...
res3 <- test.classPredict('Perou', outputName = "ClassPrediction_Brca",
generateHTML = TRUE)
Getting analysis results ...
names(res1)
[1] "performClass" "percentCorrectClass" "predNewSamples"
[4] "classifierTable" "probInClass" "CCPSenSpec"
[7] "LDASenSpec" "K1NNSenSpec" "K3NNSenSpec"
[10] "CentroidSenSpec" "SVMSenSpec" "BCPPSenSpec"
[13] "probNew" "weightLinearPred" "thresholdLinearPred"
[16] "GRPCentroid" "pmethod" "workPath"
names(res2)
[1] "performClass" "percentCorrectClass" "classifierTable"
[4] "probInClass" "CCPSenSpec" "LDASenSpec"
[7] "K1NNSenSpec" "K3NNSenSpec" "CentroidSenSpec"
[10] "SVMSenSpec" "BCPPSenSpec" "weightLinearPred"
[13] "thresholdLinearPred" "GRPCentroid" "pmethod"
[16] "workPath"
names(res3)
[1] "performClass" "percentCorrectClass" "classifierTable"
[4] "pmethod" "workPath"
res$performClass is a data frame with the performance of classifiers during cross-validation:
res1$performClass[1:11,]
Array id Class label Mean Number of genes in classifier
1 s1996 BRCA1 16
2 s1822 BRCA1 20
3 s1714 BRCA1 28
4 s1224 BRCA1 15
5 s1252 BRCA1 28
6 s1510 BRCA1 20
7 s1905 BRCA1 20
8 s1900 BRCA2 13
9 s1787 BRCA2 17
10 s1721 BRCA2 10
11 s1486 BRCA2 17
CCP Correct? DLDA Correct? 1NN Correct? 3NN Correct?
1 YES YES YES YES
2 YES YES YES YES
3 YES YES YES YES
4 YES YES YES YES
5 YES NO YES NO
6 YES YES YES YES
7 YES YES YES YES
8 YES YES YES NO
9 YES YES YES YES
10 YES YES YES YES
11 NO NO YES NO
Nearest Centroid Correct? SVM Correct? BCCP Correct?
1 YES YES YES
2 YES YES YES
3 YES YES YES
4 YES YES YES
5 YES YES YES
6 YES YES YES
7 YES YES YES
8 NO YES YES
9 YES YES YES
10 YES YES YES
11 NO NO NO
res$percentCorrectClass is a data frame with the mean percent of correct classification for each sample using different prediction methods:
res1$percentCorrectClass
CCP Correct? DLDA Correct? 1NN Correct? 3NN Correct?
1 91 82 100 73
Nearest Centroid Correct? SVM Correct? BCCP Correct?
1 82 91 91
res$predNewSamples is a data frame with predicted class for each new sample. NC means that a sample is not classified. In this example, there are four new samples:
res1$predNewSamples[1:4,]
ExpID TrueClass CCP LDA K1 K3 Centroid SVM BCCP
1 s1816 predict BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2
2 s1616 predict BRCA2 BRCA1 BRCA2 BRCA1 BRCA2 BRCA2 NC
3 s1063 predict BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1
4 s1936 predict BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2
res$probNew is a data frame with the predicted probability of each new sample belonging to the class (BRCA1) from the the Bayesian Compound Covariate method:
res1$probNew[1:4,]
Array id Class Probability
1 s1816 BRCA1 p < 1.0e-3
2 s1616 BRCA1 0.344
3 s1063 BRCA1 1
4 s1936 BRCA1 p < 1.0e-3
Note:
res$classifierTable | Data frame with composition of classifiers such as geometric means of values in each class, p-values and Gene IDs |
res$probInClass | Data frame with predicted probability of each training sample belonging to a class during cross-validation from the Bayesian Compound Covariate |
res$CCPSenSpec | Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Compound Covariate Predictor Classifier |
res$LDASenSpec | Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Diagonal Linear Discriminant Analysis Classifier. |
res$K1NNSenSpec | Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the 1-Nearest Neighbor Classifier |
res$K3NNSenSpec | Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the 3-Nearest Neighbor Classifier |
res$CentroidSenSpec | Data frame with performance (i.e., sensitivity, * specificity, positive prediction value, negative | prediction value) of the Nearest Centroid Classifier |
res$SVMSenSpec | Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Support Vector Machine Classifier |
res$BCPPSenSpec | Data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Bayesian Compound Covariate Classifier |
res$weightLinearPred | Data frame with gene weights for linear predictors such ** as Compound Covariate Predictor, Diagonal Linear Discriminant Analysis and Support Vector Machine |
res$thresholdLinearPred | Contains the thresholds for the linear prediction rules related with res$weightLinearPred. Each prediction rule is defined by the inner sum of the weights (wiwi) and log expression values (xixi) of significant genes. In this case, a sample is classified to the class BRCA1 if the sum is greater than the threshold; that is, ∑iwixi>threshold∑iwixi>threshold |
res$GRPCentroid | Data frame with centroid of each class for each predictor gene |
res$pmethod | Vector of prediction methods that are specified |
res$workPath | Path for Fortran and other intermediate outputs |
Cross-validation ROC curves are provided for Compound Covariate Predictor, Diagonal Linear Discriminant Analysis and Bayesian Compound Covariate Classifiers.
plotROCCurve(res1,"ccp")
plotROCCurve(res1,"dlda")
plotROCCurve(res1,"bcc")