Supervised learning, also known as supervised machine learning, is a subcategory of machine learning and artificial intelligence. It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately.
Bioinformatics work in genomics often throws some challenging questions that are mostly best solved with machine learning, where algorithms will learn a mathematical model of the input data in order to make decisions about similar data, previously unseen by the model. Machine Learning provides algorithms which are extensively useful for the tasks of prediction, classification, and feature selection in bioinformatics.
For example, to be able to assign a function to a novel gene or model a disease we often use algorithms to predict a medical or biological variable of interest using molecular signatures obtained via genomics methods.
install.packages("devtools")
package 'devtools' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\manso\AppData\Local\Temp\Rtmp6LPuq0\downloaded_packages
#getting file path for our data
devtools::install_github("compgenomr/compGenomRData", force = TRUE)
fileLGGexp=system.file("extdata",
"LGGrnaseq.rds",
package="compGenomRData")
fileLGGann=system.file("extdata",
"patient2LGGsubtypes.rds",
package="compGenomRData")
TCGA-CS-4941 TCGA-CS-4944 TCGA-CS-5393 TCGA-CS-5394
A1BG 72.2326 24.7132 46.3789 37.9659
A1CF 0.0000 0.0000 0.0000 0.0000
A2BP1 524.4997 105.4092 323.5828 19.7390
A2LD1 144.0856 18.0154 29.0942 7.5945
A2ML1 521.3941 159.3746 164.6157 63.5664
A2M 17944.7205 10894.9590 16480.1130 9217.7919
TCGA-CS-5395
A1BG 19.5162
A1CF 0.0000
A2BP1 299.5375
A2LD1 202.1231
A2ML1 953.4106
A2M 10801.8461
dim(gexp)
[1] 20501 184
subtype
TCGA-FG-8185 CIMP
TCGA-DB-5276 CIMP
TCGA-P5-A77X CIMP
TCGA-IK-8125 CIMP
TCGA-DU-A5TR CIMP
TCGA-E1-5311 CIMP
dim(patients)
[1] 184 1
It essential to examine how variables and samples relate to each other.
We will use the caret::preProcess() function and base R functions
boxplot(gexp[,1:50],outline=FALSE,col="cornflowerblue")
Figure 1: Gene expression distribution for the 5th patient (left). log transformed Gene expression distribution for the same patient (right)
# Install the caret package
install.packages("caret")
for the columns at least 85% of the values are the same this function creates the filter but doesn’t apply it yet
nzv=preProcess(tgexp,method="nzv",uniqueCut = 15)
# apply the filter using "predict" function return the filtered dataset and assign it to nzv_tgexp variable
nzv_tgexp=predict(nzv,tgexp)
# choose arbitrary cutoffs for variability. For example, we can choose to take the top 1000 variable predictors
SDs=apply(tgexp,2,sd )
topPreds=order(SDs,decreasing = TRUE)[1:1000]
tgexp=tgexp[,topPreds]
Use preProcess()
library(caret)
processCenter=preProcess(tgexp, method = c("center"))
tgexp=predict(processCenter,tgexp)
If two variables are highly correlated only one of them is removed
corrFilt=preProcess(tgexp, method = "corr",cutoff = 0.9)
tgexp=predict(corrFilt,tgexp)
missing_tgexp=tgexp
missing_tgexp[1,1]=NA
anyNA(missing_tgexp) # check if there are NA values
[1] TRUE
# impute the missing value(s) by median imputation
library(caret)
mImpute=preProcess(missing_tgexp,method="medianImpute")
imputedGexp=predict(mImpute,missing_tgexp)
install.packages("RANN")
package 'RANN' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\manso\AppData\Local\Temp\Rtmp6LPuq0\downloaded_packages
# by nearest neighbours
library(RANN)
knnImpute=preProcess(missing_tgexp,method="knnImpute")
knnimputedGexp=predict(knnImpute,missing_tgexp)
First, create a single data frame with predictors and response variables
Now the response variable or the class label is merged with our dataset we can split it to test and training set with caret::createPartition().
set.seed(3031) # set the random number seed for reproducibility
# get indices for 70% of the data set
intrain <- createDataPartition(y = tgexp[,1], p= 0.7)[[1]]
# seperate test and training sets
training <- tgexp[intrain,]
testing <- tgexp[-intrain,]
Assess performance by using a confusion matrix (use caret::confusionMatrix() function). CM tries to reduce the classification error, or increase the training accuracy
# gte k-NN prediction on the training data itself, with k=5
knnFit=knn3(x=training[,-1], # training set#
y=training[,1], # training set class labels
k=5)
# predictions on the training set
trainPred=predict(knnFit,training[,-1],type="class")
# compare the predicted labels to real labels and get different performance metrics
install.packages('e1071', dependencies=TRUE)
package 'e1071' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\manso\AppData\Local\Temp\Rtmp6LPuq0\downloaded_packages
confusionMatrix(data=training[,1],reference=trainPred)
Confusion Matrix and Statistics
Reference
Prediction CIMP noCIMP
CIMP 65 0
noCIMP 1 64
Accuracy : 0.9923
95% CI : (0.9579, 0.9998)
No Information Rate : 0.5077
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9846
Mcnemar's Test P-Value : 1
Sensitivity : 0.9848
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9846
Prevalence : 0.5077
Detection Rate : 0.5000
Detection Prevalence : 0.5000
Balanced Accuracy : 0.9924
'Positive' Class : CIMP
Test set accuracy again using the knn function and the confusionMatrix() function on the predicted and real classes.
# predictions on the test set, return class labels
testPred=predict(knnFit,testing[,-1],type="class")
# compare the predicted labels to real labels
# get different performance metrics
confusionMatrix(data=testing[,1],reference=testPred)
Confusion Matrix and Statistics
Reference
Prediction CIMP noCIMP
CIMP 26 1
noCIMP 1 26
Accuracy : 0.963
95% CI : (0.8725, 0.9955)
No Information Rate : 0.5
P-Value [Acc > NIR] : 8.249e-14
Kappa : 0.9259
Mcnemar's Test P-Value : 1
Sensitivity : 0.9630
Specificity : 0.9630
Pos Pred Value : 0.9630
Neg Pred Value : 0.9630
Prevalence : 0.5000
Detection Rate : 0.4815
Detection Prevalence : 0.5000
Balanced Accuracy : 0.9630
'Positive' Class : CIMP
library(pROC)
# get k-NN class probabilities prediction probabilities on the test set
testProbs=predict(knnFit,testing[,-1])
# get the roc curve
rocCurve <- pROC::roc(response = testing[,1],predictor =
testProbs[,1], #This function assumes that the second class is the class of interest, so we reverse the labels.
levels = rev(levels(testing[,1])))
# plot the curve
plot(rocCurve, legacy.axes = TRUE)
# return area under the curve
pROC::auc(rocCurve)
Area under the curve: 0.9733
## Area under the curve: 0.999
set.seed(101)
k=1:12 # set k values
trainErr=c() # set vector for training errors
for( i in k){
knnFit=knn3(x=training[,-1], # training set
y=training[,1], # training set class labels
k=i)
# predictions on the training set
class.res=predict(knnFit,training[,-1],type="class")
# training error
err=1-confusionMatrix(training[,1],class.res)$overall[1]
trainErr[i]=err
}
# plot training error vs k
plot(k,trainErr,type="p",col="#CC0000",pch=20)
# add a smooth line for the trend
lines(loess.smooth(x=k, trainErr,degree=2),col="#CC0000")
set.seed(31)
k=1:12
testErr=c()
for( i in k){
knnFit=knn3(x=training[,-1], # training set
y=training[,1], # training set class labels
k=i)
# predictions on the training set
class.res=predict(knnFit,testing[,-1],type="class")
testErr[i]=1-confusionMatrix(testing[,1],
class.res)$overall[1]
}
# plot training error
plot(k,trainErr,type="p",col="#CC0000",
ylim=c(0.000,0.08),
ylab="prediction error (1-accuracy)",pch=19)
# add a smooth line for the trend
lines(loess.smooth(x=k, trainErr,degree=2), col="#CC0000")
# plot test error
points(k,testErr,col="#00CC66",pch=19)
lines(loess.smooth(x=k,testErr,degree=2), col="#00CC66")
# add legend
legend("bottomright",fill=c("#CC0000","#00CC66"),
legend=c("training","test"),bty="n")
Run a k-NN model with cross-validation using caret::train() function
set.seed(17) # this method controls everything about training
# we will just set up 10-fold cross validation
trctrl <- trainControl(method = "cv",number=10)
# train k-NN model
knn_fit <- train(subtype~., data = training,
method = "knn",
trControl=trctrl,
tuneGrid = data.frame(k=1:12))
# best k value by cross-validation accuracy
knn_fit$bestTune
k
1 1
## k
## 6 6
# plot k vs prediction error
plot(x=1:12,1-knn_fit$results[,2],pch=19,
ylab="prediction error",xlab="k")
lines(loess.smooth(x=1:12,1-knn_fit$results[,2],degree=2),
col="#CC0000")
set.seed(17)
# this method controls everything about training
# we will just set up 100 bootstrap samples and for each bootstrap OOB samples to test the error
trctrl <- trainControl(method = "boot",number=20,
returnResamp="all")
# we will now train k-NN model
knn_fit <- train(subtype~., data = training,
method = "knn",
trControl=trctrl,
tuneGrid = data.frame(k=1:12))