Supervised machine learning in R

Introduction

Supervised learning, also known as supervised machine learning, is a subcategory of machine learning and artificial intelligence. It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately.

Bioinformatics work in genomics often throws some challenging questions that are mostly best solved with machine learning, where algorithms will learn a mathematical model of the input data in order to make decisions about similar data, previously unseen by the model. Machine Learning provides algorithms which are extensively useful for the tasks of prediction, classification, and feature selection in bioinformatics.

For example, to be able to assign a function to a novel gene or model a disease we often use algorithms to predict a medical or biological variable of interest using molecular signatures obtained via genomics methods.

Methods

install.packages("devtools") 
package 'devtools' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\manso\AppData\Local\Temp\Rtmp6LPuq0\downloaded_packages

1. Get the data

#getting file path for our data
devtools::install_github("compgenomr/compGenomRData", force = TRUE)
fileLGGexp=system.file("extdata",
                       "LGGrnaseq.rds",
                       package="compGenomRData")
fileLGGann=system.file("extdata",
                       "patient2LGGsubtypes.rds",
                       package="compGenomRData")

Check gene expression values

gexp=readRDS(fileLGGexp)
head(gexp[,1:5])
      TCGA-CS-4941 TCGA-CS-4944 TCGA-CS-5393 TCGA-CS-5394
A1BG       72.2326      24.7132      46.3789      37.9659
A1CF        0.0000       0.0000       0.0000       0.0000
A2BP1     524.4997     105.4092     323.5828      19.7390
A2LD1     144.0856      18.0154      29.0942       7.5945
A2ML1     521.3941     159.3746     164.6157      63.5664
A2M     17944.7205   10894.9590   16480.1130    9217.7919
      TCGA-CS-5395
A1BG       19.5162
A1CF        0.0000
A2BP1     299.5375
A2LD1     202.1231
A2ML1     953.4106
A2M     10801.8461

Check number of samples

dim(gexp)
[1] 20501   184

Check patient annotation

patients=readRDS(fileLGGann)
head(patients)
             subtype
TCGA-FG-8185    CIMP
TCGA-DB-5276    CIMP
TCGA-P5-A77X    CIMP
TCGA-IK-8125    CIMP
TCGA-DU-A5TR    CIMP
TCGA-E1-5311    CIMP
dim(patients)
[1] 184   1

2. Data Pre-processing

It essential to examine how variables and samples relate to each other.

We will use the caret::preProcess() function and base R functions

Transform the data (plot the first 50 tumour samples so you can view with clarity)

boxplot(gexp[,1:50],outline=FALSE,col="cornflowerblue")

Normalisation

par(mfrow=c(1,2))
hist(gexp[,5],xlab="gene expression", main="", border="blue4", col="cornflowerblue")
hist(log10(gexp+1)[,5], xlab="gene expression log scale",main="",
 border="blue4",col="cornflowerblue")

Figure 1: Gene expression distribution for the 5th patient (left). log transformed Gene expression distribution for the same patient (right)

# tame the extreme values
gexp=log10(gexp+1)

# transpose the data set
tgexp <- t(gexp)

3. Filter and scale data

# Install the caret package
install.packages("caret")

Remove near zero variation

for the columns at least 85% of the values are the same this function creates the filter but doesn’t apply it yet

nzv=preProcess(tgexp,method="nzv",uniqueCut = 15)

# apply the filter using "predict" function return the filtered dataset and assign it to nzv_tgexp variable
nzv_tgexp=predict(nzv,tgexp)

# choose arbitrary cutoffs for variability. For example, we can choose to take the top 1000 variable predictors
SDs=apply(tgexp,2,sd )
topPreds=order(SDs,decreasing = TRUE)[1:1000]
tgexp=tgexp[,topPreds]

Centre the data

Use preProcess()

library(caret)
processCenter=preProcess(tgexp, method = c("center"))
tgexp=predict(processCenter,tgexp)

Create a filter for removing higly correlated variables

If two variables are highly correlated only one of them is removed

corrFilt=preProcess(tgexp, method = "corr",cutoff = 0.9)
tgexp=predict(corrFilt,tgexp)

4. Dealing with missing values

missing_tgexp=tgexp
missing_tgexp[1,1]=NA
anyNA(missing_tgexp) # check if there are NA values
[1] TRUE
# checks which values are NA in the matrix
gexpnoNA=missing_tgexp[ , colSums(is.na(missing_tgexp)) == 0]
# impute the missing value(s) by median imputation
library(caret)
mImpute=preProcess(missing_tgexp,method="medianImpute")
imputedGexp=predict(mImpute,missing_tgexp)

install.packages("RANN")
package 'RANN' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\manso\AppData\Local\Temp\Rtmp6LPuq0\downloaded_packages
# by nearest neighbours
library(RANN)
knnImpute=preProcess(missing_tgexp,method="knnImpute")
knnimputedGexp=predict(knnImpute,missing_tgexp)

5. Splitting the data

First, create a single data frame with predictors and response variables

tgexp=merge(patients,tgexp,by="row.names")
# push sample ids back to the row names
rownames(tgexp)=tgexp[,1]
tgexp=tgexp[,-1]

Now the response variable or the class label is merged with our dataset we can split it to test and training set with caret::createPartition().

set.seed(3031) # set the random number seed for reproducibility
# get indices for 70% of the data set
intrain <- createDataPartition(y = tgexp[,1], p= 0.7)[[1]]
# seperate test and training sets
training <- tgexp[intrain,]
testing <- tgexp[-intrain,]

6. Predict class or subtype with k-nearest neighbours

library(caret)
knnFit=knn3(x=training[,-1], # training set
 y=training[,1], # training set class labels
 k=5)
# predictions on the test set
trainPred=predict(knnFit,training[,-1])

7. Assess performance of the model

Assess performance by using a confusion matrix (use caret::confusionMatrix() function). CM tries to reduce the classification error, or increase the training accuracy

# gte k-NN prediction on the training data itself, with k=5
knnFit=knn3(x=training[,-1], # training set#
            y=training[,1], # training set class labels
            k=5)

# predictions on the training set
trainPred=predict(knnFit,training[,-1],type="class")

# compare the predicted labels to real labels and get different performance metrics
install.packages('e1071', dependencies=TRUE)
package 'e1071' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\manso\AppData\Local\Temp\Rtmp6LPuq0\downloaded_packages
confusionMatrix(data=training[,1],reference=trainPred)
Confusion Matrix and Statistics

          Reference
Prediction CIMP noCIMP
    CIMP     65      0
    noCIMP    1     64
                                          
               Accuracy : 0.9923          
                 95% CI : (0.9579, 0.9998)
    No Information Rate : 0.5077          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9846          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9848          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9846          
             Prevalence : 0.5077          
         Detection Rate : 0.5000          
   Detection Prevalence : 0.5000          
      Balanced Accuracy : 0.9924          
                                          
       'Positive' Class : CIMP            
                                          

Test set accuracy again using the knn function and the confusionMatrix() function on the predicted and real classes.

# predictions on the test set, return class labels
testPred=predict(knnFit,testing[,-1],type="class")

# compare the predicted labels to real labels
# get different performance metrics
confusionMatrix(data=testing[,1],reference=testPred)
Confusion Matrix and Statistics

          Reference
Prediction CIMP noCIMP
    CIMP     26      1
    noCIMP    1     26
                                          
               Accuracy : 0.963           
                 95% CI : (0.8725, 0.9955)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : 8.249e-14       
                                          
                  Kappa : 0.9259          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9630          
            Specificity : 0.9630          
         Pos Pred Value : 0.9630          
         Neg Pred Value : 0.9630          
             Prevalence : 0.5000          
         Detection Rate : 0.4815          
   Detection Prevalence : 0.5000          
      Balanced Accuracy : 0.9630          
                                          
       'Positive' Class : CIMP            
                                          
library(pROC)
# get k-NN class probabilities prediction probabilities on the test set
testProbs=predict(knnFit,testing[,-1])

# get the roc curve
rocCurve <- pROC::roc(response = testing[,1],predictor =
                        testProbs[,1], #This function assumes that the second class is the class of interest, so we reverse the labels.

                      levels = rev(levels(testing[,1])))

# plot the curve
plot(rocCurve, legacy.axes = TRUE)
# return area under the curve
pROC::auc(rocCurve)
Area under the curve: 0.9733
## Area under the curve: 0.999

8. Model tuning and avoiding overfitting

Pick the best k?

set.seed(101)
k=1:12 # set k values
trainErr=c() # set vector for training errors

for( i in k){
  knnFit=knn3(x=training[,-1], # training set
              y=training[,1], # training set class labels
              k=i)
 # predictions on the training set
 class.res=predict(knnFit,training[,-1],type="class")
 
 
 # training error
 err=1-confusionMatrix(training[,1],class.res)$overall[1]
 trainErr[i]=err
}

# plot training error vs k
plot(k,trainErr,type="p",col="#CC0000",pch=20)

# add a smooth line for the trend
lines(loess.smooth(x=k, trainErr,degree=2),col="#CC0000")

Check the effect of k in training error, as k increases

set.seed(31)
k=1:12
testErr=c()

for( i in k){
  knnFit=knn3(x=training[,-1], # training set
 y=training[,1], # training set class labels
 k=i)
 
  # predictions on the training set
 class.res=predict(knnFit,testing[,-1],type="class")
 testErr[i]=1-confusionMatrix(testing[,1],
 class.res)$overall[1]
}
# plot training error
plot(k,trainErr,type="p",col="#CC0000",
     ylim=c(0.000,0.08),
     ylab="prediction error (1-accuracy)",pch=19)

# add a smooth line for the trend
lines(loess.smooth(x=k, trainErr,degree=2), col="#CC0000")

# plot test error
points(k,testErr,col="#00CC66",pch=19)
lines(loess.smooth(x=k,testErr,degree=2), col="#00CC66")

# add legend
legend("bottomright",fill=c("#CC0000","#00CC66"),
legend=c("training","test"),bty="n")

Cross-validation

Run a k-NN model with cross-validation using caret::train() function

set.seed(17) # this method controls everything about training

# we will just set up 10-fold cross validation
trctrl <- trainControl(method = "cv",number=10)

# train k-NN model
knn_fit <- train(subtype~., data = training,
                 method = "knn",
                 trControl=trctrl,
                 tuneGrid = data.frame(k=1:12))

# best k value by cross-validation accuracy
knn_fit$bestTune
  k
1 1
## k
## 6 6
# plot k vs prediction error
plot(x=1:12,1-knn_fit$results[,2],pch=19,
     ylab="prediction error",xlab="k")
lines(loess.smooth(x=1:12,1-knn_fit$results[,2],degree=2),
      col="#CC0000")

Try bootstrap resampling and check the prediction error

set.seed(17)
# this method controls everything about training

# we will just set up 100 bootstrap samples and for each bootstrap OOB samples to test the error
trctrl <- trainControl(method = "boot",number=20,
                       returnResamp="all")

# we will now train k-NN model
knn_fit <- train(subtype~., data = training,
                 method = "knn",
                 trControl=trctrl,
                 tuneGrid = data.frame(k=1:12))