Exploratory Data Analysis and Unsupervised Machine Learning - R

Beatriz Manso
2021-12-17

Introduction

To answer many genomics question we need to define a distance or similarity metric between, for example, patients’ expression profiles to find groups of patients that are more similar to each other in some specific circumstances than another group Clustering is a ubiquitous procedure in bioinformatics as well as any field that deals with high-dimensional data. It is very likely every genomics paper containing multiple samples have some sort of clustering.

The first required step for clustering is the distance metric (DM), which is simply a measurement of how similar gene expressions to each other are.

Methods

1. Set Working Directory and Load Libraries:

setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/3.DSB - Data Science for Bioinformatics/Practice/DSB - W11 - Exploratory Data Analysis with Unsupervised Machine Learning")

2. Get the data

df<- read_excel("Table1.xlsx")
df
# A tibble: 4 x 4
  ...1      IRX4  OCT4  PAX6
  <chr>    <dbl> <dbl> <dbl>
1 patient1    11    10     1
2 patient2    13    13     3
3 patient3     2     4    10
4 patient4     1     3     9

3. Calculate the Manhattan distance (L1) metrics using dist() function

dist(df, method = "manhattan")
          1         2         3
2  9.333333                    
3 32.000000 36.000000          
4 33.333333 37.333333  4.000000

4. Calculate the Euclidean distance (L2) metrics using dist() function

dist (df,method ="euclidean")
          1         2         3
2  4.760952                    
3 16.248077 18.293897          
4 16.852300 19.321836  2.000000

Explain the differences observed in the Manhattan and Euclidean distances

- Manhattan distance - is calculated as the sum of the absolute differences between the two vectors. The Manhattan distance is related to the L1 vector norm and the sum absolute error and mean absolute error metric.

- Euclidean distance - is the straight line distance between two points in Euclidean space

The Euclidean algorithm works as follows: for each cell, the distance to each source cell is determined by calculating the hypotenuse with x_max and y_max as the other two legs of the triangle.

The output values for the Euclidean distance raster are floating-point distance values. • Pearson coefficient of correlation -

5. Clustering

Create a data set:

y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))

#View y:
y
            t1          t2         t3          t4         t5
g1  -1.1050914 -0.03255145  0.4886729  0.05542672 -0.6811119
g2  -0.4948357  0.41822185  1.3224804  0.63875445  1.7701429
g3   0.6935929 -0.76267307  1.0035828 -0.20582995  0.6817219
g4  -0.6417948  2.54723670  0.1293014  1.90813632  0.4685572
g5  -0.6168166 -1.87090026 -1.4572187 -0.10407031  0.6962025
g6  -0.9696614  0.47608708 -0.3606395  0.23568370  0.1279677
g7  -1.8309041  1.54014027 -0.9065775 -2.12153214 -0.1946871
g8   1.1371366 -2.28515881 -0.2865203  1.09100199  0.3910011
g9  -0.5781371 -1.01783617  1.0602050 -0.59709402  0.4364838
g10 -1.2417472 -1.00305243 -0.8573725  0.48098203 -0.9777267

6. Data centring and scaling

Scale of the vectors in our expression matrix can affect the distance calculation. So we normalize with scale() so the values are in comparable scales

The function scale() centers and/or scales numeric matrices column-wise. When used with its default settings, the function returns columns that have a mean close to zero and a standard deviation of one.

For row-wise scaling the data set needs to be transposed first using the t() function.

Another transposing step is necessary to return the data set in its original orientation (same row/column assignment).

Centering and scaling are common data transformation steps for many clustering techniques.

yscaled <- t(scale(t(y))) # Centers and scales y row-wise
apply(yscaled, 1, sd)
 g1  g2  g3  g4  g5  g6  g7  g8  g9 g10 
  1   1   1   1   1   1   1   1   1   1 
dist(yscaled, method = "euclidean")
           g1        g2        g3        g4        g5        g6
g2  2.1423004                                                  
g3  3.0939624 2.4072087                                        
g4  2.0360592 2.7091729 3.9223695                              
g5  3.4636224 2.3025598 2.3299010 3.1901607                    
g6  2.0634260 2.0629333 3.7088007 0.8690975 2.8471441          
g7  2.5694345 2.4684101 3.4354028 1.9763827 3.4016650 1.8269697
g8  3.3763018 2.9806680 1.9050503 3.5668461 1.4972514 3.5193820
g9  2.3426390 1.4884626 1.2271947 3.5032928 2.5708838 3.1407352
g10 2.1196635 2.6649801 3.2855210 2.0182119 2.4256394 2.1433597
           g7        g8        g9
g2                               
g3                               
g4                               
g5                               
g6                               
g7                               
g8  3.9302463                    
g9  3.0389920 2.5546948          
g10 3.4181184 2.3028769 3.0512764
write.csv(y, "Table2.csv")
scale(t(y))
           g1         g2         g3         g4          g5         g6
t1 -1.3429111 -1.4063813  0.5575287 -1.1616452  0.05213032 -1.5167381
t2  0.3512702 -0.3588049 -1.4154543  1.2690132 -1.16429675  0.9992673
t3  1.1745948  0.6786761  0.9775103 -0.5739209 -0.76303688 -0.4568701
t4  0.4902403 -0.1057817 -0.6610303  0.7818952  0.54948023  0.5808984
t5 -0.6731942  1.1922918  0.5414457 -0.3153424  1.32572308  0.3934426
           g7         g8         g9        g10
t1 -0.7687101  0.8003801 -0.5115228 -0.7613645
t2  1.5282002 -1.6286986 -1.0240220 -0.4131914
t3 -0.1389066 -0.2101039  1.3980763 -0.2006949
t4 -0.9667337  0.7676347 -0.5336184  1.7515007
t5  0.3461502  0.2707877  0.6710868 -0.3762499
attr(,"scaled:center")
          g1           g2           g3           g4           g5 
-0.254931036  0.730952780  0.282078929  0.882287360 -0.670560666 
          g6           g7           g8           g9          g10 
-0.098112487 -0.702712125  0.009492101 -0.139275698 -0.719783364 
attr(,"scaled:scale")
       g1        g2        g3        g4        g5        g6        g7 
0.6330727 0.8715905 0.7381036 1.3120032 1.0309567 0.5746206 1.4676431 
       g8        g9       g10 
1.4088862 0.8579508 0.6855638 
yscaled <- t(scale(t(y)))
apply(yscaled, 1, sd)
 g1  g2  g3  g4  g5  g6  g7  g8  g9 g10 
  1   1   1   1   1   1   1   1   1   1 

7. Distance matrix

Computes and returns a distance matrix for the rows in the data matrix ‘y’ using the specific distance measure.

dist(y, method = "euclidean")
          g1       g2       g3       g4       g5       g6       g7
g2  2.760392                                                      
g3  2.441116 2.192385                                             
g4  3.428319 3.046873 4.244694                                    
g5  3.053992 3.832339 3.001908 5.112573                           
g6  1.298264 2.433291 2.581399 2.747781 2.696972                  
g7  3.150023 4.414999 4.445532 4.492148 4.275073 2.798673         
g8  3.595079 3.829758 2.438309 5.230992 2.477870 3.587384 5.872823
g9  1.802753 2.333229 1.377984 4.456185 2.716093 2.278727 3.834474
g10 1.743979 3.860743 3.239129 4.250474 2.155885 1.947141 3.768765
          g8       g9
g2                   
g3                   
g4                   
g5                   
g6                   
g7                   
g8                   
g9  3.035404         
g10 3.142339 2.698136

Task 2:

Repeat with Manhattan, Maximum, Canberra, and binary:

dist(y, method = "manhattan")
           g1        g2        g3        g4        g5        g6
g2   4.929419                                                  
g3   4.667806  4.621226                                        
g4   6.404835  6.040121  7.846710                              
g5   5.809327  7.007567  4.995678  8.269487                    
g6   2.482717  4.261057  5.261505  4.901999  4.704400          
g7   6.357138  9.412161  9.529582  7.924997  8.084121  5.151105
g8   7.377717  7.775743  4.843685  7.921839  4.839184  6.060515
g9   3.853888  4.351142  2.220017  7.096938  4.161910  4.447586
g10  3.175372  7.253681  6.382935  8.010354  4.351606  3.598951
           g7        g8        g9
g2                               
g3                               
g4                               
g5                               
g6                               
g7                               
g8  11.211619                    
g9   7.933135  6.062900          
g10  6.567108  6.210590  5.088258
dist(y, method = "maximum")
           g1        g2        g3        g4        g5        g6
g2  2.4512549                                                  
g3  1.7986843 1.1884286                                        
g4  2.5797881 2.1290149 3.3099098                              
g5  1.9458916 2.7796991 2.4608015 4.4181370                    
g6  0.8493124 1.6831199 1.6632543 2.0711496 2.3469873          
g7  2.1769589 2.7602866 2.5244970 4.0296685 3.4110405 2.3572158
g8  2.2526074 2.7033807 1.5224857 4.8323955 1.7539531 2.7612459
g9  1.1175957 1.4360580 1.2717300 3.5650729 2.5174237 1.4939232
g10 1.3460454 2.7478696 1.9353401 3.5502891 1.6739292 1.4791395
           g7        g8        g9
g2                               
g3                               
g4                               
g5                               
g6                               
g7                               
g8  3.8252991                    
g9  2.5579764 1.7152736          
g10 2.6025142 2.3788838 1.9175775
dist(y, method = "canberra")
          g1       g2       g3       g4       g5       g6       g7
g2  3.682111                                                      
g3  4.263187 3.581013                                             
g4  3.790289 2.748911 3.957046                                    
g5  4.249364 3.545167 2.759679 3.215290                           
g6  3.684480 2.715040 4.683909 3.239600 3.515097                  
g7  3.802619 4.147356 4.823122 3.727287 3.635434 3.266086         
g8  4.875216 3.899605 3.012820 3.362684 3.051770 3.266071 4.519704
g9  3.620079 2.792100 1.877358 2.870217 2.260121 3.799532 4.080766
g10 2.967515 3.571006 4.136136 3.915880 2.897373 2.873137 2.887524
          g8       g9
g2                   
g3                   
g4                   
g5                   
g6                   
g7                   
g8                   
g9  3.438654         
g10 3.277011 3.371960
dist(y, method = "binary")
    g1 g2 g3 g4 g5 g6 g7 g8 g9
g2   0                        
g3   0  0                     
g4   0  0  0                  
g5   0  0  0  0               
g6   0  0  0  0  0            
g7   0  0  0  0  0  0         
g8   0  0  0  0  0  0  0      
g9   0  0  0  0  0  0  0  0   
g10  0  0  0  0  0  0  0  0  0

Explain why the values from the different distance matrices differ.

The values for the different distance matrices differ between them because the calculate diferent measures:

- Manhattan - calculates the absolute distance between the two vectors

- maximum - calculates the maximum distance between two components of xx and yy

- canberra - is the numerical measure of the distance between pairs of points in a vector space

- binary - calculates the proportion of bits (vectors) in which only one is on amongst those in which at least one is on.

Using the distance matrix obtained from Euclidean distance calculate the correlation-base d distances.

Compute first a correlation matrix with the cor() function. This step will iterate a cross matrix columns, so transpose ‘t()’ step to obtain row distances.

Use the distance matrix object of class “dist” with the as.dist() function.

c <- cor(t(y), method="spearman")
d <- as.dist(1-c)
d
              g1           g2           g3           g4           g5
g2  6.000000e-01                                                    
g3  9.000000e-01 8.000000e-01                                       
g4  7.000000e-01 9.000000e-01 1.900000e+00                          
g5  1.300000e+00 5.000000e-01 9.000000e-01 1.200000e+00             
g6  7.000000e-01 9.000000e-01 1.900000e+00 2.220446e-16 1.200000e+00
g7  1.100000e+00 8.000000e-01 1.300000e+00 6.000000e-01 1.400000e+00
g8  1.500000e+00 1.300000e+00 7.000000e-01 1.600000e+00 4.000000e-01
g9  8.000000e-01 4.000000e-01 1.000000e-01 1.700000e+00 7.000000e-01
g10 2.000000e-01 4.000000e-01 1.000000e+00 7.000000e-01 7.000000e-01
              g6           g7           g8           g9
g2                                                     
g3                                                     
g4                                                     
g5                                                     
g6                                                     
g7  6.000000e-01                                       
g8  1.600000e+00 1.800000e+00                          
g9  1.700000e+00 1.100000e+00 9.000000e-01             
g10 7.000000e-01 1.400000e+00 1.100000e+00 8.000000e-01

10. Cluster with hclust for Hierarchical Clustering (HC)

Generate the HC tree:

hr <- hclust(d, method = "complete", members=NULL)
names(hr)
[1] "merge"       "height"      "order"       "labels"     
[5] "method"      "call"        "dist.method"

Plot the generated tree with the plot() function.

When the hang argument is set to ‘-1’ then all leaves end on one line and their labels hang down from 0.

par(mfrow = c(1, 2))
plot(hr, hang = 0.1)
plot(hr, hang = -1)

Plot the hclust tree horizontally by transforming into a dendrogram object.

plot(as.dendrogram(hr), edgePar=list(col=3, lwd=4), horiz=T)

unclass(hr) # Prints the full content of the hclust object.
$merge
      [,1] [,2]
 [1,]   -4   -6
 [2,]   -3   -9
 [3,]   -1  -10
 [4,]   -5   -8
 [5,]   -2    3
 [6,]   -7    1
 [7,]    2    4
 [8,]    5    6
 [9,]    7    8

$height
[1] 2.220446e-16 1.000000e-01 2.000000e-01 4.000000e-01 6.000000e-01
[6] 6.000000e-01 9.000000e-01 1.400000e+00 1.900000e+00

$order
 [1]  3  9  5  8  2  1 10  7  4  6

$labels
 [1] "g1"  "g2"  "g3"  "g4"  "g5"  "g6"  "g7"  "g8"  "g9"  "g10"

$method
[1] "complete"

$call
hclust(d = d, method = "complete", members = NULL)

$dist.method
NULL
str(as.dendrogram(hr)) # Prints dendrogram structure as text.
--[dendrogram w/ 2 branches and 10 members at h = 1.9]
  |--[dendrogram w/ 2 branches and 4 members at h = 0.9]
  |  |--[dendrogram w/ 2 branches and 2 members at h = 0.1]
  |  |  |--leaf "g3" 
  |  |  `--leaf "g9" 
  |  `--[dendrogram w/ 2 branches and 2 members at h = 0.4]
  |     |--leaf "g5" 
  |     `--leaf "g8" 
  `--[dendrogram w/ 2 branches and 6 members at h = 1.4]
     |--[dendrogram w/ 2 branches and 3 members at h = 0.6]
     |  |--leaf "g2" 
     |  `--[dendrogram w/ 2 branches and 2 members at h = 0.2]
     |     |--leaf "g1" 
     |     `--leaf "g10" 
     `--[dendrogram w/ 2 branches and 3 members at h = 0.6]
        |--leaf "g7" 
        `--[dendrogram w/ 2 branches and 2 members at h = 2.22e-16]
           |--leaf "g4" 
           `--leaf "g6" 
hr$labels[hr$order] # Prints the row labels in the order they appear in the tree.
 [1] "g3"  "g9"  "g5"  "g8"  "g2"  "g1"  "g10" "g7"  "g4"  "g6" 
par(mfrow=c(1,2))
hrd1 <- as.dendrogram(hr)
plot(hrd1)
hrd2 <- reorder(hrd1, sample(1:10))
plot(hrd2)
labels(hrd1)
 [1] "g3"  "g9"  "g5"  "g8"  "g2"  "g1"  "g10" "g7"  "g4"  "g6" 
labels(hrd2) # Example to reorder a dendrogram and print out its labels.
 [1] "g3"  "g9"  "g8"  "g5"  "g4"  "g6"  "g7"  "g2"  "g10" "g1" 

Install R Package ‘ctc’

The ‘hc2Newick’ function of the Bioc ‘ctc’ library can convert a hclust object into the Newick tree file format for export into external programs.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("ctc")
[1] "(((g3:0.0500000000000001,g9:0.0500000000000001):0,(g5:0.2,g8:0.2):0):0,((g2:0.2,(g1:0.1,g10:0.1):0.2):0,(g7:0.3,(g4:1.11022302462516e-16,g6:1.11022302462516e-16):0.3):0):0);"