To answer many genomics question we need to define a distance or similarity metric between, for example, patients’ expression profiles to find groups of patients that are more similar to each other in some specific circumstances than another group Clustering is a ubiquitous procedure in bioinformatics as well as any field that deals with high-dimensional data. It is very likely every genomics paper containing multiple samples have some sort of clustering.
The first required step for clustering is the distance metric (DM), which is simply a measurement of how similar gene expressions to each other are.
setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/3.DSB - Data Science for Bioinformatics/Practice/DSB - W11 - Exploratory Data Analysis with Unsupervised Machine Learning")
df<- read_excel("Table1.xlsx")
df
# A tibble: 4 x 4
...1 IRX4 OCT4 PAX6
<chr> <dbl> <dbl> <dbl>
1 patient1 11 10 1
2 patient2 13 13 3
3 patient3 2 4 10
4 patient4 1 3 9
dist(df, method = "manhattan")
1 2 3
2 9.333333
3 32.000000 36.000000
4 33.333333 37.333333 4.000000
dist (df,method ="euclidean")
1 2 3
2 4.760952
3 16.248077 18.293897
4 16.852300 19.321836 2.000000
- Manhattan distance - is calculated as the sum of the absolute differences between the two vectors. The Manhattan distance is related to the L1 vector norm and the sum absolute error and mean absolute error metric.
- Euclidean distance - is the straight line distance between two points in Euclidean space
The Euclidean algorithm works as follows: for each cell, the distance to each source cell is determined by calculating the hypotenuse with x_max and y_max as the other two legs of the triangle.
The output values for the Euclidean distance raster are floating-point distance values. • Pearson coefficient of correlation -
Create a data set:
y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
#View y:
y
t1 t2 t3 t4 t5
g1 -1.1050914 -0.03255145 0.4886729 0.05542672 -0.6811119
g2 -0.4948357 0.41822185 1.3224804 0.63875445 1.7701429
g3 0.6935929 -0.76267307 1.0035828 -0.20582995 0.6817219
g4 -0.6417948 2.54723670 0.1293014 1.90813632 0.4685572
g5 -0.6168166 -1.87090026 -1.4572187 -0.10407031 0.6962025
g6 -0.9696614 0.47608708 -0.3606395 0.23568370 0.1279677
g7 -1.8309041 1.54014027 -0.9065775 -2.12153214 -0.1946871
g8 1.1371366 -2.28515881 -0.2865203 1.09100199 0.3910011
g9 -0.5781371 -1.01783617 1.0602050 -0.59709402 0.4364838
g10 -1.2417472 -1.00305243 -0.8573725 0.48098203 -0.9777267
Scale of the vectors in our expression matrix can affect the distance calculation. So we normalize with scale() so the values are in comparable scales
The function scale() centers and/or scales numeric matrices column-wise. When used with its default settings, the function returns columns that have a mean close to zero and a standard deviation of one.
For row-wise scaling the data set needs to be transposed first using the t() function.
Another transposing step is necessary to return the data set in its original orientation (same row/column assignment).
Centering and scaling are common data transformation steps for many clustering techniques.
g1 g2 g3 g4 g5 g6 g7 g8 g9 g10
1 1 1 1 1 1 1 1 1 1
dist(yscaled, method = "euclidean")
g1 g2 g3 g4 g5 g6
g2 2.1423004
g3 3.0939624 2.4072087
g4 2.0360592 2.7091729 3.9223695
g5 3.4636224 2.3025598 2.3299010 3.1901607
g6 2.0634260 2.0629333 3.7088007 0.8690975 2.8471441
g7 2.5694345 2.4684101 3.4354028 1.9763827 3.4016650 1.8269697
g8 3.3763018 2.9806680 1.9050503 3.5668461 1.4972514 3.5193820
g9 2.3426390 1.4884626 1.2271947 3.5032928 2.5708838 3.1407352
g10 2.1196635 2.6649801 3.2855210 2.0182119 2.4256394 2.1433597
g7 g8 g9
g2
g3
g4
g5
g6
g7
g8 3.9302463
g9 3.0389920 2.5546948
g10 3.4181184 2.3028769 3.0512764
write.csv(y, "Table2.csv")
g1 g2 g3 g4 g5 g6
t1 -1.3429111 -1.4063813 0.5575287 -1.1616452 0.05213032 -1.5167381
t2 0.3512702 -0.3588049 -1.4154543 1.2690132 -1.16429675 0.9992673
t3 1.1745948 0.6786761 0.9775103 -0.5739209 -0.76303688 -0.4568701
t4 0.4902403 -0.1057817 -0.6610303 0.7818952 0.54948023 0.5808984
t5 -0.6731942 1.1922918 0.5414457 -0.3153424 1.32572308 0.3934426
g7 g8 g9 g10
t1 -0.7687101 0.8003801 -0.5115228 -0.7613645
t2 1.5282002 -1.6286986 -1.0240220 -0.4131914
t3 -0.1389066 -0.2101039 1.3980763 -0.2006949
t4 -0.9667337 0.7676347 -0.5336184 1.7515007
t5 0.3461502 0.2707877 0.6710868 -0.3762499
attr(,"scaled:center")
g1 g2 g3 g4 g5
-0.254931036 0.730952780 0.282078929 0.882287360 -0.670560666
g6 g7 g8 g9 g10
-0.098112487 -0.702712125 0.009492101 -0.139275698 -0.719783364
attr(,"scaled:scale")
g1 g2 g3 g4 g5 g6 g7
0.6330727 0.8715905 0.7381036 1.3120032 1.0309567 0.5746206 1.4676431
g8 g9 g10
1.4088862 0.8579508 0.6855638
g1 g2 g3 g4 g5 g6 g7 g8 g9 g10
1 1 1 1 1 1 1 1 1 1
Computes and returns a distance matrix for the rows in the data matrix ‘y’ using the specific distance measure.
dist(y, method = "euclidean")
g1 g2 g3 g4 g5 g6 g7
g2 2.760392
g3 2.441116 2.192385
g4 3.428319 3.046873 4.244694
g5 3.053992 3.832339 3.001908 5.112573
g6 1.298264 2.433291 2.581399 2.747781 2.696972
g7 3.150023 4.414999 4.445532 4.492148 4.275073 2.798673
g8 3.595079 3.829758 2.438309 5.230992 2.477870 3.587384 5.872823
g9 1.802753 2.333229 1.377984 4.456185 2.716093 2.278727 3.834474
g10 1.743979 3.860743 3.239129 4.250474 2.155885 1.947141 3.768765
g8 g9
g2
g3
g4
g5
g6
g7
g8
g9 3.035404
g10 3.142339 2.698136
Task 2:
Repeat with Manhattan, Maximum, Canberra, and binary:
dist(y, method = "manhattan")
g1 g2 g3 g4 g5 g6
g2 4.929419
g3 4.667806 4.621226
g4 6.404835 6.040121 7.846710
g5 5.809327 7.007567 4.995678 8.269487
g6 2.482717 4.261057 5.261505 4.901999 4.704400
g7 6.357138 9.412161 9.529582 7.924997 8.084121 5.151105
g8 7.377717 7.775743 4.843685 7.921839 4.839184 6.060515
g9 3.853888 4.351142 2.220017 7.096938 4.161910 4.447586
g10 3.175372 7.253681 6.382935 8.010354 4.351606 3.598951
g7 g8 g9
g2
g3
g4
g5
g6
g7
g8 11.211619
g9 7.933135 6.062900
g10 6.567108 6.210590 5.088258
dist(y, method = "maximum")
g1 g2 g3 g4 g5 g6
g2 2.4512549
g3 1.7986843 1.1884286
g4 2.5797881 2.1290149 3.3099098
g5 1.9458916 2.7796991 2.4608015 4.4181370
g6 0.8493124 1.6831199 1.6632543 2.0711496 2.3469873
g7 2.1769589 2.7602866 2.5244970 4.0296685 3.4110405 2.3572158
g8 2.2526074 2.7033807 1.5224857 4.8323955 1.7539531 2.7612459
g9 1.1175957 1.4360580 1.2717300 3.5650729 2.5174237 1.4939232
g10 1.3460454 2.7478696 1.9353401 3.5502891 1.6739292 1.4791395
g7 g8 g9
g2
g3
g4
g5
g6
g7
g8 3.8252991
g9 2.5579764 1.7152736
g10 2.6025142 2.3788838 1.9175775
dist(y, method = "canberra")
g1 g2 g3 g4 g5 g6 g7
g2 3.682111
g3 4.263187 3.581013
g4 3.790289 2.748911 3.957046
g5 4.249364 3.545167 2.759679 3.215290
g6 3.684480 2.715040 4.683909 3.239600 3.515097
g7 3.802619 4.147356 4.823122 3.727287 3.635434 3.266086
g8 4.875216 3.899605 3.012820 3.362684 3.051770 3.266071 4.519704
g9 3.620079 2.792100 1.877358 2.870217 2.260121 3.799532 4.080766
g10 2.967515 3.571006 4.136136 3.915880 2.897373 2.873137 2.887524
g8 g9
g2
g3
g4
g5
g6
g7
g8
g9 3.438654
g10 3.277011 3.371960
dist(y, method = "binary")
g1 g2 g3 g4 g5 g6 g7 g8 g9
g2 0
g3 0 0
g4 0 0 0
g5 0 0 0 0
g6 0 0 0 0 0
g7 0 0 0 0 0 0
g8 0 0 0 0 0 0 0
g9 0 0 0 0 0 0 0 0
g10 0 0 0 0 0 0 0 0 0
The values for the different distance matrices differ between them because the calculate diferent measures:
- Manhattan - calculates the absolute distance between the two vectors
- maximum - calculates the maximum distance between two components of xx and yy
- canberra - is the numerical measure of the distance between pairs of points in a vector space
- binary - calculates the proportion of bits (vectors) in which only one is on amongst those in which at least one is on.
Compute first a correlation matrix with the cor() function. This step will iterate a cross matrix columns, so transpose ‘t()’ step to obtain row distances.
Use the distance matrix object of class “dist” with the as.dist() function.
g1 g2 g3 g4 g5
g2 6.000000e-01
g3 9.000000e-01 8.000000e-01
g4 7.000000e-01 9.000000e-01 1.900000e+00
g5 1.300000e+00 5.000000e-01 9.000000e-01 1.200000e+00
g6 7.000000e-01 9.000000e-01 1.900000e+00 2.220446e-16 1.200000e+00
g7 1.100000e+00 8.000000e-01 1.300000e+00 6.000000e-01 1.400000e+00
g8 1.500000e+00 1.300000e+00 7.000000e-01 1.600000e+00 4.000000e-01
g9 8.000000e-01 4.000000e-01 1.000000e-01 1.700000e+00 7.000000e-01
g10 2.000000e-01 4.000000e-01 1.000000e+00 7.000000e-01 7.000000e-01
g6 g7 g8 g9
g2
g3
g4
g5
g6
g7 6.000000e-01
g8 1.600000e+00 1.800000e+00
g9 1.700000e+00 1.100000e+00 9.000000e-01
g10 7.000000e-01 1.400000e+00 1.100000e+00 8.000000e-01
[1] "merge" "height" "order" "labels"
[5] "method" "call" "dist.method"
When the hang argument is set to ‘-1’ then all leaves end on one line and their labels hang down from 0.
plot(as.dendrogram(hr), edgePar=list(col=3, lwd=4), horiz=T)
unclass(hr) # Prints the full content of the hclust object.
$merge
[,1] [,2]
[1,] -4 -6
[2,] -3 -9
[3,] -1 -10
[4,] -5 -8
[5,] -2 3
[6,] -7 1
[7,] 2 4
[8,] 5 6
[9,] 7 8
$height
[1] 2.220446e-16 1.000000e-01 2.000000e-01 4.000000e-01 6.000000e-01
[6] 6.000000e-01 9.000000e-01 1.400000e+00 1.900000e+00
$order
[1] 3 9 5 8 2 1 10 7 4 6
$labels
[1] "g1" "g2" "g3" "g4" "g5" "g6" "g7" "g8" "g9" "g10"
$method
[1] "complete"
$call
hclust(d = d, method = "complete", members = NULL)
$dist.method
NULL
str(as.dendrogram(hr)) # Prints dendrogram structure as text.
--[dendrogram w/ 2 branches and 10 members at h = 1.9]
|--[dendrogram w/ 2 branches and 4 members at h = 0.9]
| |--[dendrogram w/ 2 branches and 2 members at h = 0.1]
| | |--leaf "g3"
| | `--leaf "g9"
| `--[dendrogram w/ 2 branches and 2 members at h = 0.4]
| |--leaf "g5"
| `--leaf "g8"
`--[dendrogram w/ 2 branches and 6 members at h = 1.4]
|--[dendrogram w/ 2 branches and 3 members at h = 0.6]
| |--leaf "g2"
| `--[dendrogram w/ 2 branches and 2 members at h = 0.2]
| |--leaf "g1"
| `--leaf "g10"
`--[dendrogram w/ 2 branches and 3 members at h = 0.6]
|--leaf "g7"
`--[dendrogram w/ 2 branches and 2 members at h = 2.22e-16]
|--leaf "g4"
`--leaf "g6"
hr$labels[hr$order] # Prints the row labels in the order they appear in the tree.
[1] "g3" "g9" "g5" "g8" "g2" "g1" "g10" "g7" "g4" "g6"
par(mfrow=c(1,2))
hrd1 <- as.dendrogram(hr)
plot(hrd1)
hrd2 <- reorder(hrd1, sample(1:10))
plot(hrd2)
labels(hrd1)
[1] "g3" "g9" "g5" "g8" "g2" "g1" "g10" "g7" "g4" "g6"
labels(hrd2) # Example to reorder a dendrogram and print out its labels.
[1] "g3" "g9" "g8" "g5" "g4" "g6" "g7" "g2" "g10" "g1"
The ‘hc2Newick’ function of the Bioc ‘ctc’ library can convert a hclust object into the Newick tree file format for export into external programs.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ctc")
[1] "(((g3:0.0500000000000001,g9:0.0500000000000001):0,(g5:0.2,g8:0.2):0):0,((g2:0.2,(g1:0.1,g10:0.1):0.2):0,(g7:0.3,(g4:1.11022302462516e-16,g6:1.11022302462516e-16):0.3):0):0);"