Unsupervised Machine Learning in Python

Beatriz Manso

Introduction

Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values. ML can be used in many fields including bioinformatics, for genomics, proteomics, microarrays, ...

1. Calculating Distance Matrices

Load Table 1 into Spyder as a data frame using Pandas and view the contents:

Remove the first column with patient names from the data frame, so that it does not get included in the array:

Convert Table 1 from a data frame to a numpy array of intergers, this removes the header row too:

With the numpy array data, calculate the Euclidean distance metrics:

Optionally, convert the array to a data frame for easier reading:

Calculate the Manhattan (city-block) distance metrics:

2. Clustering

Load in Table 2 and view it:

Scale data (normalisation):

Calculate Euclidean distance:

(Optionally) Convert distance matrix to data frame for easier reading:

Calculate Spearman's rank correlation:

(Optionally) Convert to Spearman’s rank correlation matrix to data frame for easier reading:

Perform hierarchical clustering of scaled data using Euclidean distance:

Hierarchical clustering of scaled data using Spearman’s rank correlation: