Supervised Machine Learning in Python

Beatriz Manso

Set working directory:

Load Libraries:

1. Load the Data

DataSet Description

14 features(Columns) including the target:

2. Check duplicate rows in data

Now, there are 0 duplicate rows in the data. We will check for ‘null’ values in the data.

3. Checking the Null Values

There are no null values in this dataset

No null values in the dataset.

The data type is numeric for all the features, this implies that we do not have to change the data into dummy variables or apply one hot encoding etc. before applying any algorithm.

4. Detecting Outliers

a. Detecting Outliers using IQR (InterQuartile Range)

The Outliers are removed using two methods,

  1. Inter-Quartile Range and

  2. Z-score

b. Removing outliers using Inter-Quartile Range

After removing outliers using IQR, the data contains 227 records.

5. Data visualisation

There is a positive correlation between target and cp, thalach, slope and a Negative correlation between target and sex, exang, ca, thal, oldpeak

To visualize the relationship between different features and figure out any linear relation between them we use PAIRPLOTS.

Box and whiskers plots:

Box and Whiskers plot are useful to find out outliers in our data. If we have more outliers we will have to remove them or fix them otherwise they will become as noise for the training data.

Visualize the features and their relation with the target (Heart Disease or No Heart Disease)

There are 154 males and 74 females in our data

Chest pain type

There 4 values of chest pain, ranges from 0 to 3

Cross tables:

6. Preparing the data for model

Comparison Between Unscaled and Scaled DataFrame

BEFORE SCALING DATA:

a. Scaling the data:

AFTER SCALING DATA:

7. Fitting the Data

Applying the machine learning algorithms for predictive modelling.

1. Logistic Regression Classifier

2. Decision Tree Classifier

3. Random Forest Classifier

4. K Nearest Neighbours Classifier

8. Assessing Accuracies of each model

Logistic Regression Classifier had the highest accuracy

9. Confusion Matrix

Create confusion matrix with the prediction from the model with the highest accuracy.