ML model development based on Brazil’s Covid-19 dataset

Machine Learning - Individual Assignment

Beatriz Manso

student ID: 21485094

Description of the data:

https://www.kaggle.com/S%C3%ADrio-Libanes/covid19

In total there are 54 features, expanded when pertinent to the mean, median, max, min, diff and relative diff.

The data is already scaled.

The goal of this notebook is to develop a ML model to predict if a patient of confirmed COVID-19 case will require admission to the ICU.


Set working directory:

Installing Specific Packages:


Data at First Sight

Obs:

Between the 385 patients in our dataset, we can se that 195 patients were admitted to the ICU and 190 weren't. We can consider this distribution of data close enough and our target variable is well balanced.

Obs:

The graph above shows the number of admissions to the ICU by window of time.

From the people that are admited in the ICU, the highest percentage is after 12 hours of being admitted to the hospital.


Metadata


Exploratory Data Analysis (EDA)

Data Quality Issues

Data Duplications

Obs: There are no duplicates in our data.

Missing Values

Obs: We can see that the is a huge percentage of missing data in the dataset.

There are 225 variables with missing values and in total, 223863 NaN values.

Missing values in medical data is a common issue. However, according to the outhors of the dataset, we can assume the patients that don't have a measurement recorded are clinically stable and there is a possibility that they have vital signs and blood labs similar to neighboring windows. Therefore, we can deal with missing values by filling them using the next or previous entry, in the data preparation step.

Univariate Exploration

Binary Features

Obs:

(From the discussion section in the kaggle competition there is a discussion that explains that 0 values in the gender variable represent males while 1's represent females).

In our dataset there are 243 males and 142 females.

Obs: With the the exception of the disease group 6 patienst with a disease have higher ICU admitions frequency.

Categorical features

Obs:

Real (interval) features

Bivariate Exploration

Real (interval) features

We will visualise the relationship between features in the real variables using correlation matrix

Because there is a lot o features it's hard to take important information from the graph above.

Lets check the correlation between the features and the ICU column:


Data Preparation

Data Cleaning

As we observed before, in the missing values step there are 225 variables with missing values and in total, 223863 NaN values.

We will deal with this by inputing the values with the next and previous values, as sugested by the authors of the dataset.

Dropping Columns and Rows

1572 rows and 2 columns have been droped


ML Model Comparison

Logistic Regression

Decision Tree Classifier

Random Forest Classifier

SVM

  1. Logistic Regression - AUC test/ train: 0.69 - 0.84
  2. Decision Tree Classifier - AUC test/ train: 0.62 - 1.00
  3. Random Forest - AUC test/ train: 0.67 - 1.00
  4. Suport Vector Machine - AUC test/ train: 0.72 - 0.78

SVM is the model that has the best AUC value.

Analysing the best model: SVM