Setup Google Colab local drive

You may skip this section if you are working with Anaconda jupyter-notebook on your local computer

Installing specific analysis packages

Loading analysis packages

Loading data

Making Copies of the data

Data at first sight

Observation:

Observation:

Observation

Metadata

Metadata (version 1) for train data

Metadata (version 2) for train data

Metadata (version 1) for Extra data

Metadata (version 2) for extra data

Looking at the metadata for train dataset

No clear correlation between any of the above features we can see in the table.

Observation for Metadata:

Exploratory Data Analysis (EDA)

Visualization

Observations:

Both pie charts show that the the majority of confirmed cases and deaths due to covid-19 were found in the united states.

From this we can infer that the US is most likely driving the growth of cases. However this could be due to multiple factors. Such as the population size of the country as well as when the country went into lockdown

Observation:

Data Quality Issues

Data duplications

Observations:

Missing Values

Missing values in Train data

Observations

Missing values for County and Province_State are highly correlated, meaning it is highly likely that the rows with missing county information will also have missing province_state information.

Missing values in the extra dataset

Observations:

Observations:

Univariate Exploration

Cardinality

Cardinality of train data

Cardinality of extra data

Real (Interval) features

Real type data in extra data

Observations

The figure above shows that there are outliers in the features: population densisty and reproduction rate.

These columns would need to be scaled before they are put through a ML model

Bivariate Exploration

Real (Interval) features

Observations:

No clear observations can be made from this graph as there are too many features

Observation:

Observation.

Observation.

this graph shows that with increase in poverty there is less number of cases but there is a possibility that in some countries due to lack of medical amenities and high cost of health care ,cases are not reported.

Observation.

Observation:

No observation can be made as it is quite difficult to infer from a small graph

Integer (Ordinal) features

Observations:

This graph showing the ordinal feature test_units, shows that the majority of the records are missing (around 60500)

Other EDAs

other EDAs that are not in lab work

Time Series - Seasonality

Location Trend Seasonality Residual
GLOBAL - Confirmed Cases The trend spikes at mid-March of 2020 with cases from <10,000 per day
to almost 150,000 per day at the start of April, where it continues around 150,000
until June where it spikes again.
Seasonality follows a weekly pattern that fluctuates between +10,000 and -7,500
GLOBAL - Fatal Cases Trend follows same pattern as confirmed cases, spike in mid-March <1000
to <13,000 beginning of April, however it declines in mid-April back down to around 6,000 per day
Seasonality also follows the same weekly pattern between +750 and -1,250
Location Trend Seasonality Residual
Africa - Confirmed Sub 1000 cases to mid-March, spikes to 100,000 at start of April, slowly declines to 75,000 in June
Africa - Fatal Similar trend to confirmed, spike is mid-April with 7,000 daily fatalities
Asia - Confirmed Asia is the only continent to spike cases between mid-January and mid-February up to 10,000 a day,
cases drops down to sub 1,000 until mid-March, where cases steadily increase to 30,000 daily in June
Asia - Fatal Same trend as confirmed, however large spike in mid-April to 700 that drops to normal levels in roughly 1 week.
Flattens during May around 400 fatalities, then increases to 650 in June.
Europe - Confirmed Sub 1000 cases to mid-March, spikes to 35,000 at start of April, slowly declines to 15,000 in June
Europe - Fatal Same pattern up to April with 4000 fatalities, steeper decrease between April and June
North America - Confirmed Low cases until mid-March, steady increase to 7,000 cases per day in June
North America - Fatal Fatal cases increase at beginning of April, steady climb to 800 per day in June
Oceania - Confirmed Cases spike mid-March to 800, drop to under 50 by mid-April
Oceania - Fatal Fatalities climb mid-March, peak start of April with 8 daily, drops to normal level at start of May
South America - Confirmed Cases steadily increase from mid-March to 40,000 daily in June
South America - Fatal Fatal cases follows the same pattarn, peaking at 1,400 daily in June

The trends for continent data begin at different points in time, the order of cases first appearing goes;
Asia,
Oceania,
Europe,
North America,
Africa,
South America.

Some continents have been hit harder than others, Africa is the worst off with a peak over 100,000, and South America being 2nd with a peak of 40,000.
Oceania had the lowest peak with 800, followed by North America with 7,000.

Seasonality for all graphs follow a weekly pattern.

SIR Graphical Visualisation_ Japan

The above graph shows the fit of the SIR (susceptible, recovered, fatalities and infected) model for Japan over the dataset. In our research we realised this could be a really effective way to fit data to a model to predict health / infection trends. However, unfortunately, we ran out of time to try and apply this to our dataset further than including the graph representation above. If we were to do something like this in the future we would include SIR modelling.

Data Preparation

Dropping Columns

Label Encoding

Adding SMAs

Scaling

Splitting Dataset for modeling

Experimenting Shift/mean feature generation

Graph isn't supposed to be pretty, just using to see if groupby has done as expected.
Can see many different means have been created for each location point

LinearRegression shift/mean feature generation

XGBRegressor shift/mean feature generation

Below takes ~50mins to run

Validation predictions LR

Validation predictions XGB

ML Model

XGBRegressor + Hyperopt

MSE: 3.5773795607209973
RMSE: 54.37627441780612

Polynomial Features

Building the Multi Layer Perception regression model

RandomForest

LightGBM

Linear Regression Model

Decision Tree Regressor

Hyper Parameter Tuning for Decision Tree Regressor

Training Decision Tree With Best Hyperparameters

Gradient boost