COVID-19 pandemic in Brazil Regression and PCA

This project seeks to understand the COVID-19 pandemic in Brazil. For this, we will do a regression analysis and a PCA to understand the impact of socioeconomic factors on the mortality due to SARS-CoV-2 in the different Brazilian states.

Objective

Understand the relationship between socioeconomic factors and mortality caused by the virus SARS-CoV-2
Define which variables have a bigger impact on mortality
Highlight important aspects where public policy should focus
Cluster and PCA analysis that shows the different variables and the differences between the Brazilian states
Linear regression and obtaining the significant variables in explaining mortality

Workflow

Data

Socioeconomic data - obtained from IBGE
IDHM data - obtained from PNUD Brazil
COVID-19 deaths from the begging of the pandemic in Brazil until the 31/Dec/2020 - obtained from the national website https://covid.saude.gov.br/
Shapefile from Brazil – obtained from IBGE: https://www.ibge.gov.br/geociencias/organizacao-do-territorio/malhas-territoriais/15774-malhas.html?=&t=acesso-ao-produto

Analysis

Cluster Analysis:

Data was standardized by the function scale()
Through the function kmeans() different numbers of centers were calculated and then visualized (fviz_cluster) to observe the best number of clusters. This was verified with the elbow method (fviz_nnbclust).
Once defined the number of clusters, the information of the cluster was added to the data frame

Linear Regression Analysis:

The cluster was transformed into a dummy variable by the function dummy_columns()
Correlation analysis was done (corr and chart.Correlation)
Linear regression was done with lm() function, using the log of the deaths by 100 thousand inhabitants
The result then passed through a stepwise procedure (step) and a final regression was done
Shapiro-Francia test
Residuals were plotted in a histogram and a qqplot
OLS was calculated
Breusch-Pagan test to check heteroscedasticity

Cluster Map:

The cluster was plotted into a map of Brazil to observe to what cluster each state belongs to
Shapefile from Brazil was used (readOGR) and the cluster data was added
The map was plotted with the library tmap()

PCA Analysis:

Correlation was checked with a rho matrix
Kaiser-Mayer-Olkin (KMO) statistics
Bartlett test (cortest.bartlett)
The PCA was done with standardized data (scale) through the prcomp() function
The weight of each variable in each principal component in the PCA was plotted
The final PCA plot was done by the autoplot() function

For further information, check the GitHub repository with the scripts

Results

work in progress

References

Link to the GitHub repository

Sofia Foladori Invernizzi