COVID-19 pandemic in Brazil Regression and PCA
This project seeks to understand the COVID-19 pandemic in Brazil. For this, we will do a regression analysis and a PCA to understand the impact of socioeconomic factors on the mortality due to SARS-CoV-2 in the different Brazilian states.
Objective
- Understand the relationship between socioeconomic factors and mortality caused by the virus SARS-CoV-2
- Define which variables have a bigger impact on mortality
- Highlight important aspects where public policy should focus
- Cluster and PCA analysis that shows the different variables and the differences between the Brazilian states
- Linear regression and obtaining the significant variables in explaining mortality
Workflow
Data
- Socioeconomic data - obtained from IBGE
- IDHM data - obtained from PNUD Brazil
- COVID-19 deaths from the begging of the pandemic in Brazil until the 31/Dec/2020 - obtained from the national website https://covid.saude.gov.br/
- Shapefile from Brazil – obtained from IBGE: https://www.ibge.gov.br/geociencias/organizacao-do-territorio/malhas-territoriais/15774-malhas.html?=&t=acesso-ao-produto
Analysis
Cluster Analysis:
- Data was standardized by the function scale()
- Through the function kmeans() different numbers of centers were calculated and then visualized (fviz_cluster) to observe the best number of clusters. This was verified with the elbow method (fviz_nnbclust).
- Once defined the number of clusters, the information of the cluster was added to the data frame
Linear Regression Analysis:
- The cluster was transformed into a dummy variable by the function dummy_columns()
- Correlation analysis was done (corr and chart.Correlation)
- Linear regression was done with lm() function, using the log of the deaths by 100 thousand inhabitants
- The result then passed through a stepwise procedure (step) and a final regression was done
- Shapiro-Francia test
- Residuals were plotted in a histogram and a qqplot
- OLS was calculated
- Breusch-Pagan test to check heteroscedasticity
Cluster Map:
- The cluster was plotted into a map of Brazil to observe to what cluster each state belongs to
- Shapefile from Brazil was used (readOGR) and the cluster data was added
- The map was plotted with the library tmap()
PCA Analysis:
- Correlation was checked with a rho matrix
- Kaiser-Mayer-Olkin (KMO) statistics
- Bartlett test (cortest.bartlett)
- The PCA was done with standardized data (scale) through the prcomp() function
- The weight of each variable in each principal component in the PCA was plotted
- The final PCA plot was done by the autoplot() function
For further information, check the GitHub repository with the scripts
Results
work in progress