This project seeks to understand the COVID-19 pandemic in Brazil. For this, we will do a regression analysis and a PCA to understand the impact of socioeconomic factors on the mortality due to SARS-CoV-2 in the different Brazilian states.

Objective

  • Understand the relationship between socioeconomic factors and mortality caused by the virus SARS-CoV-2
  • Define which variables have a bigger impact on mortality
  • Highlight important aspects where public policy should focus
  • Cluster and PCA analysis that shows the different variables and the differences between the Brazilian states
  • Linear regression and obtaining the significant variables in explaining mortality

Workflow

Data

Analysis

Cluster Analysis:

  • Data was standardized by the function scale()
  • Through the function kmeans() different numbers of centers were calculated and then visualized (fviz_cluster) to observe the best number of clusters. This was verified with the elbow method (fviz_nnbclust).
  • Once defined the number of clusters, the information of the cluster was added to the data frame

Linear Regression Analysis:

  • The cluster was transformed into a dummy variable by the function dummy_columns()
  • Correlation analysis was done (corr and chart.Correlation)
  • Linear regression was done with lm() function, using the log of the deaths by 100 thousand inhabitants
  • The result then passed through a stepwise procedure (step) and a final regression was done
  • Shapiro-Francia test
  • Residuals were plotted in a histogram and a qqplot
  • OLS was calculated
  • Breusch-Pagan test to check heteroscedasticity

Cluster Map:

  • The cluster was plotted into a map of Brazil to observe to what cluster each state belongs to
  • Shapefile from Brazil was used (readOGR) and the cluster data was added
  • The map was plotted with the library tmap()

PCA Analysis:

  • Correlation was checked with a rho matrix
  • Kaiser-Mayer-Olkin (KMO) statistics
  • Bartlett test (cortest.bartlett)
  • The PCA was done with standardized data (scale) through the prcomp() function
  • The weight of each variable in each principal component in the PCA was plotted
  • The final PCA plot was done by the autoplot() function

For further information, check the GitHub repository with the scripts

Results

work in progress

References

Link to the GitHub repository