Analyses
Covid-19 pandemic has impacted brutally on economics of countries worldwide. GDP and GDP per capita have been lowered due to Covid-19 pandemic. Georgiou, et., al,(2020) discussed that covid-19 outbreak and pandemic has changed to an exceptionally extent in every aspect of humans activity. Hutagalung, et., al,(2021) discussed that number of confirmed and deaths cases due to the covid-19 are increasing in southeast Asia, and K means clustering has been used; deaths data has been collected from WHO and data is divided by Hutagalung, et., al,(2021) into 3 clusters such as “High”, “Medium”, and “low”.
The aim of the paper is to understand, and analyze the association between the severity of Covid-19 pandemic and WDI indicators. Secondary data has been collected from the World Bank Database. For analysis the collected dataset requires the data-preprocessing to make data ready for the regression, classification, and Regression techniques/Algorithms implementation. Clustering performed to understand how WDI indicators clustered as well as clusters aggregate results. Logistic regression, LDA, QDA, and Multivariate logistic regression has been used to classify the covid-19 deaths into two and four categories in R-studio.
Dataset includes 20 variables and 186 records with unique Country Names.
Data has a lot of missing values as well as wrong defined type of variables, variable covid.deaths, and comp.education is the numerical variable but in data it was showing string, so changed the data type from string/character to numeric in R (See, Appendix Image 1.0). A lot of missing values exists in the data, to treat the missing values filled the missing information with average value to those variables which consists no outliers. Variable “Covid.deaths”, “Life.expec”, “Birth.rate”, “Water.services” has no outliers
Fig 1.0: Boxplot of Variables with No outliers
Variable with outliers missing values has been filled by median value of the variable. Rest all variables “Elect.access”, “Pop.growth”, “Pop.density”, etc. all the variable contains the outliers, and hence missing values has been treated using median values of variable.
Fig 1.1: Boxplot with Outliers
Outliers could be ambiguous for model. To treat the outliers capping method has been used. At 10% level of significance some outliers has been treated but if range is extended from 10% to 15% level of significance then it almost all the outliers will get treated but still some of the variable was showing some outliers, now in this case those values are not outliers, those values has different group of cluster. Increase percentage of Level of significance is not good it will lead to errors tolerance, so max to max kept Level of Significance at 10%.
Q1. Data Pre-Processing
Correlation Plot is the plot which explains the relationship between all the variables. Covid deaths are highly/moderately negatively related with “Mortality Rate”, “Population growth”, and “Birth Rate”. Covid deaths are highly/moderately positively related with “Life.expec”, “Elect.access”, “Primary”, and “water services”. Relation between “Mortality rate”,and “Life.expec” are highly negative (Refer Fig 1.2).
Fig 1.2: Correlation Plot
K-mean cluster is an unsupervised machine learning algorithm, also known as data mining technique. K-mean cluster algorithms create “K” clusters of population/data. Value of K is decided based on gap statistics; optimal number of cluster is the value of K. K is the number of clusters build. As per Fig 1.3, value of K=7.
Fig 1.3: Optimal Number of Clusters
7 clusters have been built, and different 186 countries have been divided into 7 clusters (Refer Fig1.4).
Fig 1.4: K-mean Cluster Plot
These clusters have been built by using Euclidean distance method and in order to use this algorithm transformed the preprocessed data to scaled data, so that no ambiguities will occur during the clustering process due to the different type of measurement.
Fig 1.5: Aggregate of K-mean Clusters
As per Fig 1.5, Maximum Average number of covid-19 deaths is in the cluster-4, minimum Average number of covid-19 deaths is in the cluster-5. Maximum Average number of Life expectancy is in the cluster-2, minimum Average number of Life expectancy is in the cluster-5. Maximum Average number of Electricity access is in the cluster-2 and cluster-3; minimum Average number of electricity access is in the cluster-5. Fig 1.5, explains the Average number of individual variables for different defined clusters and in every clusters different unique countries has been divided. Between clusters characteristics are homogenous but within cluster characteristics are heterogeneous.
Logistic Regression is a supervised machine learning algorithm, and it is a classifier because it classifies the two categories. Target variable or dependent variable has two categories (Binary variable). Transformed the covid.deaths variable into binary variable by using mean values, If covid.death value is less than 292 then it is categorized as “Low” (Low number of deaths), and if covid.death value if greater than or equal to 292 then it is categorized as “High” (High number of deaths) (Refer Table 1.0). Category of covid.death variable “Low” and “High” has been encoded into 0 and 1 using label encoder in R.
Table 1.0: Statistics of Covid.deaths Variable
Minimum |
1st Quartile |
Median |
Mean |
3rd Quartile |
Maximum |
46 |
167.5 |
307 |
292.3 |
307 |
669 |
Data has been divided into 70% train and 30% test data, on 70% data logistic model has been trained and tested the model using rest 30% of the data. Accuracy of train Data is 82.31% while accuracy on test data is 85.71% (Refer Appendix Image 1.1).
Q2] K-mean Cluster
Multiclass algorithms are also known as multivariate algorithms because in this type of algorithms multiple target or dependent variable exists for analysis. LDA, QDA, and Logistic Regression have been used under multiclass algorithm. Four categories has been created using statistics of variable covid.deaths. As per Table 1.0, if Covid.deaths variable value is less than or equal to 46 then it is categorized as “Excessively low” (Excessively low number of deaths), if Covid.deaths variable value is greater than 46 and less than or equal to 167 then it is categorized as “low” (Low number of deaths), if Covid.deaths variable value is greater than 167 and less than or equal to 307 then it is categorized as “High” (High number of deaths), and if Covid.deaths variable value is Greater than 307 then it is categorized as “Excessively High” (Excessively High number of deaths). Encoded this variable using Label encoder For LDA algorithm, 0: “High”, 1: “Low”, 2: “Excess High”, 3: “Excess Low”.
For QDA and Logistic Regression Algorithm dummy variables has been created by using dependent variable (Covid.deaths.cat4), and 4 dummy variables has been created with the name “Excess_High”, “High”, “low”, “Excess_Low”. QDA (Quadratic discriminant Analysis) works with equal number of class, but our defined category has unequal number of samples in dependent variable, so to avoid error dummy variable has been used and built multiple or 4 models of QDA (Refer Appendix). LDA is a Linear Discriminant Analysis, it is a multiclass algorithm, in this algorithm there is no such issue exists, and LDA has built only one model with all 4 categories in the dependent variable.
As per Table 1.1,
- Train Accuracy of LDA model is 72.31%, and test accuracy of LDA model is 60.71%, 3LD with different proportion test has been built (Refer Appendix Image1.2). 1stLD Proportion is 0.8786, 2nd LD Proportion is 0.0746, and 3rd LD Proportion is 0.0468.
- Train Accuracy of QDA_ExcessHigh model is 90%, and test accuracy of QDA_ExcessHigh model is 71.43%.
Table 1.1: Accuracy of Multiclass Algorithms |
||
Models |
Train.Accuracy |
Test.Accuracy |
LDA |
72.31% |
60.71% |
QDA_ExcsessHigh |
90% |
71.43% |
QDA_High |
89.23% |
80.36% |
QDA_ExcsessLow_Low |
80.77% |
80.36% |
QDA_LOW |
98.46% |
89.29% |
LogisticRegression_ExcessHigh |
81.54% |
73.21% |
LogisticRegression_High |
80.77% |
73.21% |
LogisticRegression_ExcessLow |
93.08% |
87.5% |
LogisticRegression_Low |
92.31% |
78.57% |
- Train Accuracy of QDA_High model is 89.23%, and test accuracy of QDA_High model is 80.36%.
- Train Accuracy of QDA_ExcessLow_Low model is 80.77%, and test accuracy of QDA_ExcessLow_Low model is 80.36%. (To avoid sample size error in QDA combined Excess Low and Low category together and built the model; Refer Appendix Image 1.5).
- Train Accuracy of QDA_low model is 98.46%, and test accuracy of QDA_low model is 89.29%.
- Train Accuracy of LogisticRegression_ExcessHigh model is 81.54%, and test accuracy of LogisticRegression_ExcessHigh model is 73.21%.
- Train Accuracy of LogisticRegression_High model is 80.77%, and test accuracy of LogisticRegression_High model is 73.21%.
- Train Accuracy of LogisticRegression_ExcessLow model is 93.08%, and test accuracy of LogisticRegression_ExcessLow model is 87.5%.
- Train Accuracy of LogisticRegression_Low model is 92.31%, and test accuracy of LogisticRegression_ExcessHigh model is 78.57%.
Result of McNemar test has been displayed in confusion matrix report, it is a non-parametric test also known as chi-square test, Null hypothesis for this test is “There is no significant difference between Low and High number of deaths across the countries”, and Alternate Hypothesis is “There is a significant difference between Low and High number of deaths across the countries”. Pvalue of test for train and test data is 0.4042, and 0.28884 respectively, which is much greater than 0.05, test is failed to reject the null hypothesis and conclude that “There is no significant difference between Low and High number of deaths across the countries”. In the model Only “Intercept”, “GDP capita”, “Population total”, and “Birth rate” variables coefficient is significant because p-value is less than 0.05 that leads to rejection of null hypothesis and conclude that coefficients are significant, Rest of the variables coefficient are not significant.
Q3] Logistic Regression
The Results of McNemar test for all categories of algorithms has been displayed in confusion matrix report (Refer Appendix).
- McNemar test for LDA will not be suitable because McNemar test works with 2X2 contingency table but LDA has 4X4 contingency table. Prior Probability in LDA for High is 0.53077, for low is 0.17692, for Excess High 0.19231, and for excess low 0.1.
- P-value of Mcnemar for QDA_ExcessHigh model,and QDA_ExcessLow_Low shows the significance, but QDA_High model, and QDA_Low model does not showed the significance among the category.
- P-value of Mcnemar for LogisticRegression_ExcessHigh model,and LogisticRegression_Low model shows the significance, but LogisticRegression_High, and LogisticRegression_ExcessLow model does not showed the significance among the category.
- AIC is low for LogisticRegression_Low model compare to other logistic models which can be concluded that LogisticRegression_Low model is the best fit model.
- In LogisticRegression_Excesslow model no coefficients are significant because p-values are greater than 0.05, LogisticRegression_low model shows significance for variable “Health.exp”, and “Birth rate”, LogisticRegression_High shows significance for variable “Life.expec”, “Elect.access”, “Unemployement”, and “Comp.education”, and LogisticRegression_ExcessHigh model shows significance only for variable “Comp.education”.
Conclusions
As per the Above Analysis covid deaths are negatively related with Mortality rate i.e. if covid deaths increase (Decrease) then Mortatlity rate will get decrease (Increase), covid deaths are also negatively related with Birth rate i.e. if covid deaths increase (Decrease) then Birth rate will get decrease (Increase). Correlation between Mortality rate and Life expectancy is highly negative i.e. if Mortality rate increases (Decreases) then Life expectancy will decrease (Increase). Optimal Number of cluster is 7 for unique countries as per the other WDI indicators. The best model to categorize the covid deaths around the world based on WDI indicators is Multiclass QDA and Multiclass logistic Regression as per the accuracy score (Refer Table 1.1), QDA algorithm works very well on the data to classify the category based on WDI indicators.
References
Abdullah, D., Susilo, S., Ahmar, A. S., Rusli, R., & Hidayat, R. (2021). The application of K-means clustering for province clustering in Indonesia of the risk of the COVID-19 pandemic based on COVID-19 data. Quality & Quantity, 1-9. https://doi.org/10.1007/s11135-021-01176-w
Georgiou, K., Mittas, N., Angelis, L., & Chatzigeorgiou, A. (2020, August). A preliminary study of knowledge sharing related to covid-19 pandemic in stack overflow. In 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 517-520). IEEE. doi: 10.1109/SEAA51224.2020.00086.
Hutagalung, J., Ginantra, N. L. W. S. R., Bhawika, G. W., Parwita, W. G. S., Wanto, A., & Panjaitan, P. D. (2021, February). COVID-19 Cases and Deaths in Southeast Asia Clustering using K-Means Algorithm. In Journal of Physics: Conference Series (Vol. 1783, No. 1, p. 012027). IOP Publishing.
Liang, J., Bi, G., & Zhan, C. (2020). Multinomial and ordinal Logistic regression analyses with multi-categorical variables using R. Annals of translational medicine, 8(16). doi: 10.21037/atm-2020-57
Mahanty, C., Kumar, R., & Mishra, B. K. (2021). Analyses the effects of COVID-19 outbreak on human sexual behaviour using ordinary least-squares based multivariate logistic regression. Quality & Quantity, 55(4), 1239-1259. https://doi.org/10.1007/s11135-020-01057-8
Medina-Mendieta, J. F., Cortés-Cortés, M., & Cortés-Iglesias, M. (2020). COVID-19 forecasts for Cuba using logistic regression and gompertz curves. MEDICC review, 22(3), 32-39. https://doi.org/10.37757/MR2020.V22.N3.8
Niyakan, S., & Qian, X. (2021). COVID-Datathon: Biomarker identification for COVID-19 severity based on BALF scRNA-seq data. arXiv preprint arXiv:2110.04986. https://doi.org/10.48550/arXiv.2110.04986
Pastrana, T., De Lima, L., Pettus, K., Ramsey, A., Napier, G., Wenk, R., & Radbruch, L. (2021). The impact of COVID-19 on palliative care workers across the world: A qualitative analysis of responses to open-ended questions. Palliative & supportive care, 19(2), 187-192. https://doi.org/10.1017/S1478951521000298
Sunori, S. K., Negi, P. B., Maurya, S., Juneja, P., & Rana, A. (2021, January). K-Means Clustering of Ambient Air Quality Data of Uttarakhand, India during Lockdown Period of Covid-19 Pandemic. In 2021 6th International Conference on Inventive Computation Technologies (ICICT) (pp. 1254-1259). IEEE. doi: 10.1109/ICICT50816.2021.9358627.
Tena, A., Clarià, F., & Solsona, F. (2022). Automated detection of COVID-19 cough. Biomedical Signal Processing and Control, 71, 103175. https://doi.org/10.1016/j.bspc.2021.103175
Zhang, T. (2020). Data mining can play a critical role in COVID-19 linked mental health studies. Asian journal of psychiatry, 54, 102399. doi: 10.1016/j.ajp.2020.102399
Zhang, Z. (2016). Residuals and regression diagnostics: focusing on logistic regression. Annals of translational medicine, 4(10). doi: 10.21037/atm.2016.03.36