Research Scopes
In this report, the researcher investigates the health condition and socio-demographic scenarios in East-Asia and Pacific countries. The researcher took into account several health and socio-demographic indicating factors such as Adolescent Fertility Rate, Crude Death Rate, Birth Rate, Immunization rate, GNI per capita and Total Alcohol Consumption. The data set is based on information of last 15 years ranging from 2001 to 2015. The analysis tries to find out the association among different socio-demographic variables and health indicating factors. The data analysis has importance towards business representatives, government undertakers and policy-planners.
The analysis targets to find out a prominent overview of the demographic and financial aspects in the base period of 2001 to 2015 of the developed, developing and undeveloped countries of East-Asia and Pacific region. The aim is to investigate the developing factors and mediators of the developed or developing countries.
The research data is gathered from “World bank” website. The data set is publicly available to everybody. As the data set has lots of missing values, therefore, data pre-processing and data-cleansing is essential. Processing and making the data set structured, the data set is prepared and analysis. The researcher has executed an exploratory data analysis. The data analysis includes calculations and visualizations of “One variable analysis” (histograms and boxplots), “Two variable analysis” (grouped boxplots), “K-means clustering” graphs and “Linear Regression” (Scatterplots).
The data analysis and research are only limited to the selected few variables. The analysis is only based on the location of “East Asia and Pacific Countries”.
First the raw data of Health and Population data is loaded into R-Studio software. The data set includes many missing values indicated by “..”. With the help of several packages, the data is cleaned and then structured. This structured data is ready for use. It is brought in the domain of workspace.
The various types of packages that are installed in R-studio are “data.table”, “cluster”, “reshape2”, “dplyr”, “ggplot2”, “psych”, “lattice” and “factoextra” has helped to clean and analyse the data set (Knezevic, Streibig and Ritz 2007).
The distribution of “Adolescent Fertility Rate” of the people of “East Asia and Asia Pacific countries” in the year 2015 is visualized with the help of histograms. The histogram depicts that-
- In 2015, most of the considered countries have fertility rate 10% to 20% (6 countries), followed by 45% to 55% (5 countries).
- Highest Adolescent fertility rate is observed in chosen two countries.
- The adolescent fertility rate is not enough in the interval of 20% to 45% and below 15%.
The distribution of the GNI per capita of “East Asia and Pacific” countries in 2015 refers that-
- The data set have significant number of outliers.
- The three quartiles (Q1, Median and Q3) of GNI per Capita lies in the interval 0US$ to 10000US$.
- The median GNI per capita is below 5000 US$.
- The maximum GNI per capita of any individual country in 2015 is greater than 50000US$.
- The distribution of GNI per Capita of the countries in 2015 is not symmetric as its lower quartile is greater than its upper quartile.
The visualization of the variable “Total Alcohol Consumption per Capita” indicates that-
- The number of countries have total alcohol consumption per capita is highest in the interval of 1.5% to 4.5% and 7.5% to 9%.
- The least number of countries have total alcohol consumption per capita in the range 9% to 11.5%.
The grouped box plots include the distributions of crude birth rate of 1,000 people in various countries of East Asia and Pacific countries in 2001 to 2014. The crude birth rate is highest in Timor-Leste significantly than any other countries followed by Solomon Islands. The crude birth rate is significantly lesser in Hong Kong and Japan. The distribution is most centred in Kiribati and least scattered in Hong Kong, Mongolia and Myanmar.
Research Methodology
The grouped box plots include the distributions of crude death rate of 1,000 people in various countries of East Asia and Pacific countries in 2001 to 2014. The crude death rate is highest in Japan rather than any other countries followed by Korea Republic. The crude death rate is significantly lesser in Brunei Darussalam followed by Singapore. The distribution is most centred in Macao and Australia and most scattered in Hong Kong, Mongolia and Myanmar.
“K-means clustering” is a process of “Clustering algorithm” attempts to cluster the data set based on similarity (Kanungo et al. 2002). “K-means clustering” process in data analysis minimises the “Mean Squared Distance” among various observations and referred a number of centroids. The K-mean clustering is executed in two steps that are- 1) “Determination of number of clusters”, 2) “The procedure of updating the centroids of clusters” and 3) “Separation of data set in systematic order” (Ding and He 2004). The observations are assigned with respect to centroids when the average is minimum.
The number of clusters according to their importance are indicated in above visualization. The appropriate number of clusters of the “Crude Birth Rate” and “Crude Death Rate” is considered as 3.
The sizes of 3 clusters are 10, 11 and 10 respectively. Hence, the highest number of clusters in any sample is 11. The three centroids of crude death rates are 5.697921, 6.694909 and 6.546897. On the other hand, the crude death rates of three centroids are 19.34917, 29.48784 and 11.43986. The ratio of “Between Sum of Squares” and “Total Sum of Squares” of the distances between the centroids of the clusters is 84.9%.
The above graph indicates the presence of three clusters of the Crude death rates and crude birth rates of 1000 people in all countries between 2001 to 2014 according to –
- Low Crude Birth Rate and High Crude death Rate (Such as Japan, Australia and Singapore).
- Moderate Crude Birth Rate and Moderate Crude Death Rate (Such as Myanmar, Brunei Darussalam and Guam).
- High Birth Rate and Low Crude Death Rate (Such as Lao, Kiribati and Solomon Islands).
In simple words, the linear regression analysis obtains the linear significant association between two variables (Montgomery Peck and Vining 2012). One variable is “Dependent variable” and the other variable is “Independent variable”. The independent variable tries to estimate the dependent variable. The linear regression model is given as-
y = a + b*x;
Here, y = Dependent variable, x = Independent variable, a = Intercept and b = Slope (Neter, Wasserman and Kutner 1989).
The slope signifies the change in the value of ‘y” for each unit change in the value of x.
Out of two linear regression models, the first linear regression considers “Crude Birth Rate” as dependent variable and “Adolescent Fertility Rate” as independent variable. The second linear regression model considers “Crude Death Rate” as dependent variable and “Immunization Rate of BCG” as independent variable. Note that, except 2015, rest of all other years are considered in the regression analysis.
Exploratory Data Analysis
1) Crude Death Rate and Adolescent Fertility Rate:
The linear regression model could be described as-
Crude Birth Rate = 14.21101 – 0.211774 * Adolescent Fertility Rate (Draper and Smith 2014)
According to the value of multiple R2, “Adolescent Fertility Rate” explains 34.59% variability of “Crude Birth Rate”. The p-value of the overall model is 0.0025 (F-statistic 11.64). It could be said that the overall model is found to be significant. The p-value (0.003969) is less than 0.01. Therefore, the independent variable has significant linear association with dependent variable. It is also a notable fact that “Adolescent Fertility Rate” has positive association with “Crude Birth Rate” (b= 0.21174); that is, for increment in the value of “Adolescent Fertility Rate”, “Crude Birth Rate” also increases and vice versa (Seber and Lee 2012).
With the help of the fitted scatter plot, the association between these variables is established. The fitted trend line for the association of “Crude Birth Rate” and “Adolescent Fertility Rate” refers that the countries that have greater adolescent fertility rate, also have greater crude birth rate (per 1,000 people). Hence, it is evident that for greater adolescent fertility rate, the crude birth rate is lower.
2) Crude Death Rate and Immunization Rate:
The linear regression model could be described as-
Crude Death Rate = 7.83514 – 0.01498 * Immunization Rate
According to the value of multiple R2, “Immunization Rate” explains only 0.94% variability of “Crude Birth Rate”. The p-value of the overall model is 0.6526 (F-statistic 0.2083). It could be said that the overall model is absolutely insignificant. The p-value (0.6526) is far greater than 0.05. Therefore, the independent variable does not have significant linear association with dependent variable. It is also a notable fact that “Immunization Rate” has negative association with “Crude Death Rate” (b= -0.01498); that is, for increment in the value of “Immunization Rate”, “Crude Death Rate” decreases and vice versa.
With the help of the fitted scatter plot, the association between these variables could be verified. The fitting of the trend line is not good and the association is not significant. However, the trend line for the association of “Crude Death Rate” and “Immunization Rate” refers that the countries that have greater immunization rate, have lower crude death rate (per 1,000 people).
Research Conclusion:
The analysis investigates various nature and hidden factors of different considered variables of the time period 2011 to 2015 in East Asia and Pacific countries. The one-variable analysis tells that crude birth and death rates in all the East Asia and Pacific countries are positively skewed. The GNI per capita and adolescent fertility rates in 2015 are highly scattered and asymmetric within the country. The per capita alcohol consumption of these countries in 2015 is also asymmetric and fluctuated. Throughout the countries, the crude death rate is lesser than Crude birth rate except some countries like Australia and Japan. On an average, the average GNI per capita in East Asia and Pacific Countries has a maximum limit of 10000US$. The two-variable analysis indicate that the number of countries that have higher crude birth rate are greater in number than the countries that have lower crude birth rate. The scenario is totally opposite for crude death rate. The countries as per the crude death rates and crude death rates in 2001-2014 are divided into three cluster. Moreover, the significant positive link was found between “Adolescent Fertility Rate” and “Crude Birth Rate”. An insignificant but negative link was obtained between “Immunization Rate (BCG)” and “Crude Death Rate”.
While, the analyst had considered the raw data, he/she faced confusion to select correct attributes. Presence of missing value was annoying for analyst. The data cleaning and data framing was not an easy task for this data set. The “Melting” and “Decasting” process was tough. The data cleaning and merging could be done in other ways and using other packages too. For the analysis, the analyst was not able to cover all the variables. It is a kind of research lack. Not only that, the analyst faced confusion to determine the significant number of clusters. However, the analytical outcomes are overall satisfactory and proper reflection of the reality.
References:
Ding, C. and He, X., 2004, July. K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning (p. 29). ACM.
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R. and Wu, A.Y., 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE transactions on pattern analysis and machine intelligence, 24(7), pp.881-892.
Knezevic, S.Z., Streibig, J.C. and Ritz, C., 2007. Utilizing R software package for dose-response studies: the concept and data analysis. Weed Technology, 21(3), pp.840-848.
Montgomery, D.C., Peck, E.A. and Vining, G.G., 2012. Introduction to linear regression analysis (Vol. 821). John Wiley & Sons.
Neter, J., Wasserman, W. and Kutner, M.H., 1989. Applied linear regression models.
Seber, G.A. and Lee, A.J., 2012. Linear regression analysis(Vol. 329). John Wiley & Sons.