Data Cleaning and Processing
The purpose of this data analysis report is to explore a data based on health and population statistics by retrieving a secondary dataset about population and health statistics from the year 2011 to the year 2015 from World Bank. The data belongs to East Asia and Pacific region. The aim is to perform a thorough statistical analysis using statistical software “R”.
This dataset on vital statistics provides the opportunity to perform both one variable analysis and two-variable analysis along with graphical representations of the variables taken into consideration. Apart from this analysis, the researcher can also evaluate the regression analysis and segmentation of the data.
The k-means clustering is performed for clustering analysis on “R” software tool after loading the data from a CSV file and installing the necessary libraries on “R”.
The dataset is secondary of origin and restricted to the collection of the data from ‘The World Bank’ only.
The dataset which is contained in a CSV file (‘HealthandPopulationStatistics.csv’), is processed first in the Excel file. There are many data field which do not contain any value, instead they contain ‘..’ that represent unrecorded data. All the data cleaning, data reshaping is done in R programming using R code. Several R packages have been used to run the entire program.
The required packages which are installed in R software tool (Version 3.5.0) for this problem, are listed below
- “data.table”
- “reshape2”
- “psych”
- “ggplot2”
- “lattice”
- “dplyr”
At the first step, the data has been cleaned and processed to perform the required statistical programs. The following codes show the data cleaning.
Next, it can be seen that the given dataset contains attributes in rows and the years in columns. Therefore, the data should be reshaped. The following codes show the reshaping of the data using “reshape2” package in R.
The box-plot is used here to display a series of fertility rates that is, births per thousand women aged between to 15 to 18. Here, the mean of the fertility rate is higher than the median which is shown by the black line in the plot. Thus, the second quartile of the distribution of the adolescent fertility rate is around 28. The rate ranges from 0 to 75 approximately. No outliers are seen in the box-plot that indicates that all the values are within the whiskers.
The required R codes show how the Crude birth rate is analysed along with the histogram of the distribution of the data.
Box-Plot Analysis of Fertility Rate
Histogram of Crude Birth Rate
The data is represented as a series of values of the rates across several countries. With the help of the histogram, it can be understood that the distribution of the data is almost skewed to the left which means the data is negatively skewed. The highest crude birth rate is more than 80-85 percent per 1000 people.
The following codes in R show the analysis of the total life expectancy (expressed in terms of years) and the Box-plot.
Box-plot of Total Life Expectancy (years)
It can be interpreted that there is no presence of outliers in the dataset. The second quartile or the median is 72 and the values of mean and the median are almost equal. The lowest value is around 58 and the highest of life expectancy is 83.
The total unemployment rate is analysed on the basis of the countries. It shows how the total unemployment rate changes with the change of the country. The change of the unemployment rate for each of the East Asian and Pacific country is represented graphically with the help of side-by-side boxplot. The codes are shown below along with the box-plot.
The Side-by-side box-plot is used when there is one quantitative variable and the other variable is categorical. Here the Total unemployment rate is quantitative in nature but the list6 of the countries are qualitative. There are 37 box-plots in the diagram to show the distribution of the total unemployment rate for each of the 37 countries. There are outliers for country codes ‘CHIN’, ‘LAO’, ‘MYS’, ‘SLB’, ‘THA’, and ‘VNM’. For the county code ‘SLB’, there are two outliers and for the remaining countries, there is only one outlier. The unemployment rate is highest among for the country code ‘PHL’. The distribution of the data of total unemployment rate is very small for the countries having country codes ‘MMR’, ‘PRK’, and ‘SLB’.
The analysis is total health expenditure for each country is similar to the previous analysis total unemployment rate. Here also, the change of the health expenditure is represented with the help of side-by-side box-plot. The following code shows the analysis of total health expenditure for 37 countries.
Side-by-side box-plots of Total health expenditure.
Among the 37 box-plots representing the distribution of health expenditure of each of the 37 countries, countries having country codes ‘KHM’, ‘KIR’, ‘PLW’, ‘SGP’, ‘TLS’, ‘TUV’, ‘VUT’, and ‘WSM’. The spread of the box-plots on both the whiskers is highest for the country code ‘NRU’. The smallest spread is being shown for the country having country code ‘MMR’.
- Explanation of K-means clustering
Histogram Analysis of Crude Birth Rate
The k-means clustering is a well-known non-hierarchical clustering technique used widely for the purpose of segmentation (Cohen et al. 2015). Clustering technique helps to reduce the mean of the squared distance between the centroids and the observations (Silverman 2018).
In the course of study, the clustering is done between the Crude death rate and the Crude birth rate. The following codes show the k-means clustering the following diagram shows the graphical analysis of this clustering.
The clustering of analysis does the grouping of Crude death rate and the Crude birth rate. First, the data is scaled. There was no data for the year 2015. Thus, a subset has been taken of the data, excluding the data for the year 2015. Then, the data is again reshaped. After that, the plot is prepared, mentioning the range of x- variable (lowest range is 0 and the highest range is 10).
In the diagram, it can be shown that there are two optimal clusters. The red colored cluster shows the cluster for is Crude death rate, while the green colored cluster shows the Crude birth rate. Thus, the country codes having higher death rate also have higher birth rate while green are those country codes which have lower birth rate but higher death rate.
Linear regression helps to determine a linear relationship between two variables (Faraway 2016). The linear regression equation is Y = a +bX ; where Y represents the dependent variable, ‘a’ is the y-intercept, ‘b’ is the slope and X is explanatory or the independent variable. The linear regression predicts the dependent variable explained by the independent variable (Darlington and Hayes 2016).
The following code shows the plotting of regression equation and its diagrammatic representation.
In the above scatterplot, the regression line is drawn which shows negative linear relationship between the two variables. It indicates that, if one unit of Female school enrollment rate is increased then there will be a decrease of Female unemployment rate. The points are more or less scattered around the regression line.
The required code for the regression analysis is shown below along with the scatter-plot of the regression line.
Scatter-plot for regression
In the above scatter-plot, the regression line shows a strong negative relationship between the two variables. The slope of the independent variable Immunization rate is negative. The negative relationship indicates that if there is one unit increase of Immunization rate then the Crude death rate will be decreased. The values are scattered around the regression line which suggests that this is not a good regression fit.
Conclusion
From the data analysis explained above, it can be concluded that the variables are analysed well with the help of one-variable and two-variable analyses. The k-means clustering shows that two optimal clusters will be obtained if Crude death rate and the Crude birth rate are taken into account. On the other hand, the two regression lines show the negative linear relationships between Female unemployment rate and Female tertiary school enrollment and between Immunization rate and Crude death rate. These conclusions are depicted well in all the graphical representations.
This data analysis report shows the prediction using linear regression analysis. The R programing has been used for the entire derivation and graphical representation. However, more statistical analysis could be done on this data. Moreover, better interpretation could be obtained if there were no missing values.
References
Cohen, M.B., Elder, S., Musco, C., Musco, C. and Persu, M., 2015, June. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing (pp. 163-172). ACM.
Darlington, R.B. and Hayes, A.F., 2016. Regression analysis and linear models: Concepts, applications, and implementation. Guilford Publications.
Faraway, J.J., 2016. Extending the linear model with R: generalized linear, mixed effects and nonparametric regression models (Vol. 124). CRC press.
Silverman, B.W., 2018. Density estimation for statistics and data analysis. Routledge.