Data Cleaning and Exploration
In this report, the main purpose is the exploration of the data on health and population statistics. The data has been retrieved from World Bank which is a secondary source of data. The dataset contains information about the world health and population over the years 2001 – 2015. Information of countries across the East Asia and the Pacific are contained in the dataset. The statistical software “R-Studio” will be used to perform the analysis.
The dataset that will be used for this is the secondary data retrieved from World Bank. The dataset contains a lot of missing values for the variables that are assumed to be important.
The dataset that has been retrieved for this study has the chances of performing one variable and two variable analyses along with a graphical representation for each type of analysis. For further analysis, k-means cluster analysis and regression analysis can also be performed.
With the help of the descriptive statistics measures, exploratory data analysis will be performed and the advanced analysis will be performed with the help of k-means clustering and regression analysis.
The dataset retrieved from World Bank on the Health and Population data is in .csv format. The dataset contains information on numerous attributes but the values of all the attributes on all the years are not present in the dataset. There are a lot of unrecorded data which has to be cleaned before the analysis. All these cleaning and extracting of the data is performed using the statistical software “R-Studio” with the help of various packages. The packages that are required for the analysis are listed as follows:
- table: used to provide data frame and faster data manipulations.
- reshape2: Used to reshape the data
- psych: Used for multivariate analysis
- ggploy2: Used for plotting data
- lattice: Graphics package
- dplyr: Used for data cleaning and manipulation
The R-codes that will be used for data cleaning are given in the following table:
###===============Importing Data File in R=================### health <- read.csv(file.choose(), header = TRUE, sep = “,” , na.strings = “..”, blank.lines.skip = TRUE,) ###===================Libraries Used===================### library(data.table) library(reshape2) library(psych) library(ggplot2) library(lattice) library(dplyr) ###======================Data Cleaning======================### health1 <- data.table(health) health1 <- health1[Series.Code %in% c(“SP.ADO.TFRT”, “SP.DYN.CBRT.IN”,”SP.DYN.LE00.IN”,”SL.UEM.TOTL.ZS”,”SL.UEM.TOTL.FE.ZS”, “NY.GNP.PCAP.CD”,”SH.XPD.TOTL.ZS”,”SL.UEM.TOTL.FE.ZS”)] health1 <- health1[,”X2015..YR2015.” := NULL] health1 <- na.omit(health1) str(health1) View(health1) ###=============Converting Years in Rows and Attributes in Columns=============### health1 <- melt(health1, Series.Code = “Country.Code”) View(health1) |
The R codes in table 3.1 and the following boxplot in figure 3.1 shows the summary measures of the variable adolescent fertility rate.
The boxplot shows the distribution of the fertility rates in women between 15 to 18 years of age. The black line in the plot shows the median of the distribution which is found to be less than the mean. Thus, it can be said that the fertility rate is less than the average rate in most females. There are no outliers to the data.
Table 3.1
###============Adolscent Fertility Rate================### exp1 <- health1[Series.Code %in% “SP.ADO.TFRT”] describe(exp1$value) fill <- “green” exp_plot1 <- ggplot(exp1, aes(x = factor(0), y = value)) + geom_boxplot(fill = fill) exp_plot1 <- exp_plot1 + xlab(“Adolescent fertility rate (births per 1,000 women ages 15-19)”) + scale_x_discrete(breaks = NULL) exp_plot1 <- exp_plot1 + ggtitle(“Distribution of Adolescent fertility rate (births per 1,000 women ages 15-19)”) + theme_bw() plot(exp_plot1) |
Summary of Crude Birth Rate
The R codes in table 3.3 and the following boxplot in figure 3.2 shows the summary measures of the variable Crude Birth Rate. The histogram shows the distribution of the Crude Birth Rate. The bars show that the data is negatively skewed which indicates that in most of the cases, the crude birth rate is high.
###============Crude Birth Rate================### exp2 <- health1[Series.Code %in% “SP.DYN.CBRT.IN”] describe(exp2$value) View(exp2) n <- length(exp2$value) r <- diff(range(exp2$value)) barfill <- “blue” barlines <- “black” exp_plot2 <- ggplot(exp2, aes(x = value)) + geom_histogram(binwidth=r/(log2(n)+1), colour = barlines, fill = barfill) exp_plot2 <- exp_plot2 + scale_x_continuous(“Birth rate, crude (per 1,000 people)”, breaks = seq(0,24,3), limits = c(0,21)) + scale_y_continuous(“Count”)+theme_bw() exp_plot2 <- exp_plot2 + ggtitle(“Distribution of Birth rate, crude per 1,000 people”)+ theme(plot.title = element_text(hjust = 0.5)) plot(exp_plot2) |
Summary of Total Life Expectancy at Birth
The R codes in table 3.5 and the following boxplot in figure 3.3 shows the summary measures of the variable Total Life Expectancy at Birth.
The boxplot shows the distribution of the Total Life Expectancy at Birth. The black line in the plot shows the median of the distribution which is found to be almost equal to the mean from the figure. Thus, it can be said that the Total Life Expectancy at Birth is symmetrically distributed. There are no outliers to the data.
Table 3.5
###============Total Life Expectancy at Birth================###
exp3 <- health1[Series.Code %in% “SP.DYN.LE00.IN”]
describe(exp3$value)
fill <- “green”
exp_plot3 <- ggplot(exp3, aes(x = factor(0), y = value)) + geom_boxplot(fill = fill)
exp_plot3 <- exp_plot3 + xlab(“Life expectancy at birth, total (years)”) + scale_x_discrete(breaks = NULL)
exp_plot3 <- exp_plot3 + ggtitle(“Life expectancy at birth, total (years)”) + theme_bw()
plot(exp_plot3)
Comparison of Adolescent Fertility Rate Across Countries
On the basis of different countries the rate of total unemployment is calculated. The summary shows the comparison with the help of boxplots. The boxplot compares the changes in the unemployment rates with respect to the countries.
Here boxplot is used as one variable is numerical and the other is categorical. In 37 boxplots representing countries, it can be seen that there are outliers in countries coded as CHIN, LAO, MYS, SLB, THA and VNM. SLB has two outliers. PHL has the highest rate of unemployment. MMR, PRK and SLB has the lowest unemployment rates.
On the basis of different countries the total health expenditure is calculated. The summary shows the comparison with the help of boxplots. The boxplot compares the changes in the total health expenditure with respect to the countries.
Here boxplot is used as one variable is numerical and the other is categorical. In 37 boxplots representing countries, it can be seen that there are outliers in countries coded as KHM, KIR, PLW, SGP, TLS,TUV, VUT and WSM. KHM and KIR has two outliers. NRU has the highest health expenditure. MMR has the lowest health expenditure.
###============Total Health Expenditure (Country Wise)================### exp6 <- health1[Series.Code %in% “SH.XPD.TOTL.ZS”] fill <- “pink” exp_plot5 <- ggplot(exp6, aes(x = exp6$Country.Code, y = exp6$value)) + geom_boxplot(fill = fill) exp_plot5 <- exp_plot5 + scale_x_discrete(name = “Country”) + scale_y_continuous(name = “Total Health Expenditure”)+ theme_bw() exp_plot5 <- exp_plot5 +theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(“Total Health Expenditure”) plot(exp_plot5) |
In this method segmentation of the data is done on the basis of the group means. The values of the data which are close to the group means are segmented into those groups (Chatfield, 2018).
In this study, k-means clustering analysis is performed between Crude birth Rate and Crude Death rate. The grouping is done according to the countries. As there was no data for the year 2015, the whole column was eliminated from the data. In the figure, the red clusters are the clusters for the CBR and the green are the clusters for CDR. The green countries show higher birth and death rates while the red countries show higher death rate and lower birth rate.
The codes are attached in the following table (Husson, Lê & Pagès, 2017):
###============K-Means Clustering (Country Wise)================### cluster <- filter(health, Series.Code %in% c(“SP.DYN.CDRT.IN”,”SP.DYN.CBRT.IN”,”SH.IMM.IBCG”)) cluster <- subset(cluster, select = -(X2015..YR2015.)) cluster <- melt(cluster, Series.Code = c(“Series.Code”,”Country.Name”,”Country.Code”)) cluster <- dcast(cluster, formula = Country.Code ~ Series.Code, mean) cluster <- na.omit(cluster) cluster group <- kmeans(cluster[,c(“SP.DYN.CDRT.IN”,”SP.DYN.CBRT.IN”)],centers = 2, nstart = 10) group order = order(group$cluster) data.frame(cluster$Country.Code[order], group$cluster[order]) cluster_plot <- plot(cluster$SP.DYN.CDRT.IN, cluster$SP.DYN.CBRT.IN, type=”n”, xlim=c(0,10), xlab=”Crude Death Rate”, ylab=”Crude Birth Rate”)+ text(x=cluster$SP.DYN.CDRT.IN, y=cluster$SP.DYN.CBRT.IN, labels=cluster$Country.Code,col=group$cluster+1) |
The relationship between two numerical variables are established with the help of regression analysis (Fox. 2015). The general equation of linear regression is given by:
Y = a + bX
Here, x and y are respectively the independent and the dependent variables with a being the value of the dependent variable in the absence of the independent variable and b representing the slope of the regression line (Draper & Smith, 2014).
From the analysis, it can be seen that the regression line shows a negative relationship between the independent and the dependent variables. With the increase in the female school enrolment, the female unemployment rate decreases.
The codes are attaches in the following table (Berk, 2016):
From the analysis, it can be seen that the regression line shows a very weak relationship between the independent and the dependent variables. Thus, it can be said that there is no effect of immunization on Crude Death Rate.
The codes are attaches in the following table:
###============Regression-Immunization Rate and CDR================### reg_plot2 <- lm(formula = SP.DYN.CDRT.IN ~ SH.IMM.IBCG, data = reg) summary(reg_plot2) reg_plot2 <- ggplot(reg_plot2, aes(x=SH.IMM.IBCG, y=SP.DYN.CDRT.IN)) + geom_point(shape=2) + scale_x_continuous(name = “Immunization, BCG (% of one-year-old children)”) + scale_y_continuous(name = “Crude Death rate per 1,000 people”)+ geom_smooth(method=lm) +theme_bw()+ ggtitle(“Relation of Crude Death Rate to Immunization BCG rate of one-year-old children”) plot(reg_plot2) |
Conclusion
It can thus be concluded from all the analysis conducted so far that the variables have been analyzed by considering single variable, by considering two variables at a time. The k-means clustering analysis shows the relationship between the birth and the death rates across countries and have been grouped accordingly. Negative relationship has been observed within female education and female unemployment and no relationship has been observed on immunization and death rate.
The problem that has been faced the most is the selection of the variables as most of the variables have innumerable missing values. However, the results have been obtained with some selected variables which could have been better if there were lesser missing values.
Reference List
Chatfield, C. (2018). Introduction to multivariate analysis. Routledge.
Husson, F., Lê, S., & Pagès, J. (2017). Exploratory multivariate analysis by example using R. Chapman and Hall/CRC.
Fox, J. (2015). Applied regression analysis and generalized linear models. Sage Publications.
Draper, N. R., & Smith, H. (2014). Applied regression analysis(Vol. 326). John Wiley & Sons.
Berk, R. A. (2016). Statistical learning from a regression perspective. New York: Springer.