Overview of the Data Set and Study Objectives
Discuss About The Estimation Hypothesis Testing Cointegration.
The assignment expects to develop a business report that would be depicted to a senior manager of a human capital management company. The assignment is actually a combination of statistical analysis and concise non-technical business report.
Margaret, the CEO of Human Capital Management company in Melbourne who has recently attended a TED talk in which Michael Green elaborated the importance of the United Nation Development Goals in 2030. Michael green depicted contests that countries around the world would appear the meeting of UN sustainable development targets and elaborated the standing of the Social Progress Index in this method. Margaret decided to have further study the importance of the Social Progress Index after the seminar and compare the countries according to their performance at each sub-category of the Social Progress Index.
Jason who is a newly appointed research officer performs the analysis as per instruction of Margaret. Margaret downloaded Social Progress Index data set in 2017 and asked Jason to execute the data analysis. There are 50 variables under 12 categories of the data set of 182 countries.
First by randomisation technique, from the sample of 182 countries, the data of 100 countries is drawn. From each category of even numbers (category: 2, 4, 6, 8, 10 and 12), one variable is extracted. In this way, the new data set is formed that has a total of eight variables among which two variables are Continent and Country name. 100 observations correspond to each variable. However, some variables have missing values. In this situation, the entire row is omitted for the calculation purpose. A total of data of 80 countries are selected after cleaning the data set from missing values. This data is ready to utilise for further calculation.
Categorical Variables:
- Continent:
The frequency distribution in table 1 shows that the number of selected countries is highest form Asia (28) followed by Europe (23). The frequency of America is third highest (18). The number of countries is least from America (11) as per figure 1.
- Political Terror
As per figure 2, the frequency distribution of political terror refers that most of the country are facing level 3 political error (frequency = 20). A significant number of countries facing level 1 political error (frequency = 14) and level 2 political error (frequency = 12). The table 1 shows that only 12 students (=8+1+3) out of 80 students have political terror level more than or equal to 4.
- Access to improved facilities:
Selected Countries and Variables
Out of 80 countries, 38 countries have access to improved sanitation facilities in the interval 90 to 100 as per figure 3. Table 3 refers that the access to improved facilities lie in the range of 10.88 to 100 according to table 3. The average access to improved facilities is equal to 75.43. A significant number of countries has access to improved sanitation facility 100. As per figure 3, the distribution is right skewed (Data and Bartz 1988).
- Press freedom index:
65 countries out of 80 countries has press freedom index less than 50. In the interval of press freedom index 20 to 30, the number of countries is 20 that is highest in frequency as per table 4. The average press freedom index of all countries is 34.89 as per table 5. The middle most value of press freedom index is 30.94. The press freedom index of 80 countries lie in the interval of 8.59 to 80.96 with range 72.37.
- Outdoor air pollution attributable deaths:
Only 1 country has outdoor air pollution attributable deaths more than 150 as per table 6. Most of the countries (38) have outdoor air pollution attributable deaths less than 50. The average outdoor air pollution attributable death is 59 with the median value 51. The outdoor air pollution attributable death lies in the interval of 11.02 to 232.48 with the range 221.46 as per table 7.
- Corruption:
The mean corruption level is 44.99. The middle value of corruption index of 80 countries is 40.5 that means 50% of the corruption level lies over 40.5 and below 40.5 (Huisken and Sinestrari 1999). The corruption level ranges from 14 to 89 with range 75 as per table 9. Most of the counties has corruption level 30. The countries have corruption level 20 to 50 with frequency level 49 according to figure 6.
- Years of tertiary schooling:
The average years of tertiary schooling is 0.5803 with median 0.5008. The minimum years of tertiary schooling is 0.0289 and the maximum years of tertiary schooling is 1.9437 as per table 14. Only 9 countries out of 85 countries has years of tertiary schooling more than 1.2. Most of the countries (35 countries) have years of tertiary schooling less than 0.4. As per figure 8, the distribution is left skewed.
1) The average value of the access to improved sanitation facilities is 75.427. It is 95% evident that the average access to improved sanitation facilities ranges from 69.10 to 81.75 (Norušis 2006).
Frequency Distributions of Categorical Variables
2) The average value of years of tertiary schooling is 0.58. The 95% upper and lower confidence intervals of years of tertiary schooling are 0.47 and 0.69.
The hypotheses are:
Null hypothesis: The difference of mean of years of tertiary schooling of Europe and Africa is 0.
Alternative hypothesis: The average years of tertiary schooling of Europe is greater than average years of tertiary schooling in Africa.
Test applied: Two sample t-test assuming unequal variances (Johansen 1991).
Level of significance: 5%
T-statistics and df: 9.196 and 28.
Two-tail p-values: 0.0 (Keselman et al. 2004).
Interpretation: calculated p-value is less than level of significance (0.0<0.05). As a result, the null hypothesis is rejected with 5% level of significance (Cressie and Whitford 1986).
Conclusion: Hence, the mean of years of tertiary schooling of Europe is higher than the mean years of tertiary schooling in Africa.
Null hypothesis: The average of years of political terror in Asia is equal to the average years of political terror in America.
Alternative hypothesis: The average of years of political terror in Asia is unequal to the average years of political terror in America.
Test applied: Two sample t-test assuming unequal variances.
Level of significance: 5%
T-statistics and df: 0.9387 and 20.
Two-tail p-values: 0.359.
Interpretation: calculated p-value is greater than level of significance (0.359>0.05). Hence, the null hypothesis cannot be rejected with 5% level of significance.
Conclusion: Hence, the average years of political terror in Asia is same as the average years of political terror in America.
Null hypothesis: The average outdoor air pollution attributable deaths in America and average outdoor air pollution attributable deaths in Europe.
Alternative hypothesis: The average outdoor air pollution attributable deaths in America and average outdoor deaths in Europe.
Test applied: Two sample t-test assuming unequal variances.
Level of significance: 5%
T-statistics and df: 0.3251 and 21.
Two-tail p-values: 0.748.
Interpretation: calculated p-value is greater than level of significance (0.748>0.05). Therefore, the null hypothesis cannot be rejected with 5% level of significance.
Conclusion: Hence, the average outdoor air pollution attributable deaths in America is equal to the average outdoor air pollution attributable deaths in Europe.
- The correlation coefficient between two variables, outdoor air pollution attributable deaths and access to improved sanitation facilities is -0.59672. Therefore, the correlation is moderate and negative (Sedgwick 2012).
- The correlation coefficient between these two variables, correlation and press freedom is -0.54182. The correlation is moderately strong and negative (Wang 2013).
- The simple linear regression model of dependent variable outdoor air pollution attributable deaths and access to improved sanitation facilities as independent variables provides the coefficient of determination (R2) = 0.356. That is the independent variable explains 35.6% variability of the dependent variable (Seber and Lee 2012).
- The simple linear regression model of dependent variable corruption and press freedom index as independent variables provides the coefficient of determination (R2) = 0.294. That is the independent variable explains 29.4% variability of the dependent variable.
- The estimated linear model in case of 1st regression is:
Outdoor air pollution attributable deaths = 119.227 – 0.793* Access to improved sanitation facilities (Montgomery, Peck and Vining 2012).
- The estimated linear model in case of 2nd regression is:
Correlation = 67.025 – 0.632*Press Freedom Index.
- The p-values of the first and second regression models are both 0.0. The null hypothesis is assumed to be undertaken that there is no linear significant relationship between dependent and independent variables. The p-values of both the variables are less than 0.05. Therefore, at 5% level of significance, the null hypothesis of both the variables are rejected. Hence, access to improved sanitation has linear significant association with outdoor air pollution attributable deaths. On the other hand, press freedom also has linear significant association with correlation.
Conclusion:
As a conclusion it could be said that for the undertaken 80 countries European countries far ahead of African with respect to different socio-economic countries. American countries are not significantly better than Asian countries in society safety reason. Also, European and American countries are equally conscious about environment related issues. Also, for better access to improved sanitation facilities in developed countries outdoor air pollution attributable deaths is lower. For the better press freedom index of better countries, the corruption level is lesser. The overall rating of improved sanitation facilities is estimated as 70% to 80% whereas average amount of tertiary schooling is estimated almost 50% to 70%.
- In the analysis there exists lots of missing values. If these values would be present, then the sample size could be 100. That may provide a better result.
- Only 100 countries out of 182 countries are chosen for further analysis. The presence of all the countries could have make the analysis better.
- All the 50 variables under 12 categories are not considered for analysis. The inclusion of all the variables in the analysis could have make the analysis better.
Reference List:
Cressie, N.A.C. and Whitford, H.J., 1986. How to Use the Two Sample t?Test. Biometrical Journal, 28(2), pp.131-148.
Data, S. and Bartz, A.E., 1988. Basic statistical concepts. New York: Macmillan. Devore, J., and Peck.
Huisken, G. and Sinestrari, C., 1999. Convexity estimates for mean curvature flow and singularities of mean convex surfaces. Acta mathematica, 183(1), pp.45-70.
Johansen, S., 1991. Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econometrica: Journal of the Econometric Society, pp.1551-1580.
Keselman, H.J., Othman, A.R., Wilcox, R.R. and Fradette, K., 2004. The new and improved two-sample t test. Psychological Science, 15(1), pp.47-51.
Montgomery, D.C., Peck, E.A. and Vining, G.G., 2012. Introduction to linear regression analysis (Vol. 821). John Wiley & Sons.
Norušis, M.J., 2006. SPSS 14.0 guide to data analysis. Upper Saddle River, NJ: Prentice Hall.
Seber, G.A. and Lee, A.J., 2012. Linear regression analysis(Vol. 329). John Wiley & Sons.
Sedgwick, P., 2012. Pearson’s correlation coefficient. BMJ: British Medical Journal (Online), 345.
Wang, J., 2013. Pearson correlation coefficient. In Encyclopedia of Systems Biology (pp. 1671-1671). Springer New York.