Descriptive Statistics
The purpose of the assignment is to study and analyze statistical assumptions concept, hypothesis test, and statistical analysis. The objective is to identify and solve the criticality arising during the statistical theory implementation on dataset using JASP Statistical Software. Dataset has 5 variables with 2920 observations. Responses are collected from 01-01-2018 to 31-12-2019. Variables in the dataset are ‘Date’, ‘Year’, ‘Terminal’, ‘Average time taken’, ‘Luggage’. Records have been collected from the 4 different terminals, ‘T1’, ‘T2’, ‘T3’, and ‘T4’. All the 4 categories in the data has the information about the Average time taken for different terminals.
- Objective is to identify and analyze existence of relation between the ‘Average time taken’ and ‘terminal’. Aim is to identify the pattern and correlation between the defined variables.
- Objective is to analyse the Average time taken, luggage of different category of terminals are same or not.
- Objective is to check the statistical assumptions, identify difference between non-parametric and parametric results.
- Aim is to estimate and predict the Average time taken and to analyze all if all the estimates are significant or not.
Visualization techniques, parametric and non-parametric test have been used to answer the research Question. Visualization techniques are helpful in checking the statistical parametric test assumptions, and to identify the pattern of the data. Average time taken is depending on the number of luggage, and terminal. Variable ‘Luggage’, ‘Year’, and ‘day’ variable are independent variables.
As per fig1.0, Average of variable ‘Average time taken’ is 5.798, Minimum time of variable ‘Average time taken’ is 3.37, and Maximum time of variable ‘Average time taken’ is 8.0. Average luggage is 6701.392=~6701, Minimum luggage is 2065, and Maximum luggage is 186889. Year data and terminal data both are equally collected.
Fig 1.0: Descriptive Statistics
Some data visualization techniques has been explained by Embarak, Embarak and Karkal, 2018 and presented below in a graphical form.
Distribution plots helps in identifying the distribution of the data, and skewness of the data. As per Fig 1.1, Average time taken, luggage variable of histogram plot is not from the normal distribution.
Fig 1.1: Distribution Plots
To check whether data is normally distributed and follows the linearity or not, QQplot visualization method is used to check these parametric test assumptions. As per the Fig 1.2, all the variables do not follow linearity because all the data points does not lie on the drawn linear line. Some outliers exist in the variables which can cause a problem of invariability in statistical test.
Fig 1.2: QQ plot of Data
Scatter plot helps in identifying the relationship between two variables. Scatter plot tells us either relationship between variable is high/low or positive/negative. If the variables are correlated, data points are fall on line or form the curve. As per Fig 1.3, it is clear that relationships between variables are very low because some variables such as terminal, year are the categorical data. Relationship between variables is not so strong.
Fig 1.3: Scatter Plot
Box plot helps in identifying the distribution of the data or variables. Box-plot helps in identifying whether data is skewed or not. Day, and year data is normal as median is symmetric. Variable ‘Average time taken’ and ‘Luggage’ are not from normal distribution it is left skewed data, and luggage data has a lot of outliers.
Fig 1.4: Box Plots
Parametric test is used when data follows the specific parametric statistical test assumptions otherwise non-parametric test is used. Non-parametric statistical test works on some basic rules (Mathematical rules) instead of assumptions while parametric statistical test works only based on some pre-defined assumptions.
Distribution Plots
In the data ‘Luggage’ and ‘year’ are independent groups because all the variables do not depend on any factors. Number of luggage is depend on passengers and not any other variables, so ‘Luggage’ variable is independent variable, ‘year’ variable is also not dependent on any other factor also these two variable is independent variable.
Hypothesis statement:
Two Independent sample T-test hypotheses explained by Gerald, B., 2018 as follows:
Null hypothesis (): There is no significant difference exists between two population means
Alternate Hypothesis (): There is a significant difference exists between two population means
Mann Whitney Test hypothesis explained by Karadimitriou, Marshall, and Knox, 2018 as follows:
Null hypothesis (H0): There is no significant difference between two populations.
Alternate Hypothesis (H1): There is a significant difference between two populations.
Parametric or Non-Parametric test:
Luggage and year variable is non-parametric because these two variables do not follow any assumptions of parametric tests such as ‘Normality’, ‘Homogeneity’, ‘independence’ etc. Here in this section both parametric test (Independent sample t-test), and non-parametric test (Mann whitney test) has been used and aim is to compare both of these methods results.
As per Fig 1.5, P-value of Shapiro test is <<0.05, that leads in rejection of null hypothesis and conclude that data is not from normal distribution, P-value of levene’s test <<0.05, that leads in rejection of null hypothesis and conclude that variance of the data is not equal.
Fig 1.5: Two Independent Sample T-test and Mann Whitney Test
According to Two sample independent test, a parametric p-value is <<0.05, leads in rejection of null hypothesis and conclude that Two population means are not equal . As per Mann whitney result, p-value>>0.05, that leads in failure of rejection of the null hypothesis and conclude that There is no significant difference between two populations.
Results are different in both parametric and non-parametric results.
In the data ‘Terminal’ and ‘Average time taken’ are dependent groups because both the variable depends on the number of luggage. A passenger uses which terminals will depend on the number of luggage they have, and average time taken is also depends on the number of luggage. So, variables ‘Terminal’ and Average time taken is dependent groups.
Hypothesis statement:
- Hypothesis statement one-way ANOVA test is explained byRoss, and Willson, 2017. as follows:
Null hypothesis i.e. Mean average time taken of all categories of terminals is equal
Alternate Hypothesis (H1): Atleast one difference exists among the mean groups.
- Hypothesis statement kruskal wallis test is explained byCleophas, and Zwinderman, 2016. as follows:
Null hypothesis (H0): Median average time taken of all categories for terminals is equal.
Alternate Hypothesis (H1): Median average time taken of all categories for terminals is not equal.
Parametric or Non-Parametric test:
One-way ANOVA test is a parametric test. One way ANOVA is used when more than two groups exists in a single variable. When data does not follows the assumptions of one-way ANOVA such as ‘Normality’, ‘Homogeneity’, ‘linearity’, etc. then alternate approach in non-parametric test is ‘kruskal wallis test’ that also works as one way ANOVA but with different mathematical rules and not on any assumptions. Both the test has been used to compare the results.
QQ Plot of Data
One way ANOVA test has been conducted between two dependent Variables ‘Terminals’, and ‘Average time taken’. P-value for terminal data is <<0.05, leads in rejection of null hypothesis and conclude that At least one difference exists among the mean groups (Terminals). ().
Fig 1.6: One Way ANOVA Test
Post-hoc test is the part of ANOVA test and is used to compare the means group means of terminal variables. SE and Mean Difference between T1 and T2 terminal are very less, so the most significant group is T1 and T2 (Refer Fig 1.7).
Fig 1.7: Post-Hoc Tests
When data does not follows the assumptions of one way ANOVA then Kruskal wallis test, a non parametric test is used. Here, also data violates the assumptions of ANOVA test. Kruskal wallis test statistics is 2455.110, p=0.001<0.05, leads to rejection of null hypothesis and concluded that Median average time taken of all categories for terminals is not equal (Refer Fig 1.8).
Fig 1.8: Kruskal Wallis Test
In the data ‘Luggage’, ‘year’ and ‘day’ are independent groups because all the variables do not depend on any factors. Number of luggage is depend on passengers and not any other variables, so ‘Luggage’ variable is independent variable, day and year are also not dependent on any other factor also these two variable is independent variable.
Hypothesis statement:
Hypothesis statement for homogeneity test and normality test is explained by as follows:
- Levene Test (Homogeniety test):
Null hypothesis (H0): Variability among the terminal groups is equal.
Alternate Hypothesis (H1): Variability among the terminal groups is not equal.
- Shapiro Test (Normality test)
Null hypothesis (H0): Sample variable/data is normally distributed variable/data.
Alternate Hypothesis (H1): Sample variable/data is not normally distributed variable/data.
- Linear Regression
Null hypothesis (H0): There is no significant relationship exists between coefficients of independent variables with dependent variable
Alternate Hypothesis (H1): There is a significant relationship exists between coefficients of independent variables with dependent variable
Parametric or Non-Parametric test:
Category/ variable doesn’t follows the some assumptions of parametric test, like normality, linearity, and violates the assumptions of Homogeneity. Normality and homogeneity test has been conducted below using Shapiro and Levene test. To estimate the Average time taken variable linear regression, a parametric test has been used just to identify what happens if data does not follow the assumptions.
Levene Test (Homogeniety test):
To test the homogeneity of the data/variable Levene test is useful, Levene Test statistics is 494.722, p=0.001<<0.05, that leads in rejection of the null hypothesis and conclude that Variability among the groups are not equal (Refer Fig 1.9).
Fig 1.9: Levene’s Test
Shapiro Test (Normality test):
To check the Normality of the data/variable Shapiro test is useful. Test statistics of Shapiro wilk is 0.882, and P-value=0.001<<0.05, that leads in rejection of the null hypothesis and conclude that sample variable/data is not from normal distribution.
Fig 2.0: Shapiro Test
Linear Regression has been used to predict the ‘Average time taken’, this variable is depend on the variable ‘Luggage (number of luggage with the passenger)’, added ‘year’, and ‘terminal’ as independent variable like variable, ‘Luggage (number of luggage with the passenger)’ is used as a covariate. So, null model includes about the Terminal and Year variables. And alternate model H1 includes luggage, we are not interested here in studying about the number of luggage but it has the effect on Average time taken variable.
Fig 2.1: Linear Regression Model
Linear regression equation for Null model is:
Average time taken (Y)= 5.016 -0.819*T2+2.277*T3+2.293*T4-0.312*Year(2019).
P-value for all the coefficients of independent variables of null model is less than 0.05, hence leads to rejection of the null hypothesis and concluded that There is a significant relationship exists between coefficients of independent variables of null model with dependent variable i.e i.e. Null model is significant.
Linear regression equation for Alternate model is:
Average time taken (Y)= 5.016 -0.713*T2+2.282*T3+2.278*T4-0.317*Year(2019)- 1.139e-5*luggage.
P-value for all the coefficients of independent variables of alternate model is less than 0.05, hence, that leads in rejection of the null hypothesis and conclude that There is a significant relationship exists between coefficients of independent variables of alternate model with dependent variable i.e. i.e. Alternate model is significant.
R2 and Adjusted R2 is a goodness of fit measures that explains the variability of dependent variable which is explained by the independent variables. R2 value is always greater than or equal to Adjusted R2 Value. R2 value of null model and alternate model is 0.953 and 0.954 with the lowest RMSE value 0.375 which indicates that model is a best fit model.
Conclusion
As per the Above Analysis,
- No relation or very less relationship exists between Average time taken and terminal, also as per scatter plot other variables are either not related or has very less relation with other variables. Only relation between ‘Average time taken’, and ‘luggage’ has kind of strong relation.
- As per ANOVA and Kruskal wallis test it is proved that mean/median ‘Average time taken’, luggage of different category of terminals are not same/equal.
- Statistical assumptions are not fulfilled by the data i.e. data does not follows normality, homogeneity, linearity. Parametric test is used based on assumptions while non-parametric works with mathematical rules.
- As per linear regression Analysis All the estimates or coefficients of independent variables are significant i.e. overall model is significant with good R2 value.
References
Ali, Z. and Bhaskar, S.B., 2016. Basic statistical tools in research and data analysis. Indian journal of anaesthesia, 60(9), p.662.
Arnoux, P.H., Xu, A., Boyette, N., Mahmud, J., Akkiraju, R. and Sinha, V., 2017, May. 25 tweets to know you: A new model to predict personality with social media. In Proceedings of the international AAAI conference on web and social media (Vol. 11, No. 1, pp. 472-475).
Cleophas, T.J. and Zwinderman, A.H., 2016. Non-parametric tests for Three or more samples (friedman and kruskal-Wallis). In Clinical data analysis on a pocket calculator (pp. 193-197). Springer, Cham.
Embarak, D.O., Embarak and Karkal, 2018. Data analysis and visualization using python. Berkeley, CA, USA: Apress.
Gerald, B., 2018. A brief review of independent, dependent and one sample t-test. International journal of applied mathematics and theoretical physics, 4(2), pp.50-54.
Giao, H.N.K., 2021. Customer Satisfaction of Vietnam Airline Domestic Services.
Guttag, J.V., 2016. Introduction to computation and programming using Python: With application to understanding data. MIT Press.
Haslwanter, T., 2016. An Introduction to Statistics with Python. With Applications in the Life Sciences.. Switzerland: Springer International Publishing.
Karadimitriou, S.M., Marshall, E. and Knox, C., 2018. Mann-Whitney U test. Sheffield: Sheffield Hallam University.
Le, V.T., Zhang, J., Johnstone, M., Nahavandi, S. and Creighton, D., 2012, October. A generalised data analysis approach for baggage handling systems simulation. In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC) (pp. 1681-1687). IEEE.
Massaron, L. and Boschetti, A., 2016. Regression analysis with Python. Packt Publishing Ltd.
Mircioiu, C. and Atkinson, J., 2017. A comparison of parametric and non-parametric methods applied to a Likert scale. Pharmacy, 5(2), p.26.
Orcan, F., 2020. Parametric or non-parametric: Skewness to test normality for mean comparison. International Journal of Assessment Tools in Education, 7(2), pp.255-265.
Ross, A. and Willson, V.L., 2017. One-way anova. In Basic and advanced statistical tests (pp. 21-24). SensePublishers, Rotterdam.