Data sets description
1.a)
This assignment highlights the techniques and skills of collecting and analysing the data set by “MS-Excel” software. The assignment deals with both primary and secondary data set. Various types of statistical methods are used in this analysis for having descriptive and inferential statistics.
The data set gathered from the “Australian Taxation Office (ATO)” would help the viewers to discover the “Gender gap” of salaries and wages. The proposed cause for is the discrimination of hiring in the job “Gender gap”. It is also accused that the salary and occupation preference change according to the various gender that vary significantly. The data set abridges the analysis of job profile of males and females of various industries. The researcher regarded the data file as main data file that includes totally 1000 samples.
For proving the validity of first data set according to the sampling method, the researcher collected another data set by simple random sampling process. The research questions established for the second data set is that-
- “Is there any difference of the mean salary amount between “Males” and “Females” in second data set?”
- “Are the proportion of “Males” and “Females” having job to the occupation “Machinery Operators and drivers” equal or nor?”
1. b)
The data set provided in datasheet is actually a “Secondary data” which collected by Australian Taxation Office (ATO) at the time of sorting many types of detailed information in the lodgement session 2013-2014.
The analysing data set have four factors that are- 1) “Gender”, 2) “Occupation code”, 3) “Salary and Wage amount” and 4) “Gift amount”. Here, the “Numerical” or “Quantitative” factors are – 1) “Salary and Wage amount” and “Gift amount”. The qualitative variables are – 1) “Gender” and 2) “Occ_code”. “Gender” is a “Categorical” variable having two levels “Males” and “Females”. “Occupational code” is actually a “Categorical” variable that is transformed in “Numerical” variable.
Table 1: The first five cases of data set 1 are shown below
1. c)
“Random Sampling Method” helped to collect the second data set. The researcher surveyed it from the 100 international students who are dwelling in Australia. The data is primary data. The data set has mainly three variables that are “Gender”, “Sw_amt” and “Occ_code”. “sw_amt” is quantitative variables and other two variables are categorical variables.
2. a)
Figure 1: Visualization of “Occupational code” and “Gender”
“Female” employees mainly like the job of two occupations that are “Clerical and Administrative workers” and “Professionals”. “Male” employees like the job of two occupations such as “Technicians and Trade Workers” and “Professionals”. The number of female workers is least in the profession of “Machinery Operators and drivers”. The number of male workers is least in the occupation of “Sales worker”.
Research questions
Figure 2: Visualization of “Gender” and “Salary and Wage amount”
The amount of mean salary and standard deviation of salaries are higher for “Male” employees than “Female” employees.
Figure 3: Histogram of “Salary and wage amount” of “Males”
(Jelen 2010)
Figure 4: Histogram of “Salary and Wage amount” of “Females”
Table 2: Table of “Numerical Summary” of “Salary and Wage amount” gender wise
- The lowest salary of both males and females are 0.
- The highest salaries of males and females are equal (308183).
- Males earn higher salary and wages than females as an average (48181.46 > 33841.72).
- The standard deviation of salaries of males is greater than the females in (46863.41>33428.35).
- The total amount earned by males are much greater than females (25150721>16176341).
Figure 5: Scatter plot of “Gift amount” and “Salary and Wage amount”
As per the scatter plot, two quantitative variables “Salary and wage amount” and “Gift amount” have no linear relationship between themselves.
Table 3: The “Median” salaries of top four “Occupations”
Table 4: Pivot table of “Gender” and “Occupations with top four median salaries”
(Jelen and Alexander 2010)
As per higher median salaries the top four occupations are –
- “Managers”.
- “Professionals”.
- “Machinery operators and drivers”.
- “Technicians and trade workers”.
The proportions of “Male” and “Female” in all 4 types of occupations are 0.65 and 0.35. The proportions of “Male” and “Female” in all 4 types of occupations are 0.65 and 0.35. The proportions of “Male” and “Female” in all 4 types of occupations are 0.65 and 0.35. The proportions of “Male” and “Female” “Managers” are 0.64 and 0.36 respectively. The proportions of “Male” and “Female” “Professionals” are 0.47 and 0.53 respectively. The proportions of “Male” and “Female” “Technicians and Trade Worker” are 0.81 and 0.19 respectively. The proportions of “Male” and “Female” “Machinery operators and driver” “Male” and “Female” are 0.96 and 0.04 respectively.
Hypotheses:
Null hypothesis (H0): “The proportion of “Male” employees in the profession of “Machinery operators and drivers” is 0.8”.
Alternative hypothesis (HA): “The proportion of “Male” in the profession of “Machinery Operators and drivers” is higher than 0.8”.
Table 6: “One sample proportional Z-test”
Among 46 employees who work as “Machinery Operators and drivers”, 44 employees are “Male” and rest of 2 are “Female”. The calculated proportion of males is 0.9565. As per margin of error, the proportion of male “Machinery operators and drivers” varies in the interval of 0.89759 to 1 with 95% probability. One sample proportional z-test assuming “Level of significance” = 0.05 is applied to prove the hypothesis. The calculated Z-statistic is found 2.654. The value of Z-statistic at critical level (95%) = 1.959964. As, 2.654 > 1.959964, therefore, “Z-calculated” is greater than “Z-critical”. Therefore, the analyst can reject the null hypothesis with 95% probability (Lehmann and Romano 2006). Conversely, the alternative hypothesis is not rejected.
Data Analysis
Conclusion: It can be concluded that the proportion of males in the profession of “Machinery Operators and drivers” is higher than 80%.
Hypotheses:
Null hypothesis (H0): “The difference of mean amounts of “Salary and Wage” of “Male” and “Female” is equal to 0”.
Alternative hypothesis (HA): “The difference of mean amounts of “Salaries and Wages” of “Male” and the mean amounts of “Salary and Wages” of females is not equal to 0”.
Table 7: Table of “Two-sample t-test assuming unequal variances”
(Yuen 1974)
Among1000 sampled peoples, 478 are “Females” and 522 are “Males”. The mean “Salary and Wage” of females is 33841.72 and the mean “Salary and Wage” of males is 48181.46. Two samples t-test with unequal variances is applied to test the null hypothesis at “Level of significance” = 0.05. The calculated t-statistic is found as (-5.605) with 943 degrees of freedom. The calculated p-value = 0.0. It could be decided that as the calculated two-tailed p-value of the t-statistic is lesser than 5%, hence, the null hypothesis with 95% probability could be rejected. On the other hand, the alternative hypothesis is failed to reject.
Conclusion: It could be concluded that that the mean “Salary and Wage amount” of “Male” employees is higher than the mean “Salary and Wage amounts” of “Females” with 95% probability.
Hypotheses:
Null hypothesis (H0): “The difference of mean “Salary and Wage” amounts of “Male” employees and the mean “Salary and wage” of “Female” employees are 0 for the surveyed data”.
Alternative hypothesis (HA): “The difference of mean “Salary and Wage” amounts of “Male” employees and the mean “Salary or wages” of “Female” employees is unequal to 0 for the surveyed data”.
Table 8: Tables of two-sample t-test assuming unequal variances
The second data set has a total of 100 samples. The mean “Salary and Wage” of “Male” employees is 61421.02 and the mean “Salary and Wage” amount of “Female” employees is 333423.79245. The “Two samples t-test” is applied to find the differences of two averages at “Level of significance” = 0.05. The calculated t-statistics is found to be (-2.714) with 64 degrees of freedom (Romano and Lehmann 2005). The calculated two tailed p-value is (0.008). The two-tailed p-value is lesser than 5%. Hence, the analyst can reject the null hypothesis with 95% possibility.
Conclusion: It can be concluded that the mean “Salary and Wage” amounts of “Male” employees is greater than the mean “Salary and Wage” amounts of “Female” employees with 95% probability.
Hypotheses:
Null hypothesis (H0): “The difference of proportions of the males and females working as “Machinery drivers and operators” is 0”.
Alternative hypothesis (HA): “The proportion of the “Male” employees working as “Machinery Drivers and operators” is higher than the proportions of the females working as “Machinery drivers and operators””.
Table 9: Table of two-samples proportions Z-test
Out of 53 females only 2 (proportion = 3.774%) work as “Machinery drivers and operators” and out of 47 males only 6 (proportion = 12.766%) work as “Machinery drivers and operators”. Two samples proportional Z-test is applied to find the differences between the proportions (Panik, 2012). The Calculated proportion generates the “Z-statistic” = 1.654. The “Z-critical” with 5% degrees of freedom is found to be 1.9599 which is greater than calculated Z-statistic (Cressie and Whitford 1986). Therefore, the test has no significance. The researcher can reject the null hypothesis.
Conclusion: It could be concluded that the proportion of males in the profession of “Machinery drivers and operators” is greater than the females.
Section 4: Discussion and Conclusion
4. a)
The inherent facts that came into light by the analysis –
- Females prefer to work as “Clerical and administrative staffs” and “Professional employees”.
- Males prefer to work as “Technicians and trade workers” and “Professional employees”.
- Out of considered nine occupations, the “Salary and Wages” are greater according to the “Median” values in the professions- 1) “Manager” and 2) “Professionals”, “Machinery operators and drivers” and “Technicians and Trades Workers” respectively.
- The amount of “Salary and Wage” is found higher for males than females.
- The graphical visualization of the previous facts is supported by inferential decisions such as “Testing of Hypothesis”.
- This inference is also validated by surveyed data set, as the significant difference of mean salaries of “Male” and “Female” employees is attained in primary data analysis.
- The occupation “Machinery Operator and drivers” is majorly subjected by “Male” employees as the proportion of “Male” employees is higher than 80%.
- Hence, according to the analysis it is found that “Occupation” type as well as “Salary and wage amount” has a vital role to determine “Gender discrimination”.
4. b)
The future scope of the research is that-
- The number of samples of surveyed data could be more (almost 1000) that can be directly compared to the secondary data set.
- The reasons behind the amount of “Salary and wages” of males and females or both could be detected and distinguished. These variables and parameters should be included in the data sets.
- More information could be mined from the dataset if the samples regarding the variables “Age”, “Years of experience”, “Educational level” and “Monthly working hours” are included in the data set.
References:
Cressie, N.A.C. and Whitford, H.J., 1986. How to Use the Two Sample t?Test. Biometrical Journal, 28(2), pp.131-148.
Jelen, B. and Alexander, M., 2010. Pivot Table Data Crunching: Microsoft Excel 2010. Pearson Education.
Jelen, B., 2010. Charts and Graphs: Microsoft Excel 2010. Que Publishing.
Lehmann, E.L. and Romano, J.P., 2006. Testing statistical hypotheses. Springer Science & Business Media.
Panik, M.J., 2012. Testing Statistical Hypotheses. Statistical Inference: A Short Course, pp.184-216.
Romano, J.P. and Lehmann, E.L., 2005. Testing statistical hypotheses.
Yuen, K.K., 1974. The two-sample trimmed t for unequal population variances. Biometrika, 61(1), pp.165-170.