Variables and their Levels of Measurement
The provided dataset is a sample of 2013 -2014 of Australian taxation office which indicates the lodge tax return of lodgment method. The lodgment method is either a tax agent or self-prepare, the people pay tax by using one of the lodgment method, either they select a tax agent or self-prepare to pay tax for a financial year. The assignment is based on the analysis of lodgment method for gender and different age groups. The article is based on the analysis of information of individual tax return after the end of the financial year.
There are two types of data which can be used for analysis, first is Primary data, which can directly have collected from the customers of the organization on the basis of questionnaire. And other is Secondary data, which can have collected from official website of the organization and also from other official websites related to organization. (Goodwin, 2012, p.130). The dataset 1 is a subset of Australian taxation office collected from official website of the ATO. The data is collected by a government site not by the specific user, so it is a secondary data type. The data can be categorized as qualitative or , the qualitative data can be further categorized as binary level, nominal level or the ordinal level of measurements and the quantitative data can be further categorized as interval or ratio level measurements. (Morgan, 2013, p.9). The variable gender has two categories (0=male and 1=female), so it is a nominal level of variable, it indicates the gender of a tax payer.
The variable age_range contains integer values, so it is a quantitative variable. It indicates the age of a tax payer.
The variable lodgment_method have two categories (A=Tax agent and B=Self prepare), so it is a nominal level of variable, it indicates the lodgment category of a tax payer.
The variable tot_in_amt contains integer values, so it is a quantitative variable. It indicates the total income amount of a tax payer in which a tax payer will pay tax.
The variable tot_ded_amt contains integer values, so it is a quantitative variable. It indicates the total deduction amount of a tax payer. It is an amount of total deduction in the actual income amount as a tax.
The first five cases of dataset 1 is sown below:
Gender |
age_range |
Lodgment_method |
Tot_inc_amt |
Tot_ded_amt |
0 |
5 |
A |
49612 |
8184 |
0 |
6 |
A |
131313 |
7686 |
1 |
6 |
S |
53320 |
1201 |
0 |
9 |
S |
56748 |
95 |
1 |
6 |
A |
84863 |
2016 |
Statistical methods basically a process of collection, summarizing, analysis and the interpretation of the analysis. Making questionnaire on the basis of the importance of factors of the study of the organization. The characteristics of the study will contain, a specific plan, design structure to get the answers from the respondents. The questionnaire will contain the questions related to the open ended, closed ended, and the nominal, ordinal and interval level ratio variables. The analysis of the collected data from the questionnaire will indicate the strength, weakness, opportunities and threats of the factors of the study. The statistical data will indicate a summary statistic of the analysis, which will contain the graphical representation of each factor, numerical summary of each factor and the final principal components of the study. (Brace, 2008, p.45).
Data Collection and Analysis Procedure
The procedure of the data collection and analysis can be derived as follows:
- Topic selection: – Scale should not be wide are level of measurement should be accurate.
- Determination of hypothesis: – it includes the objective of the study.
- Sampling method: – Selecting an appropriate sampling method related to the study.
- Data collection: – Data should be collected through direct interview or by other similar companies’ data.
- Data handling: – Coding and putting the responses in to level of measurements.
- Statistical Analysis: -It includes the appropriate statistical model for the analysis.
- Gathering of results: – It includes the graphical and numerical representation of the data.
- Conclusions: – Determine the findings related to the hypothesis of the study. (Bethlehem, 2009, p.1). In the process of collection of dataset 2, I have collected the data for the 503 individuals of by using a survey. The sampling process involved to calculate the representative sample size of the population. The sample size for this survey was considered as 503. Thus, a large sample will be a representative of the population, it will indicate the unbiased results of the study, also the method of data collection will unbiased. It is a primary data, which is directly collected from the respondents on the basis of questionnaire.
The survey includes following variables:
- Gender: It indicates the gender of an individual (0=male and 1=female). It is a nominal level of variable.
- Age: It indicates the age of respondent, which contains integer values.
- Lodgment method: it indicates the preference of method as “A=Tax agent” or “B=Self prepare”.
- Total income amount: It indicates the he total income of a respondent in a financial year.
- Total deduction amount: It indicates the total deduction amount of a tax payer from actual income amount as a tax in a financial year.
Section 2:
The variable lodgment method is a qualitative variable which have two categories as “A=Tax agent” or “B=Self prepare”. So, a pie chart will be suitable for lodgment method, the percentage of frequencies for each type of preference of lodgment method is shown below:
So, the number of people who hire tax agent to pay tax are 740 and the number of people who pays tax by self-preparation are 260.
The sample size is 1000 and 740 people hired a tax agent to pay tax. The sample proportion p is a point estimate of the population proportion. So, the point estimate for p for the proportion is:
Now use Z-statistic to calculate the 95% confidence interval, the formula to 95% confidence interval is shown below:
Here, is the sample proportion, n is the sample size and Z is the critical value at a specified level of significance. The critical value at 5% level of significance is 1.96. So, the confidence is calculated as:
Hence, the 95% confidence interval of the proportion of tax payers who lodge the tax return by using an Agent is (0.712, 0.767).
The lower limit of the confidence interval is 0.712 and the upper limit is 0.767. The confidence interval contains the sample proportion value (0.74), so it can say that sample of 1000 people is a representative of the population.
Section 3:
The variable lodgment method is a qualitative variable which have two categories as “A=Tax agent” or “B=Self prepare”. So, a pie chart will be suitable for lodgment method, the percentage of frequencies for each type of preference of lodgment method is shown below:
So, the number of people who hire tax agent to pay tax are 245 and the number of people who pays tax by self-preparation are 258.
The sample size is 245 and 500 people hired a tax agent to pay tax. The sample proportion p is a point estimate of the population proportion. So, the point estimate for p for the proportion is:
Now use Z-statistic to calculate the 95% confidence interval, the formula to 95% confidence interval is shown below:
Here, is the sample proportion, n is the sample size and Z is the critical value at a specified level of significance. The critical value at 5% level of significance is 1.96. So, the confidence is calculated as:
Hence, the 95% confidence interval of the proportion of tax payers who lodge the tax return by using an Agent is (0.443, 0.530).
Analysis of Lodgment Method by Gender
The lower limit of the confidence interval is 0.443 and the upper limit is 0.530. The confidence interval contains the sample proportion value (48.7%), so it can say that sample of 503 people is a representative of the population.
Thus, the people who prefer to hire tax agent in dataset 1 is greater than the people who prefer hire tax agent in dataset 2. So, dataset 2 indicates almost equal number of persons prefer to hire tax agent or self-preparation to pay the tax while dataset 1 indicates most of the persons prefer to hire tax agent than self-preparation to pay the tax.
The age group is a quantitative variable and the lodgment method is a qualitative variable, the obtained histogram and the frequency for each age group corresponding to the age groups by using excel is shown below:
Count of Lodgment_method |
Column Labels |
||
Row Labels |
Agent |
Self Prepared |
Grand Total |
0 |
41 |
16 |
57 |
1 |
34 |
15 |
49 |
2 |
57 |
11 |
68 |
3 |
78 |
16 |
94 |
4 |
85 |
22 |
107 |
5 |
86 |
15 |
101 |
6 |
75 |
17 |
92 |
7 |
82 |
27 |
109 |
8 |
74 |
30 |
104 |
9 |
61 |
38 |
99 |
10 |
51 |
41 |
92 |
11 |
16 |
12 |
28 |
Grand Total |
740 |
260 |
1000 |
The histogram for row percentages is shown below:
So, the maximum age group belongs to the age 5.
The chi-square test applied to test the association between the categorical variables. The obtained analysis for age group corresponding to the lodgment method are done in dataset 1 excel worksheet:
The formula for the test statistic is given below:
Here, is the expected frequency and is the observed frequency. The chi-square test will be used if expected frequency is greater than or equal to 5. The formula to calculate the expected frequencies is shown below:
The calculated expected frequencies for all the age groups corresponding to the lodge method is shown below:
Count of Lodgment_method |
Column Labels |
||
Row Labels |
Agent |
Self Prepared |
Grand Total |
0 |
42.18 |
14.82 |
57 |
1 |
36.26 |
12.74 |
49 |
2 |
50.32 |
17.68 |
68 |
3 |
69.56 |
24.44 |
94 |
4 |
79.18 |
27.82 |
107 |
5 |
74.74 |
26.26 |
101 |
6 |
68.08 |
23.92 |
92 |
7 |
80.66 |
28.34 |
109 |
8 |
76.96 |
27.04 |
104 |
9 |
73.26 |
25.74 |
99 |
10 |
68.08 |
23.92 |
92 |
11 |
20.72 |
7.28 |
28 |
Grand Total |
740 |
260 |
1000 |
All of the expected frequencies are greater than 5, so chi-square test for association will be used for analysis. Consider the null and the alternate hypothesis as shown below:
Null hypothesis: There is no association between the age group corresponding to the lodgment method.
Alternate hypothesis: There is an association between the age group corresponding to the lodgment method.
The chi-square statistic calculations are shown below:
Count of Lodgment_method |
Column Labels |
||
Row Labels |
Agent |
Self Prepared |
Grand Total |
0 |
0.033 |
0.094 |
0.127 |
1 |
0.141 |
0.401 |
0.542 |
2 |
0.887 |
2.524 |
3.411 |
3 |
1.024 |
2.915 |
3.939 |
4 |
0.428 |
1.218 |
1.645 |
5 |
1.696 |
4.828 |
6.525 |
6 |
0.703 |
2.002 |
2.705 |
7 |
0.022 |
0.063 |
0.086 |
8 |
0.114 |
0.324 |
0.438 |
9 |
2.052 |
5.839 |
7.891 |
10 |
4.285 |
12.196 |
16.481 |
11 |
1.075 |
3.060 |
4.135 |
Grand Total |
12.460 |
35.464 |
47.924 |
The degree of freedom for the test is:
The p-value for the chi-square test is less than 0.0005.
According to the results obtained, the value of chi-Square test statistic is 47.92. So, the p-value of the test is less than the level of significance 0.05, thus the null hypothesis of the test gets rejected. Hence, it can conclude that there is an association between the age group corresponding to the lodgment method.
The total income is a quantitative variable and the lodgment method is a qualitative variable, the obtained boxplot by using the Statkey is shown below:
Analysis of Lodgment Method by Age Group
The above boxplot indicates the outliers in the data set of total income corresponding to lodge method agent and self-prepared.
The obtained dot plot is shown below:
So, maximum number of people who wants to hire a tax agent have total income between 0 to 50000.
The obtained summary statistics is shown below:
Statistics |
A |
S |
Overall |
Sample Size |
740 |
260 |
1000 |
Mean |
60601.249 |
43878.846 |
56253.424 |
Standard Deviation |
70226.303 |
42013.481 |
64495.602 |
Minimum |
-7752 |
0 |
-7752 |
Q1 |
25320.50 |
18216.00 |
23017.50 |
Median |
46077.50 |
37318.00 |
44113.50 |
Q3 |
73555.00 |
57724.50 |
70593.00 |
Maximum |
1052414 |
352377 |
1052414 |
The average income who prefer tax agent is 60601.24 and the maximum income is 1052424.
The average total income for who prefer self-prepare to pay tax is 43878.84 and the maximum total income for self-prepared is 352377.
The distribution of income for the lodgment method is positive skewed as most of the income belongs to the left side. So, it can say the data for the total income is skewed and data is not normally distributed. The boxplot shows outliers in the dataset, which indicates data for income is non-normally distributed.
The scatterplot is a way to represent the visual relationship between two quantitative variables, the visual representation indicates the strength of relationship between the variables or how they are associated. The one variable can be considered as explanatory variable and another variable can be considered as the response variable. The positive trend of scatterplot indicates a positive association between the variables, as value of one variable increases the corresponding value of another variable also increases. (Rubin, 2009, p.209). The negative trend of scatterplot indicates a negative association between the variables, as value of one variable increases the corresponding value of another variable decreases.
The no trend of scatterplot indicates a non-association between the variables. Correlation is a measure of the relationship between the two variables. It measures the strength of relationship between two or more normally distributed interval or ratio level variables. The coefficient of correlation is denoted by r, and the value of correlation coefficient lies value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation.
In general, if (r) lies between 0-0.19, then the strength of relationship between two variables is very weak. If (r) lies between 0.20-0.39 then strength of relationship between two variables is weak. If (r) lies between 0.40-0.59 then strength of relationship between two variables is moderate. If (r) lies between 060-0.79 then strength of relationship between two variables is strong. And, if the value of correlation coefficient (r) lies between 0.79-0.99 then it can say that the strength of relationship between two variables is very strong. (Israel, 2009, p.111). The scatterplot for the total income amount and total deduction amount for people who hire a tax agent and self-prepared is shown below:
Association between Age Group and Lodgment Method
Thus, as the value of the variable on the horizontal axis increases weakly, the corresponding value of the variable on the vertical axis increases weakly.
And, as the value of the variable on the horizontal axis increases weakly, the corresponding value of the variable on the vertical axis increases weakly. Thus, there is a weak positive association between the variables.
The value of correlation coefficient for the relationship between total income amount and total deduction amount for people who hire a tax agent is 0.385. And, the value of correlation coefficient for the relationship between total income amount and total deduction amount for people who self-prepared is 0.396.
The value of the correlation coefficient is 0.385 for the association between total income amount and total deduction amount for people who hire a tax agent. And, the value correlation coefficient is 0.396 for the association between total income amount and total deduction amount for people who prepare by self.
Thus, there is a weak positive association between total income amount and total deduction amount for people who hire a tax agent, and there is a weak positive association between total income amount and total deduction amount for people who prepare by self.
The number of people who hire tax agent to pay tax are 740 and the number of people who pays tax by self-preparation are 260. The 95% confidence interval of the proportion of tax payers who lodge the tax return by using a tax agent is (0.712, 0.767), so it can say that sample of 1000 people is a representative of the population.
The number of people who hire tax agent to pay tax are 245 and the number of people who pays tax by self-preparation are 258. The 95% confidence interval of the proportion of tax payers who lodge the tax return by using an Agent is (0.443, 0.530), so it can say that sample of 503 people is a representative of the population.
There is an association between the age group corresponding to the lodgment method.
The maximum number of people who wants to hire a tax agent have total income between 0 to 50000 and the maximum number of people who wants to self-preparation have total income between 0 to 50000. The average income who prefer tax agent is 60601.24 and the maximum income is 1052424, and the average total income for who prefer self-prepare to pay tax is 43878.84 and the maximum total income for an individual id 352377.
The value of the correlation coefficient is 0.385 for the association between total income amount and total deduction amount for people who hire a tax agent. And, the value correlation coefficient is 0.396 for the association between total income amount and total deduction amount for people who prepare by self.
Thus, there is a weak positive association between total income amount and total deduction amount for people who hire a tax agent, and there is a weak positive association between total income amount and total deduction amount for people who prepare by self.
The people who prefer to hire tax agent in dataset 1 is greater than the people who prefer hire tax agent in dataset 2. The distribution of income for the lodgment method is positive skewed as most of the income belongs to the left side. So, it can say the data for the total income is skewed and data is not normally distributed for the data set 1, and the data may conclude wrong findings. Thus, researcher should collect the data again to do the analysis for the further research.
References:
Goodwin, S. (2012) SAGE secondary data analysis. India: SAGE publications Pvt. Ltd.
Morgan, D. (2013) Integrating Qualitative and Quantitative methods: A Pragmatic Approach. India: SAGE publications Pvt. Ltd.
Bethlehem, J. (2010) Applied Survey Methods. United States of America: JOHN WILEY & SONS, INC., Publication.
Brace, I. (2008) questionnaire Design: How to Plan, Structure and Write Survey Material For Effective Market Research. Second edition. USA: Kogan Page publishers.
Rubin, A. (2009) Statistics for Evidence-based Practice and Evaluation. Second edition. Canada: Cengage Learning.
Israel, D. (2009) Data Analysis in Business Research. India: SAGE publications Pvt. Ltd.