Data Types and Collection Methods
The dataset is a sample data of a financial year 2013 -2014 of Australian taxation office, the data is collected on the basis of the lodgment method. The lodgment method is categorized as tax agent and self-prepare. The people in Australia needed to lodge tax return after the end of the financial year, so that people are using two ways to lodge the tax, the first one that they can keep a tax agent or they can self-prepare. The study is based on the analysis of lodge tax return after the end of the financial year. The data was selected for the 1000 people from the Australia taxation office, it includes the information of gender, age group of people, lodgment method, total income amount in a financial year and the total deduction amount as tax return in a financial year.
The classification of dataset depends on the types on dataset, the types of datasets can be defined in two categories as Primary data and the secondary data. The primary data collected for a specific purpose from the relative population by using a questionnaire. The Secondarydata collected from internet resources for some other purpose.The dataset 1 is a taken from Australian taxation office which is an internet resource without doing any survey or a specific purpose, thus it can be considered as a secondary data type. (Thomas, 2011).
The data can be represented in numeric or text form which have 2 or more than 2 categories, the categories of the data may be represented as nominal or the ordinal level. (Maciejewski, 2011)
The description of the variables in the selected dataset are defined below,
The variable gender has been categorized into two categories, the categories are male and female. Thus it have two categories which indicates variable gender is a nominal level variable.
The variable age_range holds constant values of age of people. Thus, it can be consider as a quantitative variable.
The variable lodgment_method has been categorized into two categories of lodge as Tax agent and Self prepare, Thus it have two categories which indicates variable gender is a nominal level variable.
The variable tot_in_amt holds constant values of total income of a person in a financial year, so it is a quantitative variable.
The variable tot_ded_amt holds constant values of total deduction amount of a person in a financial year, so it is a quantitative variable.
The first five cases are:
Gender |
age_range |
Lodgment_method |
Tot_inc_amt |
Tot_ded_amt |
1 |
6 |
A |
63357 |
1894 |
1 |
6 |
S |
0 |
0 |
0 |
1 |
A |
20384 |
578 |
1 |
8 |
A |
109473 |
4654 |
1 |
9 |
S |
47405 |
2809 |
Variable Description
To collect the data for the lodge of tax return use survey sampling method. Firstly we will decide the components of questionnaire, the questionnaire will be prepare on the basis of relative questions such as “Whether a respondent is a male of female”, “What is the age of a respondent”, “What is the lodge method”, “What is total income amount in a financial year” and “What is the total amount of deduction in total income in a financial year”. After making a questionnaire we will conduct a pilot survey to understand the requirement of changes in questions and also to decide the sample size. The sample size will be calculated by using the z-test. For the survey, I am considering 300 sample size. After making questionnaire we will decide the place (relevant population) for the survey, so that sample will not be biased and will be a representative of the population. We can use two methods to collect the data, first one is online survey and second one is off line survey. I am using offline survey method, so that respondent can easily understand the requirement of the questionnaire. (Ian, 2008)
The data is collected by using a survey method with a specific purpose, so it is a primary data type.
The variables are defined below,
The variable gender has been categorized into two categories, the categories are male and female. Thus it is a nominal level variable.
The variable age_range holds constant values of age of people. Thus, it can be consider as a quantitative variable.
The variable lodgment_method has been categorized into two categories of lodge as Tax agent and Self prepare, thus it is a nominal level variable.
The variable tot_in_amt holds constant values of total income of a person in a financial year, so it is a quantitative variable.
The variable tot_ded_amt holds constant values of total deduction amount of a person in a financial year, so it is a quantitative variable.
The variable lodgment_method has been categorized into two categories of lodge as Tax agent and Self prepare, thus it is a nominal level variable. To calculate the frequency for each lodge method use pivot table, the graph for frequency of each lodge method is,
Out of the 1000 people, 777 people hire a tax agent to pay the tax and 223 people self-prepare to pay the tax.
- The z-test will used to calculate the 95% confidence interval for proportion of tax payers who lodge the tax return by using an agent. So, the proportion of tax payers who lodge the tax return by using an agent in the sample is,
The formula to calculate the 95% confidence interval is,
Survey Sampling Method
The critical value is 1.96. So,
- The 95% confidence interval is (0.7512, 0.8028). The lower limit is 0.751 and the upper limit is 0.8028.
The variable lodgment_method has been categorized into two categories of lodge as Tax agent and Self prepare, thus it is a nominal level variable. To calculate the frequency for each lodge method use pivot table, the graph for frequency of each lodge method is,
Out of the 300 people, 146 people hire a tax agent to pay the tax and 154 people self-prepare to pay the tax.
- The z-test will used to calculate the 95% confidence interval for proportion of tax payers who lodge the tax return by using an agent. So, the proportion of tax payers who lodge the tax return by using an agent in the sample is,
The formula to calculate the 95% confidence interval is,
The critical value is 1.96. So,
- The 95% confidence interval is (0.429, 0.543). The lower limit is 0.429 and the upper limit is 0.543. The 95% confidence interval is (0.7512, 0.8028). The lower limit is 0.751 and the upper limit is 0.8028.
The confidence limit for the secondary data is greater than the confidence limit for primary data. The proportion of tax payers who lodge the tax return by using an agent for dataset 1 is 0.777 and the proportion of tax payers who lodge the tax return by using an agent for the data set 2 is 0.486.
The variable lodgment method is a nominal level categorical variable which have corresponding values of age_group, the variable age-group is a quantitative variable. To make graphical summary and numerical summary, I am using STATKEY. The histogram is shown below,
So, for age group 11 maximum people hire an agent to pay tax and also for the same age group people self-prepare to pay tax. The age group 11 contain those people who have age less than 20 years, so people may not aware about the tax rates, so that they are hiring tax agent or they have knowledgeable about tax rates so prepare by self.
The summary statistics results are,
Statistics |
A |
S |
Overall |
Sample Size |
777 |
223 |
1000 |
Mean |
5.691 |
6.700 |
5.916 |
Standard Deviation |
3.030 |
3.134 |
3.081 |
Minimum |
0 |
0 |
0 |
Q1 |
3.00 |
4.00 |
3.00 |
Median |
6.00 |
7.00 |
6.00 |
Q3 |
8.00 |
9.00 |
8.00 |
Maximum |
11 |
11 |
11 |
The median age-group to hire a tax agent is 7 (35-39 years) and the median age-group to hire a tax agent is 6 (40-44 years).
The chi-square test applied on the two or more nominal/ordinal level variables which have qualitative data. This is used to test whether there is association between the variables and also use to test the goodness of fit. (Alan, 2013). To know about the relationship between the variable age-group and lodge method use chi-square for the association. The obtained results by using STATKEY is shown below:
|
6 |
1 |
8 |
9 |
7 |
10 |
5 |
11 |
0 |
3 |
2 |
4 |
|
A |
88 |
47 |
94 |
62 |
80 |
70 |
85 |
26 |
40 |
68 |
53 |
64 |
777 |
S |
11 |
6 |
24 |
41 |
26 |
30 |
17 |
15 |
11 |
21 |
9 |
12 |
223 |
Total |
99 |
53 |
118 |
103 |
106 |
100 |
102 |
41 |
51 |
89 |
62 |
76 |
1000 |
Observed, Expected, Contribution to χ2
All the expected frequency are greater than 5. Consider the hypothesis for the test,
Null hypothesis- There is no association between the age group and lodgment method.
Alternate hypothesis- There is an association between the age group and lodgment method.
n = 1000, χ2 = 43.871
6 |
1 |
8 |
9 |
7 |
10 |
5 |
11 |
0 |
3 |
2 |
4 |
Total |
|
A |
88 |
47 |
94 |
62 |
80 |
70 |
85 |
26 |
40 |
68 |
53 |
64 |
777 |
S |
11 |
6 |
24 |
41 |
26 |
30 |
17 |
15 |
11 |
21 |
9 |
12 |
223 |
Total |
99 |
53 |
118 |
103 |
106 |
100 |
102 |
41 |
51 |
89 |
62 |
76 |
1000 |
The critical value at 95% confidence level and 12 degree of freedom is 21.02.
Statistical Analyses
c. The chi-square test statistic value is larger than the critical value, so it will fall in the rejection region. Thus, reject the null hypothesis, and conclude that there is an association between the age group and lodgment method.
The variable lodgment method is a nominal level categorical variable which have corresponding values of total income amount, the variable total income amount is a quantitative variable. To make graphical summary and numerical summary, I am using STATKEY. The boxplot is shown below,
So, maximum people who have salaries about 100000, hire a tax agent and also people prepare by self to lodge tax amount.
The summary statistics results are,
Statistics |
A |
S |
Overall |
Sample Size |
777 |
223 |
1000 |
Mean |
60767.079 |
46072.726 |
57490.238 |
Standard Deviation |
60318.302 |
38491.063 |
56505.149 |
Minimum |
-48994 |
0 |
-48994 |
Q1 |
26207.00 |
16387.50 |
22988.00 |
Median |
45740.00 |
40166.00 |
44665.00 |
Q3 |
80404.00 |
64420.50 |
77330.50 |
Maximum |
813632 |
199524 |
813632 |
The mean total income is 60767.07 for the people who hire a tax agent and the mean total income for the people who self-prepare is 46072.7.
b.The shape of the distribution total income amount corresponding to the lodge method is positively skewed as most of the observations are belong to the left side. The average total income is 60767.07 for the people who hire a tax agent and the mean total income for the people who self-prepare is 46072.7. The standard deviation in income is 60318.302for the people who hire a tax agent and the standard deviation in total income for the people who self-prepare is 38491.063, so the value of the standard deviation is very large which shows a spreaders in the data. The boxplot indicates a large number of outliers for the total income data.
The Pearson correlation (Association test) test applied to test whether there is a relationship in the population. It is applied on the quantitative samples, the two variables should be measured in the interval or ratio level. The results of the test indicates whether the population correlation coefficient is 0 or not. (Gravetter & Wallnau, 2010)
To know about the relationship between total income amount and total deduction amount for the people who hire a tax agent, we can use a scatterplot. To make scatterplot use STATKEY, the scatterplot for the relationship between the total income amount and total deduction amount for the people who hire a tax agent is,
The above scatterplot indicates a positive relationship between the variables total income amount and total deduction amount for the people who hire a tax agent.
The obtained numerical summary is,
Statistic |
Tot_inc_amt (Tax agent) |
Tot_ded_amt (Tax agent) |
Mean |
60767.079 |
2997.122 |
Standard Deviation |
60318.302 |
6155.260 |
Sample Size |
777 |
|
Correlation |
0.373 |
|
Slope |
0.038 |
|
Intercept |
684.405 |
The value of the correlation coefficient is 0.373, which shows a weak positive relationship between the variables total income amount and total deduction amount for the people who hire a tax agent.
Frequency Tables
To know about the relationship between total income amount and total deduction amount for the people who self-prepared, we can use a scatterplot. To make scatterplot use STATKEY, the scatterplot for the relationship between the total income amount and total deduction amount for the people who self-prepared is,
The above scatterplot indicates a weak positive relationship between the variables total income amount and total deduction amount for the people who self-prepared.
The obtained numerical summary is,
Statistic |
Tot_inc_amt (Self-Prepare) |
Tot_ded_amt (Self-Prepare) |
Mean |
46072.726 |
1357.955 |
Standard Deviation |
38491.063 |
2848.796 |
Sample Size |
223 |
|
Correlation |
0.344 |
|
Slope |
0.025 |
|
Intercept |
183.379 |
The value of the correlation coefficient is 0.344, which shows a weak positive relationship between the variables total income amount and total deduction amount for the people who self-prepared to lodge tax.
It can say that there is a weak positive relationship between total income amount and total deduction amount for people who hire a tax agent or self-prepared.
Out of the 1000 people in the dataset-1, 777 people hire a tax agent to pay the tax and 223 people self-prepare to pay the tax, and Out of the 300 people in the dataset-2, 146 people hire a tax agent to pay the tax and 154 people self-prepare to pay the tax.
The 95% confidence interval is (0.429, 0.543). The lower limit is 0.429 and the upper limit is 0.543. The 95% confidence interval is (0.7512, 0.8028). The lower limit is 0.751 and the upper limit is 0.8028. The confidence limit for the secondary data is greater than the confidence limit for primary data. The proportion of tax payers who lodge the tax return by using an agent for dataset 1 is 0.777 and the proportion of tax payers who lodge the tax return by using an agent for the data set 2 is 0.486. The age group 11 contain those people who have age less than 20 years, so people may not aware about the tax rates, so that they are hiring tax agent or they have knowledgeable about tax rates so prepare by self. The chi-square test statistic indicates that there is an association between the age group and lodgment method. The mean total income is 60767.07 for the people who hire a tax agent and the mean total income for the people who self-prepare is 46072.7.
The shape of the distribution total income amount corresponding to the lodge method is positively skewed as most of the observations are belong to the left side. The average total income is 60767.07 for the people who hire a tax agent and the mean total income for the people who self-prepare is 46072.7. The standard deviation in income is 60318.302 for the people who hire a tax agent and the standard deviation in total income for the people who self-prepare is 38491.063, so the value of the standard deviation is very large which shows a spreaders in the data. The boxplot indicates a large number of outliers for the total income data. . There is a weak positive relationship between total income amount and total deduction amount for people who hire a tax agent or self-prepared.
The shape of the distribution total income amount corresponding to the lodge method is positively skewed as most of the observations are belong to the left side. Thus, data may not be collected accurately or the level of measurements may not be justified according to the questions. There is a possibility non-distributed data for income amount. Thus, for the further research researcher should make an appropriate questionnaire, so that people should provide correct information about the total income and total deduction amount as a tax in a financial year.
References
Alan, A. (2013). Categorical Data Analysis. John Wiley & Sons.
Ian, B. (2008). Questionnaire Design: How to Plan, Structure and Write Survey Material for Effective Market Research. Kogan Page publishers.
Gravetter F. & Wallnau. L. (2010). Essentials of Statistics for the Behavioral Sciences. Cengage Learning.
Maciejewski, R. (2011). Data Representations, Transformations and Statistics for Visual Reasoning. Morgan & Claypool Publishers.
Thomas, P.V. (2011). Secondary Data Analysis. Oxford university press.