Task 1: Pie chart showing transport type of LGA sample
The Assignment Data (LGAData.xls) file contains, LGA data (sourced from ABS) for a population of 400 LGA in Australia. We are required to select a random sample of 50 LGA from this population.
For creating our random sample of 50 LGA from the population we take one of our ID say suppose we take ID of Kapil Yogi i.e MIT170802 . Hence the last two digits of his student I.D. is 02. Hence our dataset will contain LGA 02-51 ( As in between LGA 02-51 there are 50 samples including both ends).
The soft-copy of our sample LGAdata is named as LGA_sample.
Task 1 Figure 1:Histogram Showing type of transport of 50 sampled LGA’s
The variable “V21” gives the information about the type of transport of each LGA. Here 0 is denoted for “Car” and “1” for public transport.
Using Excel we get that 28 LGA (From Frequency Chart) in our sample indicate use of public transport.
We store the values of “V21” in B column.
[ Excel Formula: =COUNTIF(B2: B51,0) ]
Figure 2: Pie Diagram Showing type of occupation of 50 sampled LGA’s
The variable “V11” gives the information about the type of occupation of each LGA. Here, the indications are as follows : Managers=1; Professionals=2;Sales workers=3;Administration=4 .
Using Excel we get that the occupation type occurs most frequently in your sample is Administration indicated by “4”. (Also evident from the pie diagram)
We store the values of “V11” in C column.
[ Excel Formula: =MODE(C2:C51)] Figure 3: Pie Diagram Showing age group of 50 sampled LGA’s
The variable “V3” gives the information about the age Group of each LGA . Here, the indications are as follows : 30-34 =1; 35-39=2; 39-44 =3; 45-50=4
Out of 50 sampled LGA 8 LGA’s satisfy having people within 30-34 years age group.
Hence,required proportion= (From Relative frequency Pie Diagram)
We store the values of V3 in E column.
[ Excel formula: =COUNTIF(E2:E51,1) ]
Task 2
a)
Here in our sample V15 gives the information about Occupation4 i.e number of Administration.
Task 1: Pie chart showing occupation type of LGA sample
Here we have to sort our sample “occupation 4” data.
Using Excel we have sorted the data using the following path: Home->Sort & Filter
The percentile location formula is given by, = (n + 1)
Here,P represents the percentile rank and n denotes the number of observations under consideration.
Here n=50.
- Hence using the above formula , for the 70-th percentile,
P=70 so we get 35.7.
So interger portion is 35 and fractional part is 0.7. The 35th and 36th observations of ordered data are respectively 11 and 13.So the 70th percentile is 11+(13-11)*0.7=12.4
For The first quartile, P=25 so we get .
So interger portion is 12 and fractional part is 0.75. The 12th and 13th observations of ordered data are respectively 0 and 0.So the first quartile is 0+(0-0)*0.75=0
For the third quartile, P=75 so we get .So interger portion is 38 and fractional part is 0.25. The 38th and 39th observations of ordered data are respectively 17 and 18.So th third quartile is 17+(18-17)*0.25=17.25
b)
The 70th percentile that we have determined informs us that in 70% of the LGA’s the number of people with occupation “4” i.e administration is 12.4(12 if rounded off) or less (in our sample data).
c)
Let,
Here the Inter-Quartile Range is given by, IQR== 17.25-0=17.25, it gives the spread of middle 50% of our data.
So,50% of the LGA’s have number of administration in between 0 and 17.25 .This is a measure of spread which is not influenced by extreme small or large values.
We store the values of V15 in A column.
[ Excel Formula: = QUARTILE(A2:A51,3)-QUARTILE(A2:A51,1) ]
Task 3
a)
Using Excel we find out the following descriptive statistics table of our sample “occupation 4” data.
Mean |
8.46 |
Standard Error |
1.190167 |
Median |
6.5 |
Mode |
0 |
Standard Deviation |
8.415753 |
Sample Variance |
70.8249 |
Kurtosis |
-0.94489 |
Skewness |
0.666345 |
Range |
25 |
Minimum |
0 |
Maximum |
25 |
Sum |
423 |
Count |
50 |
Table 1: Descriptive Statistics of Occupation 4 sample data
[Excel Path: Data->Data Analysis->Descriptive Statistics ]
b)
From the Task 2 we get IQR=17.25 and Q3=17.25 and Q1=0.
Now,
The upper inner fence limit : IFUL = Q3 + 1.5 x IQR =17.25+1.5*17.25=43.125
The lower inner fence limit : IFLL = Q1 – 1.5 x IQR=0-1.5*17.25= -25.875
- c)
For your sample “occupation 4”data considering all the measures from previously done,
Here in the data the minimum and maximum values are 0 and 25 respectively .Hence both are within the IFLL,IFUL limits. So there is no outliers.
Here an appropriate measure of central tendency is, Median as measure of skewness is 0.666345 i.e here the data is skewed. Here Median=6.5
Here an appropriate measure of dispersion is, Interquatile range as the data is skewed. Here IQR=17.25.
Task 1: Pie chart showing age groups of LGA sample
d)
The variable under consideration is “Number of administration”. Here the mean is 8.46,median=6.5,first quartile=0 and third quartile=17.25,Standard deviation=8.4(approx),IQR=17.25.
Here -=17.25-6.5= 10.75 . ( and =6.5-0=6.5 .
So the data is positively skewed i.e its longer tail is towards larger values of the variable under consideration.
Here ,mean=8.46 refers that on an avg if we pick a LGA randomly then its Number of administration would be 8.46 (9 approx) on an avg.
Here,standard deviation=8.4, is a measure of spread,it accounts all the values of the variable,it measures the variability of the data .It measures how the data is deviated from the mean value.
Here ,IQR i.e inter quartile range gives the range in which the 50% of the middle values .Here the range is [0,17.25]
Task 4
a)
From Table 1, the measure of Kurtosis -0.94489,so the data is Platykurtic i.e. the tails are very thin compared to the normal distribution.
The measure of skewness 0.666345,so the data is positively skewed.
In case of normal distribution Mean=median=mode but here mean=8.46,mode=0 and median=6.5
So according to these three pieces of evidence our sample “occupation 4” data has not been obtained from a normally distributed population .
- b)
According to Standard normal table P(Z<1.5)=0.9332 where Z follows standard normal distribution.
Hence , P(-1.5<Z<1.5)=2(0.9332-0.5)=0.8664. (As Z is symmetric about 0)
So 50*0.8664=43.32 i.e 43
So approximately 43 values out of 50 should lie within 1.5 standard deviations from the mean.
c)
According to the descriptive statistics table
Mean=8.46 and standard deviation=8.415753
The bound for 1.5 standard deviation spread from the mean is given by [-4.1636295,21.0836295]
Going through the data we observe that 44 observations out of 50 lie within the above interval so it satisfies the result in (b) . (only difference of One observation can be ruled out). Hence the result does not confirm our conclusion in (a)
[Sheldon, Ross (2010). Introductory Statistics, Academic Press,USA.]
Task 5
- a)
Using Excel we find out the following descriptive statistics table of our sample “occupation 4” data.
Here, We have considered only those which are required for computation of the confidence interval.
Mean |
8.46 |
Standard Error |
1.190167 |
Standard Deviation |
8.415753 |
Sample Variance |
70.8249 |
Count |
50 |
Hence :
- i)
A point estimate of the mean “Occupation 4” of the population is given by the sample mean
i.e 8.46
- ii)
A 90% confidence interval estimate of the mean “Occupation 4” of the population.
[ , ]
,n=50, s = 8.415753
=upper 100% point of a t distribution with (n-1) degrees of freedom.
=1.676551 [ Excel Formula: =T.INV(0.95,49) ]
Task 2: Percentiles and quartiles of LGA occupation data
Here, Upper CI=8.46+1.676551=10.45537601
Lower CI=8.46-1.676551=6.464623986
Hence the 90% confidence interval is given by [6.464623986,10.45537601] i.e [6.46,10.46] (upto 2 decimal places)
iii)
In the context of the variable in this task if we collect samples again and again from the population then 90% of the times the population mean number of administration lies within [6.46,10.46] i.e [6,10] (rounded off)
- b)
The 90% confidence interval of mean number of administration lies within [6.46,10.46] i.e [6,10] ,hence it does not contain the value 59,so we would not consider the interval estimate obtained in (a), to be satisfactory.
Task 6
(a)
Here we are interested in the values of “V8” i.e Income category. According to our data the indexes are following : $650-$800=0; $801-$850=1
In this case we are focusing on the $650-$800 income earners.
Using Excel we find out the following: Out of 50 LGA’s 24 are $650-$800 income earners.
We store the data of “V8” in column F.
[ Excel Formula: =COUNTIF(F2:F51,0) ]
(i)
A point estimate of the proportion of $650-$800 income earners in the population is obtained
As, (ii)
A 99% confidence interval estimate of the $650-$800 income earners of the population is given by
[ , ]
Here is the observed proportion of $650-$800 income earners in our sample.
is the 100 % point of a standard normal distribution. n is the sample size i.e 50.
Here for 99% confidence interval,α=0.01,n=50, and 2.575829.
The value of is obtained using the following [ Excel Formula: =NORM.INV(0.995,0,1) ]
So,
Upper CI= = 0.661992846
Lower CI= = 0.298007153
Hence the 99% Confidence interval is given by,
[0.298007153, 0.661992846] i.e [0.3,0.66] (upto 2 decimal places)
- b)
Let the population proportion of $650-$800 income earners is denoted by P
Now,P follows Normal distribution with mean==0.48 and standard deviation==0.070654086
By empirical rule of Normal distribution the 95% of the values of normal distribution lies within 2 standard deviation from the mean.
So the 95% Confidence interval (based on the Empirical rule) of the $650-$800 income earners in the population is [0.48-2*0.070654086,0.48+2*0.070654086] i.e [0.338692,0.621308] i.e [0.34,0.62] (upto 2 decimal places)
[Akobeng AK. Confidence intervals and p-values in clinical decision making. Acta Paediatr. 2008;97:1004–1007]c)
The 99% confidence interval of the $650-$800 income earners in the population is [0.3,0.66] (upto 2 decimal places) where as, the 95% Confidence interval (based on the Empirical rule) of the $650-$800 income earners in the population is [0.34,0.62] (upto 2 decimal places).
Hence the length of the 95% confidence interval is 0.62-0.34=0.28 and the length of the 99% confidence interval is 0.66-0.3=0.36 . Hence the 99% confidence interval’s length is more than 95% confidence interval’s length as it is quiet obvious as 99% confidence interval will contain the value of the population mean 99% of the times if we repeatedly collect samples from our population where as 95% confidence interval will contain the value of the population mean 95% of the times if we repeatedly collect samples from our population. So the more accurate the confidence interval is the more spread it is. Direction of the spread is expected as we give an interval estimate so when we increase our accuracy level .
[Altman D, Bland JM. Confidence intervals illuminate absence of evidence. BMJ. 2004;328:1016–1017]
References
Sheldon, Ross (2010). Introductory Statistics, Academic Press,USA.
· Hoel,P.G.,(1971),Introduction to Mathematical Statistics,Fourth Edition,USA
· Feller,William(2013),An introduction to Probability Theory and Its Applications,Volume I,Third Edition,U.K.
· Du Prel, J.-B., Hommel, G., Röhrig, B., & Blettner, M. (2009). Confidence Interval or P-Value?: Part 4 of a Series on Evaluation of Scientific Publications. Deutsches Ärzteblatt International, 106(19), 335–339.
· Akobeng AK. Confidence intervals and p-values in clinical decision making. Acta Paediatr. 2008;97:1004–1007
· Altman D, Bland JM. Confidence intervals illuminate absence of evidence. BMJ. 2004;328:1016–1017