ITMD 527 Assignment
Data Analytics
Department of Information Technology and Management
1. (10 points) Use your own language/text to answer the following questions:
1). What is the difference between one -tailed test and a two -tailed test? How to determine it and
why it is important to determine that?
Answer
One Tailed Test Two Tailed Test
It is called one tailed because hypothesis can be done only in one end of the Graph .
It is called two tailed test , because hypothesis can be done on both the sides .
Critical region is either in left or in right tail end.
Critical regions are in both the ends.
It is called directional hypothesis check. It is called nondirectional hypothesis check
Significance level is divided in both tail ends Significance level is at the extreme end of any one tail of the graph.
If we consider Alpha as 5 percent, then shaded region of the graph contains 5 percent only on any one tail of the graph
If we consider Alpha as 5 percent in two -tail graph, then the shaded region is shared by both tail ends by divi ng hat 5 percent into two parts of 2.
5 percent each.
In one tail test
Null hypothesis (H0) is smaller than or equal to Zero & Alternate Hypothesis is larger than Zero.
OR
Null hypothesis (H0) is larger than or equal to Zero & Alternate Hypothesis is smaller than Zero .
In two tail test
Null hypothesis (H0) is less than or Alternative Hypothesis is greater than Zero.
Also , it varies from type of test you carry out.
ITMD 527 Assignment
Data Analytics
Department of Information Technology and Management
To determine whether the given Hypothesis is one tailed or two tailed, we need to observe following scenario .
First of all , it solely depends on the question –
- For Example, if null hypothesis says that the student Height is 6 feet and alternative Hypothesis derives that the student height is not 6 feet then this kind of situation falls under two -sided / two -tail hypothesis because we don’t know whether the height is less than 6 or more than 6 , so we need to analyze it based on the outcomes .
- Similarly , if null hypothesis says that student height is 6 feet and the alternative hy pothesis derives that the student height is greater than 6 , then this kind of testing will fall under one tail ed hypothesis testing.
Importance of determining the one -tailed or two -tailed
- It is necessary to determine the one -tailed or two -tailed test, so that we can identify whether our Alpha (calculated using confidence interval) is calculated correctly or not .
- If it is one -tailed the Alpha value does not change , if it is two -tailed alpha value is divided by 2.
- Furthermore, it is used to analyze the position of p-value
2). What is meant by a p -value? Interpret p -value in a one -tailed /one -sample hypothesis testing
Ans
- The Probability of null hypothesis to be true is called p -value, which is used to analyze the effect of it on the hypothesis. We need to know whether it is one -tailed or two -tailed and also the Alpha value before calculating the p -value.
- The H0(Null Hypothesis) is rejected if the obtained p -value is smaller than significance level Alpha ( ?).
- The H 1(Alternate Hypothesis) is rejected if the obtained p -value is greater than significance level Alpha ( ?).
Interpret the p -value in one -tail/one -sided
- We can interpret the p -value in one -tailed test if we can understand the difference of Alpha value from the 1.
- In one -tailed you do not divide the Alpha value.
- So , the p-value you obtain is not multiplied by 2.
ITMD 527 Assignment
Data Analytics
Department of Information Technology and Management
- So , the interpretation of p -value totally depends on the Alpha value you calculate using Confidence Interval .
2. (35 points) Manually solve the problem below:
A bank branch located in a commercial district of a city has the business objective of improving the process for serving customer during the noon to 1 PM (lunch period). The waiting time (defined as the time the customer enters the line until he or she reaches the teller window) of a random sample of 15 customers is collected, and the results are organized and stored as below:
4.21, 5.55, 3.02, 5.13, 4.77, 2.34, 3.54, 3.2 0,4.50, 6.10, 0.38, 5.12, 6.46, 6.19, 3.79
a) Calculate the mean and standard deviation , and find q1, q3 from the values above. Is the distribution symmetric? Why? [5]
Ans –
Calculating the following things –
Mean We calculate mean by doing summation of all data, divide by total number of observed data.
64.3 / 15 -> 4.28
Standard Deviation We calculate SD by taking square root of Variance.
First calculate variance by subtracting mean from each frequency and squaring that value and then diving all the summation of values by
(n-1)
Variance –
= 2.6830
Standard Deviation :
= 1.63
Q1 0.38
Q3 6.46
This given sample data do not follow a symmetric distribution because of following reasons –
ITMD 527 Assignment
Data Analytics
Department of Information Technology and Management
- When we sort the data into ascending order and plot a graph, we observe that the volume of data towards right side of the mean is more than the volume of data on the left side.
- So, the path of the curve is skewed towards left side.
- Hence, we can say that the distribution is Left -skewed .
- Also, in symmetric di stribution the Mean and Median have same values , but when we observe our distribution , we see that all the values are different from each other
- Thus, we can say that the distribution is not symmetric.
Mean 4.2 8
Median 4.5
b) Are there any assumptions about the population distribution is needed in order to use
sample statistics to estimate the population statistics? [5]
Ans
Assumption of Number of Samples
- When we observe the population of waiting time for serving customer, we understand that the number of samples are less than 30 .
- The samples are represented as ” n” (So, n 5
The null hypothesis is Ho which is saying
that the average waiting time is 5
The Alternate hypothesis is H1 which is
claiming that average is more than 5
Now here we have following values available,
x-bar 4.28
? 5
S 1.63
Confidence level
95 %
?[n] Sqrt of 15 3.87
- Now we have to calculate the hypothesis using p -value
- The p -value is used as the evidence to reject H0(Null Hypothesis).
- P-value is region under the normal curve based on test statistics
- So before calculating the p -value we need to impleme nt the test statistics.
ITMD 527 Assignment
Data Analytics
Department of Information Technology and Management
- Which we get as -1.73 .
- After calculating the test statistics, we generate p -value using one -tail .
If, alpha > p -value Reject Ho
If, alpha < p-value Accept Ho
- The generated p -value for -1.73 (z-statistics ) is 0.0418 .
- When we observe the p -value plot we state that,
Alpha > p -value
(0.10 > 0.0418)
THUS, WE CAN SAY THAT, SINCE alpha > p -value, WE HAVE ENOUGH
EVIDENCE TO REJECT NULL HYPOTHESIS.
e). Use just one method to solve the problem in part d) but use the 99% as the confidence
level. Did you get different results? Wh at are the reasons if you get different results. [ 5]
Ans
Ho = 5 H1 > 5
The null hypothesis is Ho which is saying that the average waiting time is 5 minutes.
The Alternate hypothesis is H1 which is claiming that average is more than 5 minutes
Now here we have following values,
x-bar 4.28
µ 5
S 1.63
Confidence Level
99 %
?[n] Sqrt of 15 3.87
ITMD 527 Assignment
Data Analytics
Department of Information Technology and Management
- Since we have to find the hypothesis of average for more than 5 minutes , so my graph will be left -tailed.
- Now first we will find the t-critical value,
- We used the tool on the web site using the below link to find the t -value by using the Significance level ( ?) which is (1 – 0.99 = 0. 01 )[Since Confidence level is 99%] and Degree of Freedom (df) as (n -1) which is 14 .
- We got t -value as 2.624 .
- Since our graph is left -tailed we will use -2.624 .
- After finding the t -value we calculate t -statistics of the data
- To calculate t-statistics we used following formula:
- We already have obtain ed the t-statistics as -1.73 previously .
- When we plot a graph of left -tailed, we observe that the -1.73 does not exceeds the t -value -2.624 i.e . (-1.73 < -2.624 )
- So , now our “H0” is Accepted in 99% confidence level .
- So, we can say that,
WE DO NOT HAVE SUFFICIENT EVIDENCE S , WE FAIL TO REJECT THE
NULL HYPOTHESIS.
3. (55 points) Chicago Ventra Transit Card can be used on both CTA bus, metro and Pace buses.
We are going to explore a resident’s average monthly cost on CTA transportations. In this case,
ITMD 527 Assignment
Data Analytics
Department of Information Technology and Management
we performed a survey, and collect monthly cost on CTA transits from 30 people, their monthly cost can be listed as follows:
12, 12, 12, 15, 24, 35, 14, 12, 120, 55, 45, 30, 40, 40, 40, 60, 60, 40, 50, 22, 36, 28, 21, 50, 39, 60, 90, 100, 110, 100
1). [10] To further understand the distribution, we draw a boxplot as follows. Interpret the box
plot
Ans
- Here we have boxplot of the CTA bus, metro and pace buses average monthly cost on transportation.
- The boxplot is divided into three levels. These levels are called as Quartile levels and they are named as (q1, q2, q3).
- So, for our given data
12, 12, 12, 15, 2 4, 35, 14, 12, 120 , 55, 45, 30, 40, 40, 40, 60, 60, 40, 50, 22, 36, 28, 21, 50, 39, 60, 90, 100, 110, 100
- Total number of samples are 30.
- Before drawing a boxplot, we sort the samples into ascending order.
ITMD 527 Assignment
Data Analytics
Department of Information Technology and Management
12 12 12 12 14 15 21 22 24 28 30 35 36 39 40 40 40 40 45 50 50 55 60 60 60 90 100 100 110 120
Q1 Q2 Q3
Now by understanding the Boxplot we can say that,
Q1 22.5
Q2 40
Q3 58.75
Similarly, we can also calculate
- Median is equal is Q2 which is approximately equal to 40 .
- Further we can estimate following values
Q3 – Q2 18.75
Q2 – Q1 17.50
As per Boxplot naming system we say,
Lower Quartile is 22.5
Upper Quartile is 58.75
Median is 40
Minimum value – 12
Maximum value – 120
- When we observe the boxplot, we see that there is a long tail of values on the upper side
above the upper quartile of the boxplot, and least number of values below lower quartile.
- So, after plotting the graph of this we can predict that there is a long tail towards right
side of the graph more values are there after the limit of upper quartile.
- By observing the boxplot, we can say that it has Normal Distribution. Since there is not
much difference between the distance of (Q1, Q2) and (Q2, Q3)
3). [15] Use the sample statistics to estimate the average monthly cost on CTA transits by Chicago residents by using 95% as the confidence level. You can use R to solve the problem. Do not forget to give details and conclusions, and paste your snapshot.
ITMD 527 Assignment
Data Analytics
Department of Information Technology and Management
Ans
Here we have t old to calculate the estimation of Average monthly Cost on CTA Transits by Chicago residents.
We can predict the following information by observing the data ,
95% confidence Interval -> Significance level ( ?) -> 0.05 .
Now to solve this problem we have used the R Studio
Firstly, we installed following packages
install.packages(“”PASWR2″”)
library(PASWR2)
Then we created the data frame of the available data of CTA smadke