Data Collection
Information investigation is critical to organizations will be putting it mildly. Indeed, no business can survive without dissecting accessible information (Siegel, 2016). Even if the accompanying circumstances are considered:
A pharma organization is performing trials on number of patients to test its new medication to battle tumor. The quantity of patients under the trial is well more than 500.
An organization needs to dispatch new variation of its Current Line Of Natural Product juice. It needs to complete the study investigation and land at some significant conclusion.
Sales Director of an organization realizes that there is some kind of problem with one of its fruitful items, however hasn’t yet completed any statistical surveying information examination. How and what does he finish up?
These circumstances are sufficiently characteristic to infer that information investigation is the life saver of any business (Anderson et al. 2014). Regardless of whether one needs to land at some showcasing choices or calibrate new item dispatch technique, information examination is the way to every one of the issues. What is the significance of information investigation – rather, one ought to say what is not vital in regards to information examination (Anderson et al. 2016)
Just breaking down information isn’t adequate from the perspective of settling on a choice. How can one decipher from the examined information is more vital. Accordingly, information investigation is not a basic leadership framework, but rather choice supporting framework. Information investigation can offer the accompanying advantages:
Organizing the discoveries from study look into or different methods for information gathering
- Break a large scale picture into a small scale one
- Procuring significant experiences from the dataset
- Basing basic choices from the discoveries
- Precluding human predisposition through legitimate factual treatment
The purpose of this report is to explore the statistical knowledge related to data analysis and modeling with the help of car data set identified by the researcher (David et al. 2015). Here, the researcher has gathered data from online sources. The first step of preparing this data set is to identify a list of car names. As mentioned in the below data set screenshot, the researcher searched over internet to collect at least 25 car model of different companies. The next step is to identify in which country the company belongs. Subsequently, the researcher has collected data like mileage (miles per gallon), acceleration, horsepower, weight, cylinders, year of production start and price. Below is the details about the data type:
Country |
Categorical |
Miles per Gallon |
Numeric |
Acceleration |
Numeric |
Horsepower |
Numeric |
Weight |
Numeric |
Cylinders |
Categorical |
Year |
Numeric |
Price |
Numeric |
The chosen data set is a sample that represent the entire car industry data. The following section has shown detailed analysis of the data set with respect to various statistical and inferential analysis.
Analysis of Quantitative Variable
This section of this study has performed analysis of at least one quantitative variable. In specific, the researcher has considered miles per gallon and acceleration for analysis. In order to perform the analysis, the researcher has used summary statistical table as mentioned below. In addition to this, the researcher also used histogram to understand whether both the variables are symmetric or skewed. From the histogram of miles per gallon, it can be said that this quantitative variable is skewed. The value of skewness as shown in the summary table is also supporting the same conclusion. Here, there is no outlier is found. Similarly, the histogram of acceleration is indicating that the distribution of this quantitative variable is almost symmetric (Anderson et al. 2014). However, the data set is slightly skewed as the skewness value is 0.31. Here also, there is no outlier identified.
Miles per Gallon |
Acceleration |
|
Mean |
23.984 |
15.876 |
Standard Deviation |
7.643498326 |
2.486677837 |
Minimum |
17.5 |
11.2 |
Fisrt Quartile |
19.2 |
13.7 |
Median |
20.2 |
15.8 |
Third Quartile |
27.5 |
17.2 |
Maximum |
43.1 |
21.5 |
Skewness |
1.342987603 |
0.305588459 |
Table 1: Summary Statistics Table
This section of the study has shown analysis of one categorical variable. For this purpose, the researcher has selected country variable. From the below mentioned frequency and relative frequency table, it can be concluded that the most of the sample car are from USA (17, 68%) and there are three countries from where the sample cars have been chosen.
Country |
Frequency |
Relative Frequency |
Germany |
4 |
16.0% |
Japan |
4 |
16.0% |
USA |
17 |
68.0% |
Grand Total |
25 |
100.0% |
Table 2: Frequency and relative frequency table of Country
Categorical variables are connected when the classification that a subject is named for one factor “impacts” the class that a subject is named in another variable. At the end of the day, the classification one is named for one factor “depends” on the classification one is delegated in another variable. The chi-square (χ2) trial of freedom will be utilized to test for this relationship. In this lesson you will likewise audit hazard and chances and find out about relative hazard, percent change in hazard, and chances proportions (Anderson et al. 2016). The beginning stage for breaking down the relationship is to make a two-path table of counts. The lines are the classifications of one variable and the segments are the classifications of the second factor. We include what number of perceptions are every mix of line and section classifications. When one variable is clearly the informative variable in the relationship, the tradition is to utilize the logical variable to characterize the lines and the reaction variable to characterize the segments. This is not a rigid manage however (Freed et al. 2014).
Analysis of Categorical Variable
This section of this study has investigated the relationship between two categorical variables. For this purpose, the researcher has selected country and cylinders are two categorical variables from the data set. In order to assess the association between these two variables, the researcher has employed chi square test of independence. Further, the researcher has developed the below mentioned hypothesis:
H0: In the population, the two categorical variables are independent;
Ha: In the population, two categorical variables are dependent;
The first step of performing chi square test of independence is to identify the observed value [identified through cross tabulation] and expected values [identified through using the formula: row total*column total/grand total]
The below two tables are representing the observed value and expected value respectively.
Observed Value |
Cylinders |
||||
4 |
6 |
8 |
Grand Total |
||
Country |
Germany |
3 |
0 |
1 |
4 |
Japan |
4 |
0 |
0 |
4 |
|
USA |
2 |
10 |
5 |
17 |
|
Grand Total |
9 |
10 |
6 |
25 |
Table 3: Observed table
Expected Value |
Cylinders |
||||
4 |
6 |
8 |
Grand Total |
||
Country |
Germany |
1.44 |
1.6 |
0.96 |
4 |
Japan |
1.44 |
1.6 |
0.96 |
4 |
|
USA |
6.12 |
6.8 |
4.08 |
17 |
|
Grand Total |
9 |
10 |
6 |
25 |
Table 4: Expected value
The next step is to identify the chi square statistic. Here, the researcher has used below mentioned formula to calculate the chi square value.
Chi square test:
χ2∗=∑((Oi−Ei)/Ei)^2
Chi square statistics table
1.173611 |
1 |
0.001736 |
3.160494 |
1 |
1 |
0.453202 |
0.221453 |
0.050846 |
Chai square statistics = 8.061342
The next step was identification of degree of freedom. Below mentioned formula is used to identify the degree of freedom.
Degree of freedom
Df = (row -1)*(column – 1)
= 4
As per chi square distribution, here the P value is 0.089357, which is greater than 0.05 value. Hence, at 95% confidence level, the null hypothesis should be accepted here. In other words, the test is not significant. Therefore, it can be infer that the two categorical variables from the data set is independent.
Categorical variables are likewise called qualitative factors or attribute factors. The estimations of a Categorical variables can be put into a countable number of classes or distinctive gatherings. All out information could conceivably have some consistent request (Lee and Peters, 2015). On the other hand, quantitative variable is a measure. This section of the study has provided a side by side histogram to show the association between miles per gallon and cylinders. The side by side histogram is indicating a strong negative correlation between these two variables (Siegel, 2016). The summary statistic as well as correlation value [ -0.72203] is also supporting the conclusion.
Correlation and regression is connected with analyzing the connection between (at least two) quantitative factors. Three devices will be utilized to depict, picture, and measure the connection between quantitative factors:
Scatterplot, a two-dimensional chart of information esteems for two quantitative factors.
Correlation, a measurement that measures the quality and course of a direct connection between two quantitative factors.
Relapse condition, a condition that portrays the normal connection between a quantitative reaction variable and a quantitative logical variable. Here, there is a negative relation between horsepower and accelaration.
Conclusion
Thus to conclude, it can be said that the study is significant enough to analyses the data gathered. However, it is the fact that this assignment provides an elementary overview to statistics, probability, and data analysis. Aspects like data preparation, classification, and summarization; basic statistics; investigation of common dissemination used in statistics; along with correlation, scatter plots has been clearly understood through this assignment.
Anderson, D., Sweeney, D. and Williams, T., 2014. Modern business statistics with Microsoft Excel. Nelson Education.
Anderson, D.R., Sweeney, D.J., Williams, T.A., Camm, J.D. and Cochran, J.J., 2016. Statistics for business & economics. Nelson Education.
David, R.A., Dennis, J.S. and Thomas, A.W., 2015. Modern Business Statistics with Microsoft Excel.
Freed, N., Bergquist, T. and Jones, S., 2014. Understanding business statistics. John Wiley & Sons.
Lee, N. and Peters, M., 2015. Business statistics using EXCEL and SPSS. Sage.
Siegel, A., 2016. Practical business statistics. Academic Press.