Population and Sample
This paper discusses the concepts of statistics and data analysis. It entails the understanding of statistics terms, various tools used in analysis of data and general statistics. Statistics is defined as a methodology whereby mathematicians and statisticians use for collecting, analyzing, interpretation, and making inferences about a sample of data or from information (Aberson, 2010). Therefore, from the above definition, it is clear that statistics is more than tabulation of the numbers and graphical presentation of information. In detail, statistical methods are used in coming up with:
- The kind and how much data is needed to be collected
- How the data should be organized and summarized
- How and which analysis should be carried and the conclusions to be drawn
- How to assess the strength of conclusions and evaluation of their uncertainty.
In conclusion, statistics provides the methodology for,
- Design: the planning on how to carry out research studies
- Description: summarizing and exploration of data
- Inference: Making of predictions and the generalization about a phenomena represented by data
Population and sample are basic concepts used in statistics. Population is characterized as a set of all individuals, subjects, or objects that an investigator is interested on during the study. Sample is defined as a set of individuals from the population that will be involved in a study (Agarwal).
Descriptive and inferential statistics are the major types of statistics. Descriptive statistics is a branch of statistics that is devoted summarizing and description of data while inferential statistics is a branch of statistics that is concerned with making of inference about population (Fraser, 2012). In general, descriptive statistics consists of methods used in organization and summarizing of information while inferential statistics consists of methods used in drawing of conclusions and measuring reliability of the conclusions about the population under study (Brase, 2013). Descriptive statistics consists of measures of central tendencies that comprise of mean, median, mode, range, minimum and maximum values, variance, and standard deviation. Descriptive statistics also comprises of construction of tables, charts, and graphs. Inferential statistics consists of methods such as point estimation, hypothesis testing, and interval estimation where all are based on the probability theory (Friedman, 2010).
Features of the population that are under investigation are summarized as numerical parameters. Therefore, the research problem becomes as an investigation of the values of the parameters (Givens, 2013). The population parameters are usually unknown and the sample statistics are used in making inference about the parameters. In general, a statistic is used to make an inference about an unknown parameter (Daniel, 2010).
The main objective of statistics is to understand what the data contains. Below are the steps to be followed in any data analysis:
A variable is defined as any measurable characteristic that varies from individual members of population. The main types of variables in statistics are quantitative and qualitative variables. Quantitative variables include height, weight, length, and width. Quantitative variables may be classified as continuous or discrete variables. Qualitative variables include eye color, marital status, sex, and hair color. Qualitative variables may be classified as either nominal or ordinal variables (Field, 2014).
The data used in this paper is from an experimental study that intended to investigate the relationship between age, gender, type of chest pain, amount of blood sugar and the class of the subject whether sick or healthy (Knopov, 2012). The data is obtained from a web resource: https://mercury.webster.edu/aleshunas/Data%20Sets/Supplemental%20Excel%20Data%20Sets.htm
The dataset comprises of 100 subjects with the following variables, age, gender, chest type pain, blood pressure, whether the fasting blood sugar is less than 120 and the class of a patient. Age, and blood sugar are quantitative variables while gender, chest pain type, and the class of the subject are qualitative variables.
Descriptive and Inferential Statistics
Table 1
age |
blood pressure |
|
Valid |
100 |
100 |
Missing |
0 |
0 |
Table 1 above indicates the sample size of the study undertaken. The results in the table above indicates there were 100 subjects in the study.
Table 2
Descriptive Statistics |
|||||||||
N |
Range |
Minimum |
Maximum |
Mean |
Std. Deviation |
Variance |
Kurtosis |
||
Statistic |
Statistic |
Statistic |
Statistic |
Statistic |
Statistic |
Statistic |
Statistic |
Std. Error |
|
age |
100 |
34 |
37 |
71 |
54.76 |
8.316 |
69.154 |
-.882 |
.478 |
blood pressure |
100 |
76 |
104 |
180 |
132.37 |
15.048 |
226.437 |
.098 |
.478 |
Valid N (list wise) |
100 |
Table 2 above represents the descriptive statistics of the quantitative variables age and blood pressure. The subject with the lowest age was 37 years while the oldest was 71 years old. The mean age of the study 54.76 years which is approximately 55 years. The standard deviation of age was 8.316. The subject with highest blood pressure had 180 while the patient with the lowest had a blood pressure of 104. The standard deviation of blood pressure was 15.048.
Table 3
Statistics |
|||
age |
blood pressure |
||
N |
Valid |
100 |
100 |
Missing |
0 |
0 |
|
Mean |
54.76 |
132.37 |
|
Median |
56.00 |
130.00 |
|
Mode |
44a |
130 |
|
Std. Deviation |
8.316 |
15.048 |
|
Variance |
69.154 |
226.437 |
|
Range |
34 |
76 |
|
a. Multiple modes exist. The smallest value is shown |
Table 3 above shows the measures of central tendencies of the quantitative variables. Age had a median of 56, mode of 44, and a range of 34. Therefore, majority of the subjects under study were aged 44 years. Blood pressure had a median of 130 and a mode of 130. Therefore, majority of subjects recorded a blood pressure of 130.
Fig 1 and fig 2 below represents histograms of age and blood pressure respectively. From the histograms below we can conclude that the data is normally distributed as neither of the two variables is skewed. Blood pressure has two values as outliers while age has none.
Fig 1 Age histogram
Fig 2 Blood Pressure histogram
Table 4
sex |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid |
Female |
29 |
29.0 |
29.0 |
29.0 |
Male |
71 |
71.0 |
71.0 |
100.0 |
|
Total |
100 |
100.0 |
100.0 |
Table 4 above shows the distribution of subjects by gender. This is illustrated in fig 2 below
Fig 2
Fig 2 indicates male were majority 71% and female 29%.
The inferential statistics discussed under this study include Chi-Square test of independence, linear regression, and test of means. Before embarking on inferential statistics, it is wise to test whether the data follows normality in order to ascertain whether to use parametric or non-parametric techniques in analysis (Lee).
The hypothesis for testing normality is as follows
H0: The data follows normality
H1: The data does not follow normality
Table 5
Tests of Normality |
||||||
Kolmogorov-Smirnova |
Shapiro-Wilk |
|||||
Statistic |
df |
Sig. |
Statistic |
df |
Sig. |
|
blood pressure |
.113 |
100 |
.423 |
.971 |
100 |
.526 |
a. Lilliefors Significance Correction |
Table 5 above indicates the results for testing normality of data. The Shapiro-Wilks p-value from the table above (0.526) is greater than the level of significance at 0.05. Therefore, we fail to reject the null hypothesis and conclude that the data follows normality. The reason behind using Shapiro-Wilk test instead of Kolmogorov-Smirnov is the sample size, since the sample size is greater than 25 we use Shapiro-Wilk test (Machin, 2010).
Chi-Square Test
Chi-Square test is a statistical test used to test the association between two variables (Paulk, 2012). The hypothesis used in testing for association is as follows:
H0: Gender and type of chest pain are independent/ there is no significant association between gender and type of chest pain
H1: Gender and type of chest pain are not independent/ there is a significant association between gender and type of chest pain.
Table 6
sex * Fasting blood sugar <120 Crosstabulation |
|||||
Fasting blood sugar <120 |
Total |
||||
False |
True |
||||
sex |
Female |
Count |
26 |
3 |
29 |
% within sex |
89.7% |
10.3% |
100.0% |
||
% within Fasting blood sugar <120 |
29.9% |
23.1% |
29.0% |
||
% of Total |
26.0% |
3.0% |
29.0% |
||
Male |
Count |
61 |
10 |
71 |
|
% within sex |
85.9% |
14.1% |
100.0% |
||
% within Fasting blood sugar <120 |
70.1% |
76.9% |
71.0% |
||
% of Total |
61.0% |
10.0% |
71.0% |
||
Total |
Count |
87 |
13 |
100 |
|
% within sex |
87.0% |
13.0% |
100.0% |
||
% within Fasting blood sugar <120 |
100.0% |
100.0% |
100.0% |
||
% of Total |
87.0% |
13.0% |
100.0% |
Table 6 above indicates that both males and the females had fasting blood sugar that is more than 120.
Chi-Square Test Table
Table 7
Chi-Square Tests |
|||||
Value |
df |
Asymp. Sig. (2-sided) |
Exact Sig. (2-sided) |
Exact Sig. (1-sided) |
|
Pearson Chi-Square |
.255a |
1 |
.614 |
||
Continuity Correctionb |
.031 |
1 |
.860 |
||
Likelihood Ratio |
.265 |
1 |
.607 |
||
Fisher’s Exact Test |
.751 |
.444 |
|||
N of Valid Cases |
100 |
||||
a. 1 cells (25.0%) have expected count less than 5. The minimum expected count is 3.77. |
|||||
b. Computed only for a 2×2 table |
Table 7 represents the various types of tests under chi-square, our interest from the table above is the “Pearson Chi-Square”. From the above results, Pearson Chi-Square value is 0.255 with a p-value of 0.614, since the p-value is greater than the level of significance at 0.05, we fail to reject the null hypothesis and conclude that there is statistically significant association between Gender and whether the fasting blood sugar is less than 120 (Pons).
Table 8
Symmetric Measures |
|||
Value |
Approx. Sig. |
||
Nominal by Nominal |
Phi |
.050 |
.614 |
Cramer’s V |
.050 |
.614 |
|
N of Valid Cases |
100 |
Both Cramer’s V and Phi tests the strength of association between variables. In table 8, the strength of association between the two variables (0.050) is very weak.
Table 9
blood pressure * sex Crosstabulation |
|||||
sex |
Total |
||||
Female |
Male |
||||
blood pressure |
104 |
Count |
0 |
1 |
1 |
% within blood pressure |
.0 |
1.0 |
1.0 |
||
% within sex |
.0 |
.0 |
.0 |
||
% of Total |
.0 |
.0 |
.0 |
||
105 |
Count |
1 |
0 |
1 |
|
% within blood pressure |
1.0 |
.0 |
1.0 |
||
% within sex |
.0 |
.0 |
.0 |
||
% of Total |
.0 |
.0 |
.0 |
||
108 |
Count |
1 |
0 |
1 |
|
% within blood pressure |
1.0 |
.0 |
1.0 |
||
% within sex |
.0 |
.0 |
.0 |
||
% of Total |
.0 |
.0 |
.0 |
||
110 |
Count |
0 |
7 |
7 |
|
% within blood pressure |
.0 |
1.0 |
1.0 |
||
% within sex |
.0 |
.1 |
.1 |
||
% of Total |
.0 |
.1 |
.1 |
||
112 |
Count |
0 |
2 |
2 |
|
% within blood pressure |
.0 |
1.0 |
1.0 |
||
% within sex |
.0 |
.0 |
.0 |
||
% of Total |
.0 |
.0 |
.0 |
||
115 |
Count |
0 |
1 |
1 |
|
% within blood pressure |
.0 |
1.0 |
1.0 |
||
% within sex |
.0 |
.0 |
.0 |
||
% of Total |
.0 |
.0 |
.0 |
||
Total |
Count |
29 |
71 |
100 |
|
% within blood pressure |
.3 |
.7 |
1.0 |
||
% within sex |
1.0 |
1.0 |
1.0 |
||
% of Total |
.3 |
.7 |
1.0 |
Table 9 below indicates there is difference in blood pressure between male and females.
Chi-Square Tests Table
The Chi-Square Test below indicates that there is no statistically significant association between gender and blood sugar.
Table 9
Chi-Square Tests |
|||
Value |
df |
Asymp. Sig. (2-sided) |
|
Pearson Chi-Square |
27.510a |
25 |
.331 |
Likelihood Ratio |
33.753 |
25 |
.113 |
N of Valid Cases |
100 |
||
a. 48 cells (92.3%) have expected count less than 5. The minimum expected count is .29. |
Table 10
Symmetric Measures |
|||
Value |
Approx. Sig. |
||
Nominal by Nominal |
Phi |
.525 |
.331 |
Cramer’s V |
.525 |
.331 |
|
N of Valid Cases |
100 |
Table 10 tends to differ on the association between gender and blood pressure. The Cramer’s V value indicates a strong association between the two variables (Vogt, 2012).
Conclusion
The study above reveals interesting facts about the relationship between gender, age, blood pressure levels, and classification of subjects as either sick or healthy. However, more studies should be conducted in order to come up with substantial evidence above.
References
Aberson. (2010). Applied power analysis for the behavioral sciences. New York: Routledge Academic.
Agarwal, B. L. (n.d.). Basic Statistics.
Brase, C. H. (2013). Understanding basic statistics. Australia: Cole Cengage Learning.
Daniel, W. W. (2010). Biostatistics. Chichseter: John Wiley.
Field, A. P. (2014). Discovering statistics using R. London: Sage.
Fraser. (2012). Business Statistics for competitive advantage with Excel 2010. New York: Springer.
Friedman, L. M. (2010). Fundamentals of clinical trials. New York: Springer.
Givens, G. H. (2013). Computational statistics. Hoboken: Wiley.
Knopov, P. S. (2012). Regression Analysis Under A Priori Parameter Restrictions. New York: Springer-Verlag.
Lee, E. T. (n.d.). Statistical methods for survival data analysis.
Machin, D. A. (2010). Randomized clinical trials. West Sussex: Wiley-Blackwell.
Paulk, A. (2012). Understanding regression analysis. New Delhi: Orange Apple.
Pons, O. (n.d.). Inequalities in analysis and probability.
Vogt, W. P. (2012). Correlation and regression analysis. Los Angeles: Sage.