Descriptive Statistics
Demographics tend to play a vital role with regards to the sales of the retail businesses especially the supermarket chains. In order to determine the potential of sales that can be generated from a region catchment area for a supermarket, it is imperative to consider the purchasing power along with other crucial factors such as age, gender which tend to alter the preferences of the customers. In this backdrop, the objective of the given analysis is to analyse the demographics of the selected sample data using a host of descriptive statistics. Additionally, using inferential statistics technique, conclusion about the underlying population would be drawn from the sample. Besides, correlation and regression analysis would be used as key enabling tools in order to highlight the relationship between selected variables and thereby highlight the implications for the supermarket. Based on the analysis of the sample data, recommendations would be offered to the supermarket chain which can be used for enhancing business.
A population data has been offered which comprises various demographic aspects related to 1000 persons. The key variables that have been included are age, gender, marital state, mortgage, children, salary, debt, location and amount spent during the given year. These demographic variables as explained above are of essence for the supermarket chain. Instead of using the population data for the statistical analysis, two samples have been selected comprising of 25 people which would be deployed for conducting further research. The random sampling method is used to obtain the two samples which are required in the given case. However, one of the samples has been obtained by deploying simple random sampling where from the given population of 1000 persons, a sample of 25 has been drawn using Excel as the enabling tool. Another sample of 25 people has been obtained from the given population of 1000 people and this has been obtained through the use of systematic random sampling whereby samples are selected after equal intervals of observations.
The descriptive statistics of the various variables in the two samples are discussed below.
Zone
Sample 1
It is apparent that majority of the customers from sample 1 are from Zone C while the distribution is even between the other three zones namely A, B and D.
Sample 2
It is apparent that majority of the customers from sample 2 are from Zone C while the distribution is even between the other three zones namely A, B and D.
Children
Sample 1
Majority of the people do not have children. Very few person from the sample have 2 or 3 children and majority of those who have children have only 1.
Sample 2
The distribution of children between 0, 1, 2 is almost equal in the given sample with only a few persons have three children.
Salary
Sample 1
The above distribution of salary is clearly non- normal along with presence of skew on the positive side owing to existence of those persons who tend to have a very high salary.
Inferential Statistics
Sample 2
The above distribution of salary is clearly non- normal along with presence of skew on the positive side owing to existence of those persons who tend to have a very high salary
Gender
Sample 1
Majority of the customers in sample 1 are females as indicated above.
Sample 2
About two –thirds of the customers in sample 2 are females as indicated above.
Debt
Sample 1
Nearly half of the sample has debt lower than $ 10,000 and compared to sample 2, the debt levels for sample 1 are comparatively lower which augers well for the supermarket.
Sample 2
With regards to sample 2, it is apparent that majority of the people have debt lower than $ 15,000 and only very few people tend to have debt in excess of $ 15,000 which goes till $ 25,000.
In order to ascertain whether the requirement of the management with regards to the minimum average spend annually has been met by thee population, hypothesis testing would be used as the appropriate tool so as to opine about the population based on the sample characteristics.
Sample 1
The requisite hypotheses are as indicated below.
Null Hypothesis: µ< $2,387
Alternative Hypothesis: µ ≥ $ 2,387
The relevant test statistics is Z considering the fact that population standard deviation is known in this case. Also, the given test would be a right tail test. The relevant output from excel is stated as shown below.
Consider the level of significance as 5%.
It is apparent that the p value has come out as 0.9145 and tends to exceed the level of significance. This implies that the available evidence does not facilitate rejection of null hypothesis. As a result, alternative hypothesis cannot be accepted. As a result, the average population yearly spending fails to be $ 2,387.
The requisite hypotheses are as indicated below.
Null Hypothesis: µ< $2,387
Alternative Hypothesis: µ ≥ $ 2,387
The relevant test statistics is Z considering the fact that population standard deviation is known in this case. Also, the given test would be a right tail test. The relevant output from excel is stated as shown below.
It is apparent that the p value has come out as 0.9145 and tends to exceed the level of significance. This implies that the available evidence does not facilitate rejection of null hypothesis. As a result, alternative hypothesis cannot be accepted. As a result, the average population yearly spending fails to be $ 2,387.
In order to analyse the linear relationships between any two given variables, correlation and regression analysis are vital tools that are frequently used to decipher if there is a causal relationship and the extent and nature of this relationship.
In order to summarise the correlation between the various numerical variables, useful technique is in the form of correlation matrix which is indicated as follows.
The sign of the correlation coefficient tends to highlight the nature of relationship between the variables. A positive relationship implies that the two variables would move in the same direction while a negative relationship implies that the two variables would move in the opposite direction. Further, the magnitude of the correlation coefficient highlights the strength of the relationship. The magnitude of the correlation coefficient tends to vary between 0 and 1 with 0 being the least magnitude and 1 being the maximum magnitude. In the given case, one potentially significant relationship is between amount spent and salary. The positive value of the correlation coefficient implies that amount spent and salary tends to move in the same direction. This makes sense considering the fact that people with higher salary would have a tendency to have a high annual spend. Another key relationship seems to be between the children count and the amount spent. Therefore, people have children would tend to spend more on an average. Clearly this information provides vital data to the supermarket as they would ideally like the target customers to have a higher salary and children.
A linear regression analysis has been performed between salary (independent variable) and amount spent (dependent variable). The output obtained from Excel is shown below.
The linear regression equation is as follows.
Amount Spent = 811.9 + 0.02 * Salary
The intercept coefficient is 811.9 which highlight the average annual spending for a person whose salary is zero. Further, the slope of 0.02 implies as the salary of person increased by $ 1, the yearly amount spent would tend to increase by $ 0.2. The supermarket chain can use this information for targeting their customers.
Also, it is apparent that slope of the regression equation is significant which can be indicated from the fact that the p value corresponding to salary slope is 0.00 and hence would lead to rejection of null hypothesis stating that the slope is insignificant. Further, based on the ANOVA analysis, it can also be highlighted that the linear regression model is significant.
Further, the coefficient of determination or R2 value is 0.4055 which implies that changes in salary can account for 40.55% of the respective changes in amount spent. The remaining 59.45% of the variation cannot be explained through the given model and hence it is imperative to introduce other suitable independent variables.
In order to summarise the correlation between the various numerical variables, useful technique is in the form of correlation matrix which is indicated as follows.
In this particular sample also, the amount spent has significant positive correlation with children and salary. Detailed explanation about the correlation observed has been offered in the previous sample. Another significant correlation tends to be witnessed between mortgage and invested. This implies that people who tend to take mortgage for buying home are more likely to invest in retirement planning. Also, bigger the quantum of mortgage, higher is the percentage contribution of combined income would be deviated towards retirement planning.
A linear regression analysis has been performed between children (independent variable) and amount spent (dependent variable). The output obtained from Excel is shown below.
The linear regression equation is as follows.
Amount Spent = 1966.45 + 338.50 * Children
The intercept coefficient is 1966.45 which highlight the average annual spending for a person whose no children are there. Further, the slope of 338.50 implies as the children of person increased by one count, the yearly amount spent would tend to increase by $ 338.50.
Further, the coefficient of determination or R2 value is 0.2215 which implies that changes in children can account for 22.15% of the respective changes in amount spent. The remaining 77.85% of the variation cannot be explained through the given model and hence it is imperative to introduce other suitable independent variables.
Conclusion and Recommendations
The data provided in the given case seems sensible considering the fact that in majority of the variables, there is convergence of the observations amongst the two samples which lays credence to the view that these samples trend are representative of the population. This is quite apparent from the correlation matrix for the two samples where amount spent tend to indicate significant relationship with the same two variables. Considering that majority of the customers hail from zone C, it makes sense for the supermarket to choose a location in that zone. However, in this regards, consideration need to be given to key variables such as children , salary which needs to be understood according to the zone which is not possible using the given samples as these are quite small in size and thereby can be potentially non-representative of the actual population.
With regards to the results obtained from the two samples, in case of inferential analysis and relationship analysis, the underlying results are comparable. However, in relation to the summary statistics, it is apparent there are obvious differences between the two samples which is on expected lines considering that the sample size is quite small. Ideally, the results would have been more similar, if based on the underlying population parameters, a minimum sample size could have been predicted and the sample drawn could have been in accordance with this size.