Introduction To The Consultancy Brief
The part 1 of the report involve a summary of the findings of analysis of the weather data from the Szeged station in Hungary. The initial section entailed an overview of the descriptive analysis and data cleaning. The section (part 1) also included visual display of the variables to observe distribution of data. In the second report, the researcher will undertake regression modelling and correlation analysis to study the association between the various variables and the actual/ apparent temperature.
The data set is a weather data from Szeged station in Hungary. The data is a subsection of weather data recorded between 2006 and 2016.
Preliminary data analysis assisted in undertaking of the data distribution in the variables under study. The researcher conducted the analysis using descriptive statistics and visual display of the continuous data variables. For the case of categorical variables, a table was used to summarize the data.
From the dataset provided variables; actual temperature, apparent temperature, humidity, wind speed, wind bearing, visibility and pressure were classified under continuous variables. A descriptive analysis of the variables is presented below.
In the ten years between 2006 and 2016, the presented maximum actual temperature attained was 34.9 degrees Celsius while apparent temperature attained a maximum of 35.3 degrees Celsius. The Szeged site in Hungary had an average of 11.6 degrees Celsius for actual temperature and 10.7 degrees for the apparent temperature. For the value humidity the maximum value attained over the time range was 1 with an average of 0.76. Wind speed went as high as 26.76km/hr though the average value was observed to be 9.67km/hr. The visibility was at its best when it was 16km and at worst 0.322 km. The air pressure at Szeged ranged between 1038.7(maximum) and 997.2 (minimum) with an average of 1017.1.
To observe the actual distribution of the data, a histogram was drawn for each of the continuous variables. Figure 1 displays a histogram for the actual temperature data.
Figure 1. Histogram of the actual temperature data
Figure 1 is a unimodal histogram with the mode range being between 10 to 15 degrees Celsius. The histogram shape shows that the data is normally distributed. Figure 2 is a visual display of the histogram for the apparent temperature data.
Figure 2. Histogram of the apparent temperature data
The histogram is bimodal with the higher mode being of the temperature range 10 to 15 degrees. The same observation was made for the actual temperature data. Also, for the case of the apparent temperature, the data is normally distributed.
Figure 3 presents the histogram for the humidity variable.
The histogram shows that the data is not normally distributed. It is rather skewed to the left. Most of the days recorded a high humidity. The skewness means that the data contains a higher number of larger values compared to the smaller v
The variables summary, and precipitation types are classified as categorical. The researcher analysed and summarised the data in the sections using tables.
Introduction To the Data Set
Of the weather descriptions recorded on a daily basis most of the observations were summarized as partly cloudy. Foggy weather was the most unlikely with just a few days being summarised using the code. Between 2006 and 2016, the weather pattern at Szeged Hungary was mostly summarised as either mostly cloudy or partly cloudy.
Precipitation type that was mostly common at the weather site was rain. There are two main types precipitation types of the two, snow was experienced in just a few of the days.
The researcher used summary statistics to review the data and identify unusual observations. Some of the sign of unusual observations that the researcher checked were present of NAs, null values and outliers.
Table 3 showed that variables such as wind bearing, visibility and pressure had some null values. The records with null values were viewed as incomplete and removed from the data before conducting further analysis.
At one-point each h of the variables had an entry labelled NA. The data rows with such entries were also classified as incomplete and eliminated from the data.
NAs entries were treated as unusual since an observation had to be made in a day. The researcher viewed such entire as originating from errors.
The summary variable had several classifications whose interpretation was not giving a clear distinction of the summarized weather conditions. After observing the data, the summaries were grouped into four namely; clear foggy, overcast and cloudy. Table 5 provides descriptive analysis of the new summary variables.
Observation of the table indicate that most of the weather observations were classified as cloudy. Only a small proportion of the days was categorised as foggy.
This section of the looked at the relationship between the data variables and the two variables of interest that is actual and apparent temperature.
Effect of Humidity
Observation of the diagram shows that the points are moving downwards from to right. The observation is a sign of negative relationship between the two variables. A correlation analysis is expected to result in a value between -1 and 0. As humidity increases, actual temperature tends to do down.
Observation of the points show that they are randomly distributed with no defined pattern. A correlation analysis is expected to show a very minimal to no relationship between wind speed and actual temperature.
Just like the case of wind speed, the points are random with no pattern shown. A study of the correlation between wind bearing and actual temperature is expected to show no relationship.
Scatter plot displayed in figure 11 described the relationship between apparent temperature and visibility. The points are randomly scattered which means correlation analysis is expected to show that there is no relationship between apparent temperature and visibility.
The mean actual temperature is observed to be high for rain type of precipitation compared to the case during snow. The same trend is observed for the case of apparent temperature. The findings show that precipitation affect both apparent and actual temperature. Studying the relationship between summary and apparent/ actual temperature indicate that the variables are dependent on each other. The mean values of both apparent and actual temperature significantly differ with the summary values. When the weather is cloudy, temperature tend to be high for both cases with lower temperature values observed when weather is summarised as foggy.
Initial Analysis
The one sample hypothesis test looked at the relationship between precipitation type and pressure. The researcher tested the if the difference between mean pressure during rain and that during snow is significantly different from 1000. The test is a two-sided hypothesis test.
The hypotheses tested are
Where is the mean pressure durng rain and the mean pressure during snow.
The hypothesis test was conducted at a 95% level of significance meaning alpha value was 0.05.
The test obtained a p value less than 0.05 hence the decision was to reject the null hypothesis. The researcher concluded that the difference in mean pressure is significantly different from 1000.
The researcher conducted the two-sample hypothesis to analyse the different in the mean pressure values under different summary categories. The hypothesis tested was the ANOVA test. The test was done a 95% level of significance meaning the value of alpha was 0.05. The hypothesis tested were
At least one mean is different
The test resulted in a p value of 0.000306< alpha (0.05)
Since the p vale is less than alpha, the researcher rejected the null hypothesis. The test can be used to conclude that at least one of the mean values is different from the others. The hypothesis test proves that pressure values differ depending on the weather summary category.
The objectives are to create regression models for Actual Temperature and Apparent temperature using data discussed in a previous report.
This section analyses the correlation between the continuous variables and the two variables of interest (apparent and actual temperature).
The relationship between humidity and the actual temperature is summarized by the Pearson correlation value of -0.6438. The value means that humidity and actual temperature have a strong negative correlation. An increase in humidity leads to a decrease in actual temperature. On the other hand, actual temperature and wind speed have a Pearson correlation value of 0.1373. The value means that wind speed has a weak positive correlation with actual temperature. An increase in wind speed leads to increase in actual temperature. The effect is though on a minimal scale. Just like in the case of wind speed, wind bearing also have a weak positive correlation with actual temperature (correlation value of 0.1655). Visibility and actual temperature have a Pearson correlation value of 0.304. The value means that visibility has a weak positive association with actual temperature. An increase actual temperature increases visibility. For the case of pressure, the correlation value is -0.3925, this is interpreted as a weak negative correlation. As actual temperature increases, the values of pressure drop.
Apparent temperature relationship with humidity is described as strong negative relationship (Pearson correlation value of -0.6151). An increase in apparent temperature leads to a decrease in humidity. For wind speed and wind bearing, the correlation with apparent temperature is described as weak positive association. Visibility and apparent temperature have weak positive association while pressure have a weak negative relationship with apparent temperature.
The regression model for the actual temperature variable is presented below.
Continuous Variable
The model selection was done using a stepwise regression approach. The model obtained was as presented below.
The model provides the variables that significantly predicts the actual temperature. From the model the actual temperature values can be presented using the linear equation
The model has an adjusted R square value of 0.6846 which means it is a better predictor than the initial model that was used obtained under the regression section.
The normal Q-Q plot above was used to check the normality assumption. The residuals are approximately normal as majority fall along the straight line. Thus, the normality assumption has been met.
The scale location plot is used to check the homogeneity of variance assumption. The points should be ideally evenly spread along a horizontal line. This is not the case here implying heteroskedasticity is a problem here.
The regression model for the apparent temperature is displayed below.
Assumptions check
From the Q-Q plot above, the normality assumption is met as most of the residuals fall along the straight line along the diagonal.
The homogeneity assumption is violated as the point do not fall evenly along a horizontal line.
Forward stepwise regression was used to select the optimum number of variables for the regression model with actual temperature as the dependent variable. It uses AIC to reveal the best model after stepwise addition of variables. The model with the lowest AIC is the best model. The out is as shown below.
As seen, the final model eliminates the variable visibility to give the lowest AIC.
Again, forward stepwise regression was used in the selection of the best model. The output is as shown below.
Again, the variable visibility is eliminated to give the lowest AIC.
The regression model has an adjusted R square value of 0.679. The value means that the independent variables in the regression model can explain 67.9% of the variation in the actual temperature. The F statistics test has a p value which is lower than 0.05. Hence, at a 95% level of significance, the model is a good fit. Analysis of the individual independent variables was done using the t statistics p value. A value greater than 0.05 is treated as not significant at a 95% level of significance. Using the criteria, the variables precipitation type, humidity, wind bearing, pressure are significant predictors of actual temperature.
The general model has an adjusted R square value of 0.6569. The value means that the model can explain up to 65.69% of the variation in apparent temperature values. The F statistics P value is less than 0.05. The model is interpreted as a good fit. Analysis of the coefficients of independent t variable using the t statistics p value show that variables; precipitation type, humidity, wind speed, wind nearing, and pressure are significant predictors of apparent temperature.
The researcher’s findings indicate that an increase in humidity leads to a decrease in actual temperature. On the other hand, wind speed has a weak positive correlation with actual temperature. Just like in the case of wind speed, wind bearing also have a weak positive correlation with actual temperature. Moreover, visibility has a weak positive association with actual temperature. For the case of pressure, as actual temperature increases, the values of pressure drop.
An increase in apparent temperature leads to a decrease in humidity. For wind speed and wind bearing, the correlation with apparent temperature is described as weak positive association. Visibility and apparent temperature have weak positive association while pressure have a weak negative relationship with apparent temperature.
The regression model has an adjusted R square value of 0.679. The value means that the independent variables in the regression model can explain 67.9% of the variation in the actual temperature. The F statistics test has a p value which is lower than 0.05. Hence, at a 95% level of significance, the model is a good fit. Analysis of the individual independent variables was done using the t statistics p value. A value greater than 0.05 is treated as not significant at a 95% level of significance. A better model was selected using a stepwise regression approach. From the model the actual temperature values can be presented using the linear equation