Total Trips Made For Each Household
There are two dataset under consideration. One consists of transport data of 858 households and the other contains data on trips made by the individuals of each household. The variables, total trips made for each household is the total count of trips by all individuals in a household. The following table gives the frequency distribution of the total trip count.The mean was found to be 11.38.
TOTAL TRIPS |
FREQUENCY |
0-9 |
428 |
10-19 |
323 |
20-29 |
77 |
30-39 |
25 |
40-49 |
4 |
50-59 |
1 |
Grand Total |
858 |
Table 1: Frequency distribution of total trips for a household
Normally, count data is taken to be poisson distribution. The frequency distribution also suggests it to be so. The histogram shows high frequency in the lower values which rapidly falls as count increases. Thus it would be reasonable to assume that the variable, total trips per household follows a passion distribution. On the other hand the mean of the variable was found to be equal to 11.38. So, due to the fact that mean is greater than 10, thus is 11.38 and greater than 10. Thus it could be considered to be reasonable large enough for considering normal approximation of normal. Thus the total number of trips by a household is assumed to follow normal distribution with mean 11.38 and variance 11.38.
Again, the following table gives the frequency distribution of the time interval or length of time taken for each trip by an individual. The mean was found to be 16.379.
Time |
FREQUENCY |
1-100 |
8235 |
101-200 |
56 |
201-300 |
9 |
301-400 |
4 |
401-500 |
2 |
501-600 |
1 |
601-700 |
1 |
701-800 |
2 |
Grand Total |
8310 |
Table 2: Frequency distribution of total time for a household
The frequency distribution also suggests it to be poisson distribution. The histogram shows high frequency in the lower values which rapidly falls as count increases suggesting the same. Thus it would be reasonable to assume that the variable, time per trips per household follows a poission distribution. On the other hand the mean of the variable was found to be equal to 16.379. So, due to the fact that mean is greater than 10, thus is 16.39 and greater than 10. Thus it could be considered to be reasonable large enough for considering normal approximation of normal. Thus the total number of trips by a household is assumed to follow normal distribution with mean or 16.37 and variance or 16.37 as well. The R code used for the computations is provided in the APPENDIX.
Normal distribution being the assumed underlying probability distribution, it acts to simplify and facilitate the analysis is process. Normal distribution, as stated as the basis for why it could be considered, is a good approximation for large samples. It is a symmetric bell shaped distribution and has a number of nice mathematical properties making calculations easy. It wt has one of the important properties called central theorem. Central theorem means relationship between shape of population distribution and shape of sampling distribution of mean. This means that sampling distribution of mean approaches normal as sample size increase. It also provides basis of measure the probability of random errors in the data since it is a Gaussian distribution. Theories such as confidence intervals, t-tests, F tests, Chi squared tests and ANOVA as well as theory of linear regression is based on the assumption of normality. Therefore, assuming the probability model to be based on normal distribution allows for a number of advantages.
Frequency Distribution of Total Trips
This is the simplest process of sampling from a large population. The advantage of using a SRS is that each and every member of the population has an equal opportunity of being selected. The process involves selecting “n” number of samples from the population of the city. For our case the sample size is 2000. Thus records regarding the households i.e., HOUSEID needs to be acquired. A random number needs to be assigned to each of the HOUSEID’s. The HOUSEID’s needs to be sorted on the basis of the random numbers. Next we can select the HOUSEID’s corresponding to the first or last 2000 random numbers. In this process we can get a simple random sample of 2000 households.
In selection of SRS stratum is not essential. The whole population is considered as one strata.
The sampling error for SRS is the difference between the mean of the sample and mean of the population.
The sampling error for the sample
Since sample size = 2000, Hence sampling error =
The confidence level depends on the z-value. The higher the z-value the more sure we can be of the sample population representing our sample.
This method of sampling is useful when there are variations in subpopulations. Under the circumstances that there are variations in household populations of the city, the populations need to be grouped based on strata. The strata are mutually exclusive i.e., a household is assigned to only one strata. The sample is drawn from each of the strata. In proportional allocation, the final sample is based on weights of each of the strata. Thus, if there are two strata “x” and “y” and the total population is “n,” then the sample would be .
In this process the size of the sample and the variability of the strata.
In order to investigate for differences in trip-distances between males and females an independent sample t-test was done.
Males |
Females |
|
Mean |
10.123 |
7.416 |
Variance |
1302.801 |
326.931 |
Observations |
3639 |
4635 |
Hypothesized Mean Difference |
0 |
|
df |
5059 |
|
t Stat |
4.136 |
|
P(T<=t) one-tail |
0.000 |
|
t Critical one-tail |
1.645 |
|
P(T<=t) two-tail |
0.000 |
|
t Critical two-tail |
1.960 |
The average trip miles of males 10.123 with a variance of 1302.801.
The average trip miles of females is 7.416 with a variance of 326.931.
From the t-test it is found that there are significant differences between the trip miles of males and females, p-value < 0.000 at a = 0.05 level of significance.
A prediction model of the total trips per household is to be determined from the available data. First, the tentative predictors of the response variable, total number of trips is to be determined. To do so, the correlation of the response, total trips, labelled as “TOTAL” in the model is viewed. The following table gives the correlation coefficient of the response with the predictors. The variables include HBWORK or home to work trips, HBSCHOOL of home to school trips, HBSHOP or home to shop trips, NHBASED or non home based trips, HBOTHER or home to other reason related trips, AGE5_17 which is the number of individuals in the age group 5 to 17, AGE18_24 which is the number of individuals in the age group 18 to 24, AGE25_34 which is the number of individuals in the age group 25 to 34, AGE35_49 which is the number of individuals in the age group 35 to 49, AGE50_64 which is the number of individuals in the age group 50 to 64 and AGE65PLS which is the number of individuals above the age group 64. The variable WTHHFIN is the cumulated weight of all individuals in the household. The variable WORKER is the number of workers in the household, the variable DRIVER is the number of drivers, HHSIZE is the number of people in the household, POPDEN is the population intensity and HHINCCAT is the category of household income with 1 being less than $20,000 to 5 being greater than $80,000
Variable |
Correlation |
HBWORK |
0.351006 |
HBSCHOOL |
0.490538 |
HBSHOP |
0.423805 |
HBOTHER |
0.781459 |
NHBASED |
0.780442 |
HHSIZE |
0.671719 |
VEHICLES |
0.401147 |
WORKERS |
0.490245 |
DRIVERS |
0.472093 |
POPDEN |
-0.04012 |
HHINCCAT |
0.27701 |
AGE0_4 |
0.051391 |
AGE5_17 |
0.634981 |
AGE18_24 |
0.13261 |
AGE25_34 |
0.082308 |
AGE35_49 |
0.406922 |
AGE50_64 |
-0.02761 |
AGE65PLS |
-0.21717 |
WTHHFIN |
-0.10913 |
Time Interval or Length of Time Taken for Each Trip
The variables HBWORK, HBSCHOOL, HBSHOP, HBOTHER, NHBASED, HHSIZE, VEHICLES, WORKERS, DRIVERS, AGE5_17, AGE35_49 were all found to have correlation with magnitude greater than 0.3. These were thus considered as tentative predictors. The variables HBWORK, HBSCHOOL, HBSHOP, HBOTHER, NHBASED were however discarded since they sum together to form TOTAL and thus are dependent, which violates the requirement for predictors to be independent. Since the variable TOTAL was taken to be normal as explained in task 1, condition of normality is deemed to hold. Then the summary of the regression of TOTAL on HHSIZE, VEHICLES, WORKERS, DRIVERS, AGE5_17, AGE35_49, is as follows:
Estimate Std. Error t value Pr(>|t|) (Intercept) 1.41542 0.57747 2.451 0.01449 * HHSIZE 0.78521 0.29885 2.627 0.00879 ** VEHICLES 0.72177 0.31599 2.284 0.02266 * WORKERS 1.38897 0.49276 2.819 0.00496 ** DRIVERS 1.29210 0.32660 3.956 8.4e-05 *** AGE5_17 3.70064 0.38252 9.674 < 2e-16 *** AGE35_49 -0.07437 0.31677 -0.235 0.81446 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.385 on 688 degrees of freedom Multiple R-squared: 0.5503, Adjusted R-squared: 0.5464 F-statistic: 140.3 on 6 and 688 DF, p-value: < 2.2e-16 |
The variables HHSIZE, VEHICLES, WORKERS, DRIVERS, AGE5_17 were found to be significant predictors. The final model where all predictors were found to be significant at 0.05 level of significance, was then found to be:
TOTAL = 1.4121 + 0.7886HHSIZE + 0.7228VEHICLES + 1.3792WORKERS+ 1.2759DRIVERS + 3.6735AGE5_17
The adjusted R-squared was found to be 0.547 which is a reasonable degree of variation in the response being explained by the five predictors in the model. The following table gives the summary of the above model:
Estimate Std. Error t value Pr(>|t|) (Intercept) 1.4121 0.5769 2.448 0.01462 * HHSIZE 0.7886 0.2983 2.644 0.00839 ** VEHICLES 0.7228 0.3157 2.289 0.02237 * WORKERS 1.3792 0.4907 2.811 0.00508 ** DRIVERS 1.2759 0.3190 3.999 7.04e-05 *** AGE5_17 3.6735 0.3644 10.082 < 2e-16 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.381 on 689 degrees of freedom Multiple R-squared: 0.5503, Adjusted R-squared: 0.547 F-statistic: 168.6 on 5 and 689 DF, p-value: < 2.2e-16 |
Again, the kilometres of travel made by an individual by vehicle is to be predicted and the prediction model is thus to be determined using the data on trips. The data in the file contains data in miles of TRPMILES, this is converted to kilometres by multiplying it with the factor 1.6304. The problem targets vehicular or car trips. The mode of travel is accounted for in the variable MODE and it has 7 levels, two of which “POV-driver” or category level 1 and “POV-passenger” or category level 2 refer to transport by car. A new dummy variable car is thus defined which takes value 1 if MODE is 1 or 2 and otherwise. Then for the data with car taking values 1, the relationship between the kilometres of travel is scrutinized with that of the other independent variables, except MODE and car. The distance travelled is assumed to follow a normal distribution, hence a linear model is fitted and the relevance of each independent variable in explaining the distance travel is considered. The following table gives the summary of the regression.
Estimate Std. Error t value Pr(>|t|) (Intercept) -6.212e+00 2.920e+00 -2.128 0.03342 * R_AGE -7.972e-02 1.974e-02 -4.039 5.49e-05 *** R_SEX 6.222e-01 5.838e-01 1.066 0.28659 WORKER -1.039e+00 7.846e-01 -1.325 0.18534 DRIVER -4.043e+00 1.393e+00 -2.903 0.00371 ** HHSIZE -1.473e-01 2.514e-01 -0.586 0.55785 VEHICLES 1.652e-01 4.272e-01 0.387 0.69891 STRTTIME -1.130e-03 6.062e-04 -1.865 0.06232 . TIME 1.334e+00 1.278e-02 104.388 < 2e-16 *** WTTRDFIN 1.695e-06 9.002e-06 0.188 0.85063 DRIVERS 2.847e-01 6.390e-01 0.446 0.65592 TRAVDAY 4.669e-01 1.986e-01 2.351 0.01877 * WORKERS -4.958e-02 5.284e-01 -0.094 0.92525 VTR_FLG 1.017e+00 9.893e-01 1.028 0.30403 HHINCCAT 6.809e-02 2.490e-01 0.273 0.78451 POPDEN 2.579e-01 3.625e-01 0.711 0.47688 PURPOSE 8.861e-01 1.996e-01 4.440 9.27e-06 *** BUS_DIST 5.704e-02 7.141e-02 0.799 0.42450 HOMEOWN -1.031e-01 7.254e-02 -1.421 0.15530 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.89 on 3670 degrees of freedom (3930 observations deleted due to missingness) Multiple R-squared: 0.7526, Adjusted R-squared: 0.7513 F-statistic: 620.1 on 18 and 3670 DF, p-value: < 2.2e-16 |
Thus the variables PURPOSE, TRAVDAY, TIME, DRIVER and R_AGE were found to be significant at 5% level of significance. The variable PURPOSE is the reason for the travel, categorized as 1 being home to work, 2 as home to school, 3 being home to shop, 4 being home to other and 5 being non-home based. TRAVDAY stands for the day of the week of travel, that is, 2 for Monday and 6 for Friday. DRIVER here stands for whether the individual making the trip is the driver himself of herself or not with 1 being yes and 2 being no, R_AGE stands for the age of the driver and TIME is the length of the trip. Then using these variables the model was run.
Estimate Std. Error t value Pr(>|t|) (Intercept) -2.685541 1.105645 -2.429 0.0152 * PURPOSE 0.645157 0.116393 5.543 3.08e-08 *** TRAVDAY -0.008565 0.120852 -0.071 0.9435 TIME 1.283095 0.007959 161.203 < 2e-16 *** DRIVER -3.153209 0.552216 -5.710 1.17e-08 *** R_AGE -0.078782 0.010640 -7.404 1.46e-13 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.34 on 7372 degrees of freedom (241 observations deleted due to missingness) Multiple R-squared: 0.7801, Adjusted R-squared: 0.7799 F-statistic: 5230 on 5 and 7372 DF, p-value: < 2.2e-16 |
The variables PURPOSE, TIME, DRIVER and R_AGE were found to be significant at 5% level of significance and thus the variable TRAVDAY was discarded from the model. Finally the model was obtained as :
TRAVEL DISTANCE = -2.719885 + 0.645229 PURPOSE + 1.2830 TIME + 3.1535 DRIVER – 0.07 R_AGE
Estimate Std. Error t value Pr(>|t|) (Intercept) -2.719885 0.993723 -2.737 0.00621 ** PURPOSE 0.645229 0.116381 5.544 3.06e-08 *** TIME 1.283082 0.007957 161.252 < 2e-16 *** DRIVER -3.153540 0.552159 -5.711 1.16e-08 *** R_AGE -0.078772 0.010638 -7.405 1.46e-13 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.34 on 7373 degrees of freedom (241 observations deleted due to missingness) Multiple R-squared: 0.7801, Adjusted R-squared: 0.78 F-statistic: 6538 on 4 and 7373 DF, p-value: < 2.2e-16 |
The summary table above shows the specifications of the final model. The model was found to have adjusted R-squared value of 0.78, which is a good measure of variation being explained by the predictors chosen for the model with all predictors having significant effect on the response travel distance in kilometres.