Analysis Of Transport Data: Normal Distribution And Regression Model

Total Trips Made For Each Household

There are two dataset under consideration. One consists of transport data of 858 households and the other contains data on trips made by the individuals of each household. The variables, total trips made for each household is the total count of trips by all individuals in a household. The following table gives the frequency distribution of the total trip count.The mean was found to be 11.38.

TOTAL TRIPS	FREQUENCY
0-9	428 Save Time On Research and Writing Hire a Pro to Write You a 100% Plagiarism-Free Paper. Get My Paper
10-19	323
20-29	77
30-39	25
40-49	4
50-59	1
Grand Total	858

Table 1: Frequency distribution of total trips for a household

Normally, count data is taken to be poisson distribution. The frequency distribution also suggests it to be so. The histogram shows high frequency in the lower values which rapidly falls as count increases. Thus it would be reasonable to assume that the variable, total trips per household follows a passion distribution. On the other hand the mean of the variable was found to be equal to 11.38. So, due to the fact that mean is greater than 10, thus is 11.38 and greater than 10. Thus it could be considered to be reasonable large enough for considering normal approximation of normal. Thus the total number of trips by a household is assumed to follow normal distribution with mean 11.38 and variance 11.38.

Again, the following table gives the frequency distribution of the time interval or length of time taken for each trip by an individual. The mean was found to be 16.379.

Time	FREQUENCY
1-100	8235
101-200	56
201-300	9
301-400	4
401-500	2
501-600	1
601-700	1
701-800	2
Grand Total	8310

Table 2: Frequency distribution of total time for a household

The frequency distribution also suggests it to be poisson distribution. The histogram shows high frequency in the lower values which rapidly falls as count increases suggesting the same. Thus it would be reasonable to assume that the variable, time per trips per household follows a poission distribution. On the other hand the mean of the variable was found to be equal to 16.379. So, due to the fact that mean is greater than 10, thus is 16.39 and greater than 10. Thus it could be considered to be reasonable large enough for considering normal approximation of normal. Thus the total number of trips by a household is assumed to follow normal distribution with mean or 16.37 and variance or 16.37 as well. The R code used for the computations is provided in the APPENDIX.

Normal distribution being the assumed underlying probability distribution, it acts to simplify and facilitate the analysis is process. Normal distribution, as stated as the basis for why it could be considered, is a good approximation for large samples. It is a symmetric bell shaped distribution and has a number of nice mathematical properties making calculations easy. It wt has one of the important properties called central theorem. Central theorem means relationship between shape of population distribution and shape of sampling distribution of mean. This means that sampling distribution of mean approaches normal as sample size increase. It also provides basis of measure the probability of random errors in the data since it is a Gaussian distribution. Theories such as confidence intervals, t-tests, F tests, Chi squared tests and ANOVA as well as theory of linear regression is based on the assumption of normality. Therefore, assuming the probability model to be based on normal distribution allows for a number of advantages.

Frequency Distribution of Total Trips

This is the simplest process of sampling from a large population. The advantage of using a SRS is that each and every member of the population has an equal opportunity of being selected. The process involves selecting “n” number of samples from the population of the city. For our case the sample size is 2000. Thus records regarding the households i.e., HOUSEID needs to be acquired. A random number needs to be assigned to each of the HOUSEID’s. The HOUSEID’s needs to be sorted on the basis of the random numbers. Next we can select the HOUSEID’s corresponding to the first or last 2000 random numbers. In this process we can get a simple random sample of 2000 households.

In selection of SRS stratum is not essential. The whole population is considered as one strata.

The sampling error for SRS is the difference between the mean of the sample and mean of the population.

The sampling error for the sample

Since sample size = 2000, Hence sampling error =

The confidence level depends on the z-value. The higher the z-value the more sure we can be of the sample population representing our sample.

This method of sampling is useful when there are variations in subpopulations. Under the circumstances that there are variations in household populations of the city, the populations need to be grouped based on strata. The strata are mutually exclusive i.e., a household is assigned to only one strata. The sample is drawn from each of the strata. In proportional allocation, the final sample is based on weights of each of the strata. Thus, if there are two strata “x” and “y” and the total population is “n,” then the sample would be .

In this process the size of the sample and the variability of the strata.

In order to investigate for differences in trip-distances between males and females an independent sample t-test was done.

	Males	Females
Mean	10.123	7.416
Variance	1302.801	326.931
Observations	3639	4635
Hypothesized Mean Difference	0
df	5059
t Stat	4.136
P(T<=t) one-tail	0.000
t Critical one-tail	1.645
P(T<=t) two-tail	0.000
t Critical two-tail	1.960

The average trip miles of males 10.123 with a variance of 1302.801.

The average trip miles of females is 7.416 with a variance of 326.931.

From the t-test it is found that there are significant differences between the trip miles of males and females, p-value < 0.000 at a = 0.05 level of significance.

A prediction model of the total trips per household is to be determined from the available data. First, the tentative predictors of the response variable, total number of trips is to be determined. To do so, the correlation of the response, total trips, labelled as “TOTAL” in the model is viewed. The following table gives the correlation coefficient of the response with the predictors. The variables include HBWORK or home to work trips, HBSCHOOL of home to school trips, HBSHOP or home to shop trips, NHBASED or non home based trips, HBOTHER or home to other reason related trips, AGE5_17 which is the number of individuals in the age group 5 to 17, AGE18_24 which is the number of individuals in the age group 18 to 24, AGE25_34 which is the number of individuals in the age group 25 to 34, AGE35_49 which is the number of individuals in the age group 35 to 49, AGE50_64 which is the number of individuals in the age group 50 to 64 and AGE65PLS which is the number of individuals above the age group 64. The variable WTHHFIN is the cumulated weight of all individuals in the household. The variable WORKER is the number of workers in the household, the variable DRIVER is the number of drivers, HHSIZE is the number of people in the household, POPDEN is the population intensity and HHINCCAT is the category of household income with 1 being less than $20,000 to 5 being greater than $80,000

Variable	Correlation
HBWORK	0.351006
HBSCHOOL	0.490538
HBSHOP	0.423805
HBOTHER	0.781459
NHBASED	0.780442
HHSIZE	0.671719
VEHICLES	0.401147
WORKERS	0.490245
DRIVERS	0.472093
POPDEN	-0.04012
HHINCCAT	0.27701
AGE0_4	0.051391
AGE5_17	0.634981
AGE18_24	0.13261
AGE25_34	0.082308
AGE35_49	0.406922
AGE50_64	-0.02761
AGE65PLS	-0.21717
WTHHFIN	-0.10913

Time Interval or Length of Time Taken for Each Trip

The variables HBWORK, HBSCHOOL, HBSHOP, HBOTHER, NHBASED, HHSIZE, VEHICLES, WORKERS, DRIVERS, AGE5_17, AGE35_49 were all found to have correlation with magnitude greater than 0.3. These were thus considered as tentative predictors. The variables HBWORK, HBSCHOOL, HBSHOP, HBOTHER, NHBASED were however discarded since they sum together to form TOTAL and thus are dependent, which violates the requirement for predictors to be independent. Since the variable TOTAL was taken to be normal as explained in task 1, condition of normality is deemed to hold. Then the summary of the regression of TOTAL on HHSIZE, VEHICLES, WORKERS, DRIVERS, AGE5_17, AGE35_49, is as follows:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.41542 0.57747 2.451 0.01449 *

HHSIZE 0.78521 0.29885 2.627 0.00879 **

VEHICLES 0.72177 0.31599 2.284 0.02266 *

WORKERS 1.38897 0.49276 2.819 0.00496 **

DRIVERS 1.29210 0.32660 3.956 8.4e-05 ***

AGE5_17 3.70064 0.38252 9.674 < 2e-16 ***

AGE35_49 -0.07437 0.31677 -0.235 0.81446

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.385 on 688 degrees of freedom

Multiple R-squared: 0.5503, Adjusted R-squared: 0.5464

F-statistic: 140.3 on 6 and 688 DF, p-value: < 2.2e-16

The variables HHSIZE, VEHICLES, WORKERS, DRIVERS, AGE5_17 were found to be significant predictors. The final model where all predictors were found to be significant at 0.05 level of significance, was then found to be:

TOTAL = 1.4121 + 0.7886HHSIZE + 0.7228VEHICLES + 1.3792WORKERS+ 1.2759DRIVERS + 3.6735AGE5_17

The adjusted R-squared was found to be 0.547 which is a reasonable degree of variation in the response being explained by the five predictors in the model. The following table gives the summary of the above model:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.4121 0.5769 2.448 0.01462 *

HHSIZE 0.7886 0.2983 2.644 0.00839 **

VEHICLES 0.7228 0.3157 2.289 0.02237 *

WORKERS 1.3792 0.4907 2.811 0.00508 **

DRIVERS 1.2759 0.3190 3.999 7.04e-05 ***

AGE5_17 3.6735 0.3644 10.082 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.381 on 689 degrees of freedom

Multiple R-squared: 0.5503, Adjusted R-squared: 0.547

F-statistic: 168.6 on 5 and 689 DF, p-value: < 2.2e-16

Again, the kilometres of travel made by an individual by vehicle is to be predicted and the prediction model is thus to be determined using the data on trips. The data in the file contains data in miles of TRPMILES, this is converted to kilometres by multiplying it with the factor 1.6304. The problem targets vehicular or car trips. The mode of travel is accounted for in the variable MODE and it has 7 levels, two of which “POV-driver” or category level 1 and “POV-passenger” or category level 2 refer to transport by car. A new dummy variable car is thus defined which takes value 1 if MODE is 1 or 2 and otherwise. Then for the data with car taking values 1, the relationship between the kilometres of travel is scrutinized with that of the other independent variables, except MODE and car. The distance travelled is assumed to follow a normal distribution, hence a linear model is fitted and the relevance of each independent variable in explaining the distance travel is considered. The following table gives the summary of the regression.

Estimate Std. Error t value Pr(>|t|)

(Intercept) -6.212e+00 2.920e+00 -2.128 0.03342 *

R_AGE -7.972e-02 1.974e-02 -4.039 5.49e-05 ***

R_SEX 6.222e-01 5.838e-01 1.066 0.28659

WORKER -1.039e+00 7.846e-01 -1.325 0.18534

DRIVER -4.043e+00 1.393e+00 -2.903 0.00371 **

HHSIZE -1.473e-01 2.514e-01 -0.586 0.55785

VEHICLES 1.652e-01 4.272e-01 0.387 0.69891

STRTTIME -1.130e-03 6.062e-04 -1.865 0.06232 .

TIME 1.334e+00 1.278e-02 104.388 < 2e-16 ***

WTTRDFIN 1.695e-06 9.002e-06 0.188 0.85063

DRIVERS 2.847e-01 6.390e-01 0.446 0.65592

TRAVDAY 4.669e-01 1.986e-01 2.351 0.01877 *

WORKERS -4.958e-02 5.284e-01 -0.094 0.92525

VTR_FLG 1.017e+00 9.893e-01 1.028 0.30403

HHINCCAT 6.809e-02 2.490e-01 0.273 0.78451

POPDEN 2.579e-01 3.625e-01 0.711 0.47688

PURPOSE 8.861e-01 1.996e-01 4.440 9.27e-06 ***

BUS_DIST 5.704e-02 7.141e-02 0.799 0.42450

HOMEOWN -1.031e-01 7.254e-02 -1.421 0.15530

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.89 on 3670 degrees of freedom

(3930 observations deleted due to missingness)

Multiple R-squared: 0.7526, Adjusted R-squared: 0.7513

F-statistic: 620.1 on 18 and 3670 DF, p-value: < 2.2e-16

Thus the variables PURPOSE, TRAVDAY, TIME, DRIVER and R_AGE were found to be significant at 5% level of significance. The variable PURPOSE is the reason for the travel, categorized as 1 being home to work, 2 as home to school, 3 being home to shop, 4 being home to other and 5 being non-home based. TRAVDAY stands for the day of the week of travel, that is, 2 for Monday and 6 for Friday. DRIVER here stands for whether the individual making the trip is the driver himself of herself or not with 1 being yes and 2 being no, R_AGE stands for the age of the driver and TIME is the length of the trip. Then using these variables the model was run.

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.685541 1.105645 -2.429 0.0152 *

PURPOSE 0.645157 0.116393 5.543 3.08e-08 ***

TRAVDAY -0.008565 0.120852 -0.071 0.9435

TIME 1.283095 0.007959 161.203 < 2e-16 ***

DRIVER -3.153209 0.552216 -5.710 1.17e-08 ***

R_AGE -0.078782 0.010640 -7.404 1.46e-13 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.34 on 7372 degrees of freedom

(241 observations deleted due to missingness)

Multiple R-squared: 0.7801, Adjusted R-squared: 0.7799

F-statistic: 5230 on 5 and 7372 DF, p-value: < 2.2e-16

The variables PURPOSE, TIME, DRIVER and R_AGE were found to be significant at 5% level of significance and thus the variable TRAVDAY was discarded from the model. Finally the model was obtained as :

TRAVEL DISTANCE = -2.719885 + 0.645229 PURPOSE + 1.2830 TIME + 3.1535 DRIVER – 0.07 R_AGE

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.719885 0.993723 -2.737 0.00621 **

PURPOSE 0.645229 0.116381 5.544 3.06e-08 ***

TIME 1.283082 0.007957 161.252 < 2e-16 ***

DRIVER -3.153540 0.552159 -5.711 1.16e-08 ***

R_AGE -0.078772 0.010638 -7.405 1.46e-13 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.34 on 7373 degrees of freedom

(241 observations deleted due to missingness)

Multiple R-squared: 0.7801, Adjusted R-squared: 0.78

F-statistic: 6538 on 4 and 7373 DF, p-value: < 2.2e-16

The summary table above shows the specifications of the final model. The model was found to have adjusted R-squared value of 0.78, which is a good measure of variation being explained by the predictors chosen for the model with all predictors having significant effect on the response travel distance in kilometres.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Analysis Of Transport Data: Normal Distribution And Regression Model ”

Get high-quality paper

NEW! AI matching with writer

Order an Essay Now & Get These Features For Free:

Turnitin Report

Formatting

Title Page

Citation

Outline

Place an Order