Utilizing Bike Sharing Data To Evaluate Decision Trees

The Challenges of Bicycle Sharing Systems

Bicycle sharing framework is a change of the customary bicycle rental frameworks whereby the procedure included enlistment to get participation, after which the individuals can lease and return bicycles, ( Kiefer & Behrendt 2016 pp.79-88).

This procedure has now been computerized on the novel bicycle sharing frameworks. Present day bicycle sharing framework are progressively getting to be valuable in urban focuses all through the world, (Kumar et al 2016 p.21597). This since bicycles give shabby and reasonable transport between short separations. Nonetheless, the administration of bicycle sharing frameworks presents issues.

The significant issue is rebalancing of the bikes, (Rivers & Koedinger 2017 pp.37-64). An irregularity is made in the framework when the clients make a hilter kilter request design. For viable working of the framework, there should be rebalancing of bikes in each bike focus. To take care of the directing issues particularly amid the surge hour, machine learning calculations prove to be useful to help explain this test, (Jian et al 2016 pp. 602-613).

For a consistent task of the bicycle sharing framework, dynamic grouping systems should be actualized for anticipating the over interest example of demand of the bicycles, (Carpenter et al 2017).

Taking care of the Demand Imbalance Problem

All together for the bicycle rebalancing to be compelling, the stock target levels must be precisely anticipated. In this task, three relapsing models have been actualized on a bicycle sharing dataset from Kaggle, and as gave in the task paper dataset (bicycle sharing dataset), (Orfanakis & Papadakis 2016). The calculations are as per the following:

Decision tree calculation

Gradient help calculation

iii. Linear relapse calculation

Dataset Description

The dataset has been recovered from the UCI information store from the accompanying url: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset . The dataset has been enhanced with occasional and information identified with climate. This was done at the University of Porto in a The dataset has hourly information and every day information which contains segment names as headers. The headers are as appeared underneath in the screen shot from the coding segment of this task:

data_path = ‘C:/Users/ROSANA/Desktop/bike/hour.csv’

rides = pd.read_csv(data_path)

rides.head()

instant	dteday	season	yr	mnth	hr	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
0	1	2011-01-01	1	0	1	0	0	6	0	1	0.24	0.2879	0.81	0.0	3	13	16
1	2	2011-01-01	1	0	1	1	0	6	0	1	0.22	0.2727	0.80	0.0	8	32	40
2	3	2011-01-01	1	0	1	2	0	6	0	1	0.22	0.2727	0.80	0.0	5	27	32
3	4	2011-01-01	1	0	1	3	0	6	0	1	0.24	0.2879	0.75	0.0	3	10	13
4	5	2011-01-01	1	0	1	4	0	6	0	1	0.24	0.2879	0.75	0.0	0	1	1

dummy_fields = [‘season’, ‘weathersit’, ‘mnth’, ‘hr’, ‘weekday’]

for field in dummy_fields:

dummies = pd.get_dummies(rides[field], prefix=field)

rides = pd.concat([rides, dummies], axis=1)

fields_to_drop = [‘instant’, ‘dteday’, ‘season’, ‘weathersit’, ‘mnth’, ‘hr’, ‘weekday’, ‘atemp’, ‘workingday’]

data = rides.drop(fields_to_drop, axis=1)

data.head()

yr	holiday	temp	hum	windspeed	casual	registered	cnt	season_1	season_2	…	hr_21	hr_22	hr_23	weekday_0	weekday_1	weekday_2	weekday_3	weekday_4	weekday_5	weekday_6
0	0	0	0.24	0.81	0.0	3	13	16	1	0	…	0	0	0	0	0	0	0	0	0	1
1	0	0	0.22	0.80	0.0	8	32	40	1	0	…	0	0	0	0	0	0	0	0	0	1
2	0	0	0.22	0.80	0.0	5	27	32	1	0	…	0	0	0	0	0	0	0	0	0	1
3	0	0	0.24	0.75	0.0	3	10	13	1	0	…	0	0	0	0	0	0	0	0	0	1
4	0	0	0.24	0.75	0.0	0	1	1	1	0	…	0	0	0	0	0	0	0	0	0	1

5 rows × 59 columns

Scaling target variables

After the target variables were scaled, the following was the output:

yr	holiday	temp	hum	windspeed	casual	registered	cnt	season_1	season_2	…	hr_21	hr_22	hr_23	weekday_0	weekday_1	weekday_2	weekday_3	weekday_4	weekday_5	weekday_6
0	0	0	-1.334609	0.947345	-1.553844	-0.662736	-0.930162	-0.956312	1	0	…	0	0	0	0	0	0	0	0	0	1
1	0	0	-1.438475	0.895513	-1.553844	-0.561326	-0.804632	-0.823998	1	0	…	0	0	0	0	0	0	0	0	0	1
2	0	0	-1.438475	0.895513	-1.553844	-0.622172	-0.837666	-0.868103	1	0	…	0	0	0	0	0	0	0	0	0	1
3	0	0	-1.334609	0.636351	-1.553844	-0.662736	-0.949983	-0.972851	1	0	…	0	0	0	0	0	0	0	0	0	1
4	0	0	-1.334609	0.636351	-1.553844	-0.723582	-1.009445	-1.039008	1	0	…	0	0	0	0	0	0	0	0	0	1

Since this assignment involve plotting of the given dataset, the python notebook has been used.

There is a total of 17379 recoords on a horly basuiis oof the dataset.

Building of Regression Models

PART 1

Decision Trees

The following are some of the code snippet implements the decision tree algorithm:

import pandas as pd

import numpy as np

from sklearn.cross_validation import cross_val_score

from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor, export_graphviz

# read the data and set “datetime” as the index

url = ‘https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv’

bikes = pd.read_csv(url, index_col=’datetime’, parse_dates=True)

bikes.rename(columns={‘count’:’total’}, inplace=True)

bikes[‘hour’] = bikes.index.hour

bikes.head()

bikes.tail()

season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	total	hour
datetime
2012-12-19 19:00:00	4	0	1	1	15.58	19.695	50	26.0027	7	329	336	19
2012-12-19 20:00:00	4	0	1	1	14.76	17.425	57	15.0013	10	231	241	20
2012-12-19 21:00:00	4	0	1	1	13.94	15.910	61	15.0013	4	164	168	21
2012-12-19 22:00:00	4	0	1	1	13.94	17.425	61	6.0032	12	117	129	22
2012-12-19 23:00:00	4	0	1	1	13.12	16.665	66	8.9981	4	84	88	23

treereg = DecisionTreeRegressor(max_depth=7, random_state=1)

scores = cross_val_score(treereg, X, y, cv=10, scoring=’mean_squared_error’)

np.mean(np.sqrt(-scores))

OUTPUT: 107.64196789476493treereg = DecisionTreeRegressor(max_depth=3, random_state=1)treereg.fit(X, y) DecisionTreeRegressor(criterion=’mse’, max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1, splitter=’best’)

from sklearn.ensemble import Gradient Boosting Regressor

rfr = RandomForestRegressor().fit(train_x, train_y)

prediction_rfr = rfr.predict(train_x)from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor().fit(train_x, train_y)

prediction_rfr = rfr.predict(train_x)

data_path = ‘C:/Users/ROSANA/Desktop/bike/hour.csv’

train_data = pd.read_csv(

train_data.head(3)

datetime	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
0	2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0.0	3	13	16
1	2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0.0	8	32	40
2	2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0.0	5	27	32

prediction_rfr = rfr.predict(train_x

plt.figure(figsize=(5, 5))

plt.scatter(prediction_rfr, train_y)

plt.plot( [0,1000],[0,1000], color=’red’)

plt.xlim(-100, 1000)

plt.ylim(-100, 1000)

plt.xlabel(‘prediction’)

plt.ylabel(‘train_y’)

plt.title(‘Random Forest Regressor Model’)

import pandas as pd

url = ‘https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv’

bikes = pd.read_csv(url, index_col=’datetime’, parse_dates=True)

def calculate_period(timestamp):

initial_date = date(2011, 1, 1)

current_date = timestamp.date()

return (current_date.year – initial_date.year) * 12 + (current_date.month – initial_date.month)

ossible_features = [

‘season’, ‘holiday’, ‘workingday’, ‘weather’,

‘temp’, ‘atemp’, ‘windspeed’, ‘month’,

‘hour’, ‘year’, ‘week_day’]

target = ‘count’

Building a linear regression mode

feature_cols = [‘temp’]

X = bikes[feature_cols]

y = bikes.total

bikes.groupby(‘hour’).total.mean().plot()

feature_cols = [‘hour’, ‘workingday’]

X = bikes[feature_cols]

y = bikes.total

linreg = LinearRegression()

linreg.fit(X, y)

linreg.coef_

Use 10-fold cross-validation for the linear regression model.

Scores = cross_val_score(linreg, X, y, cv=10, scoring=’mean_squared_error’)

np.mean(np.sqrt(-scores))

Output: 165.2232866891297

Conclusion

In conclusion, this task and numerous more examinations that have been done on the bicycle sharing dataset demonstrate that machine learning calculations can be utilized to take care of the forecast issue that is looked by bicycle sharing frameworks in different urban areas on the planet, (Diamond & Boyd 2016 pp.2909-2913). Examination of the client conduct, bike use personal conduct standards taxi be saw in the relapse models executed. The numerous tests that have been done in this genuine dataset exhibit how powerful the relapse models can be in tending to the bicycle sharing issue, (Salvatier, Wiecki, Fonnesbeck 2016 p.55).

References

Kiefer, C. and Behrendt, F., 2016. Smart e-bike monitoring system: real-time open source and open hardware GPS assistance and sensor data for electrically-assisted bicycles. IET Intelligent Transport Systems, 10(2), pp.79-88.

Jian, N., Freund, D., Wiberg, H.M. and Henderson, S.G., 2016, December. Simulation optimization for a large-scale bike-sharing system. In Proceedings of the 2016 Winter Simulation Conference (pp. 602-613). IEEE Press.

Rivers, K. and Koedinger, K.R., 2017. Data-driven hint generation in vast solution spaces: a self-improving python programming tutor. International Journal of Artificial Intelligence in Education, 27(1), pp.37-64.

Orfanakis, V. and Papadakis, S., 2016, December. Teaching basic programming concepts to novice programmers in secondary education using Twitter, Python, Ardruino and a coffee machine. In Hellenic Conference on Innovating STEM Education (HISTEM), Greece.

Salvatier, J., Wiecki, T.V. and Fonnesbeck, C., 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science, 2, p.55.

Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P. and Riddell, A., 2017. Stan: A probabilistic programming language. Journal of statistical software, 76(1).

Kumar, S., Vo, A.D., Qin, F. and Li, H., 2016. Comparative assessment of methods for the fusion transcripts detection from RNA-Seq data. Scientific reports, 6, p.21597.

Diamond, S. and Boyd, S., 2016. CVXPY: A Python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1), pp.2909-2913.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Utilizing Bike Sharing Data To Evaluate Decision Trees ”

Get high-quality paper

NEW! AI matching with writer

Order an Essay Now & Get These Features For Free:

Turnitin Report

Formatting

Title Page

Citation

Outline

Place an Order