Bikesharing

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. Data source:kaggle

This project makes use of historical usage patterns and weather data to forecast amount of user

Data Fields

datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

Data visualization

The distribution of the number of users per hour
A total of 10886 samples and the samples outside 3 std account for less than 1% of data.The outlier data will be filtered out.
Correlation matrix
There are no obvious linear relationship among temperature,humidity,windspeed and user count.
Hours mean statistics with weather
This picture show the different numbers of user count per hour in different weather.There will be more users in good weather and the morning peak is at eight in the morning and the evening peak is at seventeen in the evening
Month statistics

This picture show the mean of user count per hour in different month and different weather.January has the fewest users.

Season statistics

This picture show the user count in different season and different weather.There are more users in summer and fall and less in spring and winter.

Regression

Firstly preprocess the data,filter out the outlier data and adding dummy varibles for categorical feature 'season' 'holiday' 'workingday' 'weather' 'hour' 'month'.Then choosing the appropriate model for regression by respectively comparing the accuracy of different regression models and using grid search method based on k fold cross validation error for selecting the model parameters.

Linear Regression

Multiple linear regression gave k-fold R2 score: 0.61841 and mse: 10011.33622.In order to eliminate collinearity of independent variables,using ridge regression and applying grid search for optimal parameter "alpha"=1.5(Regularization strength).Ridge regression gave k-fold R2 score: 0.62404 and mse: 10331.75954. At the same time, Using backward estimation before regression,which gave result k-fold R2 score: 0.61899 and mse: 10017.96790.

Using support vector regression to predict the number of users.Training SVR by selecting Gaussian kernel function and applying grid search for optimal parameter "C"=2000(Penalty parameter C of the error term),"epsilon"=0.1(Epsilon in the epsilon-SVR model).SVR gave k-fold R2 score: 0.63185 and mse: 9382.43674

Decision Tree

Regression prediction using decision tree regression.applying grid search for optimal parameter "'min_samples_split'"=0.02(The minimum number of samples required to split an internal node).Decision tree regression gave k-fold R2 score: 0.61579 and mse: 9095.94262.

Random Forest Regression

Regression analysis using random forest method and select optimal parameter by grid search.Firstly,determine the parameter 'n_estimators'=100(The number of trees in the forest).Then determine the parameters 'max_features'= 0.6(The number of features to consider when looking for the best split), 'max_depth'= 26(The maximum depth of the tree).Random forest fegression gave k-fold R2 score: 0.82300 , mse: 4818.77715 and error rate: 0.65858.

XGBoost

Using XGRegressor to regression prediction.Selecting optimal parameter by grid search,'max_depth'= 10(The maximum depth of the tree), 'min_child_weight'= 8(Minimum leaf node sample weight),'colsample_bytree'= 0.8(The ratio of the number of columns sampled at random), 'gamma'= 4(The minimum loss function required for node splitting).XGRegressor gives the maximum k-fold R2 score: 0.83835 , minimum mean square error: 4414.09082 and error rate: 0.63869

Conclusion

In this project,processing data through data visualization,data analysis,data preprocessing,selecting models and optimization parameters.As the data distribution is skewed,the future work of this project is trying to transform this data using log transformation,then importing different models.

suangzi123 / bikesharing Goto Github PK

bikesharing's Introduction

Bikesharing

Data Fields

Data visualization

Regression

Conclusion

bikesharing's People

Contributors

Stargazers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent