Giter Site home page Giter Site logo

bayareabikeshareanalysis's Introduction

Introduction

In this project, I am analyzing SF Bay Area Bike Share dataset. To do so, I am using following datasets:

  1. Trips Dataset
  2. Station Dataset
  3. Weather Dataset
  4. Yelp Customer Reviews Dataset

Through this analysis, I aim to achieve a better understanding of how different factors affect the bike sharing and how BABS has been performing over the course of time.

Click here to go to the dataset.

Technology Stack

  1. Hadoop MapReduce
  2. HBase
  3. Hive
  4. Pig
  5. Weka
  6. Tableau

Analysis results

Note: All the charts are plotted using Tableau

Analysis 1: Binning pattern to determine distribution of rides in a day

Using this analysis, I aimed at finding the spread of rides over the hours of the day.

Binning_Rides_Distribution_1

Analysis 2: Subscriber vs. Customer during each year

In this analysis, I have used custom writable objects to store the running total of the count for each kind of user. The result obtained can be understood from the following tableau chart:

Counting_Pattern_Subs_Cust_Distribution

Analysis 3: Distribution of rides over months

To determine the distribution of rides over the year for each month, I have used counting with counters pattern.

Rides_over_months

Analysis 4: Determining Top 6 stations based on the number of rides

Here, we are employing Top K values pattern to determine top 6 most busiest stations in the bay area. These are the stations from where most number of rides start.

TopK_stations

Analysis 5: Distribution of rides based on temperature

In this analysis, we are determining how the mean temperature of a day affects the turn up of people for bike share. We are making this analysis for top 5 busiest stations.
To achieve this, we are join two datasets using inner join. We are also employing secondary sorting, inner join, top k pattern and chaining techniques.

Rides_based_on_temperature

Analysis 6: Sentiment Analysis of user's yelp reviews

I have used Yelp API to fetch data of customer’s review for BABS service. A very naïve method is used to analyze sentiments of users.

Sentiment_Analysis

Analysis 7: Using Pig to find duration of trip for start station, total trips started and average trip duration

I have used Pig in local mode for this analysis. Following are the commands

  1. Start Pig

./pig -x local
  1. Load table data
tripdata = LOAD '/Users/mansijain/Desktop/BABS-Dataset/trips/trips.csv' USING PigStorage(',') as (trip_id: int, start_date: chararray, start_station: chararray, start_term: chararray, end_date: chararray, end_station: chararray, end_term: int, bike, zip_code: int);
  1. Group data by start station column
split_station = GROUP tripdata BY start_station;
  1. Perform aggregate function
results = FOREACH split_station GENERATE GROUP AS start_station, SUM(tripdata.duration), AVG(tripdata.duration), count(trip_id);
  1. Store results in local fs
STORE results INTO '/Users/mansijain/Desktop/BABS-Dataset/results/start_stn' USING PigStorage(',');

Following chart was obtained:

Station_Analysis_Pig

Analysis 8: Using Pig to find duration of trip for end station, total trips started and average trip duration

A similar pig query is used to determine data associated with the end station.

  1. Group data
split_station = GROUP tripdata BY end_station;
  1. Aggregate data
results = FOREACH split_station GENERATE GROUP AS end_station, SUM(tripdata.duration), AVG(tripdata.duration), count(trip_id);
  1. Store data
STORE results INTO '/Users/mansijain/Desktop/BABS-Dataset/results/end_stn' USING PigStorage(',');

Station_Analysis_Pig

Analysis 9: Using Hive to analyze total stations and dock counts in all the cities of Bay Area

I have used Hive for this analysis. Following are the steps I followed:

  1. Create table
CREATE TABLE stationTable (id int, name string, lat double, long double, dock_count int, landmark string, date_inst date) row format delimited fields terminated by ',' STORED AS TEXTFILE;
  1. show tables
show tables;
  1. Insert vales
INSERT OVERWRITE local directory '/Users/mansijain/Desktop/BABS-Dataset/hive' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT landmark, count(id), sum(dock_count) FROM stationTable GROUP BY landmark;

Dock_Analysis

Analysis 10: Analyze the spread of all the stations in the cities of Bay Area

Using distinct filtering pattern, I analyzed all the stations present in the bay area. I have utilized latitude and longitude column to visually see the locations on a geomap.

spread_of_station

Analysis 11: Analysis of trips based on start and end station

The following analysis determines the sum of all the rides between two stations. I used composite keys to determine these values.

trip_analysis

bayareabikeshareanalysis's People

Contributors

jainmansi avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.