Giter Site home page Giter Site logo

walt1012 / traffic-congestion-eda Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ankushjain2001/traffic-congestion-eda

0.0 2.0 0.0 6.73 MB

The exploratory data analysis for the road traffic data-sets used in the Traffic Congestion project.

Python 0.17% Jupyter Notebook 99.83%

traffic-congestion-eda's Introduction

Traffic Congestion: Exploratory Data Analysis

The objective of this project is to study the effectiveness of various machine learning models for the analysis and forecasting of a road traffic data timeseries. Subsequently, the most suitable model for the use-case has been determined and a prototype application called Traffic Congestion Management System is developed using R and Shiny Dashboard. The web application can predict the traffic conditions for a particular road intersection by training with the traffic data in a timeseries format.

This repository is the exploratory data analysis for the road traffic dataset used in the Traffic Congestion project. The EDA is performed using Python and its libraries - statsmodel, matplotlib, pandas and numpy. Check the Jupyter Notebook here.

Table of Content

Introduction

Timeseries Description

  • The timeseries used for this road traffic data analysis is obtained from data.gov.uk.
  • The data is taken from the year 2010 to 2014 (1826 days) at a duration of 15 minutes.
  • For every 24 hours we have 96 observations. Hence, a total of 1826*96 = 175296 observations are recorded.
  • Following is a simple line plot of all the 175296 observations.

Proposed Machine Learning Model for the problem

For the given use case, the Autoregressive-Integrated-Moving-Average Model (ARIMA) is considered for forecasting. The results of AR, MA, ARMA and ARIMA models will be analyzed. However, before moving forward to the forecasting stage, we will study the pre-conditions of these models.

  • Hypothesis for the trend and seasonality in the timeseries will be tested
  • Stationarity of the timeseries will be tested using Dicky-Fuller Test
  • Decomposition of timeseries will be performed to obtain the residual timeseries
  • ACF and PACF analysis will be performed

Hypothesis Testing

1. Traffic Flow Trend

Claim: With increase in time, the population of the area will probably increase. Hence, an upward trend can be expected in the traffic flow as the time goes by.

Conclusion

  • No trend exists. Alternative hypothesis holds true.

2. Daily Seasonality

Claim: The road traffic flow will dependent on the different hours of the day (rush hours vs off hours). Hence, daily seasonality can be expected.

Conclusion

  • It can be concluded that there is a daily seasonality in the data as similar patterns can be observed in traffic frequency for particular time intervals.
  • The rush hours for weekdays are around 11:45 in the morning and 16:15 in the evening.
  • Null Hypothesis holds true.

3. Weekly Seasonality

Claim: The road traffic flow on weekdays will be greater than weekends. Hence weekly seasonality can be expected.

Conclusion

  • It can be concluded that there is a weekly seasonality in the data as similar patterns can be observed in traffic frequency for each week at a monthly scale.
  • The weekdays have more traffic with the highest reaching on Friday. The weekends have lesser traffic.
  • Null Hypothesis holds true.

4. Annual Seasonality

Claim: The road traffic flow will dependent on the different months of the year, due to change in weather conditions. Hence annual seasonality can be expected.

Conclusion

  • It can be concluded that there is a annual seasonality in the data as similar patterns can be observed in traffic frequency for each month at an yearly scale.
  • The average traffic flow rises till August and then decreases till December-January.
  • Null Hypothesis holds true.

Stationarity of Timeseries

The stationarity of the given timeseries is determined using the Augmented Dickey-Fuller stationarity test. Following is the premise of the test.

Null-Hypothesis of the Dickey-Fuller test states that the timeseries is non-stationary (presence of unit root).

Alternate Hypothesis of Dickey-Fuller test states that the times series is stationary (absence of unit root).

Results of Augmented Dickey-Fuller Test:

The given graph represents the data of a 3 months subset. This will ease the visual interpretation of the plots.

Statistic Value
Test Statistic -26.380273
p-value 0.000000
# Lags Used 37.000000
Number of Observations Used 8602.000000
Critical Value / Significance Level (1%) -3.431110
Critical Value / Significance Level (5%) -2.861876
Critical Value / Significance Level (10%) -2.566949

Conclusion

  • The p-value of 0 which is less than our 1% significance level value. This indicates that the null-hypothesis is invalid, with a certainity of 99%. Hence, timeseries is stationary.
  • The Test Statistic is much lower than Critical Values. Hence, the null-hypothesis can be rejected and the timeseries is indeed stationary.
  • There is no need for applying any de-trending transformation or differencing the timeseries as it is stationary.

Decomposition of Timeseries

ACF and PACF Analysis

We will use the most suitable model out of AR, MA, ARMA and ARIMA models for forecasting in the modeling stage. So, in this section we will determine the various parameters that are required for the ARIMA model.

  • 'p' or the number of Auto-Regressive terms: These are the lags of forecasted variable
  • 'q' or the number of Moving-Average terms: These are the lagging forecasted error
  • 'd' or number of differences: This is the number of non seasonal differences

A chart describing the right approach for selecting the AR or MA processes:

ACF PACF
AR Tails off gradually Significant till 'n' lags
MA Significant till 'n' lags Tails off gradually
ARMA Tails off gradually Tails off gradually

Conclusion

  • The most suitable model for the timeseries of given use-case appears to be AR or autoregressive model.
  • However, for modeling with ARIMA, following parameters can be used.
    • From the given PACF plot, we can determine that the value of 'p' is 2.
    • From the given ACF plot, we can determine that the value of 'q' is 10.

traffic-congestion-eda's People

Contributors

ankushjain2001 avatar ankur3107 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.