priscillaboyd / spat_prediction Goto Github PK

Predict traffic signal phase and timing in fixed and adaptively controlled environment using historical traffic data

License: Apache License 2.0

Python 100.00%

spat_prediction's Introduction

Signal Phase & Timing (SPaT) Prediction

This project enables the prediction of signal phase timing (SPaT) in fixed and adaptive environments by using a combination of machine learning techniques and historical traffic data.

It has been developed to support a dissertation titled "A study of machine learning algorithms and their suitability for predicting traffic signal timing" towards an MSc in Software Engineering at the University of Oxford.

Features

Takes historical traffic controller signal phase and detection data to create datasets suitable for machine learning analyses
Enables feature extraction from data to provide signal state and phase duration
Implements the Classification and Regression Tree (CART) for SPaT prediction.
Implements Recurrent Neural Network with Long Short-Term Memory for SPaT prediction.
Supports the creation of plots for data analysis

Getting started

The software has been divided into five packages:

Pre-Processing: processes the data in the expected format, creating datasets for usage with Decision Tree and Neural Network model creation
Analysis: manipulates the data for analysis, creating plots for further understanding
Decision Tree: implements the Classification and Regression Tree (CART) algorithm and Gradient Boosting Regression (GBR) ensemble algorithm to predict SPaT
Neural Network: implements a Recurrent Neural Network (RNN) using Long Short-Term Memory (LSTM) to predict SPaT
Tools: provides a number of helper functions that are re-used throughout the other packages

Data format

The application makes use of two types of data.

Siemens IC4 Tool's Emulator Data

Data generated by the Siemens IC4 Tool's Emulator can be provided to the PreProcessing module for formatting into the suitable historical traffic data used for the learning and prediction processes.

For information on the application itself and the data format it produces, please read the Siemenshandbook.

Historical Traffic Data

The prediction engine takes historical traffic data in three comma-separated (CSV) four formats to create the learning models.

Format ID	Description	Used in
1	Timestamped phase/stage with detection I/O state information	RNN LSTM
2	Timestamped phase/stage with detection I/O state information (numerical values only)	CART/GBR
3	Timestamped phase/stage without detection I/O state information (numerical values only)	CART/GBR
4	Dated phase/stage with start/end times and duration information (numerical values only)	CART/GBR

Notes:

1 and 2 differ simply in terms of data presentation (with 2 only using numerical values due to limitations with the platform used).
Examples of the data in the 4 formats as above are included in the project (within the 'data' folder).

Pre-requisites

The following versions (or newer) are required to run SPaT Prediction:

Keras - v2.0.3
matplotlib - v2.0.2
NumPy - v1.13.1
Pandas - v0.20.3
Python - v3.5.2
seaborn - v0.8
scikit-learn - v0.19.0
TensorFlow - v1.0.0

Author

Priscilla Nagashima Boyd - priscillaboyd

License

This project is licensed under the Apache Licence 2.0 - see the LICENSE file for further information

Citation

If you use SPaT Prediction, we would appreciate a citation 😊 :

A study of machine learning algorithms and their suitability for predicting traffic signal timing, Nagashima Boyd, P., University of Oxford, 2017.

spat_prediction's People

Contributors

Stargazers

Watchers

Forkers

tonny2v emitsakis

spat_prediction's Issues

[RNN LSTM] Improve plotting description

At the moment, the matplotlib graph presented is not very descriptive. Ideally, it should have:

A description of the X and Y axis
A title
Further timing granularity in the X axis

[RNN LSTM] Adapt to use Pandas for efficiency

At present, the RNN LSTM module makes use of 'vanilla' Python functions for loading the dataset. Ideally, the application store the data into a Pandas data frame for efficiency.

Combine data pre-processing modules

The DataExtract, DataCleaning and DataMerge functions should be rationalised given the expected sequential actions that link them together.

Create set of unit tests for data processing functionality

The application requires a set of unit tests for the data processing functionality (once issue #1 is resolved).

[DA] Adopt Utils module to reduce duplication

At the moment, the Data Analysis module uses functionality that is already defined in the DPP's Utils module. This needs to be refactored to reduce duplication and increase loose coupling.

Read valid phases from config file

At the moment, list of signal phases is hardcoded. This should be more dynamic by reading the actual applicable ones from the IC4 .8SD config file.

Example of how this looks in the file itself:
[Phase]
PhaseNo:9
RealPhase:0
PhaseRef:J

[DT-CART] Allow selection of dataset file by user

At the moment, the dataset location is hard-coded. Ideally, this should allow the user to select which dataset to load

Encapsulate DT and RNN modules into single package

For best practice, split the DT and RNN implementations into a single ML-oriented package.

[RNN] Save model to 'models' folder

At the moment, the .h5 model is being saved to the same directory as the code base. Instead, the model should be saved to the "models" folder (as with the DT model) for a given set of results.

[RNN LSTM] Ignore first dataset row when dividing set into training/test

The first row of the dataset (with column titles) must be ignored when dividing the data into training and test sets.

[DT] Save model testing results to file

When performing the tests against the trained model, the accuracy results / scores are output to the command line. These should be saved to file together with:

Date/time of analysis performed
Data source filename
Model filename (i.e. saved to file)
Accuracy results

[DT] Use latest created sklearn dataset suitable

At the moment the application is taking a hardcoded CSV location to create DTs. This isn't ideal for obvious reasons. The application should take, as a minimum, the latest created sklearn dataset suitable to create a DT from.

Support for decision trees to predict signal phase and timing

Add support for a decision tree algorithm (suitable for the problem, e.g. CART or C4.5) that can help predict signal phase and timing.

Keep history of final datasets created

At present, the application overwrites any output data - i.e. it does not keep a history of datasets processed. Ideally, final datasets (i.e. only the final dataset.csv and not their raw/processed files) should be kept for comparison and testing purposes.

[RNN LSTM] Archive model as HDF5 file

At present, any model created is stored as a single model.h5 file. This archiving method needs to be improved, with associated basic info, i.e.:

What dataset it relates to
When it was created
Accuracy level achieved
The activation function used
The loss function used
The optimiser used
Number of epochs
Batch size

[DT-CART] 1s records have duration showing as 86399.0

Due to the logic in the application, there is an issue where 1s long records have the duration being calculated as 86399.0 seconds. This is because the end time is less than the start time (hence Pandas gives the 86399.0 result when performing the time delta operation). There's a workaround in place for now, but this needs fixing.

[DT-CART] First and last record persisting, causing high value to be output

The very first and the very last records are persisting during the duration calculation, which means high values are being output for these (which are also duplicates as these are accounted for)

[DT] Record algorithm run time

The algorithm run time information needs to be displayed analysis

Allow user to select where dataset is stored when running model creation

At the moment, the datasets are stored in the latest results folder. The application should offer predictability for the user to define where the dataset is stored (whether the latest results folder or otherwise).

Read detector names from controller config file

At present, the detector names are hardcoded into the application. Ideally, the detector names should be taken from the IC4 controller config file.

Example from the .8SD file (where items in bold refer to detector names):

IOLine0:ASL1,0,I,0,1,2 LT1,A1,0,0,0,A,0,1,0,0,0,0,0,0,0,1,0,2,1,1,2,0,0,0,0,0,0,
IOLine1:BSL1,0,I,1,2,2 LT1,A2,0,0,0,A,0,2,0,0,0,0,0,0,0,2,0,2,1,1,2,0,0,0,0,0,0,
IOLine2:CSL1,0,I,2,4,2 LT1,A3,0,0,0,A,0,4,0,0,0,0,0,0,0,4,0,2,1,1,2,0,0,0,0,0,0,
IOLine3:DSL1,0,I,3,8,2 LT1,A4,0,0,0,A,0,8,0,0,0,0,0,0,0,8,0,2,1,1,2,0,0,0,0,0,0,

[RNN LSTM] Remove NumPy seed

At present, the NumPy seed is kept for test purposes. When taken to production, this should be removed.

Clear SUP mode data from emulated dataset

When data is emulated, the first records will have the mode stream set to "8 - SUP". These must be removed when processing an emulated dataset as they relate to sample / test records that may affect accuracy of the system.

Change signal phase representation to numeric data type

The signal phase representation must use a numeric data type in order to be processed against the RNN LSTM module. At present, the phase representation is done using a string data type (i.e. 'Red', 'RedAmber', 'Amber' and 'Green').

Numeric representation should be:

Red = 0
RedAmber = 1
Amber = 2
Green = 3

Create dataset with duration of signal phase

At the moment, the historical signal phase data has timestamped records. For some algorithms (e.g. decision trees), the timestamps need to be aggregated to give an idea of duration per phase. Ideally, the application should create a dataset with the duration of each phase (with the start/end time for troubleshooting as well as the phase and state).

[RNN] Refactor to increase modularity

At the moment, the RNN functionality sits within a single (monolithic) Python file. Ideally this should be split so that neural network (generic) functions can be re-used (i.e. if other NNs aside of RNNs are to be evaluated in future for the problem).