civictechto / ttc_subway_times Goto Github PK

A scraper to grab and publish TTC subway arrival times.

License: GNU General Public License v3.0

Python 20.28% Jupyter Notebook 79.46% Dockerfile 0.26%

scraper ttc transit

ttc_subway_times's Introduction

TTC Subway times

This is a project implementing a ‘scraper’ to grab and publish TTC subway arrival times against citizen-led performance metrics analyzing service timeliness and reliability. The end goal is to maintain and publish this performance dashboard building consensus on service quality betterment over public transit in Toronto.

State of the project

We have a Python scraper running on AWS saving predicted subway arrivals since late February 2017. We need to process this data to generate observed station arrival times for each train at each station. There was a couple month hiatus in data scraping because the database was full between August and November, but the scraper is now continuing to hum along nicely.

We're still trying to process the predicted arrival times obtained from the API into reasonably reliable state. This work is happening in a Jupyter Notebook Filtering Observed Arrivals.ipynb; it is mostly in SQL, despite being in a Python notebook.

Take a look at How to Get Involved with your expertise. Feel free to follow along; your informed feedback will surely lead to better data.

First Step Metric

Buffer Time - A measure of reliability useful for the rider:

how much extra time he/she needs to buffer in order for their trip to be on time 1 trip out of 20.

Documentation

Review the doc/ folder for the Jupyter Notebooks explaining how we explored developing this project and understanding the data. Jupyter notebooks are a way of having code, descriptive text, and output together in one document to present a narrative around data exploration.

If you're interested in a Data Dictionary, click here (it's also in the doc/ folder).

Presentations

You can watch Raphael give a presentation to CivicTech Toronto January 15, 2019 on this blurry Youtube and the slides are here

Data Flow and Data Structure

The scraper runs every minute during TTC service hours. Each of these runs is termed a polls and has a start / end time. During one run of the scraper, each station gets its predicted arrivals returned as a request, noting which station it is using stationid and lineid. For each request, 3 predicted arrivals are recorded for each line and direction at that station. These responses include the train's traindirection and its unique ID trainid, the time until the train's arrival timint and a train_message, whether the train is arriving, at station, or is delayed.

For more info: review the API exploration notebook under doc/API_exploration.ipynb

Analyzing the Data

⚠️ We're not entirely sure how to make this work, you can get data by finding us in the #transportation channel on Slack. If you think you can help with this, check out issue #59

Historical data is stored in the s3://ttc.scrape bucket, the fetch_s3.py script can be used to automatically fetch and assemble this data. Its usage is

python3 fetch_s3.py --bucket ttc.scrape --start_date 2018-12-02 --end_date 2018-12-05 --output_dir out/

If end date is omitted it is taken to be the latest available date. This script can only select by day.

This will generate three CSVs in the output_dir which can be copied into the Postgres database with the following SQL commands

COPY polls FROM '/path/to/polls.csv' DELIMITER ',' CSV HEADER;
COPY requests FROM '/path/to/requests.csv' DELIMITER ',' CSV HEADER;
COPY ntas_data FROM '/path/to/responses.csv' DELIMITER ',' CSV HEADER;

Automating the Scraper Runs

There are two ways that the scraper can be automated, via Docker/cron, or via Serverless with AWS Lambda. No matter how you choose to run it the scraper runs through ttc_scraper_api.py

Storage Backends

There are two ways the data can be stored once it has been scraped, AWS S3 and Postgres.

AWS S3 stores each scrape in a JSON collected by service day (see Consolidate function). This requires an AWS account. This can be enabled by with the --s3 flag. The advantage of S3 is that it requires no persistant server, and is extremely cheap. Its main disadvantage is that the data is not as easily queryable as an SQL database and some steps are required before it can be queried in SQL (see Analyzing the Data). This storage method is well suited to the AWS Lambda scraping mode.

Postgres requires a running Postgres instance. While the data is immediately queryable, it requires a Postgres server to be always running, which increases the work, risk and cost of the project. This can be selected with the --postgres flag and is most often used with the Docker/Cron scraping mode.

Both modes store the same data, in largely the same structure (nested JSONs vs tables with joinable keys).

AWS Lambda Scraping

There is a mode which will allow scraping via AWS Lambda with logging added to AWS Cloudwatch. This mode uses the Serverless framework.

The Serverless framework is a suite of tooling which allows the easy deployment and management of serverless code.

This allows us to run this code without having to spin up/monitor for an instance manually. And since we only pay for the code when it is running the compute costs are nearly zero.

Setup

In addition to installing the Python requirements we need to install the Serverless framework with npm (install instructions for Ubuntu by running npm install in the project root.

On Ubuntu you'll need to add the serverless binary to your PATH variable with PATH="$PATH:/[PATH/TO]/ttc_subway_times/node_modules/serverless/bin/" (replace [PATH/TO] with the absolute path to the repository). This will only be temporary for your session, if you're going to use serverless for other projects, you should probably install it globally with sudo npm install -g.

Tell Serverless which AWS creds you would like to use with:

serverless config credentials --provider aws --key <AWS_ACCESS_KEY> --secret <AWS_SECRET_KEY>

Creating these credentials must be done through your AWS account. A good guide to this process can be found on the Serverless Website

Configure the `serverless.yml`

The template serverless.yml is configured to use dev and prod environments. This will create functions with dev- or prod- prepended to them, and will also direct output to the buckets that are defined in the custom section

Move serverless.yml.template to serverless.yml.

Create an S3 bucket. Replace the angle bracketed bucket names (under the custom: properties at the end of the document with the names of the buckets you created.

At the time of writing the schedule line in serverless.yml is set as

    rate: cron(* 0-2,5-23 * * ? *)

which means that it should run every minute from 5am to 2am every day. More information on this cron line can be found on the AWS documentation, in this documentation references to UTC should be ignored, we use the 'serverless-local-schedule' package which allows us to specify crons in local time rather than UTC (otherwise the behaviour would change during daylight savings time).

Deploy

You'll need to install docker for this step (Ubuntu install instructions).

Finally deploy the function with

serverless deploy -v

This command will deploy to the dev environment by default, the environment can be specified on the command line with the --stage flag (acceptable values for this project are dev and prod)

Logs are automatically persisted to Cloudwatch.

Consolidate Function

If you are scraping with the AWS Lambda function with the S3 data destination it will write a JSON to s3://<BUCKET>/<SERVICE DATE>. As the scraper runs once per minute, this results in a very large number of small files, which are inefficient and expensive to store and transfer.

To remedy this there is a second 'consolidate' serverless function which runs at the end of every service day and combines the previous day into a .tar.gz file, storing it at s3://<BUCKET>/<SERVICE_DATE>.tar.gz.

This isn't relevant if you are storing the data in Postgres with the Lambda scraper, but this configuration will require you to modify serverless.yml.

Docker/Cron Scraping

Another way to run the scraper is to run it periodically a Cron in a Docker container. This is useful if you would like to scrape from a computer that is always left on.

Follow the below command to set up a Python 3 environment and install requirements.

Running inside docker

Follow the instructions [here](https://docs.docker.com/compose/install/) to get `docker-compose`.

All you need to do is docker-compose run --rm scraper. This will setup a database container, initialize the tables, and then run the initial scrape.

To have cli access to the data you can use docker-compose exec db psql -U postgres -d ttc. Commands of interest:

\?: list the help information for all the special commands
\h: list all the sql commands that are available
\q: quit the console

Linux/Unix

To use Mac or Linux, add the following to cron. Don't forget to change /path/to/ttc_api_scraper.py.

Note: These commands assume postgres as a data destination.

# m h  dom mon dow   command
* 5-23 * * 1-5 cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --postgres
* 0-1 * * 1-5 cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --postgres
* 5-23 * * 6-7 cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --postgres
* 0-2 * * 6-7 cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --postgres

Or to run every 20s while filtering out any "arriving" records

* 5-23 * * 1-5 cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered --postgres
* 0-1 * * 1-5 cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered --postgres
* 5-23 * * 6-7 cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered --postgres
* 0-2 * * 6-7 cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered --postgres
* 5-23 * * 1-5 (sleep 20; cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered  --postgres)
* 0-1 * * 1-5 (sleep 20; cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered  --postgres)
* 5-23 * * 6-7 (sleep 20; cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered --postgres)
* 0-2 * * 6-7 (sleep 20; cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered --postgres)
* 5-23 * * 1-5 (sleep 40; cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered --postgres)
* 0-1 * * 1-5 (sleep 40; cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered --postgres)
* 5-23 * * 6-7 (sleep 40; cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered)
* 0-2 * * 6-7 (sleep 40; cd /path/to/repo/ttc_subway_times/ && python3 ttc_api_scraper.py --filter --schemaname filtered --postgres)

Windows users

Use Task Scheduler.

cronic.py

If the above sounds complicated, here's a simple looping script that calls ttc_api_scraper.py every minute during the TTC's operating hours. Just start it in your command line with

python cronic.py

And let it collect the data.

Database setup

If you would like to use Postgres as a data repository, the database engine used to store the data is PostgreSQL. Instructions to obtain the latest and greatest version are here. After setting up your database, you can run the contents of create_tables.sql in a pgAdmin query window (or run it as a sql query).

You will also need to edit db.cfg

[DBSETTINGS]
database=ttc
host=host.ip.address
user=yourusername
password=pw

How to Get Involved

We discuss the project on CivicTechTO's Slack Team on the #transportation channel. This is probably the best place to introduce yourself and ask how you can participate. There are links in that channel to get access to ~1 month of the raw data in a csv or a PostgreSQL dump. You can also ask about getting read access to the database. Alternatively you can set up the scraper yourself and play with your own archive locally, hack away!

If you're exploring the data, please write up your exploration in a Jupyter Notebook/RMarkdown/Markdown file and place it in the doc/ folder and submit a Pull Request with details on what you explored.

Otherwise have a look at open issues and comment on any thing you think you could contribute to or open your own issue if you notice something to improve upon in the code.

Sources of Inspiration

Boldly following in others' footsteps See more on the Resources page

ttc_subway_times's People

Contributors

Stargazers

Watchers

ttc_subway_times's Issues

Do NTAS trainids in scraped data match with the Delay Data vehicle #s

We can also get Delay Data from OpenData, which is logged by operators. You can find the data here (this link is going to break really shortly).

trainids appear to be max 3 digits whereas subway vehicle numbers appear to be 4 digits starting with 5

Do these two datasets correspond? Do the IDs match and can we gain insight on the delay duration from the real-time data.

Implement rotating logs

logfile got really biiig and seems to have frozen the EC2 😕

Configure the S3 bucket to be requester-pays

There is a possibility to have a bucket be publicly readable, but any person who downloads the data pays for the transfer costs.

A) I don't entirely understand how to set this up
B) It would be great if this could get folded into the serverless.yml configuration/deployment.

Lint the scraper

Thanks to @perobertson's efforts to get continuous integration going we now have automated linting. However this has led to a lot of error messages with things failing. #42 silences many linting errors by adding them to the ignore variable in setup.cfg

Start a PR and remove these lines one by one, committing with the linting error code each time. The CI will run and should explain which lines in the scraping code are causing these warnings to be raised. Then determine whether that style issue is worth fixing or not.

There's a train direction field, we should archive it.

The API appears to now have a train direction field. We should add it to the scraper and the database

Generate Stop Arrival Times

PostgreSQL function that gets called every scraper run.
Following the gtfs spec

 {trip_id, arrival_time, departure_time, stop_id, stop_sequence}

Note: The times would actually be timestamps.

Which brings up:

check the current TTC gtfs schema for trip_ids, stop_ids

Reduce data size in the database

Having burned through my 12 months of AWS Free tier, running this operation is currently ~$30/month because of the size of the RDS. We need a way to store backups of the database data in compressed monthly files that others can access. Preferably these would be hosted in S3 buckets on AWS. Until then prepare a command to archive a month of all three tables to csv and compress them (and store them in SpiderOak)

S3 token expired

Running python3 fetch_s3.py --bucket ttc.scrape --start_date 2019-04-01 --end_date 2019-05-01 --output_dir some_local_dir from the README throws a ClientError

Full error message:

botocore.exceptions.ClientError: An error occurred (ExpiredToken) when calling the ListObjectsV2 operation: The provided token has expired.

Attempted with different date ranges and network and ran into the same issue. I ended up getting data from the links in the channel instead so it's all good.

Improve the Documentation

While the API is to some degree documented in a notebook under doc/, an entity relation of the 3 tables would be helpful to newcomers + any other info you think is helpful to someone jumping into this project!

Improve request processing time

This may just be my Pi struggling a little, but according to the 2 hr data sample the average request (and bear in mind that create_date is truncated to the second), completion time of each station request is, on average 2.6 secs with a median of 2s (truncated). There are 68 stations, so each run is taking longer than the run frequency (I think).

Could investigate using a different requests library to try to have non-blocking IO. example

Or, bundle inserts, which should reduce lost time from creating connections to the database.

Bug trying to consolidate October Data

Extracting 2019-10-29.tar.gz
 98%|███████████████████████████████████▏| 40000/40983 [06:34<00:09, 100.08it/s]Traceback (most recent call last):
  File "fetch_s3.py", line 212, in <module>
    fetch_s3()
  File "/home/rad/.local/share/virtualenvs/ttc_subway_times-ZmuzQ-JX/lib/python3.5/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/rad/.local/share/virtualenvs/ttc_subway_times-ZmuzQ-JX/lib/python3.5/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/rad/.local/share/virtualenvs/ttc_subway_times-ZmuzQ-JX/lib/python3.5/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/rad/.local/share/virtualenvs/ttc_subway_times-ZmuzQ-JX/lib/python3.5/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "fetch_s3.py", line 208, in fetch_s3
    _fetch_s3(aws_access_key_id, aws_secret_access_key, output_dir, start_date, end_date, bucket)
  File "fetch_s3.py", line 197, in _fetch_s3
    fetch_and_transform(to_download, output_dir)
  File "fetch_s3.py", line 80, in fetch_and_transform
    jsons_to_csv(tmpdir, output_dir)
  File "fetch_s3.py", line 117, in jsons_to_csv
    pd.DataFrame.from_records(requests, columns=requests[0]._fields).to_csv(
IndexError: list index out of range

Fully implement and test nonblocking IO for API requests

Data for Don Mills not collected?

Noted by @samkodes Station 68 appears to be missing in the dataset

Map `station_char` from API to GTFS `stops`

From the API we get (see Data Dictionary) station_id and station_char. For comparing with scheduled performance, and doing stuff like mapping station and line locations, it would be useful to have a mapping between GTFS stops and the data from the API.

GTFS can be download from here. Tools for dealing with gtfs data can be found here. I created two sql files to create the structure in sql and process the data:

Looks like the GTFS has at least one row per station-direction, here's a sample.

lineid	stop_id	stop_code	stop_name
1	14457	13863	BLOOR STATION - NORTHBOUND PLATFORM
1	14414	13864	BLOOR STATION - SOUTHBOUND PLATFORM
1	14455	13808	COLLEGE STATION - NORTHBOUND PLATFORM
1	14416	13807	COLLEGE STATION - SOUTHBOUND PLATFORM
1	14461	13797	DAVISVILLE STATION - NORTHBOUND PLATFORM
1	14410	13798	DAVISVILLE STATION - SOUTHBOUND PLATFORM
1	15698	15664	DOWNSVIEW PARK STATION - NORTHBOUND PLATFORM
1	15699	15665	DOWNSVIEW PARK STATION - SOUTHBOUND PLATFORM
1	14454	13809	DUNDAS STATION - NORTHBOUND PLATFORM
1	14417	13810	DUNDAS STATION - SOUTHBOUND PLATFORM
1	14428	13828	DUPONT STATION - NORTHBOUND PLATFORM
1	14443	13827	DUPONT STATION - SOUTHBOUND PLATFORM
1	14462	13796	EGLINTON STATION - NORTHBOUND PLATFORM
1	14409	13795	EGLINTON STATION - SOUTHBOUND PLATFORM

Create survey

Create a user survey

Proposed questions:(early draft)

How often do you use the subway?
Commute +, Commute, Weekly, Monthly, Less frequently
Which lines do you use?
Yonge-University, Bloor, Sheppard, Scarborough
How frequently do you experience delays?
more than 1/5, 1/10, 1/50, 1/100
How much of a problem are delays?
None, Minor, Moderate, Major, Severe
Would you like to know what the history of delays is so that you can improve your trip planning?
Not at all, Somewhat, Very much, Absolutely
Are the current methods of announcing delays adequate?
Not at all, Mostly, Absolutely

Consolidate function should delete data it has consolidated

currently consolidate copies all data from one data into a tar. It should delete the folder it pulls from.

Document how to contribute

Now that testing and.... stuff has become a little bit more formalized, it would be helpful to new people to have instructions on how to submit Pull Requests. This would include the development tool set so they can familiarize themselves with what testing gets run and how when submitting PRs.

This should be in a CONTRIBUTING.md file in the (to be created) .github/ folder

db.cfg error

I edited the db.cfg file as per: https://github.com/CivicTechTO/ttc_subway_times/blob/master/db.cfg

But getting the following error:
__main__ - CRITICAL - FATAL: role "rad" does not exist'

Should I be using different credentials?

Train arrival time algorithm

Start by selecting all the samples for a train ordered by timestamp.
Merge all samples for a train in the same sample period to get a good estimate of the location.
Use those location estimates to generate arrival time estimates.

Update the station mapping for the new Line 1 stations

The Data Dictionary contains a table mapping between station_char and station_id, it doesn't contain the new Line 1 extension stations, we should pull that from... something.

Include the York University Extension stations

There are new stations! We should have their data.

Also, Line 1 stations were numbered 1-32, what are these new stations' ids?

API Throttling

We've noticed that the API seems to lock us out if we use the async method of sending requests. This appears to send too many requests too quickly.

The current serverless version of the data pipeline is using serial requests to the API instead and that seems fine.

Dunno if there's a way to have a sleep timer on the async, which seems a liiiiittle counter-intuitive.

Replicate and Improve Ntas API: GTFS-RT format

Having the scraper generate GTFS-RT data would have two benefits:

Anyone creating apps that require GTFS-RT (a standard) would have access to a feed
We could use and improve upon tools built on GTFS-RT data

This requires generating GTFS-RT in real... time, and then also reprocessing the archive of data.

Record sequence position of train observations

It would be helpful in algorithm design if the scraper recorded with each train observation for a given request and direction its position - 1st, 2nd, or 3rd.
This would be stored in the ntas_data table.

Document how to set up Jupyter Notebook

A lot of our data exploration and documentation is in Jupyter Notebooks, this is not an obvious format for new users.

Explain how to install Jupyter
Document how to set up the necessary to run the notebooks.

Contribute to the Wiki

I've added a page for news articles relevant to the project, since some journalists did some analysis of the delay OpenData. If you find more feel free to share. It would also be good to find articles about delays that were epic enough to get written about, so we can see what those service failures look like in the data.

There also a Resources page where I've linked to some transit resources for tech tools, local projects, and research.

Malformed JSON error not the most descriptive

Periodically we get the below errors (the number in (275) is the line number in ttc_api_scraper.py printing that log message).

2018-11-20 20:08:09,876 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,876 (276): <generator object ClientResponse.text at 0x7f851c4ecba0>
2018-11-20 20:08:09,886 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:09,886 (274): Malformed JSON for station 52 on line 2
2018-11-20 20:08:09,886 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,876 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,876 (276): <generator object ClientResponse.text at 0x7f851c4ecba0>
2018-11-20 20:08:09,886 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:09,886 (274): Malformed JSON for station 52 on line 2
2018-11-20 20:08:09,886 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,886 (276): <generator object ClientResponse.text at 0x7f851c4d7f10>
2018-11-20 20:08:09,967 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:09,968 (274): Malformed JSON for station 59 on line 2
2018-11-20 20:08:09,968 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,968 (276): <generator object ClientResponse.text at 0x7f851c4e72b0>
2018-11-20 20:08:10,034 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,034 (274): Malformed JSON for station 66 on line 4
2018-11-20 20:08:10,034 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,034 (276): <generator object ClientResponse.text at 0x7f851c4ec620>
2018-11-20 20:08:10,059 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,059 (274): Malformed JSON for station 65 on line 4
2018-11-20 20:08:10,059 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,059 (276): <generator object ClientResponse.text at 0x7f851c4ec360>
2018-11-20 20:08:10,075 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,076 (274): Malformed JSON for station 60 on line 2
2018-11-20 20:08:10,076 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,076 (276): <generator object ClientResponse.text at 0x7f851c4e7518>
2018-11-20 20:08:10,098 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,098 (274): Malformed JSON for station 58 on line 2
2018-11-20 20:08:10,098 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,098 (276): <generator object ClientResponse.text at 0x7f851c4dffc0>
2018-11-20 20:08:10,099 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,099 (274): Malformed JSON for station 7 on line 1
2018-11-20 20:08:10,100 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,100 (276): <generator object ClientResponse.text at 0x7f851c4ecf10>

But the scraper then sleeps for 2s for that station, and tries 3 more times, and typically succeeds.

This could take up 1 line instead of 3, and we should try showing the actual html response we're getting from the API.
Also this could be documented somewhere because I forgot

Migrate scraper and database to the cloud?

The scraper isn't necessarily the fastest on the Pi. And there is certainly a limit to the amount of data that can be stored on the SD card.

If I get a few hours I could set up my old desktop as a server. Otherwise open to suggestions for cheap hosting options.

serverless scraper appears to consistently produce less data

Comparing 2019-03-25 from the old python scraper with the same date from the serverless one.

I had to convert the folder of json into a csv using the jsons_to_csv() function from fetch_s3.py

WITH new_data AS (SELECT date_trunc('hour', create_date) request_ts, COUNT(id) AS new_ids
FROM requests_serverless 
RIGHT OUTER JOIN ntas_data_serverless serverless ON serverless.requestid =  requests_serverless.requestid
WHERE create_date >= '2019-03-25 '  AND 
create_date < '2019-03-26'
GROUP BY request_ts),
old_data AS (SELECT date_trunc('hour', requests.create_date) request_ts, COUNT(id) AS old_ids
FROM requests
RIGHT OUTER JOIN ntas_data USING(requestid)
WHERE requests.create_date >= '2019-03-25'  AND 
requests.create_date < '2019-03-26'
GROUP BY request_ts)

SELECT request_ts, old_ids, new_ids
FROM new_data
RIGHT OUTER JOIN old_data USING(request_ts)
ORDER BY request_ts

'request_ts';'old_ids';'new_ids'
'2019-03-25 00:00:00';23,985;
'2019-03-25 01:00:00';21,499;
'2019-03-25 04:00:00';2;
'2019-03-25 05:00:00';10,433;6,212
'2019-03-25 06:00:00';24,546;13,126
'2019-03-25 07:00:00';25,393;13,575
'2019-03-25 08:00:00';25,444;13,509
'2019-03-25 09:00:00';24,943;13,395
'2019-03-25 10:00:00';25,279;13,582
'2019-03-25 11:00:00';25,342;13,489
'2019-03-25 12:00:00';24,901;13,393
'2019-03-25 13:00:00';25,218;13,415
'2019-03-25 14:00:00';24,899;13,355
'2019-03-25 15:00:00';25,107;13,355
'2019-03-25 16:00:00';25,061;13,488
'2019-03-25 17:00:00';25,301;13,551
'2019-03-25 18:00:00';25,309;13,520
'2019-03-25 19:00:00';24,948;13,534
'2019-03-25 20:00:00';25,401;13,605
'2019-03-25 21:00:00';24,932;13,438
'2019-03-25 22:00:00';24,393;13,051
'2019-03-25 23:00:00';23,879;12,854

Scraper data quality: Fix apparent scraper crashes

Crashes not confirmed, but missing runs of data suggests that scraper is crashing.
More general exception-handling and refusal to exit on http error might do it.

Migrate scraper to use AWS Lambda

As another cost-saving measure, instead of having a dedicated EC2 instance running the data-scraping script, we could be hosting this on AWS Lambda, which would presumably be cheaper.

Add logging level to the config file

To easily configure different logging level from debug.

Reduce compute times: reduce costs

Because of #54 the script takes... longer. This also means it costs more.

Any way to reduce that compute time without getting blocked would be great!

GRAND VISION

Discuss the grand vision for this project

Loosely basing this vision on the below framework

Put the Serverless data onto Spideroak

I have data since March 19, 2019 from running the AWS scraper

Scraper data quality: re-ping on empty data

Try a second or so later and see if any data is returned.

Scraper requires Python 3.5

The pycares dependency doesn't seem to work in 3.6. Should include details for setting up a virtualenv in the README.

Thots on processing NTAS data

Hi Raphael – I have a few brief thoughts on beginning to process the NTAS data that I thought I’d share in case they’re helpful. I’d be happy to start playing with implementing this processing once some data is collected (even a day’s worth).

There are two general approaches that I think could be fruitfully combined. The first approach tries to reconstruct the system’s prediction model by extracting predicted travel times between stations and looking for typical and exceptional patterns (if the system is really dumb, there will be no exceptional patterns and all we’ll get is constant travel times between stations; if the system is smart, we’ll get more information – see below). The second approach tracks variation in predicted times for each train as it moves through the system.

Both approaches assume a database that stores a single train prediction per record, with some pre-processing done to create a field called PAT (predicted arrival time) – just createDate + timeInt. So a record would have stationId, trainId, PAT, createDate, etc. I’m assuming a trainId refers to a “run” of the train, as Sammi deGuzman’s blog suggests. If the same trainId appears on multiple runs, some time-based filtering will have to happen below to make sure we’re picking up only a single run of a train.

Reconstructing the system’s prediction model by extracting predicted travel times between stations.

Suppose we have two records with the same trainId and different stationId’s. Then subtracting PATs gives us a travel time estimate (TTE) between those stations (technically, it also includes load/unload times).

If the system is stupid, TTEs between any pair of stations will be constant. This means that there’s a very high degree of redundancy in the NTAS data and there’s no reason to save observations of the same train from multiple stations for future analysis (or alternatively, observations of the same train from multiple stations at different times can be combined very easily).

If the system is smart, TTEs could vary for a number of reasons:
- High passenger volume periods increase load/unload time
- High train volume means slower train speeds
- Traffic adjustment might means slower train speeds
- Delays (for any reason) might demand traffic adjustment – meaning slower train speeds

Simply making a histogram of TTEs for any pair of stations should tell us whether the system is smart or not and what kinds of variations it might be picking up. If the system is smart, looking at unusual TTEs and seeing how they move around between stations might give us insight into how local delays propagate through the prediction model.

If building a table of TTEs, it’s probably a good idea to record the data the TTEs came from – i.e. the two original records that generate each TTE. The table should also contain a creationDate, though it’s not clear what that date should be if the records used to create the TTE have different times (they certainly will, since we’re doing low-frequency sampling). So record both creationDates?

Some filtering will be required when creating TTEs to use only records sampled close together in time (say, choose the closest times possible, and enforce some maximum time difference); this avoids junk estimates produced if traffic conditions change between the sampling of the two original records.

Tracking train predictions

Suppose we have multiple records with the same trainId and stationId. Order them by creationDate and subtract the first PAT from all the others (alternatively, could calculate running differences); augment each record by putting this difference in a field called “localDelay”. This seems good enough to start identifying problems. Comparing local delays across stations will also help describe how they propagate through the prediction model.

Consolidate past dates

After #55, it would be useful to consolidate past dates. Consolidate should therefore take a date as a parameter.

Generate Performance Metrics

Current metrics tracked by the TTC

Record poll number in database

It would be helpful in algorithm design to be able to treat each cycle through the stations as a single "poll" of the entire system.
I propose assigning each poll a unique sequential number, similar to requestid.
This number could be stored in the requests table and be used to pull together all requests from a single poll.

Would be nice to have a list of Subway & other delay events, with possible reasons. I can only find at the moment the TTC twitter account in terms of real-time feeds. So use the Twitter API?
https://dev.twitter.com/rest/reference/get/statuses/user_timeline

Consolidate function failing on AWS

[ERROR] ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
Traceback (most recent call last):
  File "/var/task/src/ttc_api_scraper/consolidate.py", line 110, in handler
    consolidate()
  File "/var/task/src/ttc_api_scraper/consolidate.py", line 52, in consolidate
    download_dir(client, s3_bucket, "{consoli_date}/".format(consoli_date=consoli_date), scrape_path)
  File "/var/task/src/ttc_api_scraper/consolidate.py", line 93, in download_dir
    for result in paginator.paginate(Bucket=bucket, Prefix=path):
  File "/var/runtime/botocore/paginate.py", line 255, in __iter__
    response = self._make_request(current_kwargs)
  File "/var/runtime/botocore/paginate.py", line 332, in _make_request
    return self._method(**current_kwargs)
  File "/var/runtime/botocore/client.py", line 320, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/var/runtime/botocore/client.py", line 623, in _make_api_call
    raise error_class(parsed_response, operation_name)

I think it might be something with the IAM user I configured, I got this from.... somewhere.

{
    "Statement": [
        {
            "Action": [
                "apigateway:*",
                "cloudformation:CancelUpdateStack",
                "cloudformation:ContinueUpdateRollback",
                "cloudformation:CreateChangeSet",
                "cloudformation:CreateStack",
                "cloudformation:CreateUploadBucket",
                "cloudformation:DeleteStack",
                "cloudformation:Describe*",
                "cloudformation:EstimateTemplateCost",
                "cloudformation:ExecuteChangeSet",
                "cloudformation:Get*",
                "cloudformation:List*",
                "cloudformation:PreviewStackUpdate",
                "cloudformation:UpdateStack",
                "cloudformation:UpdateTerminationProtection",
                "cloudformation:ValidateTemplate",
                "dynamodb:CreateTable",
                "dynamodb:DeleteTable",
                "dynamodb:DescribeTable",
                "ec2:AttachInternetGateway",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:CreateInternetGateway",
                "ec2:CreateNetworkAcl",
                "ec2:CreateNetworkAclEntry",
                "ec2:CreateRouteTable",
                "ec2:CreateSecurityGroup",
                "ec2:CreateSubnet",
                "ec2:CreateTags",
                "ec2:CreateVpc",
                "ec2:DeleteInternetGateway",
                "ec2:DeleteNetworkAcl",
                "ec2:DeleteNetworkAclEntry",
                "ec2:DeleteRouteTable",
                "ec2:DeleteSecurityGroup",
                "ec2:DeleteSubnet",
                "ec2:DeleteVpc",
                "ec2:Describe*",
                "ec2:DetachInternetGateway",
                "ec2:ModifyVpcAttribute",
                "events:DeleteRule",
                "events:DescribeRule",
                "events:ListRuleNamesByTarget",
                "events:ListRules",
                "events:ListTargetsByRule",
                "events:PutRule",
                "events:PutTargets",
                "events:RemoveTargets",
                "iam:CreateRole",
                "iam:DeleteRole",
                "iam:DeleteRolePolicy",
                "iam:GetRole",
                "iam:PassRole",
                "iam:PutRolePolicy",
                "iot:CreateTopicRule",
                "iot:DeleteTopicRule",
                "iot:DisableTopicRule",
                "iot:EnableTopicRule",
                "iot:ReplaceTopicRule",
                "kinesis:CreateStream",
                "kinesis:DeleteStream",
                "kinesis:DescribeStream",
                "lambda:*",
                "logs:CreateLogGroup",
                "logs:DeleteLogGroup",
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams",
                "logs:FilterLogEvents",
                "logs:GetLogEvents",
                "s3:CreateBucket",
                "s3:DeleteBucket",
                "s3:DeleteBucketPolicy",
                "s3:DeleteObject",
                "s3:DeleteObjectVersion",
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:PutBucketNotification",
                "s3:PutBucketPolicy",
                "s3:PutBucketTagging",
                "s3:PutBucketWebsite",
                "s3:PutEncryptionConfiguration",
                "s3:PutObject",
                "sns:CreateTopic",
                "sns:DeleteTopic",
                "sns:GetSubscriptionAttributes",
                "sns:GetTopicAttributes",
                "sns:ListSubscriptions",
                "sns:ListSubscriptionsByTopic",
                "sns:ListTopics",
                "sns:SetSubscriptionAttributes",
                "sns:SetTopicAttributes",
                "sns:Subscribe",
                "sns:Unsubscribe",
                "states:CreateStateMachine",
                "states:DeleteStateMachine"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ],
    "Version": "2012-10-17"
}