openaddresses / batch Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 5.0 10.19 MB

OpenAddresses/Machine based AWS Batch based ETL Processing

Home Page: https://batch.openaddresses.io/

License: MIT License

JavaScript 67.30% Shell 0.14% Dockerfile 0.20% HTML 0.11% Vue 32.24%

addresses geocoder geocoding geospatial gis openaddresses

batch's Introduction

OpenAddresses

Brief

A global collection of address, cadastral parcel and building footprint data sources, open and free to use. Join, download and contribute. We're just getting started.

This repository is a collection of references to address, cadastral parcel and building footprint data sources.

See openaddresses.io for a data download.

Contributing addresses

Open an issue and give information about where to find more address data. Be sure to include a link to the data and a description of the coverage area for the data.
You can also create a pull request to the sources directory.
More details in CONTRIBUTING.md.

Why collect addresses?

Street address data is essential infrastructure. Street names, house numbers, and post codes combined with geographic coordinates connects digital to physical places. Free and open addresses are rocket fuel for civic and commercial innovation.

Contributors

Code Contributors

This project exists thanks to all the people who contribute. [Contribute].

Financial Contributors

Become a financial contributor and help us sustain our community. [Contribute]

Individuals

Organizations

Support this project with your organization. Your logo will show up here with a link to your website. [Contribute]

License

The data produced by the OpenAddresses processing pipeline (available on batch.openaddresses.io) is not relicensed from the original sources. Individual sources will have their own licenses. The OpenAddresses team does its best to summarize the source licenses in the source JSON for each source. For example, the source JSON for the County of San Francisco contains a link to the County of San Francisco's open data license.

The source JSON in this repo (in the sources/ directory) is licensed under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication as described in the license file there. The rest of the repository is licensed under the BSD 3-Clause License.

batch's People

Contributors

Stargazers

Watchers

Forkers

kimballjohnson macieg missinglink arch0345 andrewharvey

batch's Issues

Bin#Match

Context

The Bin#match function is called when a job from a live run is finished successfully. It takes the source and updates the map with coverage information as necessary.

Actions

Ensure the Bin#Match function is able to successfully match all of the following

Filter by Map Click

Context

The interactive map on the /data page should allow click events to filter the list of sources

Actions

Add bounds column to job table
Populate bounds column with stats data from batch task
Make job => map relational instead of transnational updates
Add point filter to data endpoint
Update UI to show click event & perform filter
Update UI to be able to cancel filter

Summary Statistics

Context

Although the results.openaddresses.io stats have been broken for some time, we should create a chart that is similiar

Login Component

Context

The login component has neither error nor success reporting

Actions

Add error handling
Add red color to input fields that are empty/invalid
Add success handling
When a page is initially loaded - determine if a user is authenticated and set auth object

run vs run/<id>

Context

The single run API should return as much data as the run list API

http://staging.openaddresses.io/api/run/204

http://staging.openaddresses.io/api/run/

Track User Downloads

Context

Track the amount of data downloaded per user

GDAL Inverted Coordintaes

Context

The python gdal package seems to have inverted their coordinates resulting in ogr sources having mismatched coordinates

Ref: OSGeo/gdal#1546

Don't use "optimal" in batch compute environments

The compute environments that run the fetch tasks use the "optimal" instance type in the AWS Batch configuration. That defaults to using some pretty big and old instance types. It'd be better if we could be more specific with instance types ("t3.small") or even just an instance type family ("t3"). We should continue using Spot instances, though. This would save significant amounts of money.

Unchanged Data

Context

If data hasn't changed in over a year, add it to the warn list

Verify Email

Context

Not unexpectedly, we immediately got a large number of spam email accounts. At the very least we should send an email verification

Stats Check

Context

There is currently no protection on an address source dropping significantly. We should have at least basic protection from a source significantly degrading if it's total address count drops significantly between runs

Actions

Find another way to build dotmap

The dotmap on openaddresses.io is sourced from this code that uploads the complete listing of all addresses in OA to Mapbox. This costs a bunch of money, so I disabled the cronjob that runs the dotmap + upload process in AWS.

We need to find another way of generating this layer and keeping it up to date.

Set the correct Content-Type on S3 uploads

e.g. for https://v2.openaddresses.io/batch-prod/job/1004/source.png the content-type is not set, so it defaults to an octet stream and the browser downloads instead of displays the image.

Add error check for points outside the 'coverage' area

This source claims to cover Vernon county, Missouri, when it actually covers Vernon county, Wisconsin: https://batch.openaddresses.io/job/4001/

It'd be great to get an alert or otherwise flag the run if some of the points are outside or far away from the claimed coverage area.

Failed Data Jobs

Context

To make the process of fixing broken sources easier, there should be a page where the user is able to view recently failing runs

Actions

Add job endpoint to only retrieve live runs
Add job endpoint to filter by status
Update UI to use these filters on a recent job failures page

Collection Size

Context

Track the collection size in the database and display it via the UI

Actions

Add size col in collection table
Self report the collection size upon collection update
Display the collection size via the UI

Job Errors Loading

Context

The JobErrors page does not show a loading bar and instead flashes "No Errors found" every time a user navigates to the page.

Actions

Add better loading behavior
Update the error count in the upper tab when the user refreshes the list
Don't reload page on suppress, just splice out of results

CSV sources have lat and lon flipped

As reported in openaddresses/openaddresses#5465

It looks like the source is specified correctly because the existing machine generates the correct output, but batch seems to be generating output with lat/lon flipped.

Upload Papercuts

Context

The upload function now works as expected but could use a couple improvements.

Actions

Add a close button in the upper right hand corner while uploading that will call xhr.cancel
Add a warning of the user tries to navigate away from the page while uploading, notifying that this will kill the current upload
Add an API for listing past uploads

Download links all point to the same file

From a user report:

On https://batch.openaddresses.io/data/, the Download links for all data collections point to the same global data collection download. i.e. clicking any of the collection downloads starts downloading a 14GB file.

5xx Error Master Ticket

Context

I'm seeing a small but consistent number of 5xx errors from the API that all appear to be from the Job Error API.

Need to track this down so I stop getting emails.

Submit Service

Context

Investigate integrating the Submit-Service: https://github.com/openaddresses/submit-service into the batch API/UI

Missing CSV downloads?

In openaddresses/openaddresses#5436, a user points out that they can't find CSV downloads. OpenAddresses data has always distributed CSV data, so switching to line-delimited GeoJSON is a breaking change. Can we add back CSV please?

File Upload

Context

Users can currently upload data products to https://results.openaddresses.io/upload-cache

Actions

API for file upload
UI for file upload

GH Duplicate Jobs

Context

The GH bot appears to be making duplicate jobs on the v2-test branch

Ref: http://staging.openaddresses.io/run/56

Include a link to the batch website run on GitHub comment

When the data please bot posts a comment with the screenshot, it should include a link to the run. Bonus points to include links directly to the logs, too.

Data download links point to wrong files

As reported by an external data user:

The files downloaded do not match the file labels. For instance, attempting to download the Canadian province of Alberta (ca/ab/province) instead opens file for a city in Brazil. (See attached screenshot.)

Skip Sources

Context

If a source has the skip: true property, don't even fire a batch task

Register Component

Context

The register component has neither error nor success reporting

Actions

Add error handling
Add red color to input fields that are empty/invalid
Add success handling

Rerun Restrictions

Context

Allow github reruns indefinitely, but only allow Job reruns if they are ~ 1 week to prevent a super old job from overwriting a new job

Error Handling

Context

At the moment the Login component is the only component with solid error handling. Every api call should inform the user if it cannot fallback to a safe backup option

Actions

Ensure every fetch has a catch that potentially triggers the Error component

The contents of the logs page should be monospaced

e.g. on https://batch.openaddresses.io/job/7655/log, the log content should be monospaced and smaller so it's a bit easier to read.

User Paging

Context

The admin page is starting to overflow due to the number of users

Actions

Add default limit to returned usernames
Add paging system for users

Reduce costs

After switching to batch, our costs have increased dramatically. At first it was because we were using "optimal" in the AWS Batch configuration, which started rather large instances and kept them running unnecessarily, then after #80 we switched to c5 instances, but AWS Batch is still starting up larger instances and keeping them around for longer than needed. This makes our AWS bill ~twice what it was before we switched to batch.

Can we try to use c5.large instances and reduce the maximum number of vCPUs in the batch compute environment? Maybe reduce the requested memory or CPU for each task?

Incorrect encoding on data downloads

Users are reporting that data from some sources are incorrectly encoded.

Don't run CI again on merge

The data please bot seems to run on merge to master, resulting in another scrape + image going on the PR after it was merged. It shouldn't run on merge.

Source Removal

Context

On each weekly run, the currently stored data results should be compared against the source JSONs. If there are source JSONs that no longer exist, the admin should be prompted to remove the data sources from the batch platform

Fully Doc & Host API Endpoints

Context

Fully document and host in-code generated API documentation.

Actions

Investigate APIDoc
Document existing routes
Document POST/Patch Bodies
Document general return JSON
Document important non-generic Error states

Warn On Non: Website 200

Context

In the check_sources portion of the task, attempt to curl the website and check for a 200 status code.

Run Auto Live

Context

The current results page does not show CI data that is sucessfully merged into master, meaning that users are forced to scrape the CI runs page to get the latest data if they can't wait for the scheduled runs

Actions

Monitor GH Actions for merge event
If a PR is merged into master, mark the run as live

DotMap

Context

machine currently performs the update on the openaddresses.io dotmap. We should create a new scheduled event that downloads the global collection and performs the dotmap update

Actions

Create dotmap event
Download global collection
Use tippecanoe to create vector layer
Upload layer tile-by-tile via new Mapbox API to avoid size restrictions

Job Error

Show job error on Job Page if one exists

preview first 10 features in job output

Is your feature request related to a problem? Please describe.
When uploading a new source, or making changes to an existing one, ideally I would verify that it's being processed as expected before the PR is merged. The job preview like at https://batch.openaddresses.io/job/21189/ shows a map which is really helpful for a quick scan to make sure the projection is roughly correct and that most data is being loaded, but it doesn't tell that the attributes are parsed correctly.

Describe the solution you'd like
I can download the processed data, however for large sources that means I need to download the whole dataset when really I just want to see the first 5 or 10 features as that is usually enough for a first pass scan to make sure the parsing is correct.

Data Backfill

Context

We should backfill the v2 service with all of the last good runs from the results service.

Actions

Write script to get list of s3 locations of latest runs
Write script to download and convert to GeoJSONLD
Directly override these files into the database - skipping the runner

cc/ @iandees

Link directly to logs from errors page

On https://batch.openaddresses.io/errors, we should include a link to the logs for that source so that it's easier to find out what went wrong.

Invalid Coordinates

Context

As I was trying to add new whyoming sources, they succeeded but had invalid coordinates. The stats module should also track the number of valid vs invalid lat/lngs

Context

Track number of valid coords
Issue warning/fail if about certain %

GH Issue Bot

Context

The Github CI integration should post images & stats to the PR once a job completes

Actions

Create an issue on ci job success
Include picture in issue
Include stats table in issue

Year Tag

Context

Many sources have a year tag for ensuring static data is updated. If a year tag is older than 1 year, start to consistently WARN the source.

US-TX-Mclennan Will Run for Days

Context

US-TX-McLennan will run for days if not manually terminated

Job: https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/cc97737d-a9c7-479d-9423-d3d11d48b2e3
CWL: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fbatch$252Fjob/log-events/batch-prod-job$252Fdefault$252Fddd79241891746339c201b6dd93052d2

Actions

Add default timeout value
Add ability to increase timeout value in schema
Fix McLennan to not consume infinite resources

Generate Weekly Data Dumps

Context

Most users download our large data dumps, generating these dumps is not currently supported

Actions

Add global dump
Add config definition for data dumps
Add data dump batch task
Add data dump API -> fire batch task
Add lambda schedule to hit data dump API
Add UI panel for displaying data dumps

openaddresses / batch Goto Github PK

batch's Introduction

OpenAddresses

Brief

Contributing addresses

Why collect addresses?

Contributors

Code Contributors

Financial Contributors

Individuals

Organizations

License

batch's People

Contributors

Stargazers

Watchers

Forkers

batch's Issues

Context

Actions

Context

Actions

Context

Context

Actions

Context

Context

Context

Context

Context

Context

Actions

Context

Actions

Context

Actions

Context

Actions

Context

Actions

Context

Context

Context

Actions

Context

Context

Context

Actions

Context

Context

Actions

Context

Actions

Context

Context

Actions

Context

Context

Actions

Context

Actions

Context

Actions

Context

Context

Context

Actions

Context

Context

Actions

Context

Actions

Recommend Projects

Recommend Topics

Recommend Org