Giter Site home page Giter Site logo

openaddresses / batch Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 5.0 10.19 MB

OpenAddresses/Machine based AWS Batch based ETL Processing

Home Page: https://batch.openaddresses.io/

License: MIT License

JavaScript 67.30% Shell 0.14% Dockerfile 0.20% HTML 0.11% Vue 32.24%
addresses geocoder geocoding geospatial gis openaddresses

batch's Introduction

OpenAddresses

Brief

A global collection of address, cadastral parcel and building footprint data sources, open and free to use. Join, download and contribute. We're just getting started.

This repository is a collection of references to address, cadastral parcel and building footprint data sources.

Contributing addresses

  • Open an issue and give information about where to find more address data. Be sure to include a link to the data and a description of the coverage area for the data.
  • You can also create a pull request to the sources directory.
  • More details in CONTRIBUTING.md.

Why collect addresses?

Street address data is essential infrastructure. Street names, house numbers, and post codes combined with geographic coordinates connects digital to physical places. Free and open addresses are rocket fuel for civic and commercial innovation.

Contributors

Code Contributors

This project exists thanks to all the people who contribute. [Contribute].

Financial Contributors

Become a financial contributor and help us sustain our community. [Contribute]

Individuals

Organizations

Support this project with your organization. Your logo will show up here with a link to your website. [Contribute]

License

The data produced by the OpenAddresses processing pipeline (available on batch.openaddresses.io) is not relicensed from the original sources. Individual sources will have their own licenses. The OpenAddresses team does its best to summarize the source licenses in the source JSON for each source. For example, the source JSON for the County of San Francisco contains a link to the County of San Francisco's open data license.

The source JSON in this repo (in the sources/ directory) is licensed under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication as described in the license file there. The rest of the repository is licensed under the BSD 3-Clause License.

batch's People

Contributors

dependabot[bot] avatar iandees avatar ingalls avatar missinglink avatar rzmk avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

batch's Issues

Bin#Match

Context

The Bin#match function is called when a job from a live run is finished successfully. It takes the source and updates the map with coverage information as necessary.

Actions

Ensure the Bin#Match function is able to successfully match all of the following

  • Country Level Sources
  • Region Level Sources (Province/State/Territory)
  • District Level Sources
  • Custom GeoJSON
  • Add point support to map backend
  • Add point support to map UI
  • Add full test coverage
  • Add zoom based scaling of points

Filter by Map Click

Context

The interactive map on the /data page should allow click events to filter the list of sources

Actions

  • Add bounds column to job table
  • Populate bounds column with stats data from batch task
  • Make job => map relational instead of transnational updates
  • Add point filter to data endpoint
  • Update UI to show click event & perform filter
  • Update UI to be able to cancel filter

Summary Statistics

Context

Although the results.openaddresses.io stats have been broken for some time, we should create a chart that is similiar

Screenshot from 2020-08-22 07-55-33

Login Component

Context

The login component has neither error nor success reporting

Actions

  • Add error handling
  • Add red color to input fields that are empty/invalid
  • Add success handling
  • When a page is initially loaded - determine if a user is authenticated and set auth object

Don't use "optimal" in batch compute environments

The compute environments that run the fetch tasks use the "optimal" instance type in the AWS Batch configuration. That defaults to using some pretty big and old instance types. It'd be better if we could be more specific with instance types ("t3.small") or even just an instance type family ("t3"). We should continue using Spot instances, though. This would save significant amounts of money.

Unchanged Data

Context

If data hasn't changed in over a year, add it to the warn list

Verify Email

Context

Not unexpectedly, we immediately got a large number of spam email accounts. At the very least we should send an email verification

Stats Check

Context

There is currently no protection on an address source dropping significantly. We should have at least basic protection from a source significantly degrading if it's total address count drops significantly between runs

Actions

  • Add stats UI to job page
  • Add stats diff UI to job page
  • Add BBOX UI to job page
  • Add BBOX UI diff to job page
  • Add Warn type to job status
  • Add session management
  • Add basic Admin Component
  • Add tab for marking Warn jobs as Success or Failure (Authenticated)
  • Task runner must perform stats comparison and potentially create a job error
  • If Job is part of live run and fails, create a job error

Find another way to build dotmap

The dotmap on openaddresses.io is sourced from this code that uploads the complete listing of all addresses in OA to Mapbox. This costs a bunch of money, so I disabled the cronjob that runs the dotmap + upload process in AWS.

We need to find another way of generating this layer and keeping it up to date.

Set the correct Content-Type on S3 uploads

e.g. for https://v2.openaddresses.io/batch-prod/job/1004/source.png the content-type is not set, so it defaults to an octet stream and the browser downloads instead of displays the image.

Failed Data Jobs

Context

To make the process of fixing broken sources easier, there should be a page where the user is able to view recently failing runs

Actions

  • Add job endpoint to only retrieve live runs
  • Add job endpoint to filter by status
  • Update UI to use these filters on a recent job failures page

Collection Size

Context

Track the collection size in the database and display it via the UI

Actions

  • Add size col in collection table
  • Self report the collection size upon collection update
  • Display the collection size via the UI

Job Errors Loading

Context

The JobErrors page does not show a loading bar and instead flashes "No Errors found" every time a user navigates to the page.

Actions

  • Add better loading behavior
  • Update the error count in the upper tab when the user refreshes the list
  • Don't reload page on suppress, just splice out of results

Upload Papercuts

Context

The upload function now works as expected but could use a couple improvements.

Actions

  • Add a close button in the upper right hand corner while uploading that will call xhr.cancel
  • Add a warning of the user tries to navigate away from the page while uploading, notifying that this will kill the current upload
  • Add an API for listing past uploads

5xx Error Master Ticket

Context

Screenshot from 2021-02-27 06-14-57

I'm seeing a small but consistent number of 5xx errors from the API that all appear to be from the Job Error API.

Need to track this down so I stop getting emails.

Data download links point to wrong files

As reported by an external data user:

The files downloaded do not match the file labels. For instance, attempting to download the Canadian province of Alberta (ca/ab/province) instead opens file for a city in Brazil. (See attached screenshot.)

image

Skip Sources

Context

If a source has the skip: true property, don't even fire a batch task

Register Component

Context

The register component has neither error nor success reporting

Actions

  • Add error handling
  • Add red color to input fields that are empty/invalid
  • Add success handling

Rerun Restrictions

Context

Allow github reruns indefinitely, but only allow Job reruns if they are ~ 1 week to prevent a super old job from overwriting a new job

Error Handling

Context

At the moment the Login component is the only component with solid error handling. Every api call should inform the user if it cannot fallback to a safe backup option

Actions

  • Ensure every fetch has a catch that potentially triggers the Error component

User Paging

Context

The admin page is starting to overflow due to the number of users

Actions

  • Add default limit to returned usernames
  • Add paging system for users

Reduce costs

After switching to batch, our costs have increased dramatically. At first it was because we were using "optimal" in the AWS Batch configuration, which started rather large instances and kept them running unnecessarily, then after #80 we switched to c5 instances, but AWS Batch is still starting up larger instances and keeping them around for longer than needed. This makes our AWS bill ~twice what it was before we switched to batch.

Can we try to use c5.large instances and reduce the maximum number of vCPUs in the batch compute environment? Maybe reduce the requested memory or CPU for each task?

Don't run CI again on merge

The data please bot seems to run on merge to master, resulting in another scrape + image going on the PR after it was merged. It shouldn't run on merge.

Source Removal

Context

On each weekly run, the currently stored data results should be compared against the source JSONs. If there are source JSONs that no longer exist, the admin should be prompted to remove the data sources from the batch platform

Fully Doc & Host API Endpoints

Context

Fully document and host in-code generated API documentation.

Actions

  • Investigate APIDoc
  • Document existing routes
  • Document POST/Patch Bodies
  • Document general return JSON
  • Document important non-generic Error states

Warn On Non: Website 200

Context

In the check_sources portion of the task, attempt to curl the website and check for a 200 status code.

Run Auto Live

Context

The current results page does not show CI data that is sucessfully merged into master, meaning that users are forced to scrape the CI runs page to get the latest data if they can't wait for the scheduled runs

Actions

  • Monitor GH Actions for merge event
  • If a PR is merged into master, mark the run as live

DotMap

Context

machine currently performs the update on the openaddresses.io dotmap. We should create a new scheduled event that downloads the global collection and performs the dotmap update

Actions

  • Create dotmap event
  • Download global collection
  • Use tippecanoe to create vector layer
  • Upload layer tile-by-tile via new Mapbox API to avoid size restrictions

Job Error

Show job error on Job Page if one exists

preview first 10 features in job output

Is your feature request related to a problem? Please describe.
When uploading a new source, or making changes to an existing one, ideally I would verify that it's being processed as expected before the PR is merged. The job preview like at https://batch.openaddresses.io/job/21189/ shows a map which is really helpful for a quick scan to make sure the projection is roughly correct and that most data is being loaded, but it doesn't tell that the attributes are parsed correctly.

Describe the solution you'd like
I can download the processed data, however for large sources that means I need to download the whole dataset when really I just want to see the first 5 or 10 features as that is usually enough for a first pass scan to make sure the parsing is correct.

Data Backfill

Context

We should backfill the v2 service with all of the last good runs from the results service.

Actions

  • Write script to get list of s3 locations of latest runs
  • Write script to download and convert to GeoJSONLD
  • Directly override these files into the database - skipping the runner

cc/ @iandees

Invalid Coordinates

Context

As I was trying to add new whyoming sources, they succeeded but had invalid coordinates. The stats module should also track the number of valid vs invalid lat/lngs

Context

  • Track number of valid coords
  • Issue warning/fail if about certain %

GH Issue Bot

Context

The Github CI integration should post images & stats to the PR once a job completes

Actions

  • Create an issue on ci job success
  • Include picture in issue
  • Include stats table in issue

Year Tag

Context

Many sources have a year tag for ensuring static data is updated. If a year tag is older than 1 year, start to consistently WARN the source.

US-TX-Mclennan Will Run for Days

Context

US-TX-McLennan will run for days if not manually terminated

Job: https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/cc97737d-a9c7-479d-9423-d3d11d48b2e3
CWL: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fbatch$252Fjob/log-events/batch-prod-job$252Fdefault$252Fddd79241891746339c201b6dd93052d2

Actions

  • Add default timeout value
  • Add ability to increase timeout value in schema
  • Fix McLennan to not consume infinite resources

Generate Weekly Data Dumps

Context

Most users download our large data dumps, generating these dumps is not currently supported

Actions

  • Add global dump
  • Add config definition for data dumps
  • Add data dump batch task
  • Add data dump API -> fire batch task
  • Add lambda schedule to hit data dump API
  • Add UI panel for displaying data dumps

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.