Giter Site home page Giter Site logo

batch-machine's Introduction

OpenAddresses

Brief

A global collection of address, cadastral parcel and building footprint data sources, open and free to use. Join, download and contribute. We're just getting started.

This repository is a collection of references to address, cadastral parcel and building footprint data sources.

Contributing addresses

  • Open an issue and give information about where to find more address data. Be sure to include a link to the data and a description of the coverage area for the data.
  • You can also create a pull request to the sources directory.
  • More details in CONTRIBUTING.md.

Why collect addresses?

Street address data is essential infrastructure. Street names, house numbers, and post codes combined with geographic coordinates connects digital to physical places. Free and open addresses are rocket fuel for civic and commercial innovation.

Contributors

Code Contributors

This project exists thanks to all the people who contribute. [Contribute].

Financial Contributors

Become a financial contributor and help us sustain our community. [Contribute]

Individuals

Organizations

Support this project with your organization. Your logo will show up here with a link to your website. [Contribute]

License

The data produced by the OpenAddresses processing pipeline (available on batch.openaddresses.io) is not relicensed from the original sources. Individual sources will have their own licenses. The OpenAddresses team does its best to summarize the source licenses in the source JSON for each source. For example, the source JSON for the County of San Francisco contains a link to the County of San Francisco's open data license.

The source JSON in this repo (in the sources/ directory) is licensed under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication as described in the license file there. The rest of the repository is licensed under the BSD 3-Clause License.

batch-machine's People

Contributors

albarrentine avatar andrewharvey avatar bertday avatar dependabot[bot] avatar erictheise avatar geobrando avatar hannesj avatar iandees avatar ingalls avatar jalessio avatar kreed avatar lowks avatar migurski avatar mikedillion avatar minicodemonkey avatar nelsonminar avatar sbma44 avatar slibby avatar stefanb avatar sudoprime avatar trescube avatar waldoj avatar zyphlar avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

batch-machine's Issues

Line delimited GeoJSON

Line delimited GeoJSON features is growing in use as it can be easier for some applications to read and write than regular GeoJSON wrapped as a FeatureCollection.

I tried feeding Line delimited GeoJSON to OpenAddresses but it failed parsing, so it would be nice to have support for this.

regexp function not working as expected

When a match isn't found for a regexp pattern, I would expect it to return nothing. However, it looks like batch-machine is returning the full field if no matches are found.

For example in openaddresses/openaddresses#6981, some address points only include a zip code, city, state, and country in REV_LongLabel. Since no match is found for the regexp pattern I set for number for these points, it returns the full field.
image

improve memory efficiency for single line files

with open(filepath, 'rb') as file:
for line in file:
fingerprint.update(line)

It looks like this code reads the source data line by line, however as noticed by the MemoryError's reported in openaddresses/openaddresses#5541 and openaddresses/openaddresses#5490 where the source data is GBs of GeoJSON as a single line, this can consume the systems resources.

Perhaps we could instead do something like this?

with open(filepath, 'rb') as file:
    while chunk := file.read(8192):
        fingerprint.update(chunk)

chain functions broken due to oa prefix in new batch-machine

https://github.com/openaddresses/openaddresses/blob/master/ATTRIBUTE_FUNCTIONS.md#chain documents use of chain functions with a _wip variable, the example given is:

"street": {
    "function": "chain",
    "variable": "street_wip",
    "functions": [
        {
            "function": "postfixed_street",
            "field": "Prop_Addr"
        },
        {
            "function": "remove_postfix",
            "field": "street_wip",
            "field_to_remove": "Prop_Addr_Unit"
        }
    ]
}

This used to work fine with https://github.com/openaddresses/machine but once that we deprecated in favour of https://github.com/openaddresses/batch-machine a breaking change was introduced which meant this example would now fail as evidenced by the errors in openaddresses/openaddresses#5496

After a few months of not understanding the issue, I took another look comparing the old conform https://github.com/openaddresses/machine/blob/05db17d8492b3d8f4064f0f5b0ca9c68041c535a/openaddr/conform.py with the new one https://github.com/openaddresses/batch-machine/blob/8bb6ecfb20beec56913ae7ae267cf5489c9c35ad/openaddr/conform.py this revealed an issue at

def row_fxn_chain(sc, row, key, fxn):
functions = fxn["functions"]
var = fxn.get("variable")
original_key = key
if var and var.upper().lstrip('OA:') not in sc.SCHEMA and var not in row:
row['oa:' + var] = u''
key = var
for func in functions:
row = row_function(sc, row, key, func)
if row.get('oa:' + key):
row[key] = row['oa:' + key]
row['oa:{}'.format(original_key.lower())] = row['oa:{}'.format(key)]
return row

The code here now creates these intermediate variables with the oa: prefix, once I added this prefix to the _wip variable fields at https://github.com/openaddresses/openaddresses/pull/5828/files#diff-8441ea2e82820b7decef018f2d606c8a543ad86411a2c49aedab254a0f72f848 it worked.

So the question for you @ingalls is, is this either

a) a big in batch-machine that we can fix, or
b) a change in how these conform files are processed and we should update the documentation and existing sources which broke

I'm happy to help out, but I need to first understand some of the background and motivation of this change.

Running Process-One.py locally

Hi all,

I am trying to run a single process on a source to better understand how the tool actually works... but i seem to be getting an error when running the openaddr-process-one command.

Here are my steps to getting docker running and triggerirng the command:

docker build -t bashmachine .
docker run -it batchmachine bash
openaddr-process-one openaddr/sources/airdrie airdrie

image

I'm not surer where to go from here... Am i getting a directory not found due to permissions? My thought was that docker runs all containers as root unless specified.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.