Giter Site home page Giter Site logo

datamade / usaddress Goto Github PK

View Code? Open in Web Editor NEW
1.5K 39.0 303.0 6.86 MB

:us: a python library for parsing unstructured United States address strings into address components

Home Page: https://parserator.datamade.us/usaddress

License: MIT License

Python 100.00%
python-library address nlp parserator python address-parser natural-language-processing machine-learning conditional-random-fields crf

usaddress's Introduction

usaddress

usaddress is a Python library for parsing unstructured United States address strings into address components, using advanced NLP methods.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying address components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify address components with perfect accuracy, nor can it verify that a given address is correct/valid.

It also does not normalize the address. However, this library built on top of usaddress does.

Tools built with usaddress

A RESTful API built on top of usaddress for programmers who don't use python. Requires an API key and the first 1,000 parses are free.

Parserator: Parse and Split Addresses allows you to easily split addresses into separate columns by street, city, state, zipcode and more right in Google Sheets.

How to use the usaddress python library

  1. Install usaddress with pip, a tool for installing and managing python packages (beginner's guide here).

In the terminal,

pip install usaddress
  1. Parse some addresses!

usaddress

Note that parse and tag are different methods:

import usaddress
addr='123 Main St. Suite 100 Chicago, IL'

# The parse method will split your address string into components, and label each component.
# expected output: [(u'123', 'AddressNumber'), (u'Main', 'StreetName'), (u'St.', 'StreetNamePostType'), (u'Suite', 'OccupancyType'), (u'100', 'OccupancyIdentifier'), (u'Chicago,', 'PlaceName'), (u'IL', 'StateName')]
usaddress.parse(addr)

# The tag method will try to be a little smarter
# it will merge consecutive components, strip commas, & return an address type
# expected output: (OrderedDict([('AddressNumber', u'123'), ('StreetName', u'Main'), ('StreetNamePostType', u'St.'), ('OccupancyType', u'Suite'), ('OccupancyIdentifier', u'100'), ('PlaceName', u'Chicago'), ('StateName', u'IL')]), 'Street Address')
usaddress.tag(addr)

How to use this development code (for the nerds)

usaddress uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train the usaddress parser's model (a .crfsuite settings file) on labeled training data, and provides tools for adding new labeled training data.

Building & testing the code in this repo

To build a development version of usaddress on your machine, run the following code in your command line:

git clone https://github.com/datamade/usaddress.git  
cd usaddress  
pip install -r requirements.txt  
python setup.py develop  
parserator train training/labeled.xml usaddress  

Then run the testing suite to confirm that everything is working properly:

nosetests .

Having trouble building the code? Open an issue and we'd be glad to help you troubleshoot.

Adding new training data

If usaddress is consistently failing on particular address patterns, you can adjust the parser's behavior by adding new training data to the model. Follow our guide in the training directory, and be sure to make a pull request so that we can incorporate your contribution into our next release!

Important links

Team

Bad Parses / Bugs

Report issues in the issue tracker

If an address was parsed incorrectly, please let us know! You can either open an issue or (if you're adventurous) add new training data to improve the parser's model. When possible, please send over a few real-world examples of similar address patterns, along with some info about the source of the data - this will help us train the parser and improve its performance.

If something in the library is not behaving intuitively, it is a bug, and should be reported.

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches!

Copyright

Copyright (c) 2014 Atlanta Journal Constitution. Released under the MIT License.

usaddress's People

Contributors

brentpayne avatar cathydeng avatar daguar avatar derekeder avatar fgregg avatar jeancochrane avatar markbaas avatar mbatchkarov avatar mlissner avatar ohiat avatar rj-lovering avatar tanyaschlusser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

usaddress's Issues

Easily confused if StreetNamePostType omitted

Skipping a "St." or "Dr." causes it to interpret city or town names as StreetNamePostType:

>>> usaddress.parse("123 Main Sometown, New York 10101")
[('123', 'AddressNumber'), ('Main', 'StreetName'), ('Sometown,', 'StreetNamePostType'), ('New', 'PlaceName'), ('York', 'StateName'), ('10101', 'ZipCode')]

Specify prerequisite packages?

You might consider specifying that a couple of packages have to be installed in Ubuntu/Debian for this to work. (I'm running Ubuntu 12.04.) Attempting to install it throws cryptic errors unless g++ and python-dev have been installed, like such:

apt-get install g++
apt-get install python-dev

OTOH, I only have about six months of serious Python experience, and that's mostly in writing Python (not participating in the Python software ecosystem), so perhaps this is common knowledge.

Install error on Python 3 x64

I'm using Anaconda3 Python Distribution and install bombs out:

C:\Anaconda3>pip install usaddress
Downloading/unpacking usaddress
  Running setup.py (path:C:\Users\shitals\AppData\Local\Temp\pip_build_shitals\u
saddress\setup.py) egg_info for package usaddress

Downloading/unpacking python-crfsuite>=0.7 (from usaddress)
  Running setup.py (path:C:\Users\shitals\AppData\Local\Temp\pip_build_shitals\p
ython-crfsuite\setup.py) egg_info for package python-crfsuite

Installing collected packages: usaddress, python-crfsuite
  Running setup.py install for usaddress

  Running setup.py install for python-crfsuite
    error: [WinError 2] The system cannot find the file specified
    Complete output from command C:\Anaconda3\python.exe -c "import setuptools,
tokenize;__file__='C:\\Users\\shitals\\AppData\\Local\\Temp\\pip_build_shitals\\
python-crfsuite\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__
).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record C:\Users\sh
itals\AppData\Local\Temp\pip-xu5ktq5w-record\install-record.txt --single-version
-externally-managed --compile:
    running install

running build

running build_py

creating build

creating build\lib.win-amd64-3.4

creating build\lib.win-amd64-3.4\pycrfsuite

copying pycrfsuite\_dumpparser.py -> build\lib.win-amd64-3.4\pycrfsuite

copying pycrfsuite\_logparser.py -> build\lib.win-amd64-3.4\pycrfsuite

copying pycrfsuite\__init__.py -> build\lib.win-amd64-3.4\pycrfsuite

running build_ext

error: [WinError 2] The system cannot find the file specified

----------------------------------------
Cleaning up...
Command C:\Anaconda3\python.exe -c "import setuptools, tokenize;__file__='C:\\Us
ers\\shitals\\AppData\\Local\\Temp\\pip_build_shitals\\python-crfsuite\\setup.py
';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n',
'\n'), __file__, 'exec'))" install --record C:\Users\shitals\AppData\Local\Temp\
pip-xu5ktq5w-record\install-record.txt --single-version-externally-managed --com
pile failed with error code 1 in C:\Users\shitals\AppData\Local\Temp\pip_build_s
hitals\python-crfsuite
Storing debug log for failure in C:\Users\shitals\pip\pip.log

Use conform.merge property to find failing addresses

Some of the openaddress source files have a conform.merge property that describes the subcomponents available in the source data, i.e. Westminster, CO

We can use this properties to identify sources of tagged, or partially tagged addresses.

Let's build some code to

  1. walk through https://github.com/openaddresses/openaddresses/tree/master/sources
  2. if the source is American and has conform.merge properties then download the cached data, if exists
  3. compare our tagging with the source tagging for identifying potential failures of our parser.

Complete the building of our improvement feedback loop

To build a really great parser we need to set up a feedback loop for ourselves.

The loop will include

  1. Testing the performance of the parser on addresses
  2. Identifying the failures
  3. Altering the parser by
    1. Changing the training corpus by
      1. changing the amount of training data
      2. changing the variety of training data
      3. changing the representativeness of the tranining data
    2. changing the features we use for the model
    3. changing the model form

After making adjustments to the parser (3), we need to be able to evaluate (1) the performance, and identify current difficult cases (2) with one command.

In the next round, adjustments to the parser (3), need to be based upon our current failures.

  • Testing (1 & 2)
    • Start our test corpus: #8
  • Automated build and test #7
  • Patterns for ingesting new training/test data
    • #6 Deciding on data format
    • Directory structure for raw_data, data_scripts, training_data

Not robust for these examples

Here are some of the examples I tried. In general,

  • Street directional prefix detection is not robust (people often make mistake in this).
  • Occupancy identifier doesn't seem to take in to account special characters.
  • Country name is interpreted as street name.
  • Intersection detection doesn't work with "&".
  • No robustness for hints specified in brackets.
  • Complex city names are not handled well
  • Things like "P.O. BOX" is not detected as single token.
    Etc
>>> import usaddress
>>> usaddress.parse('123 Main St., Flr #100, Chicago, IL')
[('123', 'AddressNumber'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Flr', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 Main NE St., Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('Main', 'StreetName'), ('NE', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 Main NorthWest St, Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('Main', 'StreetName'), ('NorthWest', 'StreetName'), ('St,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 NE Main St., Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 Norh East Main St., Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('Norh', 'StreetName'), ('East', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 NorhEast Main St., Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('NorhEast', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 NorhEast Main St., Cabin #100 1/2, Chicago, IL')
[('123', 'AddressNumber'), ('NorhEast', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 NorhEast Main St., Cabin #100 1/2, Chicago, Illinoi, United States of America')
[('123', 'AddressNumber'), ('NorhEast', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi,', 'PlaceName'), ('United', 'PlaceName'), ('States', 'PlaceName'), ('of', 'StreetName'), ('America', 'StreetName')]
>>> usaddress.parse('123 NorhEast Main St., Cabin #100 1/2, Chicago, Illinoi, US of A')
[('123', 'AddressNumber'), ('NorhEast', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi,', 'PlaceName'), ('US', 'StateName'), ('of', 'StreetNamePreType'), ('A', 'StreetName')]
>>> usaddress.parse('123 NE Main St (near Lincon Cinema), Cabin #100 1/2, Chicago, Illinoi, US of A')
[('123', 'AddressNumber'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St', 'StreetName'), ('near', 'StreetName'), ('Lincon', 'StreetName'), ('Cinema),', 'StreetNamePostType'), ('Cabin', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi,', 'PlaceName'), ('US', 'StateName'), ('of', 'StreetNamePreType'), ('A', 'StreetName')]
>>> usaddress.parse('123 NE Main St & 551A Juno Ave, Cabin #100 1/2, Chicago, Illinoi')
[('123', 'AddressNumber'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St', 'StreetName'), ('&', 'StreetName'), ('551A', 'StreetName'), ('Juno', 'StreetName'), ('Ave,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('123 NE Main St and 551A Juno Ave, Cabin #100 1/2, Chicago, Illinoi')
[('123', 'AddressNumber'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St', 'StreetNamePostType'), ('and', 'IntersectionSeparator'), ('551A', 'StreetName'), ('Juno', 'StreetName'), ('Ave,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('1231 1/2 NE Main St, Cabin #100 1/2, Chicago, Illinoi')
[('1231', 'AddressNumber'), ('1/2', 'AddressNumberSuffix'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('1231A 1/2 NE Main St, Cabin #100 1/2, Chicago, Illinoi')
[('1231A', 'AddressNumber'), ('1/2', 'StreetName'), ('NE', 'StreetName'), ('Main', 'StreetName'), ('St,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('1231A-B 1/2 NE Main St, Cabin #100 1/2, Chicago, Illinoi')
[('1231A-B', 'AddressNumber'), ('1/2', 'StreetName'), ('NE', 'StreetName'), ('Main', 'StreetName'), ('St,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('501 US Hwy 81 Exit 4, Trailer #100, Chicago, Illinoi')
[('501', 'AddressNumber'), ('US', 'StreetNamePreType'), ('Hwy', 'StreetNamePreType'), ('81', 'StreetName'), ('Exit', 'StreetName'), ('4,', 'StreetName'), ('Trailer', 'StreetName'), ('#', 'StreetNamePostType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('501 Main Blvd, Sun-lo-City on the river, Illinoi')
[('501', 'AddressNumber'), ('Main', 'StreetName'), ('Blvd,', 'StreetNamePostType'), ('Sun-lo-City', 'PlaceName'), ('on', 'StreetName'), ('the', 'StreetName'), ('river,', 'StreetName'), ('Illinoi', 'PlaceName')]
>>> usaddress.parse('P.O. BOX 2555, Sun-lo-City on the river, Illinoi')
[('P.O.', 'USPSBoxType'), ('BOX', 'USPSBoxType'), ('2555,', 'USPSBoxID'), ('Sun-lo-City', 'BuildingName'), ('on', 'BuildingName'), ('the', 'BuildingName'), ('river,', 'BuildingName'), ('Illinoi', 'BuildingName')]

us-ia-linn labeling errors

tested the parser on openaddress data for Linn county Iowa. 1836 failures out of 95164 addresses (1.9%).

since these are all the addresses within a county, many of the failures are essentially the same errors, repeated for various address/unit numbers. I scrolled through all the failures, and here's a representative sample (address string, predicted labels, true labels):
#1.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('180 S 19th Street Ct Marion IA 52302\n ', ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#2.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1115 Indian Creek Cir Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#3.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('3942 21st Avenue Pl SW Unit 4 Cedar Rapids IA 52404\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#4.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('4101 16th Ave SW Trlr 26 Cedar Rapids IA 52404\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetNamePreType', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#5.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1341 39th Street Pl Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#6.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('2510 Heather View Cir Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#7.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('916 E Ave NW Cedar Rapids IA 52405\n ', ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#8.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1113 6th St SE D Cedar Rapids IA 52401\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#9.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1238 O Avenue Pl NE Cedar Rapids IA 52402\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#10.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1300 Oakland Rd NE Bldg 5 Cedar Rapids IA 52402\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetNamePreType', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#11.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('3043 1/2 Leonard St NE Cedar Rapids IA 52402\n ', ('AddressNumber', 'AddressNumberSuffix', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#12.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1702 Hunters Creek Way Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#13.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('9 Chapelridge Cir Apt E Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#14.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('2351 Blairs Ferry Rd NE Bldg S3 Cedar Rapids IA 52402\n ', ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#15.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('2415 Grey Wolf Hiawatha IA 52233\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#16.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('550 West Side Pl SW Cedar Rapids IA 52404\n ', ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#17.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('3397 C Avenue Ext Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#18.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('4997 Hwy 13 Central City IA 52214\n ', ('AddressNumber', 'StreetNamePreType', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetNamePreType', 'StreetName', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))

Script to import OSM data and parse into our format

Two scripts.

  1. Import OSM data that has addr:full and addr:* components, so we can tag naturalistic address data
  2. Import OSM data that has addr:street:* components (addr:street:direction, addr:street:name, etc). We can create synthetic natural address from these componets. Let's do this second one first.

Steps.

  1. Grab data from http://overpass.osm.rambler.ru/ http://stackoverflow.com/questions/16625584/trying-to-send-xml-file-using-python-requests-to-openstreetmap-overpass-api
  2. Parse
  3. Profit

generating xml when first token is tagged NULL/None

e.g. ('#', None), ('1', 'AddressNumber') -> # 1

addr2XML in manual_labeling.py can't handle this b/c when generating XML, token text tagged as None is appended to the tail of the previous token. this breaks when the first token is tagged as None b/c there is no previous token. Not sure how to address this w/ ElementTree (how to set a child of AddressString that is not an xml element & also not the tail of a child xml element)

in the case above, '#' should probably actually be labeled as AddressNumber, in which case it wouldn't be an issue. are there cases when an address would have its first token validly tagged None? Should we just get rid of the None tag & tail stuff and put all tokens inside xml tags?

New feature

lower word, remove leading or trailing punctution (maybe interior punctuation too).

missing usaddr.crfsuite

getting an error when I import usaddress:

File "usaddress/__init__.py", line 6, in <module>
    + '/usaddr.crfsuite')
IOError: [Errno 2] No such file or directory: '/Users/derekeder/projects/datamade/us-address-parser/usaddress/usaddr.crfsuite'

Is this file supposed to be generated locally somehow?

Test class for list2xml

Please write a test class for testing list2xml

among other's please test that

('#', foo), ('1', 'foo'), ('Pinto', 'foo') --> <foo>#</foo> <foo>1</foo> <foo>Pinto</foo>

and

('#', NULL), ('1', 'foo'), ('Pinto', 'foo') --> # <foo>1</foo> <foo>Pinto</foo>

The tagger needs to be trained on atomic tokens

Right now some of the 'tokens' that you are using are phrases like

[('Homer Spit Road,', 'street'), ('Homer,', 'city'), ('AK', 'state'), ('99603', 'zip')]

In general, we won't know that that 'Homer Spit Road' all belongs together. Indeed that's what we are trying to learn.

So, use training data of this form:

[('Homer', 'street'), ('Spit', 'street'), ('Road,', 'street'), ('Homer,', 'city'), ('AK', 'state'), ('99603', 'zip')]

Standardize training data

The sources of training data will come in a variety of formats. We should choose a particular format to standardize to. The original data should be kept in an 'raw' data directory.

I would suggest an XML format at this point.

  • test.py
  • us50.test.tagged
  • us50.train.tagged

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.