datamade / usaddress Goto Github PK

:us: a python library for parsing unstructured United States address strings into address components

Home Page: https://parserator.datamade.us/usaddress

License: MIT License

Python 100.00%

python-library address nlp parserator python address-parser natural-language-processing machine-learning conditional-random-fields crf

usaddress's Introduction

usaddress

usaddress is a Python library for parsing unstructured United States address strings into address components, using advanced NLP methods.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying address components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify address components with perfect accuracy, nor can it verify that a given address is correct/valid.

It also does not normalize the address. However, this library built on top of usaddress does.

Tools built with usaddress

Parserator API

A RESTful API built on top of usaddress for programmers who don't use python. Requires an API key and the first 1,000 parses are free.

Parserator Google Sheets App

Parserator: Parse and Split Addresses allows you to easily split addresses into separate columns by street, city, state, zipcode and more right in Google Sheets.

How to use the usaddress python library

Install usaddress with pip, a tool for installing and managing python packages (beginner's guide here).

In the terminal,

pip install usaddress

Parse some addresses!

Note that parse and tag are different methods:

import usaddress
addr='123 Main St. Suite 100 Chicago, IL'

# The parse method will split your address string into components, and label each component.
# expected output: [(u'123', 'AddressNumber'), (u'Main', 'StreetName'), (u'St.', 'StreetNamePostType'), (u'Suite', 'OccupancyType'), (u'100', 'OccupancyIdentifier'), (u'Chicago,', 'PlaceName'), (u'IL', 'StateName')]
usaddress.parse(addr)

# The tag method will try to be a little smarter
# it will merge consecutive components, strip commas, & return an address type
# expected output: (OrderedDict([('AddressNumber', u'123'), ('StreetName', u'Main'), ('StreetNamePostType', u'St.'), ('OccupancyType', u'Suite'), ('OccupancyIdentifier', u'100'), ('PlaceName', u'Chicago'), ('StateName', u'IL')]), 'Street Address')
usaddress.tag(addr)

How to use this development code (for the nerds)

usaddress uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train the usaddress parser's model (a .crfsuite settings file) on labeled training data, and provides tools for adding new labeled training data.

Building & testing the code in this repo

To build a development version of usaddress on your machine, run the following code in your command line:

git clone https://github.com/datamade/usaddress.git  
cd usaddress  
pip install -r requirements.txt  
python setup.py develop  
parserator train training/labeled.xml usaddress

Then run the testing suite to confirm that everything is working properly:

nosetests .

Having trouble building the code? Open an issue and we'd be glad to help you troubleshoot.

Adding new training data

If usaddress is consistently failing on particular address patterns, you can adjust the parser's behavior by adding new training data to the model. Follow our guide in the training directory, and be sure to make a pull request so that we can incorporate your contribution into our next release!

Important links

Web Interface: https://parserator.datamade.us/usaddress
Python Package Distribution: https://pypi.python.org/pypi/usaddress
Python Package Documentation: https://usaddress.readthedocs.io/
API Documentation: https://parserator.datamade.us/api-docs
Repository: https://github.com/datamade/usaddress
Issues: https://github.com/datamade/usaddress/issues
Blog post: http://datamade.us/blog/parsing-addresses-with-usaddress

Team

Forest Gregg, DataMade
Cathy Deng, DataMade
Miroslav Batchkarov, University of Sussex
Jean Cochrane, DataMade

Bad Parses / Bugs

Report issues in the issue tracker

If an address was parsed incorrectly, please let us know! You can either open an issue or (if you're adventurous) add new training data to improve the parser's model. When possible, please send over a few real-world examples of similar address patterns, along with some info about the source of the data - this will help us train the parser and improve its performance.

If something in the library is not behaving intuitively, it is a bug, and should be reported.

Note on Patches/Pull Requests

Fork the project.
Make your feature addition or bug fix.
Send us a pull request. Bonus points for topic branches!

Copyright

usaddress's People

Contributors

Stargazers

Watchers

Forkers

pombredanne perryjg zingysain wavelets vhagerty neuroradiology mbatchkarov afthill crccheck jeffersonk boscomw veryspatial surfcao brianjbuck darrell wpbird007 shannonyu wycg1984 gokunwu chagge r2k0 gabelula ghing chappylol asgerpetersen johnjreiser brentpayne shahin jeff-lewis hkeeler dataqc lenelvarisano yucheng1992 daguar 0xuye0 mashcode cubgs53 yedtoss frankleng markbaas copyfun walterhan chrispan1009 krishkm hafe0024 landsurveyorsunited danromuald adamchainz rlugojr zubinaryarupeepower rflynn techgaun nokout gdg chaitanyassk pilgrim2go texasfile mvl1208 ericakimm wangzhixuan shannons-ds svenvarkel mmenchu particledecay wedwardbeck national-voter-file theolivenbaum ecopete bitswarming mmarsik akmalyala signalnine rj-lovering alseageo quitz jim-cl rvnthvrm hotelsoft tebrown jeetthakare stephensebastin prasad4fun keliu0530 hucruz abitwise sofianhw sruthivijay aniketmukherjee kumar-sameer hnhan1 luoyec codabeans colin-fraser tchrestoff pwilkebrown yashodhan19 nikhilshankarm jacobabello miando zzl1401

usaddress's Issues

handle address components in parentheses

Feature model

Hey @cathydeng, take a look at @miria's paper on using probabilistic methods to parse addresses. https://github.com/miria/AddressTagger/blob/master/StatisticalAddressParsing.pdf

It has some nice ideas for features:

the word length,
whether the word was in all capital letters,
whether the term is completely numeric
whether the word contains at least one number.

There are also a number of valuable references we should check out for otehr feature ideas.

Might want to grab newline \n as token.

create a way to easily add training data

given a list of address strings, tag address tokens via the command line and produce xml file in standardized training format

Writing to xml doesn't preseve leading punctuation

in manual_labeling.py if we label

#1 Pinto drive

It give saved as

<AddressNumber>1</AddressNumber> <StreetName>Pinto....

and the '#' gets lost.

Given full addresses, use the tagger's estimated probabiltiies of sequences to identify potential failures.

Depends on #13

Easily confused if StreetNamePostType omitted

Skipping a "St." or "Dr." causes it to interpret city or town names as StreetNamePostType:

>>> usaddress.parse("123 Main Sometown, New York 10101")
[('123', 'AddressNumber'), ('Main', 'StreetName'), ('Sometown,', 'StreetNamePostType'), ('New', 'PlaceName'), ('York', 'StateName'), ('10101', 'ZipCode')]

Specify prerequisite packages?

You might consider specifying that a couple of packages have to be installed in Ubuntu/Debian for this to work. (I'm running Ubuntu 12.04.) Attempting to install it throws cryptic errors unless g++ and python-dev have been installed, like such:

apt-get install g++
apt-get install python-dev

OTOH, I only have about six months of serious Python experience, and that's mostly in writing Python (not participating in the Python software ecosystem), so perhaps this is common knowledge.

How to tag " CORNER OF CAVE & ZERKEL STS"

How to tag "905 North Kings Highway, Cherry Hill , NJ 8034"

Should "Kings Highway" really be split into street name and street type?

Given full addresses, use disagreements between our parser and a mature, rule based parser to identify hard cases and potential failures.

In particular, we can compare our parser to https://github.com/jjensenmike/python-streetaddress

indent using four spaces

@cathydeng, style issue. In our python projects, we indent using four spaces.

In testing data, don't tag '#'

Install error on Python 3 x64

I'm using Anaconda3 Python Distribution and install bombs out:

C:\Anaconda3>pip install usaddress
Downloading/unpacking usaddress
  Running setup.py (path:C:\Users\shitals\AppData\Local\Temp\pip_build_shitals\u
saddress\setup.py) egg_info for package usaddress

Downloading/unpacking python-crfsuite>=0.7 (from usaddress)
  Running setup.py (path:C:\Users\shitals\AppData\Local\Temp\pip_build_shitals\p
ython-crfsuite\setup.py) egg_info for package python-crfsuite

Installing collected packages: usaddress, python-crfsuite
  Running setup.py install for usaddress

  Running setup.py install for python-crfsuite
    error: [WinError 2] The system cannot find the file specified
    Complete output from command C:\Anaconda3\python.exe -c "import setuptools,
tokenize;__file__='C:\\Users\\shitals\\AppData\\Local\\Temp\\pip_build_shitals\\
python-crfsuite\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__
).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record C:\Users\sh
itals\AppData\Local\Temp\pip-xu5ktq5w-record\install-record.txt --single-version
-externally-managed --compile:
    running install

running build

running build_py

creating build

creating build\lib.win-amd64-3.4

creating build\lib.win-amd64-3.4\pycrfsuite

copying pycrfsuite\_dumpparser.py -> build\lib.win-amd64-3.4\pycrfsuite

copying pycrfsuite\_logparser.py -> build\lib.win-amd64-3.4\pycrfsuite

copying pycrfsuite\__init__.py -> build\lib.win-amd64-3.4\pycrfsuite

running build_ext

error: [WinError 2] The system cannot find the file specified

----------------------------------------
Cleaning up...
Command C:\Anaconda3\python.exe -c "import setuptools, tokenize;__file__='C:\\Us
ers\\shitals\\AppData\\Local\\Temp\\pip_build_shitals\\python-crfsuite\\setup.py
';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n',
'\n'), __file__, 'exec'))" install --record C:\Users\shitals\AppData\Local\Temp\
pip-xu5ktq5w-record\install-record.txt --single-version-externally-managed --com
pile failed with error code 1 in C:\Users\shitals\AppData\Local\Temp\pip_build_s
hitals\python-crfsuite
Storing debug log for failure in C:\Users\shitals\pip\pip.log

Let the command line manual_labeler append examples to existing xml file

From commandline

> python manual_labeler.py address_strings_to_be_labeled.csv existing_training_file.xml

handle punctuation properly in creating/parsing xml files in training format

training format here: https://github.com/datamade/us-address-parser/blob/master/training/data/README.md

in parse.py, need to fix:
-osmNaturalToTraining (creating training xml)
-osmSyntheticToTraining (creating training xml)
-parseTrainingData (parsing training xml)

Good sources of messy addresses

The contributor's address from the illinois campaign finance data http://electionmoney.org/#receipts

How to tag "1800 M Street, NW North Tower, Ste. 700, Washington, DC 20036"

How do we handle "North Tower"?

handle intersection addresses

create IntersectionSeparator tag

decide on a common format for addresses to accept

Use conform.merge property to find failing addresses

Some of the openaddress source files have a conform.merge property that describes the subcomponents available in the source data, i.e. Westminster, CO

We can use this properties to identify sources of tagged, or partially tagged addresses.

Let's build some code to

walk through https://github.com/openaddresses/openaddresses/tree/master/sources
if the source is American and has conform.merge properties then download the cached data, if exists
compare our tagging with the source tagging for identifying potential failures of our parser.

Tests for splitting address strings

Please write a new test class for testing splitting.

Tests should include, among others,

#1 Pinto Drive--> #, 1, Pinto, Drive

Complete the building of our improvement feedback loop

To build a really great parser we need to set up a feedback loop for ourselves.

The loop will include

Testing the performance of the parser on addresses
Identifying the failures
Altering the parser by
1. Changing the training corpus by
  1. changing the amount of training data
  2. changing the variety of training data
  3. changing the representativeness of the tranining data
2. changing the features we use for the model
3. changing the model form

After making adjustments to the parser (3), we need to be able to evaluate (1) the performance, and identify current difficult cases (2) with one command.

In the next round, adjustments to the parser (3), need to be based upon our current failures.

Testing (1 & 2)
- Start our test corpus: #8
Automated build and test #7
Patterns for ingesting new training/test data
- #6 Deciding on data format
- Directory structure for raw_data, data_scripts, training_data

decide on tags to use for address tokens

Provide options for returning probabilities of sequences and marginal probabilities per label for tagged output

The python-crfsuite.Tagger object has probability and marginal methods for this. Coudl be useful to expose this.

https://github.com/tpeng/python-crfsuite/blob/26ee2ce5d228b96f3d307ae9cea83b71c6ee8d58/pycrfsuite/_pycrfsuite.pyx#L541

Add the test examples from us50 to tests

depends on #3

How to tag 'Hwys 16 And 63 North, Spring Valley, MN 55975'

This is an intersection

Hwys - StreetNamePreType
16 - StreetName
And - IntersectionSeparator
63 - StreetName
North - StreetNamePostDirection

Not robust for these examples

Here are some of the examples I tried. In general,

Street directional prefix detection is not robust (people often make mistake in this).
Occupancy identifier doesn't seem to take in to account special characters.
Country name is interpreted as street name.
Intersection detection doesn't work with "&".
No robustness for hints specified in brackets.
Complex city names are not handled well
Things like "P.O. BOX" is not detected as single token.
Etc

>>> import usaddress
>>> usaddress.parse('123 Main St., Flr #100, Chicago, IL')
[('123', 'AddressNumber'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Flr', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 Main NE St., Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('Main', 'StreetName'), ('NE', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 Main NorthWest St, Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('Main', 'StreetName'), ('NorthWest', 'StreetName'), ('St,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 NE Main St., Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 Norh East Main St., Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('Norh', 'StreetName'), ('East', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 NorhEast Main St., Cabin #100, Chicago, IL')
[('123', 'AddressNumber'), ('NorhEast', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 NorhEast Main St., Cabin #100 1/2, Chicago, IL')
[('123', 'AddressNumber'), ('NorhEast', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
>>> usaddress.parse('123 NorhEast Main St., Cabin #100 1/2, Chicago, Illinoi, United States of America')
[('123', 'AddressNumber'), ('NorhEast', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi,', 'PlaceName'), ('United', 'PlaceName'), ('States', 'PlaceName'), ('of', 'StreetName'), ('America', 'StreetName')]
>>> usaddress.parse('123 NorhEast Main St., Cabin #100 1/2, Chicago, Illinoi, US of A')
[('123', 'AddressNumber'), ('NorhEast', 'StreetName'), ('Main', 'StreetName'), ('St.,', 'StreetNamePostType'), ('Cabin', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi,', 'PlaceName'), ('US', 'StateName'), ('of', 'StreetNamePreType'), ('A', 'StreetName')]
>>> usaddress.parse('123 NE Main St (near Lincon Cinema), Cabin #100 1/2, Chicago, Illinoi, US of A')
[('123', 'AddressNumber'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St', 'StreetName'), ('near', 'StreetName'), ('Lincon', 'StreetName'), ('Cinema),', 'StreetNamePostType'), ('Cabin', 'OccupancyType'), ('#', 'OccupancyIdentifier'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi,', 'PlaceName'), ('US', 'StateName'), ('of', 'StreetNamePreType'), ('A', 'StreetName')]
>>> usaddress.parse('123 NE Main St & 551A Juno Ave, Cabin #100 1/2, Chicago, Illinoi')
[('123', 'AddressNumber'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St', 'StreetName'), ('&', 'StreetName'), ('551A', 'StreetName'), ('Juno', 'StreetName'), ('Ave,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('123 NE Main St and 551A Juno Ave, Cabin #100 1/2, Chicago, Illinoi')
[('123', 'AddressNumber'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St', 'StreetNamePostType'), ('and', 'IntersectionSeparator'), ('551A', 'StreetName'), ('Juno', 'StreetName'), ('Ave,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('1231 1/2 NE Main St, Cabin #100 1/2, Chicago, Illinoi')
[('1231', 'AddressNumber'), ('1/2', 'AddressNumberSuffix'), ('NE', 'StreetNamePreDirectional'), ('Main', 'StreetName'), ('St,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('1231A 1/2 NE Main St, Cabin #100 1/2, Chicago, Illinoi')
[('1231A', 'AddressNumber'), ('1/2', 'StreetName'), ('NE', 'StreetName'), ('Main', 'StreetName'), ('St,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('1231A-B 1/2 NE Main St, Cabin #100 1/2, Chicago, Illinoi')
[('1231A-B', 'AddressNumber'), ('1/2', 'StreetName'), ('NE', 'StreetName'), ('Main', 'StreetName'), ('St,', 'StreetNamePostType'), ('Cabin', 'PlaceName'), ('#', 'OccupancyType'), ('100', 'OccupancyIdentifier'), ('1/2,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('501 US Hwy 81 Exit 4, Trailer #100, Chicago, Illinoi')
[('501', 'AddressNumber'), ('US', 'StreetNamePreType'), ('Hwy', 'StreetNamePreType'), ('81', 'StreetName'), ('Exit', 'StreetName'), ('4,', 'StreetName'), ('Trailer', 'StreetName'), ('#', 'StreetNamePostType'), ('100,', 'OccupancyIdentifier'), ('Chicago,', 'PlaceName'), ('Illinoi', 'StateName')]
>>> usaddress.parse('501 Main Blvd, Sun-lo-City on the river, Illinoi')
[('501', 'AddressNumber'), ('Main', 'StreetName'), ('Blvd,', 'StreetNamePostType'), ('Sun-lo-City', 'PlaceName'), ('on', 'StreetName'), ('the', 'StreetName'), ('river,', 'StreetName'), ('Illinoi', 'PlaceName')]
>>> usaddress.parse('P.O. BOX 2555, Sun-lo-City on the river, Illinoi')
[('P.O.', 'USPSBoxType'), ('BOX', 'USPSBoxType'), ('2555,', 'USPSBoxID'), ('Sun-lo-City', 'BuildingName'), ('on', 'BuildingName'), ('the', 'BuildingName'), ('river,', 'BuildingName'), ('Illinoi', 'BuildingName')]

Set learning parameters based on out-of-set testing or cross validation

python-crfsuite can set training parameters (see trainer.set_params in this example.

Let's set the L1 and L2 parameters by using out-of-set testing or cross validation.

Can't manually tag "Office of General Counsel, 820 N. Michigan Avenue, Suite 750, Chicago, IL 60611"

How to handle "Office of General Counsel" (NULL type)?

Add license information for current example data

us-ia-linn labeling errors

tested the parser on openaddress data for Linn county Iowa. 1836 failures out of 95164 addresses (1.9%).

since these are all the addresses within a county, many of the failures are essentially the same errors, repeated for various address/unit numbers. I scrolled through all the failures, and here's a representative sample (address string, predicted labels, true labels):
#1.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('180 S 19th Street Ct Marion IA 52302\n ', ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#2.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1115 Indian Creek Cir Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#3.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('3942 21st Avenue Pl SW Unit 4 Cedar Rapids IA 52404\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#4.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('4101 16th Ave SW Trlr 26 Cedar Rapids IA 52404\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetNamePreType', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#5.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1341 39th Street Pl Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#6.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('2510 Heather View Cir Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#7.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('916 E Ave NW Cedar Rapids IA 52405\n ', ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#8.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1113 6th St SE D Cedar Rapids IA 52401\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#9.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1238 O Avenue Pl NE Cedar Rapids IA 52402\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#10.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1300 Oakland Rd NE Bldg 5 Cedar Rapids IA 52402\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetNamePreType', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#11.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('3043 1/2 Leonard St NE Cedar Rapids IA 52402\n ', ('AddressNumber', 'AddressNumberSuffix', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#12.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1702 Hunters Creek Way Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#13.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('9 Chapelridge Cir Apt E Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#14.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('2351 Blairs Ferry Rd NE Bldg S3 Cedar Rapids IA 52402\n ', ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#15.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('2415 Grey Wolf Hiawatha IA 52233\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#16.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('550 West Side Pl SW Cedar Rapids IA 52404\n ', ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#17.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('3397 C Avenue Ext Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'))
———————————————————————————————————
#18.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('4997 Hwy 13 Central City IA 52214\n ', ('AddressNumber', 'StreetNamePreType', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetNamePreType', 'StreetName', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))

How to tag '2029 East Highway 356 (Irving Boulevard), Irving, TX 75038'

2029 - AddressNumber
East - StreetNamePreDirectional
Highway - StreetNamePreType
356 - StreetName
(Irving Boulevard) - Null

Better handling of special characters

I hacked together a temporary solution for ampersands, 9cc63ad#diff-5253508931d87b3c5f701f1d50a83d6bR33

We need something better.

Can't manually tag "127 Public Square, 4th Flr, Cleveland, OH 44114"

"Public Square" is what?

Identify sources for tagged address strings

For example,

https://github.com/fgregg/crf/tree/master/samples/data/us50
https://github.com/jjensenmike/python-streetaddress/blob/master/test.py

We need more. The sources also need to have a MIT compatible license.

How to tag "2020 N. LINCOLN PARK WEST"

What do we do with the west?

Script to import OSM data and parse into our format

Two scripts.

Import OSM data that has addr:full and addr:* components, so we can tag naturalistic address data
Import OSM data that has addr:street:* components (addr:street:direction, addr:street:name, etc). We can create synthetic natural address from these componets. Let's do this second one first.

Steps.

Grab data from http://overpass.osm.rambler.ru/ http://stackoverflow.com/questions/16625584/trying-to-send-xml-file-using-python-requests-to-openstreetmap-overpass-api
Parse
Profit

generating xml when first token is tagged NULL/None

e.g. ('#', None), ('1', 'AddressNumber') -> # 1

addr2XML in manual_labeling.py can't handle this b/c when generating XML, token text tagged as None is appended to the tail of the previous token. this breaks when the first token is tagged as None b/c there is no previous token. Not sure how to address this w/ ElementTree (how to set a child of AddressString that is not an xml element & also not the tail of a child xml element)

in the case above, '#' should probably actually be labeled as AddressNumber, in which case it wouldn't be an issue. are there cases when an address would have its first token validly tagged None? Should we just get rid of the None tag & tail stuff and put all tokens inside xml tags?

Split "6257A" -> ['6257', 'A']

also '123-B' to ['123', 'B']

New feature

lower word, remove leading or trailing punctution (maybe interior punctuation too).

comma should separate "5835 Peachtree Corners East,Suite A"

missing usaddr.crfsuite

getting an error when I import usaddress:

File "usaddress/__init__.py", line 6, in <module>
    + '/usaddr.crfsuite')
IOError: [Errno 2] No such file or directory: '/Users/derekeder/projects/datamade/us-address-parser/usaddress/usaddr.crfsuite'

Is this file supposed to be generated locally somehow?

In test data, let's put periods inside of tags

<StreetNamePredirectional>S.</StreetNamePredirectional>

instead of

<StreetNamePredirectional>S</StreetNamePredirectional>.

Test class for list2xml

Please write a test class for testing list2xml

among other's please test that

('#', foo), ('1', 'foo'), ('Pinto', 'foo') --> <foo>#</foo> <foo>1</foo> <foo>Pinto</foo>

and

('#', NULL), ('1', 'foo'), ('Pinto', 'foo') --> # <foo>1</foo> <foo>Pinto</foo>

Write a build script and test script.

In root repository, simple script to train current model and run tests.

The tagger needs to be trained on atomic tokens

Right now some of the 'tokens' that you are using are phrases like

[('Homer Spit Road,', 'street'), ('Homer,', 'city'), ('AK', 'state'), ('99603', 'zip')]

In general, we won't know that that 'Homer Spit Road' all belongs together. Indeed that's what we are trying to learn.

So, use training data of this form:

[('Homer', 'street'), ('Spit', 'street'), ('Road,', 'street'), ('Homer,', 'city'), ('AK', 'state'), ('99603', 'zip')]

test.py
us50.test.tagged
us50.train.tagged