Giter Site home page Giter Site logo

graal-research / deepparse Goto Github PK

View Code? Open in Web Editor NEW
292.0 292.0 30.0 9.49 MB

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page: https://deepparse.org/

License: GNU Lesser General Public License v3.0

Python 99.39% Shell 0.44% Dockerfile 0.17%
addresses-parsing machine-learning python

deepparse's People

Contributors

ajndkr avatar davebulaval avatar dependabot[bot] avatar freud14 avatar gamesetandmatch avatar hayatosempai avatar mayas3 avatar yogeshchandrasekharuni avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

deepparse's Issues

Generating training data for USPS address parsing

Note: I may be able to make whatever training dataset I create available under an open license; it's a goal for sure.

I'm trying to create a good dataset to use for training for parsing USPS (United States) addresses using proper USPS address parts. I have both an enormous dataset (millions) of clean addresses correctly labeled, as well as some (hundreds of thousands) of dirty addresses that have been matched to clean addresses, and so labeled.

Those clean addresses all use official abbreviations for Street Type, Street Pre Direction, Unit Type, etc; they also all have few missing address parts. I'd imagine I'd want to synthetically generate non-abbreviated representations of each possible abbreviated term; my questions are:

  1. Given I have labeled addresses for 90% of the entire US, how do I determine the point of diminishing returns for the size of the training set?
  2. what portion of the training dataset should consist of the alternative non-abbreviated terms?

My initial idea is to randomly select 20 addresses from every 5 digit zipcode, resulting in ~ 1M addresses; then synthetically modify about 50% of the to use alternative non-official abbreviated forms of the abbreviated terms.

'tuple' object has no attribute 'lower' in addres_parser

Describe the bug
I am running the example from the documentationhttps://deepparse.org/examples/parse_addresses.html but getting an error with converting tuple object to lower case

To Reproduce
parsed_addresses = address_parser(test_data[0:300])

converting output of multiple addresses to table[BUG]

I am trying to use the following code to extract the individual address elements and put them into a dataframe:

from deepparse.parser import AddressParser
address_parser = AddressParser(model_type="bpemb", device=0)
import pandas as pd

# parse multiple addresses
parsed_address = address_parser(
    ["350 rue des Lilas Ouest Québec Québec G1L 1B6", "350 rue des Lilas Ouest Québec Québec G1L 1B6"])

print(parsed_address) #to look at the output

pd.DataFrame(parsed_address["ParsedAddress"].to_list(), columns=['street_number', 'street_name', 'municipality', 'province','postal_code'])

I get the following error:

pd.DataFrame(parsed_address["ParsedAddress"].to_list(), ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
1 #df3 = pd.DataFrame(columns=['street_number', 'street_name', 'municipality', 'province','postal_code'])
2 #d_list = []
----> 3 pd.DataFrame(parsed_address["ParsedAddress"].to_list(), columns=['street_number', 'street_name', 'municipality', 'province','postal_code'])
4

TypeError: 'ParsedAddress' object is not subscriptable

Any ideas on what I should I do? Thank you!

Repeated words gets overwritten in output.

See the example shown on the website:
image

It seems that the second "Québec" overwrites the first one that is supposed to be the municipality. It's probably because of the dict that associated the words with the class.

[improvements] Models improvements

Data augmentations:

  • , in addresses (#56)
    - [ ] order noise [WIP]
  • incomplete address noise (no postal code, no city) (#78)

Tags improvement:

- [ ] Adding a Country tag

  • Retrain new tags (#81)

New models:

  • attention model (#105)

Smaller models

  • half size for hidden size (#103)

Retrain Issue with "bpemb" model

While retraining with "bpemb" model_type, the model is not at all running. Its getting stuck forever. No checkpoints are getting created as well. What could be the reason?

Screen Shot 2021-08-20 at 5 24 59 PM

eval method not necessary in networks

Hi,
I noticed that your networks all contain a eval() method. However, this method is not necessary since it is already called recursively by PyTorch. I think they should be removed.
Thank you.

[BUG] Cutting of predictions

Since we don't sort the sequence (and the len of then) we were taking the max len as the first element of the max_len. So most of the cases we cut the sequence to tag.

Use memory mapping when loading embeddings

One idea for a future release would be to load the embeddings via memory mapping instead of loading them all into memory.

For fasttext, it seems that the Fasttext API does not support memory mapping. However, gensim seems to support it but not with the fasttext format. So, either we save the current embeddings in a format readable by memory mapping in the gensim API and we upload them somewhere (GRAIL website server???) or we take embeddings provided by gensim and we retrain a model with them.

For BPEmb, I haven't checked but it's less bad with regard to memory usage.

Few comments

Hi,
Here is a few comments on the library.

Let's start with the big one. I noticed that the documentation web site documents almost everything in the library. Is it on purpose? I mean, in software engineering, if you document it, it means that you intend to support all these different parts and that their interfaces should remain stable. In my opinion, you should document only the parts that are essential to you the library and keep the other parts only as backends. Maybe it is because you want to support training as we can see in issue #11, so you intend for the user to use these parts for training? Anyway, just some thoughts for you. I would like to know what are your intentions.

Now, a few comments on the code I looked at.

https://github.com/MAYAS3/deepparse/blob/67674b892aa0809e3d8d4c1303624212aec13d7d/deepparse/parser/address_parser.py#L47

I think there should be default values for the parameters of the AddressParser class. What I would suggest is that it should use the device 0 by default if it exists, otherwise just use the CPU. Maybe choose a default model too.

https://github.com/MAYAS3/deepparse/blob/67674b892aa0809e3d8d4c1303624212aec13d7d/deepparse/parser/address_parser.py#L90

When tagging an address, it would be nice if the return was a dictionary where the keys are the tags and the values are the words. For instance, instead of this:

{'350 rue des Lilas Ouest Québec Québec G1L 1B6': {'350': 'StreetNumber',
  'rue': 'StreetName',
  'des': 'StreetName',
  'Lilas': 'StreetName',
  'Ouest': 'StreetName',
  'Québec': 'Province',
  'G1L': 'PostalCode',
  '1B6': 'PostalCode'}}

it could be something like this:

{'350 rue des Lilas Ouest Québec Québec G1L 1B6': {'StreetNumber': '350',
  'StreetName': 'rue des Lilas Ouest',
  'Province': 'Québec',
  'PostalCode': 'G1L 1B6'}}

Notice how some tags where merged and the keys and values are inverted. Maybe there could be a flag if you want the other way. Or, better yet, you could return an object.

Alright, that's it for now.

[BUG] Received "TypeError: can't pickle fasttext_pybind.fasttext objects" when trying to retrain

Describe the bug

I was following the retrain instruction on the page, https://deepparse.org/examples/fine_tuning.html and I received the below error messages.

address_parser.retrain(training_container, 0.8, epochs=5, batch_size=8)
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\janch.conda\envs\py36\lib\site-packages\deepparse\parser\address_parser.py", line 327, in retrain
callbacks=callbacks)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 477, in train
return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 618, in _train
return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 575, in fit_generator
self._fit_generator_one_batch_per_step(epoch_iterator, callback_list)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 652, in _fit_generator_one_batch_per_step
for step, (x, y) in train_step_iterator:
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 75, in iter
for step, data in _get_step_iterator(self.steps_per_epoch, self.generator):
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 19, in cycle
for x in iterable:
File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 355, in iter
return self._get_iterator()
File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 914, in init
w.start()
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle fasttext_pybind.fasttext objects

  • OS: Windows
  • Python 3.6
  • Running on CPU only

I use the following line: [BUG]

Hi,

I am using Windows and the latest version of Anaconda

from deepparse.parser import AddressParser

Generates the following errors:

C:\Users\Owner\Anaconda3\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

ImportError Traceback (most recent call last)
in
----> 1 from deepparse.parser import AddressParser

~\Anaconda3\lib\site-packages\deepparse\parser_init_.py in
1 # pylint: disable=wildcard-import
----> 2 from .address_parser import *
3 from .parsed_address import *

~\Anaconda3\lib\site-packages\deepparse\parser\address_parser.py in
10 from .. import CACHE_PATH, handle_checkpoint, indices_splitting
11 from .. import load_tuple_to_device, download_fasttext_magnitude_embeddings
---> 12 from ..converter import TagsConverter
13 from ..converter import fasttext_data_padding, bpemb_data_padding, DataTransform
14 from ..dataset_container import DatasetContainer

~\Anaconda3\lib\site-packages\deepparse\converter_init_.py in
2 from .data_padding import *
3 from .target_converter import *
----> 4 from .data_transform import *

~\Anaconda3\lib\site-packages\deepparse\converter\data_transform.py in
3 from . import fasttext_data_padding_teacher_forcing, bpemb_data_padding_teacher_forcing,
4 bpemb_data_padding_with_target, fasttext_data_padding_with_target
----> 5 from ..vectorizer import TrainVectorizer
6
7

~\Anaconda3\lib\site-packages\deepparse\vectorizer_init_.py in
1 # pylint: disable=wildcard-import
----> 2 from .bpemb_vectorizer import *
3 from .fasttext_vectorizer import *
4 from .magnitude_vectorizer import *
5 from .train_vectorizer import *

~\Anaconda3\lib\site-packages\deepparse\vectorizer\bpemb_vectorizer.py in
3 import numpy as np
4
----> 5 from .vectorizer import Vectorizer
6 from ..embeddings_models.embeddings_model import EmbeddingsModel
7

~\Anaconda3\lib\site-packages\deepparse\vectorizer\vectorizer.py in
2 from typing import List
3
----> 4 from ..embeddings_models.embeddings_model import EmbeddingsModel
5
6

~\Anaconda3\lib\site-packages\deepparse\embeddings_models_init_.py in
2 from .bpemb_embeddings_model import *
3 from .embeddings_model import *
----> 4 from .fasttext_embeddings_model import *
5 from .magnitude_embeddings_model import *

~\Anaconda3\lib\site-packages\deepparse\embeddings_models\fasttext_embeddings_model.py in
1 import platform
2
----> 3 from gensim.models.fasttext import load_facebook_vectors
4 from numpy.core.multiarray import ndarray
5

ImportError: cannot import name 'load_facebook_vectors' from 'gensim.models.fasttext' (C:\Users\Owner\Anaconda3\lib\site-packages\gensim\models\fasttext.py)

Any ideas? Thank you!

Sincerely,

tom

Additional context
Add any other context about the problem here.

[Question] Training noisy data from another country?

If I have a large dataset with noisy raw addresses and also correctly parsed results for each one, how do I start with training deepparse to get a trained dataset?

The raw+result data I have is currently in CSV format but with a bit of scripting I can easily transform into another format. I just don't completely understand how to train Deepparse for this.

[performance] comma

Addresses with comma seem to lower performance.

from deepparse.parser import AddressParser

dp = AddressParser(model="bpemb", device=0)

dp("2020 boul. René-Lévesques, Montréal, QC, Canada", with_prob=True).address_parsed_components
#> [('2020', ('PostalCode', 0.8566)),
#>  ('boul.', ('Province', 0.7204)),
#>  ('René-Lévesques,', ('StreetName', 0.7636)),
#>  ('Montréal,', ('StreetName', 0.9614)),
#>  ('QC,', ('StreetName', 0.7382)),
#>  ('Canada', ('Province', 0.5126))]

parsed_address = address_parser("2020 boul. René-Lévesques Montréal QC", with_prob=True)

>>> print(parsed_address.address_parsed_components)

[('2020', ('PostalCode', 0.9467)), ('boul.', ('StreetName', 0.9895)), ('René-Lévesques', ('StreetName', 0.9602)), ('Montréal', ('Municipality', 0.9965)), ('QC', ('Province', 0.9999))]

fasttext is the lightext and bpemb is the most accurate?

Hi,
The fasttext model needs 9 GB of RAM to be loaded but is the most accurate if my memory serves? So it would be the most accurate but not the lightest. Also, I think it should be mentionned somewhere in the doc that it takes that much memory to be loaded?
Thank you.

fastText pre-trained embeddings

Since the download of the pre-trained fastText embedding is pretty long, I think we should add a note in the doc to be more clear with the user.

Also, should we put a message, warning when we download the model as per gensim? The reason been, in my case when using a remote terminal in PyCharm the updated print does not show.

@MAYAS3 @freud14

LGPL licence

I'd like to contribute to the project, but the LGPL license is quite restrictive because all derivatives can only be redistributed under LGPL v3 license (as I understand it). I'm not sure it is a complete blocker, but it might be an issue down the road. Would you consider relicensing the project under a slightly more permissible license, such as GPL or MIT (or another)?

[RuntimeError] Retrain Error

Hi, I got this error when I tried to retrain the model. What could be possible causes?

RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1

I used this code setting

address_parser = AddressParser(model_type="best", device=0)
lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1)
address_parser.retrain(training_container, 0.8, epochs=15, batch_size=64, num_workers=2, callbacks=[lr_scheduler])

I have transformed my training data into a pickle file with the right format as the example in the doc; list of tuples ( 'address text', [list of tags corresponding to each word] ). Moreover, I have already made sure that the number of words in a tuple matches the number of elements in its corresponding list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.