graal-research / deepparse Goto Github PK

View Code? Open in Web Editor NEW

292.0 292.0 30.0 9.49 MB

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page: https://deepparse.org/

License: GNU Lesser General Public License v3.0

Python 99.39% Shell 0.44% Dockerfile 0.17%

addresses-parsing machine-learning python

deepparse's People

Contributors

Stargazers

Watchers

deepparse's Issues

Generating training data for USPS address parsing

Note: I may be able to make whatever training dataset I create available under an open license; it's a goal for sure.

I'm trying to create a good dataset to use for training for parsing USPS (United States) addresses using proper USPS address parts. I have both an enormous dataset (millions) of clean addresses correctly labeled, as well as some (hundreds of thousands) of dirty addresses that have been matched to clean addresses, and so labeled.

Those clean addresses all use official abbreviations for Street Type, Street Pre Direction, Unit Type, etc; they also all have few missing address parts. I'd imagine I'd want to synthetically generate non-abbreviated representations of each possible abbreviated term; my questions are:

Given I have labeled addresses for 90% of the entire US, how do I determine the point of diminishing returns for the size of the training set?
what portion of the training dataset should consist of the alternative non-abbreviated terms?

My initial idea is to randomly select 20 addresses from every 5 digit zipcode, resulting in ~ 1M addresses; then synthetically modify about 50% of the to use alternative non-official abbreviated forms of the abbreviated terms.

'tuple' object has no attribute 'lower' in addres_parser

Describe the bug
I am running the example from the documentationhttps://deepparse.org/examples/parse_addresses.html but getting an error with converting tuple object to lower case

To Reproduce
parsed_addresses = address_parser(test_data[0:300])

Add handling for server connexion errors

converting output of multiple addresses to table[BUG]

I am trying to use the following code to extract the individual address elements and put them into a dataframe:

from deepparse.parser import AddressParser
address_parser = AddressParser(model_type="bpemb", device=0)
import pandas as pd

# parse multiple addresses
parsed_address = address_parser(
    ["350 rue des Lilas Ouest Québec Québec G1L 1B6", "350 rue des Lilas Ouest Québec Québec G1L 1B6"])

print(parsed_address) #to look at the output

pd.DataFrame(parsed_address["ParsedAddress"].to_list(), columns=['street_number', 'street_name', 'municipality', 'province','postal_code'])

I get the following error:

pd.DataFrame(parsed_address["ParsedAddress"].to_list(), ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
1 #df3 = pd.DataFrame(columns=['street_number', 'street_name', 'municipality', 'province','postal_code'])
2 #d_list = []
----> 3 pd.DataFrame(parsed_address["ParsedAddress"].to_list(), columns=['street_number', 'street_name', 'municipality', 'province','postal_code'])
4

TypeError: 'ParsedAddress' object is not subscriptable

Any ideas on what I should I do? Thank you!

Repeated words gets overwritten in output.

See the example shown on the website:

It seems that the second "Québec" overwrites the first one that is supposed to be the municipality. It's probably because of the dict that associated the words with the class.

Propose a external-ressources to allow posting of new data

Pretrained model doesn't need to inherit from nn.Module

https://github.com/MAYAS3/deepparse/blob/dbc7236ed31e3878487298adbe99967561501a75/deepparse/network/pre_trained_seq2seq.py#L14

since we us this class as a wrapper and we do not aim to optimize the pretrained models, we do not have to inherit from nn.module here

fasttext model not loading

Add support for fasttext embeddings

Dual CPU-GPU support

Add CI

Add License

Yapf and Pylint are not setted on tests

To add in Travis and fix the pylint.

[improvements] Models improvements

Data augmentations:

, in addresses (#56)
~~- [ ] order noise [WIP]~~
incomplete address noise (no postal code, no city) (#78)

Tags improvement:

~~- [ ] Adding a Country tag~~

Retrain new tags (#81)

New models:

attention model (#105)

Smaller models

half size for hidden size (#103)

[FEATURE] Release train and test dataset

Release the dataset (in pickle format)

Add a README file with the dataset
Add a LICENSE
Add the link in doc and GitHub README

Update .ckpt model in cache

Retrain Issue with "bpemb" model

While retraining with "bpemb" model_type, the model is not at all running. Its getting stuck forever. No checkpoints are getting created as well. What could be the reason?

Add comparison with related work to docs & readMe

[FEATURE] Conda Forge Release

https://medium.com/@giswqs/building-a-conda-package-and-uploading-it-to-anaconda-cloud-6a3abd1c5c52

[FEATURE] Address tagging comparison

Feature to compare an address tagging with our tag prediction.

Add script to download the .cache element

The case where the node doesn't have access to the internet (i.e. calcul québec)

eval method not necessary in networks

Hi,
I noticed that your networks all contain a eval() method. However, this method is not necessary since it is already called recursively by PyTorch. I think they should be removed.
Thank you.

[BUG] Cutting of predictions

Since we don't sort the sequence (and the len of then) we were taking the max len as the first element of the max_len. So most of the cases we cut the sequence to tag.

Model weights are not available

Trying to download weights from http://graal-research.github.io/deepparse-external-assets/bpemb.ckpt.

https://www.graal-research.github.io website seems to be offline.

[feature] fine-tuning of models

Add an API to allow the fine-tuning of the models.

Use memory mapping when loading embeddings

One idea for a future release would be to load the embeddings via memory mapping instead of loading them all into memory.

For fasttext, it seems that the Fasttext API does not support memory mapping. However, gensim seems to support it but not with the fasttext format. So, either we save the current embeddings in a format readable by memory mapping in the gensim API and we upload them somewhere (GRAIL website server???) or we take embeddings provided by gensim and we retrain a model with them.

For BPEmb, I haven't checked but it's less bad with regard to memory usage.

Few comments

Hi,
Here is a few comments on the library.

Let's start with the big one. I noticed that the documentation web site documents almost everything in the library. Is it on purpose? I mean, in software engineering, if you document it, it means that you intend to support all these different parts and that their interfaces should remain stable. In my opinion, you should document only the parts that are essential to you the library and keep the other parts only as backends. Maybe it is because you want to support training as we can see in issue #11, so you intend for the user to use these parts for training? Anyway, just some thoughts for you. I would like to know what are your intentions.

Now, a few comments on the code I looked at.

https://github.com/MAYAS3/deepparse/blob/67674b892aa0809e3d8d4c1303624212aec13d7d/deepparse/parser/address_parser.py#L47

I think there should be default values for the parameters of the AddressParser class. What I would suggest is that it should use the device 0 by default if it exists, otherwise just use the CPU. Maybe choose a default model too.

https://github.com/MAYAS3/deepparse/blob/67674b892aa0809e3d8d4c1303624212aec13d7d/deepparse/parser/address_parser.py#L90

When tagging an address, it would be nice if the return was a dictionary where the keys are the tags and the values are the words. For instance, instead of this:

{'350 rue des Lilas Ouest Québec Québec G1L 1B6': {'350': 'StreetNumber',
  'rue': 'StreetName',
  'des': 'StreetName',
  'Lilas': 'StreetName',
  'Ouest': 'StreetName',
  'Québec': 'Province',
  'G1L': 'PostalCode',
  '1B6': 'PostalCode'}}

it could be something like this:

{'350 rue des Lilas Ouest Québec Québec G1L 1B6': {'StreetNumber': '350',
  'StreetName': 'rue des Lilas Ouest',
  'Province': 'Québec',
  'PostalCode': 'G1L 1B6'}}

Notice how some tags where merged and the keys and values are inverted. Maybe there could be a flag if you want the other way. Or, better yet, you could return an object.

Alright, that's it for now.

[BUG] Received "TypeError: can't pickle fasttext_pybind.fasttext objects" when trying to retrain

Describe the bug

I was following the retrain instruction on the page, https://deepparse.org/examples/fine_tuning.html and I received the below error messages.

address_parser.retrain(training_container, 0.8, epochs=5, batch_size=8)
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\janch.conda\envs\py36\lib\site-packages\deepparse\parser\address_parser.py", line 327, in retrain
callbacks=callbacks)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 477, in train
return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 618, in _train
return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 575, in fit_generator
self._fit_generator_one_batch_per_step(epoch_iterator, callback_list)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 652, in _fit_generator_one_batch_per_step
for step, (x, y) in train_step_iterator:
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 75, in iter
for step, data in _get_step_iterator(self.steps_per_epoch, self.generator):
File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 19, in cycle
for x in iterable:
File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 355, in iter
return self._get_iterator()
File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 914, in init
w.start()
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle fasttext_pybind.fasttext objects

OS: Windows
Python 3.6
Running on CPU only

Add url in setup.py

Add exception handling

Allow for custom ground truth specification for training

Allow users to train a parser and make predictions with their own tags

Unify formatting (yapf)

Use template

Convert code to production

[FEATURE] Improve documentation and README

Added URLs of our models in README and doc.
Add more examples

I use the following line: [BUG]

Hi,

I am using Windows and the latest version of Anaconda

from deepparse.parser import AddressParser

Generates the following errors:

C:\Users\Owner\Anaconda3\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

ImportError Traceback (most recent call last)
in
----> 1 from deepparse.parser import AddressParser

~\Anaconda3\lib\site-packages\deepparse\parser_init_.py in
1 # pylint: disable=wildcard-import
----> 2 from .address_parser import *
3 from .parsed_address import *

~\Anaconda3\lib\site-packages\deepparse\parser\address_parser.py in
10 from .. import CACHE_PATH, handle_checkpoint, indices_splitting
11 from .. import load_tuple_to_device, download_fasttext_magnitude_embeddings
---> 12 from ..converter import TagsConverter
13 from ..converter import fasttext_data_padding, bpemb_data_padding, DataTransform
14 from ..dataset_container import DatasetContainer

~\Anaconda3\lib\site-packages\deepparse\converter_init_.py in
2 from .data_padding import *
3 from .target_converter import *
----> 4 from .data_transform import *

~\Anaconda3\lib\site-packages\deepparse\converter\data_transform.py in
3 from . import fasttext_data_padding_teacher_forcing, bpemb_data_padding_teacher_forcing,
4 bpemb_data_padding_with_target, fasttext_data_padding_with_target
----> 5 from ..vectorizer import TrainVectorizer
6
7

~\Anaconda3\lib\site-packages\deepparse\vectorizer_init_.py in
1 # pylint: disable=wildcard-import
----> 2 from .bpemb_vectorizer import *
3 from .fasttext_vectorizer import *
4 from .magnitude_vectorizer import *
5 from .train_vectorizer import *

~\Anaconda3\lib\site-packages\deepparse\vectorizer\bpemb_vectorizer.py in
3 import numpy as np
4
----> 5 from .vectorizer import Vectorizer
6 from ..embeddings_models.embeddings_model import EmbeddingsModel
7

~\Anaconda3\lib\site-packages\deepparse\vectorizer\vectorizer.py in
2 from typing import List
3
----> 4 from ..embeddings_models.embeddings_model import EmbeddingsModel
5
6

~\Anaconda3\lib\site-packages\deepparse\embeddings_models_init_.py in
2 from .bpemb_embeddings_model import *
3 from .embeddings_model import *
----> 4 from .fasttext_embeddings_model import *
5 from .magnitude_embeddings_model import *

~\Anaconda3\lib\site-packages\deepparse\embeddings_models\fasttext_embeddings_model.py in
1 import platform
2
----> 3 from gensim.models.fasttext import load_facebook_vectors
4 from numpy.core.multiarray import ndarray
5

ImportError: cannot import name 'load_facebook_vectors' from 'gensim.models.fasttext' (C:\Users\Owner\Anaconda3\lib\site-packages\gensim\models\fasttext.py)

Any ideas? Thank you!

Sincerely,

tom

Additional context
Add any other context about the problem here.

Add init import

[BUG] When large data, since we don't use a dataloader OOM for GPU

We should move to a DataLoader instead with a batch_size and multithreads.

[Question] Training noisy data from another country?

If I have a large dataset with noisy raw addresses and also correctly parsed results for each one, how do I start with training deepparse to get a trained dataset?

The raw+result data I have is currently in CSV format but with a bit of scripting I can easily transform into another format. I just don't completely understand how to train Deepparse for this.

Validate install_requires in setup.py

[FEATURE] Improve README with table of results and countries

Add a simplified version of our tables and a noisy data table.

Code for fasttext not working

Loading is fine now, but inferring does not work.

I also made the test in an old commit but had the same errors.

Handle path conventions (linux vs windows)

[performance] comma

Addresses with comma seem to lower performance.

from deepparse.parser import AddressParser

dp = AddressParser(model="bpemb", device=0)

dp("2020 boul. René-Lévesques, Montréal, QC, Canada", with_prob=True).address_parsed_components
#> [('2020', ('PostalCode', 0.8566)),
#>  ('boul.', ('Province', 0.7204)),
#>  ('René-Lévesques,', ('StreetName', 0.7636)),
#>  ('Montréal,', ('StreetName', 0.9614)),
#>  ('QC,', ('StreetName', 0.7382)),
#>  ('Canada', ('Province', 0.5126))]


parsed_address = address_parser("2020 boul. René-Lévesques Montréal QC", with_prob=True)

>>> print(parsed_address.address_parsed_components)

[('2020', ('PostalCode', 0.9467)), ('boul.', ('StreetName', 0.9895)), ('René-Lévesques', ('StreetName', 0.9602)), ('Montréal', ('Municipality', 0.9965)), ('QC', ('Province', 0.9999))]

fasttext is the lightext and bpemb is the most accurate?

Hi,
The fasttext model needs 9 GB of RAM to be loaded but is the most accurate if my memory serves? So it would be the most accurate but not the lightest. Also, I think it should be mentionned somewhere in the doc that it takes that much memory to be loaded?
Thank you.

fastText pre-trained embeddings

Since the download of the pre-trained fastText embedding is pretty long, I think we should add a note in the doc to be more clear with the user.

Also, should we put a message, warning when we download the model as per gensim? The reason been, in my case when using a remote terminal in PyCharm the updated print does not show.

@MAYAS3 @freud14

LGPL licence

I'd like to contribute to the project, but the LGPL license is quite restrictive because all derivatives can only be redistributed under LGPL v3 license (as I understand it). I'm not sure it is a complete blocker, but it might be an issue down the road. Would you consider relicensing the project under a slightly more permissible license, such as GPL or MIT (or another)?

Add unit tests

[FEATURE] Address extraction and parser

[RuntimeError] Retrain Error

Hi, I got this error when I tried to retrain the model. What could be possible causes?

RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1

I used this code setting

address_parser = AddressParser(model_type="best", device=0)
lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1)
address_parser.retrain(training_container, 0.8, epochs=15, batch_size=64, num_workers=2, callbacks=[lr_scheduler])

I have transformed my training data into a pickle file with the right format as the example in the doc; list of tuples ( 'address text', [list of tags corresponding to each word] ). Moreover, I have already made sure that the number of words in a tuple matches the number of elements in its corresponding list.