novetta / adaptnlp Goto Github PK

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Home Page: https://novetta.github.io/adaptnlp/

License: Apache License 2.0

Dockerfile 0.24% Python 11.63% Shell 0.05% Jupyter Notebook 87.68% Roff 0.38% Makefile 0.02%

nlp pytorch transformers natural-language-processing machine-learning deep-learning deep-learning-tutorial docker fine-tuning language-models

adaptnlp's People

Contributors

Stargazers

Watchers

adaptnlp's Issues

max_len Attribute Error

The LMFineTuner throws an attribute error with gpt2. AttributeError: 'GPT2TokenizerFast' object has no attribute 'max_len'.
This should be changed to model_max_length.

AttributeError                            Traceback (most recent call last)
<ipython-input-6-b19979ceb9c2> in <module>
----> 1 finetuner.train(
      2     training_args=training_args,
      3     train_file=train_path,
      4     eval_file=test_path,
      5     mlm=False,

/adaptnlp/adaptnlp/language_model.py in train(self, training_args, train_file, eval_file, line_by_line, mlm, mlm_probability, plm_probability, max_span_length, block_size, overwrite_cache)
    127         # Check block size for Dataset
    128         if block_size <= 0:
--> 129             block_size = self.tokenizer.max_len
    130         else:
    131             block_size = min(block_size, self.tokenizer.max_len)

AttributeError: 'GPT2TokenizerFast' object has no attribute 'max_len'

This error occurs while following the Fine-Tuning a Language Model tutorial

Add FastAPI deployment to CI workflow

AdaptNLP's current workflow only builds the source package docker image with tutorials. We should add a workflow for building the FastAPI docker image as well.

Integrating Huggingface's Tokenizer Across AdaptNLP Interface

Keep up with latest iterations of tokenizer library

Spot potential integrations with Flair and AdaptNLP framework.

Sequence classification on fine-tuned BERT: CUDA out of memory

Describe the bug
After fine-tuning a model (bert-base-cased) and trying to use it for sequence classification, the learning rate finder uses all available GPU (1x K80) memory. The fine-tuned model is < 0.5 GB, data used for training is a few megabytes, and embeddings are CPU loaded.

Despite the small model size, loading it alone with Transformers' BertModel.from_pretrained() uses nearly all available GPU memory.

System:

OS: Ubuntu 18.04
Version: 0.1.5

Additional context
I have successfully used a bert-base-cased fine-tuned model for sequence classification. Using the same code but switching the data seems to cause this issue despite both datasets being similarly sized and structured.

Add `EasyTranslator` Module

Adding module for pre-trained translation based language models. Using the T5 model for now.

Look into bringing in fastai-suite libraries

I think we can reduce a ton of code if we bring in some fastcore and fastprogress. So far:

Use fastcore.script for scripting rather than argparse
Helper functions in fastcore.xtras and fastcore.basics to reduce the logic
Use fastprogress for the progress bar should not only reduce clutter but also give a prettier progress bar

Hugging Face's Model Repo Integration

Allow Easy Modules to pull in pre-trained models that have been uploaded to huggingface.co/models.

- Sequence Classification
- Question Answering
- Token Tagging
* Embeddings Done with flair new transformers embedding class

Pin to FastAPI 0.49.2 to deal with new Starlette and FastAPI Updates in a controlled manner

FastAPI 0.52.0 released.

Translation and Summarization Modules Output Format Correction

Unnecessary nested list layer for the translation and summarization modules.

Significant slowdown in EasyTokenTagger release 0.2.0

I'm experiencing a slowdown in NER performance using EasyTokenTagger and 'ner-ontonotes' after updating to release 0.20. Has there been any underlying changes to how the tagger object works?

Specifically, I am dealing with a very large chunk of text. Prior to this release, the NER tagging took around 15 seconds for this particular text. Now, it's taking 15+ minutes the first time but subsequent calls on that text are very quick. Is there some sort of caching or indexing that's being done now? I'd imagine this could create a lot of overhead for large chunks of text.

Sequence classification using REST API fails with models except en-sentiment

Sequence classification over REST API using any model except for en-sentiment fails with:

File "/usr/local/lib/python3.6/dist-packages/starlette/routing.py", line 41, in app response = await func(request) File "/usr/local/lib/python3.6/dist-packages/fastapi/routing.py", line 197, in app dependant=dependant, values=values, is_coroutine=is_coroutine File "/usr/local/lib/python3.6/dist-packages/fastapi/routing.py", line 147, in run_endpoint_function return await dependant.call(**values) File "./app/main.py", line 87, in sequence_classifier text=text, mini_batch_size=1, model_name_or_path=_SEQUENCE_CLASSIFICATION_MODEL File "/adaptnlp/adaptnlp/sequence_classification.py", line 285, in tag_text return classifier.predict(text=text, mini_batch_size=mini_batch_size, **kwargs,) File "/adaptnlp/adaptnlp/sequence_classification.py", line 140, in predict text_sent.add_label(label) TypeError: add_label() missing 1 required positional argument: 'value'

Reproducable with:
docker run -itp 5000:5000 -e TOKEN_TAGGING_MODE='ner' \ -e TOKEN_TAGGING_MODEL='ner-ontonotes-fast' \ -e SEQUENCE_CLASSIFICATION_MODEL='nlptown/bert-base-multilingual-uncased-sentiment' \ achangnovetta/adaptnlp-rest:latest \ bash

Windows and MacOS Support with CI

Some users may want to install the AdaptNLP package on windows and mac OSs locally.

MacOS should be able to successfully install AdaptNLP, but there is current workflow for testing macOS builds.

Windows will require more direction, especially with PyTorch installs.

Transformers 2.6.0 and 2.7.0 Update

Updating AdaptNLP to the latest Transformers v2.7.0 release.

New models:

BART
T5

Additional features will be posted in this thread including tokenizers and additional NLP tasks.

Note: This issue consolidates issue #5 concerning v2.5.0

Tutorials for Summarization and Translation Modules

Adding tutorials in the documentation for EasySummarizer and EasyTranslator and using Hugging Face's implementations of the Bart and T5 Models.

Notebooks and colab links will also be provided.

Save context in QuestionAnswering and re-use it

I notices when we run any code snippet, it convert the text to vectors or some similar thing.
For example in this code snippet

from adaptnlp import EasyQuestionAnswering 
from pprint import pprint

## Example Query and Context 
query = "What is the meaning of life?"
context = "Machine Learning is the meaning of life."
top_n = 5

## Load the QA module and run inference on results 
qa = EasyQuestionAnswering()
best_answer, best_n_answers = qa.predict_qa(query=query, context=context, n_best_size=top_n, mini_batch_size=1, model_name_or_path="distilbert-base-uncased-distilled-squad")

## Output top answer as well as top 5 answers
print(best_answer)
pprint(best_n_answers)

It convert both query and context to vectors first. What if we have very long context and we have a lot of queries, each time it will convert the context to vector. I think there should be a way to save context vector and re-use it instead of creating again and again.

LMFineTuner learning_rate_finder_configs wrong positional argument

Describe the bug
When using the LMFineTuner and specifying the learning_rate_finder_configs , an error is thrown when passing these configs to finetuner.find_learning_rate() as suggested in the documentation and in the Colab example. It looks like the base_path argument may have been renamed to output_dir but this is not reflected in the sources previously mentioned.

TypeError: find_learning_rate() missing 1 required positional argument: 'output_dir'

To Reproduce
Steps to reproduce the behavior:

Import LMFineTuner
Specify an output directory (OUTPUT_DIR)
Define finetuner with the suggested ft_configs and call the freeze method using these configs
Define learning_rate with the suggested learning_rate_finder_configs and call the find_learning_rate method using these configs

Expected behavior
I expect all the config arguments names to be correct and present for the find_learning_rate method as suggested in the example.

Desktop:

OS: macOS 10.15
Browser: Firefox 74
Version: AdaptNLP 0.1.5

EasySequenceClassifier tag_text function returns None for FlairSequenceClassifier model

Hi!
I tried to follow the tutorial for training custom sequence classifier: https://novetta.github.io/adaptnlp/tutorial/training-sequence-classification.html
The last step returns empty sentences while expected labels:
sentences = classifier.tag_text(example_text, model_name_or_path=OUTPUT_DIR)

To Reproduce the behavior:

from adaptnlp import EasySequenceClassifier
from flair.data import Sentence

OUTPUT_DIR = "…/best-model.pt"    # my custom model
classifier = EasySequenceClassifier()

ex_text = "This is a good text example"
example_text=[Sentence(ex_text)]

sentences = classifier.tag_text(text=example_text, model_name_or_path=OUTPUT_DIR, mini_batch_size=1)
print("Label output:\n")
print(sentences)

Returns

2020-12-28 17:44:31,111 loading file .../best-model.pt
Label output:

None

Surprisingly labels got added to example_text
print(example_text)
Returns
[Sentence: " This is a good text example " [− Tokens: 17 − Sentence-Labels: {'label': [0 (0.8812)]}]]

Proposed explanation/ contribution:
I think I know the reason for unexpected behavior and will be happy to help.
classifier.tag_text creates FlairSequenceClassifier classifier.
FlairSequenceClassifier initiates flair.models.TextClassifier classifier and uses TextClassifier predict method within its own predict method.
But flair.models.TextClassifier predict method returns None because the labels are directly added to the sentences. I can re-write FlairSequenceClassifier predict method to return Sentences with labels instead of None.

EasyDocumentEmbeddings and pre-trained models in local dir

Describe the bug
EasyDocumentEmbeddings does not load a fine-tuned model saved in local dir.

To Reproduce
Steps to reproduce the behavior:

Go to https://novetta.github.io/adaptnlp/tutorial/fine-tuning-language-model.html
Follow all steps in the above documentation page:
2.1 Load a pre-trained "bert-base-cased" model

ft_configs = {
              "train_data_file": train_data_file,
              "eval_data_file": eval_data_file,
              "model_type": "bert",
              "model_name_or_path": "bert-base-cased",
              "mlm": True,
              "mlm_probability": 0.15,
              "config_name": None,
              "tokenizer_name": None,
              "cache_dir": None,
              "block_size": -1,
              "no_cuda": False,
              "overwrite_cache": False,
              "seed": 42,
              "fp16": False,
              "fp16_opt_level": "O1",
              "local_rank": -1,
             }
finetuner = LMFineTuner(**ft_configs)
finetuner.freeze()

2.2 Find suitable learning rate

learning_rate_finder_configs = {
    "base_path": OUTPUT_DIR,
    "file_name": "learning_rate.tsv",
    "start_learning_rate": 1e-7,
    "end_learning_rate": 10,
    "iterations": 100,
    "mini_batch_size": 8,
    "stop_early": True,
    "smoothing_factor": 0.7,
    "adam_epsilon": 1e-8,
    "weight_decay": 0.0,
}
learning_rate = finetuner.find_learning_rate(**learning_rate_finder_configs)
finetuner.freeze()

2.3 Fine tune it with One cycle policy. OUTPUT_DIR is where model is saved

train_configs = {
    "output_dir": OUTPUT_DIR,
    "should_continue": False,
    "overwrite_output_dir": True,
    "evaluate_during_training": True,
    "per_gpu_train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "learning_rate": learning_rate,
    "weight_decay": 0.0,
    "adam_epsilon": 1e-8,
    "max_grad_norm": 1.0,
    "num_train_epochs": 10.0,
    "max_steps": -1,
    "warmup_steps": 0,
    "logging_steps": 50,
    "save_steps": 50,
    "save_total_limit": None,
    "use_tensorboard": False,
}
finetuner.train_one_cycle(**train_configs)

2.4
Instantiate a downstream task as

from adaptnlp import EasyDocumentEmbeddings
doc_embeddings = EasyDocumentEmbeddings(OUTPUT_DIR)

I get the following error

May need a couple moments to instantiate...
Corresponding flair embedding module not found for ...my_directory\finetunedmodel\
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-38-3f24c4b91aac> in <module>
      1 from adaptnlp import EasyDocumentEmbeddings
      2 
----> 3 doc_embeddings = EasyDocumentEmbeddings(OUTPUT_DIR)

~\.conda\envs\adaptnlp\lib\site-packages\adaptnlp\embeddings.py in __init__(self, methods, configs, *embeddings)
    353                 )
    354 
--> 355         assert len(self.embedding_stack) != 0
    356         if "pool" in methods:
    357             self.pool_embeddings = DocumentPoolEmbeddings(

AssertionError:

Expected behavior
Expected is to have EasyDocumentEmbeddings instantiated with the finetuned model, the one saved in OUTPUT_DIR. Am I missing any step?

Desktop (please complete the following information):

OS: Window 10
Packages
-- python 3.6.10
-- adaptnlp '0.1.4'

Deploy mkdocs with github pages

Shift away from temporary static S3 documentation hosting and use github pages.

Edit readme links.
Add documentation workflow

EasyTokenTagger Tests

Write tests for EasyTokenTagger class and methods.

EasyTokenTagger
tag_text
tag_all

ImportError: cannot import name 'EasyTokenTagger'

Describe the bug
A clear and concise description of what the bug is.
I tried to run the code in the tutorial

from adaptnlp import EasyTokenTagger


## Example Text
example_text = "Novetta's headquarters is located in Mclean, Virginia."

## Load the token tagger module and tag text with the NER model 
tagger = EasyTokenTagger()
sentences = tagger.tag_text(text=example_text, model_name_or_path="ner")

## Output tagged token span results in Flair's Sentence object model
for sentence in sentences:
    for entity in sentence.get_spans("ner"):
        print(entity)

and it gave me the error:

...
  File "/home/rajiv/Documents/dev/python/nltk-trial/adaptnlp.py", line 2, in <module>
    from adaptnlp import EasyTokenTagger
ImportError: cannot import name 'EasyTokenTagger'

Desktop (please complete the following information):

OS: Ubuntu
Version: 20.04
Python: 3.6.9

Make documentation about release etc more clear with Sequence Classification

Unified Training API

Training API will use fastai under the hood, and we'll make a few functions to build general datasets.

Tasks and sample datasets to use:

Language Models:
- IMDB_SAMPLE text (fastai)
NER/Token Classification
- Annotated Corpus for Named Entity Recognition
Question/Answering
- SQuAD dataset (HuggingFace Datasets)
Sequence Classification
- IMDB_SAMPLE classification (fastai)
Summarization
- Amazon Fine Food Reviews
Translation
- English to French

Other Information

Task API's should have a simple user interface, IE high-level can only input specific options, while midlevel has access to the full fastai Learner params.

Example mid-level API I'm thinking about:

dls = some_build_data_thing()
tuner = QAFineTuner(dls, 'bert-base-cased')
tuner.tune(
  scheduler = 'fit_flat_cos',
  n_epochs = 3,
  lr = None,
  suggest_method = 'valley', # Triggers if lr is None
  additional_callbacks = []
)

And its high-level:

tuner = QAFineTuner.from_csv(
  question_column_name = "question",
  answer_column_name = "answer",
  model = "bert-base-cased"
)
tuner.tune(...)

We should automatically pull in proper metrics for each task, but users have the option to bring in their own as well and pass it to QAFineTuner (good defaults)

Tuners should also have a func like QAFineTuner.from_csv() to build the dataset in-house

Can't load big dataset

Describe the bug
It happens when I want to
learning_rate = finetuner.find_learning_rate(**learning_rate_finder_configs)
in the tutorial. I have a big dataset with 200k rows and each of them has a text with around 200 words.

In your code when you instantiate the TextDataset, the line
tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
takes an eternity for a text of 20 million words. Do you think it can be achieved in the better/faster way like by keeping the rows like they are ?

For the record:
Time for 100 characters: 0.0003399848937988281s
Time for 1000 characters: 0.00124359130859375s
Time for 10 000 characters: 0.012135982513427734s
Time for 100 000 characters: 0.2131056785583496s
Time for 1 000 000 characters: 8.782422542572021s
Time for 10 000 000 characters: 734.5397665500641s

Can't reach the end of the full TextDataset (109 610 928 characters).

To Reproduce
Tutorial with a big dataset

FastText Embeddings with Flair's Classic Word Embeddings

Low lift to incorporate Flair's classic word embeddings, which include FastText embeddings and glove embeddings.

These will result in word embeddings, stacked embeddings and document embeddings using Flair's classic word embeddings module.

Memory Efficiency for Testing

We'll need to deallocate loaded models in tests so they don't clog up memory during CI builds.

HuggingFace adjusted how to get access to bart-large-cnn

Following the summary tutorial, we will hit a 404 Client-Error as bart-large-cnn is now located at https://huggingface.co/facebook/bart-large-cnn

Need to look into what adjustments need to be made

Nightly build failing due to circleci server capacity...need to upgrade

Need to upgrade circleci account with more ram to run tests successfully for builds.

multi-label classification / paperswithcode dataset

Hi guys,

Hope you are all well !

I was wondering if adaptnlp can handle multi-label classification with 1560 labels.

More precisely, I would like to apply it to paperswithcode dataset where labels are called tasks.

Refs:

Thanks for any insights or inputs on that.

Cheers,
X

Data API

We probably should have a data API of some form, that ties into #128

Ideally it should simply prep a dataset for tokenization of a model, or tokenize the data itself.

For now we cover two inputs:

Individual texts
CSV

We should support something akin to fastai's get_y, but with decent defaults so that customization is available, but not needed.

Ideally something like:

dset = TaskDataset.from_df(
  df,  # Can be fname or dataframe
  get_x = ColReader('text'),
  get_y = ColReader('label'),
  splitter = RandomSplitter(),
  model = 'bert-base-uncased', # The name/type of downstream model
  task = "ner" # Or use a `Task.NER` namespace class
)

And further:

dset.dataloaders(bs=8, collate_fn=data_collator)

It reads extremely similar to the fastai API, but we do not use the fastai API, as for text doing it like this is a bit easier.

The highest level API would look like so:

dls = TaskDataLoaders.from_df(df, 'text', 'label', model='bert-base-uncased')

We should note the model used, and when integrating it with the tuning API if something is off with the model entered, we make note of that

Add cron job in circleci workflow for nightly builds

Worried about bugs/breaks as time passes.

Nightly builds should alert us of any issues coming from updated dependencies, model repos, etc.

Add FastAPI workflow

Manual build for rest services should be swapped out for a CI workflow for the fastapi dockerfile.

AttributeError: 'EasyDocumentEmbeddings' object has no attribute 'rnn_embeddings'

Cannot use pool option to generate embeddings (instead of the default rnn).

A snippet for the problem:

embedding_type='albert-xxlarge-v2'
embedding_methods=["pool"]
doc_embeddings = EasyDocumentEmbeddings(embedding_type, methods = embedding_methods)

This is the error I get:

  File "env/lib/python3.7/site-packages/adaptnlp/training.py", line 91, in __init__
   self._initial_setup(self.label_dict, **kwargs)
 File "env/lib/python3.7/site-packages/adaptnlp/training.py", line 97, in _initial_setup
   document_embeddings: DocumentRNNEmbeddings = self.encoder.rnn_embeddings
AttributeError: 'EasyDocumentEmbeddings' object has no attribute 'rnn_embeddings'

Expected behavior would be to successfully obtain an easy document embeddings object with no errors

Running on debian buster, python3.7

If someone could give me a fix or a workaround or if I'm using this incorrectly, then please let me know

Make evaluating a fine-tuned model more intuitive and clear

Expectation: Pass in the model directory that the trained model was saved to, and run inference on it

Current Reality: You need to pass in the model architecture name

Why this is an issue: If I unload the model, it destroys the Trainer, so there's no way to recover it

Potential solutions: Post training, also save something that can load the trainer state back

Update README tutorial links (maybe try fastlink checker?)

EasyWordEmbeddings breaks on newest `transformers` version

Describe the bug
When following along the GPT example, we will run into a type error stating:

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

To Reproduce
Steps to reproduce the behavior:

from adaptnlp import EasyWordEmbeddings

example_text = "This is Albert.  My last name is Einstein.  I like physics and atoms."
embeddings = EasyWordEmbeddings()
sentences = embeddings.embed_text(example_text, model_name_or_path="gpt2")

Expected behavior
Should return embedded sentences

Stack Trace

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-c8dc3dc5d3d7> in <module>
----> 1 sentences = embeddings.embed_text(example_text, model_name_or_path="gpt2")

<ipython-input-11-1e6d4d6f8dd9> in embed_text(self, text, model_name_or_path)
     51                         return Sentence("")
     52         embedding = self.models[model_name_or_path]
---> 53         return embedding.embed(sentences)
     54 
     55     def embed_all(

/opt/venv/lib/python3.8/site-packages/flair/embeddings/base.py in embed(self, sentences)
     58 
     59         if not everything_embedded or not self.static_embeddings:
---> 60             self._add_embeddings_internal(sentences)
     61 
     62         return sentences

/opt/venv/lib/python3.8/site-packages/flair/embeddings/token.py in _add_embeddings_internal(self, sentences)
    875         # embed each micro-batch
    876         for batch in sentence_batches:
--> 877             self._add_embeddings_to_sentences(batch)
    878 
    879         return sentences

/opt/venv/lib/python3.8/site-packages/flair/embeddings/token.py in _add_embeddings_to_sentences(self, sentences)
    940             while subtoken_ids_sentence:
    941                 nr_sentence_parts += 1
--> 942                 encoded_inputs = self.tokenizer.encode_plus(subtoken_ids_sentence,
    943                                                             max_length=self.max_subtokens_sequence_length,
    944                                                             stride=self.stride,

/opt/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2418         )
   2419 
-> 2420         return self._encode_plus(
   2421             text=text,
   2422             text_pair=text_pair,

/opt/venv/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py in _encode_plus(self, *args, **kwargs)
    167         )
    168 
--> 169         return super()._encode_plus(*args, **kwargs)
    170 
    171     def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:

/opt/venv/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    453 
    454         batched_input = [(text, text_pair)] if text_pair else [text]
--> 455         batched_output = self._batch_encode_plus(
    456             batched_input,
    457             is_split_into_words=is_split_into_words,

/opt/venv/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py in _batch_encode_plus(self, *args, **kwargs)
    157         )
    158 
--> 159         return super()._batch_encode_plus(*args, **kwargs)
    160 
    161     def _encode_plus(self, *args, **kwargs) -> BatchEncoding:

/opt/venv/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    380         )
    381 
--> 382         encodings = self._tokenizer.encode_batch(
    383             batch_text_or_text_pairs,
    384             add_special_tokens=add_special_tokens,

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

AdaptNLP v0.2.x Additional Features Discussion

There are a lot of ideas that may be floating for feature implementations, so this thread just provides a mini roadmap and environment to think about adaptnlp's progression.

Ideas can be stated freely in this thread and do not replace feature-request issue posts.

Tokenizer Start integrating tokenizers all across adaptnlp for speed and performance enhancements for training and inference.
Summarization Add NLP-task of summarization using document-level encoder based on transformer language models
GPU Multi-GPU and mixed-precision is prevalent in AdaptNLP, but its implementation can be improved and debugged
FastAPI Batch-Serving Improve on the concurrent calls with batch processing from the NLP models (maybe try to make it CPU and GPU agnostic for ease-of-use)
Model Downloading Start structuring a way to download and potentially upload pre-trained NLP-task models

Stretch Goals

HuggingFace raw embeddings over Flair
Try and integrate Callbacks for text generation and other classes that aren't using it

Note: Didn't do this for text generation, more complex than its worth
Use fastrelease (with conda)
Improve test coverage
GH CI for testing Mac, Windows, and Linux, similar to how fastai has it setup
nbdev?
Windows support
Use Pipeline for inference

Note: Pipeline is slower on many tasks that AdaptNLP covers, tests are in place to ensure that this is always true
1.0.0: Unified training framework for at least 4 NLP tasks

Transformers 2.5.0 Update

Leverage tokenizer library for transformers and adaptnlp

Compare fast tokenizers with flair defaults and wherever tokenizers are used for adaptnlp

Address any incompatibility issues in this issues thread.

BrokenPipeError when using cells

Describe the bug
When running code in cells (as below), receive a BrokenPipeError. Code runs fine in line by line.

To Reproduce
Steps to reproduce the behavior:
Run the following in cell format in Spyder 4.1.2 running Python 3.7
from adaptnlp import EasyQuestionAnswering

#%%

Example Query and Context

query = "What is the meaning of life?"
context = "Machine Learning is the meaning of life."
top_n = 5

#%%

Load the QA module and run inference on results

qa = EasyQuestionAnswering()
best_answer, best_n_answers = qa.predict_qa(query=query, context=context, n_best_size=top_n, mini_batch_size=1, model_name_or_path="distilbert-base-uncased-distilled-squad")

#%%

Output top answer as well as top 5 answers

print(best_answer)
print(best_n_answers)

Expected behavior
Print best_answer and best_n_answers.

Desktop (please complete the following information):

OS: Win10 on VM
Version: Spyder 4.1.2 w/ Python 3.7

Additional context
Only occurs in Spyder, not in Jupyter Notebooks.

AttributeError: 'CamembertForMaskedLM' object has no attribute 'cls'

Describe the bug
Trying to freeze a LMFinetuner based on Camembert weights and get:

AttributeError Traceback (most recent call last)
in
6 }
7 finetuner = LMFineTuner(**ft_configs)
----> 8 finetuner.freeze()

~/anaconda3/envs/pe_adaptnlp/lib/python3.8/site-packages/adaptnlp/transformers/finetuning.py in freeze(self)
1630 """Freeze last classification layer group only
1631 """
-> 1632 layers_len = len(list(self.model.cls.parameters()))
1633 self.freeze_to(-layers_len)
1634

~/anaconda3/envs/pe_adaptnlp/lib/python3.8/site-packages/torch/nn/modules/module.py in getattr(self, name)
573 if name in modules:
574 return modules[name]
--> 575 raise AttributeError("'{}' object has no attribute '{}'".format(
576 type(self).name, name))
577

AttributeError: 'CamembertForMaskedLM' object has no attribute 'cls'

To Reproduce

from adaptnlp import LMFineTuner
train_file = "path/to/train" 
valid_file = "path/to/valid"
ft_configs = {
              "train_data_file": train_file,
              "eval_data_file": valid_file,
              "model_type": "camembert",
              "model_name_or_path": "camembert-base",
             }
finetuner = LMFineTuner(**ft_configs)
finetuner.freeze()

Expected behavior
No error

Desktop (please complete the following information):

OS: Amazon Linux
Browser Chrome

Apex errors with torch.save serializing in finetuner

Serializing error when using LMFineTuner with fp16 param set to True for mixed precision training.

Reproduce bug by setting fp16 to True after installing NVIDIA's Apex library.

Solution: Address bug by deep copying training args/parameters gathered by the built-in locals() method.

Clean docker images and update to nvcr.io cuda containers

Need a cleaner development and runtime setup of an adaptnlp container with explicitly specified virtuelenvs.

Should use the nvcr.io cuda images for extensive gpu compatibility

PyTorch 1.7 + python 3.6 incompatibility

Describe the bug
Latest flair release has value error issues as seen here

To Reproduce
Steps to reproduce the behavior:

from adaptnlp import EasyTokenTagger
tagger = EasyTokenTagger()
tagger.tag_text("example", model_name_or_path="ner-fast")

Expected behavior
Entities are tagged.

Desktop (please complete the following information):

python 3.6

Additional context
Should be addressed in the next flair release, but other steps can be taken for temporary fixes:

downgrade torch to 1.6 or less
use python 3.7+

Add `EasySummarizer` Module

Using T5 and Bart language models, we can use Hugging Face's conditional generation modules with language model heads to easily provide abstractive summarizations within AdaptNLP's easy API

Label columns being handled by trainer data collator and only takes in ["label", "labels", "label_ids"]

Label columns being handled by trainer data collator and only takes in ["label", "labels", "label_ids"] in model forward pass.

Need to change this to accept any label column name in train param.

Add ELECTRA Models

With Transformers 2.8.0 being released, we can now incorporate ELECTRA models into AdaptNLP.

Partition requirements for docs/test/dev in setup.py

Specify dependencies for documentation, testing, development in setup.py

This will get rid of unnecessary dependencies when a simple adaptnlp is required as well as when package files are uploaded to PyPI

cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler'

Describe the bug
Your demo Colab Notebook "Custom Fine-Tuning and Training with Transformer Models" doesn't work and generates the following error:

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

EasyDocumentEmbeddings with large text

Describe the bug
I want to embed a long document and I am following the tutorial available here https://github.com/Novetta/adaptnlp/blob/master/tutorials/3.%20Embeddings/embeddings.ipynb.
It works up to a certain text lenght, e.g. about 300 tokens. After I get a CUDA error. How to embed long text? What's then best strategy?

To Reproduce
Steps to reproduce the behavior:

Go to ttps://github.com/Novetta/adaptnlp/blob/master/tutorials/3.%20Embeddings/embeddings.ipynb. and follow all steps there, with a much longer text.

1.1 Instantiate EasyDocumentEmbeddings :

# Instantiate with a language model
from adaptnlp import EasyDocumentEmbeddings
embeddings = EasyDocumentEmbeddings("bert-base-cased")

1.2 Document Pool embedding

example_text = "This is Albert.  My last name is Einstein.  I like physics and atoms.  \
Here is another sentence. And I will have many many more sentences..."

# Document Pool embedding
sentences = embeddings.embed_pool(example_text)
>> [Sentence: "This is Albert. My last name is Einstein. I like physics and atoms. Here is another sentence. And I will have many many more sentences..." - 25 Tokens]

The above provides only one flair Sentence object.

1.3 Get embeddings

# Get the text/document embedding
for sentence in sentences:
    print(sentence)
    print(sentence.get_embedding().shape)
>> Sentence: "This is Albert. My last name is Einstein. I like physics and atoms. Here is another sentence. And I will have many many more sentences..." - 25 Tokens
torch.Size([3072])

The above works for documents with not many tokens.
It will generate the following error for longer text:

Get error

RuntimeError: CUDA error: device-side assert triggered

Expected behavior
Expect to have more than one sentence object and the entire document to be embedded.

Desktop (please complete the following information):

OS: Windows 10
-- python 3.6.10
-- adaptnlp '0.1.4'

Additional context
In the comments of this flair issue flairNLP/flair#1323 I see that different sentences are provided to different Sentence objects:

sentences = [Sentence("A Finnish sentence."), Sentence("Another one.")]

matrix = list()
for sentence in sentences:
    document_embeddings.embed(sentence)
    embedding = sentence.get_embedding()
    matrix.append(embedding)

Is this the best approach to embed long documents?
Should it be automatically taken into account in sentences = embeddings.embed_pool(example_text) ?

novetta / adaptnlp Goto Github PK

adaptnlp's People

Contributors

Stargazers

Watchers

Forkers

adaptnlp's Issues

Tasks and sample datasets to use:

Other Information

Example Query and Context

Load the QA module and run inference on results

Output top answer as well as top 5 answers

Recommend Projects

Recommend Topics

Recommend Org