Giter Site home page Giter Site logo

dsfsi / textaugment Goto Github PK

View Code? Open in Web Editor NEW
386.0 8.0 59.0 168 KB

TextAugment: Text Augmentation Library

License: MIT License

Python 100.00%
augmentation augmentation-methods synonym wordnet word2vec low-resouce-language nlp nlp-augmentation mixup natural-language-processing

textaugment's Introduction

licence GitHub release Wheel python TotalDownloads Downloads LNCS arxiv

You have just found TextAugment.

TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim v3.x, and TextBlob and plays nicely with them.

Acknowledgements

Cite this paper when using this library. Arxiv Version

@inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

Table of Contents

Features

  • Generate synthetic data for improving model performance without manual effort
  • Simple, lightweight, easy-to-use library.
  • Plug and play to any machine learning frameworks (e.g. PyTorch, TensorFlow, Scikit-learn)
  • Support textual data

Citation Paper

Improving short text classification through global augmentation methods.

alt text

Requirements

  • Python 3

The following software packages are dependencies and will be installed automatically.

$ pip install numpy nltk gensim==3.8.3 textblob googletrans 

The following code downloads NLTK corpus for wordnet.

nltk.download('wordnet')

The following code downloads NLTK tokenizer. This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

nltk.download('punkt')

The following code downloads default NLTK part-of-speech tagger model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word.

nltk.download('averaged_perceptron_tagger')

Use gensim to load a pre-trained word2vec model. Like Google News from Google drive.

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

You can also use gensim to load Facebook's Fasttext English and Multilingual models

import gensim
model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')

Or training one from scratch using your data or the following public dataset:

Installation

Install from pip [Recommended]

$ pip install textaugment
or install latest release
$ pip install [email protected]:dsfsi/textaugment.git

Install from source

$ git clone [email protected]:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install

How to use

There are three types of augmentations which can be used:

  • word2vec
from textaugment import Word2vec
  • fasttext
from textaugment import Fasttext
  • wordnet
from textaugment import Wordnet
  • translate (This will require internet access)
from textaugment import Translate

Fasttext/Word2vec-based augmentation

See this notebook for an example

Basic example

>>> from textaugment import Word2vec, Fasttext
>>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself')
>>> t.augment('The stories are good')
The films are good
>>> t = Fasttext(model='path/to/gensim/model'or 'gensim model itself')
>>> t.augment('The stories are good')
The films are good

Advanced example

>>> runs = 1 # By default.
>>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> word = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)
>>> word.augment('The stories are good', top_n=10)
The movies are excellent
>>> fast = Fasttext(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)
>>> fast.augment('The stories are good', top_n=10)
The movies are excellent

WordNet-based augmentation

Basic example

>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
>>> from textaugment import Wordnet
>>> t = Wordnet()
>>> t.augment('In the afternoon, John is going to town')
In the afternoon, John is walking to town

Advanced example

>>> v = True # enable verbs augmentation. By default is True.
>>> n = False # enable nouns augmentation. By default is False.
>>> runs = 1 # number of times to augment a sentence. By default is 1.
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> t = Wordnet(v=False ,n=True, p=0.5)
>>> t.augment('In the afternoon, John is going to town', top_n=10)
In the afternoon, Joseph is going to town.

RTT-based augmentation

Example

>>> src = "en" # source language of the sentence
>>> to = "fr" # target language
>>> from textaugment import Translate
>>> t = Translate(src="en", to="fr")
>>> t.augment('In the afternoon, John is going to town')
In the afternoon John goes to town

EDA: Easy data augmentation techniques for boosting performance on text classification tasks

This is the implementation of EDA by Jason Wei and Kai Zou.

https://www.aclweb.org/anthology/D19-1670.pdf

See this notebook for an example

Synonym Replacement

Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.synonym_replacement("John is going to town", top_n=10)
John is give out to town

Random Deletion

Randomly remove each word in the sentence with probability p.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_deletion("John is going to town", p=0.2)
is going to town

Random Swap

Randomly choose two words in the sentence and swap their positions. Do this n times.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_swap("John is going to town")
John town going to is

Random Insertion

Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_insertion("John is going to town")
John is going to make up town

AEDA: An easier data augmentation technique for text classification

This is the implementation of AEDA by Karimi et al, a variant of EDA. It is based on the random insertion of punctuation marks.

https://aclanthology.org/2021.findings-emnlp.234.pdf

Implementation

See this notebook for an example

Random Insertion of Punctuation Marks

Basic example

>>> from textaugment import AEDA
>>> t = AEDA()
>>> t.punct_insertion("John is going to town")
! John is going to town

Mixup augmentation

This is the implementation of mixup augmentation by Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz adapted to NLP.

Used in Augmenting Data with Mixup for Sentence Classification: An Empirical Study.

Mixup is a generic and straightforward data augmentation principle. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularises the neural network to favour simple linear behaviour in-between training examples.

Implementation

See this notebook for an example

Built with ❤ on

Authors

Acknowledgements

Cite this paper when using this library. Arxiv Version

@inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

Licence

MIT licensed. See the bundled LICENCE file for more details.

textaugment's People

Contributors

c-juhwan avatar categitau avatar dependabot[bot] avatar haowenhou avatar hlmu avatar josephsefara avatar vukosim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textaugment's Issues

Possibly conflict with gensim 4.0.1

This error yield out when I usse textaugment with gensim 4 but not gensim 3

  File "aug.py", line 15, in <module>
    data_df['paraphrased_text'] = data_df['text'].progress_apply(lambda x: w2v.augment(x))
  File "/home/anhvd/.local/lib/python3.7/site-packages/tqdm/std.py", line 770, in inner
    return getattr(df, df_function)(wrapper, **kwargs)
  File "/home/anhvd/miniconda3/envs/textaug/lib/python3.7/site-packages/pandas/core/series.py", line 4138, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2467, in pandas._libs.lib.map_infer
  File "/home/anhvd/.local/lib/python3.7/site-packages/tqdm/std.py", line 765, in wrapper
    return func(*args, **kwargs)
  File "aug.py", line 15, in <lambda>
    data_df['paraphrased_text'] = data_df['text'].progress_apply(lambda x: w2v.augment(x))
  File "/home/anhvd/miniconda3/envs/textaug/lib/python3.7/site-packages/textaugment/word2vec.py", line 146, in augment
    similar_words_and_weights = [(syn, t) for syn, t in self.model.wv.most_similar(w[1])]
AttributeError: 'KeyedVectors' object has no attribute 'wv

Traslation: HTTPError: HTTP Error 404: Not Found

I am unable to translate text from one language to other.

Code:

from textaugment import Translate

t = Translate(src="en", to="fr")
t.augment('In the afternoon, John is going to town')

Error:
`---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
in
2
3 t = Translate(src="en", to="fr")
----> 4 t.augment('In the afternoon, John is going to town')

/usr/local/lib/python3.7/site-packages/textaugment/translate.py in augment(self, data)
134 data = TextBlob(data.lower())
135 try:
--> 136 data = data.translate(from_lang=self.src, to=self.to)
137 data = data.translate(from_lang=self.to, to=self.src)
138 except NotTranslated:

/usr/local/lib/python3.7/site-packages/textblob/blob.py in translate(self, from_lang, to)
545 """
546 return self.class(self.translator.translate(self.raw,
--> 547 from_lang=from_lang, to_lang=to))
548
549 def detect_language(self):

/usr/local/lib/python3.7/site-packages/textblob/translate.py in translate(self, source, from_lang, to_lang, host, type_)
52 tk=calculate_tk(source),
53 )
---> 54 response = self.request(url, host=host, type=type
, data=data)
55 result = json.loads(response)
56 if isinstance(result, list):

/usr/local/lib/python3.7/site-packages/textblob/translate.py in request(self, url, host, type, data)
90 if host or type_:
91 req.set_proxy(host=host, type=type_)
---> 92 resp = request.urlopen(req)
93 content = resp.read()
94 return content.decode('utf-8')

/usr/local/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
--> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):

/usr/local/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
529 for processor in self.process_response.get(protocol, []):
530 meth = getattr(processor, meth_name)
--> 531 response = meth(req, response)
532
533 return response

/usr/local/lib/python3.7/urllib/request.py in http_response(self, request, response)
639 if not (200 <= code < 300):
640 response = self.parent.error(
--> 641 'http', request, response, code, msg, hdrs)
642
643 return response

/usr/local/lib/python3.7/urllib/request.py in error(self, proto, *args)
567 if http_err:
568 args = (dict, 'default', 'http_error_default') + orig_args
--> 569 return self._call_chain(*args)
570
571 # XXX probably also want an abstract factory that knows when it makes

/usr/local/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
501 for handler in handlers:
502 func = getattr(handler, meth_name)
--> 503 result = func(*args)
504 if result is not None:
505 return result

/usr/local/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
647 class HTTPDefaultErrorHandler(BaseHandler):
648 def http_error_default(self, req, fp, code, msg, hdrs):
--> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp)
650
651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

`

Questions about the mixup strategy

Hello,

I dive into the mixup code here.
https://github.com/dsfsi/textaugment/blob/master/examples/mixup_example_using_IMDB_sentiment.ipynb
first of all, mixup does not improve the validation loss and validation accuracy.
I try by myself, here is the result.
image

Another question is that the mixup seems wrong.
in the code https://github.com/dsfsi/textaugment/blob/master/textaugment/mixup.py


lam_vector = np.random.beta(alpha, alpha, batch_size)
index = np.random.permutation(batch_size)
mixed_x = (x.T * lam_vector).T + (x[index, :].T * (1.0 - lam_vector)).T
output_x.append(mixed_x)
if y is None:
    return np.concatenate(output_x, axis=0)
mixed_y = (y.T * lam_vector).T + (y[index].T * (1.0 - lam_vector)).T
output_y.append(mixed_y)
np.concatenate(output_x, axis=0), np.concatenate(output_y, axis=0)

the x is the word ids of the original sentence. And it is wrong to mix the word ids via lam_vector, which is generated from beta distribution.

It should be the word embeddings or the sentence embeddings that are about to be mixed up within each batch.

UnpicklingError: invalid load key, '<'.

The code snippet I used :

from textaugment import Word2vec
t = Word2vec(model='/content/drive/My Drive/ewiki8/enwik9')
#t.augment('The stories are good')

Random State, reproducible augmentations

Hey, is there any way to make the augmentations reproducible. I am using this for a competition and I want to generate the same set of texts every time the code is run.

Thank you

Augmentation support for Chinese text

  • Synonym-based augmentation will require synonyms in Chinese

  • RTT-based augmentation already supported. User will have to translate words from Chinese to another language then back to Chinese.

  • Word2vec-based augmentation already supported. User will have to build the embeddings using Chinese text.

AttributeError: 'FastTextKeyedVectors' object has no attribute 'wv' suppose due to Migration from Gensin 3.x to 4

Hello thank you for making this module!
It's so helpful.

I have problem to use it for Indonesian language using FastText.
I downloded the model for Indonesia and follow this example notebook as instructed, but get error as follow when I tried to execute this below code.

  • Code :
    t = Word2vec(model = model.wv) output = t.augment('Ceritanya bagus')

  • Error :

AttributeError: 'FastTextKeyedVectors' object has no attribute 'wv'

After I did some research it says that there's changes due to Migrating from Gensim 3.x to 4.
The change log have been documented here.

Then, I tried to changed my code as below and get different error :

  • Code 2:
    t = Word2vec(model = model.wv.most_similar) output = t.augment('Ceritanya bagus. Aku suka sekali')

  • Error 2 :

TypeError: Model path must be a string. Or type of model must be a gensim.models.word2vec.Word2Vec or gensim.models.keyedvectors.Word2VecKeyedVectors or gensim.models.keyedvectors.FastTextKeyedVectors type. To load a model use gensim.models.Word2Vec.load('path')

Could someone please tell me how could I fix the error?
It seems there are some changes to do augmentation since Gensim migrated to version 4.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.