Giter Site home page Giter Site logo

mosh98 / swe_aug Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 545 KB

Dritributed Text Augmentation Techniques (Appeared AAAI 2023)

Home Page: https://knowledge-nlp.github.io/aaai2023/papers/019-augmentation-poster.pdf

Python 100.00%
augmentation swedish-language text-augmentation text-classification swedish

swe_aug's Introduction

Swedish Augmentation Packages

Includes many different Augmentation packages for Swedish.


How do i setup?

Step 1

!git clone https://github.com/mosh98/swe_aug.git

This is built on top of a swedish word2vec. Make sure you download that first.

Step 2

!wget https://www.ida.liu.se/divisions/hcs/nlplab/swectors/swectors-300dim.txt.bz2
!bzip2 -dk /content/swectors-300dim.txt.bz2
!pip install -r reqs.txt


word_vec_path = '/content/swectors-300dim.txt' #path to txt vector file

#you can even set path to your own pretrain word2vec (make sure its a txt file)

Then Use your desired augmentation package


EDA Open in Colab

EDA: Easy Data Augmentation in Swedish

What is EDA? [2]

A way to augment data in a way that is easy to understand and use. There are 4 mains components

  1. Random Synomym Replacement
  2. Random Word Replacement
  3. Random Word Deletion
  4. Random Word Insertion
from swe_aug import EDA
aug = EDA.Enkel_Data_Augmentation(word_vec_path)

txt = "enter ur desired text. It can be a sentence or a paragraph"
augmented_sentences = aug.enkel_augmentation(txt, alpha_sr=0.1, 
                                             alpha_ri=0.3, alpha_rs=0.2, 
                                             alpha_rd=0.1, num_aug=4)
#returns a list of augmented sentences

Text Fragmenter Open in Colab

from swe_aug.Other_Techniques import Text_Cropping

frag = Text_Cropping.cropper(percent = 0.25)
list_of_fragmented_sentence = frag.text_fragmeter(txt)
# chops sentence into 4 halfs.

Type Specific Similar word Replacement Open in Colab

Idea is to replace word that are similar in an embeddings space that has the same POS token. [4]

# "NOUN", "VERB", "ADJ", "ADV", "PROPN","CONJ"
#These are the tokens you can perturb! [CASE SENSITIVE!]

from swe_aug.Other_Techniques import Type_SR
aug = Type_SR.type_DA(word_vec_path)

list_of_augs = aug.type_synonym_sr(txt, token_type = "NOUN", n = 2)

References

[1] Swedish word2vec: https://www.ida.liu.se/divisions/hcs/nlplab/swectors/

[2] EDA: https://aclanthology.org/D19-1670/

[3] Text Fragmenter: That was me

[4] Type Specific: That was me too

Cite?

@software{Mahamud2022,
  author = {Mahamud,Mosleh},
  title = {Swedish Augmentation Packages},
  year = {2022},
  publisher = {GitHub},
  journal = {Not Decided yet},
  howpublished = {\url{https://github.com/mosh98/swe_aug}},
}

swe_aug's People

Contributors

mosh98 avatar

Watchers

 avatar

swe_aug's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.