Giter Site home page Giter Site logo

ricardol1u / chinesenlpdataaugmentation4paddle Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 0.0 115 KB

Chinese NLP Data Augmentation, BERT Contextual/EDA Augmentation, Customized for PaddleNLP, 百度飞桨框架下的NLP数据增强 (采用Bert或EDA)

Python 100.00%
paddlepaddle paddle paddlenlp data-augmentation nlp-data-augmentation bert eda

chinesenlpdataaugmentation4paddle's Introduction

ChineseNLPDataAugmentation4Paddle

Chinese NLP Data Augmentation, BERT Contextual Augmentation, Customized for PaddleNLP

百度飞桨框架下的NLP数据增强 (采用Bert或EDA)

how this work

Bert Part

  1. Randomly insert several [MASK] tokens or replace some original tokens with [MASK] in the original text

    before: 时间往往能打败大多数人
    insert: 时间[MASK]往往能打败大多数人
    replace: 时间往往[MASK]打败大多数人
    

    we adopt the jieba a Chinese word segmentation module to avoid insert [MASK] to one word inside like "时[MASK]间往往能打败大多数人"

  2. utilize the BertForMaskedLM to predict which token the [MASK] should be

  3. choose the best top k prediction combination as results

EDA Part

TBD

how to use

  1. environment require

    • PaddleNLP
    • PaddlePaddle
    • jieba
    • synonyms // only required in eda part,
  2. python augumentor.py --input /path/to/sentences.txt

    the context in sentences.txt should be like this

    帮我查一下航班信息
    保研没有大多数人想象中的那么难
    时间往往能打败大多数人
    

    one row one sentence

output

input: 帮我查一下航班信息  

output: {'score': [0.15944890677928925, 0.03266862779855728, 0.16812720894813538], 'insert_index': [1, 2, 3], 'token': [6435, 3221, 872], 'token_str': ['请', '是', '你'], 'sequence': '请 是 你 帮 我 查 一 下 航 班 信 息'}

input: 时间往往能打败大多数人 

output: {'score': [0.054044950753450394, 0.925567626953125], 'insert_index': [3, 4], 'token': [1045, 2518], 'token_str': ['光', '往'], 'sequence': '时 光 往 往 能 打 败 大 多 数 人'}

chinesenlpdataaugmentation4paddle's People

Contributors

ricardol1u avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.