Giter Site home page Giter Site logo

level1-semantictextsimilarity-nlp-03's Introduction

๐Ÿ†Level1 Project - STS(Semantic Text Similarity)

โœ๏ธ๋Œ€ํšŒ ์†Œ๊ฐœ

ํŠน์ง• ์„ค๋ช…
๋Œ€ํšŒ ์ฃผ์ฒด ๋„ค์ด๋ฒ„ ๋ถ€์ŠคํŠธ์บ ํ”„ AI-Tech 6๊ธฐ NLPํŠธ๋ž™์˜ level1 ๋„๋ฉ”์ธ ๊ธฐ์ดˆ ๋Œ€ํšŒ์ž…๋‹ˆ๋‹ค.
๋Œ€ํšŒ ์„ค๋ช… ๋‘ ๋ฌธ์žฅ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‘ ๋ฌธ์žฅ์— ๋Œ€ํ•œ STS(Semantic Text Simliarity)๋ฅผ ์ถ”๋ก ํ•˜๋Š” ๋Œ€ํšŒ๋กœ Kaggle๊ณผ Dacon๊ณผ ๊ฐ™์ด competition ํ˜•ํƒœ๋กœ ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ ๋ฐ์ดํ„ฐ๋Š” slack ๋Œ€ํ™”, ๋„ค์ด๋ฒ„ ์˜ํ™” ํ›„๊ธฐ, ๊ตญ๋ฏผ ์ฒญ์› ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. Train(9324๊ฐœ), Dev(550๊ฐœ), Test(1100๊ฐœ)
ํ‰๊ฐ€ ์ง€ํ‘œ ๋ชจ๋ธ์˜ ํ‰๊ฐ€์ง€ํ‘œ๋Š” ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜(Pearson correlation coefficient)๋กœ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

๐ŸŽ–๏ธLeader Board

๐ŸฅˆPrivate Leader Board(2์œ„)

Alt text

๐Ÿฅ‰Public Leader Board(3์œ„)

Alt text


๐Ÿ‘จโ€๐Ÿ’ปTeam & Members

  • Team๋ช… : 369 [NLP 3์กฐ]

๐ŸงšMembers

๊น€๋™ํ˜„ ๊น€์œ ๋ฏผ ๋ฐ•์‚ฐ์•ผ ์ด์ข…์› ํ™ฉ๊ธฐ์ค‘ ํ™ฉ์˜ˆ์›
Alt text Alt text Alt text Alt text Alt text Alt text
Github Github Github Github Github Github
Mail Mail Mail Mail Mail Mail

๐Ÿ’ฏOur Team's Goal

ํŒ€ ๋‹จ์œ„์˜ ํ”„๋กœ์ ํŠธ์ธ๋งŒํผ ์ตœ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์‹œ๋„๋ฅผ ๋ถ„์—…ํ•˜์—ฌ ์„œ๋กœ์˜ ๋‚ด์šฉ์ด ๊ฒน์น˜์ง€ ์•Š๋„๋ก ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋•Œ ์—…๋ฌด๋ฅผ ์ •ํ™•ํ•œ ๊ธฐ์ค€์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๊ธฐ๋ณด๋‹ค ๋ชจ๋‘๊ฐ€ ์ž์œ ๋กญ๊ฒŒ EDA๋ถ€ํ„ฐ ์ „์ฒ˜๋ฆฌ, ๋ชจ๋ธ ์‹คํ—˜, ๋ชจ๋ธ ํŠœ๋‹๊นŒ์ง€ end-to-end๋กœ ๊ฒฝํ—˜ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ˜‘์—…ํ•˜์˜€์Šต๋‹ˆ๋‹ค. โ€˜ํ•˜๋‚˜์— ๊ฝ‚ํžˆ๋ฉด ๋๊นŒ์ง€ ํŒ๋‹คโ€™๋Š” ๊ณตํ†ต์ ์„ ๊ฐ€์ง„ ์ €ํฌ ํŒ€์›๋“ค์˜ ๊ฐ•ํ•œ ์ฑ…์ž„๊ฐ๊ณผ ์•„์ด๋””์–ด๋ฅผ ํ–ฅํ•œ ๋ˆ์งˆ๊ธด(?) ์• ์ • ๋•๋ถ„์— ์„ฑ๊ณต์ ์œผ๋กœ ํ”„๋กœ์ ํŠธ๋ฅผ ๋งˆ๋ฌด๋ฆฌํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ‘ผMember's role

Member Role
๊น€๋™ํ˜„ EDA(๋ฐ์ดํ„ฐ ์…‹ ํŠน์„ฑ ๋ถ„์„), ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(back translation), ๋ชจ๋ธ๋ง ๋ฐ ํŠœ๋‹(Bert, Roberta, Albert, SBERT, WandB)
๊น€์œ ๋ฏผ EDA(label-pred ๋ถ„ํฌ ๋ถ„์„), ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(back translation/nnp_sl_masking/์–ด์ˆœ๋„์น˜/๋‹จ์ˆœ๋ณต์ œ), ๋ชจ๋ธ ํŠœ๋‹(roberta-large, kr-electra-discriminator)
๋ฐ•์‚ฐ์•ผ EDA(label ๋ถ„ํฌ ๋ฐ ๋ฌธ์žฅ ๊ธธ์ด ๋ถ„์„), ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(sentence swap), ๋ชจ๋ธ๋ง ๋ฐ ํŠœ๋‹(KoSimCSE-roberta, ํ•ด๋‹น ๋ชจ๋ธ ๊ธฐ๋ฐ˜ Siamese Network ์ ์šฉ ๋ชจ๋ธ)
์ด์ข…์› EDA(label ๋ถ„ํฌ ๋ถ„์„, label-pred ๋ถ„ํฌ ๋ถ„์„), ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(hanspell), ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(/swap sentence/copied sentence/SR/random masking), ๋ชจ๋ธ ํŠœ๋‹(roberta-large, electra-kor-base, kr-electra-discriminator), ์•™์ƒ๋ธ”(soft voting, weight voting), ์ฝ”๋“œ ๋ฆฌํŒฉํ† ๋ง
ํ™ฉ๊ธฐ์ค‘ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(๋„์–ด์“ฐ๊ธฐ ํ†ต์ผ), ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(๋ถ€์‚ฌ/๊ณ ์œ ๋ช…์‚ฌ ์ œ๊ฑฐ Augmentation), ๋ชจ๋ธ๋ง(KoSimCSE-roberta), ์•™์ƒ๋ธ”(variance-based ensemble)
ํ™ฉ์˜ˆ์› ๋ชจ๋ธ๋ง ๋ฐ ํŠœ๋‹(RoBERTa, T5, SBERT), ๋ชจ๋ธ ๊ฒฝ๋Ÿ‰ํ™”(Roberta-large with deepspeed)

๐ŸƒProject process

๐Ÿ–ฅ๏ธ Project Introduction

๊ฐœ์š” Description
ํ”„๋กœ์ ํŠธ ์ฃผ์ œ STS(Semantic Text Similarity) : ๋‘ ๋ฌธ์žฅ์˜ ์œ ์‚ฌ๋„ ์ •๋„๋ฅผ ์ˆ˜์น˜๋กœ ์ถ”๋ก ํ•˜๋Š” Task
ํ”„๋กœ์ ํŠธ ๋ชฉํ‘œ ๋‘ ๋ฌธ์žฅ(sentence1, sentence2)์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ด ๋‘ ๋ฌธ์žฅ์˜ ์œ ์‚ฌ๋„๋ฅผ 0~5์‚ฌ์ด์˜ ์ ์ˆ˜๋กœ ์ถ”๋ก ํ•œ๋Š” AI ๋ชจ๋ธ ์ œ์ž‘.
ํ”„๋กœ์ ํŠธ ํ‰๊ฐ€์ง€ํ‘œ ์‹ค์ œ ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์˜ ํ”ผ์–ด์Šจ ์ƒ๊ด€ ๊ณ„์ˆ˜(Pearson Correlation Coefficient)
๊ฐœ๋ฐœ ํ™˜๊ฒฝ GPU : Tesla V100 Server 6๋Œ€, IDE : Vscode, Jupyter Notebook
ํ˜‘์—… ํ™˜๊ฒฝ Notion(์ง„ํ–‰ ์ƒํ™ฉ ๊ณต์œ ), Figma(์ง„ํ–‰ ์ƒํ™ฉ ์‹œ๊ฐํ™” ๊ณต์œ ), Github(์ฝ”๋“œ ๋ฐ ๋ฐ์ดํ„ฐ ๊ณต์œ ), Slack(์‹ค์‹œ๊ฐ„ ์†Œํ†ต)

๐Ÿ“…Project TimeLine

  • ํ”„๋กœ์ ํŠธ๋Š” 2023-12-11 ~ 2023 12-21 ์•ฝ 11์ผ๊ฐ„ ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Alt text


๐Ÿ•ต๏ธWhat we did

  • ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๋ฉฐ ๋‹จ๊ณ„๋ณ„๋กœ ์‹คํ—˜ํ•ด ๋ณด๊ณ  ์ ์šฉํ•ด ๋ณธ ๋‚ด์šฉ๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
Process What we did
EDA ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๋ถ„์„, Baseline ๋ชจ๋ธ ์˜ˆ์ธก๊ณผ ์‹ค์ œ๊ฐ’ ์ฐจ์ด ๋ถ„์„
Preprocessing emotion normalize, repeat normalize, ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ, ์˜์–ด ์†Œ๋ฌธ์ž ์ฒ˜๋ฆฌ, hanspell(๋งž์ถค๋ฒ•๊ฒ€์‚ฌ)
Augmentation SR(Synonym Replacement), Swap Sentence, Copied Sentence, NNP, SL Masking, Back Translation, ์–ด์ˆœ ๋„์น˜, ๋‹จ์ˆœ ๋ณต์ œ
Experiment Model klue/RoBERTa-base, klue/RoBERTa-large, klue/bert-base, monologg/KoELECTRA-base, KETI-AIR/ke-t5-base, xlm-roberta-large, snunlp/KR-SBERT-V40K-klueNLI-augSTS, kykim/electra-kor-base, snunlp/KR-ELECTRA-discriminator, BM-K/KoSimCSE-roberta, rurupang/roberta-base-finetuned-sts
Hyper paramter tunning & Mornitoring Wandb Sweep
Ensemble weight voting, soft voting

๐Ÿ“ŠDataSet

Version Description
AugmentationV1 ์›๋ณธ ๋ฐ์ดํ„ฐ์—์„œ label>=4์ธ ๋ฐ์ดํ„ฐ ๋‹จ์ˆœ ์ฆ๊ฐ•.
AugmentationV2 ์›๋ณธ ๋ฐ์ดํ„ฐ + ๋งž์ถค๋ฒ• ๊ฒ€์‚ฌ ๋ฐ์ดํ„ฐ + SR + Swap Sentence + Copied Sentence
AugmentationV3 AugmentationV2 + NNP, SL Masking
  • ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ณผ์ •์—์„œ ๋ผ๋ฒจ ๋ถ„ํฌ๋ฅผ ๊ท ํ˜•์žˆ๊ฒŒ ๋งž์ถ”๊ณ ์ž ๋ผ๋ฒจ๋ณ„ ์ฆ๊ฐ• ๋น„์œจ์„ ์กฐ์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค. Alt text

๐Ÿค–Ensemble Model

  • ์ตœ์ข…์ ์œผ๋กœ 5๊ฐœ์˜ ๋ชจ๋ธ์„ ์•™์ƒ๋ธ”์— ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
Model Learing Rate Batch Size loss epoch Data Cleaning Data Augmentation Public Pearson Scheduler Ensemble Weight
klue/RoBERTa-large 1e-5 16 L1 5 Spell Check AugmentationV2 0.9125 0.9125
klue/RoBERTa-large 1e-5 16 MSE 2 Spell Check AugmentationV3 0.9166 0.9166
kykim/electra-kor-base 2e-5 32 L1 23 Spell Check AugmentationV2 0.9216 CosineAnnealingWarmRestarts 0.9216
snunlp/KR-ELECTRA-discriminator 1e-5 32 L1 15 AugmentationV1 0.9179 0.9179
snunlp/KR-ELECTRA-discriminator 2e-5 32 L1 15 Spell Check AugmentationV2 0.9217 CosineAnnealingWarmRestarts 0.9217

๐Ÿ“Project Structure

๐Ÿ“๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ ์„ค๋ช…

  • ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ : ./data
  • ํ•™์Šต ์ฝ”๋“œ ๊ฒฝ๋กœ : ./code
  • ๋ชจ๋ธ Config ๊ฒฝ๋กœ : ./code/config
  • ํ•™์Šต๋œ ๋ชจ๋ธ ์ƒ์„ฑ ๊ฒฝ๋กœ : ./model
  • ์ถ”๋ก  ๋ฐ ์•™์ƒ๋ธ” ๊ฒฐ๊ณผ ์ƒ์„ฑ ๊ฒฝ๋กœ : ./result

๐Ÿ“๐Ÿ“์ฝ”๋“œ ๊ตฌ์กฐ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ์ฝ”๋“œ ์œ„์น˜ : ./code/augmentation.py

    • ์ฆ๊ฐ• ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๊ฒฝ๋กœ : ./data/
  2. Config ์„ค์ • ์ฝ”๋“œ ๊ฒฝ๋กœ : ./code/config/

  3. Model Train ์ฝ”๋“œ ์œ„์น˜ : ./code/train.py

    • ๋ชจ๋ธ .ptํŒŒ์ผ ์ƒ์„ฑ ๊ฒฝ๋กœ : ./model/
  4. Infer & Ensemble ์ฝ”๋“œ ์œ„์น˜ : ./code/infer.py

    • ์ถ”๋ก  ๊ฒฐ๊ณผ .csvํŒŒ์ผ ์ƒ์„ฑ ๊ฒฝ๋กœ : ./result/
๐Ÿ“level1_semantictextsimilarity-nlp-3
โ”œโ”€code
โ”‚  โ”‚  augmentation.py
โ”‚  โ”‚  dataloader.py
โ”‚  โ”‚  dataset.py
โ”‚  โ”‚  inference.py
โ”‚  โ”‚  learner.py
โ”‚  โ”‚  model.py
โ”‚  โ”‚  train.py
โ”‚  โ”‚  utils.py
โ”‚  โ”‚  
โ”‚  โ”œโ”€config
โ”‚  โ”‚      kr_electraV1_config.py
โ”‚  โ”‚      kr_electraV2_config.py
โ”‚  โ”‚      kykim_config.py
โ”‚  โ”‚      roberta_large_config.py
โ”‚  โ”‚      roberta_large_nnp_config.py
โ”‚  โ”‚      
โ”‚  โ””โ”€wrappers
โ”‚          config_wrapper.py
โ”‚          train_wrapper.py
โ”‚          
โ”œโ”€data
โ”‚      dev.csv
โ”‚      dev_spellcheck.csv
โ”‚      sample_submission.csv
โ”‚      test.csv
โ”‚      test_spellcheck.csv
โ”‚      train.csv
โ”‚      train_augmentV1.csv
โ”‚      train_augmentV2.csv
โ”‚      train_augmentV3.csv
โ”‚      wordnet.pickle
โ”‚      
โ”œโ”€model
โ”‚      klue-roberta-large-nnp.pt
โ”‚      klue-roberta-large.pt
โ”‚      kykim-electra-kor-base.pt
โ”‚      snunlp-KR-ELECTRA-discriminator-V1.pt
โ”‚      snunlp-KR-ELECTRA-discriminator-V2.pt
โ”‚      
โ””โ”€result
        ensemle.csv
        klue-roberta-large-nnp.csv
        klue-roberta-large.csv
        kykim-electra-kor-base.csv
        snunlp-KR-ELECTRA-discriminator-V1.csv
        snunlp-KR-ELECTRA-discriminator-V2.csv

๐Ÿ’ปHow to Start

๐Ÿ“ŠMake Dataset

# ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
> python ./code/augmentation.py

๐Ÿค–Train Model

# ./code/config ์—์„œ ํ›ˆ๋ จ ๋ชจ๋ธ์˜ config ์„ค์ •
> python ./code/train.py

๐Ÿค–Infer or Ensemble Model

# Infer
> python ./code/inference.py --mode inference

# Ensemble
> python ./code/inference.py --mode ensemble

level1-semantictextsimilarity-nlp-03's People

Contributors

github-classroom[bot] avatar jongwoncode avatar merri4 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.