Giter Site home page Giter Site logo

mraihan-gmu / offmix-3l Goto Github PK

View Code? Open in Web Editor NEW

This project forked from languagetechnologylab/offmix-3l

1.0 0.0 0.0 407 KB

This is a dataset for the offensive language detection task. It contains 1k natural code mixed data. The languages are Bangla-English-Hindi.

License: GNU Affero General Public License v3.0

offmix-3l's Introduction

OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification

Publication: The 11th International Workshop on Natural Language Processing for Social Media (SocialNLP) under AACL-2023.

Read in arXiv


๐Ÿ“ Citation

When using the OffMix-3L dataset, please cite the following:

@article{goswami2023offmix,
  title={OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification},
  author={Goswami, Dhiman and Raihan, Md Nishat and Mahmud, Antara and Anstasopoulos, Antonios and Zampieri, Marcos},
  journal={arXiv preprint arXiv:2310.18387},
  year={2023}
}

๐Ÿ“– Introduction

Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been built with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce OffMix-3L, a novel dataset for sentiment analysis containing code-mixed data between three languages: Bangla, English, and Hindi.


๐Ÿ“Š Dataset Details

We introduce OffMix-3L, a novel three-language code-mixed test dataset with gold standard labels in Bangla-Hindi-English for the task of Sentiment Analysis, containing 1,001 instances.

We are presenting this dataset exclusively as a test set due to the unique and specialized nature of the task. Such data is very difficult to gather and requires significant expertise to access. The size of the dataset, while limiting for training purposes, offers a high-quality testing environment with gold-standard labels that can serve as a benchmark in this domain.


๐Ÿ“ˆ Dataset Statistics

All Bangla English Hindi Other
Tokens 87,190 31,228 6,690 14,694 34,578
Types 18,787 7,714 1,135 1,413 8,645
Max. in instance 173 62 20 47 93
Min. in instance 41 4 3 2 8
Avg 87.10 31.20 6.68 14.68 34.54
Std Dev 20.58 8.60 3.05 5.74 10.98

OffMix-3L Data Card. The row "Avg" represents the average number of tokens with its standard deviation in row "Std Dev".


๐Ÿ“‰ Results

Models F1 Score
BanglishBERT 0.68
BERT 0.66
mBERT 0.63
HingBERT 0.60
MuRIL 0.60
HateBERT 0.60
fBERT 0.58
roBERTa 0.58
XLM-R 0.57
DistilBERT 0.57
GPT 3.5 Turbo 0.57
BanglaBERT 0.54
IndicBERT 0.55
HindiBERT 0.43

Weighted F-1 score for different models: training on synthetic and tested on natural data (OffMix-3L).

offmix-3l's People

Contributors

languagetechnologylab avatar

Stargazers

Md Nishat Raihan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.