Giter Site home page Giter Site logo

aliksarkar / bengali-pos-tagging-using-bnlp Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 7.09 MB

Machine Learning approach to Bengali Parts of Speech Tagging using BNLP.

License: GNU General Public License v3.0

Jupyter Notebook 100.00%
natural-language-processing nlp-machine-learning deep-learning nlp

bengali-pos-tagging-using-bnlp's Introduction

Machine Learning approach to Bengali Parts of Speech Tagging using BNLP.

About the Project:

This project has been done as the part of Minor Project submission at Heritage Institute of Technology under the Mentorship of Prof. Sandipan Ganguly (HIT-K).

Introduction to BNLP (Bengali Natural Language Processing) Toolkit:

A library with pre-trained model for POS Tagging, Word Embedding, Name Entity Recognition, FastText, Bengali StopWords, Bengali Corpus Class recognition etc.

Installation

  • pypi package installer(python 3.6, 3.7, 3.8 tested okay)

    pip install bnlp_toolkit

    or Upgrade

    pip install -U bnlp_toolkit

Methodology:

Raw Text-> Tokenization -> POS Tagging

  • We have first used Natural Language ToolKit or NLTK library to define & apply basic POS tagging on English Corpus.

  • In the next step, we took a small Bengali Corpus & tokenized each Bengali words from sentences individually using BasicTokenizer from BNLP under Rule-Based Approach. Then the same applied on two larger Bengali corpora.

  • In next step, we have used NLTKTokenizer from BNLP to tokenize Bengali small corpus into two phases. One is in Word Tokenizing & second one is in Sentence Tokenizing under Rule-based approach. Word Tokenizer tokenized Bengali Words while Sentence Tokenizer tokenized each sentences separately. Then applied the same on two larger Bengali Corpora.

  • Next we used SentencePieceTokenizer to apply Unsupervised Learning on two Bengali Corpora.

  • In the next step, we used POS function with pre-trained model from BNLP & took a small Bengali Corpus to tag Bengali words & categorize them into different Parts of Speeches under Conditional Random Field based approach.

  • In the next we have embedded Bengali Words of a corpus using BengaliWord2Vector with pre-trained model from BNLP to get the vector shape of words & their values under Deep Learning approach.


pie-chart

pie-chart-Evaluated result of BNLP


Confusion Matrix

We found false positive result as well & calculated Confusion Matrices to get Precision, Recall & F1 value.

We have used dataset from NLTR & got 90% accuracy.

Tools:

  1. Jupyter Notebook/Google Colab
  2. BNLP Library taken from: Prof. Sagor Sarker (Bangladesh) on GitHub.
  3. Research papers on Bengali Pos Tagging.

Mentor: Prof. Sandipan Ganguly (HIT-K).

Developers:

  1. Alik Sarkar
  2. Arghyadeep Banerjee
  3. Soham Chakraborty
  4. Tanmay Guchhait
  5. Debabrata Maity
  6. Rajdeep Das
  7. Sanju Manna

Read on ResearchGate:

https://www.researchgate.net/publication/359257508_Machine_Learning_approach_to_POS_Tagging_in_Bengali_Language_Project_Report

References taken from:

  1. https://bnlp.readthedocs.io/en/latest/
  2. https://github.com/sagorbrur/bnlp
  3. https://www.researchgate.net/publication/348957805_BNLP_Natural_language_processing_toolkit_for_Bengali_language
  4. https://medium.com/analytics-vidhya/bengali-pos-part-of-speech-tagging-using-indian-corpus-e85f47d3ad65
  5. https://nltr.itewb.gov.in/

BNLP Developer Credit: Prof. Sagor Sarker (https://github.com/sagorbrur)

bengali-pos-tagging-using-bnlp's People

Contributors

aliksarkar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.