Giter Site home page Giter Site logo

word-embedding-model-for-bangla's Introduction

Word embedding (word2vec) model for Bangla


For many NLP tasks, we need linguistic resources (e.g. linguistically motivated features, dictionary) in order to design fundamental tools such as parts-of-speech tagger, named entity recognizer. Such resources are scarce and at the same time digital content, such as text, are growing very fast, can also partially be available. The idea is utilizes the source and design model can be useful in many Bangla language specific NLP tasks.

In order to design the word embedding model, one needs to have a large collection of text. For example, google-embedding model (GoogleNews-vectors-negative300.bin.gz) contains 100 billion words. For Bangla, publicly available text data is not very large. One way is to collect is scrap the text data from the wab. For this task, we have used several sources including prothom-alo news corpus, transcriptions of CRBLP speech corpus, Bangla textbook corpus, and wiki dump (wikidump accessed on 20th of July, 2016). It contains 84.25 millions of words, 3.2 millions word types, and 7.5 millions of sentences. Please check the paper below for more details.

Due to the license limitation, we can not publicly release this dataset.

Prerequisites

  • Python 2.7
  • Gensim (Tested on 1.0.1)

Data preparation for word2vec model training

Since data sources is different, therefore it is necessary to clean the data first (e.g., removing html tags). We are not including that step here. Once data is cleaned, the follwing tool can be used to prepare the data for training word embedding model. It basically split the text into sentences and writes to a text file and also multiple files. The sentense delimeter include "?" and "\u0964" characters. It also annotate all numbers into DIGIT tag.

Please check the Readme in DataProcessor module in order to run it. Here is an example,

java -classpath dist/DataProcessor.jar dataprocessor.DataProcessor -i bn_data_files_list.txt -d bn_data_preprocessed/ -o bn_data_preprocessed.txt

Model training parameter

For training the model there are many tools available. We have explored word2vec and gensim implementation.

word2vec implementation:

./word2vec -train bn_data_preprocessed.txt -output bn_w2v_model.txt -cbow 0 -size 300 -window 5 -negative 5 -hs 1 -sample 1e-4 -threads 40 -binary 0 -iter 15 -min-count 3 -save-vocab bn_w2v_model_vocab.txt

gensim implementation of word2vec. Model training parameters include the following:

gensim.models.Word2Vec(size=300, window=5, min_count=3, workers=40,seed=1,cbow=0,sample=1e-4,hs=1,negative=5,iter=15)

How to use:

Find the top ten most similar words of the word 'প্রধানমন্ত্রী'

import gensim
model = gensim.models.Word2Vec.load('bn_w2v_model.bin')
words=model.most_similar(positive=['প্রধানমন্ত্রী'], negative=[], topn=10)

for w in words:
  print w[0]
# The most similar top ten words that we are getting are reasonable, as shown below.
মনমোহন
প্রধানমন্ত্রীর
হাসিনা
রাষ্ট্রপতি
হাসিনার
প্রধানমন্ত্রীকে
শান্তিপথে
বিদেশমন্ত্রী
মনমোহনের
এসডিএফের
#Finding the word vector
vector = model['প্রধানমন্ত্রী']

The trained model using word2vec can be found here binary version and text version.

The vocab size of the model is 436126 with 300-dimensional vectors.

Please cite the following paper if you are using this model:

Firoj Alam, Shammur Absar Chowdhury and Sheak Rashed Haider Noori, Bidirectional LSTMs - CRFs Networks for Bangla POS Tagging, ICCIT 2016.

@inproceedings{alam2016bidirectional,
  title={Bidirectional LSTMs—CRFs networks for bangla POS tagging},
  author={Alam, Firoj and Chowdhury, Shammur Absar and Noori, Sheak Rashed Haider},
  booktitle={19th International Conference on Computer and Information Technology (ICCIT), 2016},
  pages={377--382},
  year={2016},
  organization={IEEE}
}

word-embedding-model-for-bangla's People

Contributors

firojalam avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.