Giter Site home page Giter Site logo

aliosm / semantic-question-similarity Goto Github PK

View Code? Open in Web Editor NEW
13.0 3.0 7.0 430.38 MB

Official implementation of: Tha3aroon at NSURL-2019 Task 8: Semantic Question Similarity in Arabic

License: MIT License

Python 100.00%
semantic-textual-similarity ordered-neurons-lstm neural-networks attention elmo bert onlstm

semantic-question-similarity's Introduction

PWC

Semantic-Question-Similarity

The official implementation of our paper: Tha3aroon at NSURL-2019 Task 8: Semantic Question Similarity in Arabic, which was part of NSURL-2019 workshop on Task 8 for Arabic Semantic Question Similarity.

0. Prerequisites

  • Python >= 3.6
  • Install required packages listed in requirements.txt file
    • pip install -r requirements.txt
  • To use ELMo embeddings:
    • Clone ELMoForManyLangs repository
      • git clone https://github.com/HIT-SCIR/ELMoForManyLangs.git
    • Install the package:
      • cd ELMoForManyLangs
      • python setup.py install
      • cd ..
    • Download and unzip Arabic pre-trainled ELMo model
      • wget http://vectors.nlpl.eu/repository/11/136.zip -O elmo_dir/136.zip
      • unzip elmo_dir/136.zip -d elmo_dir
      • cp ELMoForManyLangs/configs/cnn_50_100_512_4096_sample.json elmo_dir/cnn_50_100_512_4096_sample.json

1. Data Preprocessing

Data preprocessing step to separate punctuations from words

python 1_preprocess.py --dataset-split train
python 1_preprocess.py --dataset-split test

2. Data Enlarging

Enlarging the data using both Positive and Negative Transitive properties (descriped in the paper)

python 2_enlarge.py

3. Generating Words Embeddings

To make the training step faster, we pre-generate words embeddings from either ELMo or BERT models and store them in a pickle file

python 3_build_embeddings_dict.py --embeddings-type elmo # For ELMo
python 3_build_embeddings_dict.py --embeddings-type bert # For BERT

We adopted using ELMoForManyLangs over bert-embedding because it yields better results.

4. Model Training

Training the model using ELMo with 0.2 dropout, 256 batch size, 100 epochs and 2000 dev set size

python 4_train.py --embeddings-type elmo --dropout-rate 0.2 --batch-size 256 --epochs 100 --dev-split 2000

This hyperparameters setup gives the best results according to our experiments, change the values in order to experiment more..

5. Model Inferencing

Inferencing predictions for the test set is done given the path to a certain checkpoint, the default threshold is 0.5 which can be changed using the optional argument --threshold

python 5_infer.py --model-path checkpoints/epoch100.h5

Model Structure

The following figure illustrates our best model structure.

Note: All codes in this repository are tested on Ubuntu 18.04

Contributors

  1. Ali Hamdi Ali Fadel.
  2. Ibraheem Tuffaha.
  3. Mahmoud Al-Ayyoub.

License

The project is available as open source under the terms of the MIT License.

semantic-question-similarity's People

Contributors

aliosm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.