Giter Site home page Giter Site logo

allensmile / toeicbert Goto Github PK

View Code? Open in Web Editor NEW

This project forked from graykode/toeicbert

0.0 1.0 0.0 272 KB

TOEIC(Test of English for International Communication) solving using pytorch-pretrained-BERT model.

License: MIT License

Python 100.00%

toeicbert's Introduction

TOEIC-BERT

76% Correct rate with ONLY Pre-Trained BERT model in TOEIC!!

This is project as topic: TOEIC(Test of English for International Communication) problem solving using pytorch-pretrained-BERT model. The reason why I used huggingface's pytorch-pretrained-BERT model is for pre-training or to do fine-tune more easily. I've solved the only blank problem, not the whole problem. There are two types of blank issues:

  1. Selecting Correct Grammar Type.
Q) The teacher had me _________ scales several times a day.
  1. play (Answer)
  2. to play
  3. played
  4. playing
  1. Selecting Correct Vocabulary Type.
Q) The wet weather _________ her from going shopping.
  1. interrupted
  2. obstructed
  3. impeded
  4. discouraged (Answer)

BERT Testing

  1. input
{
    "1" : {
        "question" : "Business experts predict that the upward trend in consumer spending is _ to continue until the end of this year.",
        "answer" : "likely",
        "1" : "potential",
        "2" : "likely",
        "3" : "safety",
        "4" : "seemed"
    }
}
  1. output
=============================
Question : Business experts predict that the upward trend in consumer spending is _ to continue until the end of this year.

Real Answer : likely

1) potential 2) likely 3) safety 4) seemed

BERT's Answer => [likely]

Why BERT?

In pretrained BERT, It contains contextual information. So It can find more contextual or grammatical sentences, not clear, a little bit. I was inspired by grammar checker from blog post.

Can We Use BERT as a Language Model to Assign a Score to a Sentence?

BERT uses a bidirectional encoder to encapsulate a sentence from left to right and from right to left. Thus, it learns two representations of each word-one from left to right and one from right to left-and then concatenates them for many downstream tasks.

Evaluation

I had evaluated with only pretrained BERT model(not fine-tuning) to check grammatical or lexical error. Above mathematical expression, X is a question sentence. and n is number of questions : {a, b, c, d}. C subset means answer candidate tokens : C of warranty is ['warrant', '##y']. V means total Vocabulary.

There's a problem with more than one token. I solved this problem by getting the average value of each tensor. ex) is being formed as ['is', 'being', 'formed']

Then, we find argmax in L_n(T_n).

predictions = model(question_tensors, segment_tensors)

# predictions : [batch_size, sequence_length, vocab_size]
predictions_candidates = predictions[0, masked_index, candidate_ids].mean()

Result of Evaluation.

Fantastic result with only pretrained BERT model

  • bert-base-uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
  • bert-large-uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
  • bert-base-cased: 12-layer, 768-hidden, 12-heads , 110M parameters
  • bert-large-cased: 24-layer, 1024-hidden, 16-heads, 340M parameters

Total 7067 datasets: make non-deterministic with model.eval()

bert-base-uncased bert-base-cased bert-large-uncased bert-large-cased
Correct Num 5192 5398 5321 5148
Percent 73.46% 76.38% 75.29% 72.84

Quick Start with Python pip Package.

Start with pip

$ pip install toeicbert

Run & Option

$ python -m toeicbert --model bert-base-uncased --file test.json
  • -m, --model : bert-model name in huggingface's pytorch-pretrained-BERT : bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased.

  • -f, --file : json file to evalution, see json format, test.json.

    key(question, 1, 2, 3, 4) is required options, but answer not.

    _ in question will be replaced to [MASK]

{
    "1" : {
        "question" : "The teacher had me _ scales several times a day.",
        "answer" : "play",
        "1" : "play",
        "2" : "to play",
        "3" : "played",
        "4" : "playing"
    },
    "2" : {
        "question" : "The teacher had me _ scales several times a day.",
        "1" : "play",
        "2" : "to play",
        "3" : "played",
        "4" : "playing"
    }
}

Author

  • Tae Hwan Jung(Jeff Jung) @graykode, Kyung Hee Univ CE(Undergraduate).
  • Author Email : [email protected]

Thanks for Hwan Suk Gang(Kyung Hee Univ.) for collecting Dataset(7114 datasets)

toeicbert's People

Contributors

graykode avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.