Giter Site home page Giter Site logo

banboooo044 / bert Goto Github PK

View Code? Open in Web Editor NEW

This project forked from google-research/bert

0.0 1.0 0.0 14.88 MB

TensorFlow code and pre-trained models for BERT

Home Page: https://arxiv.org/abs/1810.04805

License: Apache License 2.0

Python 99.31% Shell 0.69%

bert's Introduction

BERT experiments

original version is this: https://github.com/google-research/bert

First, I needed to apply them for Japanese sentences, so prepared a data called JAS. And then, just replaced run_classifier.py and add some lines:

class JasProcessor(DataProcessor):
  """Processor for the CoLA data set (GLUE version)."""

  def read_tsv(self, path):
    df = pd.read_csv(path, sep="\t")
    return [(str(text), str(label)) for text,label in zip(df['text'], df['label'])]

  
  def get_train_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self.read_tsv(os.path.join(data_dir, "train.tsv")), "train")

  def get_dev_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self.read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

  def get_test_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
      self.read_tsv(os.path.join(data_dir, "test.tsv")), "test")

  def get_labels(self):
    """See base class."""
    return ["0", "1", "2", "3", "4", "5"]

  def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      guid = "%s-%s" % (set_type, i)
      text_a = tokenization.convert_to_unicode(line[0])
      label = tokenization.convert_to_unicode(line[1])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples

and then, ran it. The result:

eval_accuracy = 0.363
eval_loss = 1.5512451
global_step = 937
loss = 1.5512451

it's nice because its data size isn't so large. The baseline model JAS_old/run.py is lower (acc: 0.333).

vocab.txt for sentencepiece

I fixed tokenization.py for sentencepiece vocaburaly. https://github.com/sugiyamath/bert/blob/master/tokenization.py

fixed lines: 159, 211-219

it disabled "text normalization" and "charcter based tokenization for chinese characters", because wanna increase vocabulary for Japanese language.

And then, created pre_example.sh . https://github.com/sugiyamath/bert/blob/master/pre_example.sh

bert's People

Contributors

jacobdevlin-google avatar cbockman avatar abhishekraok avatar 0xflotus avatar aijunbai avatar ammarasmro avatar craigcitro avatar georgefeng avatar rodgzilla avatar eric-haibin-lin avatar stefan-it avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.