Giter Site home page Giter Site logo

ipo_sponsor_ner_model's Introduction

Introduction of Repository

The repository is a reproduction and fork of https://github.com/Gxzzz/BiLSTM-CRF , while the data input and data type are different from original purpose.

This is a Pytorch implementation of BiLSTM-CRF for Named Entity Recognition (NER), which is described in Bidirectional LSTM-CRF Models for Sequence Tagging.

The corpus is created by the extraction of "parties involved" and "underwriting" in English version of IPO prospectus of HKEX. The Chinese version of IPO prospectus is not included in this handle ,thus, no Chinese NLP tool is used in this repository

Main dependencies

Usage of this repository

  • The repository is part of the IPO sponsor parser project ,https://github.com/etnetapp-dev/ipo_pdfparsing_server.
  • The deployment script of the NER model is embedded in the api_server.py of the IPO_sponsor_parser project.
  • The application of the model aims to identify designated name entities in the "parties involved" and "underwriting" parts of the IPO prospectus.
  • The pre-defined name entities include person, ORG (organization) and title (e.g. sponsors, underwriters.etc.(

Target portions in IPO prospectus

key components of repository

  • raw textual data and labeled data (data folder)
  • corpus creation and textual data conversion to IOBES format (preprocessing folder)
  • model training (train jupyter notebook)
  • model testing (model_test.py)

Corpus creation

    1. extract all texts and paragraphs from "parties involved" and "underwriting" in English version of IPO prospectus of HKEX and output result to txt files
    1. load the txt data int excel and label records by columns or you can also label the data by using tools e.g. https://github.com/Wadaboa/ner-annotator
    1. load labelled data into txt file and convert the textual data into iob format ('B'=Begin, 'I'=inter, O='omit') by the script, "text2iob1.py" in the preprocessing folder
    1. convert the data from iob format into IOBES format (('B'=Begin, 'I'=inter,'E'=End, 'S'=Single , O='omit')
  • If using other Labelling tools e.g. https://github.com/Wadaboa/ner-annotator, please specify to apply IOBES format when extracting output.

Manual data labelling (in excel file)

Corpus format : Characters and tags are seperated by \t

model training

  • Detail training descriptions and documentation are all specified inside the train.ipynb. The train.ipynb is executed via google colab pro platform on which, all major dependenies, e.g. pytorch and sklearn, are pre-installed , thus, no open-library installation is required on google colab platform.
  • The overall training process varies from 2 hours to 4 hours, subject to the volumn of data and the type of GPU instances (Google colab randomly assigns Nvidia K80 or more advanced version to the notebook)
  • Please edit the data and model folder path according to the google drive file structures.

model testing and deployment

  • There are 3 key files to be stored locally for re-usage and testing of the NER model after training , including, model.pth, sent_vocab.json and tag_vocab.json.
  • The pytorch 1.8 cpu version must be installed in local machine to run the model.

Key model component for deployment and testing

ipo_sponsor_ner_model's People

Contributors

marcusau avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.