Giter Site home page Giter Site logo

ner_gs's Introduction

Automatically extracting gene and species name(s) from abstract text

Recognizing entities in text is the first step towards machines that can extract insights out of enormous document repositories like pubmed.

Getting Started

Prerequisites

Using the Python NLP software library spaCy to extract genes from pubmed text

  • Anaconda and Jupyter Notebook
  • spaCy - Open-source library for industrial-strength Natural Language Processing (NLP) in Python

Installation

Anaconda and Jupyter Notebook :

  1. Downloads and install Anaconda from https://repo.anaconda.com/archive/Anaconda3-2019.07-Windows-x86_64.exe. Select the default options when prompted during the installation of Anaconda.
  2. Open “Anaconda Prompt” by finding it in the Windows (Start) Menu.
  3. Type the command python --version to verified Anaconda was installed.
  4. Type the command jupyter notebook to start Jupyter Notebook.

spaCy :
spaCy is compatible with 64-bit CPython 2.7 / 3.5+ and runs on Unix/Linux, macOS/OS X and Windows.The latest spaCy releases are available over

Windows & OS X & Linux

  • Run the below command in Command Prompt
    ( Make sure you Add Python to PATH )
pip install -U spacy
  • Run the below command in Anaconda Prompt
    ( Run as administrator )
conda install -c conda-forge spacy

Getting started step by step

  • step 1 - Open jupyter notebook
  • step 2 - Run Training spaCy’s Statistical Models , It will output a trained model called Demo_1, You could assign your own test data at line 104.
  • step 3 - Run Testing, You could assign your own test data at line 55.

Code tutorial and processing walkthrough

  • Load the model, or create an empty model
    We can create an empty model and train it with our annotated dataset or we can use existing spacy model and re-train with our annotated data.
if model is not None:
    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank("en")  # create blank Language class
    print("Created blank 'en' model")

if 'ner' not in nlp.pipe_names :
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)
else :
    ner = nlp.get_pipe("ner")
  • Adding Labels or entities
# add labels
for _, annotations in train_data:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])

other_pipe = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

# Only training NER
with nlp.disable_pipes(*other_pipe) :
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
  • Training and updating the model
    Training data : Annotated data contain both text and their labels
    Text : Input text the model should predict a label for.
    Label : The label the model should predict.
# Spacy Training Data Format
Train_data = [
    ( "Text 1", entities : {
                [(start,end, "Label 1"), (start,end, "Label 2"), (start,end, "Label 3")]
                }
    ),
    ( "Text 2", entities : {
             [(start,end, "Label 1"), (start,end,"Label 2")]
             }
    ),
    ( "Text 3", entities : {
            [(start,end, "Label 1"), (start,end, "Label 2"), 
            (start,end,"Label 3"),(start,end, "Label 4 ")]
            }
    )
]
  1. We will train our model for a number of iterations so that the model can learn from it effectively.
for int in range(iteration) :
    print("Starting iteration" + str(int))
    random.shuffle(train_data)
    losses = {}
  1. At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations based on the order of examples.
  2. We will update the model for each iteration using nlp.update().
    for text, annotation in train_data :
        nlp.update(
        [text],
        [annotation],
        drop = 0.2,
        sgd = optimizer,
        losses = losses
        )
  #print(losses)
new_model = nlp
  • Evaluate the model
# Spacy Testing Data Format
test_data = [
    ('Text 1',
     [(start, end, 'Label 1')]),
    ('Text 2',
     [(start, end, 'Label 1'), (start, end, 'Label 2')])
]
import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def evaluate(model, examples):
  scorer = Scorer()
  for input_, annot in examples:
    #print(input_)
    doc_gold_text = model.make_doc(input_)
    gold = GoldParse(doc_gold_text, entities=annot['entities'])
    pred_value = model(input_)
    scorer.score(pred_value, gold)
  return scorer.scores

test_result = evaluate(new_model, test_data)
  • Visualization
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")#model name
doc = nlp("""Your test context""")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc,jupyter=True,style='ent')

Reference

See the spaCy Tutorials for more details and examples
[1] How to create custom NER in Spacy
[2] How to extract genes from text with Sysrev and spaCy
[3] Custom Named Entity Recognition Using spaCy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.