Giter Site home page Giter Site logo

arshsekhon / pubtator_loader Goto Github PK

View Code? Open in Web Editor NEW
6.0 2.0 3.0 39 KB

A Python ๐Ÿ package to load PubTator Documents ๐Ÿงพ, tokenize and convert them to BILUO Format.

License: GNU General Public License v3.0

Python 100.00%
pubtator-format pubtator-loader medmentions pubtator

pubtator_loader's Introduction

PubTator Loader

Build - Main License: GPL v3 PyPI version

pubtator_loader is a python module that allows loading corpus from PubTator format and manipulate documents as Python object. It can also be used in combination with spacy to tokenize the documents and convert them to BILUO Tags to use for different NLP tasks.

PubTator Format

The PubTator format uses the following format:

<PMID>|t|<TITLE>
<PMID>|a|<ABSTRACT>
<PMID>	<START OFFSET 1>	<LAST OFFSET 1>	<MENTION 1>	<TYPE 1>	<IDENTIFIER 1>
<PMID>	<START OFFSET 2>	<LAST OFFSET 2>	<MENTION 2>	<TYPE 2>	<IDENTIFIER 2>

<PMID>|t|<TITLE>
<PMID>|a|<ABSTRACT>
<PMID>	<START OFFSET 1>	<LAST OFFSET 1>	<MENTION 1>	<TYPE 1>	<IDENTIFIER 1>
<PMID>	<START OFFSET 2>	<LAST OFFSET 2>	<MENTION 2>	<TYPE 2>	<IDENTIFIER 2>

where:

  • The first line contains the title of the paper.
  • The second line contains the abstract of the paper.
  • The subsequent lines contain the annotations for the entities in a tab separated format:
    • PMID
    • Start Offset
    • End Offset
    • Mention (entity text)
    • Type of Entity
    • Identifier (normalized form)

Usage

from pubtator_loader import PubTatorCorpusReader
dataset_reader = PubTatorCorpusReader('./sample_pubator_input.txt')

corpus = dataset_reader.load_corpus() 
# corpus will be a List[PubtatorDocuments]

for doc in corpus:
    print(doc)
"""
Console Output:
    {
  "id": 25763772,
  "title_text": "DCTN4 as a modifier of chronic ....",
  "abstract_text": "Pseudomonas aeruginosa (Pa) infection in cystic fibrosis .....",
  "entities": [
    {
      "document_id": 25763772,
      "start_index": 0,
      "end_index": 5,
      "text_segment": "DCTN4",
      "semantic_type_id": "T103",
      "entity_id": "UMLS:C4308010"
    },
    .
    .
    .
    {
      "document_id": 25763772,
      "start_index": 67,
      "end_index": 82,
      "text_segment": "cystic fibrosis",
      "semantic_type_id": "T038",
      "entity_id": "UMLS:C0010674"
    }
  ]
}
"""


import spacy
import scispacy

# load the scispacy model
nlp = spacy.load('en_core_sci_lg')

# Convert PubTator document to BILUO format.
doc_in_BILUO = doc.tokenize_and_convert_to_bilou(nlp)

for idx, (token, semantic_type_id, entity_id) in enumerate(doc_in_BILUO):
    print(f'{idx}\t{token}\t{semantic_type_id}\t{entity_id}')

"""
Console Output:

0         <START>          <START>     <START>
1           DCTN4      U-T116,T123  U-C4308010
2              as                O           O
3               a                O           O
4        modifier                O           O
5              of                O           O
6         chronic           B-T047  B-C0854135
7     Pseudomonas           I-T047  I-C0854135
8      aeruginosa           I-T047  I-C0854135
9       infection           L-T047  L-C0854135
10             in                O           O
11         cystic           B-T047  B-C0010674
12       fibrosis           L-T047  L-C0010674
13    Pseudomonas           B-T047  B-C0854135
14     aeruginosa           I-T047  I-C0854135
15              (           I-T047  I-C0854135
16             Pa           I-T047  I-C0854135
17              )           I-T047  I-C0854135
18      infection           L-T047  L-C0854135
19             in                O           O
20         cystic           B-T047  B-C0010674
21       fibrosis           L-T047  L-C0010674
.               .                .           .
.               .                .           .
.               .                .           .
.               .                .           .


"""

pubtator_loader's People

Contributors

arshsekhon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pubtator_loader's Issues

PubTatorCorpusReader.load_corpus() can't handle multiple/composite mentions

Trying to parse the NCBI Disease Corpus train set, but get an error for mentions that include multiple MeSH terms (i.e. "colon and some other cancers" -> D003110|D009369). Suggestions on how to handle this aside from removing lines that include "CompositeMention".

Dataset

10192393|t|A common human skin tumour is caused by activating mutations in beta-catenin.
10192393|a|WNT signalling orchestrates... but a small percentage of colon and some other cancers harbour...
10192393        15      26      skin tumour     DiseaseClass    D012878
10192393        443     449     cancer  DiseaseClass    D009369
10192393        483     496     colon cancers   DiseaseClass    D003110
10192393        539     565     adenomatous polyposis coli      SpecificDisease D011125
10192393        567     570     APC     SpecificDisease D011125
10192393        670     698     colon and some other cancers    CompositeMention        D003110|D009369
10192393        855     867     skin tumours    DiseaseClass    D012878
10192393        879     893     pilomatricomas  SpecificDisease D018296
10192393        1021    1035    pilomatricomas  SpecificDisease D018296
10192393        1210    1221    skin tumour     DiseaseClass    D012878
10192393        1262    1268    tumour  Modifier        D009369
10192393        1312    1326    pilomatricomas  SpecificDisease D018296
10192393        1385    1392    tumours DiseaseClass    D009369
10192393        1615    1622    tumours DiseaseClass    D009369

Error

     77         prev_line_type = curr_line_type
     78     except Exception as e:
---> 79         raise Exception('ERROR occured when parsing line'
     80                         f' #{line_number}. Exception {e}')
     82 if self.__document_being_read is not None:
     83     self.corpus.append(self.__document_being_read)

Exception: ERROR occured when parsing line #8. Exception Unexpected content received on line #8, the line/data may have been corrupted. Content: '10192393	670	698	colon and some other cancers	CompositeMention	D003110|D009369

input format

Hi,

may I ask what is the input format of the tool? As I only see a file named 'sample_pubator_reader_input' but not available.

Error on pip install

When I try and install the package, I get the following error:

  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  ร— Getting requirements to build wheel did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [164 lines of output]
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
          int length
      
      
      cdef class Vocab:
          cdef Pool mem
          cpdef readonly StringStore strings
                ^
      ------------------------------------------------------------
      
      spacy/vocab.pxd:28:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
      
      
      cdef class Vocab:
          cdef Pool mem
          cpdef readonly StringStore strings
          cpdef public Morphology morphology
                ^
      ------------------------------------------------------------
      
      spacy/vocab.pxd:29:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
      
      cdef class Vocab:
          cdef Pool mem
          cpdef readonly StringStore strings
          cpdef public Morphology morphology
          cpdef public object vectors
                ^
      ------------------------------------------------------------
      
      spacy/vocab.pxd:30:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
      cdef class Vocab:
          cdef Pool mem
          cpdef readonly StringStore strings
          cpdef public Morphology morphology
          cpdef public object vectors
          cpdef public object _lookups
                ^
      ------------------------------------------------------------
      
      spacy/vocab.pxd:31:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
          cdef Pool mem
          cpdef readonly StringStore strings
          cpdef public Morphology morphology
          cpdef public object vectors
          cpdef public object _lookups
          cpdef public object writing_system
                ^
      ------------------------------------------------------------
      
      spacy/vocab.pxd:32:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
          cpdef readonly StringStore strings
          cpdef public Morphology morphology
          cpdef public object vectors
          cpdef public object _lookups
          cpdef public object writing_system
          cpdef public object get_noun_chunks
                ^
      ------------------------------------------------------------
      
      spacy/vocab.pxd:33:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
          cdef float prior_prob
      
      
      cdef class KnowledgeBase:
          cdef Pool mem
          cpdef readonly Vocab vocab
                ^
      ------------------------------------------------------------
      
      spacy/kb.pxd:31:10: Variables cannot be declared with 'cpdef'. Use 'cdef' instead.
      Copied /tmp/pip-install-2b1stxhf/spacy_240f1e06418342eb88af85a39eba4527/setup.cfg -> /tmp/pip-install-2b1stxhf/spacy_240f1e06418342eb88af85a39eba4527/spacy/tests/package
      Copied /tmp/pip-install-2b1stxhf/spacy_240f1e06418342eb88af85a39eba4527/pyproject.toml -> /tmp/pip-install-2b1stxhf/spacy_240f1e06418342eb88af85a39eba4527/spacy/tests/package
      Cythonizing sources
      Compiling spacy/training/example.pyx because it changed.
      Compiling spacy/parts_of_speech.pyx because it changed.
      Compiling spacy/strings.pyx because it changed.
      Compiling spacy/lexeme.pyx because it changed.
      Compiling spacy/vocab.pyx because it changed.
      Compiling spacy/attrs.pyx because it changed.
      Compiling spacy/kb.pyx because it changed.
      Compiling spacy/ml/parser_model.pyx because it changed.
      Compiling spacy/morphology.pyx because it changed.
      Compiling spacy/pipeline/dep_parser.pyx because it changed.
      Compiling spacy/pipeline/morphologizer.pyx because it changed.
      Compiling spacy/pipeline/multitask.pyx because it changed.
      Compiling spacy/pipeline/ner.pyx because it changed.
      Compiling spacy/pipeline/pipe.pyx because it changed.
      Compiling spacy/pipeline/trainable_pipe.pyx because it changed.
      Compiling spacy/pipeline/sentencizer.pyx because it changed.
      Compiling spacy/pipeline/senter.pyx because it changed.
      Compiling spacy/pipeline/tagger.pyx because it changed.
      Compiling spacy/pipeline/transition_parser.pyx because it changed.
      Compiling spacy/pipeline/_parser_internals/arc_eager.pyx because it changed.
      Compiling spacy/pipeline/_parser_internals/ner.pyx because it changed.
      Compiling spacy/pipeline/_parser_internals/nonproj.pyx because it changed.
      Compiling spacy/pipeline/_parser_internals/_state.pyx because it changed.
      Compiling spacy/pipeline/_parser_internals/stateclass.pyx because it changed.
      Compiling spacy/pipeline/_parser_internals/transition_system.pyx because it changed.
      Compiling spacy/pipeline/_parser_internals/_beam_utils.pyx because it changed.
      Compiling spacy/tokenizer.pyx because it changed.
      Compiling spacy/training/align.pyx because it changed.
      Compiling spacy/training/gold_io.pyx because it changed.
      Compiling spacy/tokens/doc.pyx because it changed.
      Compiling spacy/tokens/span.pyx because it changed.
      Compiling spacy/tokens/token.pyx because it changed.
      Compiling spacy/tokens/span_group.pyx because it changed.
      Compiling spacy/tokens/graph.pyx because it changed.
      Compiling spacy/tokens/morphanalysis.pyx because it changed.
      Compiling spacy/tokens/_retokenize.pyx because it changed.
      Compiling spacy/matcher/matcher.pyx because it changed.
      Compiling spacy/matcher/phrasematcher.pyx because it changed.
      Compiling spacy/matcher/dependencymatcher.pyx because it changed.
      Compiling spacy/symbols.pyx because it changed.
      Compiling spacy/vectors.pyx because it changed.
      [ 1/41] Cythonizing spacy/attrs.pyx
      [ 2/41] Cythonizing spacy/kb.pyx
      Traceback (most recent call last):
        File "/mnt/home/lotrecks/anaconda3/envs/graphs/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/mnt/home/lotrecks/anaconda3/envs/graphs/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/mnt/home/lotrecks/anaconda3/envs/graphs/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-ytg7xk32/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "/tmp/pip-build-env-ytg7xk32/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-ytg7xk32/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 341, in run_setup
          exec(code, locals())
        File "<string>", line 224, in <module>
        File "<string>", line 211, in setup_package
        File "/tmp/pip-build-env-ytg7xk32/overlay/lib/python3.10/site-packages/Cython/Build/Dependencies.py", line 1154, in cythonize
          cythonize_one(*args)
        File "/tmp/pip-build-env-ytg7xk32/overlay/lib/python3.10/site-packages/Cython/Build/Dependencies.py", line 1321, in cythonize_one
          raise CompileError(None, pyx_file)
      Cython.Compiler.Errors.CompileError: spacy/kb.pyx
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

ร— Getting requirements to build wheel did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

It looks like this is maybe related to spacy, and not pubtator loader directly, but I've never gotten this error when installing spacy with other packages. Wondering if you've seen this error?

Cant import the library: No module named 'pubtator_loader.models'

Hi,

while trying to use your library, I failed at the first step: importing it. Not sure what I am doing wrong; can you sport what I could be doing wrong?

Find the traceback below.

...: from pubtator_loader.pubtator_corpus_reader import PubTatorCorpusReader

ModuleNotFoundError Traceback (most recent call last)
in
----> 1 from pubtator_loader.pubtator_corpus_reader import PubTatorCorpusReader

~/anaconda3/envs/know-nlp-tf2/lib/python3.6/site-packages/pubtator_loader-0.1.1-py3.6.egg/pubtator_loader/init.py in
----> 1 from .models import PubTatorEntity, PubTatorDocument # noqa

ModuleNotFoundError: No module named 'pubtator_loader.models'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.