Giter Site home page Giter Site logo

bayer-ie's Introduction

CSAIL NLP Bayer Project

This is a python-based project to be used to parse and extract relevant portions of EMA/FDA/EPA PDF documents. The modules in this project are setup up so that they can be used through a Command-line interface (CLI).

Installation

Use the package manager pip to install the packages specified in requirements.py.

pip install -r requirements.txt
python -m spacy download en_core_web_sm

The mechanism to process native PDF files is dependent on xpdf's implementation of pdftohtml (version 0.62.0). In order for the modules in this project to work properly, make sure to download pdftohtml from here and add it to your PATH.

The reviews PDFs for the retrieval task are processed with Grobid.

Pretrained models

Due to GitHub/GitLab's file size constraints, the pretrained models have been compressed using gunzip. To utilize them, they must first be decompressed using the following commands:

gzip -d <path to compressed EMA model>
gzip -d <path to compressed FDA model>
gzip -d <path to compressed EPA model>

Usage - Extraction

Codes are located in src/extraction.

Currently, there are four available commands. These can be accessed through the main.py module and are as follows:

  • segment

    • Positional Arguments:

      • source: Data source (EMA or FDA)
      • dir: Path to directory containing input documents (PDFs or XMLs)
      • output_dir: Path to desired output file(s)
    • Optional Arguments:

      • --pool-workers: Number of pool workers to be used.
      • --separate-documents: Separate segmentation in a per-document basis.
  • mapLabels:

    • Positional Arguments:
      • data_dir: Path to segmented files.
      • output_dir: Path to ouput directory.
      • mapping_file: Path to json file with label mappings.
    • Optional Arguments:
      • --separate-documents: Separate segmentations in a per-document basis.
  • train

    • Positional Arguments:
      • source: Data source (EMA or FDA).
      • data_dir: Path to segmented file(s).
      • output_dir: Path to desired output directory.
    • Optional Arguments:
      • --rationales_path: Path to rationales file.
  • predict

    • Positional Arguments:
      • source: Data source (EMA or FDA).
      • data_dir: Path to segmented file(s).
      • models_path: Path to trained models file.
      • output_dir: Path to desired output directory.
    • Optional Arguments:
      • --separate_documents: Separate segmentations in a per-document basis.
  • xvalidate

    • Positional Arguments:
      • source: Data source (EMA or FDA).
      • data_dir: Path to segmented file(s).
      • output_dir: Path to desired output directory.
      • num_folds: Number of folds to use in cross validation.
    • Optional Arguments:
      • --rationales_path: Path to rationales file.

Usage - Retrieval

Codes are located in src/retrieval.

pdfToXML.py converts the PDFs to XMLs. Grobid server needs to be launched beforehand. Examples can be found at example_data/retrieval.

main.py runs and evaluates the algorithm. Use the command --method [BOW|WMD] to specify the method. The predictions and evaluation results are saved to an Excel file.

python src/retrieval/main.py --method [BOW|WMD] --excel_path [path to the xlsx file] \
        --xml_path [path to the XML folder] --output_path [output xlsx]

bayer-ie's People

Contributors

juanmoo avatar irin99 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.