Giter Site home page Giter Site logo

gsarti / covid-papers-browser Goto Github PK

View Code? Open in Web Editor NEW
183.0 13.0 27.0 10.96 MB

Browse Covid-19 & SARS-CoV-2 Scientific Papers with Transformers ๐Ÿฆ  ๐Ÿ“–

Home Page: http://covidbrowser.areasciencepark.it

License: GNU General Public License v2.0

Python 4.89% Shell 0.22% Jupyter Notebook 3.68% HTML 2.09% JavaScript 18.98% CSS 70.10% Dockerfile 0.03%
biobert scibert bionlp natural-language-processing covid-19 search-engine

covid-papers-browser's Introduction

Covid-19 Semantic Browser: Browse Covid-19 & SARS-CoV-2 Scientific Papers with Transformers ๐Ÿฆ  ๐Ÿ“–

Covid-19 Semantic Browser is an interactive experimental tool leveraging a state-of-the-art language model to search relevant content inside the COVID-19 Open Research Dataset (CORD-19) recently published by the White House and its research partners. The dataset contains over 44,000 scholarly articles about COVID-19, SARS-CoV-2 and related coronaviruses.

Various models already fine-tuned on Natural Language Inference are available to perform the search:

All models are trained on SNLI [3] and MultiNLI [4] using the sentence-transformers library [5] to produce universal sentence embeddings [6]. Embeddings are subsequently used to perform semantic search on CORD-19.

Currently supported operations are:

  • Browse paper abstract with interactive queries.

  • Reproduce SciBERT-NLI, BioBERT-NLI and CovidBERT-NLI training results.

Setup

Python 3.6 or higher is required to run the code. First, install the required libraries with pip, then download the en_core_web_sm language pack for spaCy and data for NLTK:

pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt

Using the Browser

First of all, download a model fine-tuned on NLI from HuggingFace's cloud repository.

python scripts/download_model.py --model scibert-nli

Second, download the data from the Kaggle challenge page and place it in the data folder.

Finally, simply run:

python scripts/interactive_search.py

to enter the interactive demo. Using a GPU is suggested since the creation of the embeddings for the entire corpus might be time-consuming otherwise. Both the corpus and the embeddings are cached on disk after the first execution of the script, and execution is really fast after embeddings are computed.

Use the interactive demo as follows:

Demo GIF

Reproducing Training Results for Transformers

First, download a pretrained model from HuggingFace's cloud repository.

python scripts/download_model.py --model scibert

Second, download the NLI datasets used for training and the STS dataset used for testing.

python scripts/get_finetuning_data.py

Finally, run the finetuning script by adjusting the parameters depending on the model you intend to train (default is scibert-nli).

python scripts/finetune_nli.py

The model will be evaluated against the test portion of the Semantic Text Similarity (STS) benchmark dataset at the end of training. Please refer to my model cards for additional references on parameter values.

References

[1] Beltagy et al. 2019, "SciBERT: Pretrained Language Model for Scientific Text"

[2] Lee et al. 2020, "BioBERT: a pre-trained biomedical language representation model for biomedical text mining"

[3] Bowman et al. 2015, "A large annotated corpus for learning natural language inference"

[4] Adina et al. 2018, "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference"

[5] Reimers et al. 2019, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"

[6] As shown in Conneau et al. 2017, "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data"

covid-papers-browser's People

Contributors

gsarti avatar manueltonneau avatar mgorsk1 avatar mirkolai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid-papers-browser's Issues

Unable to download scibert-nli after latest changes

Trying:

python scripts/download_model.py --model scibert-nli

gets:

Traceback (most recent call last):
  File "scripts/download_model.py", line 50, in <module>
    tokenizer = AutoTokenizer.from_pretrained(MODELS_PRETRAINED[args.model])
KeyError: 'scibert-nli'

error in download_scibert.py (os.rmtree)

Hello, great library! I am getting the error AttributeError: module 'os' has no attribute 'rmtree' using python 3.7.

This is fixed if os.rmtree(path) is replaced with shutil.rmtree(path)

Add argparse parametrization for the finetuning script

Similar to what is currently available in download_model.py, add Argparse with parameters in finetune_nli.py for parameters:

  • model_name, default 'models/scibert', type str

  • batch_size, default 64, type int

  • model_save_path, default 'models/scibert_nli', type str

  • num_epochs, default 2, type int

  • warmup_steps, default None, not required

  • do_mean_pooling with action='store_true'

  • do_cls_pooling with action='store_true'

  • do_max_pooling with action='store_true'

Then:

  • Add check for only one of the pooling condition to be verified (raise an AttributeError if more than one is). If none is specified, we use mean pooling strategy.

  • Check if the warmup_step parameter is set before setting it to 10% of training: if it is, keep the user-defined value.

Open Source Helps!

Thanks for your work to help the people in need! Your site has been added! I currently maintain the Open-Source-COVID-19 page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.

http://open-source-covid-19.weileizeng.com/

Cheers!

Error in pip install of requirements, but still worked?

I'm on ubuntu, made a fresh conda environment and conda installed pip within it, then ran pip install -r requirements.txt

I got this

ERROR: sentence-transformers 0.2.5.1 has requirement transformers==2.3.0, but you'll have transformers 2.5.1 which is incompatible.

But then the pip install continued and apparently succeeded:
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.