microsoft / biogpt Goto Github PK

License: MIT License

Python 98.79% Shell 1.21%

biogpt's Introduction

BioGPT

This repository contains the implementation of BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.

Requirements and Installation

PyTorch version == 1.12.0
Python version == 3.10
fairseq version == 0.12.0:

git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout v0.12.0
pip install .
python setup.py build_ext --inplace
cd ..

Moses

git clone https://github.com/moses-smt/mosesdecoder.git
export MOSES=${PWD}/mosesdecoder

fastBPE

git clone https://github.com/glample/fastBPE.git
export FASTBPE=${PWD}/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

sacremoses

pip install sacremoses

sklearn

pip install scikit-learn

Remember to set the environment variables MOSES and FASTBPE to the path of Moses and fastBPE respetively, as they will be required later.

Getting Started

Pre-trained models

We provide our pre-trained BioGPT model checkpoints along with fine-tuned checkpoints for downstream tasks, available both through URL download as well as through the Hugging Face 🤗 Hub.

Model	Description	URL	🤗 Hub
BioGPT	Pre-trained BioGPT model checkpoint	link	link
BioGPT-Large	Pre-trained BioGPT-Large model checkpoint	link	link
BioGPT-QA-PubMedQA-BioGPT	Fine-tuned BioGPT for question answering task on PubMedQA	link
BioGPT-QA-PubMedQA-BioGPT-Large	Fine-tuned BioGPT-Large for question answering task on PubMedQA	link
BioGPT-RE-BC5CDR	Fine-tuned BioGPT for relation extraction task on BC5CDR	link
BioGPT-RE-DDI	Fine-tuned BioGPT for relation extraction task on DDI	link
BioGPT-RE-DTI	Fine-tuned BioGPT for relation extraction task on KD-DTI	link
BioGPT-DC-HoC	Fine-tuned BioGPT for document classification task on HoC	link

Download them and extract them to the checkpoints folder of this project.

For example:

mkdir checkpoints
cd checkpoints
wget https://msralaphilly2.blob.core.windows.net/release/BioGPT/checkpoints/Pre-trained-BioGPT.tgz?sp=r&st=2023-11-13T15:37:35Z&se=2099-12-30T23:37:35Z&spr=https&sv=2022-11-02&sr=b&sig=3CcG1TOhqJPBhkVutvVn3PtUq0vPyLBgwggUfojypfY%3D
tar -zxvf Pre-trained-BioGPT.tgz

Example Usage

Use pre-trained BioGPT model in your code:

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
        "checkpoints/Pre-trained-BioGPT", 
        "checkpoint.pt", 
        "data",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

Use fine-tuned BioGPT model on KD-DTI for drug-target-interaction in your code:

import torch
from src.transformer_lm_prompt import TransformerLanguageModelPrompt
m = TransformerLanguageModelPrompt.from_pretrained(
        "checkpoints/RE-DTI-BioGPT", 
        "checkpoint_avg.pt", 
        "data/KD-DTI/relis-bin",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        max_len_b=1024,
        beam=1)
m.cuda()
src_text="" # input text, e.g., a PubMed abstract
src_tokens = m.encode(src_text)
generate = m.generate([src_tokens], beam=args.beam)[0]
output = m.decode(generate[0]["tokens"])
print(output)

For more downstream tasks, please see below.

Downstream tasks

See corresponding folder in examples:

Document Classification on HoC

Question Answering on PubMedQA

Hugging Face 🤗 Usage

BioGPT has also been integrated into the Hugging Face transformers library, and model checkpoints are available on the Hugging Face Hub.

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

from transformers import pipeline, set_seed
from transformers import BioGptTokenizer, BioGptForCausalLM
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator("COVID-19 is", max_length=20, num_return_sequences=5, do_sample=True)

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BioGptTokenizer, BioGptForCausalLM
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Beam-search decoding:

import torch
from transformers import BioGptTokenizer, BioGptForCausalLM, set_seed

tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")

sentence = "COVID-19 is"
inputs = tokenizer(sentence, return_tensors="pt")

set_seed(42)

with torch.no_grad():
    beam_output = model.generate(**inputs,
                                 min_length=100,
                                 max_length=1024,
                                 num_beams=5,
                                 early_stopping=True
                                )
tokenizer.decode(beam_output[0], skip_special_tokens=True)

For more information, please see the documentation on the Hugging Face website.

Demos

Check out these demos on Hugging Face Spaces:

License

BioGPT is MIT-licensed. The license applies to the pre-trained models as well.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

biogpt's People

Contributors

Stargazers

Watchers

Forkers

michael-wzhu renqianluo minstar owaiskhan9654 dimbage avivbrokman hertera1 r-tinn zequnl svngoku taokz kurt-biorender stefanomuscat shayezkarim lorenzofamiglini crispae eltociear soundprediction dmumedical hichnik ah-merii huangyingting simplyjuanjo sk-telemed egoomni antifmatter him1532 virdi qilei123 codemaker2015 buffalonian sentido-labs kimdn qlinhta austintapp arosstale joskid nasa03 sagsyo file-campuran hsfzitedu alabarga ominux anshikasrivastava25 awangenh iuriimattos2 kamelkaouech ukaserge slepetys mriffle yanndd1 hemanthkumarak rhinojosa vishalbelsare milad-ho3ir mdbabumiamssm atiqureee51 hvanphucs dongcf mgaitan lyrl standardgalactic yuta519 kmisztal tonoy30 pkhoueiry schisandrin alexwjung shicheng-guo goswamig zetah07 cossano appdirectory gayansamuditha kellybuff ailabteam cybersaksham puzuwe fabus961 mohamed155 cloudwalkerfre jqmcginnis xstarlink vaniza s4ch mcholy thageek mail4metablocks gdsttian movindutb benmfeng shabayek emilielundblad akhilmedvolt safarli thehappyitguy jaedukseo girishpoojary sombochea haorand

biogpt's Issues

ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]': "

Hi everyone, I am getting this error, can someone tell me what's going on ?

Unable to download the checkpoints

Hi,
I tried to download the checkpoints with wget https://msralaphilly2.blob.core.windows.net/ml-la/release/BioGPT/checkpoints.tgz but got ERROR 404: The specified resource does not exist... So I wonder whether there is some problem about the download link. I would be grateful if you could check it. THX！

Download Links except BIOGPT not working

Hey, I tried downloading the pretrained model and for some reason except the first one none of them is working. Any reason?

is XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>BlobNotFound</Code>
<Message>The specified blob does not exist. RequestId:be06d49f-701e-0046-5564-2ad7cb000000 Time:2023-01-17T11:10:03.7923850Z</Message>
</Error>

Not able to import fairseq

Followed the requirements and installation step and downloaded the pretrained model on google colab using the readme.md file (https://github.com/microsoft/BioGPT).

If i try to import fairseq, i get the error

Not sure if i'm missing something here. Appreciate any help.

Huggingface version?

Dear authors,

This was a pleasure to read your paper, this model looks very promising for biomedical applications. Do you plan on releasing your model on the Huggingface Hub soon? I think this would make it slightly easier to reuse.

All the best,
Valentin

colab notebook template?

I precisely followed the great README.md and share it in the colab notebook here: github gist

All went fine until I acctually do the inference:

path1 = '/content/BioGPT/'

from src.transformer_lm_prompt import TransformerLanguageModelPrompt

m = TransformerLanguageModelPrompt.from_pretrained(
        path1 + "checkpoints/RE-DTI-BioGPT", 
        "checkpoint_avg.pt", 
        path1 + "data/KD-DTI/relis-bin",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        max_len_b=1024,
        beam=1)

Upon running this I get error:

[/usr/lib/python3.8/posixpath.py](https://localhost:8080/#) in join(a, *p)
     74     will be discarded.  An empty last part will result in a path that
     75     ends with a separator."""
---> 76     a = os.fspath(a)
     77     sep = _get_sep(a)
     78     path = a

TypeError: expected str, bytes or os.PathLike object, not NoneType

(expand error)

TypeError                                 Traceback (most recent call last)
[<ipython-input-21-41726c6a2fdf>](https://localhost:8080/#) in <module>
      3 from src.transformer_lm_prompt import TransformerLanguageModelPrompt
      4 
----> 5 m = TransformerLanguageModelPrompt.from_pretrained(
      6         path1 + "checkpoints/RE-DTI-BioGPT",
      7         "checkpoint_avg.pt",

2 frames
[/usr/local/lib/python3.8/dist-packages/fairseq/models/fairseq_model.py](https://localhost:8080/#) in from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs)
    265         from fairseq import hub_utils
    266 
--> 267         x = hub_utils.from_pretrained(
    268             model_name_or_path,
    269             checkpoint_file,

[/usr/local/lib/python3.8/dist-packages/fairseq/hub_utils.py](https://localhost:8080/#) in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_map, **kwargs)
     64         "vocab.json": "bpe_vocab",
     65     }.items():
---> 66         path = os.path.join(model_path, file)
     67         if os.path.exists(path):
     68             kwargs[arg] = path

[/usr/lib/python3.8/posixpath.py](https://localhost:8080/#) in join(a, *p)
     74     will be discarded.  An empty last part will result in a path that
     75     ends with a separator."""
---> 76     a = os.fspath(a)
     77     sep = _get_sep(a)
     78     path = a

TypeError: expected str, bytes or os.PathLike object, not NoneType

I'm happy I made it so far, but I cannot really solve this myself. Could you tell me what I did wrong?

Only the first checkpoint link works

Are the links for pretrained models beyond the first intentionally dead?

Would love access to the Large Q&A version. Is there an ETA if 'coming soon'?

Best

README example fails: 'NoneType' object has no attribute 'tokenizer'

Hi there, I followed the README example set up script exactly, and received the following error. May you please help me resolve?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 m = TransformerLanguageModel.from_pretrained(
      2         "checkpoints/Pre-trained-BioGPT", 
      3         "checkpoint.pt", 
      4         "data",
      5         tokenizer='moses', 
      6         bpe='fastbpe', 
      7         bpe_codes="data/bpecodes",
      8         min_len=100,
      9         max_len_b=1024)

File ~/miniforge3/envs/test_env/lib/python3.10/site-packages/fairseq/models/fairseq_model.py:261, in BaseFairseqModel.from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs)
    238 """
    239 Load a :class:`~fairseq.models.FairseqModel` from a pre-trained model
    240 file. Downloads and caches the pre-trained model file if needed.
   (...)
    257         model archive path.
    258 """
    259 from fairseq import hub_utils
--> 261 x = hub_utils.from_pretrained(
    262     model_name_or_path,
    263     checkpoint_file,
    264     data_name_or_path,
    265     archive_map=cls.hub_models(),
    266     **kwargs,
    267 )
    269 cls.upgrade_args(x["args"])
    271 logger.info(x["args"])

File ~/miniforge3/envs/test_env/lib/python3.10/site-packages/fairseq/hub_utils.py:70, in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_map, **kwargs)
     67 if "user_dir" in kwargs:
     68     utils.import_user_module(argparse.Namespace(user_dir=kwargs["user_dir"]))
---> 70 models, args, task = checkpoint_utils.load_model_ensemble_and_task(
     71     [os.path.join(model_path, cpt) for cpt in checkpoint_file.split(os.pathsep)],
     72     arg_overrides=kwargs,
     73 )
     75 return {
     76     "args": args,
     77     "task": task,
     78     "models": models,
     79 }

File ~/miniforge3/envs/test_env/lib/python3.10/site-packages/fairseq/checkpoint_utils.py:279, in load_model_ensemble_and_task(filenames, arg_overrides, task, strict, suffix, num_shards)
    277 if not PathManager.exists(filename):
    278     raise IOError("Model file not found: {}".format(filename))
--> 279 state = load_checkpoint_to_cpu(filename, arg_overrides)
    280 if shard_idx == 0:
    281     args = state["args"]

File ~/miniforge3/envs/test_env/lib/python3.10/site-packages/fairseq/checkpoint_utils.py:231, in load_checkpoint_to_cpu(path, arg_overrides)
    229 if arg_overrides is not None:
    230     for arg_name, arg_val in arg_overrides.items():
--> 231         setattr(args, arg_name, arg_val)
    232 state = _upgrade_state_dict(state)
    233 return state

AttributeError: 'NoneType' object has no attribute 'tokenizer'

My environment variables are set as:

(test_env) an583@PHS030015 project_dir % conda env config vars list
MOSES = /Users/ayush/project_dir/mosesdecoder
FASTBPE = /Users/ayush/project_dir/fastBPE

I would be grateful for assistance to help resolve. Thanks!

BIO GPT

Please post pre-print of the manuscript

I cannot access the Briefings in Bioinformatics paper through my institution, and many other interested readers will have the same problem. Please consider posting a pre-print of the manuscript (e.g. in arXiv). Thanks!

Download Error

Once downloaded installation fails and fails to run on GitHub

Training code?

Is it possible to see the data/code used to actually pretrain the model?

Thanks!

ConstrainedGenerator

Hi @renqianluo,

What is special in ConstrainedGenerator ?

Thanks
Kamal

"data" is not found after executing the code on Github

Hello,

Could anybody please guide me that how I can run the standard BioGPT model by using the current below code?

import torch from fairseq.models.transformer_lm import TransformerLanguageModel m = TransformerLanguageModel.from_pretrained( "checkpoints/Pre-trained-BioGPT", "checkpoint.pt", "data", tokenizer='moses', bpe='fastbpe', bpe_codes="data/bpecodes", min_len=100, max_len_b=1024) m.cuda() src_tokens = m.encode("COVID-19 is") generate = m.generate([src_tokens], beam=5)[0] output = m.decode(generate[0]["tokens"]) print(output)

After running this, I always get the error that the data is not found. Not sure if I have to download the data from an external source separately or not.

Thanks

AssertionError

Hi when i run the code

raised an error I don't know how to address such error .would you please give a hand?

Azurew ML

How to use BioGPT in azure ml .Please suggest

Cannot run inference on PubMedQA-Large

Using your pre-trained model, the infer_large.sh script is failing as follows:

KeyError: "'_name'"
sed: can't read ../../checkpoints/QA-PubMedQA-BioGPT-Large/generate_checkpoint_avg.pt: No such file or directory
infer_large.sh: line 31: ../../checkpoints/QA-PubMedQA-BioGPT-Large/generate_checkpoint_avg.pt: No such file or directory
Traceback (most recent call last):
  File "/mnt/d/ml/biogpt/examples/QA-PubMedQA/postprocess.py", line 37, in <module>
    with open(out_file, "r", encoding="utf8") as fr:
FileNotFoundError: [Errno 2] No such file or directory: '../../checkpoints/QA-PubMedQA-BioGPT-Large/generate_checkpoint_avg.pt.detok'
Traceback (most recent call last):
  File "/mnt/d/ml/biogpt/examples/QA-PubMedQA/hard_match_evaluation.py", line 37, in <module>
    main()
  File "/mnt/d/ml/biogpt/examples/QA-PubMedQA/hard_match_evaluation.py", line 19, in main
    with open(pred_file) as reader:
FileNotFoundError: [Errno 2] No such file or directory: '../../checkpoints/QA-PubMedQA-BioGPT-Large/generate_checkpoint_avg.pt.detok.extracted.txt'

Please let me know if you have any suggestions to get it working. There seems to be a problem generating the output file.

Error at loading: TransformerLanguageModel.from_pretrained in google colab

I am following the instruction on readme on google colab.

path1 = '/content/fairseq/'
import os
path = os.path.join(path1, "checkpoints/Pre-trained-BioGPT")

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
        path, 
        "checkpoint.pt", 
        "data",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

TypeError Traceback (most recent call last)
in
5 import torch
6 from fairseq.models.transformer_lm import TransformerLanguageModel
----> 7 m = TransformerLanguageModel.from_pretrained(
8 path,
9 "checkpoint.pt",

7 frames
/usr/local/lib/python3.8/dist-packages/fairseq/models/transformer_lm.py in build_embedding(cls, args, dictionary, embed_dim, path)
319 @classmethod
320 def build_embedding(cls, args, dictionary, embed_dim, path=None):
--> 321 embed_tokens = Embedding(len(dictionary), embed_dim, dictionary.pad())
322 return embed_tokens
323

TypeError: object of type 'NoneType' has no len()

BioGPT is now available in 🤗 Transformers

BioGPT is now available for usage in 🤗 Transformers!

Docs: https://huggingface.co/docs/transformers/main/en/model_doc/biogpt.

Checkpoints on the hub: https://huggingface.co/microsoft/biogpt

It'd be very nice if someone converted the remaining BioGPT checkpoints to the HuggingFace format. The conversion script can be found here.

BioGPT-Large Exception: Could not infer language pair, please provide it explicitly

Hi, I installed bioGPT in a docker (repbioinfo/biogpt), I have downloaded the Pre-trained BioGPT-Large model checkpoint and here is the script (/BioGPT/script.py):

import torch
from src.transformer_lm_prompt import TransformerLanguageModelPrompt
m = TransformerLanguageModelPrompt.from_pretrained(
        "/scratch/QA-PubMedQA-BioGPT-Large/", 
        "checkpoint_avg.pt", 
        "/BioGPT/data/BioGPT-Large",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="/BioGPT/data/bpecodes",
        max_len_b=1024,
        beam=1)
m.cuda()
src_text="what DNA is today"
src_tokens = m.encode(src_text)
generate = m.generate([src_tokens], beam=args.beam)[0]
output = m.decode(generate[0]["tokens"])

but unfortunately i get this error.

THanks !

2023-03-01 20:50:43 | INFO | fairseq.file_utils | loading archive file /scratch/QA-PubMedQA-BioGPT-Large/
2023-03-01 20:50:43 | INFO | fairseq.file_utils | loading archive file /BioGPT/data/BioGPT-Large
Traceback (most recent call last):
File "/BioGPT/script.py", line 3, in
m = TransformerLanguageModelPrompt.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/fairseq/models/fairseq_model.py", line 267, in from_pretrained
x = hub_utils.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/fairseq/hub_utils.py", line 73, in from_pretrained
models, args, task = checkpoint_utils.load_model_ensemble_and_task(
File "/usr/local/lib/python3.10/dist-packages/fairseq/checkpoint_utils.py", line 432, in load_model_ensemble_and_task
task = tasks.setup_task(cfg.task)
File "/usr/local/lib/python3.10/dist-packages/fairseq/tasks/init.py", l

ine 46, in setup_task
return task.setup_task(cfg, **kwargs)
File "/BioGPT/src/language_modeling_prompt.py", line 133, in setup_task
raise Exception(
Exception: Could not infer language pair, please provide it explicitly

How to use Pre-trained-BioGPT-Large model?

I was able to use Pre-trained BioGPT in accordance with the use case.
Could you give us an example code of using Pre-trained BioGPT-Large?

I tried this code.

import os
os.chdir('/home/******/BioGPT')`

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel

m = TransformerLanguageModel.from_pretrained(
        "checkpoints/Pre-trained-BioGPT-Large", 
        "checkpoint.pt", 
        "data",
        tokenizer='moses', 
        bpe='biogpt-large-fastbpe', 
        bpe_codes="data/biogpt_large_bpecodes",
        min_len=100,
        max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

And the result was bellow.

RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
	size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([57717, 1600]) from checkpoint, the shape in current model is torch.Size([42384, 1600]).
	size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([57717, 1600]) from checkpoint, the shape in current model is torch.Size([42384, 1600]).

Thank you!

Unable to cat and decompress files

wget https://msralaphilly2.blob.core.windows.net/ml-la/release/BioGPT/checkpoints.tgz has failed on ever machine I have tried it on due to read errors.

Thus, I tried for i in `seq 0 21`; do wget `printf "https://msralaphilly2.blob.core.windows.net/ml-la/release/BioGPT/checkpoints.tgz.%02d" $i`; done

This works, but cat checkpoints.tgz.* | tar -zxvf - fails with the following error:

gzip: stdin: invalid compressed data--format violated
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Any ideas as to why?

Error when executig

File "C:\Users\Gilgamesh\Documents\Advanced A.I\Bio\BioGPT\bio.py", line 3, in
m = TransformerLanguageModel.from_pretrained(
File "C:\Python310\lib\site-packages\fairseq\models\fairseq_model.py", line 275, in from_pretrained
return hub_utils.GeneratorHubInterface(x["args"], x["task"], x["models"])
File "C:\Python310\lib\site-packages\fairseq\hub_utils.py", line 108, in init
self.bpe = encoders.build_bpe(cfg.bpe)
File "C:\Python310\lib\site-packages\fairseq\registry.py", line 61, in build_x
return builder(cfg, *extra_args, **extra_kwargs)
File "C:\Python310\lib\site-packages\fairseq\data\encoders\fastbpe.py", line 27, in init
self.bpe = fastBPE.fastBPE(codes)
AttributeError: module 'fastBPE' has no attribute 'fastBPE'

I get this error when running this code

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
"checkpoints",
"checkpoint.pt",
"data",
tokenizer='moses',
bpe='fastbpe',
bpe_codes="data/BioGPT",
min_len=100,
max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

Unable to find the PubMed/data-bin file

Hi,

I am trying to use BioGPT for generating some biomedical text.
However, PubMed/data-bin file needs to load the pre-trained checkpoint of the biogpt model.
I could not find a way to download this file.
Let me know if I can get this.

Thanks

colab: preprocessing (RE-DDI)

Please see the created public github gist , I try to run !bash preprocess.sh from within %cd /content/BioGPT/examples/RE-DDI.

I followed the instructions of this respository's README.md, but I the running the !bash preprocess.sh shows a very strange output:

Traceback (most recent call last):
  File "/usr/local/bin/fairseq-preprocess", line 8, in <module>
    sys.exit(cli_main())
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 389, in cli_main
    main(args)
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 372, in main
    _make_all(args.source_lang, src_dict, args)
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 185, in _make_all
    _make_dataset(
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 178, in _make_dataset
    _make_binary_dataset(
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 119, in _make_binary_dataset
    final_summary = FileBinarizer.multiprocess_dataset(
  File "/usr/local/lib/python3.8/dist-packages/fairseq/binarizer.py", line 100, in multiprocess_dataset
    offsets = find_offsets(input_file, num_workers)
  File "/usr/local/lib/python3.8/dist-packages/fairseq/file_chunker_utils.py", line 25, in find_offsets
    with open(filename, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/DDI/raw/relis_train.tok.bpe.x'

expand error

Following PMID in ../../data/DDI/raw/train.json has no extracted triples:
DDI-DrugBank.d519 DDI-MedLine.d18 DDI-DrugBank.d491 DDI-MedLine.d4 DDI-DrugBank.d134 DDI-DrugBank.d230 DDI-DrugBank.d259 DDI-DrugBank.d293 DDI-MedLine.d64 DDI-MedLine.d100 DDI-DrugBank.d295 DDI-DrugBank.d402 DDI-MedLine.d101 DDI-DrugBank.d190 DDI-MedLine.d140 DDI-MedLine.d112 DDI-MedLine.d9 DDI-DrugBank.d301 DDI-DrugBank.d128 DDI-DrugBank.d101 DDI-DrugBank.d28 DDI-DrugBank.d376 DDI-MedLine.d28 DDI-DrugBank.d93 DDI-MedLine.d88 DDI-DrugBank.d539 DDI-DrugBank.d525 DDI-DrugBank.d540 DDI-DrugBank.d461 DDI-MedLine.d132 DDI-DrugBank.d360 DDI-MedLine.d43 DDI-MedLine.d121 DDI-DrugBank.d262 DDI-DrugBank.d164 DDI-DrugBank.d534 DDI-DrugBank.d385 DDI-DrugBank.d408 DDI-MedLine.d96 DDI-DrugBank.d285 DDI-DrugBank.d473 DDI-MedLine.d57 DDI-DrugBank.d557 DDI-DrugBank.d161 DDI-DrugBank.d24 DDI-DrugBank.d67 DDI-DrugBank.d490 DDI-DrugBank.d421 DDI-MedLine.d65 DDI-DrugBank.d342 DDI-DrugBank.d264 DDI-MedLine.d10 DDI-DrugBank.d312 DDI-MedLine.d117 DDI-MedLine.d135 DDI-DrugBank.d255 DDI-DrugBank.d390 DDI-DrugBank.d68 DDI-MedLine.d11 DDI-MedLine.d14 DDI-MedLine.d75 DDI-DrugBank.d541 DDI-DrugBank.d118 DDI-MedLine.d50 DDI-DrugBank.d218 DDI-DrugBank.d370 DDI-DrugBank.d201 DDI-DrugBank.d244 DDI-MedLine.d138 DDI-MedLine.d33 DDI-DrugBank.d553 DDI-DrugBank.d125 DDI-DrugBank.d366 DDI-DrugBank.d147 DDI-MedLine.d71 DDI-DrugBank.d363 DDI-MedLine.d32 DDI-MedLine.d76 DDI-DrugBank.d290 DDI-MedLine.d38 DDI-MedLine.d77 DDI-DrugBank.d80 DDI-DrugBank.d27 DDI-MedLine.d120 DDI-DrugBank.d52 DDI-DrugBank.d302 DDI-DrugBank.d486 DDI-DrugBank.d472 DDI-MedLine.d6 DDI-MedLine.d123 DDI-DrugBank.d173 DDI-DrugBank.d570 DDI-DrugBank.d126 DDI-DrugBank.d156 DDI-MedLine.d13 DDI-MedLine.d91 DDI-DrugBank.d349 DDI-DrugBank.d436 DDI-DrugBank.d300 DDI-DrugBank.d432 DDI-MedLine.d52 DDI-DrugBank.d554 DDI-MedLine.d19 DDI-DrugBank.d109 DDI-DrugBank.d63 DDI-DrugBank.d168 DDI-DrugBank.d37 DDI-DrugBank.d50 DDI-DrugBank.d455 DDI-DrugBank.d70 DDI-MedLine.d48 DDI-DrugBank.d515 DDI-DrugBank.d406 DDI-MedLine.d127 DDI-MedLine.d22 DDI-DrugBank.d418 DDI-MedLine.d78 DDI-MedLine.d80 DDI-MedLine.d129 DDI-DrugBank.d61 DDI-DrugBank.d524 DDI-DrugBank.d189 DDI-MedLine.d92 DDI-DrugBank.d6 DDI-DrugBank.d278 DDI-MedLine.d66 DDI-DrugBank.d383 DDI-MedLine.d15 DDI-MedLine.d60 DDI-MedLine.d31 DDI-MedLine.d58 DDI-MedLine.d137 DDI-DrugBank.d555 DDI-DrugBank.d58 DDI-DrugBank.d433 DDI-DrugBank.d375 DDI-DrugBank.d102 DDI-DrugBank.d268 DDI-DrugBank.d391 DDI-MedLine.d83 DDI-DrugBank.d243 DDI-DrugBank.d119 DDI-DrugBank.d49 DDI-MedLine.d139 DDI-DrugBank.d513 DDI-DrugBank.d451 DDI-DrugBank.d38 DDI-DrugBank.d182 DDI-MedLine.d118 DDI-DrugBank.d319 DDI-MedLine.d141 DDI-MedLine.d70 DDI-MedLine.d109 DDI-MedLine.d98 DDI-DrugBank.d214 DDI-DrugBank.d193 DDI-DrugBank.d152 DDI-MedLine.d40 DDI-DrugBank.d535 DDI-DrugBank.d167 DDI-MedLine.d108 DDI-DrugBank.d445 DDI-DrugBank.d235 DDI-DrugBank.d317 DDI-DrugBank.d251 DDI-DrugBank.d496 DDI-DrugBank.d117 DDI-DrugBank.d203 DDI-DrugBank.d532 DDI-DrugBank.d361 DDI-DrugBank.d294 DDI-MedLine.d37 DDI-MedLine.d72 DDI-MedLine.d95 DDI-DrugBank.d280 DDI-MedLine.d26 DDI-MedLine.d74 DDI-DrugBank.d407 DDI-DrugBank.d343 DDI-DrugBank.d209 DDI-DrugBank.d159 DDI-DrugBank.d239 DDI-DrugBank.d155 DDI-DrugBank.d474 DDI-DrugBank.d271 DDI-DrugBank.d403 DDI-DrugBank.d447 DDI-MedLine.d136 DDI-DrugBank.d90 DDI-DrugBank.d136 DDI-MedLine.d41 DDI-DrugBank.d292 DDI-DrugBank.d1 DDI-DrugBank.d92 DDI-DrugBank.d127 
664 samples in ../../data/DDI/raw/train.json has been processed with 195 samples has no triples extracted.
Following PMID in ../../data/DDI/raw/valid.json has no extracted triples:
DDI-DrugBank.d348 DDI-DrugBank.d520 DDI-DrugBank.d248 DDI-MedLine.d122 DDI-MedLine.d103 DDI-MedLine.d35 DDI-MedLine.d24 DDI-DrugBank.d169 DDI-DrugBank.d221 
50 samples in ../../data/DDI/raw/valid.json has been processed with 9 samples has no triples extracted.
191 samples in ../../data/DDI/raw/test.json has been processed with 0 samples has no triples extracted.
Preprocessing train
Can't open perl script "/scripts/tokenizer/tokenizer.perl": No such file or directory
Can't open perl script "/scripts/tokenizer/tokenizer.perl": No such file or directory
preprocess.sh: line 27: /fast: No such file or directory
preprocess.sh: line 28: /fast: No such file or directory
Preprocessing valid
Can't open perl script "/scripts/tokenizer/tokenizer.perl": No such file or directory
Can't open perl script "/scripts/tokenizer/tokenizer.perl": No such file or directory
preprocess.sh: line 27: /fast: No such file or directory
preprocess.sh: line 28: /fast: No such file or directory
Preprocessing test
Can't open perl script "/scripts/tokenizer/tokenizer.perl": No such file or directory
Can't open perl script "/scripts/tokenizer/tokenizer.perl": No such file or directory
preprocess.sh: line 27: /fast: No such file or directory
preprocess.sh: line 28: /fast: No such file or directory
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
2023-02-14 04:07:49 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
2023-02-14 04:07:49 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='../../data/DDI/relis-bin', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='x', srcdict='../../data/DDI/raw/dict.txt', suppress_crashes=False, target_lang='y', task='translation', tensorboard_logdir=None, testpref='../../data/DDI/raw/relis_test.tok.bpe', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='../../data/DDI/raw/relis_train.tok.bpe', use_plasma_view=False, user_dir=None, validpref='../../data/DDI/raw/relis_valid.tok.bpe', wandb_project=None, workers=8)
2023-02-14 04:07:50 | INFO | fairseq_cli.preprocess | [x] Dictionary: 42384 types
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-preprocess", line 8, in <module>
    sys.exit(cli_main())
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 389, in cli_main
    main(args)
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 372, in main
    _make_all(args.source_lang, src_dict, args)
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 185, in _make_all
    _make_dataset(
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 178, in _make_dataset
    _make_binary_dataset(
  File "/usr/local/lib/python3.8/dist-packages/fairseq_cli/preprocess.py", line 119, in _make_binary_dataset
    final_summary = FileBinarizer.multiprocess_dataset(
  File "/usr/local/lib/python3.8/dist-packages/fairseq/binarizer.py", line 100, in multiprocess_dataset
    offsets = find_offsets(input_file, num_workers)
  File "/usr/local/lib/python3.8/dist-packages/fairseq/file_chunker_utils.py", line 25, in find_offsets
    with open(filename, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/DDI/raw/relis_train.tok.bpe.x'

I feel like this must be a very small error that blocks the execution. It seems however that this is difficult to solve by me.

Tokenizer encodes special tokens as two element list

Using the huggingface implementation of biogpt tokenizer i expected only 1 element but got 2.

from transformers import BioGptTokenizer
tokenizer= BioGptTokenizer.from_pretrained('microsoft/biogpt')
tokenizer.encode(tokenizer.eos_token)
output: [2, 2]

BioGPT Large pre-training dataset ?

The training dataset used for BioGPT-Large?
Is the same as the BioGPT-347M parameters.

@renqianluo

Clarification on model output

We have tried running QA(non-large), document classification and the 3 RE models. All of them seem to have learned1, learned2,.... learned9 in the output.
What do they stand for or is there any step that we are missing?
We don't face this issue for text generation(non-large) though.
e.g.
re_bc5cdr output
"It is critical to understand factors associated with nasopharyngeal carcinoma (NPC) metastasis. To track the evolutionary route of metastasis, here we perform an integrative genomic analysis of 163 matched blood and primary, regional lymph node metastasis and distant metastasis tumour samples, combined with single-cell RNA-seq on 11 samples from two patients. The mutation burden, gene mutation frequency, mutation signature, and copy number frequency are similar between metastatic tumours and primary and regional lymph node tumours. There are two distinct evolutionary routes of metastasis, including metastases evolved from regional lymph nodes (lymphatic route, 61.5%, 8 / 13) and from primary tumours (hematogenous route, 38.5%, 5 / 13). learned1 learned2 learned3 learned4 learned5 learned6 learned7 learned8 learned9 the relation between cisplatin and NPC exists; the relation between cisplatin and metastasis exists;."

Unable to convert BioGpt slow tokenizer to fast: token out of vocabulary

Hi @themanojkumar ,
I was trying to use BioGpt model in my QA task for fine-tuning. I would like to use the tokenizer as a fast tokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, when creating a BiogptTokenizerFast from the PreTrainedTokenizerFast by convert_slow_tokenizer, following error occurs: Error while initializing BPE: Token -@</w> out of vocabulary.
According to this issue https://github.com/huggingface/transformers/issues/9290, this problem might be caused by some missing tokens. Could you please check it? Thank you very much!

Environment

transformers version: 4.25.0

Error trace

Traceback (most recent call last):
  File "run.py", line 124, in <module>
    trainer, predict_dataset = get_trainer(args)
  File "***/tasks/qa/get_trainer.py", line 31, in get_trainer
    tokenizer = BioGptTokenizerFast.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 1801, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 1956, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "***/model/biogpt/tokenization_biogpt_fast.py", line 117, in __init__
    super().__init__(
  File "***/model/biogpt/tokenization_utils_fast.py", line 114, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "***/model/biogpt/convert_slow_tokenizer.py", line 1198, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "***/model/biogpt/convert_slow_tokenizer.py", line 273, in converted
    BPE(
Exception: Error while initializing BPE: Token `-@</w>` out of vocabulary

Installation Error

Following the install instruction and following error occurred. How can I resolve this?
Thank you

$ python setup.py build_ext --inplace
Traceback (most recent call last):
  File "/home/miniconda3/envs/biogpt/lib/python3.10/site-packages/torch/__init__.py", line 172, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/home/miniconda3/envs/biogpt/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/miniconda3/envs/biogpt/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fairseq/setup.py", line 87, in <module>
    from torch.utils import cpp_extension
  File "/home/miniconda3/envs/biogpt/lib/python3.10/site-packages/torch/__init__.py", line 217, in <module>
    _load_global_deps()
  File "/home/miniconda3/envs/biogpt/lib/python3.10/site-packages/torch/__init__.py", line 178, in _load_global_deps
    _preload_cuda_deps()
  File "/home/miniconda3/envs/biogpt/lib/python3.10/site-packages/torch/__init__.py", line 158, in _preload_cuda_deps
    ctypes.CDLL(cublas_path)
  File "/home/miniconda3/envs/biogpt/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/miniconda3/envs/biogpt/lib/python3.10/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

Can't initialize `PubMedQA`

Currently have 36309e2 cloned and downloaded the PubMedQA on 2/3/2023:

$ md5sum checkpoints/QA-PubMedQA-BioGPT.tgz
8d05745c9cd93ce3a7b4d87251823b67  checkpoints/QA-PubMedQA-BioGPT.tgz

Following the advice under #23, I was able to make some progress but this what I get when I try to initialize PubMedQA:

import torch
from src.transformer_lm_prompt import TransformerLanguageModelPrompt

m = TransformerLanguageModelPrompt.from_pretrained(
        "checkpoints/QA-PubMedQA-BioGPT",
        "checkpoint_avg.pt",
        data="data/PubMedQA/biogpt-ansis-bin",
        tokenizer='moses',
        bpe='fastbpe',
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 m = TransformerLanguageModelPrompt.from_pretrained(
      2         "checkpoints/QA-PubMedQA-BioGPT",
      3         "checkpoint_avg.pt",
      4         data="data/PubMedQA/biogpt-ansis-bin",
      5         tokenizer='moses',
      6         bpe='fastbpe',
      7         bpe_codes="data/bpecodes",
      8         min_len=100,
      9         max_len_b=1024)

File ~/miniconda3/envs/biogpt/lib/python3.10/site-packages/fairseq/models/fairseq_model.py:267, in BaseFairseqModel.from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs)
    244 """
    245 Load a :class:`~fairseq.models.FairseqModel` from a pre-trained model
    246 file. Downloads and caches the pre-trained model file if needed.
   (...)
    263         model archive path.
    264 """
    265 from fairseq import hub_utils
--> 267 x = hub_utils.from_pretrained(
    268     model_name_or_path,
    269     checkpoint_file,
    270     data_name_or_path,
    271     archive_map=cls.hub_models(),
    272     **kwargs,
    273 )
    274 logger.info(x["args"])
    275 return hub_utils.GeneratorHubInterface(x["args"], x["task"], x["models"])

File ~/miniconda3/envs/biogpt/lib/python3.10/site-packages/fairseq/hub_utils.py:73, in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_map, **kwargs)
     70 if "user_dir" in kwargs:
     71     utils.import_user_module(argparse.Namespace(user_dir=kwargs["user_dir"]))
---> 73 models, args, task = checkpoint_utils.load_model_ensemble_and_task(
     74     [os.path.join(model_path, cpt) for cpt in checkpoint_file.split(os.pathsep)],
     75     arg_overrides=kwargs,
     76 )
     78 return {
     79     "args": args,
     80     "task": task,
     81     "models": models,
     82 }

File ~/miniconda3/envs/biogpt/lib/python3.10/site-packages/fairseq/checkpoint_utils.py:432, in load_model_ensemble_and_task(filenames, arg_overrides, task, strict, suffix, num_shards, state)
    427     raise RuntimeError(
    428         f"Neither args nor cfg exist in state keys = {state.keys()}"
    429     )
    431 if task is None:
--> 432     task = tasks.setup_task(cfg.task)
    434 if "task_state" in state:
    435     task.load_state_dict(state["task_state"])

File ~/miniconda3/envs/biogpt/lib/python3.10/site-packages/fairseq/tasks/__init__.py:46, in setup_task(cfg, **kwargs)
     40         task = TASK_REGISTRY[task_name]
     42 assert (
     43     task is not None
     44 ), f"Could not infer task type from {cfg}. Available argparse tasks: {TASK_REGISTRY.keys()}. Available hydra tasks: {TASK_DATACLASS_REGISTRY.keys()}"
---> 46 return task.setup_task(cfg, **kwargs)

File ~/Projects/biogpt/src/language_modeling_prompt.py:134, in LanguageModelingPromptTask.setup_task(cls, args, **kwargs)
    132     args.source_lang, args.target_lang = data_utils.infer_language_pair(paths[0])
    133 if args.source_lang is None or args.target_lang is None:
--> 134     raise Exception(
    135         "Could not infer language pair, please provide it explicitly"
    136     )
    138 dictionary, output_dictionary = cls.setup_dictionary(args, **kwargs)
    139 prompt = cls.setup_prompt(args, dictionary)

Exception: Could not infer language pair, please provide it explicitly

I was wondering if it would be possible to add example code pieces (like the one for the basic model) to the repo for different models to make it easier to get started. Happy to help with the documentation if you have any pointers for me.

Thank you!

Cannot load the model TransformerLanguageModelPrompt.from_pretrained

Hi there,

I cannot load the model RE-DTI-BioGPT by running

import torch
from src.transformer_lm_prompt import TransformerLanguageModelPrompt
m = TransformerLanguageModelPrompt.from_pretrained(
        "checkpoints/RE-BC5CDR-BioGPT", 
        "checkpoint_avg.pt", 
        "data/BC5CDR/relis-bin",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        max_len_b=1024,
        beam=1)

and got the error

2023-02-14 15:08:06 | INFO | fairseq.file_utils | loading archive file checkpoints/RE-BC5CDR-BioGPT
2023-02-14 15:08:06 | INFO | fairseq.file_utils | loading archive file data/BC5CDR/relis-bin
2023-02-14 15:08:10 | INFO | src.language_modeling_prompt | dictionary: 42384 types
2023-02-14 15:08:13 | INFO | fairseq.models.fairseq_model | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': '../../src', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'pytorch_ddp', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'distributed_num_procs': 1}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': True, 'max_tokens': 1024, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 1024, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 100, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [32], 'lr': [1e-05], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': '../../checkpoints/RE-BC5CDR-BioGPT', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 1, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 1024, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': {'_name': 'transformer_lm_prompt_biogpt', 'activation_fn': 'gelu', 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'relu_dropout': 0.0, 'decoder_embed_dim': 1024, 'decoder_output_dim': 1024, 'decoder_input_dim': 1024, 'decoder_ffn_embed_dim': 4096, 'decoder_layers': 24, 'decoder_attention_heads': 16, 'decoder_normalize_before': True, 'no_decoder_final_norm': False, 'adaptive_softmax_cutoff': None, 'adaptive_softmax_dropout': 0.0, 'adaptive_softmax_factor': 4.0, 'no_token_positional_embeddings': False, 'share_decoder_input_output_embed': True, 'character_embeddings': False, 'character_filters': '[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', 'character_embedding_dim': 4, 'char_embedder_highway_layers': 2, 'adaptive_input': False, 'adaptive_input_factor': 4.0, 'adaptive_input_cutoff': None, 'tie_adaptive_weights': False, 'tie_adaptive_proj': False, 'decoder_learned_pos': True, 'layernorm_embedding': False, 'no_scale_embedding': False, 'checkpoint_activations': False, 'offload_activations': False, 'decoder_layerdrop': 0.0, 'decoder_layers_to_keep': None, 'quant_noise_pq': 0.0, 'quant_noise_pq_block_size': 8, 'quant_noise_scalar': 0.0, 'min_params_to_wrap': 100000000, 'base_layers': 0, 'base_sublayers': 1, 'base_shuffle': 0, 'add_bos_token': False, 'tokens_per_sample': 1024, 'max_target_positions': 1024, 'tpu': False, 'scale_fc': False, 'scale_attn': False, 'scale_heads': False, 'scale_resids': False}, 'task': {'_name': 'language_modeling_prompt', 'data': 'data/BC5CDR/relis-bin', 'sample_break_mode': 'none', 'tokens_per_sample': 1024, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': 1024, 'shorten_method': 'none', 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'source_lang': None, 'target_lang': None, 'max_source_positions': 640, 'manual_prompt': None, 'learned_prompt': 9, 'learned_prompt_pattern': 'learned', 'prefix': False, 'sep_token': '<seqsep>'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'tpu': False, 'lr': [1e-05]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 100, 'warmup_init_lr': 1e-07, 'lr': [1e-05]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': {'_name': 'fastbpe', 'bpe_codes': 'data/bpecodes'}, 'tokenizer': {'_name': 'moses', 'source_lang': 'en', 'target_lang': 'en', 'moses_no_dash_splits': False, 'moses_no_escape': False}}
Traceback (most recent call last):
  File "/Users/kaka/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3378, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-25-639ebf247e35>", line 1, in <module>
    m = TransformerLanguageModelPrompt.from_pretrained(
  File "/Users/kaka/lib/python3.8/site-packages/fairseq/models/fairseq_model.py", line 275, in from_pretrained
    return hub_utils.GeneratorHubInterface(x["args"], x["task"], x["models"])
  File "/Users/kaka/lib/python3.8/site-packages/fairseq/hub_utils.py", line 108, in __init__
    self.bpe = encoders.build_bpe(cfg.bpe)
  File "/Users/kaka/lib/python3.8/site-packages/fairseq/registry.py", line 61, in build_x
    return builder(cfg, *extra_args, **extra_kwargs)
  File "/Users/kaka/lib/python3.8/site-packages/fairseq/data/encoders/fastbpe.py", line 27, in __init__
    self.bpe = fastBPE.fastBPE(codes)
AttributeError: module 'fastBPE' has no attribute 'fastBPE'

Is there any idea why this is happening?

Cannot run example code: Couldnt find "data"

I ran all steps on a new conda env. I wondered what "data" path we call when instantiating the model. fairseq wonders as well:

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
        'checkpoints/Pre-trained-BioGPT', 
        'checkpoint.pt', 
        "data",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024
  )

yields

2022-12-20 19:41:01 | INFO | fairseq.file_utils | loading archive file checkpoints/Pre-trained-BioGPT
2022-12-20 19:41:01 | INFO | fairseq.file_utils | Archive name 'data' was not found in archive name list. We assumed 'data' was a path or URL but couldn't find any file associated to this path or URL.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 m = TransformerLanguageModel.from_pretrained(
      2         'checkpoints/Pre-trained-BioGPT', 
      3         'checkpoint.pt', 
      4         "data",
      5         tokenizer='moses', 
      6         bpe='fastbpe', 
      7         bpe_codes="data/bpecodes",
      8         min_len=100,
      9         max_len_b=1024
     10   )

File /anaconda/envs/bioGPT/lib/python3.10/site-packages/fairseq/models/fairseq_model.py:267, in BaseFairseqModel.from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs)
    244 """
    245 Load a :class:`~fairseq.models.FairseqModel` from a pre-trained model
    246 file. Downloads and caches the pre-trained model file if needed.
   (...)
    263         model archive path.
    264 """
    265 from fairseq import hub_utils
--> 267 x = hub_utils.from_pretrained(
    268     model_name_or_path,
    269     checkpoint_file,
    270     data_name_or_path,
    271     archive_map=cls.hub_models(),
    272     **kwargs,
    273 )
    274 logger.info(x["args"])
    275 return hub_utils.GeneratorHubInterface(x["args"], x["task"], x["models"])

File /anaconda/envs/bioGPT/lib/python3.10/site-packages/fairseq/hub_utils.py:73, in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_map, **kwargs)
     70 if "user_dir" in kwargs:
     71     utils.import_user_module(argparse.Namespace(user_dir=kwargs["user_dir"]))
---> 73 models, args, task = checkpoint_utils.load_model_ensemble_and_task(
     74     [os.path.join(model_path, cpt) for cpt in checkpoint_file.split(os.pathsep)],
     75     arg_overrides=kwargs,
     76 )
     78 return {
     79     "args": args,
     80     "task": task,
     81     "models": models,
     82 }

File /anaconda/envs/bioGPT/lib/python3.10/site-packages/fairseq/checkpoint_utils.py:469, in load_model_ensemble_and_task(filenames, arg_overrides, task, strict, suffix, num_shards, state)
    467 argspec = inspect.getfullargspec(task.build_model)
    468 if "from_checkpoint" in argspec.args:
--> 469     model = task.build_model(cfg.model, from_checkpoint=True)
    470 else:
    471     model = task.build_model(cfg.model)

File /anaconda/envs/bioGPT/lib/python3.10/site-packages/fairseq/tasks/language_modeling.py:191, in LanguageModelingTask.build_model(self, args, from_checkpoint)
    190 def build_model(self, args, from_checkpoint=False):
--> 191     model = super().build_model(args, from_checkpoint)
    192     for target in self.targets:
    193         if target not in model.supported_targets:

File /anaconda/envs/bioGPT/lib/python3.10/site-packages/fairseq/tasks/fairseq_task.py:671, in LegacyFairseqTask.build_model(self, args, from_checkpoint)
    659 """
    660 Build the :class:`~fairseq.models.BaseFairseqModel` instance for this
    661 task.
   (...)
    667     a :class:`~fairseq.models.BaseFairseqModel` instance
    668 """
    669 from fairseq import models, quantization_utils
--> 671 model = models.build_model(args, self, from_checkpoint)
    672 model = quantization_utils.quantize_model_scalar(model, args)
    673 return model

File /anaconda/envs/bioGPT/lib/python3.10/site-packages/fairseq/models/__init__.py:106, in build_model(cfg, task, from_checkpoint)
     98             ARCH_CONFIG_REGISTRY[model_type](cfg)
    100 assert model is not None, (
    101     f"Could not infer model type from {cfg}. "
    102     "Available models: {}".format(MODEL_DATACLASS_REGISTRY.keys())
    103     + f" Requested model type: {model_type}"
    104 )
--> 106 return model.build_model(cfg, task)

File /anaconda/envs/bioGPT/lib/python3.10/site-packages/fairseq/models/transformer_lm.py:300, in TransformerLanguageModel.build_model(cls, args, task)
    289     embed_tokens = AdaptiveInput(
    290         len(task.source_dictionary),
    291         task.source_dictionary.pad(),
   (...)
    297         args.quant_noise_pq_block_size,
    298     )
    299 else:
--> 300     embed_tokens = cls.build_embedding(
    301         args, task.source_dictionary, args.decoder_input_dim
    302     )
    304 if args.tie_adaptive_weights:
    305     assert args.adaptive_input

File /anaconda/envs/bioGPT/lib/python3.10/site-packages/fairseq/models/transformer_lm.py:321, in TransformerLanguageModel.build_embedding(cls, args, dictionary, embed_dim, path)
    319 @classmethod
    320 def build_embedding(cls, args, dictionary, embed_dim, path=None):
--> 321     embed_tokens = Embedding(len(dictionary), embed_dim, dictionary.pad())
    322     return embed_tokens

TypeError: object of type 'NoneType' has no len()

Missing hardware requirements per model?

Hello, i'm checking if i'm able to run these models on my PC but i would like to know the system requirements of each model, such as the amount of GPU RAM needed to load each.

Thank you.

load and finetune

Hi, I am finding it difficult to use fairseq framework. Do you know how to import models directly with pytorch and fine-tune them?

Neural interface for transhuman web 3

Open web³

Cannot run example code

Hello, I am trying to evaluate the models provided here but the dependency installation instructions are slightly unclear. After running through them in the way that I assume you meant (in which I am cloning fairseq and moses into the BioGPT directory, not just naively following the instructions and cloning moses into the fairseq directory as is implied by the lack of cd in the instructions), I get the following error:

Traceback (most recent call last):
  File "/opt/workspace/BioGPT/test.py", line 2, in <module>
    from fairseq.models.transformer_lm import TransformerLanguageModel
  File "/usr/local/lib/python3.9/site-packages/fairseq/__init__.py", line 32, in <module>
    import fairseq.criterions  # noqa
  File "/usr/local/lib/python3.9/site-packages/fairseq/criterions/__init__.py", line 36, in <module>
    importlib.import_module("fairseq.criterions." + file_name)
  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.9/site-packages/fairseq/criterions/label_smoothed_cross_entropy_latency_augmented.py", line 6, in <module>
    from examples.simultaneous_translation.utils.latency import LatencyTraining
ModuleNotFoundError: No module named 'examples.simultaneous_translation'

I would be happy to contribute better instructions and resolve the dependencies if someone can provide me the ouput of pip freeze from an environment that they know can successfully run the examples provided in the readme of this directory, otherwise any advice is greatly appreciated. Thanks!

Assertionerror

Hey @renqianluo , I was still facing the same issue.

Originally posted by @parthplc in #10 (comment)

Error when running first example

I followed installation instructions but have failed to run the first example.
I'm glad to contribute with a slick script that install all the requirements with one command when I can successfully run it.

TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 m = TransformerLanguageModel.from_pretrained(
      2         "checkpoints/Pre-trained-BioGPT",
      3                 "checkpoint.pt",
      4                         "data",
      5                                 tokenizer='moses',
      6                                         bpe='fastbpe',
      7                                                 bpe_codes="data/bpecodes",
      8                                                         min_len=100,
      9                                                                 max_len_b=1024)

File ~/venvs/biogpt/lib/python3.10/site-packages/fairseq/models/fairseq_model.py:267, in BaseFairseqModel.from_pretrained(cls, model_name_or_path, checkpoint_
file, data_name_or_path, **kwargs)
    244 """
    245 Load a :class:`~fairseq.models.FairseqModel` from a pre-trained model
    246 file. Downloads and caches the pre-trained model file if needed.
   (...)
    263         model archive path.
    264 """
    265 from fairseq import hub_utils
--> 267 x = hub_utils.from_pretrained(
    268     model_name_or_path,
    269     checkpoint_file,
    270     data_name_or_path,
    271     archive_map=cls.hub_models(),
    272     **kwargs,
    273 )
    274 logger.info(x["args"])
    275 return hub_utils.GeneratorHubInterface(x["args"], x["task"], x["models"])

File ~/venvs/biogpt/lib/python3.10/site-packages/fairseq/hub_utils.py:73, in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_m
ap, **kwargs)
     70 if "user_dir" in kwargs:
     71     utils.import_user_module(argparse.Namespace(user_dir=kwargs["user_dir"]))
---> 73 models, args, task = checkpoint_utils.load_model_ensemble_and_task(
     74     [os.path.join(model_path, cpt) for cpt in checkpoint_file.split(os.pathsep)],
     75     arg_overrides=kwargs,
     76 )
     78 return {
     79     "args": args,
     80     "task": task,
     81     "models": models,
     82 }

File ~/venvs/biogpt/lib/python3.10/site-packages/fairseq/checkpoint_utils.py:469, in load_model_ensemble_and_task(filenames, arg_overrides, task, strict, suff
ix, num_shards, state)
    467 argspec = inspect.getfullargspec(task.build_model)
    468 if "from_checkpoint" in argspec.args:
--> 469     model = task.build_model(cfg.model, from_checkpoint=True)
    470 else:
    471     model = task.build_model(cfg.model)

File ~/venvs/biogpt/lib/python3.10/site-packages/fairseq/tasks/language_modeling.py:191, in LanguageModelingTask.build_model(self, args, from_checkpoint)
    190 def build_model(self, args, from_checkpoint=False):
--> 191     model = super().build_model(args, from_checkpoint)
    192     for target in self.targets:
    193         if target not in model.supported_targets:

File ~/venvs/biogpt/lib/python3.10/site-packages/fairseq/tasks/fairseq_task.py:671, in LegacyFairseqTask.build_model(self, args, from_checkpoint)
    659 """
    660 Build the :class:`~fairseq.models.BaseFairseqModel` instance for this
    661 task.
   (...)
    667     a :class:`~fairseq.models.BaseFairseqModel` instance
    668 """
    669 from fairseq import models, quantization_utils
--> 671 model = models.build_model(args, self, from_checkpoint)
    672 model = quantization_utils.quantize_model_scalar(model, args)
    673 return model

File ~/venvs/biogpt/lib/python3.10/site-packages/fairseq/models/__init__.py:106, in build_model(cfg, task, from_checkpoint)
     98             ARCH_CONFIG_REGISTRY[model_type](cfg)
    100 assert model is not None, (
    101     f"Could not infer model type from {cfg}. "
    102     "Available models: {}".format(MODEL_DATACLASS_REGISTRY.keys())
    103     + f" Requested model type: {model_type}"
    104 )
--> 106 return model.build_model(cfg, task)

File ~/venvs/biogpt/lib/python3.10/site-packages/fairseq/models/transformer_lm.py:300, in TransformerLanguageModel.build_model(cls, args, task)
    289     embed_tokens = AdaptiveInput(
    290         len(task.source_dictionary),
    291         task.source_dictionary.pad(),
   (...)
    297         args.quant_noise_pq_block_size,
    298     )
    299 else:
--> 300     embed_tokens = cls.build_embedding(
    301         args, task.source_dictionary, args.decoder_input_dim
    302     )
    304 if args.tie_adaptive_weights:
    305     assert args.adaptive_input

File ~/venvs/biogpt/lib/python3.10/site-packages/fairseq/models/transformer_lm.py:321, in TransformerLanguageModel.build_embedding(cls, args, dictionary, embe
d_dim, path)
    319 @classmethod
    320 def build_embedding(cls, args, dictionary, embed_dim, path=None):
--> 321     embed_tokens = Embedding(len(dictionary), embed_dim, dictionary.pad())
    322     return embed_tokens

TypeError: object of type 'NoneType' has no len()

This is my installation script at the moment:

export ve_name='biogpt'
export py_version=3.10
curl bit.ly/cfgvelinux -L | bash
. activate_ve $ve_name
ve_data_path=$HOME/venvs/$ve_name/data
ve_code_path=$HOME/venvs/$ve_name/code
mkdir $ve_code_path
mkdir $ve_data_path


cd $ve_code_path
git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout v0.12.0
pip install .
python setup.py build_ext --inplace

cd $ve_code_path
git clone https://github.com/moses-smt/mosesdecoder.git
export MOSES=$ve_code_path/mosesdecoder

cd $ve_code_path
git clone https://github.com/glample/fastBPE.git
export FASTBPE=$ve_code_path/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

pip install sacremoses
pip install scikit-learn
pip install torch==1.12.0

mkdir $ve_data_path/checkpoints
cd ~/venvs/biogpt/data/checkpoints
wget https://msramllasc.blob.core.windows.net/modelrelease/BioGPT/checkpoints/Pre-trained-BioGPT.tgz
tar -zxvf Pre-trained-BioGPT.tgz

Traceback (most recent call last):

Hello when i execute the following code i get the following error (Windows 11)
`

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
"checkpoints",
"checkpoint_avg.pt",
"data",
tokenizer='moses',
bpe='fastbpe',
bpe_codes="data/bpecodes",
min_len=100,
max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

2023-02-22 19:39:41 | INFO | fairseq.file_utils | loading archive file checkpoints
2023-02-22 19:39:41 | INFO | fairseq.file_utils | loading archive file data
Traceback (most recent call last):
File "C:\Users\Gilgamesh\Documents\Advanced A.I\Bio\code\BioGPT\bio.py", line 3, in
m = TransformerLanguageModel.from_pretrained(
File "C:\Python310\lib\site-packages\fairseq\models\fairseq_model.py", line 267, in from_pretrained
x = hub_utils.from_pretrained(
File "C:\Python310\lib\site-packages\fairseq\hub_utils.py", line 73, in from_pretrained
models, args, task = checkpoint_utils.load_model_ensemble_and_task(
File "C:\Python310\lib\site-packages\fairseq\checkpoint_utils.py", line 432, in load_model_ensemble_and_task
task = tasks.setup_task(cfg.task)
File "C:\Python310\lib\site-packages\fairseq\tasks_init_.py", line 42, in setup_task
assert (
AssertionError: Could not infer task type from {'_name': 'language_modeling_prompt', 'data': 'data', 'sample_break_mode': 'none', 'tokens_per_sample': 2048, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': 2048, 'shorten_method': 'none', 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'source_lang': None, 'target_lang': None, 'max_source_positions': 1900, 'manual_prompt': None, 'learned_prompt': 9, 'learned_prompt_pattern': 'learned', 'prefix': False, 'sep_token': ''}. Available argparse tasks: dict_keys(['audio_pretraining', 'audio_finetuning', 'cross_lingual_lm', 'denoising', 'speech_to_text', 'text_to_speech', 'frm_text_to_speech', 'hubert_pretraining', 'language_modeling', 'legacy_masked_lm', 'masked_lm', 'multilingual_denoising', 'multilingual_language_modeling', 'multilingual_masked_lm', 'speech_unit_modeling', 'translation', 'multilingual_translation', 'online_backtranslation', 'semisupervised_translation', 'sentence_prediction', 'sentence_prediction_adapters', 'sentence_ranking', 'simul_speech_to_text', 'simul_text_to_text', 'speech_to_speech', 'translation_from_pretrained_bart', 'translation_from_pretrained_xlm', 'translation_lev', 'translation_multi_simple_epoch', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['audio_pretraining', 'audio_finetuning', 'hubert_pretraining', 'language_modeling', 'masked_lm', 'multilingual_language_modeling', 'speech_unit_modeling', 'translation', 'sentence_prediction', 'sentence_prediction_adapters', 'simul_text_to_text', 'translation_from_pretrained_xlm', 'translation_lev', 'dummy_lm', 'dummy_masked_lm'])

raise ProcessExitedException

hello, I followed the instructions in GitHub https://github.com/microsoft/BioGPT/tree/main/examples/RE-DDI but raised an error
i also dont konw how to solve such quesiton . would you please give me a hand?

Why did you decided to use GPT instead of BERT for extractive Q/A?

It seems to me that nearly all extractive Q/A models are based in encoder/encoder-decoder networks, were there any particular reason to use a generative decoder network instead for this task?

Was it because pre-training GPT on biomedical text performs better in general than on networks such as BERT because GPT gets better embeddings for fine-tuning to a more specific downstream task?

Cannot run interactive.py in text_generation dir.

I ran interactive.py file in text_generation dir and got the following error.
FileNotFoundError: [Errno 2] No such file or directory: '/home/********/BioGPT/data/PubMed/data-bin/dict.txt'
There certainly does not appear to be a corresponding directory.

Instead, changing the directory referenced in interactive.py as shown below seems to have executed without problems.
parser.add_argument("--data_dir", type=str, default="../../data/PubMedQA/raw") #try to fix

Could you please check this error?

first example fails

Why might this be happening?

Redhat, torch 1.12.1+cu102, Python 3.9.12

MRE:

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
        "checkpoints/Pre-trained-BioGPT", 
        "checkpoint.pt", 
        "data",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024)

yields:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fs/clip-scratch/sschulho/fairseq/fairseq/models/fairseq_model.py", line 267, in from_pretrained
    x = hub_utils.from_pretrained(
  File "/fs/clip-scratch/sschulho/fairseq/fairseq/hub_utils.py", line 73, in from_pretrained
    models, args, task = checkpoint_utils.load_model_ensemble_and_task(
  File "/fs/clip-scratch/sschulho/fairseq/fairseq/checkpoint_utils.py", line 378, in load_model_ensemble_and_task
    model = task.build_model(cfg.model)
  File "/fs/clip-scratch/sschulho/fairseq/fairseq/tasks/language_modeling.py", line 180, in build_model
    model = super().build_model(args)
  File "/fs/clip-scratch/sschulho/fairseq/fairseq/tasks/fairseq_task.py", line 633, in build_model
    model = models.build_model(args, self)
  File "/fs/clip-scratch/sschulho/fairseq/fairseq/models/__init__.py", line 96, in build_model
    return model.build_model(cfg, task)
  File "/fs/clip-scratch/sschulho/fairseq/fairseq/models/transformer_lm.py", line 264, in build_model
    embed_tokens = cls.build_embedding(
  File "/fs/clip-scratch/sschulho/fairseq/fairseq/models/transformer_lm.py", line 285, in build_embedding
    embed_tokens = Embedding(len(dictionary), embed_dim, dictionary.pad())
TypeError: object of type 'NoneType' has no len()

How do I Fine-tune this model?

AMD GPU

Request for comprehensive documentation for BioGPT

As a developer, I am keen to put this model to the test. However, what I am missing is robust documentation that will streamline the process of utilizing this model in developing projects.

checkpoint downloads often break

Congratulations on your work. Can you please package several sub-packages in checkpoints separately and provide download links for them? I wanted to experiment with your checkpoint but I found that the download often broke when I used wget to download it

checkpoint,download issue

pretrained and fine tuned checkpoint download speed is slow and it often breaks. Can you please work around it and solve. I am trying to reproduce the result.

Problem importing sys

Running "KD-DTI" example results in error: src module not found. "BioGPT" directory needs to be added to the path...