allenai / specter Goto Github PK

SPECTER: Document-level Representation Learning using Citation-informed Transformers

License: Apache License 2.0

Jsonnet 4.28% Python 92.70% Shell 3.01%

specter's Introduction

SPECTER: Document-level Representation Learning using Citation-informed Transformers

This repository contains code, link to pretrained models, instructions to use SPECTER and link to the SciDocs evaluation framework.

***** New Jan 2021: HuggingFace models *****

Specter is now accessible through HuggingFace's transformers library.

Thanks to @zhipenghoustat for providing the Huggingface training scripts and the checkpoint.

See below:

How to use the pretrained model

1- Through Huggingface Transformers Library

Requirement: pip install --upgrade transformers==4.2

from transformers import AutoTokenizer, AutoModel

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('allenai/specter')
model = AutoModel.from_pretrained('allenai/specter')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
# take the first token in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

A sample script to run the model in batch mode on a dataset of papers is provided under scripts/embed_papers_hf.py

How to use:

CUDA_VISIBLE_DEVICES=0 python scripts/embed_papers_hf.py \
--data-path path/to/paper-metadata.json \
--output path/to/write/output.json \
--batch-size 8

** Note that huggingface model yields slightly higher average results than those reported in the paper. To reproduce our exact numbers use our original implementation see reproducing results.

Expected SciDocs results from the huggingface model:

mag-f1	mesh-f1	co-view-map	co-view-ndcg	co-read-map	co-read-ndcg	cite-map	cite-ndcg	cocite-map	cocite-ndcg	recomm-ndcg	recomm-P@1	Avg
79.4	87.7	83.4	91.4	85.1	92.7	92.0	96.6	88.0	94.7	54.6	20.9	80.5

2- Through this repo

Errata for paper: In the paper we mentioned that we take the representation corresponding to the [CLS] token as the aggregate representation of the sequence. However, in the AllenNLP v0.9 implementation of BERT embedder, each token representation is a scalar mix of all layer representations. To get aggregate representation of the input in a single vector, average pooling is used. Therefore, the original SPECTER model uses scalar mixing of layers and average pooling to embed a given document as opposed to taking the final layer represenation of the [CLS] token. The Huggingface model above uses final layer represnation of [CLS]. In practice this doesn't impact the results and both models perform comparably.

1 - Clone the repo and download the pretrained model and supporting files:

Download

Download the tar file at: download [833 MiB]
The compressed archive includes a model.tar.gz file which is the pretrained model as well as supporting files that are inside a data/ directory.

Here are the commands to run:

git clone git@github.com:allenai/specter.git

cd specter

wget https://ai2-s2-research-public.s3-us-west-2.amazonaws.com/specter/archive.tar.gz

tar -xzvf archive.tar.gz

2 - Install the environment:

conda create --name specter python=3.7 setuptools  

conda activate specter  

# if you don't have gpus, remove cudatoolkit argument
conda install pytorch cudatoolkit=10.1 -c pytorch   

pip install -r requirements.txt  

python setup.py install

3 - Embed papers or documents using SPECTER

Specter requires two main files as input to embed the document. A text file with ids of the documents you want to embed and a json metadata file consisting of the title and abstract information. Sample files are provided in the data/ directory to get you started. Input data format is according to:

metadata.json format:

{
    'doc_id': {'title': 'representation learning of scientific documents',
               'abstract': 'we propose a new model for representing abstracts'},
}

To use SPECTER to embed your data use the following command:

python scripts/embed.py \
--ids data/sample.ids --metadata data/sample-metadata.json \
--model ./model.tar.gz \
--output-file output.jsonl \
--vocab-dir data/vocab/ \
--batch-size 16 \
--cuda-device -1

Change --cuda-device to 0 or your specified GPU if you want faster inference.
The model will run inference on the provided input and writes the output to --output-file directory (in the above example output.jsonl ).
This is a jsonlines file where each line is a key, value pair consisting the id of the embedded document and its specter representation.

Public API

A collection of public APIs for retrieving pre-computed Specter embeddings for papers in the Semantic Scholar corpus is available at: https://www.semanticscholar.org/product/api, and an API for embedding a given paper title and abstract using Specter is available at: allenai/paper-embedding-public-apis

NOTE: Embeddings that are retrieved from the public APIs will not match the embeddings that can be generated by running the model on this repo. They are produced by two different versions of the SPECTER model. Although embeddings from the two different sets cannot be mixed and matched within the same task, the sets perform similarly on downstream tasks.

How to reproduce our results

In order to reproduce our results please refer to the SciDocs repo where we provide the embeddings for the evaluation tasks and instructions on how to run the benchmark to get the results.

Advanced: Training your own model

First follow steps 1 and 2 from the Pretrained models section to download the supporting files and install the environment.

Next you need to create pickled training instances using the specter/data_utils/create_training_files.py script and then use the resulting files as input to the scripts/run-exp-simple.sh script.

You will need the following files:

data.json containing the document ids and their relationship.
metadata.json containing mapping of document ids to textual fiels (e.g., title, abstract)
train.txt,val.txt, test.txt containing document ids corresponding to train/val/test sets (one doc id per line).

The data.json file should have the following structure (a nested dict):

{"docid1" : {  "docid11": {"count": 1}, 
               "docid12": {"count": 5},
               "docid13": {"count": 1}, ....
            }
"docid2":   {  "docid21": {"count": 1}, ....
....}

Where docids are ids of documents in your data and count is a measure of importance of the relationship between two documents. In our dataset we used citations as indicator of relationship where count=5 means direct citation while count=1 refers to a citation of a citation.

The create_training_files.py script processes this structure with a triplet sampler that selects both easy and hard negatives (as described in the paper) according the count value in the above structure. For example papers with count=5 are considered positive candidates, papers with count=1 considered hard negatives and other papers that are not cited are easy negatives. You can control the number of hard negatives by setting --ratio_hard_negatives argument in the script.

Create preprocessed training files:

python specter/data_utils/create_training_files.py \
--data-dir data/training \
--metadata data/training/metadata.json \
--outdir data/preprocessed/

After preprocessing the data you will have three pickled files containing training instannces as well as a metrics.json showing number of examples in each set. Use the following script to start training the model:

Run the training script

./scripts/run-exp-simple.sh -c experiment_configs/simple.jsonnet \
-s model-output/ --num-epochs 2 --batch-size 4 \
--train-path data/preprocessed/data-train.p --dev-path data/preprocessed/data-val.p \
--num-train-instances 55 --cuda-device -1

In this example: The model's checkpoint and logs will be stored in model-output/ .
Note that you need to set the correct --num-train-instances for your dataset. This number is stored in metrics.json file output from the preprocessing step. You can monitor the training progress using tensorboard:
tensorboard --logdir model-output/ --bind_all

SciDocs benchmark

SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks.

Link to SciDocs:

https://github.com/allenai/scidocs

Citation

Please cite the SPECTER paper as:

@inproceedings{specter2020cohan,
  title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}},
  author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld},
  booktitle={ACL},
  year={2020}
}

specter's People

Contributors

Stargazers

Watchers

Forkers

yangliuy fablos dongzhou1979 alonkipnis drorhilman maulidaannisa sahandv fgh95 leleyi notanoxymoron zhipenghoustat shubhampachori12110095 soumyajain29 intsol-attic dragomirradev lucyxiaoluwang coloratto sailfish009 jishnujayakumar nlichtenberg jacklmind satya-nutella liyubov techthiyanes waicongtam rpiryani maniraja1 lecfab miracleyin mle-els ometaxas atypon michaelatkin31 haroldrubio wilfoderek zyloong jinamshah piegu jokineno junzhh dasolhwang bxff ebrentjones siinvictus holongate thomasavare mttmtt31 lewholz coolshr goyalkaraniit kobayashi-kei iq-scm bensenhsu armvndj ottersome

specter's Issues

How to correctly define "--included-text-fields" when embedding data

Hi there,
I have used the pretrained model to embed my own data. In scripts/embed.py, there is an argument --included-text-fields which seems to be a user-defined option. Since I would like to compare the embedding of the title, title+abstract and also all the content of an article containing results, conclusion, and etc.; however, it doesn't work when I override the value such as abstract title results. Besides, I found that when title is given to --included-text-fields , the embedding differs from the same condition but with abstract removed.
Is there any limitation regarding to this issue?

The corresponding input format:

metadata.json
{
    doc_id: { "title": "..........",
              "abstract": "..........",
              "results": "..........",
              "conclusion": "..........",
              "other_content": "..........",
              "paper_id": doc_id },
     ...
}

How to set num_train_instances when using your pickled training data

Hi~, I am trying to training from scratch using the pickled data in #2.

But the training process is stuck at 15143it with batchsize 32. I guess that I used the wrong param of num_train_instances (680000)

The error shows as following:
./scripts/run-exp-simple.sh: line 152: 18438 Killed python -m allennlp.run train $config_file --include-package specter -s $serialization_dir

How to change the sequence length?

Hi, I'd like to change the max sequence length in order to embed larger documents.

Is there an extra argument I can give to embed.py to do this?

I notice that embed_papers_hf.py has a max_length parameter, but to use that script I need some way to specify that I don't have a GPU.

Would appreciate any help with either of these scripts. :)

Matching articles from SPECTER's dataset with S2ORC IDs

Hi,

I want to match articles used in SPECTER's training and validation sets with the articles from S2ORC.
The problem is that article IDs in SPECTER's training and validation sets are not used in the S2ORC dataset, i.e., S2ORC uses different paper IDs compared to SPECTER.

For example, this article can be found in SPECTER's validation set and its ID there is: 793efec2096f6511c45430ff5f2f08a362dcf3eb.
Corpus ID of this paper is 11967120 and this Corpus ID is used in S2ORC as paper_id. (I've found this Corpus ID on the Semantic Scholar's webpage linked above)

Is there any easy way to obtain these Corpus IDs for articles from SPECTER's dataset?
I'm aware I could use Semantic Scholar's API for this, but I think that would be very time-consuming (SPECTER's dataset contains over 165k unique article IDs if I calculated correctly).

Thanks!

Huggingface model produces different embeddings

Hi @armancohan

the available SciDocs embeddings and the embeddings returned from the API do not match the embeddings derived from the Huggingface model. Are the model weights the same? Or any idea why this is the case?

I'm using the following code:

tokenizer = AutoTokenizer.from_pretrained('allenai/specter')
model = AutoModel.from_pretrained('allenai/specter')

papers = [{'paper_id': 'A', 'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'}]

# concatenate title and abstract
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
model.eval()
with torch.no_grad():
    result = model(**inputs)
# take the first token in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

pred_embed = embeddings.numpy()
# => array([[-8.39614451e-01,  1.14658070e+00, -5.10430574e-01, ...

# Embeddings via API
response = requests.post("https://model-apis.semanticscholar.org/specter/v1/invoke", json=papers)

true_embed = np.array([response.json()['preds'][0]['embedding']])
# =>  array([[-3.30647826e-02, -4.52146339e+00,  4.61816311e-01, ...

(true_embed==pred_embed).all()
# => False

Code does not go to completion when `njobs` is greater than 1

When njobs parameter in ./specter/data_utils/create_training_files.py is set to something greater than 1, the code halts and does not go to completion. And higher njobs is a requirement when considering large amount of data. Please look into this issue @armancohan @sergeyf

Streaming data for inferrence

Hi! I'm trying to embed some 100M papers using SPECTER. However, there's some kind of a memory leak that makes the whole process extremely inefficient. I see that AllenNLP models support JSONL input format.

What is the simplest way to replace the ids and metadata args with a single JSONL file or stdin?

How to create the vocab files? (tokens.txt, non_padded_namespaces.txt, venue.txt)

Hi,

I'm testing the SPECTER training method with another model than Scibert (a non-English model present in the Hugging Face hub).

For doing that, I download the scibert tar file, untar it and replace files by the other BERT ones.

Here the files after untar:

data/
data/scibert_scivocab_uncased/
data/scibert_scivocab_uncased/scibert.tar.gz
data/scibert_scivocab_uncased/vocab.txt
data/vocab/
data/vocab/non_padded_namespaces.txt
data/vocab/tokens.txt
data/vocab/venue.txt

As we can see, the scibert tar file has a folder vocab with 3 files.

How theses files were created?
How can I created them from a non-English BERT model present in the Hugging Face hub?

Format specified for json file containing title and abstract is wrong.

The format specified in the README is

{
    'doc_id': {'title': 'representation learning of scientific documents',
    'abstract': 'we propose a new model for representing abstracts'},
}

However, the correct format is

{
    'doc_id': {'title': 'representation learning of scientific documents',
    'abstract': 'we propose a new model for representing abstracts',
    'paper_id': 'doc_id'},
}

Model name 'data/scibert_scivocab_uncased/vocab.txt' was not found in model name list

Hi, I think I'm doing something really dumb but I don't know how to fix it.

I'm trying to run create_training_files.py as per the instructions, but I'm getting the error in the title. This file does not exist in the specter repository. Am I supposed to download it from somewhere else?

concat_title_abstract must be True in scripts/pytorch_lightning_training_script/train.py#L73

Hi @armancohan.

In scripts/pytorch_lightning_training_script/train.py#L73, concat_title_abstract must be True if you want to train both with title and abstract.

However, it is setup to None in the actual script. That means that only the document title is used for training.

concat_title_abstract: bool = None # actual code

should be:

concat_title_abstract: bool = True

What do you think? Thank you.

query regarding information retrieval using text query

Hi! Thanks for sharing your amazing work. I had a query.
I wanted to perform a similarity match between a query sentence and a larger text (title and abstract from arxiv papers). I created embeddings for the database. However, I was unsure about how I can create a similar embedding for the query text. I tried by keeping the query as the title and the abstract empty but the results were not good. Although i did notice this on the paper:

We observe that removing the abstract from the textual input and relying only on the title results in a substantial decrease in performance

Do you have any suggestions as to what I can try out to improve? Thanks :)

PS: the embeddings for the papers themselves are pretty good as I tried finding the closest ones for a couple of samples and they all seem to be very consistent

Fine-tuning SPECTER?

Is there any way that instead of training from SciBERT, but directly fine-tune on SPECTER? It seems that the format of the model weights of SPECTER is different from SciBERT.
How do I fine-tune SPECTER like SciBERT on classification tasks?

How to train my own model with other checkpoints?

Hi,

First, thank you for the excellent work.

I want to train my own model with the SPECTER's objective, e.g. starting with huggingface's bert-base-uncased instead of SciBERT. If possible, what should I do?

Thanks!

Can we get calculated embeddings directly into variable?

embed.py script calculates embeddings of inputs in sample-metadata.json and stores them in output file output.jsonl

python scripts/embed.py \
--ids data/sample.ids --metadata data/sample-metadata.json \
--model ./model.tar.gz \
--output-file output.jsonl \
--vocab-dir data/vocab/ \
--batch-size 16 \
--cuda-device -1

Is there any way to call the function directly from python script and get the output embeddings in a variable? ...something like the web-api is doing but locally, offline
I do not want to load the model for each evaluation call

Create preprocessed training files: metadata.json is missing ids in the train.txt, test.txt and val.txt

When I run the following -

python specter/data_utils/create_training_files.py \
--data-dir data/training \
--metadata data/training/metadata.json \
--outdir data/preprocessed/

I get done getting triplets, success rate:0.00%

and my data-metrics.json looks like -

{
  "train": 0,
  "val": 0,
  "test": 0
}

I debugged the code and found that at line
there is a key error when self.metadata is called.
Looks like the ids in train.txt, val.txt and test.txt are not in the metadata.json file

Please help and share the correct metadata.json file

Error when training: "None" is not a <Class 'allennlp.data.fields.field.Field'>

Hi guys, thanks for your hard work.

I am trying to reproduce your model by training with the original dataset using the run-exp-simple.sh:

./scripts/run-exp-simple.sh -c experiment_configs/simple.jsonnet -s mo
del-output/ --num-epochs 2 --batch-size 4 --train-path data/preprocessed/train.pkl --dev-path data/preprocessed/val.pkl --num
-train-instances 55 --cuda-device 0

but got the following error:

TypeError: ArrayField.empty_field: return type None is not a <class 'allennlp.data.fields.field.Field'>.

Do you have any idea what this is about?

I have ensured that the training pickle file does exist.

Thanks, cheers

Using trained model: which tokenizer?

I have trained a new model following the guildeline in README.md. The model was trained on my own dataset of scientific articles. Now, in order to use the trained model, I need a tokenizer. Which one should I use? Do I need to load the vocabulary from disk in case the vocabulary used during training is different from pretrained ones?

How to deal with this bug?

When I run this code you provide:

 python scripts/embed.py \
--ids data/sample.ids --metadata data/sample-metadata.json \
--model ./model.tar.gz \
--output-file output.jsonl \
--vocab-dir data/vocab/ \
--batch-size 16 \
--cuda-device -1

I have already configured the environment. And when I modify this bug, find the dict json.loads(evaluate_snippet("", serialized_overrides, ext_vars=ext_vars)) is "overrides".
How to deal with this?
Please help me, thank you!

Confused about the Huggingface Usage (using ' ' instead of '[SEP]' for concatenating)

Hi, thanks very much for your code, very interesting work. I am a little bit confused about some points in your code.

In your pytorch training file, it is clearly stated that you would concatenate title, abstract and separate them by [SEP]:

title_field = instance.fields.get(f'{paper_type}_title')
abst_field = instance.fields.get(f'{paper_type}_abstract')
if title_field:
    tokens.extend(title_field.tokens)
if tokens:
    tokens.extend([Token('[SEP]')])
if abst_field:
    tokens.extend(abst_field.tokens)

So I am expecting each input sequence should follow this strategy. While in your example usage of huggingface toolkit where you use a space ' ' to concatenate title and abstract:

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract
title_abs = [d['title'] + ' ' + (d.get('abstract') or '') for d in papers]
# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
# take the first token in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

It makes me confused.....Why you are changing from '[SEP]' to ' ' in the inference time..... and how is it compatible with your training scenario (since you are using [SEP] all the time during training)....

I tried with the tokenizer model (AutoTokenizer.from_pretrained('allenai/specter') to check the tokenized results on your given example:
{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'} ->

'BERT We introduce a new language representation model called BERT' ->

['[CLS]', 'ber', '##t', 'we', 'introduce', 'a', 'new', 'language', 'representation', 'model', 'called', 'ber', '##t', '[SEP]', '[PAD]', '[PAD]']

no [SEP] seems have been inserted between title and abstract.

Many thanks for any reply on this

RoBERTa instead of BERT: what changes in SPECTER scripts?

Hi @armancohan.

The SPECTER scripts use BERT.

Could you give the changes to do in order to use a model RoBERTa (for example: there is no file vocab.txt for a model RoBERTa, the tokenizer is not WordPiece, etc.)? Thank you.

List of scripts:

create_training_files.py
pytorch_lightning_training_script/train.py
other scripts to change?

About the script create_training_files.py, here the lines where "something" needs to be changed, no?

17
from allennlp.data.token_indexers import SingleIdTokenIndexer, PretrainedBertIndexer
19
from allennlp.data.tokenizers.word_splitter import WordSplitter, SimpleWordSplitter, BertBasicWordSplitter
47
"do_lowercase": "true",
48
"pretrained_model": "data/scivocab_scivocab_uncased/vocab.txt",
121
_tokenizer = WordTokenizer(word_splitter=BertBasicWordSplitter(do_lower_case=bert_params["do_lowercase"]))
122
_token_indexers = {"bert": PretrainedBertIndexer.from_params(Params(bert_params))}
417

def main(data_files, train_ids, val_ids, test_ids, metadata_file, outdir, n_jobs=1, njobs_raw=1,
         margin_fraction=0.5, ratio_hard_negatives=0.3, samples_per_query=5, comment='', bert_vocab='',
         concat_title_abstract=False, included_text_fields='title abstract'):

502
ap.add_argument('--bert_vocab', help='path to bert vocab', default='data/scibert_scivocab_uncased/vocab.txt')
519

main([data_file], [train_ids], [val_ids], [test_ids], metadata_file, args.outdir, args.njobs, args.njobs_raw,
         margin_fraction=args.margin_fraction, ratio_hard_negatives=args.ratio_hard_negatives,
         samples_per_query=args.samples_per_query, comment=args.comment, bert_vocab=args.bert_vocab,
         concat_title_abstract=args.concat_title_abstract, included_text_fields=args.included_text_fields
         )

About the script pytorch_lightning_training_script/train.py, as the classes AutoTokenizer and AutoModel are used, there is nothing to change?

I am wondering when exactly will the model and data be released?

Help required for training data preparation

I want to train SPECTER on my data.

I checked following script for generating pickled training files: specter/data_utils/create_training_files.py

As per my understanding it requires following inputs:

data.json (line 489)
metadata.json
train.csv, val.csv, test.csv

Please help me with the schema required for data.json and metadata.json files, columns in train.csv, val.csv, test.csv and instructions about any other required files or pre-processing steps.

where can I get the dataset used in the paper?

Where can I get the dataset used in the paper including the metadata of the paper and user activity? Without them, how can I reproduce the experimental results in the paper? Thanks for your responce

ERROR: Allennlp params.py ValueError: Cannot convert variable to bool: all

z 运行repo的例子，怎么报出这个error? 如何解决

File "specter/predict_command.py", line 225, in
run()
File "specter/predict_command.py", line 221, in run
main(prog="allennlp")
File "specter/predict_command.py", line 215, in main
args.func(args)
File "specter/predict_command.py", line 159, in predict
predictor = get_predictor(args)
File "specter/predict_command.py", line 148, in get_predictor
overrides=args.overrides)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/models/archi val.py", line 230, in load_archive
cuda_device=cuda_device)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/models/model .py", line 327, in load
return cls.by_name(model_type).load(config, serialization_dir, weights_file , cuda_device)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/models/model .py", line 265, in load
model = Model.from_params(vocab=vocab, params=model_params)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/from params.py", line 365, in from_params
return subclass.from_params(params=params, **extras)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/from params.py", line 386, in from_params
kwargs = create_kwargs(cls, params, **extras)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/from params.py", line 133, in create_kwargs
kwargs[name] = construct_arg(cls, name, annotation, param.default, params, * *extras)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/from params.py", line 229, in construct_arg
return annotation.from_params(params=subparams, **subextras)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/from params.py", line 365, in from_params
return subclass.from_params(params=params, **extras)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/modules/text field_embedders/basic_text_field_embedder.py", line 160, in from_params
for name, subparams in token_embedder_params.items()
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/modules/text field_embedders/basic_text_field_embedder.py", line 160, in
for name, subparams in token_embedder_params.items()
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/from params.py", line 365, in from_params
return subclass.from_params(params=params, **extras)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/from params.py", line 386, in from_params
kwargs = create_kwargs(cls, params, **extras)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/from_ params.py", line 133, in create_kwargs
kwargs[name] = construct_arg(cls, name, annotation, param.default, params, * *extras)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/from_ params.py", line 243, in construct_arg
return params.pop_bool(name, default) if optional else params.pop_bool(name)
File "/home/delta/miniconda3/lib/python3.7/site-packages/allennlp/common/param s.py", line 304, in pop_bool
raise ValueError("Cannot convert variable to bool: " + value)
ValueError: Cannot convert variable to bool: all

How to install this model in the docker

Hi, could you provide the instruction for docker?

Positive paper sampling

In your paper it says:

The positive paper is a paper that is not cited by the query paper.

Does this mean that citations are considered as unidirectional such that a paper citing the query could be a negative sample for the query?

Thanks!

Question regarding train / val / test split

Please help me with following queries on train / val / test dataset split

What was the basis used for splitting corpus into train, val and test?
In the training script provided (as below), is there option to include test-path so that the best model saved based on results on val can be evaluated on the test set?

./scripts/run-exp-simple.sh -c experiment_configs/simple.jsonnet \
-s [output-dir] --num-epochs [num-epochs] --batch-size [batch-size] \
--train-path [path-to-train.pkl] --dev-path [path-to-dev.pkl] \
--cuda-device 0 --num-train-instances [num-instances]

Training data and code to reproduce model training

Can you please release the training data and code to reproduce model training?
What is the expected timeline?

cannot import name 'import_submodules' from 'allennlp.common.util'

I installed specter following the provided instructions but when I run the creation script with the provided demo command

python scripts/embed.py \
--ids data/sample.ids --metadata data/sample-metadata.json \
--model ./model.tar.gz \
--output-file output.jsonl \
--vocab-dir data/vocab/ \
--batch-size 16 \
--cuda-device -1

I get this error from the allennlp library. Do you have any suggestions how to fix that?

Traceback (most recent call last):
  File "specter/predict_command.py", line 20, in <module>
    from allennlp.common.util import lazy_groups_of, import_submodules
ImportError: cannot import name 'import_submodules' from 'allennlp.common.util' (/home/horsmann/miniconda3/envs/specter/lib/python3.7/site-packages/allennlp/common/util.py)

P.S.:
I am on Ubuntu

Any plans to make the model available on the Transformers library?

It would certainly increase its visibility :)

Nan loss in training specter with specter/scripts/pytorch_lightning_training_script/train.py

By following the instructions given in the readme:
python train.py --save_dir path/to/save --gpus 1 --train_file path/to/training_set --dev_file path/to/val_set --test_file path/to/test_set --batch_size 2 --num_workers 0 --num_epochs 4 --grad_accum 256

I get nan loss and val_loss is constant:

Training Data Set for Spectre

Hi all, great job developing such robust model.

I am trying to replicate your model for assignment in my uni. I read in your paper that you are releasing your training dataset however I am still unable to find it anywhere. I also understand that Scidocs is just for evaluation purposes. If you are unable to release the dataset, can you at least share the method you use to determine the dataset you are using.

Cheers

ArrayField.empty_field problem with DatasetReader

I followed your instructions and I'm simply trying to get the predictor working with your dataset.

I see an error on initializing from allennlp.data.dataset_readers import DatasetReader

Exception has occurred: TypeError
ArrayField.empty_field: return type `None` is not a `<class 'allennlp.data.fields.field.Field'>`.
  File "/mnt/d/github/public/specter/src/allennlp/allennlp/data/fields/array_field.py", line 50, in ArrayField
    def empty_field(self):  # pylint: disable=no-self-use
  File "/mnt/d/github/public/specter/src/allennlp/allennlp/data/fields/array_field.py", line 10, in <module>
    class ArrayField(Field[numpy.ndarray]):
  File "/mnt/d/github/public/specter/src/allennlp/allennlp/data/fields/__init__.py", line 7, in <module>
    from allennlp.data.fields.array_field import ArrayField
  File "/mnt/d/github/public/specter/src/allennlp/allennlp/data/instance.py", line 3, in <module>
    from allennlp.data.fields.field import DataArray, Field
  File "/mnt/d/github/public/specter/src/allennlp/allennlp/data/dataset_readers/dataset_reader.py", line 8, in <module>

Here's how I am using it. I am launching the predict_command.py with the following arguments (see below). I even tried running from embed.py with the same arguments you showed in the Readme.md and I get the same error which is why I am investigating it this way. It doesn't even get to reading the arguments and the program has crashed.

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Predict_command.py",
            "type": "python",
            "request": "launch",
            "program": "specter/predict_command.py",
            "args": [
                "archive_file=./model.tar.gz",
                "input_file=data/sample-metadata.json",
                "--output-file=output.json",
                "--batch_size=16",
                "--cuda_device=-1",
                "--predictor=specter_predictor",
            ],
            "console": "integratedTerminal"
        } ]
}

Use custom dataset without hard negatives

Hi,

I'm trying to create preprocessed training files using my custom data. My data doesn't include any hard negatives, and when I use your script create_training_files.py, errors show up saying no triplets are constructed:

2021-08-20 14:30:58,836,836 INFO [create_training_files.py:453] loading metadata: ../../data/specter/metadata.json
2021-08-20 14:30:58,907,907 INFO [create_training_files.py:457] loading data file: ../../data/specter/data.json
2021-08-20 14:30:59,040,40 INFO [create_training_files.py:466] getting instances for `data` and `train` set
2021-08-20 14:30:59,041,41 INFO [create_training_files.py:468] writing output ../../data/specter/preprocessed/data-train.p
2021-08-20 14:30:59,101,101 INFO [create_training_files.py:303] Generating triplets ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 85452/85452 [01:00<00:00, 1404.12it/s]
INFO:/home/guoao/anaconda3/envs/specter/lib/python3.7/site-packages/specter-0.0.1-py3.7.egg/specter/data_utils/triplet_sampling.py:Done generating triplets, #successful queries: 0,#skipped queries: 85452
2021-08-20 14:32:01,745,745 INFO [create_training_files.py:365] done getting triplets, success rate:0.00%,total: 0
2021-08-20 14:32:01,746,746 INFO [create_training_files.py:407] converting raw instances to allennlp instances:
0it [00:00, ?it/s]

Then I dive into the script specter/data_utils/triplet_sampling.py to use TripletGenerator and see what happens (since I can't use breakpoints in multiprocess programs). I find out that since there're no hard negatives, the margin here becomes 0.0, making the candidates_pos a blank list.

If I change the line to if candidates[j][1] >= margin + candidates[-1][1]:, the function will work. I don't really understand the meaning of margin and not sure if changing the line will impact the generation results or not. So I wonder if it's safe to do so?

Thank!

KeyError: 'authors' while trying to get article embeddings

I am trying get SPECTER embeddings of some articles. I am using following structure for the input metadata.json

{
    'doc_id1': {'title': 'representation learning of scientific documents',
                'abstract': 'we propose a new model for representing abstracts'},
    'doc_id2': {'title': ' learning of scientific documents',
                'abstract': 'we propose a new model'}
}

I used embed.py command as given in README to get the embeddings but ran into following error.
Please help me solving this issue.
I was able to successfully get embeddings for sample-metadata.json. In this file additional metadata keys like 'authors, 'year', 'cited_by' are given. But as per documentation in README, only 'title' and 'abstract' are the required fields for each article right?

  0%|                                                                                                                                                                                      | 0/625 [00:00<?, ?batches/s]
Traceback (most recent call last):
  File "specter/predict_command.py", line 225, in <module>
    run()
  File "specter/predict_command.py", line 221, in run
    main(prog="allennlp")
  File "specter/predict_command.py", line 215, in main
    args.func(args)
  File "specter/predict_command.py", line 172, in _predict
    manager.run()
  File "specter/predict_command.py", line 98, in run
    for model_input_json, result in zip(batch_json, self._predict_json(batch_json)):
  File "/opt/conda/envs/specter/lib/python3.7/site-packages/allennlp/commands/predict.py", line 153, in _predict_json
    results = self._predictor.predict_batch_json(batch_data)
  File "./specter/predictor.py", line 76, in predict_batch_json
    instances.append(self._dataset_reader.text_to_instance(json_dict))
  File "./specter/data.py", line 441, in text_to_instance
    source_author, source_author_position = self._get_author_field(source_paper['authors'])
KeyError: 'authors'

Does the model use Bert's vocabulary or scibert's vocabulary, or do they all use it? Where is it used

To get Embeddings for research papers and reviews

Can we also get the representation for review text through the model? Is yes, then could you guide on how? Also, the sample metadata file in the readme has the title and abstract information, but in the ./data it also has other keys like cited by, etc. Plus, where should I specify the path to the actual paper text data?

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

Hi, when I run the script with the provided demo command:

python scripts/embed.py --ids data/sample.ids --metadata data/sample-metadata.json --model ./model.tar.gz --output-file output.jsonl --vocab-dir data/vocab/ --batch-size 16 --cuda-device 0

I get this error, and I don't know how to fix it. Could you please give me some advice?

Traceback (most recent call last):
File "specter/predict_command.py", line 226, in
run()
File "specter/predict_command.py", line 222, in run
main(prog="allennlp")
File "specter/predict_command.py", line 216, in main
args.func(args)
File "specter/predict_command.py", line 160, in _predict
predictor = _get_predictor(args)
File "specter/predict_command.py", line 149, in get_predictor
overrides=args.overrides)
File "e:\ding-project\specter and gender\specter\src\allennlp\allennlp\models\archival.py", line 214, in load_archive
config = Params.from_file(os.path.join(serialization_dir, CONFIG_NAME), overrides)
File "e:\ding-project\specter and gender\specter\src\allennlp\allennlp\common\params.py", line 492, in from_file
overrides_dict = parse_overrides(params_overrides)
File "e:\ding-project\specter and gender\specter\src\allennlp\allennlp\common\params.py", line 168, in parse_overrides
return unflatten(json.loads(evaluate_snippet("", serialized_overrides, ext_vars=ext_vars)))
File "E:\ProgramData\Anaconda3\envs\specter\lib\json_init.py", line 348, in loads
return _default_decoder.decode(s)
File "E:\ProgramData\Anaconda3\envs\specter\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "E:\ProgramData\Anaconda3\envs\specter\lib\json\decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

Label namespace warning

When I run the sample code in the readme.md

python scripts/embed.py \
--ids data/sample.ids --metadata data/sample-metadata.json \
--model ./model.tar.gz \
--output-file output.jsonl \
--vocab-dir data/vocab/ \
--batch-size 16 \
--cuda-device -1

I get a warning

WARNING:allennlp.data.fields.label_field:Your label namespace was 'year'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary. See documentation for `non_padded_namespaces` parameter in Vocabulary.

I still seem to generate meaningful embeddings, though. It's consistent across Windows 10 and OS X 10.15.6.

Thanks!

allenai / specter Goto Github PK

specter's Introduction

SPECTER: Document-level Representation Learning using Citation-informed Transformers

How to use the pretrained model

1- Through Huggingface Transformers Library

2- Through this repo

Download

Public API

How to reproduce our results

Advanced: Training your own model

SciDocs benchmark

Citation

specter's People

Contributors

Stargazers

Watchers

Forkers

specter's Issues

Recommend Projects

Recommend Topics

Recommend Org