Giter Site home page Giter Site logo

dpr's Introduction

Dense Passage Retrieval

Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the following paper:

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020.

If you find this work useful, please cite the following paper:

@inproceedings{karpukhin-etal-2020-dense,
    title = "Dense Passage Retrieval for Open-Domain Question Answering",
    author = "Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.550",
    doi = "10.18653/v1/2020.emnlp-main.550",
    pages = "6769--6781",
}

If you're interesting in reproducing experimental results in the paper based on our model checkpoints (i.e., don't want to train the encoders from scratch), you might consider using the Pyserini toolkit, which has the experiments nicely packaged in via pip. Their toolkit also reports higher BM25 and hybrid scores.

Features

  1. Dense retriever model is based on bi-encoder architecture.
  2. Extractive Q&A reader&ranker joint model inspired by this paper.
  3. Related data pre- and post- processing tools.
  4. Dense retriever component for inference time logic is based on FAISS index.

New (March 2021) release

DPR codebase is upgraded with a number of enhancements and new models. Major changes:

  1. Hydra-based configuration for all the command line tools exept the data loader (to be converted soon)
  2. Pluggable data processing layer to support custom datasets
  3. New retrieval model checkpoint with better perfromance.

New (March 2021) retrieval model

A new bi-encoder model trained on NQ dataset only is now provided: a new checkpoint, training data, retrieval results and wikipedia embeddings. It is trained on the original DPR NQ train set and its version where hard negatives are mined using DPR index itself using the previous NQ checkpoint. A Bi-encoder model is trained from scratch using this new training data combined with our original NQ training data. This training scheme gives a nice retrieval performance boost.

New vs old top-k documents retrieval accuracy on NQ test set (3610 questions).

Top-k passages Original DPR NQ model New DPR model
1 45.87 52.47
5 68.14 72.24
20 79.97 81.33
100 85.87 87.29

New model downloadable resources names (see how to use download_data script below):

Checkpoint: checkpoint.retriever.single-adv-hn.nq.bert-base-encoder

New training data: data.retriever.nq-adv-hn-train

Retriever resutls for NQ test set: data.retriever_results.nq.single-adv-hn.test

Wikipedia embeddings: data.retriever_results.nq.single-adv-hn.wikipedia_passages

Installation

Installation from the source. Python's virtual or Conda environments are recommended.

git clone [email protected]:facebookresearch/DPR.git
cd DPR
pip install .

DPR is tested on Python 3.6+ and PyTorch 1.2.0+. DPR relies on third-party libraries for encoder code implementations. It currently supports Huggingface (version <=3.1.0) BERT, Pytext BERT and Fairseq RoBERTa encoder models. Due to generality of the tokenization process, DPR uses Huggingface tokenizers as of now. So Huggingface is the only required dependency, Pytext & Fairseq are optional. Install them separately if you want to use those encoders.

Resources & Data formats

First, you need to prepare data for either retriever or reader training. Each of the DPR components has its own input/output data formats. You can see format descriptions below. DPR provides NQ & Trivia preprocessed datasets (and model checkpoints) to be downloaded from the cloud using our dpr/data/download_data.py tool. One needs to specify the resource name to be downloaded. Run 'python data/download_data.py' to see all options.

python data/download_data.py \
	--resource {key from download_data.py's RESOURCES_MAP}  \
	[optional --output_dir {your location}]

The resource name matching is prefix-based. So if you need to download all data resources, just use --resource data.

Retriever input data format

The default data format of the Retriever training data is JSON. It contains pools of 2 types of negative passages per question, as well as positive passages and some additional information.

[
  {
	"question": "....",
	"answers": ["...", "...", "..."],
	"positive_ctxs": [{
		"title": "...",
		"text": "...."
	}],
	"negative_ctxs": ["..."],
	"hard_negative_ctxs": ["..."]
  },
  ...
]

Elements' structure for negative_ctxs & hard_negative_ctxs is exactly the same as for positive_ctxs. The preprocessed data available for downloading also contains some extra attributes which may be useful for model modifications (like bm25 scores per passage). Still, they are not currently in use by DPR.

You can download prepared NQ dataset used in the paper by using 'data.retriever.nq' key prefix. Only dev & train subsets are available in this format. We also provide question & answers only CSV data files for all train/dev/test splits. Those are used for the model evaluation since our NQ preprocessing step looses a part of original samples set. Use 'data.retriever.qas.*' resource keys to get respective sets for evaluation.

python data/download_data.py
	--resource data.retriever
	[optional --output_dir {your location}]

DPR data formats and custom processing

One can use their own data format and custom data parsing & loading logic by inherting from DPR's Dataset classes in dpr/data/{biencoder|retriever|reader}_data.py files and implementing load_data() and getitem() methods. See DPR hydra configuration instructions.

Retriever training

Retriever training quality depends on its effective batch size. The one reported in the paper used 8 x 32GB GPUs. In order to start training on one machine:

python train_dense_encoder.py \
train_datasets=[list of train datasets, comma separated without spaces] \
dev_datasets=[list of dev datasets, comma separated without spaces] \
train=biencoder_local \
output_dir={path to checkpoints dir}

Example for NQ dataset

python train_dense_encoder.py \
train_datasets=[nq_train] \
dev_datasets=[nq_dev] \
train=biencoder_local \
output_dir={path to checkpoints dir}

DPR uses HuggingFace BERT-base as the encoder by default. Other ready options include Fairseq's ROBERTA and Pytext BERT models. One can select them by either changing encoder configuration files (conf/encoder/hf_bert.yaml) or providing a new configuration file in conf/encoder dir and enabling it with encoder={new file name} command line parameter.

Notes:

  • If you want to use pytext bert or fairseq roberta, you will need to download pre-trained weights and specify encoder.pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').
  • Validation and checkpoint saving happens according to train.eval_per_epoch parameter value.
  • There is no stop condition besides a specified amount of epochs to train (train.num_train_epochs configuration parameter).
  • Every evaluation saves a model checkpoint.
  • The best checkpoint is logged in the train process output.
  • Regular NLL classification loss validation for bi-encoder training can be replaced with average rank evaluation. It aggregates passage and question vectors from the input data passages pools, does large similarity matrix calculation for those representations and then averages the rank of the gold passage for each question. We found this metric more correlating with the final retrieval performance vs nll classification loss. Note however that this average rank validation works differently in DistributedDataParallel vs DataParallel PyTorch modes. See train.val_av_rank_* set of parameters to enable this mode and modify its settings.

See the section 'Best hyperparameter settings' below as e2e example for our best setups.

Retriever inference

Generating representation vectors for the static documents dataset is a highly parallelizable process which can take up to a few days if computed on a single GPU. You might want to use multiple available GPU servers by running the script on each of them independently and specifying their own shards.

python generate_dense_embeddings.py \
	model_file={path to biencoder checkpoint} \
	ctx_src={name of the passages resource, set to dpr_wiki to use our original wikipedia split} \
	shard_id={shard_num, 0-based} num_shards={total number of shards} \
	out_file={result files location + name PREFX}	

The name of the resource for ctx_src parameter or just the source name from conf/ctx_sources/default_sources.yaml file.

Note: you can use much large batch size here compared to training mode. For example, setting batch_size 128 for 2 GPU(16gb) server should work fine. You can download already generated wikipedia embeddings from our original model (trained on NQ dataset) using resource key 'data.retriever_results.nq.single.wikipedia_passages'. Embeddings resource name for the new better model 'data.retriever_results.nq.single-adv-hn.wikipedia_passages'

We generally use the following params on 50 2-gpu nodes: batch_size=128 shard_id=0 num_shards=50

Retriever validation against the entire set of documents:

python dense_retriever.py \
	model_file={path to a checkpoint downloaded from our download_data.py as 'checkpoint.retriever.single.nq.bert-base-encoder'} \
	qa_dataset={the name os the test source} \
	ctx_datatsets=[{list of passage sources's names, comma separated without spaces}] \
	encoded_ctx_files=[{list of encoded document files glob expression, comma separated without spaces}] \
	out_file={path to output json file with results} 
	

For example, If your generated embeddings fpr two passages set as ~/myproject/embeddings_passages1/wiki_passages_* and ~/myproject/embeddings_passages2/wiki_passages_* files and want to evaluate on NQ dataset:

python dense_retriever.py \
	model_file={path to a checkpoint file} \
	qa_dataset=nq_test \
	ctx_datatsets=[dpr_wiki] \
	encoded_ctx_files=[\"~/myproject/embeddings_passages1/wiki_passages_*\",\"~/myproject/embeddings_passages2/wiki_passages_*\"] \
	out_file={path to output json file with results} 

The tool writes retrieved results for subsequent reader model training into specified out_file. It is a json with the following format:

[
    {
        "question": "...",
        "answers": ["...", "...", ... ],
        "ctxs": [
            {
                "id": "...", # passage id from database tsv file
                "title": "",
                "text": "....",
                "score": "...",  # retriever score
                "has_answer": true|false
     },
]

Results are sorted by their similarity score, from most relevant to least relevant.

By default, dense_retriever uses exhaustive search process, but you can opt in to use lossy index types. We provide HNSW and HNSW_SQ index options. Enabled them by indexer=hnsw or indexer=hnsw_sq command line arguments. Note that using this index may be useless from the research point of view since their fast retrieval process comes at the cost of much longer indexing time and higher RAM usage. The similarity score provided is the dot product for the default case of exhaustive search (indexer=flat) and L2 distance in a modified representations space in case of HNSW index.

Reader model training

python train_extractive_reader.py \
	encoder.sequence_length=350 \
	train_files={path to the retriever train set results file} \
	dev_files={path to the retriever dev set results file}  \
	output_dir={path to output dir}

Default hyperparameters are set for a single node with 8 gpus setup. Modify them as needed in the conf/train/extractive_reader_default.yaml and conf/extractive_reader_train_cfg.yaml cpnfiguration files or override specific parameters from the command line. First time run will preprocess train_files & dev_files and convert them into serialized set of .pkl files in the same locaion and will use them on all subsequent runs.

Notes:

  • If you want to use pytext bert or fairseq roberta, you will need to download pre-trained weights and specify encoder.pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').
  • Reader training pipeline does model validation every train.eval_step batches
  • Like the bi-encoder, it saves model checkpoints on every validation
  • Like the bi-encoder, there is no stop condition besides a specified amount of epochs to train.
  • Like the bi-encoder, there is no best checkpoint selection logic, so one needs to select that based on dev set validation performance which is logged in the train process output.
  • Our current code only calculates the Exact Match metric.

Reader model inference

In order to make an inference, run train_reader.py without specifying train_files. Make sure to specify model_file with the path to the checkpoint, passages_per_question_predict with number of passages per question (being used when saving the prediction file), and eval_top_docs with a list of top passages threshold values from which to choose question's answer span (to be printed as logs). The example command line is as follows.

python train_extractive_reader.py \
  prediction_results_file={path to a file to write the results to} \
  eval_top_docs=[10,20,40,50,80,100] \
  dev_files={path to the retriever results file to evaluate} \
  model_file= {path to the reader checkpoint} \
  train.dev_batch_size=80 \
  passages_per_question_predict=100 \
  encoder.sequence_length=350

Distributed training

Use Pytorch's distributed training launcher tool:

python -m torch.distributed.launch \
	--nproc_per_node={WORLD_SIZE}  {non distributed scipt name & parameters}

Note:

  • all batch size related parameters are specified per gpu in distributed mode(DistributedDataParallel) and for all available gpus in DataParallel (single node - multi gpu) mode.

Best hyperparameter settings

e2e example with the best settings for NQ dataset.

1. Download all retriever training and validation data:

python data/download_data.py --resource data.wikipedia_split.psgs_w100
python data/download_data.py --resource data.retriever.nq
python data/download_data.py --resource data.retriever.qas.nq

2. Biencoder(Retriever) training in the single set mode.

We used distributed training mode on a single 8 GPU x 32 GB server

python -m torch.distributed.launch --nproc_per_node=8
train_dense_encoder.py \
train=biencoder_nq \
train_datasets=[nq_train] \
dev_datasets=[nq_dev] \
train=biencoder_nq \
output_dir={your output dir}

New model training combines two NQ datatsets:

python -m torch.distributed.launch --nproc_per_node=8
train_dense_encoder.py \
train=biencoder_nq \
train_datasets=[nq_train,nq_train_hn1] \
dev_datasets=[nq_dev] \
train=biencoder_nq \
output_dir={your output dir}

This takes about a day to complete the training for 40 epochs. It switches to Average Rank validation on epoch 30 and it should be around 25 or less at the end. The best checkpoint for bi-encoder is usually the last, but it should not be so different if you take any after epoch ~ 25.

3. Generate embeddings for Wikipedia.

Just use instructions for "Generating representations for large documents set". It takes about 40 minutes to produce 21 mln passages representation vectors on 50 2 GPU servers.

4. Evaluate retrieval accuracy and generate top passage results for each of the train/dev/test datasets.

python dense_retriever.py \
	model_file={path to the best checkpoint or use our proivded checkpoints (Resource names like checkpoint.retriever.*)  } \
	qa_dataset=nq_test \
	ctx_datatsets=[dpr_wiki] \
	encoded_ctx_files=["{glob expression for generated embedding files}"] \
	out_file={path to the output file}

Adjust batch_size based on the available number of GPUs, 64-128 should work for 2 GPU server.

5. Reader training

We trained reader model for large datasets using a single 8 GPU x 32 GB server. All the default parameters are already set to our best NQ settings. Please also download data.gold_passages_info.nq_train & data.gold_passages_info.nq_dev resources for NQ datatset - they are used for special NQ only heuristics when preprocessing the data for the NQ reader training. If you already run reader trianign on NQ data without gold_passages_src & gold_passages_src_dev specified, please delete the corresponding .pkl files so that thye will be re-generated.

python train_extractive_reader.py \
	encoder.sequence_length=350 \
	train_files={path to the retriever train set results file} \
	dev_files={path to the retriever dev set results file}  \
	gold_passages_src={path to data.gold_passages_info.nq_train file} \
	gold_passages_src_dev={path to data.gold_passages_info.nq_dev file} \
	output_dir={path to output dir}

We found that using the learning rate above works best with static schedule, so one needs to stop training manually based on evaluation performance dynamics. Our best results were achieved on 16-18 training epochs or after ~60k model updates.

We provide all input and intermediate results for e2e pipeline for NQ dataset and most of the similar resources for Trivia.

Misc.

  • TREC validation requires regexp based matching. We support only retriever validation in the regexp mode. See --match parameter option.
  • WebQ validation requires entity normalization, which is not included as of now.

License

DPR is CC-BY-NC 4.0 licensed as of now.

dpr's People

Contributors

ali-abz avatar atomu2014 avatar hikushalhere avatar jeanm avatar jongwon-jay-lee avatar juyongjiang avatar l-yohai avatar lintool avatar martiansideofthemoon avatar monologg avatar nicola-decao avatar raman-r-4978 avatar sadakmed avatar scottyih avatar shmsw25 avatar soheeyang avatar vlad-karpukhin avatar yizhilll avatar yuvalpeleg avatar zake7749 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dpr's Issues

Error in Download Data

Hi, I'm reproducing your work.
Before starting, a naming error for the download path occurred in the downloading script for download all the data.retriever.
This error showed:

$ python data/download_data.py --resource data.retriever
Loading from  https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz
100% [......................................................................] 256239282 / 256239282Saved to  ./data/retriever/nq-dev.tmp
Uncompressing  ./data/retriever/nq-dev.tmp
Saved to  ./data/retriever/nq-dev.json
Loading from  https://dl.fbaipublicfiles.com/dpr/nq_license/LICENSE
100% [..............................................................................] 21065 / 21065Saved to  ./data/retriever/LICENSE
Loading from  https://dl.fbaipublicfiles.com/dpr/nq_license/README
100% [..................................................................................] 506 / 506Saved to  ./data/retriever/README
Loading from  https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz
100% [....................................................................] 2314892908 / 2314892908Saved to  ./data/retriever/nq-train.tmp
Uncompressing  ./data/retriever/nq-train.tmp
Saved to  ./data/retriever/nq-train.json
Loading from  https://dl.fbaipublicfiles.com/dpr/nq_license/LICENSE
100% [..............................................................................] 21065 / 21065Traceback (most recent call last):
  File ".../conda_env_torch1.5/lib/python3.7/shutil.py", line 566, in move
    os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: '/data/disk1/private/liyizhi/DPR/data/retriever/LICENSEf_je2ioi.tmp' -> ' (1)./data/retriever/LICENSE'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "data/download_data.py", line 348, in <module>
    main()
  File "data/download_data.py", line 340, in main
    download(args.resource, args.output_dir)
  File "data/download_data.py", line 304, in download
    download(key, out_dir)
  File "data/download_data.py", line 325, in download
    download_file(license_files[0], save_root_dir, 'LICENSE')
  File "data/download_data.py", line 294, in download_file
    wget.download(s3_url, out=local_file)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/wget.py", line 534, in download
    shutil.move(tmpfile, filename)
  File ".../conda_env_torch1.5/lib/python3.7/shutil.py", line 580, in move
    copy_function(src, real_dst)
  File ".../conda_env_torch1.5/lib/python3.7/shutil.py", line 266, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File ".../conda_env_torch1.5/lib/python3.7/shutil.py", line 121, in copyfile...tFoundError: [Errno 2] No such file or directory: ' (1)./data/retriever/LICENSE'

I've tryied some temporary change with forcing the out_dir fixed in download_file function and they were not working.

def download_file(s3_url: str, out_dir: str, file_name: str):
    print('Loading from ', s3_url)
    out_dir='./data/retriever'
    local_file = os.path.join(out_dir, file_name)
    wget.download(s3_url, out=local_file)
    print('Saved to ', local_file)
def download_file(s3_url: str, out_dir: str, file_name: str):
    print('Loading from ', s3_url)
    local_file = local_file.strip(' (1)') # this one not working
    wget.download(s3_url, out=local_file)
    print('Saved to ', local_file)

Finally I try to give up downloading the license and it worked.

#     download_file(license_files[0], save_root_dir, 'LICENSE')
#     download_file(license_files[1], save_root_dir, 'README')

Hope you can figure out the bug for further usage, thanks.

Training but not showing any GPU usage

Hi,

I use the following code to train on my dataset.

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0,1,2,3 /usr/bin/python3.6 train_dense_encoder.py --encoder_model_type hf_ber t --model_file checkpoint/retriever/multiset/bert-base-encoder.cp --pretrained_model_cfg bert-base-uncased --train_file ../data/DPR_question_retriever_ training --dev_file ../data/DPR_question_retriever_training --output_dir checkpoint/retriever/multiset

But I cannot see any GPU usage (extremely low). Also the training seems stopping at epoch 1. What would be the problem? Thank you!

Pre-trained model and dataset for SQuAD reproduction

Hi, I found that there is no option for reproducing the results for SQuADv1.1 experiments.
Though the dataset is with some bias compared NQ, I think it's still an important part of the paper since this dataset is common.
Could you provide the dataset in DPR format and the pre-trained models for reproduction? Thx

Fine-tune using trained model on nq

Hello,

Thank you amazing paper and also for open sourcing the code.
We tried to train retriever model using nq dataset, but we have only 1 GPU of 12GB. From issue #64 i understood we need more than one GPU to train the model.
I was thinking if its possible to use already trained model using nq dataset and then fine-tune on my own dataset(which is much smaller)?
We were able to use DPR model using huggingface transformers, but didn't get any docs to fine-tune on my own dataset which is much smaller.

Thank you!

Fine tune DPR

I have around 5000 Question and passage pairs with me and would like to fine tune DPR on my data.
I didn't find any script to convert QA pairs to format which DPR uses for training (with positive and negative context).

Can some one please help me with this.

Question regarding dense_retriever arguments

What is the "encoded_ctx_file" for the dense_retriever.py script?


I am trying to use the pretrained retriever to create the input files for the reader model.
I've used the download_data.py script to download:

The retriever checkpoint into: checkpoint/retriever/single/nq/bert-base-encoder.cp
Wikipedia tsv file into: data/wikipedia_split/psgs_w100.tsv
QA data into: data/retriever/qas/nq-*.csv

But I am not sure what is the "encoded_ctx_file" refered to in the following running instruction:

python dense_retriever.py --model_file ${path to biencoder checkpoint} --ctx_file {path to all documents .tsv file} --qa_file {path to test|dev .csv file} --encoded_ctx_file "{encoded document files glob expression}" --out_file {path to output json file with results} --n-docs 200

GPU out of memory

Hi, thanks for the cool work!

I try to train the reader using

python src/train_reader.py \
--encoder_model_type hf_bert \
--seed 42 \
--learning_rate 1e-4 \
--eval_step 2000 \
--do_lower_case \
--eval_top_docs 50 \
--warmup_steps 0 \
--sequence_length 350 \
--batch_size 1 \
--passages_per_question 24 \
--num_train_epochs 3 \
--dev_batch_size 72 \
--passages_per_question_predict 50 \
--pretrained_model_cfg bert-base-uncased \
--train_file train.0.pkl \
--dev_file test.0.pkl \
--output_dir train_output

But even if I set the batch_size=1,the GPU will also out of memory...
I use single card of V100 16G GPU.
Can you help me? Any help will be appreciate!

Retriever Inference Time

Thanks for the great work and making it available to us. I could set up and run the DPR system with pre-trained models. However I observed it takes more than 20 seconds to fetch top 10 passages from the end to end system for a given query. I'm using the faiss package which is integrated with the retriever. Am I missing something here ? If this is the case, is there any way to reduce the inference time to make it effective in real time QA applications.

Looking forward to your valuable feedback.

About sampling on positive and negative passages in the DPR reader

In the training of reader model, a passage selection score is computed over $[p_1, p_2, ..., p_24]$ in your implementation. But I find that the positive passage will always be put as $p_1$ (see switch_labels)

DPR/dpr/models/reader.py

Lines 78 to 80 in d7ba973

relevance_logits = relevance_logits.view(N, M)
switch_labels = torch.zeros(N, dtype=torch.long).cuda()
switch_loss = torch.sum(loss_fct(relevance_logits, switch_labels))

Will this introduce some bias in the training stage? The model will tend to take the first passage from retriever as the positive one, which might make the rerank less effective.

Confusion about reader data and retrieval results

I have some confusion regarding the reader data provided in data.reader.nq.single and the retrieval results provided in data.retriever_results.nq.single.[train|dev|test].json.

From my understanding, the reader data is obtained by running dense_retriever.py on the dense embeddings generated via generate_dense_embeddings.py. According to Table 2 of the paper, Top-20 and Top-100 retrieval accuracy for NQ (Single-DPR) are 78.4% are 85.4%, respectively. Thus, my interpretation is that, for k <= 100, almost all of the Top-k passages could be used as reader positives, since they contain the answer span.

However, for each question in the provided reader data, almost all of the passages are negative. Specifically, for both reader train and dev data, the average number of positive passages per question is less than 4.

Also, when I checked the provided retriever results, I found that the retrieval accuracy (i.e., average percentage of Top-100 passages per question with has_answer = True) for these datasets is only around 8%, which seems a lot lower than the 85.4% reported in Table 2.

Could you please help me better understand what is happening here? Thanks!

Questions about the data of TriviaQA

Greetings,

I have 2 questions regarding the data of TriviaQA as follows:

  1. [Reader training] If I understand correctly, multiset refers to training the retriever on multiple datasets rather than the reader. Then the pickle files data.reader.trivia.multi-hybrid.train_[0-8].pkl (total size is 66,544) are questions from TriviaQA only, right? Could you provide DPR's json results similar to data.retriever_results.nq.single.[train/dev/text] in addition to the pkl files since some samples are filtered in pkl for reader training (78,785 -> 66,544)?

  2. [Retriever Training] data.retriever.nq-train (size 58,880) is the subset after pruning but data.retriever.trivia-train (size 78,785) appears to be the whole training set (referring to 60,413 in Table 1)?

Thank you!

How is (ICT) Wikipedia Pre-training carried out in section 4.1 of paper?

Hi @vlad-karpukhin ,

I understand the Wikipedia pre-processing method in Section 4.1 of the paper. However, I couldn't find a mention of how the model is trained from these 21,015,324 passages that are used for the pre-training of DPR models. Do you train using ICT i.e. break passage (100 words) into sentences and try to predict the context given the sentence? If yes, do you chose the sentence at random? I trained models choosing a sentence at random, but it leads to worse performances, so I was wondering where did I go wrong?

If it's not pre-trained like ICT, could you explain how the pre-training of the model is done?

Kind Regards,
Nandan

BM25 baselines replication reported in the paper

Hi @vlad-karpukhin ,

I wanted to replicate the BM25 retrieval results mentioned in Table 2 in the DPR paper. When I read footer 8 -

Lucene implementation. BM25 parameters b = 0.4 (document length normalization) and k1 = 0.9 (term frequency
scaling) are tuned using development sets

I, unfortunately, find nothing on reproducing the BM25 benchmarks within this repository and I wish to run the same BM25 benchmark for NQ and SQuAD datasets -- could you direct me if there is any open-sourced code available? Do you implement Anserini BM25 or something else (code reference if available would be really helpful)?

Kind Regards,
Nandan Thakur

Can not reproduce retriever results on NQ dataset.

Hi, thanks for sharing the source code!

I'm trying to reproduce retriever results on NQ dataset. I use the similar hyper-parameters but the final average rank is 93 (it should be 25 as readme.md said). I only change the nproc_per_node to 2 for device limited? Both val_av_rank_hard_neg and val_av_rank_other_neg is 30 as default.

Best hyperparameters for all datasets

Just as you did with Natural Questions, can you please also share the best hyperparameter settings for TriviaQA, WebQuestions, CuratedTREC, and SQuAD? This would help a lot in reproducing the results from the paper.

Reader model inference - passage score confusing the reader model?

Hello!

First, thanks for sharing this work!

I got a setup where I have a successfully reproduced the DPR retriever accuracy at top 20 and top 100 retrieved passages using vespa.ai but I have some trouble with the reader checkpoint and the EM metric/results it produces in my setup and wanted to ask if you have any idea. I suspect it's due to the score range distribution. With Vespa.ai the passage score is calculated as 1/(1+l2_distance) (After transforming the inner dot product space to euclidian space for HNSW). Steps to reproduce with two questions in the NQ open test set

python3 data/download_data.py --resource  checkpoint.reader.nq-single.hf-bert-base
Loading from  https://dl.fbaipublicfiles.com/dpr/checkpoint/reader/nq-single/hf_bert_base.cp
100% [....................................................................] 1313930953 / 1313930953Saved to  ./checkpoint/reader/nq-single/hf-bert-base.cp

Reader model inference

 python3 train_reader.py --prediction_results_file out.json  --eval_top_docs 10 100 --model_file data/checkpoint/reader/nq-single-subset/hf-bert-base.cp --dev_file results-filtered.json --passages_per_question_predict 100 --sequence_length 350

Where results-filtered.json is available https://gist.githubusercontent.com/jobergum/33f6fd7f612dbb923e1414682fc66f5b/raw/9be34f2518c5d9ef132fa60f0e7689c5f724ec13/results-filtered.json. Variants of those two questions exists in the train set (https://arxiv.org/pdf/2008.02637.pdf) so I was hoping for a good EM score.

Reader inference Output

on device=cpu, n_gpu=0, world size=1
16-bits training: False 
 **************** CONFIGURATION **************** 
adam_betas                     -->   (0.9, 0.999)
adam_eps                       -->   1e-08
batch_size                     -->   2
checkpoint_file_name           -->   dpr_reader
dev_batch_size                 -->   4
dev_file                       -->   results-filtered.json
device                         -->   cpu
distributed_world_size         -->   1
do_lower_case                  -->   False
dropout                        -->   0.1
encoder_model_type             -->   None
eval_step                      -->   2000
eval_top_docs                  -->   [10, 100]
fp16                           -->   False
fp16_opt_level                 -->   O1
fully_resumable                -->   False
gold_passages_src              -->   None
gold_passages_src_dev          -->   None
gradient_accumulation_steps    -->   1
learning_rate                  -->   1e-05
local_rank                     -->   -1
log_batch_step                 -->   100
max_answer_length              -->   10
max_grad_norm                  -->   1.0
max_n_answers                  -->   10
model_file                     -->   data/checkpoint/reader/nq-single-subset/hf-bert-base.cp
n_gpu                          -->   0
no_cuda                        -->   False
num_train_epochs               -->   3.0
num_workers                    -->   16
output_dir                     -->   None
passages_per_question          -->   2
passages_per_question_predict  -->   100
prediction_results_file        -->   out.json
pretrained_file                -->   None
pretrained_model_cfg           -->   None
projection_dim                 -->   0
seed                           -->   0
sequence_length                -->   350
train_file                     -->   None
train_rolling_loss_step        -->   100
warmup_steps                   -->   100
weight_decay                   -->   0.0
 **************** CONFIGURATION **************** 
***** Initializing components for training *****
Reading saved model from data/checkpoint/reader/nq-single-subset/hf-bert-base.cp
model_state_dict keys odict_keys(['model_dict', 'optimizer_dict', 'scheduler_dict', 'offset', 'epoch', 'encoder_params'])
Overriding args parameter value from checkpoint state. Param = pretrained_model_cfg, value = bert-base-uncased
Overriding args parameter value from checkpoint state. Param = encoder_model_type, value = hf_bert
Overriding args parameter value from checkpoint state. Param = sequence_length, value = 350
..
Loading checkpoint @ batch=3940 and epoch=19
Loading model weights from saved state ...
Loading saved optimizer state ...
No train files are specified. Run validation.
Validation ...
Data files: ['results-filtered.json']
Found preprocessed files. ['results-filtered.0.pkl', 'results-filtered.1.pkl']
Reading file results-filtered.0.pkl
Aggregated data size: 1
Reading file results-filtered.1.pkl
Aggregated data size: 2
Total data size: 2
n=10	EM 0.00
n=100	EM 0.00
Saving prediction results to  out.json

out.json

[
    {
        "gold_answers": [
            "Jakeem Grant",
            "John Ross"
        ],
        "predictions": [
            {
                "prediction": {
                    "passage": "4. 18 run by within the same week added some support to the legitimacy of the times.'s was hand - timed by a scout as running a 4. 10 in 2016, potentially beating's record. ran a 4. 27 - second 40 - yard dash in 1989. 2013, recorded a time of 4. 22 at a facility during a workout. 2017 sprinter ran a time of 4. 12 seconds on turf in response to claims that players are as fast as. is a list of the official 40 - yard",
                    "passage_idx": 0,
                    "relevance_score": -3.989269256591797,
                    "score": -5.150297403335571,
                    "text": "scout"
                },
                "top_k": 10
            },
            {
                "prediction": {
                    "passage": "was a high school track champion and record holder in the late 1940s. was a standout sprinter for in,.'s speed was such that the athletic director steered him toward track, not wanting him to get hurt playing football. soon became a statewide success and was followed by the state's largest newspaper, the \" \". ran the 40 yard dash in 4. 2 seconds, a schoolboy record at the time, the 100 yard dash in a wind - assisted 9. 2 seconds, and the 220 yard sprint",
                    "passage_idx": 28,
                    "relevance_score": -2.3115363121032715,
                    "score": 8.016469955444336,
                    "text": ""
                },
                "top_k": 100
            }
        ],
        "question": "who ran the fastest 40 yard dash in the nfl"
    },
    {
        "gold_answers": [
            "Kevin McKeon as Young Pink",
            "Bob Geldof as Pink",
            "David Bingham as Little Pink",
            "Bob Geldof",
            "David Bingham",
            "Kevin McKeon"
        ],
        "predictions": [
            {
                "prediction": {
                    "passage": "the tour, band relationships dropped to an all - time low ; four were parked in a circle, with the doors facing away from the centre. used his own vehicle to arrive at the venue, and stayed in separate hotels from the rest of the band., returning to perform his duties as a salaried musician, was the only member of the band to profit from the tour, which lost about \u00a3400, 000. film adaptation, \" \u2013, \" was released in 1982. was written by and directed by, with as.",
                    "passage_idx": 2,
                    "relevance_score": -11.945453643798828,
                    "score": -4.854901194572449,
                    "text": "band"
                },
                "top_k": 10
            },
            {
                "prediction": {
                    "passage": "\" \", alongside and in 2016. - is married to. first child, a son, was born in 2015. has a brother and sister, &. grew up on the cost of. 2016, - revealed that shortly after relocating to in 2012, he began receiving treatment from. actor told of \" \" that he decided to attend because he did not like the personality changes he suffered as a result of consuming alcohol. -.",
                    "passage_idx": 32,
                    "relevance_score": -7.610385417938232,
                    "score": 7.239379405975342,
                    "text": ""
                },
                "top_k": 100
            }
        ],
        "question": "who played pink in pink floyd the wall"
    }
]

reader inference infinite running with very high CPU utilization and memory

I just following readme to do reader inference, run with following commands

export CUDA_VISIBLE_DEVICES=0
python train_reader.py \
  --prediction_results_file ./results.json \
  --eval_top_docs 10 20 40 50 80 100 \
  --dev_file test.0.pkl \
  --model_file hf_bert_base.cp \
  --log_batch_step 1 \
  --dev_batch_size 80 \
  --passages_per_question_predict 100 \
  --sequence_length 350

However, the process get stuck in this position for an hour

image

And I find GPU utilization is 0, and CPU utilization and memory consuming is very horrible.

image

I did not modify any code. The env is

python == 3.7.7
CUDA == 10.1
torch == 1.6.0
transformers == 2.11.0

Limited number of passages for the Reader

Looking at the preprocessed NQ data files for the reader module I see that the number of passages per question varies between 5 and 100 with an average of 57 passages per question. Can you share why this is so?

Inquiry about GPU distributed training

Hi, I've started running the model training with gpus of less RAM ( 4 x 11gb).

The batch_size is set to 4 in order to avoid memory error. This runs well.

nohup python -m torch.distributed.launch --nproc_per_node=4 --master_port=19234 train_dense_encoder.py --max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_model_cfg bert-base-uncased --seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 4 --do_lower_case --train_file ./data/retriever/nq-train.json --dev_file ./data/retriever/nq-dev.json --output_dir ./checkpoints --learning_rate 2e-05 --num_train_epochs 40 --dev_batch_size 4 --eval_per_epoch 2 > log/nohup.retriver_encoder.log6.ser23_3.4xGPU 2>&1 &

The code will also be fine if I set only one GPU visible, and the batch_size could be 8.

nohup python train_dense_encoder.py --max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_model_cfg bert-base-uncased --seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 8 --do_lower_case --train_file ./data/retriever/nq-train.json --dev_file ./data/retriever/nq-dev.json --output_dir ./checkpoints --learning_rate 2e-05 --num_train_epochs 40 --dev_batch_size 8 --eval_per_epoch 2 > log/nohup.retriver_encoder.log6.ser23_3.4xGPU 2>&1 &

However, while the visible number is set to 4 ( or more than 1 ), this error will occur. Seems that the distributed training can only be launched with python -m torch.distributed.launch. I think this might be clarified in the README.

  File "train_dense_encoder.py", line 564, in <module>
    main()
  File "train_dense_encoder.py", line 554, in main
    trainer.run_train()
  File "train_dense_encoder.py", line 129, in run_train
    self._train_epoch(scheduler, epoch, eval_step, train_iterator)
  File "train_dense_encoder.py", line 324, in _train_epoch
    loss, correct_cnt = _do_biencoder_fwd_pass(self.biencoder, biencoder_batch, self.tensorizer, args)
  File "train_dense_encoder.py", line 472, in _do_biencoder_fwd_pass
    input.ctx_segments, ctx_attn_mask)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../DPR/dpr/models/biencoder.py", line 85, in forward
    question_attn_mask, self.fix_q_encoder)
  File ".../DPR/dpr/models/biencoder.py", line 77, in get_representation
    sequence_output, pooled_output, hidden_states = sub_model(ids, segments, attn_mask)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk1_mnt/private/liyizhi/DPR/dpr/models/hf_models.py", line 125, in forward
    attention_mask=attention_mask)
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/transformers/modeling_bert.py", line 707, in forward
    attention_mask, input_shape, self.device
  File ".../conda_env_torch1.5/lib/python3.7/site-packages/transformers/modeling_utils.py", line 113, in device
    return next(self.parameters()).device
StopIteration

PS: I wonder if you could provide the checkpoint files for the fine-tuned model parameters? I've looked over several times in the README and couldn't find such instruction. It would be nice if you could provide this for a simple inference to test the result before I reproducing from scratch. I'm worry the smaller batch size resulted by lower-profile device could not completely reproduce your work, thanks.

Request for test set of SQuAD

With data/download_data.py, we can download the train and dev sets of SQuAD by specifying the resource as data.retriever.squad1-train and data.retriever.squad1-dev, respectively.
I guess the QA files of SQuAD like data.retriever.qas.trivia-test can be made by dumping tsv files that contains only the "question" and "answers" of those files.
However, I could not find the link to the test set for SQuAD from the download script. Could you please provide the qas file for the test set of SQuAD?

Adding more details to this issue:

  • I noticed that the test set size of SQuAD indicated in the DPR paper is the same as the official dev set size of SQuAD v1.1 (dev-v1.1.json), so made a file named squad1-test.csv by processing dev-v1.1.json. However, I could not reproduce the retrieval result reported in the paper. Therefore, I came to wonder if the test split used for the paper is a different file or processed in a different way.
  • To reproduce the retriever, I used the hyperparameters specified in the README file as "Best hyperparameter settings - 2. Biencoder(Retriever) training in single set mode." referring to the comment in #22. I could successfully reproduce the scores of TriviaQA, but could not get the numbers for SQuAD with the tsv file I processed by myself.

(I'm sorry for initially submitting the issue in an incomplete form; I accidentally hit the enter button while I was typing...)

Questions about the Retriever input data format

Hi, thank you so much for open-sourcing DPR! I have some questions about the Retriever input data format.

Given the paper, the best performance comes from the Gold setting + 1 BM25 paragraph, in which (from my understanding) the negative examples are in-batch gold paragraphs and 1 BM 25 paragraphs. On the other hand, in the provided retriever’s nq_train.json data, there are multiple positive_ctxs and 50 negative_ctxs and a lot of hard_negative_ctxs, while it seems that those negative_ctxs will not be used by default) and only one paragraph from hard_negative_ctxs will be used.

First, what is the difference between the negative_ctxs and hard_negative_ctxs?
Second, how those negative paragraphs are selected?
Also, there are multiple positive_ctxs in nq_train.json. According to the paper, the positive examples for NQ and SQuAD are the preprocessed paragraphs corresponding to the original reference paragraphs in the original NQ / SQuAD datasets. How are the positive paragraphs in nq_train.json are selected?

For SQuAD and Natural Questions, since the original passages have been split and processed differently than our pool of candidate passages, we match and replace each gold passage with the corresponding passage in the candidate
pool.

Number of passages for each question used in Table 5 of DPR paper

Hi @vlad-karpukhin,
Thanks for sharing your work on GitHub.

I have a question regarding your End-to-End QA setup used to obtain EM values reported in Table 5 of the paper:
What is the number of retrieved passages (for each question) by the retriever component which are then passed to the reader component?
Have you used the top-100? Or have you changed this number?
Thanks

A question on hardware requirements to run retriever validation

Hi, thank you so much for open-sourcing DPR! I have a question about hardware requirements.

I would like to run validation on the retrieval task and downloaded the checkpoints as well as generated vectors for NQ following the instruction in this issue. Yet, when I was trying to run validation with this command, the program was terminated due to Memory error during building index with faiss_indexer (related: facebookresearch/faiss#180 )

Traceback (most recent call last):
  File "dense_retriever.py", line 285, in <module>
    main(args)
  File "dense_retriever.py", line 226, in main
    retriever.index_encoded_data(input_paths, buffer_size=index_buffer_sz)
  File "dense_retriever.py", line 94, in index_encoded_data
    self.index.index_data(buffer)
  File "/home/akari/projects/DPR/dpr/indexer/faiss_indexers.py", line 77, in index_data
    self.index.add(vectors)
  File "/home/akari/projects/DPR/dpr_env/lib/python3.6/site-packages/faiss/__init__.py", line 138, in replacement_add
    self.add_c(n, swig_ptr(x))
  File "/home/akari/projects/DPR/dpr_env/lib/python3.6/site-packages/faiss/swigfaiss.py", line 1454, in add
    return _swigfaiss.IndexFlat_add(self, n, x)
MemoryError: std::bad_alloc

Would you tell me how much memory our machine should have to run the validation? (I do not need to re-train the model right now). My machine has 62GB of of RAM. If it takes more than that, are there any options to reduce RAM usage during indexing?

Calculation of eval_step

In train_dense_encoder.py and train_reader.py, eval_step is calculated in the following way:

updates_per_epoch = train_iterator.max_iterations // args.gradient_accumulation_steps
eval_step = math.ceil(updates_per_epoch / args.eval_per_epoch)

If I understand correctly, this does not work if args.gradient_accumulation_steps > 1 and args.eval_per_epoch > 1.

For example, let train_iterator.max_iterations=100, args.gradient_accumulation_steps=2, and args.eval_per_epoch=2. It follows that updates_per_epoch=50 and eval_step=25.

However, the problem is that train_iterator.max_iterations is never actually changed to updates_per_epoch and is still 100. Thus, with an eval_step of 25, validation is performed four times per epoch, instead of twice.

Retriever inference result

Hi!

I run the retriever inference as instructed. But I do not understand its output. Is there another instruction for how to use the generated dense embeddings?

Thank you very much!

About inference for the reader model

Hi DPR authors, thanks for the great work! I am wondering why the reader model do inference by firstly selecting the top reranked passage within the passage_thresholds, and then select the best span within the top ranked passage.

DPR/train_reader.py

Lines 334 to 336 in d7ba973

best_spans = get_best_spans(self.tensorizer, p_start_logits, p_end_logits, ctx_ids, max_answer_length,
passage_idx, relevance_logits[q, passage_idx].item(), top_spans=10)
nbest.extend(best_spans)

DPR/train_reader.py

Lines 343 to 344 in d7ba973

curr_nbest = [pred for pred in nbest if pred.passage_index < n]
passage_rank_matches[n] = curr_nbest[0]

In my understanding, spans in nbest are sorted according to their corresponding passage selection score, so the best span from the passage with the highest passage selection score will be chosen. This makes the reader model depends highly on the top1 accuracy of passage selection (reranking).

Will it be better if the reader rerank all spans from all passages according to the sum of the relevance score and the span score?

BTW, I have two minor implementation questions:

  • The question-title-passage encoding looks like this [CLS] question [SEP] title [SEP] passage [PAD] [PAD] ... [PAD], is there a missing [SEP] at the end of the passage?
  • The reader model is trained by maximizing the marginal likelihood of all matched answer spans. During the inference time, if answers from different passages have the same string, why don't merge them by summing up their scores?

Question in "reader model input data pre-processing"

Hi, thanks for you sharing!

When I run preprocess_reader_data.py, after split data into 16 chunks, I was returned a error infomation called:

Traceback (most recent call last):
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/mnt/dqa/renruiyang/env/DPR/dpr/data/reader_data.py", line 425, in _preprocess_reader_samples_chunk
    for i, r in enumerate(iterator):
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1022, in __iter__
    for obj in iterable:
  File "/mnt/dqa/renruiyang/env/DPR/dpr/data/reader_data.py", line 152, in preprocess_retriever_data
    is_train_set,
  File "/mnt/dqa/renruiyang/env/DPR/dpr/data/reader_data.py", line 280, in _select_reader_passages
    answers_token_ids = [tensorizer.text_to_tensor(a, add_special_tokens=False) for a in answers]
  File "/mnt/dqa/renruiyang/env/DPR/dpr/data/reader_data.py", line 280, in <listcomp>
    answers_token_ids = [tensorizer.text_to_tensor(a, add_special_tokens=False) for a in answers]
  File "/mnt/dqa/renruiyang/env/DPR/dpr/models/hf_models.py", line 154, in text_to_tensor
    pad_to_max_length=False, truncation=True)
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 816, in encode
    **kwargs,
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 908, in encode_plus
    first_ids = get_input_ids(text)
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 898, in get_input_ids
    return self.convert_tokens_to_ids(self.tokenize(text, **kwargs))

▽
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 723, in tokenize
    tokenized_text = split_on_tokens(added_tokens, text)
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 717, in split_on_tokens
    for token in tokenized_text
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 717, in <genexpr>
    for token in tokenized_text
TypeError: _tokenize() got an unexpected keyword argument 'truncation'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/preprocess_reader_data.py", line 49, in <module>
    main(args)
  File "src/preprocess_reader_data.py", line 28, in main
    tensorizer, args.num_workers)
  File "/mnt/dqa/renruiyang/env/DPR/dpr/data/reader_data.py", line 215, in convert_retriever_results
    for file_name in workers.map(_parse_batch, chunks):
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "//mnt/dqa/lihongyu04/python3.7_torch1.1/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
TypeError: _tokenize() got an unexpected keyword argument 'truncation'

My settings are:

python src/preprocess_reader_data.py \
        --retriever_results ./official-reproduce/nq-train.json \
        --gold_passages_src ./official-reproduce/nq-train_gold_info.json \
        --do_lower_case \
        --pretrained_model_cfg bert-base-uncased \
        --encoder_model_type hf_bert \
        --out_file ./ \
       --is_train_set

Please help me solve this bug...
Thank you!

How inter-dependent are the DPR and the reader?

Greetings,

I am wondering how the performance of the reader depends on the retriever.

  1. In the paper Sec 6.1, it says "For training, we sample one positive and m− 1 negative passages for each question at each iteration". Is this one positive passage retrieved by DPR or any passage (described in Sec 4.2 Selecting positive passages) with has_answer=True will do? Or rather, is the reader training independent of the retriever training?

  2. In the same paragraph, it says "we use the highest-ranked passage from BM25 that contains the answer as the positive passage". Do you use A or Q+A as the query for BM25?

  3. I assume the order of retrieved passages doesn't matter for the reader as its probability is normalized by softmax of all retrieved passages. Is this correct?

Many thanks!

Compute Requirements for Retriever Validation

I am trying to run Retriever validation against the entire set of documents for NQ dataset using checkpoint model and wikipedia embeddings downloaded using download_data.py

With 4 V100 GPUs(32GB), and 128 GB RAM, I'm getting segmentation fault. How much memory and compute do I need to run the Retriever validation?

Initialized host <HOST> as d.rank -1 on device=cuda, n_gpu=4, world size=1
16-bits training: False
 **************** CONFIGURATION ****************
batch_size                     -->   32
ctx_file                       -->   <FILEPATH>/psgs_w100.tsv
device                         -->   cuda
distributed_world_size         -->   1
do_lower_case                  -->   False
encoded_ctx_file               -->   <FILEPATH>/single
encoder_model_type             -->   None
fp16                           -->   False
fp16_opt_level                 -->   O1
hnsw_index                     -->   False
index_buffer                   -->   50000
local_rank                     -->   -1
match                          -->   string
model_file                     -->   <FILEPATH>/bert-base-encoder.cp
n_docs                         -->   200
n_gpu                          -->   4
no_cuda                        -->   False
out_file                       -->   <FILEPATH>/output
pretrained_file                -->   None
pretrained_model_cfg           -->   None
projection_dim                 -->   0
qa_file                        -->   <FILEPATH>/nq-test.csv
save_or_load_index             -->   False
sequence_length                -->   512
validation_workers             -->   16
 **************** CONFIGURATION ****************
Reading saved model from <FILEPATH>/bert-base-encoder.cp
model_state_dict keys odict_keys(['model_dict', 'optimizer_dict', 'scheduler_dict', 'offset', 'epoch', 'encoder_params'])
Overriding args parameter value from checkpoint state. Param = do_lower_case, value = True
Overriding args parameter value from checkpoint state. Param = pretrained_model_cfg, value = bert-base-uncased
Overriding args parameter value from checkpoint state. Param = encoder_model_type, value = hf_bert
Overriding args parameter value from checkpoint state. Param = sequence_length, value = 256
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/vaibhav/miniconda3/envs/DPR/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Segmentation fault (core dumped)

Retrieval results for TriviaQA

The retrieval results for NQ seems to be stored in data.retriever_results.nq.single.train/dev/test. But I couldn't find the retrieval results for TriviaQA. Could you specify how to access that? Thanks.

Interactive mode

Hi and thanks for the great repo.

I am wondering if is there an easy way to have an interactive script like DrQA.

For instance:

python scripts/retriever/interactive.py 

>>> process('question answering', k=5)

+------+-------------------------------+-----------+
| Rank |             Doc Id            | Doc Score |
+------+-------------------------------+-----------+
|  1   |       Question answering      |   327.89  |
|  2   |       Watson (computer)       |   217.26  |
|  3   |          Eric Nyberg          |   214.36  |
|  4   |   Social information seeking  |   212.63  |
|  5   | Language Computer Corporation |   184.64  |
+------+-------------------------------+-----------+

where the score is the dot product instead of the tf-idf score.
Is there an easy way to have this? and is there a function that given the doc_id, or directly, return the corresponding document embedding?

I feel this can be of great use for the research community.

Thanks again for your great work

Andrea

dense_retriever not enough values to unpack

Hello,

While I was running dense_retriever.py, I incurred the below problem. I am a bit confused about the below problem. Is this issue due to that I set --num_shards =20 and only run --shard_id =0 and --shard_id=19 for generate_dense_embeddings.py? I mean if I set the --num_shards = 2 and run --shard_id =0 and --shard_id=1 for generate_dense_embeddings.py then rerun this dense_retriever.py, then the below issue should be fixed? Thanks very much!

loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home1/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
Loading saved model state ...
Encoder vector_size=768
Reading all passages data from files: ['output/dpr_ctx.index_meta.dpr', 'output/dpr_ctx_0', 'output/dpr_ctx.index.dpr', 'output/dpr_ctx_19']
Reading file output/dpr_ctx.index_meta.dpr
Traceback (most recent call last):
File "dense_retriever.py", line 284, in
main(args)
File "dense_retriever.py", line 223, in main
retriever.index.index_data(input_paths)
File "/work/DPR/dpr/indexer/faiss_indexers.py", line 34, in index_data
for i, item in enumerate(iterate_encoded_files(vector_files)):
File "/work/DPR/dpr/indexer/faiss_indexers.py", line 190, in iterate_encoded_files
db_id, doc_vector = doc
ValueError: not enough values to unpack (expected 2, got 1)

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte" when doing inference.

Hello, sorry for this quick-and-dirty issue.

Problem is when I try to run this:
python generate_dense_embeddings.py --model_file checkpoint/retriever/multiset/bert-base-encoder.cp --ctx_file compressed-data/wikipedia_split/psgs_w100.tsv.gz --out_file myout

I get this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Previously, I ran these commands:
python data/download_data.py --resource checkpoint.retriever.multiset.bert-base-encoder
python data/download_data.py --resource compressed-data.wikipedia_split.psgs_w100

My conda env:

# packages in environment at /home/electron/miniconda3/envs/dpr:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
blis                      0.4.1                    pypi_0    pypi
ca-certificates           2020.7.22                     0  
catalogue                 1.0.0                    pypi_0    pypi
certifi                   2020.6.20                py38_0  
chardet                   3.0.4                    pypi_0    pypi
click                     7.1.2                    pypi_0    pypi
cymem                     2.0.3                    pypi_0    pypi
cython                    0.29.21                  pypi_0    pypi
dpr                       0.1.0                    pypi_0    pypi
faiss-cpu                 1.6.3                    pypi_0    pypi
filelock                  3.0.12                   pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
idna                      2.10                     pypi_0    pypi
joblib                    0.17.0                   pypi_0    pypi
ld_impl_linux-64          2.33.1               h53a641e_7  
libedit                   3.1.20191231         h14c3975_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
murmurhash                1.0.2                    pypi_0    pypi
ncurses                   6.2                  he6710b0_1  
numpy                     1.19.2                   pypi_0    pypi
openssl                   1.1.1h               h7b6447c_0  
packaging                 20.4                     pypi_0    pypi
pip                       20.2.3                   py38_0  
plac                      1.1.3                    pypi_0    pypi
preshed                   3.0.2                    pypi_0    pypi
pyparsing                 2.4.7                    pypi_0    pypi
python                    3.8.5                h7579374_1  
readline                  8.0                  h7b6447c_0  
regex                     2020.9.27                pypi_0    pypi
requests                  2.24.0                   pypi_0    pypi
sacremoses                0.0.43                   pypi_0    pypi
sentencepiece             0.1.91                   pypi_0    pypi
setuptools                49.6.0                   py38_1  
six                       1.15.0                   pypi_0    pypi
spacy                     2.3.2                    pypi_0    pypi
sqlite                    3.33.0               h62c20be_0  
srsly                     1.0.2                    pypi_0    pypi
thinc                     7.4.1                    pypi_0    pypi
tk                        8.6.10               hbc83047_0  
tokenizers                0.8.1rc1                 pypi_0    pypi
torch                     1.6.0                    pypi_0    pypi
tqdm                      4.50.0                   pypi_0    pypi
transformers              3.0.2                    pypi_0    pypi
urllib3                   1.25.10                  pypi_0    pypi
wasabi                    0.8.0                    pypi_0    pypi
wget                      3.2                      pypi_0    pypi
wheel                     0.35.1                     py_0  
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7b6447c_3  

Downloads from the previously-ran commands had been successful.

Should I manually edit generate_dense_embeddings.py to solve this? What else can I do, is it solvable by changing the version of python etc. , or am I doing something wrong?

Error when running train_reader -- ValueError: a must be greater than 0 unless no samples are taken

Hi! I get the following error when running train_reader.py:

Total iterations per epoch=1237
 Total updates=24720
  Eval step = 2000
***** Training *****
***** Epoch 0 *****
Traceback (most recent call last):
  File "train_reader.py", line 507, in <module>
    main()
  File "train_reader.py", line 498, in main
    trainer.run_train()
  File "train_reader.py", line 126, in run_train
    global_step = self._train_epoch(scheduler, epoch, eval_step, train_iterator, global_step)
  File "train_reader.py", line 225, in _train_epoch
    is_train=True, shuffle=True)
  File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 134, in create_reader_input
    is_random=shuffle)
  File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 193, in _create_question_passages_tensors
    positive_idx = _get_positive_idx(positives, max_len, is_random)
  File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 175, in _get_positive_idx
    positive_idx = np.random.choice(len(positives)) if is_random else 0
  File "mtrand.pyx", line 894, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken

This error occurs right after train_reader.py successfully loads all of the preprocessed .pkl reader data files. Could you please help me resolve this issue?

Load pretrained checkpoints when tuning?

Hi,

I searched and did't find this issue. (but maybe I am wrong)

I get 'Checkpoint files []' message when tuning on my own dataset. So I wonder if this means that I tune the original bert-uncased-base model on my dataset.

Is there an option for loading checkpoint that's trained on NQ when tuning? I see there is an option pretrained_file. But no matter what I pass to it, I still get the 'Checkpoint files []' message...

Thank you!

dpr_all_documents is not defined

I am encountering the error of "dpr_all_documents is not defined" when running the dense_retriever.py script, particularly at the "validate" function right before saving the results.

Reducing memory usage of dense_retriever.py when hnsw index option is given

index_buffer_sz = -1 # encode all at once

Currently, the code forces the index buffer size to be -1 when HNSW index is used, which makes all the 21M vectors to be loaded in RAM at once and not deallocated until all the indexing is done. This takes up significant amount of memory.
Because the 21M vectors are only necessary to calculate the max norm (phi) while faiss indexing can be done in batches, we can significantly reduce the memory usage by separating the indexing step from max norm calculation.

ref. calculating the max norm:

phi = max(phi, norms)

ref. indexing:

self.index.add(hnsw_vectors)

Can pre-trained model checkpoints be loaded as hugging-face transformer model?

Hello,

I wanted to load the checkpoint - "checkpoint.retriever.multiset.bert-base-encoder" as a huggingface transformer model, i.e. will I be able to load it using something similar like the example code mentioned below?

This is something which I would ideally like to have -
from transformers import AutoModel
model = AutoModel.from_pretrained(path_to_checkpoint)

At the moment, I see in files - dense_retriever.py and generate_dense_embeddings.py, the checkpoint model, is loaded as the bi-encoder class defined as - models/biencoder.py
print(type(encoder))
>>> <class 'dpr.models.biencoder.BiEncoder'>

Kind Regards,
Nandan Thakur

OSError: [Errno 12] Cannot allocate memory

Hello,

I am running dense_retriever.py for retriever validation for nq-train. Below is the command I used. I incurred the error as below. Does this error caused by the num_shards in previous generate_dense_embeddings in which I ran the --shard_id 0 with --num_shards 20 and --shard_id 19 with --num_shards 20 and produced both dpr_ctx_0 and dpr_ctx_19? Or this is due to my current machine RAM memory ( I also put my machine information below)? Thanks very much!

Error:
Total encoded queries tensor torch.Size([79168, 768])
index search time: 3956.004276 sec.
Reading data from: output/data/wikipedia_split/psgs_w100.tsv
Matching answers in top docs...
Exception in thread Thread-4949:
Traceback (most recent call last):
File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/pool.py", line 412, in _handle_workers
pool._maintain_pool()
File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/pool.py", line 248, in _maintain_pool
self._repopulate_pool()
File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Command:
python3 dense_retriever.py
--model_file output/checkpoint/retriever/multiset/bert-base-encoder.cp
--ctx_file output/data/wikipedia_split/psgs_w100.tsv
--qa_file output/data/retriever/qas/nq-train.csv
--encoded_ctx_file output/'dpr_ctx*'
--out_file output/dpr_retrieval/nq-train.json
--n-docs 100
--validation_workers 32
--batch_size 64

Machine information:

Accelerators: | 4 NVIDIA Quadro RTX 5000 / node
CUDA Parallel Processing Cores: | 3072 / card
NVIDIA Tensor Cores: | 384 / card
GPU Memory: | 16GB GDDR6 / card
CPUs: | 2 Intel Xeon E5-2620 v4 (“Broadwell”)
RAM: | 128GB (2133 MT/s) DDR4
Local storage: | 144GB /tmp partition on a 240GB SSD.

Reproducing Table 2 of DPR paper

Hi! Does the repo include code for reproducing Table 2 ("Top-20 & Top-100 retrieval accuracy on test sets...") of the DPR paper?

Best results reproduction instruction

Hello,

I am trying to train a model based on your instructions and tried to run train_dense_encoder.py
In the instruction you are refering to --dev_file {path to downloaded data.retriever.qas.nq-dev resource} but it is unclear to which file you mean.

Is it retriever/qas/nq-dev.csv or retriever/nq-dev.json? The first option fails as the code expects a json file but the second one doesn't seem like a "retriever.qas" resource based on its name.

retriver bugs:pickle.load() ValueError: could not convert string to int

Reading file /storage03/users/nq_open/retriver/DPR/dpr_ctx_10
Reading file /storage03/users//nq_open/retriver/DPR/dpr_ctx.index.dpr
Traceback (most recent call last):
File "dense_retriever.py", line 303, in
main(args)
File "dense_retriever.py", line 242, in main
retriever.index_encoded_data(input_paths, buffer_size=index_buffer_sz)
File "dense_retriever.py", line 92, in index_encoded_data
for i, item in enumerate(iterate_encoded_files(vector_files)):
File "dense_retriever.py", line 193, in iterate_encoded_files
doc_vectors = pickle.load(reader)
ValueError: could not convert string to int

How to generate gold_passage_info?

Hi, thanks for the great work. May I know how gold_passages_info.nq_{train|dev|test} are generated? I guess they might come from NQ dataset, but is there any standard way to generate these files?

dense_retriever -- MemoryError: std::bad_alloc

Hi! It seems that no matter what value I set index_buffer to, I get the following error when running dense_retriever.py:

Traceback (most recent call last):
  File "dense_retriever.py", line 331, in <module>
    main(args)
  File "dense_retriever.py", line 268, in main
    retriever.index_encoded_data(input_paths, buffer_size=index_buffer_sz)
  File "dense_retriever.py", line 100, in index_encoded_data
    self.index.index_data(buffer)
  File "/home/aarchan/qa-aug/qa-aug/dpr/indexer/faiss_indexers.py", line 93, in index_data
    self.index.add(vectors)
  File "/home/aarchan/anaconda2/envs/qa-aug/lib/python3.8/site-packages/faiss/__init__.py", line 138, in replacement_add
    self.add_c(n, swig_ptr(x))
  File "/home/aarchan/anaconda2/envs/qa-aug/lib/python3.8/site-packages/faiss/swigfaiss.py", line 1454, in add
    return _swigfaiss.IndexFlat_add(self, n, x)
MemoryError: std::bad_alloc

For reference, the machine I'm running this on has 128GB RAM, but it doesn't seem to be enough. Could you please help me with this issue? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.