Giter Site home page Giter Site logo

conae's Introduction

ConAE

This repo provides the code for reproducing the experiments in Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder

Requirements

To install requirements, run the following commands:

git clone https://github.com/NEUIR/ConAE/
cd ConAE
python setup.py install

Data Download

To download all the needed data, run:

bash commands/data_download.sh 

Original embeddings generation

All embeddings come from the dataset encoded by the ANCE encoder in the format of 768-dimensional dense vectors. You can encode the dataset yourself, or download the already encoded embeddings directly from other repositories such as Pyserini. If you generate the embedding yourself or download it from other places, pay attention to the format of this repository, otherwise an error will occur. The dataset embeddings we use are given below.

1.To encode queries, run:

python -u -m torch.distributed.launch ../data/get_query_embedding.py \
	--dataset MSMARCO
	--tokenizer castorini/ance-msmarco-passage \
	--encoder castorini/ance-msmarco-passage \
	--output_path ../data/datasets/MSMARCO \
	--data_path ../data/datasets/MSMARCO \

2.To encode docs, run:

python ../data/get_doc_embedding.py \
	--index ../data/datasets/MSMARCO/dindex-msmarco-passage-ance-bf-20210224-060cef \
	--output_path ../data/datasets/MSMARCO	\
	--partition 1

Training

MSMARCO

1.To train the ConAE in the paper for MSMARCO, you need to get candidates before training:

python ../train/MSMARCO/get_candidates.py 
	--index ../data/datasets/MSMARCO/dindex-msmarco-passage-ance-bf-20210224-060ce \
	--qembed_path ../data/datasets/MSMARCO/qembed.pkl \
	--output_path ../data/datasets/MSMARCO \
	--topk 500 --batch_size 128 --threads 40

2.Then, we randomly sample 50,000 queries from the raw training set of MSMARCO as the development set.

python ../train/MSMARCO/split_train.py

3.Finally, run:

python ../train/MSMARCO/train.py 
	--train_path ../data/datasets/MSMARCO/train_retrieval_large.json \
	--valid_path ../data/datasets/MSMARCO/train_retrieval_small.json \
	--query_embed_path ../data/datasets/MSMARCO/qembed.pkl \
	--train_doc_embed_path ../data/datasets/MSMARCO/embed/dembed.pkl \
	--output_dim 64 \
	--outdir ../data/datasets/MSMARCO/save_model_64

NQ

1.To train the ConAE in the paper for NQ, you need to get candidates before training:

python ../train/NQ/get_candidates.py 
	--index ../data/datasets/NQ/dindex-wikipedia-ance_multi-bf-20210224-060cef \
	--qembed_path ../data/datasets/NQ/qembed.pkl \
	--output_path ../data/datasets/NQ \
	--topk 500 --threads 40

2.Because of memory limitation, we divide docs into dev sets and train sets. If you have enough memory, you can omit this step.

python ../train/NQ/split_doc.py

3.Finally, run:

python ../train/NQ/train.py 
	--train_path ../data/datasets/NQ/train_retrieval.json \
	--valid_path ../data/datasets/NQ/dev_retrieval.json \
	--query_embed_path ../data/datasets/NQ/qembed.pkl \
	--train_doc_embed_path ../data/datasets/NQ/dembed_train.pkl \
	--valid_doc_embed_path ../data/datasets/NQ/dembed_dev.pkl \
	--output_dim 64 \
	--outdir ../data/datasets/NQ/save_model_64

Compress embeddings

1.To Compress dense embeddings(MSMARCO) using ConAE, run:

python ../get_embbedings/get_emb.py \
  --query_embed_path ../data/datasets/MSMARCO/qembed.pkl \
  --doc_embed_path ../data/datasets/MSMARCO/embed/dembed.pkl \
  --checkpoint ../data/checkpoint/ConAE/MSMARCO/ConAE_256/model.best.pt  \
  --output_dim 256 \
  --output_path ../data/datasets/MSMARCO/compress_emb/

2.To Compress dense embeddings(NQ) using ConAE, run:

python ../get_embbedings/get_nq_emb.py \
  --query_embed_path ../data/datasets/NQ/query-embedding-ance_multi-nq-test/ \
  --doc_embed_path ../data/datasets/NQ/dindex-wikipedia-ance_multi-bf-20210224-060cef/ \
  --checkpoint ../data/checkpoint/ConAE/NQ/ConAE_256/model.best.pt \
  --output_dim 256 \
  --output_path ../data/datasets/NQ/compress_emb/

Our checkpoints could be downloaded here.

Evaluation

The command for evaluation is the same as that for Compress embeddings described above. However you need to add --evaluation to the command to have the program to evaluate after the compress embeddings step.

Results

MSMARCO MRR@10 NDCG@10 Recall@1k
ANCE-768 0.3302 0.3877 0.9584
ConAE-256 0.3294 0.3864 0.9560
ConAE-128 0.3245 0.3816 0.9523
ConAE-64 0.2862 0.3376 0.9222
NQ Top20 Top100
ANCE-768 0.8224 0.8787
ConAE-256 0.8053 0.8723
ConAE-128 0.8064 0.8687
ConAE-64 0.7604 0.8460
TREC-COVID NDCG@10
ANCE-768 0.6529
ConAE-256 0.6405
ConAE-128 0.6381
ConAE-64 0.5006

conae's People

Contributors

zhanghan9797 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

openmatch

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.