terrierteam / pyterrier_dr Goto Github PK

3.0 3.0 7.0 69 KB

Python 100.00%

pyterrier_dr's Issues

TypeError: from_file() missing 1 required positional argument: 'shared'

torch.version: '1.13.0'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_69880/976215978.py in <module>
      1 retr_pipeline = model >> pyterrier_dr.TorchIndex('/nfstrecdl/msmarco-passage.tct-hnp')
----> 2 retr_pipeline.search('Hello Terrier')

/opt/conda/envs/dr/lib/python3.7/site-packages/pyterrier/transformer.py in search(self, query, qid, sort)
    214         import pandas as pd
    215         queryDf = pd.DataFrame([[qid, query]], columns=["qid", "query"])
--> 216         rtr = self.transform(queryDf)
    217         if "qid" in rtr.columns and "rank" in rtr.columns:
    218             rtr = rtr.sort_values(["qid", "rank"], ascending=[True,True])

/opt/conda/envs/dr/lib/python3.7/site-packages/pyterrier/transformer.py in transform(self, topics)
    885     def transform(self, topics):
    886         for m in self.models:
--> 887             topics = m.transform(topics)
    888         return topics
    889 

/opt/conda/envs/dr/lib/python3.7/site-packages/pyterrier_dr/indexes.py in transform(self, inp)
    657             query_vecs = query_vecs.half()
    658 
--> 659         self.docnos_and_data()
    660 
    661         step = self._cuda_data.shape[0]

/opt/conda/envs/dr/lib/python3.7/site-packages/pyterrier_dr/indexes.py in docnos_and_data(self)
    637                 'f2': (torch.HalfStorage,  torch.HalfTensor,  torch.cuda.HalfTensor,  2),
    638             }[self.dtype]
--> 639             self._cpu_data = TType(SType.from_file(str(self.index_path/'index.npy'), size=meta['count'] * meta['vec_size'])).reshape(meta['count'], meta['vec_size'])
    640             doc_batch_size = self.idx_mem//SIZE//meta['vec_size']
    641             if self.half:

TypeError: from_file() missing 1 required positional argument: 'shared'

apply over an index

In the FCRS usecase (w/ @yashonwu), we have a need to take a (dense) index, and apply a "model" to it, to make a new index. I think it would make sense to replace our custom index and ranking code with pyterrier_dr.

Currently, that just means applying an torch.nn.Linear. This can be done using a GPU, batchwise. More generically, it could be a function. What could be a reasonable API for PyTerrier DR?

index_new = index.downcast( model.img_transform, batchsize=K, device=torch.cuda() )

I wouldnt expect all indices to support this. I think some FAISS impls could support it, but all I really care about is the numpy index impl.

Save a model

Hi,

How to save a trained a model? i looked for a 'save' method but i couldn't find one

Thanks.

Indexing Robust

Hey,

I'm having some problems while trying to index the Robust collection using NumpyIndex. I tried to index it by following the example given in the Pyterrier documentation:

index = pyterrier_dr.NumpyIndex('./indices/robust_MB')
files = pt.io.find_files("path/to/robust")
idx_pipeline = model >> index
idx_pipeline.index(files)

But it doesn't work. I think it's because i need to pass the fields and meta when indexing but NumpyIndex doesn't accept this parameters.

Any help is welcome. Thanks.

`TctColBert.reverse()'s function

Heyy,

I'm a bit confused on what the method TctColBert.reverse() do? and is it used by any other method internally?
Also, in this line of code:
Q = Q[:, 4:, :].mean(dim=1) # remove the first 4 tokens (representing [CLS] [ Q ]), and average
seems stupid to ask but why is [CLS] [ Q ] represeted in 4 tokens and not 2?

Thank you!

index should have a get_embs() transformers

Which call .docnos_and_data() if needed.

Memory requirement for TorchIndex

Hi,

NumpyIndex does not require the full index to be in memory, but can the same be said on TorchIndex?
I'm trying to run an experiment on the MS MARCO dev set with TorchIndex qnd i get this error saying cuda is out of memory.

Thanks for any help!

Here's the code:
`#Training
model = pyterrier_dr.TctColBert(model_name='distilbert-base-uncased')
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
model.fit(dataset=dataset, steps=1000)

#Evaluation
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
index = pyterrier_dr.TorchIndex('index.torch')
idx_pipeline = model >> index
idx_pipeline.index(dataset.get_corpus_iter())

retr_pipeline = model >> index
pt.Experiment(
[retr_pipeline],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank","ndcg", "ndcg_cut_10"]
)`

And i get this error:
`Traceback (most recent call last):
File "/XXX/DR_cuda/DR_cuda.py", line 22, in
pt.Experiment(

File "/XXX/dr_env/lib/python3.8/site-packages/pyterrier/pipelines.py", line 450, in Experiment
time, evalMeasuresDict = _run_and_evaluate(

File "/XXX/dr_env/lib/python3.8/site-packages/pyterrier/pipelines.py", line 170, in _run_and_evaluate
res = system.transform(topics)

File "/XXX/dr_env/lib/python3.8/site-packages/pyterrier/ops.py", line 335, in transform
topics = m.transform(topics)

File "/XXX/dr_env/lib/python3.8/site-packages/pyterrier_dr/indexes.py", line 676, in transform
scores = query_vecs @ self._cuda_data[:bsize].T

RuntimeError: CUDA out of memory. Tried to allocate 33.70 GiB (GPU 0; 10.92 GiB total capacity; 1.63 GiB already allocated; 8.27 GiB free; 2.00 GiB reserved in total by PyTorch)

srun: error: gpu-nc06: task 0: Exited with exit code 1`

Error code 139

Hi,

While trying to index the Arguana collection and then run an experiment on it i encounter the error code 139. From what i understand it's a segmentation violation error, but i don't understand where it is comming from.

Here's my code:
`
dataset = pt.get_dataset('irds:beir/arguana')
index = pyterrier_dr.NumpyIndex('index_arg.np')
idx_pipeline = model >> index
idx_pipeline.index(dataset.get_corpus_iter())

retr_pipeline = model >> index
pt.Experiment(
[retr_pipeline],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank",RR@10,"ndcg", "ndcg_cut_10"]
)
`

and i get this :
beir/arguana documents: 100%|██████████| 8674/8674 [15:44<00:00, 9.19it/s]srun: error: gpu-nc07: task 0: Exited with exit code 139

If anyone knows anything about this please let me know.
Thank you.

tests & gh action

Should add some unit tests and a github action to run them

terrierteam / pyterrier_dr Goto Github PK

pyterrier_dr's Issues

TypeError: from_file() missing 1 required positional argument: 'shared'

apply over an index

Save a model

Indexing Robust

`TctColBert.reverse()'s function

index should have a get_embs() transformers

Memory requirement for TorchIndex

Error code 139

tests & gh action

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent