pyterrier_dr's Issues
TypeError: from_file() missing 1 required positional argument: 'shared'
torch.version: '1.13.0'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_69880/976215978.py in <module>
1 retr_pipeline = model >> pyterrier_dr.TorchIndex('/nfstrecdl/msmarco-passage.tct-hnp')
----> 2 retr_pipeline.search('Hello Terrier')
/opt/conda/envs/dr/lib/python3.7/site-packages/pyterrier/transformer.py in search(self, query, qid, sort)
214 import pandas as pd
215 queryDf = pd.DataFrame([[qid, query]], columns=["qid", "query"])
--> 216 rtr = self.transform(queryDf)
217 if "qid" in rtr.columns and "rank" in rtr.columns:
218 rtr = rtr.sort_values(["qid", "rank"], ascending=[True,True])
/opt/conda/envs/dr/lib/python3.7/site-packages/pyterrier/transformer.py in transform(self, topics)
885 def transform(self, topics):
886 for m in self.models:
--> 887 topics = m.transform(topics)
888 return topics
889
/opt/conda/envs/dr/lib/python3.7/site-packages/pyterrier_dr/indexes.py in transform(self, inp)
657 query_vecs = query_vecs.half()
658
--> 659 self.docnos_and_data()
660
661 step = self._cuda_data.shape[0]
/opt/conda/envs/dr/lib/python3.7/site-packages/pyterrier_dr/indexes.py in docnos_and_data(self)
637 'f2': (torch.HalfStorage, torch.HalfTensor, torch.cuda.HalfTensor, 2),
638 }[self.dtype]
--> 639 self._cpu_data = TType(SType.from_file(str(self.index_path/'index.npy'), size=meta['count'] * meta['vec_size'])).reshape(meta['count'], meta['vec_size'])
640 doc_batch_size = self.idx_mem//SIZE//meta['vec_size']
641 if self.half:
TypeError: from_file() missing 1 required positional argument: 'shared'
apply over an index
In the FCRS usecase (w/ @yashonwu), we have a need to take a (dense) index, and apply a "model" to it, to make a new index. I think it would make sense to replace our custom index and ranking code with pyterrier_dr.
Currently, that just means applying an torch.nn.Linear. This can be done using a GPU, batchwise. More generically, it could be a function. What could be a reasonable API for PyTerrier DR?
index_new = index.downcast( model.img_transform, batchsize=K, device=torch.cuda() )
I wouldnt expect all indices to support this. I think some FAISS impls could support it, but all I really care about is the numpy index impl.
Save a model
Hi,
How to save a trained a model? i looked for a 'save' method but i couldn't find one
Thanks.
Indexing Robust
Hey,
I'm having some problems while trying to index the Robust collection using NumpyIndex. I tried to index it by following the example given in the Pyterrier documentation:
index = pyterrier_dr.NumpyIndex('./indices/robust_MB')
files = pt.io.find_files("path/to/robust")
idx_pipeline = model >> index
idx_pipeline.index(files)
But it doesn't work. I think it's because i need to pass the fields and meta when indexing but NumpyIndex doesn't accept this parameters.
Any help is welcome. Thanks.
`TctColBert.reverse()'s function
Heyy,
I'm a bit confused on what the method TctColBert.reverse()
do? and is it used by any other method internally?
Also, in this line of code:
Q = Q[:, 4:, :].mean(dim=1) # remove the first 4 tokens (representing [CLS] [ Q ]), and average
seems stupid to ask but why is [CLS] [ Q ] represeted in 4 tokens and not 2?
Thank you!
index should have a get_embs() transformers
Which call .docnos_and_data() if needed.
Memory requirement for TorchIndex
Hi,
NumpyIndex does not require the full index to be in memory, but can the same be said on TorchIndex?
I'm trying to run an experiment on the MS MARCO dev set with TorchIndex qnd i get this error saying cuda is out of memory.
Thanks for any help!
Here's the code:
`#Training
model = pyterrier_dr.TctColBert(model_name='distilbert-base-uncased')
dataset = pt.get_dataset('irds:msmarco-passage/train/judged')
model.fit(dataset=dataset, steps=1000)
#Evaluation
dataset = pt.get_dataset('irds:msmarco-passage/dev/judged')
index = pyterrier_dr.TorchIndex('index.torch')
idx_pipeline = model >> index
idx_pipeline.index(dataset.get_corpus_iter())
retr_pipeline = model >> index
pt.Experiment(
[retr_pipeline],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank","ndcg", "ndcg_cut_10"]
)`
And i get this error:
`Traceback (most recent call last):
File "/XXX/DR_cuda/DR_cuda.py", line 22, in
pt.Experiment(
File "/XXX/dr_env/lib/python3.8/site-packages/pyterrier/pipelines.py", line 450, in Experiment
time, evalMeasuresDict = _run_and_evaluate(
File "/XXX/dr_env/lib/python3.8/site-packages/pyterrier/pipelines.py", line 170, in _run_and_evaluate
res = system.transform(topics)
File "/XXX/dr_env/lib/python3.8/site-packages/pyterrier/ops.py", line 335, in transform
topics = m.transform(topics)
File "/XXX/dr_env/lib/python3.8/site-packages/pyterrier_dr/indexes.py", line 676, in transform
scores = query_vecs @ self._cuda_data[:bsize].T
RuntimeError: CUDA out of memory. Tried to allocate 33.70 GiB (GPU 0; 10.92 GiB total capacity; 1.63 GiB already allocated; 8.27 GiB free; 2.00 GiB reserved in total by PyTorch)
srun: error: gpu-nc06: task 0: Exited with exit code 1`
Error code 139
Hi,
While trying to index the Arguana collection and then run an experiment on it i encounter the error code 139. From what i understand it's a segmentation violation error, but i don't understand where it is comming from.
Here's my code:
`
dataset = pt.get_dataset('irds:beir/arguana')
index = pyterrier_dr.NumpyIndex('index_arg.np')
idx_pipeline = model >> index
idx_pipeline.index(dataset.get_corpus_iter())
retr_pipeline = model >> index
pt.Experiment(
[retr_pipeline],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank",RR@10,"ndcg", "ndcg_cut_10"]
)
`
and i get this :
beir/arguana documents: 100%|██████████| 8674/8674 [15:44<00:00, 9.19it/s]srun: error: gpu-nc07: task 0: Exited with exit code 139
If anyone knows anything about this please let me know.
Thank you.
tests & gh action
Should add some unit tests and a github action to run them
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.