njunlp / knn-box Goto Github PK
View Code? Open in Web Editor NEWan easy-to-use knn-mt toolkit
License: MIT License
an easy-to-use knn-mt toolkit
License: MIT License
Traceback (most recent call last):
File "/home/nlp/anaconda3/envs/knn/lib/python3.7/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 564, in run_script
exec(code, module.dict)
File "/home/nlp/y2020/yzh/knn-box-master/knnbox-scripts/vanilla-knn-mt-visual/src/app.py", line 215, in
knn_main()
File "/home/nlp/y2020/yzh/knn-box-master/knnbox-scripts/vanilla-knn-mt-visual/src/app.py", line 126, in knn_main
k = 1, lambda=0.0, temperature=1.0
File "/home/nlp/y2020/yzh/knn-box-master/knnbox-scripts/vanilla-knn-mt-visual/../../knnbox-scripts/vanilla-knn-mt-visual/src/function.py", line 510, in translate_using_knn_model
assert len(results) == 1, "interactive mode, should have only one sentence"
AssertionError: interactive mode, should have only one sentence
隔了一段时间重新用后出现了上述问题
hello,我在用您的代码运行vanilla-knn-mt/build_datastore.sh遇到了这个错误。我的基础NMT模型fairseq版本是最新的V0.12训练的。在加载模型参数的时候报了这个错误:
AttributeError: 'NoneType' object has no attribute 'user_dir'
![Uploading image.png…]
这个错误是因为fairseq新版本已经没有用 state["args"],已经改成了state['cfgs']。
当我把代码改成state[‘cfgs’]时,报错:
0.10的state["args"]是argparse.Namespace类型
如果要改代码好像要改许多地方,这也是我发现的其中一个版本不一致带来的问题,应该之后还会有类似的问题。貌似对新版本的fairseq训练的模型不太友好,有比较好的解决方式么?谢谢
感谢您指出这个问题。我们经过对checkpoint文件的对比,确认了您提出的现象,您可以按照下面的示例代码,对checkpoint进行一个小的修改,以正常加载,而kNN-BOX的代码无需修改:
import torch
new_version_ckpt = torch.load("<新版本保存的checkpoint路径>")
new_version_ckpt["args"] = new_version_ckpt["cfg"]["model"]
torch.save(new_version_ckpt, "<为旧版本转换过的checkpiont>")
尖括号内的文件名请按需填写,以下是一些补充说明:
新版本中,fairseq会向checkpoint里面保存更多的配置信息,而原本的ckpt["args"]
仅包含模型的信息,改保存到ckpt["cfg"]["model"]
中,类型仍为argparseNamespace
Originally posted by @Maxwell-Lyu in #18 (comment)
谢谢! 我转换了模型后,运行build_datastore.sh又产生了下面的错误
在之前的工作中,我使用fairseq.logging.meters.StopwatchMeter 计NMT预测时间、KNN检索时间,大概是这样的流程:
timerNMT.start()
x = NMTModel(input)
timerNMT.stop()
timerKNN.start()
knn_retrieve(x)
timerKNN.stop()
但是得到的结果却非常诡异,即便是非常非常小的datastore,有GPU加速的faiss进行KNN检索的总用时占了inference的大部分时间,后来我发现,https://github.com/NJUNLP/knn-box/blob/master/knnbox/retriever/utils.py 中的retrieve_k_nearest中,会临时把tensor迁移到cpu上,这个操作会先进行一个cuda同步,等待tensor的前置算子全部计算完成后,才能把它的值拷贝到cpu上。
这时我意识到,根据pytorch文档的说法,pytorch上的cuda kernel默认是异步执行的。也就是说python这边的函数返回的时候,cuda操作可能并没有完成,我把计时的流程改为
torch.cuda.synchronize()
timerNMT.start()
x = NMTModel(input)
torch.cuda.synchronize()
timerNMT.stop()
torch.cuda.synchronize()
timerKNN.start()
knn_retrieve(x)
torch.cuda.synchronize()
timerKNN.stop()
它结果就看起来靠谱多了。
另一个更简单的例子则是(运行在我的RTX8000 GPU上):
import torch
import time
a = torch.randn(10000,100000,device='cuda')
b = torch.randn(100000,10000,device='cuda')
#torch.cuda.synchronize()
start = time.perf_counter()
c = a@b
#torch.cuda.synchronize()
print(time.perf_counter() - start)
输出 0.27494998497422785
而解除torch.cuda.synchronize()的注释后:
import torch
import time
a = torch.randn(10000,100000,device='cuda')
b = torch.randn(100000,10000,device='cuda')
torch.cuda.synchronize()
start = time.perf_counter()
c = a@b
torch.cuda.synchronize()
print(time.perf_counter() - start)
输出2.4091989540029317
也就是说如果我们想测试算法的用时,应该在计时器的开始和结束前都进行一次cuda流同步操作,以确保测得时间就是中间被测代码的实际运行到运算出结果的时间。
于是我改了一下fairseq的StopwachMeter,在start和stop中加入cuda同步操作
class StopWatchTimer:
'''
Timer for time measurement.
'''
def __init__(self, cudaSyncOnEvents=False, cudaStream : torch.cuda.Stream=None) -> None:
'''
Args:
cudaSyncOnEvents: Call cuda synchronize if start、stop、reset、elapsedTime is called
'''
self.startTime = None
self.totalTime = 0
self.itemCount = 0
if cudaSyncOnEvents and (not torch.cuda.is_available()):
raise RuntimeError("cuda is not available")
self.cudaSyncOnEvents = cudaSyncOnEvents
self.cudaStream = cudaStream
if cudaStream:
self.cudaSyncFunction = cudaStream.synchronize
else:
self.cudaSyncFunction = torch.cuda.synchronize
def __cudaSync(self):
if self.cudaSyncOnEvents:
self.cudaSyncFunction()
def start(self):
self.__cudaSync()
self.startTime = time.perf_counter()
def stop(self, itemCount=0):
if self.startTime is not None:
self.__cudaSync()
dtime = time.perf_counter() - self.startTime
self.totalTime += dtime
self.itemCount += itemCount
def reset(self):
self.itemCount = 0
self.totalTime = 0
self.start()
def elapsedTime(self) -> float:
if self.startTime is None:
return 0.0
self.__cudaSync()
return time.perf_counter() - self.startTime
Emmm,这或许其实是fairseq的问题.....
希望或许能对一些KNN-MT的加速的工作有帮助
Thanks for sharing the great tool.
I am wondering if the tool supports multi-processing when saving keys and value to datastore (i.e. multi-gpu inference and saving key values). It may helps for huge datastore application.
The "dump" and "load" in knn-box is not symmetric when it comes to faiss_index
build_faiss_index
method saves faiss_index shape to config.jsondump
method does notload
method tries to load faiss_index shape from config.jsonfaiss's Index class saves vector dimention and vector counts in faiss_index file, knn-box need not to save them.
Traceback (most recent call last):
File "/data0/lvyz/knn-box/knnbox-scripts/plac-knn-mt/../../knnbox-scripts/plac-knn-mt/save_drop_index.py", line 39, in <module>
mt_known = retriever.retrieve(query=query, return_list=["mt_known"])["mt_known"]
File "/data0/lvyz/knn-box/knnbox/retriever/retriever.py", line 21, in retrieve
self.datastore.load_faiss_index("keys", move_to_gpu=True)
File "/data0/lvyz/knn-box/knnbox/datastore/datastore.py", line 155, in load_faiss_index
shape = config["data_infos"][filename]["faiss_index_shape"]
KeyError: 'faiss_index_shape'
在这一步前,我已成功运行了Run base neural machine translation model (our baseline)。报的错是You should set pad mask first。源代码是
我所使用的环境是win10,torch1.13。
运行的命令是:
python knnbox-scripts/common/validate.py data-bin/medical --task translation --path pretrain-models/wmt19.de-en.ffn8192.pt --model-overrides "{'eval_bleu': False, 'required_seq_len_multiple':1, 'load_alignments': False}" --dataset-impl mmap --valid-subset train --skip-invalid-size-inputs-valid-test --max-tokens 4096 --bpe fastbpe --user-dir knnbox/models --arch "adaptive_knn_mt@transformer_wmt19_de_en" --knn-mode "build_datastore" --knn-datastore-path datastore/vanilla/medical
向您请教下如何设置pad mask.
请问一下,我理解的knn-mt是让训练好的模型对训练数据进行推理,从而记录下key-value值。但是有没有可能推理时的输出并不一定和真实标签结果相同,从而导致知识库中key-value个数不同或者不匹配的问题呢?或者这个问题也有解决只是我对代码理解没有到位。
代码中给出的fairseq版本是0.10.0,那么能否在fairseq0.12.0中使用呢
请问出现这个问题是什么原因呢?
2023-02-09 08:23:10 | INFO | fairseq.tasks.translation | /data/home/likai/NMT-offline/knn-box/knnbox-scripts/vanilla-knn-mt/../../data-bin/zh2en-ziyan-03 train zh-en 721 examples
2023-02-09 08:23:11 | INFO | train | | valid on 'train' subset | loss 2.994 | nll_loss 1.346 | ppl 2.54 | wps 0 | wpb 16975 | bsz 721
[vals.npy: (16975,) saved successfully ^_^ ]
||| {'vals': <knnbox.common_utils.memmap.Memmap object at 0x7f9b55f62990>} <class 'knnbox.common_utils.memmap.Memmap'>
Traceback (most recent call last):
File "/data/home/likai/NMT-offline/knn-box/knnbox-scripts/vanilla-knn-mt/../../knnbox-scripts/common/validate.py", line 252, in <module>
cli_main()
File "/data/home/likai/NMT-offline/knn-box/knnbox-scripts/vanilla-knn-mt/../../knnbox-scripts/common/validate.py", line 246, in cli_main
distributed_utils.call_main(args, main, override_args=override_args)
File "/data/home/likai/NMT-offline/knn-box/fairseq/distributed_utils.py", line 301, in call_main
main(args, **kwargs)
File "/data/home/likai/NMT-offline/knn-box/knnbox-scripts/vanilla-knn-mt/../../knnbox-scripts/common/validate.py", line 192, in main
datastore.build_faiss_index("keys", use_gpu=(not args.build_faiss_index_with_cpu)) # build faiss index
File "/data/home/likai/NMT-offline/knn-box/knnbox/datastore/datastore.py", line 177, in build_faiss_index
if not isinstance(self.datas[name], Memmap):
KeyError: 'keys'
在KNNMT论文(也就是Vanilla KNNMT)里说,加入KNN检索之后,速度慢了有2两个数量级。
我用咱们KNNBOX,在一块RTX8000上,测试对IT领域翻译的速度(推理速度),按照Readme中的指令运行的。
Baseline (只用NMT):用时14.1s
Vanilla KNNMT : 用时17.7s
我注意到有基于原始KNNMT的代码的工作例如Revised-Key-KNNMT,在运行的时候,整个程序的整体GPU利用率只有25%,作者说因为“检索数据库时,需要把向量先转移到CPU,检索得到表示后再转移到GPU",但我观察https://github.com/NJUNLP/knn-box/blob/master/knnbox/retriever/utils.py 似乎也是这样做的,运行Vanilla KNNMT GPU占用率却能够稳定接近100%。原始的KNNMT的代码https://github.com/urvashik/knnmt 是否存在GPU利用率过低的问题?(我实在没能跑起来这个原始的KNNMT代码,不知道HOME环境变量应该设置为什么,只能厚着脸皮来这里问问了)。
如果原始KNNMT比Baseline慢了两个数量级是因为原始KNNMT代码做的优化不好,而像KNNBOX这样能100%利用GPU则不存在这个问题的话,那么有关KNNMT的加速的研究还有用吗....
你好,我想请教一个问题。
我在运行adaptive knn-mt中的 bash train_metak.sh的时候。
运行结果报错如下:
RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptrfaiss::gpu::GpuResources, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /root/miniconda3/conda-bld/faiss-pkg_1669821591485/work/faiss/gpu/GpuIndex.cu:58: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 0
感觉像是GPU未能识别,但是查了下nvidia-smi都是能看到GPU序号是0和1的。
而且在跑vanilla knn-mt时程序都是好的。
Hi, thanks for creating and sharing this codebase, it has been really helpful to me.
I'm interested in replicating multilingual experiments from your paper (Table 2) but I'm having some issues.
Are my following assumptions correct:
Are there any other details I should be aware of to reproduce your results?
when i test the base model it gives me 32.5 in blue score, when i do use vanilla-knn it always 0
to build datastore i used this configuration
CUDA_VISIBLE_DEVICES=1 python $PROJECT_PATH/knnbox-scripts/common/validate.py $DATA_PATH \ --task translation \ --path $BASE_MODEL \ --source-lang en --target-lang ar \ --model-overrides "{'eval_bleu': False, 'required_seq_len_multiple':1, 'load_alignments': False}" \ --dataset-impl mmap \ --valid-subset train \ --skip-invalid-size-inputs-valid-test \ --max-tokens 2048 \ --bpe fastbpe \ --user-dir $PROJECT_PATH/knnbox/models \ --arch vanilla_knn_mt@transformer_wmt19_de_en \ --knn-mode build_datastore \ --knn-datastore-path $DATASTORE_SAVE_PATH \
and to test the the model i used this configurations
CUDA_VISIBLE_DEVICES=1 python $PROJECT_PATH/knnbox-scripts/common/generate.py $DATA_PATH \ --task translation \ --path $BASE_MODEL \ --dataset-impl mmap \ --beam 4 --lenpen 0.6 --max-len-a 1.2 --max-len-b 10 --source-lang en --target-lang ar \ --gen-subset test \ --max-tokens 2048 \ --encoder-embed-dim 768 \ --decoder-embed-dim 768 \ --dropout 0.2 \ --attention-dropout 0.0 \ --encoder-layerdrop 0 \ --decoder-layerdrop 0 \ --encoder-ffn-embed-dim 2048 \ --decoder-ffn-embed-dim 2048 \ --scoring sacrebleu \ --tokenizer moses \ --remove-bpe \ --user-dir $PROJECT_PATH/knnbox/models \ --arch vanilla_knn_mt@transformer_wmt19_de_en \ --knn-mode inference \ --knn-datastore-path $DATASTORE_LOAD_PATH \ --knn-k 8 \ --knn-lambda 0.7 \ --knn-temperature 10.0 \
Thanks for this repo to gather the knn-series' code. I have successfully reproduced the vanilla-knn under the guidance of it.
But the error when I trying to reproduce the Adaptive-knn in the [stage 2. train meta-k network] disturbed me for so long , the error info shows:
train.py: error: argument --arch/-a: invalid choice: 'transformer_wmt19_de_en' (choose from 'transformer_tiny', 'transformer', 'transformer_iwslt_de_en', 'transformer_wmt_en_de', 'transformer_vaswani_wmt_en_de_big', 'transformer_vaswani_wmt_en_fr_big', 'transformer_wmt_en_de_big' ... )
It seems that there's no such arch file as 'transformer_wmt19_de_en' which is used as the args. I wonder if it is a modified architecture by yourself or the elder version of fairseq? (I mentioned that you recommend the 0.12.2 version of fairseq and I'm pretty sure I successfully built it.) From the list I can only find the arch 'wmt_en_de' , it is also not compatible with the wmt19.de-en.ffn8192.pt file, but some 'de_en' arches only related to iwslt datasets... (T_T)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.