njunlp / knn-box Goto Github PK

View Code? Open in Web Editor NEW

99.0 5.0 12.0 11.54 MB

an easy-to-use knn-mt toolkit

License: MIT License

Python 96.55% C++ 0.55% Cuda 1.27% Cython 0.37% Shell 1.12% Lua 0.15%

fairseq knn-mt pytorch

knn-box's Issues

运行vanilla-knn-mt-visual出错

当我运行可视化的knn-mt出现上图的参数不匹配的错误：
第363行的参数列表：

第1153行的参数列表：

关于vanilla-knn-mt inference报错

在inference时，频频出现此类AttributeError: 'Namespace' object has no attribute 的错误。

还有就是，无法调用父类文件knnbox，ModuleNotFoundError: No module named 'knnbox'

翻译错误：AssertionError: interactive mode, should have only one sentence

Traceback (most recent call last):
File "/home/nlp/anaconda3/envs/knn/lib/python3.7/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 564, in run_script
exec(code, module.dict)
File "/home/nlp/y2020/yzh/knn-box-master/knnbox-scripts/vanilla-knn-mt-visual/src/app.py", line 215, in
knn_main()
File "/home/nlp/y2020/yzh/knn-box-master/knnbox-scripts/vanilla-knn-mt-visual/src/app.py", line 126, in knn_main
k = 1, lambda=0.0, temperature=1.0
File "/home/nlp/y2020/yzh/knn-box-master/knnbox-scripts/vanilla-knn-mt-visual/../../knnbox-scripts/vanilla-knn-mt-visual/src/function.py", line 510, in translate_using_knn_model
assert len(results) == 1, "interactive mode, should have only one sentence"
AssertionError: interactive mode, should have only one sentence

隔了一段时间重新用后出现了上述问题

基础NMT模型fairseq版本冲突问题

hello，我在用您的代码运行vanilla-knn-mt/build_datastore.sh遇到了这个错误。我的基础NMT模型fairseq版本是最新的V0.12训练的。在加载模型参数的时候报了这个错误:
AttributeError: 'NoneType' object has no attribute 'user_dir'
![Uploading image.png…]
这个错误是因为fairseq新版本已经没有用 state["args"]，已经改成了state['cfgs']。

当我把代码改成state[‘cfgs’]时，报错：

0.10的state["args"]是argparse.Namespace类型

如果要改代码好像要改许多地方，这也是我发现的其中一个版本不一致带来的问题，应该之后还会有类似的问题。貌似对新版本的fairseq训练的模型不太友好，有比较好的解决方式么？谢谢

感谢您指出这个问题。我们经过对checkpoint文件的对比，确认了您提出的现象，您可以按照下面的示例代码，对checkpoint进行一个小的修改，以正常加载，而kNN-BOX的代码无需修改：

          感谢您指出这个问题。我们经过对checkpoint文件的对比，确认了您提出的现象，您可以按照下面的示例代码，对checkpoint进行一个小的修改，以正常加载，而kNN-BOX的代码无需修改：

import torch
new_version_ckpt = torch.load("<新版本保存的checkpoint路径>")
new_version_ckpt["args"] = new_version_ckpt["cfg"]["model"]
torch.save(new_version_ckpt, "<为旧版本转换过的checkpiont>")

尖括号内的文件名请按需填写，以下是一些补充说明：
新版本中，fairseq会向checkpoint里面保存更多的配置信息，而原本的ckpt["args"]仅包含模型的信息，改保存到ckpt["cfg"]["model"]中，类型仍为argparseNamespace

Originally posted by @Maxwell-Lyu in #18 (comment)
谢谢！我转换了模型后，运行build_datastore.sh又产生了下面的错误

关于计时的一个问题

在之前的工作中，我使用fairseq.logging.meters.StopwatchMeter 计NMT预测时间、KNN检索时间，大概是这样的流程：

timerNMT.start()
x = NMTModel(input)
timerNMT.stop()
timerKNN.start()
knn_retrieve(x)
timerKNN.stop()

但是得到的结果却非常诡异，即便是非常非常小的datastore，有GPU加速的faiss进行KNN检索的总用时占了inference的大部分时间，后来我发现，https://github.com/NJUNLP/knn-box/blob/master/knnbox/retriever/utils.py 中的retrieve_k_nearest中，会临时把tensor迁移到cpu上，这个操作会先进行一个cuda同步，等待tensor的前置算子全部计算完成后，才能把它的值拷贝到cpu上。
这时我意识到，根据pytorch文档的说法，pytorch上的cuda kernel默认是异步执行的。也就是说python这边的函数返回的时候，cuda操作可能并没有完成，我把计时的流程改为

torch.cuda.synchronize()
timerNMT.start()
x = NMTModel(input)
torch.cuda.synchronize()
timerNMT.stop()
torch.cuda.synchronize()
timerKNN.start()
knn_retrieve(x)
torch.cuda.synchronize()
timerKNN.stop()

它结果就看起来靠谱多了。

另一个更简单的例子则是(运行在我的RTX8000 GPU上)：

import torch
import time
a = torch.randn(10000,100000,device='cuda')
b = torch.randn(100000,10000,device='cuda')
#torch.cuda.synchronize()
start = time.perf_counter()
c = a@b
#torch.cuda.synchronize()
print(time.perf_counter() - start)

输出 0.27494998497422785

而解除torch.cuda.synchronize()的注释后：

import torch
import time
a = torch.randn(10000,100000,device='cuda')
b = torch.randn(100000,10000,device='cuda')
torch.cuda.synchronize()
start = time.perf_counter()
c = a@b
torch.cuda.synchronize()
print(time.perf_counter() - start)

输出2.4091989540029317

也就是说如果我们想测试算法的用时，应该在计时器的开始和结束前都进行一次cuda流同步操作，以确保测得时间就是中间被测代码的实际运行到运算出结果的时间。

于是我改了一下fairseq的StopwachMeter，在start和stop中加入cuda同步操作

class StopWatchTimer:
    '''
    Timer for time measurement.
    '''
    def __init__(self, cudaSyncOnEvents=False, cudaStream : torch.cuda.Stream=None) -> None:
        '''
        Args:
            cudaSyncOnEvents: Call cuda synchronize if start、stop、reset、elapsedTime is called
        '''
        self.startTime = None 
        self.totalTime = 0
        self.itemCount = 0
        
        if cudaSyncOnEvents and (not torch.cuda.is_available()):
            raise RuntimeError("cuda is not available")
        
        self.cudaSyncOnEvents = cudaSyncOnEvents
        self.cudaStream = cudaStream
        if cudaStream:
            self.cudaSyncFunction = cudaStream.synchronize
        else:
            self.cudaSyncFunction = torch.cuda.synchronize
        
    def __cudaSync(self): 
        if self.cudaSyncOnEvents:
            self.cudaSyncFunction()
        
    def start(self):
        self.__cudaSync()
        self.startTime = time.perf_counter()
        
    def stop(self, itemCount=0):
        if self.startTime is not None:
            self.__cudaSync()
            dtime = time.perf_counter() - self.startTime
            self.totalTime += dtime
            self.itemCount += itemCount
            
    def reset(self):
        self.itemCount = 0
        self.totalTime = 0
        self.start()
        
    def elapsedTime(self) -> float:
        if self.startTime is None: 
            return 0.0
        self.__cudaSync()
        return time.perf_counter() - self.startTime

Emmm，这或许其实是fairseq的问题.....

希望或许能对一些KNN-MT的加速的工作有帮助

关于RuntimeError: Error(s) in loading state_dict for VanillaKNNMT:

您好，我在运行vanilla-knn-mt/inference.sh脚本后，在加载模型时出现了错误：

这是我的运行脚本：

请问这个问题应该如何解决？

关于plac-knn-mt运行报错

你好，我们在进行运行plac-knn-mt的时候
首先bash save_mt_pred.sh，
接着bash save_drop_index.sh，
然后bash prune_datastore.sh对datastore进行prune后，
使用vanilla-knn-mt中的inference，能够正常运行，但是运行出来的BLUE是0。

Multi-processing for huge datastore

Thanks for sharing the great tool.

I am wondering if the tool supports multi-processing when saving keys and value to datastore (i.e. multi-gpu inference and saving key values). It may helps for huge datastore application.

KeyError when load_faiss_index from a dumpped datastore

How to reproduce

[OK] Build a datastore and its faiss_index using scripts under knnbox-scripts/vanilla-knn-mt
[OK] Load this datastore in my code, dump it
[Error] Load the dumpped datastore, and load its faiss_index (called by any retriever)

Cause

The "dump" and "load" in knn-box is not symmetric when it comes to faiss_index

The build_faiss_index method saves faiss_index shape to config.json
The dump method does not
The load method tries to load faiss_index shape from config.json

Fix

faiss's Index class saves vector dimention and vector counts in faiss_index file, knn-box need not to save them.

Error Trace

Traceback (most recent call last):
  File "/data0/lvyz/knn-box/knnbox-scripts/plac-knn-mt/../../knnbox-scripts/plac-knn-mt/save_drop_index.py", line 39, in <module>
    mt_known    = retriever.retrieve(query=query, return_list=["mt_known"])["mt_known"]
  File "/data0/lvyz/knn-box/knnbox/retriever/retriever.py", line 21, in retrieve
    self.datastore.load_faiss_index("keys", move_to_gpu=True)
  File "/data0/lvyz/knn-box/knnbox/datastore/datastore.py", line 155, in load_faiss_index
    shape = config["data_infos"][filename]["faiss_index_shape"]
KeyError: 'faiss_index_shape'

AssertionError: You should set pad mask first! 当我尝试运行bash build_datastore.sh

我在Run vanilla knn-mt出现了问题，

在这一步前，我已成功运行了Run base neural machine translation model (our baseline)。报的错是You should set pad mask first。源代码是

我所使用的环境是win10，torch1.13。
运行的命令是：
python knnbox-scripts/common/validate.py data-bin/medical --task translation --path pretrain-models/wmt19.de-en.ffn8192.pt --model-overrides "{'eval_bleu': False, 'required_seq_len_multiple':1, 'load_alignments': False}" --dataset-impl mmap --valid-subset train --skip-invalid-size-inputs-valid-test --max-tokens 4096 --bpe fastbpe --user-dir knnbox/models --arch "adaptive_knn_mt@transformer_wmt19_de_en" --knn-mode "build_datastore" --knn-datastore-path datastore/vanilla/medical
向您请教下如何设置pad mask.

FileNotFoundError: /home/demo/knn-box/knnbox/models

请问这里缺少knn模型文件是哪一步的问题呢？

推理训练数据与标签不一致

请问一下，我理解的knn-mt是让训练好的模型对训练数据进行推理，从而记录下key-value值。但是有没有可能推理时的输出并不一定和真实标签结果相同，从而导致知识库中key-value个数不同或者不匹配的问题呢？或者这个问题也有解决只是我对代码理解没有到位。

是否可以在fairseq0.12.0版本进行使用？

代码中给出的fairseq版本是0.10.0，那么能否在fairseq0.12.0中使用呢

KeyError: 'keys'

请问出现这个问题是什么原因呢？

2023-02-09 08:23:10 | INFO | fairseq.tasks.translation | /data/home/likai/NMT-offline/knn-box/knnbox-scripts/vanilla-knn-mt/../../data-bin/zh2en-ziyan-03 train zh-en 721 examples
2023-02-09 08:23:11 | INFO | train |  | valid on 'train' subset | loss 2.994 | nll_loss 1.346 | ppl 2.54 | wps 0 | wpb 16975 | bsz 721                                                                                                     
[vals.npy: (16975,) saved successfully ^_^ ]
|||  {'vals': <knnbox.common_utils.memmap.Memmap object at 0x7f9b55f62990>} <class 'knnbox.common_utils.memmap.Memmap'>
Traceback (most recent call last):
  File "/data/home/likai/NMT-offline/knn-box/knnbox-scripts/vanilla-knn-mt/../../knnbox-scripts/common/validate.py", line 252, in <module>
    cli_main()
  File "/data/home/likai/NMT-offline/knn-box/knnbox-scripts/vanilla-knn-mt/../../knnbox-scripts/common/validate.py", line 246, in cli_main
    distributed_utils.call_main(args, main, override_args=override_args)
  File "/data/home/likai/NMT-offline/knn-box/fairseq/distributed_utils.py", line 301, in call_main
    main(args, **kwargs)
  File "/data/home/likai/NMT-offline/knn-box/knnbox-scripts/vanilla-knn-mt/../../knnbox-scripts/common/validate.py", line 192, in main
    datastore.build_faiss_index("keys", use_gpu=(not args.build_faiss_index_with_cpu))   # build faiss index
  File "/data/home/likai/NMT-offline/knn-box/knnbox/datastore/datastore.py", line 177, in build_faiss_index
    if not isinstance(self.datas[name], Memmap):
KeyError: 'keys'

关于运行速度的问题

在KNNMT论文(也就是Vanilla KNNMT)里说，加入KNN检索之后，速度慢了有2两个数量级。

我用咱们KNNBOX，在一块RTX8000上，测试对IT领域翻译的速度（推理速度），按照Readme中的指令运行的。

Baseline （只用NMT）：用时14.1s

Vanilla KNNMT : 用时17.7s

我注意到有基于原始KNNMT的代码的工作例如Revised-Key-KNNMT，在运行的时候，整个程序的整体GPU利用率只有25%，作者说因为“检索数据库时，需要把向量先转移到CPU，检索得到表示后再转移到GPU"，但我观察https://github.com/NJUNLP/knn-box/blob/master/knnbox/retriever/utils.py 似乎也是这样做的，运行Vanilla KNNMT GPU占用率却能够稳定接近100%。原始的KNNMT的代码https://github.com/urvashik/knnmt 是否存在GPU利用率过低的问题？（我实在没能跑起来这个原始的KNNMT代码，不知道HOME环境变量应该设置为什么，只能厚着脸皮来这里问问了）。

如果原始KNNMT比Baseline慢了两个数量级是因为原始KNNMT代码做的优化不好，而像KNNBOX这样能100%利用GPU则不存在这个问题的话，那么有关KNNMT的加速的研究还有用吗....

运行adaptive knn-mt出错

你好，我想请教一个问题。
我在运行adaptive knn-mt中的 bash train_metak.sh的时候。
运行结果报错如下：
RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptrfaiss::gpu::GpuResources, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /root/miniconda3/conda-bld/faiss-pkg_1669821591485/work/faiss/gpu/GpuIndex.cu:58: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 0
感觉像是GPU未能识别，但是查了下nvidia-smi都是能看到GPU序号是0和1的。
而且在跑vanilla knn-mt时程序都是好的。

Question about multilingual experiments in kNN-BOX paper

Hi, thanks for creating and sharing this codebase, it has been really helpful to me.

I'm interested in replicating multilingual experiments from your paper (Table 2) but I'm having some issues.

Are my following assumptions correct:

I should use the 418M M2M-100 model (not the larger versions).
I should create individual datastores per language direction, based on the TED training data. So for example, Cs->En M2M-100 BLEU score is 20.7, and when adding a Cs->En datastore based on the TED Cs->En training data the result will be improved to 22.3. (You did not use the M2M-100 training data, since this would be huge, correct? It would be helpful to add datastore sizes to the table so that readers can infer this.)

Are there any other details I should be aware of to reproduce your results?

the blue-score always 0

when i test the base model it gives me 32.5 in blue score, when i do use vanilla-knn it always 0

to build datastore i used this configuration
CUDA_VISIBLE_DEVICES=1 python $PROJECT_PATH/knnbox-scripts/common/validate.py $DATA_PATH \ --task translation \ --path $BASE_MODEL \ --source-lang en --target-lang ar \ --model-overrides "{'eval_bleu': False, 'required_seq_len_multiple':1, 'load_alignments': False}" \ --dataset-impl mmap \ --valid-subset train \ --skip-invalid-size-inputs-valid-test \ --max-tokens 2048 \ --bpe fastbpe \ --user-dir $PROJECT_PATH/knnbox/models \ --arch vanilla_knn_mt@transformer_wmt19_de_en \ --knn-mode build_datastore \ --knn-datastore-path $DATASTORE_SAVE_PATH \

and to test the the model i used this configurations

CUDA_VISIBLE_DEVICES=1 python $PROJECT_PATH/knnbox-scripts/common/generate.py $DATA_PATH \ --task translation \ --path $BASE_MODEL \ --dataset-impl mmap \ --beam 4 --lenpen 0.6 --max-len-a 1.2 --max-len-b 10 --source-lang en --target-lang ar \ --gen-subset test \ --max-tokens 2048 \ --encoder-embed-dim 768 \ --decoder-embed-dim 768 \ --dropout 0.2 \ --attention-dropout 0.0 \ --encoder-layerdrop 0 \ --decoder-layerdrop 0 \ --encoder-ffn-embed-dim 2048 \ --decoder-ffn-embed-dim 2048 \ --scoring sacrebleu \ --tokenizer moses \ --remove-bpe \ --user-dir $PROJECT_PATH/knnbox/models \ --arch vanilla_knn_mt@transformer_wmt19_de_en \ --knn-mode inference \ --knn-datastore-path $DATASTORE_LOAD_PATH \ --knn-k 8 \ --knn-lambda 0.7 \ --knn-temperature 10.0 \

missing 'transformer_wmt19_de_en' arch when I trying to reproduce the Adaptive-kNN

Thanks for this repo to gather the knn-series' code. I have successfully reproduced the vanilla-knn under the guidance of it.
But the error when I trying to reproduce the Adaptive-knn in the [stage 2. train meta-k network] disturbed me for so long , the error info shows:

train.py: error: argument --arch/-a: invalid choice: 'transformer_wmt19_de_en' (choose from 'transformer_tiny', 'transformer', 'transformer_iwslt_de_en', 'transformer_wmt_en_de', 'transformer_vaswani_wmt_en_de_big', 'transformer_vaswani_wmt_en_fr_big', 'transformer_wmt_en_de_big' ... )

It seems that there's no such arch file as 'transformer_wmt19_de_en' which is used as the args. I wonder if it is a modified architecture by yourself or the elder version of fairseq? (I mentioned that you recommend the 0.12.2 version of fairseq and I'm pretty sure I successfully built it.) From the list I can only find the arch 'wmt_en_de' , it is also not compatible with the wmt19.de-en.ffn8192.pt file, but some 'de_en' arches only related to iwslt datasets... (T_T)

您好，当我运行adaptive-knn-mt中的build_datasotre.sh时，报No moudle named'knnbox'

我什么也没改啊

njunlp / knn-box Goto Github PK

knn-box's Issues

How to reproduce

Cause

Fix

Error Trace

Recommend Projects

Recommend Topics

Recommend Org