open-compass / opencompass Goto Github PK

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Home Page: https://opencompass.org.cn/

License: Apache License 2.0

Python 100.00%

evaluation benchmark large-language-model chatgpt llm llama2 openai llama3

opencompass's People

Contributors

Stargazers

Watchers

Forkers

jimmyma99 ezra-yu yingfhu tau-j zimix0 zwwwayne yyk-wew wangbo-zhao zhangyuanhan-ai st-rnd apollohuang1 gaotongxiao haorand go-with-me000 lyhiving liushz mayi140611 vansin assassindesign gaoyang07 techsuni2023 jasonzess leymore apx103 lzhgrla acebot712 v-xchen-v techthiyanes yanqiangmiffy yukinowan l1183325308 expert68 kennymckormick adai-5090 anakin-skywalker-joseph itsharex tonysy ngc7292 zhouzaida arajan10 soaringsoul spurslipu felixgithub2017 ishaan-jaff fangyixiao18 eltociear lidonyx helanhu luodian tiansiyuan bestpredicts pleisto sudoxxhh cdpath runningleon jsweber xmxoxo kevinnunu ranchizhao zhangyikaii haochenye industryessentials wangxingjun778 0605 c1rn09 wangxidong06 sam-ai xlsean cuteythyme allentdan rin-nn so2liu lvhan028 hobeedzc chenbohua3 feiward ianz2020 reneguo tangbotony mahyarhabibi kustomzone cyzlovedream dzgd papayofen ruivers spico197 open-mmlab-12 hit-cwh wangfudong65 saakshii12 jingmingzhuo sanster personalized-agent hkgai bittersweet1999 frankweijue ayushrakesh odoochain shresthasurav pjlab-sys4nlp

opencompass's Issues

描述该错误

的

环境信息

的

其他信息

No response

描述该错误

解码会偶现失败最后的评估会因为数量不一致无法评估请问如何回复其中一条或者剔除这一条正常进行评估

环境信息

{'CUDA available': True,
'CUDA_HOME': '/home/work/cuda-11.3',
'GCC': 'gcc (GCC) 8.2.0',
'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-SXM4-40GB',
'MMEngine': '0.8.2',
'NVCC': 'Cuda compilation tools, release 11.3, V11.3.58',
'OpenCV': '4.8.0',
'PyTorch': '2.0.1',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2023.1-Product Build 20230303 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v2.7.3 (Git Hash '
'6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX2\n'
' - CUDA Runtime 11.8\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n'
' - CuDNN 8.7\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=11.8, '
'CUDNN_VERSION=8.7.0, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -Wno-deprecated '
'-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
'-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
'-DUSE_FBGEMM -DUSE_QNNPACK '
'-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
'-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
'-Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wunused-local-typedefs '
'-Wno-unused-parameter -Wno-unused-function '
'-Wno-unused-result -Wno-strict-overflow '
'-Wno-strict-aliasing '
'-Wno-error=deprecated-declarations '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=redundant-decls '
'-Wno-error=old-style-cast '
'-fdiagnostics-color=always -faligned-new '
'-Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0]',
'TorchVision': '0.15.2',
'numpy_random_seed': 2147483648,
'opencompass': '0.1.0+4b0aa80',
'sys.platform': 'linux'}

其他信息

No response

[Feature] 请问chatglm目前支持吗，是否有相关文档参考

Describe the feature

自己试了下，报错蛮多的，想问下是否有官方文档支持清华chatglm系列

Will you implement it?

I would like to implement this feature and create a PR!

问题咨询：关于ceval和nq这两类数据是如何计算分数的？

描述该错误

ceval是单选题，每个科目题目数目不一样，模型推测出结果之后，如果做对一题就加分吗？然后再归一化为百分制？
nq之类问答题，是如何计算分数的呢？

在discussion区提问，响应比较慢，提到issue区试试。多谢。

环境信息

python3.10

其他信息

No response

在ceval,mmlu, gaokao 上都速度异常（3个任务2卡22h还未结束），单个样本速度正常。

          我遇到了类似问题，在ceval,mmlu, gaokao 上都速度异常（3个任务2卡22h还未结束），单个样本速度正常。

outputs 输出发现评估重复，比如profession_law只有170个，但输出文件中有 15 个 xxxx_mmlu_profession_law_[0-15]，每个95个样本，且样本有重复。请问是评估方式不对吗，还是 sizepartition 这里有 bug？

python run.py ./configs/eval_xxx.py --mode eval -w outputs/ --slurm --partition xxx --debug

Originally posted by @Desein-Yang in #30 (comment)

[Bug] Cannot reproduce results for internlm-7b-chat and MMLU

Describe the bug

The result I get is listed below.

dataset                      version    metric         mode    internlm-chat-7b-hf
---------------------------  ---------  -------------  ------  ---------------------
--------- 考试 Exam ---------  -          -              -       -
mmlu                         -          naive_average  gen     50.52

But according to the leaderboard, it should be 50.8

Environment

...

Other information

No response

[Bug] 显存不断重新加载并且解码速度很慢

描述该错误

使用llama_7B_hf 这个模型进行解码评测观测nvidia-smi 里面显卡的使用会不断有显存开辟释放八块a100 两个小时只解了300条速度很慢求助是否有设置上的问题
下图是模型设置

显卡观测

环境信息

其他信息

No response

[Feature] Add few-shot demo by selecting sample in Random / TopK / VoteK

Describe the feature

As proposed in #80 (comment)

Will you implement it?

I would like to implement this feature and create a PR!

[documentation] 希望增加一个完整链路级别的文档指导如何测评一个指定模型

描述该错误

https://opencompass.readthedocs.io/en/latest/advanced_guides/new_model.html
这个文档中已经有关于如何新增model支持的描述，但是对于如何完成评测一个新增model，这部分文档还比较分散，比如需要：

新增model类的几个方法的实现（以api为例）
新增对应config
希望文档能有一个整体性的描述，对小白用户比较友好

环境信息

无

其他信息

No response

[Feature] test

描述该功能

test

是否希望自己实现该功能？

我希望自己来实现这一功能，并向 OpenCompass 贡献代码！

[Bug]

描述该错误

07/28 16:33:32 - OpenCompass - INFO - Task [opencompass.models.MyModel_opencompass_customize model test/agieval-gaokao-physics]
/opt/conda/envs/opencompass/lib/python3.10/site-packages/mmengine/utils/manager.py:113: UserWarning: <class 'mmengine.logging.logger.MMLogger'> instance named of OpenCompass has been created, the method get_instance should not accept any other arguments
warnings.warn(
Traceback (most recent call last):
File "/opt/conda/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 122, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/mnt/workspace/opencompass/opencompass/datasets/base.py", line 12, in init
self.dataset = self.load(**kwargs)
File "/mnt/workspace/opencompass/opencompass/datasets/agieval/agieval.py", line 52, in load
dataset = Dataset.from_list(data)
File "/opt/conda/envs/opencompass/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 951, in from_list
return cls.from_dict(mapping, features, info, split)
File "/opt/conda/envs/opencompass/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 911, in from_dict
pa_table = InMemoryTable.from_pydict(mapping=mapping)
File "/opt/conda/envs/opencompass/lib/python3.10/site-packages/datasets/table.py", line 799, in from_pydict
return cls(pa.Table.from_pydict(*args, **kwargs))
File "pyarrow/table.pxi", line 3849, in pyarrow.lib.Table.from_pydict
File "pyarrow/table.pxi", line 5401, in pyarrow.lib._from_pydict
File "pyarrow/array.pxi", line 357, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 243, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/opt/conda/envs/opencompass/lib/python3.10/site-packages/datasets/arrow_writer.py", line 189, in arrow_array
out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
File "pyarrow/array.pxi", line 327, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/workspace/opencompass/opencompass/tasks/openicl_infer.py", line 147, in
inferencer.run()
File "/mnt/workspace/opencompass/opencompass/tasks/openicl_infer.py", line 66, in run
self.dataset = build_dataset_from_cfg(self.dataset_cfg)
File "/mnt/workspace/opencompass/opencompass/utils/build.py", line 13, in build_dataset_from_cfg
return LOAD_DATASET.build(dataset_cfg)
File "/opt/conda/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/opt/conda/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 144, in build_from_cfg
raise type(e)(
pyarrow.lib.ArrowTypeError: class AGIEvalDataset_v2 in opencompass/datasets/agieval/agieval.py: Expected bytes, got a 'list' object

The 'agieval-gaokao-physics' dataset loads failed

环境信息

{'CUDA available': True,
'CUDA_HOME': '/usr/local/cuda',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
'GPU 0,1,2,3': 'Tesla V100-SXM2-32GB',
'MMEngine': '0.8.2',
'NVCC': 'Cuda compilation tools, release 11.3, V11.3.109',
'OpenCV': '4.8.0',
'PyTorch': '1.13.1+cu116',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201402\n'
' - Intel(R) Math Kernel Library Version '
'2020.0.0 Product Build 20191122 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v2.6.0 (Git Hash '
'52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX2\n'
' - CUDA Runtime 11.6\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n'
' - CuDNN 8.3.2 (built against CUDA 11.5)\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=11.6, '
'CUDNN_VERSION=8.3.2, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -fabi-version=11 -Wno-deprecated '
'-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
'-fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM '
'-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
'-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
'-DEDGE_PROFILER_USE_KINETO -O2 -fPIC '
'-Wno-narrowing -Wall -Wextra '
'-Werror=return-type -Werror=non-virtual-dtor '
'-Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wunused-local-typedefs '
'-Wno-unused-parameter -Wno-unused-function '
'-Wno-unused-result -Wno-strict-overflow '
'-Wno-strict-aliasing '
'-Wno-error=deprecated-declarations '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=redundant-decls '
'-Wno-error=old-style-cast '
'-fdiagnostics-color=always -faligned-new '
'-Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, '
'USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, '
'USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, '
'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, '
'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0]',
'TorchVision': '0.14.1+cu116',
'numpy_random_seed': 2147483648,
'opencompass': '0.1.0+2cce271',
'sys.platform': 'linux'}

其他信息

No response

[Bug]internLM-7b 跑openai_humaneval数据，在eval阶段跑错

描述该错误

控制台打印：

outputs/internlm_7b/20230726_232201/logs/eval/internlm-7b-hf/openai_humaneval.out文件的输入为：

完整log为：
openai_humaneval.out.txt

帮忙看下是怎么回事？多谢

环境信息

python3.10
cuda11.8

其他信息

No response

运行时，出现 No module named 'opencompass'错误

Describe the bug

按照第一个例子执行：
python run.py configs/eval_demo.py -w outputs/demo
上报如下错误：
Traceback (most recent call last):
File "/root/project/opencompass-0.1.0/opencompass/tasks/openicl_eval.py", line 10, in
from opencompass.registry import (ICL_EVALUATORS, MODELS, TASKS,
ModuleNotFoundError: No module named 'opencompass'

看了下opencompass-0.1.0/opencompass/tasks/openicl_eval.py的代码：
from opencompass.registry import (ICL_EVALUATORS, MODELS, TASKS,TEXT_POSTPROCESSORS)
感觉文件夹的层次不对？
这块需要怎么配置，能让openicl_eval.py直接看到两层之外的opencompass目录呢？

python小白，请大侠帮忙，感谢

Environment

python3.10

Other information

No response

[Bug] 无法指定gpu

描述该错误

无法指定gpu，无论在shell中使用CUDA_VISIBLE_DEVICES=1，还是在python中使用os.environ["CUDA_VISIBLE_DEVICES"]=1，都无法让程序只占用1卡，程序永远都会优先占用0卡。

环境信息

shell中

CUDA_VISIBLE_DEVICES=1 python run.py configs/eval_llama_7b.py -w outputs/llama_7b --gpu_idx=1

python中

parser = argparse.ArgumentParser(description='Run an evaluation task')
parser.add_argument('--gpu_idx', type=str, default=None)
args = parser.parse_args()
os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu_idx

其他信息

No response

[Feature] Speedup for opencompass\icl_retriever\icl_mdl_retriever\MDLRetriever

描述该功能

MDLRetriever uses the gpt2-xl model to process the commonsenseqa dataset.
In def topk_search(self): ,it took me nearly 2 hours to process to get rtr_idx_list on V100. If I don't care about the random seed, can I pickle the result of the first processing into a binary file, which is convenient for testing and calling quickly later?

是否希望自己实现该功能？

我希望自己来实现这一功能，并向 OpenCompass 贡献代码！

[Bug] FileNotFoundError: Couldn't find a module script at /mnt/workspace/xxx/evaluation/opencompass/accuracy/accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.

描述该错误

Hello, I encountered the following problems when evaluating a custom model on the Alibaba Cloud server, how to solve it?

FileNotFoundError: Couldn't find a module script at /mnt/workspace/xxx/evaluation/opencompass/accuracy/accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.

环境信息

python run.py ./config/eval_my_model.py --debug

其他信息

No response

问题咨询：关于ceval数据集以及新增类似ceval数据集的问题

描述该错误

1、data/ceval/包含formal_ceval和release_ceval两个文件夹，分别都是什么用途？评测internLM-7b时，使用的是formal_ceval下的数据。使用release_ceval的数据在数据集合配置glm文件夹下，是评测GLM的模型的时候使用release_ceval数据吗？
2、data/ceval/formal_ceval下分别有dev，test，val三个文件夹，CEvalDataset类将这些数据load到了下面变量：
return DatasetDict({ 'val': val_dataset, 'dev': dev_dataset, 'test': test_dataset })
这块后续是如何评测的呢？测试集、开发集，验证集这三类数据如何使用？
按照ceval官方的说法，测试机没有答案，需要将测试结果提交他们网站才能拿到得分，opencompass是如何操作的呢？
3、opencompass是否支持问答题的评测，如果支持，哪个数据集是问答题的例子？

建议opencompass开个论坛，我们可以到论坛里去交流相关技术问题。

感谢~

环境信息

无。

其他信息

No response

InternLM-7B base model results?

Describe the feature

Hi OpenCompass team, thank you for this excellent project. I see the results for InternLM-Chat-7B reported in the OpenCompass leaderboard, but I don't see any results for the base model InternLM-7B. Would you be willing to share those numbers as well?

I think it would be very useful to the community to understand how much of the performance comes from the base model vs. the chat finetuning.

Will you implement it?

I would like to implement this feature and create a PR!

[Bug] One mistake in the pipeline of MMBench

Describe the bug

In https://opencompass.readthedocs.io/en/latest/MMBench.html:

if data_sample['context'] is None:
    prompt = data_sample['context'] + ' ' + data_sample['question'] + ' ' + data_sample['options']
else:
    prompt = data_sample['question'] + ' ' + data_sample['options']

It should be

if data_sample['context'] is not None:

Environment

No anomaly

Other information

Also, the evaluation pipeline of MMbench is not complete. Can you release the code of Circular Evaluation Strategy and ChatGPT-involved Choice Extraction? How should we submit the results to the leaderboard? The hyperlink of instruction https://opencompass.org.cn/mmbench seems not correct.

在评测hunameval数据时，遇到一些问题，请帮忙看下；

描述该错误

评测humaneval数据，推理完成，eval阶段出现下面异常，可能是哪里出了问题？
Traceback (most recent call last):
File "/root/project/code/opencompass-3715be6/opencompass/tasks/openicl_eval.py", line 180, in
inferencer.run()
File "/root/project/code/opencompass-3715be6/opencompass/tasks/openicl_eval.py", line 54, in run
self._score()
File "/root/project/code/opencompass-3715be6/opencompass/tasks/openicl_eval.py", line 121, in _score
result = icl_evaluator.score(
File "/root/project/code/opencompass-3715be6/opencompass/datasets/humaneval.py", line 36, in score
score = self.eval(out_dir,
File "/root/project/code/opencompass-3715be6/human-eval-master/human_eval/evaluation.py", line 51, in evaluate_functional_correctness
problems = read_problems(problem_file)
File "/root/project/code/opencompass-3715be6/human-eval-master/human_eval/data.py", line 12, in read_problems
return {task["task_id"]: task for task in stream_jsonl(evalset_file)}
File "/root/project/code/opencompass-3715be6/human-eval-master/human_eval/data.py", line 12, in
return {task["task_id"]: task for task in stream_jsonl(evalset_file)}
File "/root/project/code/opencompass-3715be6/human-eval-master/human_eval/data.py", line 21, in stream_jsonl
with open(filename, "rb") as gzfp:
FileNotFoundError: [Errno 2] No such file or directory: '/root/project/code/opencompass-3715be6/human-eval-master/human_eval/../data/HumanEval.jsonl.gz'

环境信息

python3.10
humanEval使用master分之

其他信息

No response

[Feature] Loading the same model multiple times in opencompass for evaluation on MMLU dataset

Describe the feature

I am currently using LLaMa to evaluate PPL's ACC on MMLU dataset and I have a question about opencompass toolkit. When I use opencompass to divide the evaluation task into 40 sub-tasks, does this mean that the toolkit loads the same model 40 times for evaluation? From my observation, the time it takes to load the model is equivalent to the inference time. Therefore, I am concerned that the evaluation may take a long time due to the repeated loading of the model.

Additionally, I am wondering if it is possible to configure the opencompass config to support batch inference? This could potentially improve the efficiency of the evaluation process.

I would greatly appreciate it if you could provide detailed guidance on these issues.

Will you implement it?

I would like to implement this feature and create a PR!

[Bug]

Describe the bug

test

Environment

test

Other information

No response

[Bug]

描述该错误

运行评测脚本时，出现以下错误

Traceback (most recent call last):
File "/d1/pub/wyzh/evaluation/opencompass/run.py", line 339, in
main()
File "/d1/pub/wyzh/evaluation/opencompass/run.py", line 210, in main
tasks = partitioner(cfg)
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/partitioners/base.py", line 48, in call
tasks = self.partition(models, datasets, work_dir, self.out_dir)
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/partitioners/size.py", line 67, in partition
datasets = sorted(datasets,
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/partitioners/size.py", line 68, in
key=lambda x: self.get_cost(x),
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/partitioners/size.py", line 188, in get_cost
dataset = build_dataset_from_cfg(dataset)
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/utils/build.py", line 13, in build_dataset_from_cfg
return LOAD_DATASET.build(dataset_cfg)
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 98, in build_from_cfg
obj_cls = registry.get(obj_type)
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/mmengine/registry/registry.py", line 451, in get
self.import_from_location()
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/mmengine/registry/registry.py", line 376, in import_from_location
import_module(loc)
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/datasets/init.py", line 1, in
from .afqmcd import * # noqa: F401, F403
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/datasets/afqmcd.py", line 7, in
from .base import BaseDataset
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/datasets/base.py", line 6, in
from opencompass.openicl import DatasetReader
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/openicl/init.py", line 2, in
from .icl_evaluator import * # noqa
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/openicl/icl_evaluator/init.py", line 4, in
from .icl_hf_evaluator import * # noqa
File "/d1/pub/wyzh/evaluation/opencompass/opencompass/openicl/icl_evaluator/icl_hf_evaluator.py", line 4, in
import evaluate
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/evaluate/init.py", line 29, in
from .evaluation_suite import EvaluationSuite
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/evaluate/evaluation_suite/init.py", line 10, in
from ..evaluator import evaluator
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/evaluate/evaluator/init.py", line 17, in
from transformers.pipelines import SUPPORTED_TASKS as SUPPORTED_PIPELINE_TASKS
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/transformers/pipelines/init.py", line 44, in
from .audio_classification import AudioClassificationPipeline
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/transformers/pipelines/audio_classification.py", line 21, in
from .base import PIPELINE_INIT_ARGS, Pipeline
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/transformers/pipelines/base.py", line 36, in
from ..modelcard import ModelCard
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/transformers/modelcard.py", line 48, in
from .training_args import ParallelMode
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/transformers/training_args.py", line 67, in
from accelerate import PartialState
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/accelerate/accelerator.py", line 33, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/accelerate/utils/init.py", line 109, in
from .launch import (
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/accelerate/utils/launch.py", line 23, in
from ..commands.config.config_args import SageMakerConfig
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/accelerate/commands/config/init.py", line 19, in
from .config import config_command_parser
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/accelerate/commands/config/config.py", line 25, in
from .sagemaker import get_sagemaker_input
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/accelerate/commands/config/sagemaker.py", line 35, in
import boto3 # noqa: F401
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/boto3/init.py", line 17, in
from boto3.session import Session
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/boto3/session.py", line 17, in
import botocore.session
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/session.py", line 26, in
import botocore.client
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/client.py", line 15, in
from botocore import waiter, xform_name
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/waiter.py", line 18, in
from botocore.docs.docstring import WaiterDocstring
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/docs/init.py", line 15, in
from botocore.docs.service import ServiceDocumenter
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/docs/service.py", line 14, in
from botocore.docs.client import ClientDocumenter, ClientExceptionsDocumenter
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/docs/client.py", line 17, in
from botocore.docs.example import ResponseExampleDocumenter
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/docs/example.py", line 13, in
from botocore.docs.shape import ShapeDocumenter
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/docs/shape.py", line 19, in
from botocore.utils import is_json_value_header
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/utils.py", line 37, in
import botocore.httpsession
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/botocore/httpsession.py", line 45, in
from urllib3.contrib.pyopenssl import (
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/urllib3/contrib/pyopenssl.py", line 50, in
import OpenSSL.crypto
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/OpenSSL/init.py", line 8, in
from OpenSSL import crypto, SSL
File "/home/llm/anaconda3/envs/safe-rlhf/lib/python3.10/site-packages/OpenSSL/crypto.py", line 3268, in
_lib.OpenSSL_add_all_algorithms()
AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'

环境信息

{'CUDA available': True,
'CUDA_HOME': '/home/llm/anaconda3/envs/safe-rlhf',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-SXM4-40GB',
'MMEngine': '0.8.3',
'NVCC': 'Cuda compilation tools, release 11.7, V11.7.99',
'OpenCV': '4.8.0',
'PyTorch': '2.0.1',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2023.1-Product Build 20230303 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v2.7.3 (Git Hash '
'6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX2\n'
' - CUDA Runtime 11.7\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n'
' - CuDNN 8.5\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=11.7, '
'CUDNN_VERSION=8.5.0, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -Wno-deprecated '
'-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
'-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
'-DUSE_FBGEMM -DUSE_QNNPACK '
'-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
'-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
'-Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wunused-local-typedefs '
'-Wno-unused-parameter -Wno-unused-function '
'-Wno-unused-result -Wno-strict-overflow '
'-Wno-strict-aliasing '
'-Wno-error=deprecated-declarations '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=redundant-decls '
'-Wno-error=old-style-cast '
'-fdiagnostics-color=always -faligned-new '
'-Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) '
'[GCC 11.3.0]',
'TorchVision': '0.15.2+cu117',
'numpy_random_seed': 2147483648,
'opencompass': '0.1.0+e9b7b8a',
'sys.platform': 'linux'}

其他信息

No response

[Bug]

Describe the bug

test

Environment

test

Other information

No response

[Bug] Failed to prepare the AGIEval dataset for evaluation

Describe the bug

Command:

python3 run.py configs/eval_internlm_7b.py

Dataprep steps:

cd ./data/AGIEval
git clone https://github.com/microsoft/AGIEval.git
cp -R AGIEval/data data

Logs:

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object

Environment

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-SXM4-80GB',
 'MMEngine': '0.8.1',
 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.66',
 'OpenCV': '4.8.0',
 'PyTorch': '2.0.1',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2023.1-Product Build 20230303 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.7.3 (Git Hash '
                              '6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX2\n'
                              '  - CUDA Runtime 11.8\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n'
                              '  - CuDNN 8.7\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=11.8, '
                              'CUDNN_VERSION=8.7.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -Wno-deprecated '
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
                              '-DUSE_FBGEMM -DUSE_QNNPACK '
                              '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
                              '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
                              '-Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '
                              '-Wno-unused-parameter -Wno-unused-function '
                              '-Wno-unused-result -Wno-strict-overflow '
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=redundant-decls '
                              '-Wno-error=old-style-cast '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '
                              'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
                              'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, '
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0]',
 'TorchVision': '0.15.2',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.1.0+18efcb0',
 'sys.platform': 'linux'}

Other information

No response

[Bug]

描述该错误

测试运行报错，log里感觉像缺少文件：
opencompass-main/accuracy/accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.

环境信息

{'CUDA available': True,
'CUDA_HOME': '/usr/local/cuda',
'GCC': 'x86_64-linux-gnu-gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0',
'GPU 0,1,2,3,4,5,6,7': 'Tesla V100-SXM2-32GB',
'MMEngine': '0.8.2',
'NVCC': 'Cuda compilation tools, release 11.6, V11.6.112',
'OpenCV': '4.8.0',
'PyTorch': '2.0.1+cu117',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2022.2-Product Build 20220804 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v2.7.3 (Git Hash '
'6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX2\n'
' - CUDA Runtime 11.7\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n'
' - CuDNN 8.5\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=11.7, '
'CUDNN_VERSION=8.5.0, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -Wno-deprecated '
'-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
'-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
'-DUSE_FBGEMM -DUSE_QNNPACK '
'-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
'-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
'-Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wunused-local-typedefs '
'-Wno-unused-parameter -Wno-unused-function '
'-Wno-unused-result -Wno-strict-overflow '
'-Wno-strict-aliasing '
'-Wno-error=deprecated-declarations '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=redundant-decls '
'-Wno-error=old-style-cast '
'-fdiagnostics-color=always -faligned-new '
'-Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.10.11 (main, Apr 5 2023, 14:15:30) [GCC 7.5.0]',
'TorchVision': '0.15.2+cu117',
'numpy_random_seed': 2147483648,
'opencompass': '0.1.0+',
'sys.platform': 'linux'}

其他信息

No response

[Feature] Golden answer in prediction

Describe the feature

Like title.

Will you implement it?

I would like to implement this feature and create a PR!

这个推理评估速度正常吗？帮忙看下

描述该错误

运行配置：
python run.py configs/eval_internlm_7b.py -w outputs/internlm_7b

代码几乎都是库里默认的，没做大的改动。数据集配置：.datasets.collections.base_medium

运行环境：
P100 GPU 16G 两块；

下面是目前生成的log：
20230725_222436.tar.gz

GPU使用情况：

目前在推理，速度如下：

请问这个速度正常吗？这么慢吗？

环境信息

{'CUDA available': True,
'CUDA_HOME': '/usr/local/cuda',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
'GPU 0,1': 'Tesla P100-PCIE-16GB',
'MMEngine': '0.8.2',
'NVCC': 'Cuda compilation tools, release 11.8, V11.8.89',
'OpenCV': '4.8.0',
'PyTorch': '2.0.0+cu118',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2022.2-Product Build 20220804 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v2.7.3 (Git Hash '
'6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX2\n'
' - CUDA Runtime 11.8\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
' - CuDNN 8.7\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=11.8, '
'CUDNN_VERSION=8.7.0, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -Wno-deprecated '
'-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
'-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
'-DUSE_FBGEMM -DUSE_QNNPACK '
'-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
'-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
'-Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wunused-local-typedefs '
'-Wno-unused-parameter -Wno-unused-function '
'-Wno-unused-result -Wno-strict-overflow '
'-Wno-strict-aliasing '
'-Wno-error=deprecated-declarations '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=redundant-decls '
'-Wno-error=old-style-cast '
'-fdiagnostics-color=always -faligned-new '
'-Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.10.10 (main, Jul 24 2023, 16:03:07) [GCC 9.4.0]',
'TorchVision': '0.15.1+cu118',
'numpy_random_seed': 2147483648,
'opencompass': '0.1.0+',
'sys.platform': 'linux'}

其他信息

对代码几乎没做改动，唯一动的地方是 base_medium.py注释掉了下面数据集，因为没有找到相关数据；
from ..bbh.bbh_gen_5b92b0 import bbh_datasets
from ..agieval.agieval_mixed_2f14ad import agieval_datasets
from ..Xsum.Xsum_gen_31397e import Xsum_datasets
from ..TheoremQA.TheoremQA_gen_ef26ca import TheoremQA_datasets
from ..triviaqa.triviaqa_gen_2121ce import triviaqa_datasets

为感觉不应该这么慢。

[Feature] Any plans to rank for the 13B models?

Describe the feature

We are curious how far can 13B models go compared with 7B models. Do you have plans to list them in the leader-boards? Some recommendations for the models: LLAMA-13B, Baichuan-13B, WizardLM-13B, Vicuna-13B and etc.

Will you implement it?

I would like to implement this feature and create a PR!

[Bug] 测试用例数据集Winograd失败

描述该错误

根据Docs中的demo示例，测试在 SIQA 和 Winograd 这两个数据，其中Winograd数据集与本地dataset代码不兼容，跑不通，注释掉即可正常使用，希望进行排查

即使把这里的名字改为‘winogrande_xl’，又会报其他错，把这数据注释掉，就没有问题

环境信息

{'CUDA available': True,
'CUDA_HOME': '/usr/local/cuda-11.1',
'GCC': 'gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)',
'GPU 0,1,2,3': 'NVIDIA GeForce RTX 3090',
'MMEngine': '0.8.2',
'NVCC': 'Cuda compilation tools, release 11.1, V11.1.74',
'OpenCV': '4.8.0',
'PyTorch': '2.0.1',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2023.1-Product Build 20230303 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v2.7.3 (Git Hash '
'6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX2\n'
' - CUDA Runtime 11.8\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n'
' - CuDNN 8.7\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=11.8, '
'CUDNN_VERSION=8.7.0, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -Wno-deprecated '
'-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
'-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
'-DUSE_FBGEMM -DUSE_QNNPACK '
'-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
'-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
'-Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wunused-local-typedefs '
'-Wno-unused-parameter -Wno-unused-function '
'-Wno-unused-result -Wno-strict-overflow '
'-Wno-strict-aliasing '
'-Wno-error=deprecated-declarations '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=redundant-decls '
'-Wno-error=old-style-cast '
'-fdiagnostics-color=always -faligned-new '
'-Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0]',
'TorchVision': '0.15.2',
'numpy_random_seed': 2147483648,
'opencompass': '0.1.0+',
'sys.platform': 'linux'}

其他信息

No response

[Bug] 使用mmbench测试miniGPT4的样例程序报错

Describe the bug

描述
修改文件minigpt_4_7b_mmbench.py的minigpt4_model的llama_model参数和minigpt_4_load_from参数，执行python run.py ./configs/multimodal/minigpt_4/minigpt_4_7b_mmbench.py --mm-eval --debug -w outputs/minigpt_4命令，报错如下。
Traceback (most recent call last):
File "/home/taas/lidan6/opencompass/run.py", line 372, in
main()
File "/home/taas/lidan6/opencompass/run.py", line 213, in main
tasks = partitioner(cfg)
File "/home/taas/lidan6/opencompass/opencompass/partitioners/mm_naive.py", line 99, in call
models = cfg['models']
File "/home/taas/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 1483, in getitem
return self._cfg_dict.getitem(name)
File "/home/taas/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 131, in getitem
return self.build_lazy(super().getitem(key))
File "/home/taas/anaconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/config/config.py", line 98, in missing
raise KeyError(name)
KeyError: 'models'
定位问题：错误原因是没有读到模型，感觉框架内部参数传递中间格式错了

Environment

环境配置无误，LLM模型评测可以跑通

Other information

No response

了解下，如何处理 TheoremQA, TriviaQA and Xsum 这些数据集用于本框架的预测呢？

描述该错误

了解下，如何处理 TheoremQA, TriviaQA and Xsum 这些数据集用于本框架的预测呢？

环境信息

无

其他信息

无

请教下ceval数据集中计算PPL的方法

描述该错误

中文描述：
opencompass-3715be6/opencompass/models/huggingface.py中的

def _get_ppl(self, inputs: List[str], mask_length: Optional[List[int]] = None) -> List[float]: outputs, inputs = self.get_logits(inputs) .......
这里给你一个input列表，就可以计算出PPL的值；
首先调用get_logits，返回模型预测的token和input token
这个input 就是按照prompt模板组成的prompt，以ceval为例，就是five-shot的例子加上新的题目以及答案是一个label “A”
我的问题是：为啥将这个input给模型，调用模型的前向传播方法，获取output[0]就是模型的预测呢？
“fiveshot + 新的选择题目+answer ：A“ 这个给模型是想让模型预测啥呢？

大模型小白，希望高手赐教。

English description：
In the opencompass-3715be6/opencompass/models/huggingface.py file, there is a function _get_ppl that takes a list of inputs and calculates the value of Perplexity (PPL). It first calls the get_logits function to get the model's predicted tokens and input tokens.

The input list provided to the function is constructed based on a prompt template, such as the "ceval" template. In this case, it consists of a five-shot example, a new question, and the answer labeled as "A".

The question is why this input is given to the model and why the output[0] of the model's forward pass is considered as the model's prediction. What is the intention behind providing the input "fiveshot + new question + answer: A" to the model?

As a novice in large models, I would appreciate guidance from experts.

环境信息

python3.10

其他信息

No response

[documentation] 文档链接失效

描述该错误

首页readme文档：
下面展示了快速安装的步骤。有部分第三方功能可能需要额外步骤才能正常运行，详细步骤请参考安装指南。
该链接not found了

环境信息

None

其他信息

No response

[Bug] comsenseqa infer/eval bug

Describe the bug

When I test my LLM on the commonsenseqa dataset, I get array out of bounds. But I don't know how to fix it.

from mmengine.config import read_base
from opencompass.partitioners import SizePartitioner, NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
with read_base():
    from .datasets.commonsenseqa.commonsenseqa_ppl_5545e2 import commonsenseqa_datasets

Traceback (most recent call last):
  File "/home/opencompass/opencompass/tasks/openicl_infer.py", line 147, in <module>
    inferencer.run()
  File "/home/opencompass/opencompass/tasks/openicl_infer.py", line 76, in run
    self._inference()
  File "/home/opencompass/opencompass/tasks/openicl_infer.py", line 124, in _inference
    inferencer.inference(retriever,
  File "/home/opencompass/opencompass/openicl/icl_inferencer/icl_ppl_inferencer.py", line 108, in inference
    prompt = retriever.generate_label_prompt(
  File "/home/opencompass/opencompass/openicl/icl_retriever/icl_base_retriever.py", line 146, in generate_label_prompt
    self.test_ds[idx], ice, label, remain_sep)
  File "/root/miniconda3/envs/opencompass/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2803, in __getitem__
    return self._getitem(key)
  File "/root/miniconda3/envs/opencompass/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2787, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
  File "/root/miniconda3/envs/opencompass/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 583, in query_table
    _check_valid_index_key(key, size)
  File "/root/miniconda3/envs/opencompass/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 526, in _check_valid_index_key
    raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 611 is out of bounds for size 611

I output partial intermediate results.
In BaseRetriever lines 40-41:
len of self.index_ds in commonsenseqa : 9741
len of self.test_ds in commonsenseqa : 611
but the len of ice_idx_list is 1221. It is too long for self.test_ds.

Environment

``{'CUDA available': True,
'CUDA_HOME': '/usr/local/cuda',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
'GPU 0': 'Tesla V100-SXM2-32GB',
'MMEngine': '0.8.2',
'NVCC': 'Cuda compilation tools, release 12.1, V12.1.66',
'OpenCV': '4.8.0',
'PyTorch': '2.0.1',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2023.1-Product Build 20230303 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v2.7.3 (Git Hash '
'6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX2\n'
' - CUDA Runtime 11.8\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n'
' - CuDNN 8.7\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=11.8, '
'CUDNN_VERSION=8.7.0, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -Wno-deprecated '
'-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
'-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
'-DUSE_FBGEMM -DUSE_QNNPACK '
'-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
'-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
'-Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wunused-local-typedefs '
'-Wno-unused-parameter -Wno-unused-function '
'-Wno-unused-result -Wno-strict-overflow '
'-Wno-strict-aliasing '
'-Wno-error=deprecated-declarations '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=redundant-decls '
'-Wno-error=old-style-cast '
'-fdiagnostics-color=always -faligned-new '
'-Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0]',
'TorchVision': '0.15.2',
'numpy_random_seed': 2147483648,
'opencompass': '0.1.0+',
'sys.platform': 'linux'}

Other information

No response

huggingface数据集离线加载

描述该功能

在加载huggingface上的数据集时，需要服务器联网去huggingface上下载；
而我的服务器不能上网，所以在执行：
from datasets import load_dataset
dataset = load_dataset(**kwargs)
会发生ConnectTimeout错误；

能否提供一种离线的数据加载方式？
或者如何修改代码更加容易从本地disk加载数据集？

请给点建议，非常感谢。

是否希望自己实现该功能？

我希望自己来实现这一功能，并向 OpenCompass 贡献代码！

[Feature]

描述该功能

Support TydiQA?

是否希望自己实现该功能？

我希望自己来实现这一功能，并向 OpenCompass 贡献代码！

为何看不到每一个数据集的平均分值？

Describe the feature

运行评估测试
python run.py configs/test_internlm.py -w outputs/001
cat configs/test_internlm.py
from mmengine.config import read_base

with read_base():
from .datasets.ceval.ceval_gen import ceval_datasets
from .datasets.mmlu.mmlu_ppl import mmlu_datasets

datasets = [*ceval_datasets, *mmlu_datasets]

from opencompass.models import HuggingFaceCausalLM

_meta_template = dict(
round=[
dict(role='HUMAN', begin='<|User|>:', end='\n'),
dict(role='BOT', begin='<|Bot|>:', end='\n', generate=True),
],
)

models = [
dict(
type=HuggingFaceCausalLM,
abbr='internlm-chat-7b-8k-hf',
path="/media/root/sdc/data/model/internlm-chat-7b-8k",
tokenizer_path='/media/root/sdc/data/model/internlm-chat-7b-8k',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False,
trust_remote_code=True,
),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
meta_template=_meta_template,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
然后获得评测结果
测试结果里面看不到ceval或者mmlu的测试平均值，需要如何设置，还是需要自己实现代码

Will you implement it?

I would like to implement this feature and create a PR!

[Feature] 历史 repo 和版本太多，希望增加对 lmdeploy+puyu+gRPC 接口支持

Describe the feature

之前给的 pjeval/add_grpc_support 现在没法跑了。需要

llama_service 切 v0.4.3
unpack 也报错还得修

  File "/workspace/GitProjects/pjeval/opencompass/models/grpc_api.py", line 108, in _generate
    for status, result, tokens in self.grpcapi.generate(
TypeError: cannot unpack non-iterable StatusCode object

然后发现是 llama_service v0.4.3 版本似乎不对...彻底没法维护了。

现在 lmdeploy 里面就有 llama_service，干脆适配一次 lmdeploy，废弃掉内部的吧。
需要用这个来测非对称量化版本精度。

Will you implement it?

I would like to implement this feature and create a PR!

请问leaderboard中数据集的评测方式是gen还是ppl?

描述该功能

部分数据集下有 ‘ppl’ 和 ‘gen’ 两种评估方式，请问在leaderboard中是哪种评估方式？比如mmlu、ceval、gsm8k、hellaswag...

是否希望自己实现该功能？

我希望自己来实现这一功能，并向 OpenCompass 贡献代码！

[Bug] MMBench测试有bug

描述该错误

一直卡在evaluating，持续好几个小时。

环境信息

None

其他信息

No response

在评估internlm7b在c-eval gaokao上评测速度很慢，不知道是为什么？多卡多线程下好像更慢

描述该错误

比较慢

环境信息

比较慢

其他信息

No response

请教下，如何设置完成多级多卡间并行的测试相同的模型的不同checkpoint呢？

Describe the feature

无

Will you implement it?

I would like to implement this feature and create a PR!

[Feature] test notification

描述该功能

是否希望自己实现该功能？

我希望自己来实现这一功能，并向 OpenCompass 贡献代码！

奇怪的bug

[Details Correction] Shikra is using Vicuna-7/13B but not LLaMA 7B

A minor error: Table 15 in the paper and table in the benchmark show that Shikra used LLaMA 7B as the language backbone. However, it should be Vicuna-7/13B according to the paper.

关于mode为eval的疑问

Describe the feature

之前评测了好几个数据集然后进程中断了。
文件夹下有这三个文件：

prediction里为每个任务的输出，我该怎么评测结果，尝试加上--mode eval -r 20230724_021658参数后生成的csv文件都是‘--’值

Will you implement it?

I would like to implement this feature and create a PR!

[Bug] Benchmarks for SFT models on some datasets are not correctly implemented

Describe the bug

Running opencompass on my own SFT model, we saw some suspicious results.

dataset                               version    metric            mode    model-hf
------------------------------------  ---------  ----------------  ------  -----------------
--------- 语言 Language ---------       -          -                 -       -
WiC                                   d06864     accuracy          gen     0.00
WSC                                   6dc406     accuracy          gen     0.00
winogrande                            a9ede5     accuracy          gen     0.00
--------- 推理 Reasoning ---------      -          -                 -       -
AX_g                                  68aac7     accuracy          gen     0.00
RTE                                   68aac7     accuracy          gen     0.00
openai_humaneval                      8e312c     humaneval_pass@1  gen     0.00
mbpp                                  1e1056     score             gen     0.00

Environment

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-SXM4-80GB',
 'MMEngine': '0.8.1',
 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.66',
 'OpenCV': '4.8.0',
 'PyTorch': '2.0.1',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2023.1-Product Build 20230303 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.7.3 (Git Hash '
                              '6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX2\n'
                              '  - CUDA Runtime 11.8\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n'
                              '  - CuDNN 8.7\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=11.8, '
                              'CUDNN_VERSION=8.7.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -Wno-deprecated '
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
                              '-DUSE_FBGEMM -DUSE_QNNPACK '
                              '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
                              '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
                              '-Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '
                              '-Wno-unused-parameter -Wno-unused-function '
                              '-Wno-unused-result -Wno-strict-overflow '
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=redundant-decls '
                              '-Wno-error=old-style-cast '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '
                              'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
                              'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, '
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0]',
 'TorchVision': '0.15.2',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.1.0+18efcb0',
 'sys.platform': 'linux'}

Other information

Take WiC as example, the prediction of the model is either Yes or No.

{
        "origin_prompt": "Human: Sentence 1: As he called the role he put a check mark by each student's name.\nSentence 2: A check on its dependability under stress.\nAre 'check' in the above two sentenses the same?\nA. Yes\nB. No\nAnswer: \n\nAssistant:",
        "prediction": "Yes"
    }

But the labels are true or false, as can be seen below.

{"word": "brush", "sentence1": "She gave her hair a quick brush.", "sentence2": "The dentist recommended two brushes a day.", "idx": 3, "label": "true", "start1": 26, "start2": 28, "end1": 31, "end2": 35, "version": 1.1}

请问leaderboard上的GAOKAO-Bench 结果是用的全部还是只用了子集？

Describe the feature

如果是子集的话是哪些？

Will you implement it?

I would like to implement this feature and create a PR!

执行python run.py configs/eval_demo.py -w outputs/demo 速度异常

描述该错误

按照文档中的快速上手，数据及使用的是ceval，模型为OPT-125M；
运行python run.py configs/eval_demo.py -w outputs/demo时，特别慢，没有在GPU上运行，不知道是我哪里的问题；
log打印：

GPU情况：

一直卡着，没有进度。

环境信息

python3.10

其他信息

No response