Giter Site home page Giter Site logo

charm's Introduction

CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations

arXiv license

πŸ“ƒPaper 🏰Project Page πŸ†Leaderboard ✨Findings

Construction of CHARM

Comparison of commonsense reasoning benchmarks

Benchmarks CN-Lang CSR CN-specifics Dual-Domain Rea-Mem
Most benchmarks in davis2023benchmarks ✘ βœ” ✘ ✘ ✘
XNLI, XCOPA,XStoryCloze βœ” βœ” ✘ ✘ ✘
LogiQA, CLUE, CMMLU βœ” ✘ βœ” ✘ ✘
CORECODE βœ” βœ” ✘ ✘ ✘
CHARM (ours) βœ” βœ” βœ” βœ” βœ”

"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks.

πŸš€ What's New

  • [2024.6.06] Leaderboard updated! LLaMA-3, GPT-4o, Gemini-1.5, Yi1.5, Qwen1.5, etc. are evaluated.
  • [2024.5.24] CHARM has been open-sourced !!! πŸ”₯πŸ”₯πŸ”₯
  • [2024.5.15] CHARM has been accepted to the main conference of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) !!! πŸ”₯πŸ”₯πŸ”₯
  • [2024.3.21] Paper available on ArXiv.

🧾 TODO

  • Support inference and evaluation on Opencompass.

πŸ› οΈ Inference and Evaluation on Opencompass

Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation.

1. OpenCompass Environment Setup

Refer to the installation steps for OpenCompass.

2. Download CHARM

git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}

cd ${path_to_opencompass}
mkdir data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM

3. Run Inference and Evaluation

cd ${path_to_opencompass}

# modify config file `configs/eval_charm_rea.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_rea.py -r --dump-eval-details

# modify config file `configs/eval_charm_mem.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_mem.py -r --dump-eval-details

The inference and evaluation results would be in ${path_to_opencompass}/outputs, like this:

outputs
β”œβ”€β”€ CHARM_mem
β”‚   └── chat
β”‚       └── 20240605_151442
β”‚           β”œβ”€β”€ predictions
β”‚           β”‚   β”œβ”€β”€ internlm2-chat-1.8b-turbomind
β”‚           β”‚   β”œβ”€β”€ llama-3-8b-instruct-lmdeploy
β”‚           β”‚   └── qwen1.5-1.8b-chat-hf
β”‚           β”œβ”€β”€ results
β”‚           β”‚   β”œβ”€β”€ internlm2-chat-1.8b-turbomind_judged-by--GPT-3.5-turbo-0125
β”‚           β”‚   β”œβ”€β”€ llama-3-8b-instruct-lmdeploy_judged-by--GPT-3.5-turbo-0125
β”‚           β”‚   └── qwen1.5-1.8b-chat-hf_judged-by--GPT-3.5-turbo-0125
β”‚Β Β          └── summary
β”‚Β Β              └── 20240605_205020 # MEMORY_SUMMARY_DIR
β”‚Β Β                  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Anachronisms_Judgment
β”‚Β Β                  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Movie_and_Music_Recommendation
β”‚Β Β                  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Sport_Understanding
β”‚Β Β                  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Time_Understanding
β”‚Β Β                  └── judged-by--GPT-3.5-turbo-0125.csv # MEMORY_SUMMARY_CSV
└── CHARM_rea
    └── chat
        └── 20240605_152359
            β”œβ”€β”€ predictions
            β”‚   β”œβ”€β”€ internlm2-chat-1.8b-turbomind
            β”‚   β”œβ”€β”€ llama-3-8b-instruct-lmdeploy
            β”‚   └── qwen1.5-1.8b-chat-hf
            β”œβ”€β”€ results # REASON_RESULTS_DIR
            β”‚   β”œβ”€β”€ internlm2-chat-1.8b-turbomind
            β”‚   β”œβ”€β”€ llama-3-8b-instruct-lmdeploy
            β”‚   └── qwen1.5-1.8b-chat-hf
            └── summary
                β”œβ”€β”€ summary_20240605_205328.csv # REASON_SUMMARY_CSV
                └── summary_20240605_205328.txt

4. Generate Analysis Results

cd ${path_to_CHARM_repo}

# generate Table5, Table6, Table9 and Table10 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_reasoning.py ${REASON_SUMMARY_CSV}

# generate Figure3 and Figure9 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_mem_rea.py ${REASON_SUMMARY_CSV} ${MEMORY_SUMMARY_CSV}

# generate Table7, Table12, Table13 and Figure11 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/analyze_mem_indep_rea.py data/CHARM ${REASON_RESULTS_DIR} ${MEMORY_SUMMARY_DIR} ${MEMORY_SUMMARY_CSV}

πŸ–ŠοΈ Citation

@misc{sun2024benchmarking,
      title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, 
      author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
      year={2024},
      eprint={2403.14112},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

πŸ’³ License

This project is released under the Apache 2.0 license.

charm's People

Contributors

jxd0712 avatar

Stargazers

zhangruijie avatar Jiancheng PAN avatar yoga33 avatar  avatar ChenYubin avatar  avatar BigDream avatar lorinma avatar gary avatar  avatar  avatar sfk avatar ding ding avatar Chao Pang avatar  avatar  avatar Jiax avatar Vladislav Sorokin avatar θ΅΅ε―η‘ž avatar 唐国撁Tommy avatar  avatar Conghui He avatar

Watchers

John King avatar  avatar Kostas Georgiou avatar  avatar

Forkers

sorokinvld lihqi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.