Giter Site home page Giter Site logo

ieit-yuan / yuan2.0-m32 Goto Github PK

View Code? Open in Web Editor NEW
164.0 3.0 38.0 9.79 MB

Mixture-of-Experts (MoE) Language Model

License: Apache License 2.0

Shell 0.65% Jinja 0.13% CMake 0.45% Dockerfile 0.10% Python 86.79% Cuda 7.94% C 0.11% C++ 3.75% Makefile 0.02% Batchfile 0.01% HTML 0.05%

yuan2.0-m32's Introduction

Yuan2.0-M32: Mixture of Experts with Attention Router

👾 ModelScope • 🤗 Hugging Face • 💬 WeChat• 📎 Yuan2.0-M32 Paper

English | 简体中文


0. Latest News 🎉🎉

  • [2024-06-28] Yuan2.0-M32-HF-INT8 released, high performance, lossless generation accuracy 🎗️🎗️🎗️
  • [2024-06-18] Yuan2.0-M32-HF-INT4 released 🎗️🎗️
  • [2024-05-28] Yuan2.0-M32 released

1. Introduction

Yuan2.0-M32 is a Mixture-of-Experts (MoE) language model with 32 experts, of which 2 are active. A new router network, Attention Router, is proposed and has been adopted for more efficient expert selection, boosting accuracy by 3.8% over models using a classical router network. Yuan 2.0-M32 is trained from scratch with 2000B tokens, and its training computation is only 9.25% of that required by a dense model of the same parameter scale. Demonstrating competitive capabilities in coding, math, and various specialized fields, Yuan2.0-M32 operates with only 3.7B active parameters out of a total 40B, and a forward computation of 7.4 GFLOPS per token, which is just 1/19th of Llama3-70B's requirement. Yuan 2.0-M32 has surpassed Llama3-70B on the MATH and ARC-Challenge benchmarks, achieving accuracies of 55.9% and 95.8%, respectively. The basic information of the Yuan2.0-M32 model is as follows:

  • Total Parameters : 40B
  • Experts: 32
  • Active Experts: 2
  • Active Parameters: 3.7B
  • Pretrained Tokens: 2000B tokens
  • Sequence Length: 16K

The technical report for the Yuan2.0-M32 model has been released, and you can find more detailed technical information and evaluation results by referring to the paper.

Fig.1: Yuan 2.0-M32 Architecture

2. Model Downloads

Model Sequence Length Type Download
Yuan2.0-M32 16K Megatron ModelScope | HuggingFace | Netdisk | Wisemodel
Yuan2.0-M32-HF 16K HuggingFace ModelScope | HuggingFace | Netdisk | Wisemodel
Yuan2.0-M32-GGUF 16K GGUF ModelScope | HuggingFace | Netdisk | Wisemodel
Yuan2.0-M32-GGUF-INT4 16K GGUF ModelScope | HuggingFace | Netdisk | Wisemodel
Yuan2.0-M32-HF-INT4 16K HuggingFace ModelScope | HuggingFace | Netdisk | Wisemodel
Yuan2.0-M32-HF-INT8 16K HuggingFace ModelScope | HuggingFace | Netdisk | Wisemodel

* Yuan2.0-M32-HF-INT4: The method of quantization and inference ,refer to the guide.

3. Evaluation

3.1 Benchmarks 🏆

We conducted a thorough evaluation of the Yuan2.0-M32 model across a range of benchmarks, including HumanEval, GSM8K, MMLU, Math, and ARC-Challenge. These benchmarks are designed to test the model's proficiency in key areas such as natural language understanding, knowledge acquisition, mathematical computation and reasoning, and code generation. The Yuan2.0-M32 has shown a consistent and significant advantage over other models like Llama3-8B and Mistral-8×7B, excelling in all evaluated tasks. Remarkably, its overall performance is on par with the more substantial Llama3-70B model.The detailed evaluation results are outlined in the subsequent table.

Model HumanEval GSM8K MMLU Math ARC-C*
Llama3-70B 81.7% 93% 80.3% 50.4% 93.3%
Llama3-8B 62.2% 79.6% 68.4% 30% 78.6%
Phi-3-medium 62.2% 91.0% 78.0% - 91.6%
Phi-3-small 61% 89.6% 75.7% - 90.7%
Phi-3-mini 58.5% 82.5% 68.8% - 84.9%
Mistral-8*22B 45.1% 78.6% 77.8% 41.8% 91.3%
Mistral-8*7B 40.2% 58.4% 70.86% 28.4% 85.9%
Yuan2.0-M32 74.4% 92.7% 72.2% 55.9% 95.8%

* ARC-C: Al2 Reasoning Challenge (ARC) benchmark is divided into Easy and Challenge parts, with the later containing more complex parts that needs further reasoning. We test our model on the Challenge parts.


3.2 Computational Utilization for Model

Model Params (B) Active Params (B) GFLOPs/token (Inference) GFLOPS/token (Fine-tune) Mean Accuracy Average Accuracy/GFLOPSs per token (Inference)
Llama3-70B 70 70 140 420 79.25 0.57
Llama3-8B 8 8 16 48 64.15 4.00
Mistral-8*22B 141 39 78 234 72.38 0.93
Mistral-8*7B 47 12.9 25.8 77.3 60.83 2.36
Yuan2.0-M32 40 3.7 7.4 22.2 79.15 10.69

4. Quick Start

4.1 Environment Config

We strongly recommend using the latest release of docker images of Yuan2.0-M32.You can launch an instance of the Yuan 2.0 container with the following Docker commands:

docker pull yuanmodel/yuan2.0:m32
docker run --gpus all --privileged --ulimit stack=68719476736 --shm-size=1000G -itd -v /path/to/yuan_2.0:/workspace/yuan_2.0 -v /path/to/dataset:/workspace/dataset -v /path/to/checkpoints:/workspace/checkpoints --name your_name yuanmodel/yuan2.0:m32
docker exec -it your_name bash

4.2 Data Preprocess

We have provided the data preprocess script. See documentation here.

4.3 Model Pretrain

We've provided several scripts for pretraining in the example. The details can be seen from documentation here.

4.4 Inference Service

5. Statement of Agreement

The use of the source code in this repository requires compliance with the open source license agreement Apache 2.0. The Yuan2.0 model supports commercial use and does not require authorization. Please understand and comply with the 《Yuan2.0 Model License Agreement》. Do not use the open source model and code, as well as derivatives generated from open source projects, for any purposes that may cause harm to the country and society, or for any services that have not undergone security assessment and filing. Although we have taken measures to ensure the compliance and accuracy of the data during training, the model has a huge number of parameters and is affected by probability and randomness factors. We cannot guarantee the accuracy of the output content, and the model is easily misled by input instructions. This project does not assume any data security, public opinion risks, or any model misleading, abusing, spreading caused by open-source models and code Risks and responsibilities arising from improper utilization You will be solely responsible for the risks and consequences arising from the use, copying, distribution, and modification of the model in this open source project

6. Contact Us

If you have any questions, please raise an issue or contact us at [email protected]

7. Join Us

We are currently recruiting experts in large model framework development, inference performance optimization, and open-source community operations.

You can send resume to [email protected], with the title of email: [Application of Yuan Team Application] - [Your Name].

yuan2.0-m32's People

Contributors

huifu1018 avatar iei-mjx avatar zhaoxudong01 avatar hicollj avatar chendarrcy avatar ljg-ieisystem avatar

Stargazers

 avatar hejack0207 avatar Kasper Piskorski avatar Dr. Artificial曾小健 avatar 潘晓彤 avatar ucc117 avatar  avatar  avatar kingfly avatar Marvin Wang avatar HaiKu avatar Duohan Liang avatar  avatar  avatar  avatar Jose Cohenca avatar Pouria Taj avatar 王答 avatar Mike Bybee avatar Zeth Abney avatar  avatar NOKUBI Takatsugu avatar Yongwei Wang avatar kyle avatar zzerd avatar Xu Fei avatar  avatar zhangheli1997 avatar  avatar  avatar  avatar Yexuan Wu avatar  avatar Tharita Tipdecho avatar  avatar Gavin Wong avatar Curya avatar XianyanLin avatar ChaoPeng avatar zachary avatar niina avatar  avatar alpha&omega avatar hanliming avatar ZhanKun avatar João Leonardi avatar  avatar Tan Shaohui avatar  avatar  avatar  avatar  avatar Michael.谢 avatar Abe Usher avatar  avatar Duke Jones avatar Keith avatar Unchun Yang avatar Chris Evers avatar  avatar ORCTOM avatar  avatar Zman avatar Lockinwize Lolite avatar  avatar  avatar liuhongli avatar  avatar  avatar  avatar Lynn avatar  avatar  avatar  avatar jiawei avatar learner avatar Myungchul Shin avatar  avatar richard wang avatar Othman J avatar  avatar  avatar Danny avatar Benjamin avatar Amit Meel avatar Suresh Veeragoni avatar  avatar  avatar Natsu avatar DeadWave avatar  avatar tico Ag avatar  avatar  avatar Leon Zou avatar Xuang Wu avatar  avatar elucida avatar fchengjin avatar 爱可可-爱生活 avatar

Watchers

 avatar  avatar  avatar

yuan2.0-m32's Issues

llama-cpp-python推理时报错

使用llama-cpp-python推理时报错:unknown model architecture: ’yuan2_moe‘
llama-cpp-python版本:0.2.76
推理的模型:Yuan2-moe_0526_int4.gguf

使用vLLM推理框架对Yuan2.0进行推理,开启tp并行且enforce_eager=True时推理结果有误

推理脚本如下:

from vllm import LLM, SamplingParams
import time
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained('/yuan-2b-hf/', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>','<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)

prompts = ["青岛旅游推荐?","长江有多长?"]
sampling_params = SamplingParams(max_tokens=300, temperature=1, top_p=0, top_k=1, min_p=0.0, length_penalty=1.0, repetition_penalty=1.0, stop="<eod>", )

llm = LLM(model="/yuan-2b-hf/", trust_remote_code=True, enforce_eager=True, tensor_parallel_size=4, gpu_memory_utilization=0.8, disable_custom_all_reduce=True, max_num_seqs=2)

start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()
total_tokens = 0
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    num_tokens = len(tokenizer.encode(generated_text, return_tensors="pt")[0])
    total_tokens += num_tokens
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

print("inference_time:", (end_time - start_time))
print("total_tokens:", total_tokens)

输出结果如下:

Prompt: '青岛旅游推荐?', Generated text: ' 青岛旅游推荐如下:\n1.\n- 青岛旅游推荐青岛旅游。青岛旅游。青岛旅游推荐?\n青岛旅游。青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2.青岛旅游推荐?\n2旅游推荐?\n2.青岛旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2.青岛旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2\n2.青岛旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2旅游推荐?\n2\n'
Prompt: '长江有多长?', Generated text: ' 长江的长度约为6,3,3,即6,即6,即长江有6,即6,即6,即长江的长度为6,即6,即长江的 6,即长江的 6,即6,即长江的 6,即长江的 6,即6,即6,即长江的 6,即6,即6,即长江的 6,即6,即长江的6,即6,即长江的 6,即6,即长江的 6,即6,即长江的,即6,即,即,即6,即长江的,即6,即6,即,即6,即6,即,即,即6,即6,即,即6,即,即6,即6,即,即6,即,即6,即6,即,即6,即6,即,即6,即,即6,即6,即,即6,即6,即6,即,即6,即,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6,即6'

vllm推理无法启动NameError: name 'tokenizer_revision' is not defined

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cu122-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 201, in
openai_serving_chat = OpenAIServingChat(engine, model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cu122-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 43, in init
super().init(engine=engine,
File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cu122-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_engine.py", line 41, in init
self.tokenizer = get_tokenizer(
File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cu122-py3.10-linux-x86_64.egg/vllm/transformers_utils/tokenizer.py", line 101, in get_tokenizer
tokenizer_revision=tokenizer_revision,
NameError: name 'tokenizer_revision' is not defined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.