Giter Site home page Giter Site logo

glm-130b's Introduction

🌐 Blog • ⏬ Download Model • 🪧 Demo • ✉️ Email • 📃 Paper [ICLR 2023]

💬 Google Group (Updates) or Wechat Group or Slack channel (Discussions)

GLM-130B: An Open Bilingual Pre-Trained Model

GLM-130B is an open bilingual (English & Chinese) bidirectional dense model with 130 billion parameters, pre-trained using the algorithm of General Language Model (GLM). It is designed to support inference tasks with the 130B parameters on a single A100 (40G * 8) or V100 (32G * 8) server. With INT4 quantization, the hardware requirements can further be reduced to a single server with 4 * RTX 3090 (24G) with almost no performance degradation. As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English) and it has the following unique features:

  • Bilingual: supports both English and Chinese.
  • Performance (EN): better than GPT-3 175B (+4.0%), OPT-175B (+5.5%), and BLOOM-176B (+13.0%) on LAMBADA and slightly better than GPT-3 175B (+0.9%) on MMLU.
  • Performance (CN): significantly better than ERNIE TITAN 3.0 260B on 7 zero-shot CLUE datasets (+24.26%) and 5 zero-shot FewCLUE datasets (+12.75%).
  • Fast Inference: supports fast inference on both SAT and FasterTransformer (up to 2.5X faster) with a single A100 server.
  • Reproducibility: all results (30+ tasks) can be easily reproduced with open-sourced code and model checkpoints.
  • Cross-Platform: supports training and inference on NVIDIA, Hygon DCU, Ascend 910, and Sunway (Will be released soon).

This repository mainly focus on the evaluation of GLM-130B. If you find our work and our open-sourced efforts useful, ⭐️ to encourage our following development! :)

News

  • [2023.06.25] Release ChatGLM2-6B, an updated version of ChatGLM-6B which introduces Stronger Performance (MMLU (+23%), CEval (+33%), GSM8K (+571%), BBH (+60%)), Longer Context (from 2K in ChatGLM-6B to 32K, and trained with a context length of 8K during the dialogue alignment), and More Efficient Inference (speeds up by 42% under the official implementation; the dialogue length supported by 6G GPU memory has increased from 1K to 8K). More details please refer to ChatGLM2-6B
  • [2023.06.14] We release the research WebGLM, which enables efficient and accurate web-enhanced question answering. All code and data are released!
  • [2023.03.14] We are happy to introduce ChatGLM, a bilingual dialogue language model based on GLM-130B, and its open-sourced version ChatGLM-6B which can be run under only 6GB GPU memory!
  • [2023.01.21] GLM-130B has been accepted to ICLR 2023!
  • [2022.10.06] Our paper for GLM-130B is out!
  • [2022.08.24] We are proud to publish the quantized version for GLM-130B. While preserving the activation precision as FP16, the model weights can be quantized to as low as INT4 with almost no degradation of performance, further reducing the hardware requirements of the GLM-130B to a single server with 4 * RTX 3090 (24G)! See Quantization of GLM-130B for details.

For smaller models, please find monolingual GLMs (English: 10B/2B/515M/410M/335M/110M, Chinese: 10B/335M) and an 1B multilingual GLM (104 languages).

Getting Started

Environment Setup

Hardware

Hardware GPU Memory Quantization Weight Offload
8 * A100 40 GB No No
8 * V100 32 GB No Yes (BMInf)
8 * V100 32 GB INT8 No
8 * RTX 3090 24 GB INT8 No
4 * RTX 3090 24 GB INT4 No
8 * RTX 2080 Ti 11 GB INT4 No

It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on a single server with 4 * RTX 3090 (24G) is possible, see Quantization of GLM-130B for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even smaller GPU memory, see Low-Resource Inference for details.

Software

The GLM-130B code is built on the top of SAT. We recommend using Miniconda to manage your environment and installing additional dependencies via pip install -r requirements.txt. Here are the recommended environment configurations:

  • Python 3.9+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+ / Apex (installation with CUDA and C++ extensions is required, see here)
  • SwissArmyTransformer>=0.2.11 is required for quantization

Model weights

Download the GLM-130B’s model checkpoint from here, make sure all 60 chunks are downloaded completely, then use the following command to merge them into a single archive file and extract it:

cat glm-130b-sat.tar.part_* > glm-130b-sat.tar
tar xvf glm-130b-sat.tar

Set CHECKPOINT_PATH in configs/model_glm_130b.sh to the path of the extracted folder. Since the checkpoint file is up to 260G, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time. Since the checkpoint we distribute is in 8-way tensor parallel, a conversion scripts is also provided if you need to change the tensor parallel dimension.

python tools/convert_tp.py \
    --input-folder <SRC_CKPT_PATH>  \
    --output-folder <DST_CKPT_PATH> \
    --target-tp <TARGET_TP>

Left-To-Right Generation / Blank Filling

bash scripts/generate.sh --input-source interactive

You can also specify an input file by --input-source input.txt.

GLM-130B uses two different mask tokens: [MASK] for short blank filling and [gMASK] for left-to-right long text generation. When the input does not contain any MASK token, [gMASK] will be automatically appended to the end of the text.

Left-To-Right Generation Examples

Example 1

Input: Who is the greatest artist? The greatest artist is [gMASK]

Output: Who is the greatest artist? The greatest artist is not the one who paints the most beautiful picture, or writes the best poem, or plays the best symphony. The greatest artist is the one whose work has the greatest impact on the lives of others. The Greatest Artist is the artist who inspires us to be our best selves. The Great Artist is not the artist whose work is flawless. The great artist is not one who is perfect. The true great artist has the ability to make us feel something. The artist who makes us feel is the great artist. The art of the great artists is to make you feel something that you have never felt before. The greatness of an artist is in the ability of their work to touch the lives and hearts of others, and to make them feel something they have never experienced before. This is the true greatness of the artist. 2. The Significance of the Great Artist The significance of the greatest artist cannot be overstated. The significance and impact of the art of a great artist cannot even be measured. The impact of a single work of art can change the course of history. A single work can change a person’s life. A work of great art can make you laugh, make you cry, make your heart race, or make you think. A great work of artistic expression has the power to change the way you look at the world. The work of a truly great artist can change your life forever. 3. The Impact of the Greatest Artist The greatest artists have the ability and the power of their art to change people’s lives. They have the power and the ability, through their art, to make people feel something new. The power of the work of the truly great artists cannot be measured, but it can be felt. The greater the impact of an artists work, the greater the power it has to change lives. 4. The Influence of the Art of the Artist The art and the work created by the greatest artists has the potential to change a life. The influence of the artists work cannot be quantified, but is felt by those who experience it. The truly great work has an ability to touch people’ lives in a way that no other work can.

Example 2 (Chinese)

Input: 问题:冬天,**哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者 回答: [gMASK]

Output: 问题:冬天,**哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者 回答:避寒的话,建议去海南三亚。 三亚位于海南岛的最南端,是**最南部的热带滨海旅游城市,**海滨城市,是**空气质量最好的城市、全国最长寿地区(平均寿命80岁)。三亚市别称鹿城,又被称为"东方夏威夷",位居**四大一线旅游城市"三威杭厦"之首,拥有全岛最美丽的海滨风光。三亚拥有被无数城市嫉妒的清新空气,柔和海滩。在沙滩悠闲散步、沐浴傍晚温和阳光,在海边玩耍,在雨林里面呼吸健康,欣赏自然奇观,一切都是那么令人享受。 气候 三亚地处热带,属热带季风气候区域,终年气温高,寒暑变化不大,年平均气温25.5度。三亚最热的月份是6月,平均气温为28.5°C,极端最高气温为35.7°C ,三亚没有明显的冬季,而且冬季更是到海滨旅游的最好时候,所以三亚是全年全天候避寒、消暑、度假、旅游的好地方。不过,三亚最旺的旅游时间是从10月份到农历春节,这个时候的人特别多,房间也不好订,建议最好避开高峰。三亚的三个旅游高峰期是"五一"、"十一"、"春节",这三个节日期间的房价都会上浮几倍,如果你选择了这三个假期来三亚的话要提前一个星期定房和咨询。

Blank Filling Examples

Example 1

Input: Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.

Output: Ng is an adjunct professor at Stanford University (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.

Example 2 (Chinese)

Input: 凯旋门位于意大利米兰市古城堡旁。1807年为纪念[MASK]而建,门高25米,顶上矗立两武士青铜古兵车铸像。

Output: 凯旋门位于意大利米兰市古城堡旁。1807年为纪念拿破仑胜利而建,门高25米,顶上矗立两武士青铜古兵车铸像。

Arguments Useful in Generation
  • --input-source [path] or "interactive" The input file's path. It can also be "interactive", which will launch a CLI.
  • —-output-path [path] The folder containing the results.
  • —-out-seq-length [int] The maximum sequence length for generation (including context).
  • —-min-gen-length [int] The minimum generation length for each MASK.
  • —-sampling-strategy "BaseStrategy" or "BeamSearchStrategy". The sampling strategy used.
    • For BeamSearchStrategy:
      • —-num-beams [int] The number of beams.
      • —-length-penalty [float] The maximum sequence length for generation (including context).
      • —-no-repeat-ngram-size [int] Prohibit repeated n-gram generation.
      • —-print-all-beam Print the generated results for all beams.
    • For BaseStrategy:
      • —-top-k [int] Top k sampling.
      • —-top-p [float] Top p sampling.
      • —-temperature [float] The sampling temperature.

Evaluation

We use the YAML file to define tasks. Specifically, you can add multiple tasks or folders at a time for evaluation, and the evaluation script will automatically collect all YAML files under those folders recursively.

bash scripts/evaluate.sh task1.yaml task2.yaml dir1 dir2 ...

Download our evaluation dataset here, and set DATA_PATH in scripts/evaluate.sh to your local dataset directory. The task folder contains the YAML files for 30+ tasks we evaluated for GLM-130B. Take the CoLA task for example, run bash scripts/evaluate.sh tasks/bloom/glue_cola.yaml, which outputs an accuracy of ~65% for the best prompt and ~57% for the median.

Expected Output
MultiChoiceTaskConfig(name='glue_cola', type=<TaskType.MULTICHOICE: 'mul'>, path='/thudm/LargeScale/data/zeroshot/bloom/glue_cola', module=None, metrics=['Accuracy'], use_task_mask=False, use_multitask_encoding=False, unidirectional=False, max_seq_length=2048, file_pattern={'validation': '**/validation.jsonl'}, micro_batch_size=8)
Evaluating task glue_cola:
  Evaluating group validation:
      Finish Following_sentence_acceptable/mul/validation.jsonl, Accuracy = 42.665
      Finish Make_sense_yes_no/mul/validation.jsonl, Accuracy = 56.951
      Finish Previous_sentence_acceptable/mul/validation.jsonl, Accuracy = 65.197
      Finish editing/mul/validation.jsonl, Accuracy = 57.622
      Finish is_this_correct/mul/validation.jsonl, Accuracy = 65.197
Evaluation results of task glue_cola:
  Group validation Accuracy: max = 65.197, median = 57.622, average = 57.526
Finish task glue_cola in 101.2s. 

Multi-node evaluation can be configured by setting HOST_FILE_PATH(required by the DeepSpeed lanucher) in scripts/evaluate_multiple_node.sh. Set DATA_PATH in scripts/evaluate_multiple_node.sh and run the following command to evaluate all the tasks in ./task directory.

bash scripts/evaluate_multiple_node.sh ./tasks

See Evaluate Your Own Tasks for details on how to add new tasks.

2.5X faster Inference using FasterTransformer

By adapting the GLM-130B model to FasterTransfomer, a highly optimized transformer model library by NVIDIA, we can reach up to 2.5X speedup on generation, see Inference with FasterTransformer for details.

License

This repository is licensed under the Apache-2.0 license. The use of GLM-130B model weights is subject to the Model License.

Citation

If you find our work useful, please consider citing GLM-130B:

@article{zeng2022glm,
  title={Glm-130b: An open bilingual pre-trained model},
  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
  journal={arXiv preprint arXiv:2210.02414},
  year={2022}
}

You may also consider GLM's original work in your reference:

@inproceedings{du2022glm,
  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={320--335},
  year={2022}
}

glm-130b's People

Contributors

duzx16 avatar erjanmx avatar prnake avatar sengxian avatar sleepychord avatar xiao9905 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glm-130b's Issues

Inference with 3090*16

Hi,

I want to deploy GLM-130B to two 3090 * 8 nodes for inference (3090*16).

I think the memory is enough, but I'm not familiar with distributed inference.

Maybe I need to do the following things:

  1. model parallel and pipeline parallel
  2. a distributed API server
  3. ...

Could you provide me with some ideas or materials?

Thanks.

INT4版本

请问有INT4版本的GLM-130B的下载地址吗?

Tensor parallel dimension conversion script fails

Hello!

It seems the script for converting the tensor parallel dimension fails

Running for instance

python tools/convert_tp.py --input-folder "../glm/glm-130b-sat" --output-folder "../glm/four-div-glm-130b-sat" --target-tp 4

Yields

Traceback (most recent call last):
  File "/extra/ucinlp1/dylan/GLM-130B/tools/convert_tp.py", line 154, in <module>
    main(args)
  File "/extra/ucinlp1/dylan/GLM-130B/tools/convert_tp.py", line 149, in main
    torch.save(create_checkpoint(sd_list, i, original_tp, args.target_tp, args.quantization_bit_width), save_path)
  File "/extra/ucinlp1/dylan/GLM-130B/tools/convert_tp.py", line 121, in create_checkpoint
    new_sd[key], new_sd[f"{key}_scale"] = new_sd[key]
ValueError: too many values to unpack (expected 2)

Any advice here? Thanks 🙏🏻

Met exception when run the fastertransformers demo

Thank you for your awesome work!
when I follow the steps provided here, I just met the exception:

Traceback (most recent call last):
  File "/FasterTransformer/examples/pytorch/glm/glm_server.py", line 101, in <module>
    if not glm.load(ckpt_path=args.ckpt_path):
  File "/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 319, in load
    is_load = self.weights.load(ckpt_path, tensor_para_rank=self.tensor_para_rank,
  File "/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 190, in load
    scale.extend([module[f'transformer.layers.{i}.attention.query_key_value.weight_scale'].reshape(head_num, num_splits, size_per_head).permute(1, 0, 2).reshape(3, local_dim) for i in range(layer_num)])
  File "/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 190, in <listcomp>
    scale.extend([module[f'transformer.layers.{i}.attention.query_key_value.weight_scale'].reshape(head_num, num_splits, size_per_head).permute(1, 0, 2).reshape(3, local_dim) for i in range(layer_num)])
KeyError: 'transformer.layers.0.attention.query_key_value.weight_scale'

It seems that state_dict is missing some keys

Question about sample concatenation during training

Hi,

Thanks for your work and open-source!

There's one point I'm confused about: In your paper (section 2.3, last paragraph), you said

For the [MASK] and multi-task objectives, we use a context window of 512 and concatenate four samples together to cater the 2,048-sequence-length

I wonder if there's a special attention mask to ensure that each sample should only attend to itself, and not attend to other samples?
(e.g., something like a block diagonal attention mask as the following, where each block corresponds to one sample, respectively?)
screenshot

Otherwise it would be weird to concatenate multiple independent samples together just for computation efficiency, or am I missing something here? (Since there's no training code in the repo yet)

The config for glm-10b

I need to use the glm-10b with scripts/generate.sh and set MODEL_TYPE='glm-10b' in the config file in configs.

However, there are still errors that [Errno 2] No such file or directory: 'XXX/glm-10b-en/126000/mp_rank_01_model_states.pt' and [Errno 2] No such file or directory: 'XXX/glm-10b-en/126000/mp_rank_02_model_states.pt' maybe because glm_130b is used.

How can be the config file modified in configs to use glm-10b instead of glm-130b?

Looking forward to reply. Thanks.

Does GLM-130B support newline (\n)?

I found the tokenizer by default will remove newline (\n). Is '\n' included in the training corpora?
I was trying to use '\n' to separate multiple samples (few-shot learning), and I was comparing with other models so it is better to not change the prompt. Is it recommended to set the tokenizer ignore_linebreak=False? where '\n' will be encoded to 20004.
Thank you very much!

Does GLM-130 B support [sMASK] for sentence generation?

I'm developing a chat bot on top of GLM-130B.

Currently I'm using "[MASK]" at the end of dialogue for bot's response generation.
[gMASK] is too slow for me on my 8xV100 server.

Your GLM repo https://github.com/THUDM/GLM reports [sMASK] could be used for sentence generation.
But I didn't find any doc in this repo. Does GLM-130 B support [sMASK] for sentence generation?

Do you have any plan to export more API of GLM-130 B ? Such as compute LM perplexity / Multiple choice selection or any other features? Since you have already test the model on Few-CLUE, there must be ways to utilize those features.

Building an API for GLM-130B

Hi!
I am trying to build an API for GLM-130B model. So far, I have tried to run GLM model and FastAPI server from generate.sh script with no success. I also tried to run the GLM model on the start_event of FastAPI with no success. Is there any way through which I can use the model to generate response through API.
Thanks

Good replacement for `\n`

Related: #17

Hi, since \n characters are ignored, what would be the next best option to use instead when prompting GLM with in-context examples?

For example, for other models where \n is not ignored, we input prompts that look like this:

Passage: The triangle is above the red sphere.
The pink rectangle is to the left of the red sphere.
Question: Is the triangle to the left of the pink rectangle?
Answer: no

Passage: The chest is bigger than the suitcase.
The box is bigger than the suitcase.
The chest fits inside the box.
The suitcase is bigger than the box of chocolates.
The container fits inside the box.
Question: Does the suitcase fit in the box?
Answer: yes

Passage: Mary travelled to the bedroom.
Daniel travelled to the office.
Daniel journeyed to the hallway.
Mary travelled to the hallway.
Sandra travelled to the kitchen.
Mary travelled to the kitchen.
John journeyed to the garden.
Daniel went to the bathroom.
Question: Where is Sandra?
Answer: kitchen

Passage: The hallway is west of the kitchen.
The office is east of the kitchen.
Question: What is the kitchen west of?
Answer: office

Passage: This morning Fred moved to the school.
Julie went back to the cinema yesterday.
Mary travelled to the bedroom yesterday.
Fred journeyed to the bedroom yesterday.
Bill travelled to the kitchen yesterday.
This afternoon Fred journeyed to the office.
Fred travelled to the park this evening.
Mary went to the office this morning.
This afternoon Mary went back to the cinema.
This morning Julie travelled to the office.
Question: Where was Mary before the office?
Answer: bedroom

Passage: The hallway is north of the office.
The bathroom is south of the office.
Question: What is north of the office?
Answer:

I was wondering what the best practice for prompt construction for GLM was, especially for the case where there are in-context examples.

No input text for generation, why is the GPU occupancy 100%?

No input text for generation, why is the GPU occupancy 100%?

Fri Oct 21 11:05:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   38C    P0    53W / 300W |  20392MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   41C    P0    66W / 300W |  20392MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   40C    P0    59W / 300W |  20248MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:0B.0 Off |                    0 |
| N/A   40C    P0    67W / 300W |  20248MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14678      C   /opt/conda/bin/python           20389MiB |
|    1   N/A  N/A     14679      C   /opt/conda/bin/python           20389MiB |
|    2   N/A  N/A     14682      C   /opt/conda/bin/python           20245MiB |
|    3   N/A  N/A     14686      C   /opt/conda/bin/python           20245MiB |
+-----------------------------------------------------------------------------+

1xA100 80GB inference in INT4?

Thanks for making such a powerful model widely available! Very impressive work to get it to run on a single node using all open source methods.

I took it for a spin on an 8x A100 40GB machine and got some nice results.

Have you tried running the model on a single A100 80GB or an H100? Can it run without off-loading the weights to CPU?

I looked at the low resource info and did some simple calculations and it looks like

  1. The FP16 model has 260GB of weights and runs smoothly on 320GB of VRAM (eg an 8x A100 40GB or 4x A100 80GB).
  2. The INT4 model has 65GB of weights, so it should run smoothly on 65 * 320/260 = 80 GB VRAM.

If that's the case, it'd be great to know because single-card setups are even easier to work with than single-node, and the H100s are coming soon.

How to cite the repo

Hi there!

Thanks for the great work!
Was wondering how can we cite this work ?

Thanks

Mismatch error when load int4 model

When I load the int4 model, I get the following error;
The run command is: bash scripts/generate.sh --input-source input.txt
I use two a6000 graphics cards (2*48G)

Traceback (most recent call last):
  File "/ssd1/xingyum/GLM-130B/generate.py", line 210, in <module>
    main(args)
  File "/ssd1/xingyum/GLM-130B/generate.py", line 156, in main
    model, tokenizer = initialize_model_and_tokenizer(args)
  File "/ssd1/xingyum/GLM-130B/initialize.py", line 72, in initialize_model_and_tokenizer
    load_checkpoint(model, args)
  File "/home/xingyum/anaconda3/envs/vis/lib/python3.10/site-packages/SwissArmyTransformer/training/model_io.py", line 181, in load_checkpoint
    missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False)
  File "/home/xingyum/anaconda3/envs/vis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GLM130B:
        size mismatch for transformer.word_embeddings.weight: copying a param with shape torch.Size([18816, 12288]) from checkpoint, the shape in current model is torch.Size([75264, 12288]).
        size mismatch for transformer.layers.0.attention.query_key_value.weight: copying a param with shape torch.Size([4608, 12288]) from checkpoint, the shape in current model is torch.Size([18432, 12288]).
        size mismatch for transformer.layers.0.attention.query_key_value.bias: copying a param with shape torch.Size([4608]) from checkpoint, the shape in current model is torch.Size([18432]).
        size mismatch for transformer.layers.0.attention.dense.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 6144]).

BIG-Bench evaluation?

BIG-Bench (paper, code) is a large and diverse collaborative benchmark testing multiple capabilities of LLMs. I think it would be very beneficial to community to see the evaluation of GLM on this benchmark

Generate script

Can I apply the generate script scripts/generate.sh on GLM-10B Chinese checkpoint in the GLM repository?

INT8 inference

Hi, in your paper you talk about using INT8 dtype to store the weights, but they are cast to FP16 for the calculation. I was just wondering if at inference time do you actually calculate in INT8 (rather than FP16) given that you are using fastertransformer and that has support kernels which use INT8 tensor cores, to obtain an improvement in speed

GLM-10B和GLM-130B

你好,看到GLM-130B采用Ext5的方式加入了instruction tuning进行指令微调,请问GLM-10B也有引入instruction tuning吗?

FasterTransformer benchmark-generation.sh bug

I try to run GLM FasterTransformer benchmark-generation.sh(without load model checkpoint),but encounter a bug as follows:

CUDA error: invalid argument
Exception raised from alloc_block at /opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp:1037 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f8738f8063c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x25dd2 (0x7f8738fdfdd2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2b278 (0x7f8738fe5278 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2cd8c (0x7f8738fe6d8c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2d2f8 (0x7f8738fe72f8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x103 (0x7f873c3e00a3 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x35079fb (0x7f873c5179fb in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x3507a8f (0x7f873c517a8f in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x1d5c77f (0x7f878593677f in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::empty_memory_format::call(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1e5 (0x7f87856e3ac5 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::empty(c10::ArrayRef<long>, c10::TensorOptions, c10::optional<c10::MemoryFormat>) + 0x1d3 (0x7f86cdd75643 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #11: fastertransformer::Allocator<(fastertransformer::AllocatorType)2>::malloc(unsigned long, bool) + 0xe6 (0x7f86cdd89046 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #12: fastertransformer::GlmContextDecoder<__half>::allocateBuffer() + 0x70 (0x7f86cddc6a00 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #13: fastertransformer::GlmContextDecoder<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, std::vector<fastertransformer::GlmDecoderLayerWeight<__half>*, std::allocator<fastertransformer::GlmDecoderLayerWeight<__half>*> > const*, fastertransformer::LayerNormWeight<__half> const*) + 0x1f0 (0x7f86cddcb8e0 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #14: fastertransformer::Glm<__half>::encode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) + 0x1517 (0x7f86cdda6f07 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #15: torch_ext::FTGlm<__half>::encode(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int) + 0xd44 (0x7f86cdd90134 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #16: torch_ext::GlmOp::encode(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, long) + 0x10f (0x7f86cdd6e31f in /root/FasterTransformer/build/lib/libth_glm.so)
frame #17: <unknown function> + 0x7344b (0x7f86cdd8a44b in /root/FasterTransformer/build/lib/libth_glm.so)
frame #18: <unknown function> + 0x69ee6 (0x7f86cdd80ee6 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #19: PyCFunction_Call + 0x54 (0x55f91235f914 in /opt/conda/bin/python)
frame #20: _PyObject_MakeTpCall + 0x31e (0x55f912362ebe in /opt/conda/bin/python)
frame #21: <unknown function> + 0x1b85de (0x55f9123e85de in /opt/conda/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x4d33 (0x55f9124043c3 in /opt/conda/bin/python)
frame #23: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #24: _PyFunction_Vectorcall + 0x378 (0x55f9123e7818 in /opt/conda/bin/python)
frame #25: <unknown function> + 0x1b848c (0x55f9123e848c in /opt/conda/bin/python)
frame #26: PyObject_Call + 0x5e (0x55f912351b6e in /opt/conda/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x21bf (0x55f91240184f in /opt/conda/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #29: _PyFunction_Vectorcall + 0x378 (0x55f9123e7818 in /opt/conda/bin/python)
frame #30: _PyObject_FastCallDict + 0x2fd (0x55f9123d1d2d in /opt/conda/bin/python)
frame #31: _PyObject_Call_Prepend + 0xcf (0x55f9123d229f in /opt/conda/bin/python)
frame #32: <unknown function> + 0x1a2329 (0x55f9123d2329 in /opt/conda/bin/python)
frame #33: _PyObject_MakeTpCall + 0x31e (0x55f912362ebe in /opt/conda/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x55f5 (0x55f912404c85 in /opt/conda/bin/python)
frame #35: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #36: _PyFunction_Vectorcall + 0x378 (0x55f9123e7818 in /opt/conda/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x947 (0x55f9123fffd7 in /opt/conda/bin/python)
frame #38: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #39: PyEval_EvalCodeEx + 0x39 (0x55f9123e7499 in /opt/conda/bin/python)
frame #40: PyEval_EvalCode + 0x1b (0x55f912482ecb in /opt/conda/bin/python)
frame #41: <unknown function> + 0x252f63 (0x55f912482f63 in /opt/conda/bin/python)
frame #42: <unknown function> + 0x26f033 (0x55f91249f033 in /opt/conda/bin/python)
frame #43: <unknown function> + 0x274022 (0x55f9124a4022 in /opt/conda/bin/python)
frame #44: PyRun_SimpleFileExFlags + 0x1b2 (0x55f9124a4202 in /opt/conda/bin/python)
frame #45: Py_RunMain + 0x36d (0x55f9124a477d in /opt/conda/bin/python)
frame #46: Py_BytesMain + 0x39 (0x55f9124a4939 in /opt/conda/bin/python)
frame #47: __libc_start_main + 0xf3 (0x7f87d07a30b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #48: <unknown function> + 0x1e8f39 (0x55f912418f39 in /opt/conda/bin/python)

Following is my environment:

  • docker image: nvcr.io/nvidia/pytorch:21.09-py3 or nvcr.io/nvidia/pytorch:22.05-py3
  • GPU: 3090(Driver Version: 470.57.02 CUDA Version: 11.4) or A100(Driver Version: 470.57.02 CUDA Version: 11.7)
  • CUDA_LAUNCH_BLOCKING=1

4x 80gb A100 vs 8x 40gb A100

GCP prices 8x 40gb A100's at 50% more than 4x 80gb A100's. Would I be able to accomplish the same results with a little tweaking of the default config?

Hugging Face transformers integration

Greetings,

Are there any plans for integrating GLM-130b in the transformers library? (it seems only the small glm-10b is available at the moment)

We are trying to use the generated output to send additional queries to the model in batch mode and the current setup of the generate.sh script is difficult to integrate with existing code, at least compared to Bloom and similar.

Thanks,

Alfredo

Inference with FasterTransformer with GLB-130B

Hi!
I am trying to configure the GLM-130B models with FasterTransformer and I need to convert glm ckpt files, so where i can get model_optim_rng.pt file?
And I'm facing this
CMake Error at cmake/Modules/FindNCCL.cmake:153 (message): Found NCCL header version and library version do not match! (include: /home/ubuntu/anaconda3/envs/glm/include, library: /home/ubuntu/anaconda3/envs/glm/lib/libnccl.so) Please set NCCL_INCLUDE_DIR and NCCL_LIB_DIR manually. Call Stack (most recent call first): CMakeLists.txt:41 (find_package)
while i'm trying to make build using this command cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..

My basic task is to minimize the inference time I also configured THUDM/GLM-130B main branch and I set MAX_OUTPUT_LENGTH=64 it takes about 55s to generate a response.
Machine Specs: (V100) 8 * 32GB
Thanks

FasterTransformer conda issue

Thanks a lot for sharing the code.
I followed the steps mentioned here for running it locally without docker, but I am getting the following error.

Traceback (most recent call last):
  File "/projects/tir4/users/zhengbaj/exp/GLM-130B/FasterTransformer/examples/pytorch/glm/glm_server.py", line 105, in <module>
    glm.init_model(512,# output_len,
  File "/projects/tir4/users/zhengbaj/exp/GLM-130B/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 375, in init_model
    self.cuda()
  File "/projects/tir4/users/zhengbaj/exp/GLM-130B/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 359, in cuda
    self.model = self.Glm(get_torch_default_comm(), self.rank, self.head_num, self.size_per_head, self.head_num * self.size_per_head * 8 // 3,
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. libth_glm.Glm(arg0: c10d::ProcessGroupNCCL, arg1: int, arg2: int, arg3: int, arg4: int, arg5: int, arg6: int, arg7: int, arg8: int, arg9: int, arg10: int, arg11: int, arg12: int, arg13: List[at::Tensor], arg14: List[at::Tensor], arg15: List[at::Tensor])

Run generate.sh with "model_glm_130b_int4.sh" configuration, still reporting an error, memory 157G (physical memory) + 195G (virtual memory, swap), 4*V100 graphics card.

Run generate.sh with "model_glm_130b_int4.sh" configuration, still reporting an error, memory 157G (physical memory) + 195G (virtual memory, swap), 4*V100 graphics card.

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
/workspace/generate.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-10-20_08:16:53
  host      : 8bdf70b6de4a
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 1286)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1286
=====================================================

Use GLM-130B on machine translation task.

Hi!
After reading a lot of information, it seems that in the field of machine translation, it is more likely to use a small amount of parallel corpus for fine-tuning, and I feel that it may work better for some low-resource languages. But it seems that it is difficult to improve the performance of rich corpus languages.

I have checked GLM papers and found no performance analysis on the machine translation task. Is it possible to use GLM-130B to improve machine translation performance in English-Chinese translation tasks? Are there any experiments or best practices about this?

How to use the code for multinode inference?

Hi really appreciate the great work!

I am wondering, is there a straightforward way to adapt the code for multinode inference?

I got 3 A100s each with 3 GPUs of 40GB memory.

Does this code naturally support multinode inference? If so where in the code shall I tune it?

Thanks!

GLM-130B+CodeGeex

您好,试了GLM-130B和CodeGeex的效果,很惊艳。请问是否考虑将两个模型结合成一个模型?例如:在GLM-130B的基础上采用CodeGeex的数据集进行继续预训练。

language model evaluation at idx = 0

hi, I'm still looking for computing perplexity with GLM.

Just look into the recent updates on evaluation/{dataset.py, tasks.py} about language model task.

The code inside dataset.py:LanguageModelTaskDataset:297 is

        if idx == 0 or self.config.unidirectional:
            prompt, text = tokens[:1], tokens[1:]
        else:
            prompt_length = self.config.max_seq_length - 1 - self.config.generation_length
            prompt, text = tokens[:prompt_length], tokens[prompt_length:]

        # ..... skip ....
        return {
            "tokens": np.array(prompt + [mask_id, sop_id] + text[:-1], dtype=np.int64),
            "targets": np.array(prompt + [mask_id] + text, dtype=np.int64),
            "position_ids": np.arange(0, seq_length, dtype=np.int64),
            "attention_mask": attention_mask < 0.5,
            "loss_masks": np.array([0] * (len(prompt) + 1) + [1] * len(text), dtype=np.int64),
        }

at idx==0, you take the full text as prompt input and also the output text.
It would lead to absolutely lower PPL. Because model has a full view of what it needs to predict.
Why wouldn't set the prompt to empty list?

请教关于训练日志中的一些问题

我有如下问题想请教:

  1. LargeScale 是一个开源工具包吗?我在搜索引擎和github中没有找到直接信息
  2. 在测速中,更大的全局batch size 会有更大的吞吐,为什么最终会选择4224呢?另外,BSZ=176 * 24=4224`,24正好是dp数,那176 需要梯度累加吗?大模型训练上用梯度累加跟小模型上会有显著差异吗?
  3. 如下面所引用的,咱们中英文的数据都是纯文本,由多任务数据换回原中英文数据?对整个数据进行重新shuffle吗?这样会不会导致模型训练到重复的数据?这个shuffle对大模型训练的稳定作用这么大吗?

分析可能是 distribution 变动仍然太过剧烈,先换纯文本 + reshuffle 尝试训练

  1. warmup-samples-after-loading 这个是什么操作?是从平衡的多任务,逐渐转换为带权重分布的多任务吗?
  2. 这里的loss 爆炸体现在loss 上是nan 吗?还是只是突然增加数量级?

BIG-bench-lite evaluation code

Hi thanks for the great work!

Is there a plan on sharing the code and data you specifically used for evaluating BIG-bench-lite?

It might be important for recreating the results given the decision points regarding prompt design etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.