s-jol / open-llama Goto Github PK

The complete training code of the open-source high-performance Llama model, including the full process from pre-training to RLHF.

Home Page: https://huggingface.co/s-JoL/Open-Llama-V2

License: MIT License

Python 93.80% Shell 6.20%

open-llama's People

Contributors

Stargazers

Watchers

Forkers

randaldkennedy

open-llama's Issues

Model Checkpoint + Adapter fine-tune?

This is sort of 2 questions:

Are the weights available anywhere, do you intend to make them available anywhere soon? (eg. on huggingface?)

Second: is it possible to introduce some recent fine-tuning work to this? (eg. specifically https://github.com/ZrrSkywalker/LLaMA-Adapter would be really great)

Personal Interest: I'd love to be able to have an easy to use starting place to get weights + apply custom fine-tuning with multi-modal inputs

基于预训练模型在自有数据进行instruct-tuning效果问题

您好，我在您的V1版预训练模型上做SFT，用2批数据，对比在BLOOM 3B上跑SFT，结果，您这个发布的预训练基座，效果奇差，不知道是什么原因？您这边有什么建议吗

我现在开始用stage3来train，不过刚开始就卡住，看显存和内存，不像是不够的样子。这是为什么呢？
accelerate launch --config_file configs/accelerate_configs/ds_stage3_offload.yaml train_lm.py --train_config configs/pretrain_config.yaml --model_config configs/model_configs/7B.json

I0510 07:28:04.417845 139863945226048 distributed_c10d.py:337] Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:2 (world_size=8, worker_count=7, timeout=0:30:00)
I0510 07:28:04.428285 140051304576832 distributed_c10d.py:337] Waiting in store based barrier to initialize process group for rank: 7, key: store_based_barrier_key:2 (world_size=8, worker_count=7, timeout=0:30:00)

How to do V2.0 pre-training?

Hi there,

Thanks a lot for the excellent work V2.0 release.

Could you please tell me if we need to re-process all data from scratch, since the data format got changed?

What are the scripts that we should run sequentially?

That's to say, what data preparation steps(scripts) shall we run before executing the following command?

accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/pretrain_config.yaml

Thanks again!

v2的checkpoint好像是pretrain模型，不是instruct tuning后的模型

https://huggingface.co/s-JoL/Open-Llama-V2，我使用了v2的checkpoint，回复如下所示，传上来的应该是预训练模型，麻烦作者大大更新一下模型。

what ‘s difference between 10w_vocab_wudao5_pile10.model and llama_tokenizer_extended.model

关于预训练中的eval阶段

您好，关于预训练过程中的evaluate阶段的代码，我有三个问题希望能得到您的解答：

预训练过程中的evaluate起到的作用是什么？我看您在pretrain_llama.py里的evaluate仅仅只是把输出结果通过wandb进行记录，似乎并没有涉及到计算准确率和进行early-stop。那是不是不加evaluate阶段也可以？
想问一下如果不进行early-stop，是否会导致预训练阶段的过拟合？
我在3090上尝试用您的pretrain代码预训练一个小模型，仅使用单卡时代码能够正常跑通，8卡运行时会在evaluate阶段的model.generate处卡住，想问一下这种会是由于什么原因导致的？
我所使用的模型参数为：
hidden_size = 1024
num_hidden_layers = 24
num_attention_heads = 16
intermediate_size = 2048
use_stable_embedding = False

希望能得到您的解答，十分感谢

Can this train regular llama models or just openllama?

Question in title ^

加载huggingface上的checkpoint时提示KeyError: 'open-llama'

运行代码：
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("s-JoL/Open-Llama-V1", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("s-JoL/Open-Llama-V1").cuda()

inputs = tokenizer('user:implement quick sort in python\nsystem:', return_tensors='pt', return_attention_mask=False)
for k, v in inputs.items():
inputs[k] = v.cuda()
pred = model.generate(**inputs, max_new_tokens=512, do_sample=True)
print(tokenizer.decode(pred.cpu()[0]).strip())

报错：
Traceback (most recent call last):
File "/data/rooter_use/Open-Llama/test.py", line 4, in
model = AutoModelForCausalLM.from_pretrained("s-JoL/Open-Llama-V1").cuda()
File "/data/wangyanbing/.conda/envs/cyz_py39/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 441, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/data/wangyanbing/.conda/envs/cyz_py39/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 937, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/data/wangyanbing/.conda/envs/cyz_py39/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 643, in getitem
raise KeyError(key)
KeyError: 'open-llama'

transformers是通过您在README里的pip install git+https://github.com/s-JoL/transformers.git@dev来进行的安装
猜测是因为config.json里的architectures和model_type与transformers里的字典不匹配？

Open-Llama-V2-pretrain使用train_lm.py进行Instruction-Tuning报错：RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmEx

Open-Llama-V2-pretrain使用train_lm.py进行Instruction-Tuning，命令CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' accelerate launch --config_file configs/accelerate_configs/ds_stage1.yaml train_lm.py --train_config configs/instruct_config.yaml --model_config configs/model_configs/7B.json，运行失败

具体报错这样的

Traceback (most recent call last):
File "/home/fenbi/machao/git/Open-Llama/train_lm.py", line 118, in
app.run(main)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/fenbi/machao/git/Open-Llama/train_lm.py", line 114, in main
trainer.train()
File "/home/fenbi/machao/git/Open-Llama/solver/trainer.py", line 157, in train
losses = self.train_step(batch)
File "/home/fenbi/machao/git/Open-Llama/solver/trainer.py", line 124, in train_step
out = self.model(**batch)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1724, in forward
loss = self.module(*inputs, **kwargs)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/transformers/models/open_llama/modeling_open_llama.py", line 726, in forward
outputs = self.model(
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/transformers/models/open_llama/modeling_open_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/transformers/models/open_llama/modeling_open_llama.py", line 318, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/transformers/models/open_llama/modeling_open_llama.py", line 211, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

环境：
torch==1.13.1
Cuda==11.6

关于v2

Hi，

很高兴看到性能在v2能有大幅的提升。有几个疑问。

本次更新主要包含以下几个方面，相对于v1版本提升有效训练速度50%，其中pad从30%减少至5%，训练速度从3200token/s提升至3600token/s。

这里pad从30%减少至5%哪里可以看的出来？有打印可以debug相关的pad百分比吗？
训练速度是xx token/s，这个是从耗时计算出来的吗？在代码中可以有相关的debug代码吗？个人觉得这样更加直观一些。

关于训练开销

请问你这边在finetune llama的时候，各个参数如7B，13B，30B各个节点所用资源的数量，有统计过吗？
比方单卡时gpu显存，cpu内存占用？八卡时gpu显存，cpu内存占用？

我现在发现全参数finetune的时候cpu内存占用特别厉害，不知道你这边有没有什么特别的优化没？

如何将预训练好的pt文件转化为bin格式的文件？

如何将预训练好的pt文件转化为transformers可以读取的bin格式的文件？

33B和65B模型训练

作者大大
我的机器配置是8x80G A100, 内存512G
启动33b模型使用命令
accelerate launch --main_process_port 30001 --config_file configs/accelerate_configs/ds_stage3.yaml train_lm.py --train_config configs/pretrain_config.yaml --model_config configs/model_configs/33B.json 会oom
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.83 GiB (GPU 3; 79.20 GiB total capacity; 67.98 GiB already allocated; 1.75 GiB free; 76.03 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON
启动65b模型
accelerate launch --main_process_port 30001 --config_file configs/accelerate_configs/ds_stage3_offload.yaml train_lm.py --train_config configs/pretrain_config.yaml --model_config configs/model_configs/65B.json
会内存爆炸卡死

dataset download fail

两个数据集下载都失败了。the pipe和wudao
update-alternatives: warning: skip creation of /usr/share/man/man1/unrar.1.gz because associated file /usr/share/man/man1/unrar-nonfree.1.gz (of link group unrar) doesn't exist
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

有hugging face的下载链接不？或者能否提供其他的下载链接？
感谢您的分享

支持基于 llama-7b权重做二次增量预训练吗？

请问支持基于 llama-7b权重做二次增量预训练吗？支持的话，有相关步骤吗

zero3 如何训练65b模型

zero3应该不支持pipeline parallelism, 65B单卡能训练起来吗

这个代码库可以支持训练更大的模型吗？比如60B？

instruct finetune报错，初始化的模型和加载的83200.pt好像结构上对不齐

RuntimeError: Error(s) in loading state_dict for OpenLlamaForCausalLM:
Missing key(s) in state_dict: "lm_head.weight".
Unexpected key(s) in state_dict: "model.embed_layer_norm.weight", "model.embed_layer_norm.bias".

training speed questions: multi-node training, CPU offload

First of all, thank you for this project! I had some questions about how training was done, as I've struggled to scale up training larger model sizes when using transformers + deepspeed on the original llama weights.

I'm curious if you've tested multi-node distributed training? From what I understand, all the training speed numbers listed in your 2023-05-08 v2.1 release are from training on a single node with 8x A100 80GB. Is that correct?
Could you share other aspects of the machine configuration (eg, number of CPUs and amount of CPU memory)? In particular, when using CPU offloading for 65B, how much CPU memory was needed?
From the v2.1 release notes, it looks like you have very close to linear slowdown when increasing the model size. It's surprising to me that, even as you have to move from DS-1 to DS-3 (for 13, 33, 65B) and enable CPU offloading (for 65B), you don't get more of a slowdown due to additional overhead. Can you comment on how you were able to achieve this?

Thank you!

How to tell if data are downloaded correctly?

Downloading Pile and Wudao data needs a lot of time. During this long time period, network connection can break and various types of errors can occur.

For example, I am getting the following error message and not sure if it is okay:
"curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOAL_ERROR(err 1)"

Is there a way to double-check and ensure the data are downloaded correctly?

Is zero2&zero3 with cpuoffload tested?

I am wondering the performance but I have no gpu resource to test it.

Released model on demo

Hi there,

thank you for your work! I see some performance divergence between your online demo and my local model instance. Are those the same (Open-Llama-V1)?

wudao数据集的下载预处理脚本问题

首先是下载链接我试了自己账号申请的链接无法下载，只能用scidb的链接，不需要登录，然后用curl下载老是出错（下完了文件md5不一致，也没法解压），就换成了wget，终于下载成功。我用的下载代码是（没有循环）：

wget -v -c 'https://download.scidb.cn/download?fileId=63a30383fed6a8a9e8454302&dataSetType=organization&fileName=WuDaoCorporaText-2.0-open.rar' -O data/WuDaoCorpus2.0_base_200G.rar

然后解压的命令没有指定保存路径，如果是在项目根目录运行这个sh文件的话会解压到根目录里（Open-LLama/WuDaoCorpus2.0_base_200G/）。需要将其移到data文件里，或者修改data/preprocess_wudao.py里的路径。
另外pile真的很难下（还得翻墙）……

中文预训练模型的精度有评测吗？

@s-JoL 非常感谢分享这么好的工作，中文预训练模型真的好稀缺。我想了解一下，目前的预训练模型，有在一些评测集上测试过指标吗？

where is model implement？

how to modify model structure

感谢大佬

看了那么多代码，就感觉你给的代码最好，让我成功搞定了很多问题

关于V2版代码数据处理的两个问题

您好，我在使用您的V2版本代码时，遇到如下两个问题，应该是数据处理过程中存在一些没有处理的异常情况：

问题1:
pretrain_transform函数batch["text"] = [batch["title"][0] + "\n" + batch["content"][0]]报错：
TypeError: unsupported operand type(s) for +: 'datetime.datetime' and 'str'
可能的出错原因：
load_dataset加载数据集时并没有把所有的数据都转成str格式
解决方法：改成batch["text"] = [str(batch["title"][0]) + "\n" + str(batch["content"][0])]后，暂时未遇到类似的问题

问题2:
遍历dataloader时，dataloader._process_data(data)报错，最终定位到torch/utils/data/_utils/collate.py128行，
return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
报错KeyError: 'id'
可能的出错原因：
您的v2版代码里，pretrain_transform返回的是字典，而非像v1版一样返回tokenize结果，因此wudao和pile数据返回的key不一致，导致出错。
解决方法：
可能需要修改一下preprocess_the_pile和preprocess_wudao的代码？在数据预处理的时候就将两份数据的key进行对齐

CUDA 11.7

当前代码版本里的FusedAdam依赖于CUDA11.7，请问能实现较低版本的cuda依赖嘛？比如cuda10.2和cuda11.5

关于开源协议

请问下这里的预训练权重是使用的原始llama的协议（受限协议），还是独立的预训练权重呢（那么是哪种开源协议）

可以share没有经过instruction tuning的checkpoint吗

instruction tuning使用了chatgpt的数据，不太符合openai的规范

有计划加入RedPajama-Data-1T相关数据训练模型吗？

RedPajama-Data-1T 宣称自己对标LLAMA的数据，应该是比pile要好不少的，有计划使用来训练及发布预训练好的checkpoint嘛

Thailand example in readme.md is a pure hallucination

There is no "Lumphini beach" and there is no monkey forrest to play in Bangkok. It seems like a bad example to have as main example for the repo. Maybe one that spits out accurate data

v2版训练代码在forward阶段报错

您好，我今天在尝试您的v2版训练代码时，遇到了如下的错误：

Traceback (most recent call last):
File "/data/rooter_use/cyz_open_llama/Open-Llama/train_lm.py", line 94, in
app.run(main)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/data/rooter_use/cyz_open_llama/Open-Llama/train_lm.py", line 90, in main
trainer.train()
File "/data/rooter_use/cyz_open_llama/Open-Llama/solver/trainer.py", line 153, in train
losses = self.train_step(batch)
File "/data/rooter_use/cyz_open_llama/Open-Llama/solver/trainer.py", line 120, in train_step
out = self.model(**batch)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1675, in forward
loss = self.module(*inputs, **kwargs)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/transformers/models/open_llama/modeling_open_llama.py", line 725, in forward
outputs = self.model(
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/transformers/models/open_llama/modeling_open_llama.py", line 612, in forward
layer_outputs = decoder_layer(
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/transformers/models/open_llama/modeling_open_llama.py", line 315, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/transformers/models/open_llama/modeling_open_llama.py", line 232, in forward
attn_output = xops.memory_efficient_attention(query_states, key_states, value_states,
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/xformers/ops/fmha/init.py", line 197, in memory_efficient_attention
return _memory_efficient_attention(
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/xformers/ops/fmha/init.py", line 298, in _memory_efficient_attention
return _fMHA.apply(
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/xformers/ops/fmha/init.py", line 43, in forward
out, op_ctx = _memory_efficient_attention_forward_requires_grad(
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/xformers/ops/fmha/init.py", line 323, in _memory_efficient_attention_forward_requires_grad
op = _dispatch_fw(inp)
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/xformers/ops/fmha/dispatch.py", line 95, in _dispatch_fw
return _run_priority_list(
File "/data/rooter_use/conda/envs/llama-env39/lib/python3.9/site-packages/xformers/ops/fmha/dispatch.py", line 70, in _run_priority_list
raise NotImplementedError(msg)
NotImplementedError: No operator found for memory_efficient_attention_forward with inputs:
query : shape=(2, 2048, 4, 192) (torch.bfloat16)
key : shape=(2, 2048, 4, 192) (torch.bfloat16)
value : shape=(2, 2048, 4, 192) (torch.bfloat16)
attn_bias : <class 'xformers.ops.fmha.common.LowerTriangularMask'>
p : 0.1
cutlassF is not supported because:
dropout > 0.0
flshattF is not supported because:
max(query.shape[-1] != value.shape[-1]) > 128
tritonflashattF is not supported because:
max(query.shape[-1] != value.shape[-1]) > 128
dropout > 0.0
requires A100 GPU
smallkF is not supported because:
dtype=torch.bfloat16 (supported: {torch.float32})
max(query.shape[-1] != value.shape[-1]) > 32
attn_bias type is <class 'xformers.ops.fmha.common.LowerTriangularMask'>
unsupported embed per head: 192

想请教一下，这个错误日志里除了requires A100以外，其它的问题是由什么导致的？我使用的accelerate config和您提供的一样，模型config为：

data:
mode: "pretrain"
data:
wudao: "/data/rooter_use/Open-Llama/data/pretrain_data/part-wudao*.jsonl.zst"
# 由于加载了Llama模型的ckpt所以只使用少量英文数据
# the_pile: "data/pretrain_data/part-pile-1*.jsonl.zst"
pad_to_max: False
sequence_sample_mode: "none"
concat_multiple_sequence: True
num_sequences: 10
seq_length: 2048
tokenizer_model_path: "configs/llama_tokenizer_extended.model"
model:
initializer_range: 1.0e-2
hidden_dropout_prob: 0.1
attention_dropout_prob: 0.1
use_stable_embedding: False
shared_input_output_embedding: False
hidden_size: 768
num_hidden_layers: 6
num_attention_heads: 4
intermediate_size: 2048
train:
train_batch_size: 2
num_training_steps: 500000
num_warmup_steps: 2000
initializer_range: 1.0e-2
lr: 2.0e-4
weight_decay: 1.0e-1
# 加载预训练权重，从头训练设为null
# ckpt: "data/llama_raw_ckpt/7B/extended.pth"
ckpt: null
train_num_workers: 16
gradient_accumulation_steps: 12
prefetch_factor: 100
log_interval: 5
eval_interval: 500
save_interval: 1000
work_dir: "data/saved_ckpt/small"
project_name: "Llama Pretrain"

按照报错日志来看，似乎有一些问题是由于模型config的配置不符合xformers要求导致的？

可以提供一个dockerfile吗？

环境现在太多，有的时候设置有问题，可以提供一个验证过的dockerfile环境吗？
多谢~

有计划放出13B及以上的模型么？

关于83200.pt？

请问，您这个预训练的模型83200.pt，是指跑了83200step吗？Open-Llama-V1是基于这个训练的吗？

继续预训练的问题

由于新版代码用了huggingface dataset api，有没有办法监控每次run被用过的zst文件？这样的话在继续预训练的时候才有办法跳过这些文件

做微调时，保存权重出了问题

您好，非常感谢您的这个项目！我在使用您的预训练权重时，采用的是fastchat的微调代码，但是在保存的时候，出现了报错：

/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224: UserWarning: Failed to clone() tensor with name _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. Error: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 6; 39.42 GiB total capacity; 36.01 GiB already allocated; 16.31 MiB free; 37.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224: UserWarning: Failed to clone() tensor with name _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. Error: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 3; 39.42 GiB total capacity; 36.36 GiB already allocated; 22.31 MiB free; 38.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224: UserWarning: Failed to clone() tensor with name _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. Error: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 2; 39.42 GiB total capacity; 36.36 GiB already allocated; 22.31 MiB free; 38.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224: UserWarning: Failed to clone() tensor with name _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. Error: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 39.42 GiB total capacity; 36.36 GiB already allocated; 22.31 MiB free; 38.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224: UserWarning: Failed to clone() tensor with name _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of _fsdp_wrapped_module._fpw_module.model.layers.30.self_attn.q_proj.weight. Error: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 4; 39.42 GiB total capacity; 36.30 GiB already allocated; 34.31 MiB free; 38.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  warnings.warn(

而且权重的读取也出现报错：

File /usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py:415, in load_state_dict(checkpoint_file)
    414 try:
--> 415     return torch.load(checkpoint_file, map_location="cpu")
    416 except Exception as e:

File /usr/local/lib/python3.8/dist-packages/torch/serialization.py:789, in load(f, map_location, pickle_module, weights_only, **pickle_load_args)
    788                 raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
--> 789         return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
    790 if weights_only:

File /usr/local/lib/python3.8/dist-packages/torch/serialization.py:1131, in _load(zip_file, map_location, pickle_module, pickle_file, **pickle_load_args)
   1130 unpickler.persistent_load = persistent_load
-> 1131 result = unpickler.load()
   1133 torch._utils._validate_loaded_sparse_tensors()

File /usr/local/lib/python3.8/dist-packages/torch/_utils.py:153, in _rebuild_tensor_v2(storage, storage_offset, size, stride, requires_grad, backward_hooks)
    150 def _rebuild_tensor_v2(
    151     storage, storage_offset, size, stride, requires_grad, backward_hooks
    152 ):
--> 153     tensor = _rebuild_tensor(storage, storage_offset, size, stride)
    154     tensor.requires_grad = requires_grad

File /usr/local/lib/python3.8/dist-packages/torch/_utils.py:147, in _rebuild_tensor(storage, storage_offset, size, stride)
    146 t = torch.tensor([], dtype=storage.dtype, device=storage.untyped().device)
--> 147 return t.set_(storage.untyped(), storage_offset, size, stride)

RuntimeError: Trying to resize storage that is not resizable

During handling of the above exception, another exception occurred:

UnicodeDecodeError                        Traceback (most recent call last)
File /usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py:419, in load_state_dict(checkpoint_file)
    418 with open(checkpoint_file) as f:
--> 419     if f.read(7) == "version":
    420         raise OSError(
    421             "You seem to have cloned a repository without having git-lfs installed. Please install "
    422             "git-lfs and run `git lfs install` followed by `git lfs pull` in the folder "
    423             "you cloned."
    424         )

File /usr/lib/python3.8/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
    323 # keep undecoded input until the next call

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
Cell In[1], line 32
     28 model_config.pad_token_id = tokenizer.pad_token_id
     29 #model = OpenLlamaForCausalLM(model_config).cuda()
---> 32 model=  OpenLlamaForCausalLM.from_pretrained(model_name, from_tf= False).cuda()

File /usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py:2647, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2637     if dtype_orig is not None:
   2638         torch.set_default_dtype(dtype_orig)
   2640     (
   2641         model,
   2642         missing_keys,
   2643         unexpected_keys,
   2644         mismatched_keys,
   2645         offload_index,
   2646         error_msgs,
-> 2647     ) = cls._load_pretrained_model(
   2648         model,
   2649         state_dict,
   2650         loaded_state_dict_keys,  # XXX: rename?
   2651         resolved_archive_file,
   2652         pretrained_model_name_or_path,
   2653         ignore_mismatched_sizes=ignore_mismatched_sizes,
   2654         sharded_metadata=sharded_metadata,
   2655         _fast_init=_fast_init,
   2656         low_cpu_mem_usage=low_cpu_mem_usage,
   2657         device_map=device_map,
   2658         offload_folder=offload_folder,
   2659         offload_state_dict=offload_state_dict,
   2660         dtype=torch_dtype,
   2661         load_in_8bit=load_in_8bit,
   2662         keep_in_fp32_modules=keep_in_fp32_modules,
   2663     )
   2665 model.is_loaded_in_8bit = load_in_8bit
   2667 # make sure token embedding weights are still tied if needed

File /usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py:2956, in PreTrainedModel._load_pretrained_model(cls, model, state_dict, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage, device_map, offload_folder, offload_state_dict, dtype, load_in_8bit, keep_in_fp32_modules)
   2954 if shard_file in disk_only_shard_files:
   2955     continue
-> 2956 state_dict = load_state_dict(shard_file)
   2958 # Mistmatched keys contains tuples key/shape1/shape2 of weights in the checkpoint that have a shape not
   2959 # matching the weights in the model.
   2960 mismatched_keys += _find_mismatched_keys(
   2961     state_dict,
   2962     model_state_dict,
   (...)
   2966     ignore_mismatched_sizes,
   2967 )

File /usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py:431, in load_state_dict(checkpoint_file)
    426             raise ValueError(
    427                 f"Unable to locate the file {checkpoint_file} which is necessary to load this pretrained "
    428                 "model. Make sure you have saved the model properly."
    429             ) from e
    430 except (UnicodeDecodeError, ValueError):
--> 431     raise OSError(
    432         f"Unable to load weights from pytorch checkpoint file for '{checkpoint_file}' "
    433         f"at '{checkpoint_file}'. "
    434         "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True."
    435     )

OSError: Unable to load weights from pytorch checkpoint file for '/data/gck/model_save_dir/hf_Open-Llama-V1_SFT_v1/checkpoint-200/pytorch_model-00001-of-00003.bin' at '/data/gck/model_save_dir/hf_Open-Llama-V1_SFT_v1/checkpoint-200/pytorch_model-00001-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

请问这是什么原因呢？

基于v2的预训练模型进行指令微调时，LOSS总是忽大忽小，波动范围较大，不知道是否是正常情况？

低的时候有0.8，高的时候有2.3，不知道大家训练的时候是不是也是这种情况？是因为训练还不充分的缘故吗？（目前训练了大概2个小时，总共5G的指令微调数据）

[2023-05-12 20:30:13,489] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now!
I0512 20:30:13.490093 140284242999104 logging.py:47] DeepSpeed Model and Optimizer saved to output dir data/saved_ckpt/7B_instruction/pytorch_model
I0512 20:30:13.491180 140284242999104 logging.py:47] Scheduler state saved in data/saved_ckpt/7B_instruction/scheduler.bin
I0512 20:30:13.492729 140284242999104 logging.py:47] Random states saved in data/saved_ckpt/7B_instruction/random_states_0.pkl
Epoch: 0, Global Step: 3051, Data Step: 3050, Loss: 0.8125, Token per second per gpu: 564.6297147981206
Epoch: 0, Global Step: 3101, Data Step: 3100, Loss: 2.03125, Token per second per gpu: 2999.5104936194425
Epoch: 0, Global Step: 3151, Data Step: 3150, Loss: 2.078125, Token per second per gpu: 2999.4936411123576
Epoch: 0, Global Step: 3201, Data Step: 3200, Loss: 2.15625, Token per second per gpu: 2997.8913711548653
Epoch: 0, Global Step: 3251, Data Step: 3250, Loss: 2.203125, Token per second per gpu: 2999.0908188699273
Epoch: 0, Global Step: 3301, Data Step: 3300, Loss: 1.125, Token per second per gpu: 3000.028080336949
Epoch: 0, Global Step: 3351, Data Step: 3350, Loss: 1.8984375, Token per second per gpu: 2998.797114319698
Epoch: 0, Global Step: 3401, Data Step: 3400, Loss: 1.796875, Token per second per gpu: 2997.457181530268
Epoch: 0, Global Step: 3451, Data Step: 3450, Loss: 0.439453125, Token per second per gpu: 2997.4025205262483
Epoch: 0, Global Step: 3501, Data Step: 3500, Loss: 2.15625, Token per second per gpu: 2997.202249851822
Epoch: 0, Global Step: 3551, Data Step: 3550, Loss: 0.9765625, Token per second per gpu: 2996.3489861889375
Epoch: 0, Global Step: 3601, Data Step: 3600, Loss: 1.8515625, Token per second per gpu: 2998.468727956593
Epoch: 0, Global Step: 3651, Data Step: 3650, Loss: 1.4375, Token per second per gpu: 2997.1825473994422
Epoch: 0, Global Step: 3701, Data Step: 3700, Loss: 1.3671875, Token per second per gpu: 2997.2918867307167
Epoch: 0, Global Step: 3751, Data Step: 3750, Loss: 1.1015625, Token per second per gpu: 2998.0614931246614
Epoch: 0, Global Step: 3801, Data Step: 3800, Loss: 2.359375, Token per second per gpu: 2997.3599415502686

请问一下指令微调（instruct finetune）可以使用32GB显存的v100吗？

请问一下指令微调（instruct finetune）可以使用32GB显存的v100吗？目前有4卡的v100，每张卡显存是32G，有没有办法可以使用4卡的v100进行指令微调操作？试过了stage2还是报显存不够。

试过ds_stage3的配置文件，发现报了如下错误，有人知道是什么原因吗？非常感谢
启动命令如下：accelerate launch --config_file configs/accelerate_configs/ds_stage3.yaml train_lm.py --train_config configs/instruct_config.yaml --model_config configs/model_configs/7B.json

报错如下：
File "/home/fenbi/miniconda3/envs/mc-model/lib/python3.9/site-packages/transformers/models/open_llama/modeling_open_llama.py", line 385, in init_weights
module.weight.data[module.padding_idx].zero()
IndexError: index 32000 is out of bounds for dimension 0 with size 0

关于训练成本

你好，对于你描述的复现这个open-llma的训练成本比较好奇，因为之前的你在示例图片底下说复现这样一个模型大概是2W刀，但是到高性能部分变成了9w刀，不太确定具体价格？

还有就是想再请教下训练所需显存？如果不考虑时间， 4卡a100 * 80g 能复现这个结果吗？

Help! Lots of skippings in running dataset/train-tokenizer.py

How to solve this problem?
Thanks a lot!

Loading data from data/pretrain_data/part-wudao-3029.jsonl.zst
trainer_interface.cc(367) LOG(INFO) Reserved chars are found. Skipped: 月建
月建，一般是指十二支而言，如建子、建丑、建寅等等。“建”代表北斗七星斗柄顶端的指向。古老天文学称北斗星斗柄所指为“建”，干支历将一岁划分为十二辰，斗柄旋转而依次指为十二辰（十二月令），称为“十二月建”。十二月建是依据二十四节气而来的节气月。 [1-2] 此外，建又指十二个表示当下态势的字，即''建，除，满，平，定，执，破，危，成，收，开，闭”，月建指摇卦当月的月地支，也称月令、月将。月建在当月三十日中当权得令，对卦爻的旺衰有着直接影响。五行在四季十二月中呈旺相休囚的周期性变化，当令的月地支便为月建。概念概念一：“月建”，又称月令，简称节，是古人干支纪月的一种方法。“建”代表北斗七星斗柄顶端的指向，是干支历法中十二个月之始。古代天文学称北斗星斗柄所指为建，一岁之中斗柄旋转而依次指为十二辰，称为“十二月建”。十二月建即：一月建寅，二月建卯，三月建辰，四月建巳，五月建午，六月建未，七月建申，八月建酉，九月建戌，十月建亥，十一月建子，十二月建丑。 [2-4] 概念二：月建是指十二个表示当下态势的字，即''建，除，满，平，定，执，破，危，成，收，开，闭”。月建指摇卦当月的月地支，也称月令、月将。月建在当月三十日中当权得令，对卦爻的旺衰有着直接影响。五行在四季十二月中呈旺相休囚的周期性变化，当令的月地支便为月建。释义《汉书.律历志》：“辰者，日月之会而建所指也。”就说明了辰，月建所指。《易·系辞》：寒暑相推而岁成。古人以“岁”来表示寒暑交替。岁，“岁”又名为“摄提”、“太岁”，是天皇氏时代创制的纪元星名。《盘古王表》：“天皇氏始制干支之名，以定岁之所在”。《三皇本纪》：“天皇氏，木德王，岁起摄提”。《太平御览》卷17引《释名》曰：“岁，越也。越，故限也，年进也，进而前也，祀巳也，新气生，故气巳也，载载生物也。”《开元占经》卷67引《淮南鸿烈间诂》曰：“斗杓为小岁。（注：）岁之言越历十二辰而行。”古老天文学称北斗星斗柄所指为“建”。斗柄旋转一圈（从立春到下一立春），谓之一十二月建，“十二月建”是干支历法的基本内容，依据北斗星斗柄的指向确定。以北斗星斗柄所指的方位作为确定月份的标准，称为“斗建”，“建”代表北斗七星斗柄顶端的指向。北斗星斗柄绕东、南、西、北旋转一圈谓之一岁（摄提），干支历法将一岁划分为十二辰（或“十二月令”），每月令含两个节气，斗柄旋转而依次指向“十二辰”，称为“十二月建”。十二月建亦即是十二个节气月。古人的所谓“月建”，就是把一年十二个月和天上的十二辰联系起来。十二辰是古代天文学的一个概念，就是把黄道（即太阳一年在天空中移动一圈的路线）附近的一周天十二等分，由东向西配以子，丑，寅，卯，辰，巳，午，未，申，酉，戌，亥十二支。十二支和十二月相配，依序称为建子月，建丑月，建寅月等等，这就叫“月建”。那么十二月份是怎样确定的呢？这个问题与太阳的周年视运动引起的北斗星转动有关。北斗星是大熊星座的七颗明亮的星，分布成斗形，因此**古代称为北斗星。天枢，天璇（xuan音旋），天玑（ji音机），天权组成斗身。玉衡，开阳，摇光三星组成斗柄，古曰杓（biao音彪）。用直线把天璇，天枢两星连接起来向斗口方向延长约五倍的距离就遇到小熊座a星，即北极星。北极星是出现在天空北部的一颗亮星，距地球北极很近，差不多正对着地轴，从地球上看，它的位置几乎不变。航海和旅行的人常靠它来辨别方向。因为北斗星围绕北极星转动，因次北斗星亦用来辨方向，定季节。人们可以根据初昏时北斗星斗柄所指的方向来决定季节：斗柄指东，天下皆春；斗柄指南，天下皆夏；斗柄指西，天下皆秋；斗柄指北，天下皆冬。同样，也可以根据北斗星斗柄所指十二辰中的不同位置来确定十二月份。以日南至（即冬至）所在之朔望月的日月相会日（朔日），北斗斗柄指辰位为“子”位，为建子月；日月之会日的斗柄所指“丑”位，为建丑月；日月之会日的斗柄所指“寅”，为建寅月。依此类推，日月之会日的斗柄所指十二辰中的那一支，就是建该支月，称为“月建”。然而随着时间的流逝，据说古人发现北斗星逐渐偏离原来的位置，于是改用赤道上直接定的十二支。一般所谓月建是指十二支而言，如建子、建丑、建寅等等。有人认为月建是指（农历）月的大小而言，所以有大建和小建的谬论。 [1] 所谓“正月建寅”，其实是农历的正月（初一或朔）从天象中的寅位开始向卯排列。农历是一种阴阳历，采用加设闰月调整，闰月使用上月月建，其原因就是在闰月的朔日所指的辰位与上月相同。在古代，月建对判断占事至关重要。月建即月令，掌一月之权，可三旬之令，一月三十日当权得令，掌管万物之提纲，巡察六爻之善恶，操生杀之权柄。月建能助卦爻之衰弱，挫爻象之旺强，制服动变之爻，扶起飞伏之神。月建当权为主帅纲领，爻之衰弱者，能生之合之，拱之扶之，衰而亦旺。爻之强旺者，能克之冲之，刑之破之，旺而亦衰，卦有变爻克制动爻者，月建能制服变爻，卦有动爻克制静爻者，月建能制服动爻。用神伏藏，飞神压住者，月建能冲克飞神，扶助伏神而为用。爻之月合而有用，爻逢月破而无功。月建合爻，则为月合而有用；月冲之爻，即为月破无用之爻。月建不入爻，亦为有用；月建一入爻，愈见刚强。卦无用神，即以月建为用神，不必寻伏神。月建入卦动而作原神者，为福更大。动而作忌神者，为祸更凶。不入卦者，缓之；入卦者，速之。在卦象中凡有卦爻地支与月地支相同，称为临月建。与月地支同类，称为临月扶。与月地支相冲，称为临月破。月建生卦爻，称为临月生，月建克卦爻，称为临月克。徐墨斋师傅为他人占卦，如图：摇卦月是寅月,卦中父母爻（初九爻）为临月扶，妻财爻（九五爻）为临月破,兄弟爻（九四爻）为临月生,子孙丑土（六二爻）和子孙戌土（上九爻）为临月克，若卦爻地支中有寅木的，就为临月建。例：寅月庚戌日，测求财得大有卦，乾宫：火天大有（归魂）六神【本卦】螣蛇 ▅▅▅▅▅ 官鬼己巳火应勾陈 ▅▅ ▅▅ 父母己未土朱雀 ▅▅▅▅▅ 兄弟己酉金青龙 ▅▅▅▅▅ 父母甲辰土世玄武 ▅▅▅▅▅ 妻财甲寅木白虎 ▅▅▅▅▅ 子孙甲子水财爻寅木为用，临月建，而克世必得，但本旬为空，要到甲寅旬中，寅爻出空可得，果于甲寅日得财。寓意月建干支，利于用神为善。月建干支，不利于用神为恶。月建干支，利于用神，但为局中他神克运河或合住，善而不善，然亦不恶，平庸而已。月建干支，不利于用神，但为局中他神克去或合住，恶而不恶，然亦不善，平庸而已。应用月建与流年月建善，流年亦善，则更妙。月建善，流年恶。则善中有恶。月建恶，流年亦恶，则更恶。月建恶，流年善，则恶中有善。月建善，惟被局中某神克合，若流年制住克合之神，则仍佳妙。月建恶，惟被局中某神克合，若流年制住克合之神，则仍蹇劣。月建善，惟被局中某神克合，若流年生辅克合之神，则凶多吉少。月建恶，惟被局中某神克合，若流年生辅克合之神，则吉多凶少。月建善，流年再生助之，则更善。月建恶，流年再生助之，则更恶月建善，流年若克挫之，则善力减轻。月建恶，流年若克挫之，则恶力减轻。月建之干支月建看法，月干重于月支，因干流动，而支固定。月建即流月也。或以干为上半月，支为下半月，不甚可信，总宜干支合看。亦有以月支所藏人元，分何者当旺几天，而定几天吉凶，更不足信，盖完全偏重于月支。成如此说，则月建可无须月干矣。智德师傅提醒，看命中强弱，且不能以建中人元旺几天而标定，况流年中之流动月建耶。月建与时令在干支历里：正月为寅，二月为卯，三月为辰，四月为巳，五月为午，六月为未，七月为申，八月为酉，九月为戌，十月为亥，十一月为子，十二月为丑；月支固定，故不若月干之重视。然而时令与月建，颇有关系焉。特叙述如后。（一）春令木旺，甲寅月、乙卯月、甲辰月，则木更盛。丙寅月、丁卯月、丙辰月，则火得木生而亦强。戊寅月、己卯月、戊辰月，则土被木克而不健。庚寅月、辛卯月、庚辰月，则金为木挫而无力。壬寅月、癸卯月、壬辰月，则水受木泄亦疲弱。（二）夏令火旺。丁巳月、丙午月、丁未月，则火更盛。己巳月、戊午月、己未月，则土得火生而亦强。辛巳月、庚午月、辛未月，则金被火熔而不健。癸巳月、壬午月、癸未月，则水为火灼而无力。乙巳月、甲午月、乙未月则木受火泄亦疲弱。（三）秋令金旺。庚申月、辛酉月、庚戌月，则金更盛。壬申月、癸酉月、壬戌月，则水得金生而亦强。甲申月、乙酉月、甲戌月，则木被金克而不健。丙申月、丁酉月、丙戌月，则火为金磨而无力。戊申月、己酉月、戊戌月土受金泄亦疲弱。（四）冬令水旺，癸亥月、壬子月、癸丑月，则水更盛。乙亥月，甲子月，乙丑月，则木得水生而亦强。丁亥月、丙子月、丁丑月，则火被水克而不健。己亥月、戊子月、己丑月，则土因水泛而无力。辛亥月、庚子月、辛丑月，则金受水泄亦疲弱。（五）四立前各十八天土旺。戊辰月、己未月、戊戌月、己丑月，则土更甚。庚辰月、辛未月、庚戌月、辛丑月，则金得土生而亦强。壬辰月、癸未月、壬戌月、癸丑月，则水被土克而不健。甲辰月、乙未月、甲戌月、乙丑月，则木为土折而无力。丙辰月、丁未月、丙戌月、丁丑月，则火受土泄亦疲弱。
Loading data from data/pretrain_data/part-wudao-7.jsonl.zst
trainer_interface.cc(367) LOG(INFO) Reserved chars are found. Skipped: 问测2016年运
男、1982年农历11月生,在一家民营企业上班。预测2016年:1、事业工作情况,工作上能否有机会转提升,自己投资的茶楼、培训学校是否都有很大的发展,财利如何 2、现在有一个喜欢的人,出现了矛盾没有继续,是否可以修复,是否会遇到合适的人结婚, 3、2016年文运如何,参加比赛可以获奖吗。公历起卦时间:2016年1月8日16时18分 (手工指定)农历:乙未年十一月廿九日申时小寒:2016年01月06日06时47分立春: 2016年02月04日18时00分干支: 乙未年己丑月己丑日壬申时 (日空:午未)神煞:驿马-亥桃花-午日禄-午贵人-子,申乾宫:风地观离宫:风水涣六神伏神本卦变卦勾陈妻财辛卯木 ▅▅▅▅▅ 妻财辛卯木 ▅▅▅▅▅ 朱雀兄弟壬申金官鬼辛巳火 ▅▅▅▅▅ 官鬼辛巳火 ▅▅▅▅▅ 世青龙父母辛未土 ▅▅ ▅▅ 世父母辛未土 ▅▅ ▅▅ 玄武妻财乙卯木 ▅▅ ▅▅ 官鬼戊午火 ▅▅ ▅▅ 白虎官鬼乙巳火 ▅▅ ▅▅ ╳→ 父母戊辰土 ▅▅▅▅▅ 应腾蛇子孙甲子父母乙未土 ▅▅ ▅▅ 应妻财戊寅木 ▅▅ ▅▅ 目前工作上平静,死水一潭,不下不上,你也没怎么跳槽的心思,就是有想法也不会真正付诸行动,难得找也懒得找茶馆经营总体顺当,指15年及16年都基本上差不多了情况,16年春、夏冬季生意比较好些,秋季惨淡一些。世父为书面作品:并到应爻顾客那里,卖画方面还行,不过每幅价位都不算高。16年应买家要求画、写的那种比较多,也容易成交。没有显示会有经营方式的大改变,维持目前形式,就可有稳定额度的收益。子爻不上卦又克弱,16年招生学员比15年少,还有中途退出的,教学生方面人数比15年减少起码近3成妻克世,处的那个女友关键她心不定,也可以说不够爱你吧,2人性格也存在某些本质差异不合,所以没看到能发展到真正在一起。正2月矛盾多摩擦多,或会结束交往或决定放弃巳官动生世,单位农4、5月有一次提拨一批人之象,但巳化辰乃真兄爻竞争者,最终另外的同事抢得名额,世爻空,你会落空。在家中之西南方摆放一只瓷羊工艺品,上半年多穿红色系衣物,有争取助运之效流月大概运,农历月份为准正月体质偏差,防感冒及肠胃病,与女友别扭不少。二月与正月以上延续之外,财上赚得多花的也快,存不下来多少。卯妻动化官,女友或会提出分开、与你结束交往之类,并且另有考虑的人三月平平,无大喜亦无忧,与人合作事宜遇到某些问题难以达成一致四、五月单位开始研究提拔一批人,领导或会跟你透点风,勉励之意,但是泛泛之举,你也没多大竞争的劲头,自己生意上事宜正忙活中六月劳心劳神,工作、生意、教学、书画都会比较忙乱事多。
trainer_interface.cc(367) LOG(INFO) Reserved chars are found. Skipped: 杭州下城区治疗溢脂性脱发医院
杭州下城区治疗溢脂性脱发医院【杭州脱发医院-杭州植发医院】▃▅▆▇杭州杭城植发医院▇▆▅▃【24小时免费咨询杭城植发专家专业指导】许多患者在面临脱发严重之后感到自卑、焦虑和烦恼,导致自信下降和社交孤独感,害怕引起他人的注意,对他人有敌对情绪,对人际关系过度敏感,长期的精神压抑令患者产生"社交危机感",对社会活动产生焦虑、畏惧的心理.当您面临这样的困境时,考虑一下到一家好的植发医院做植发手术来破解这样的困境吧。人们常说烦恼三千丝,而现在愈加多的人脱掉了这三千丝,却还是有数不清的烦恼。那么当有人正在遭遇或者将要遭遇掉发的困扰时,我们应该怎样护理头发呢？不要用脱脂性强或碱性洗发剂这类洗发剂的脱脂性和脱水太强,易使头发干燥、头皮坏死。应选用对头皮和头发无刺激性的天然洗发剂,或根据自己的发质选用。最好不要一直使用某种洗发剂。目前,杭州杭城植发医院植拥有三位在国内享有盛誉的植发专家。陈海涛。杭州杭城医院植发科主任,国内著名植发专家,fue**第一人,杭城lhs高端技术的研发者。李长恒。技术顾问,原上海华山皮肤科主任、学科带头人,上海皮肤科诊疗界的泰斗,国务院特殊津贴教授,皮肤病国家科技进步二等奖获得者,杭城lhs高端技术的研发者。诸慕兰。特邀专家,杭州第三人民医院专家,杭城lhs高端技术的研发者,国务院特殊津贴教授。这三位专家在植发领域拥有多年的治疗脱发成功经验。做到贴心为您服务！近些年来,植发技术的不断成熟,让越来越多的脱发患者对植发有所了解。植发进入**已经数十年,从传统植发技术逐渐发展到现在的第三代fue无痕植发技术,已经慢慢被脱发患者所认可。如今,大多数脱发患者都会选择通过植发技术来彻底治疗自己的脱发问题。那么,头发种植真的能解决脱发的问题吗？相信这是所以脱发患者在选择头发种植前最多的疑问。这里我们请到杭州杭城植发医院从事多年临床经验的专家为我们解答。头发种植真的能解决脱发问题吗？点击直接咨询头发种植治疗脱发的原理专家指出:头发种植亦属于自体毛发种植的一种,是通过运用显微外科手术,从人体后枕部提取部分健康的毛囊单位,经过特殊的分离、活化后按照脱发患者原本毛发生长的方向种植到脱发区域。待种植后的毛囊存活后便能长出新的头发来。后枕部的头发提取后,会有影响吗？因为种植的毛囊单位来自于人体后枕部的毛囊,所以很多脱发患者会担心后枕部的毛囊提取后不就变的少了吗？这样不就影响到后枕部的头发密度了吗？针对这一问题,专家解释:医学研究表明,人体后枕部的毛囊组织不受雄性激素及外界的影响永不坏死,且后枕部的头发密度是整体头发密度的2到4倍。所以,头发种植才会提取这部分的毛囊,且不会影响供体区(后枕部)头发的密度。头发种植后能达到好的效果吗？专家介绍:现在的头发种植技术已经到了很成熟的地步,也是医学界公认的唯一有效治疗脱发问题的方法。因提取的后枕部的毛囊具有永不脱落的特性,所以种植后生长出来的头发仍保持了这一特性,不会再次脱落。且因毛囊来源于患者本身,所以新长出的头发会与原有头发融为一体,自然、美观。而我院所拥有的lhs无痕植发技术更能做到手术:无痛、无痕、无痂,带给脱发患者好的植发效果。头发种植真的能解决脱发问题吗？相信通过杭州杭城植发医院专家的介绍,您对头发种植应该有了更深的了解。正受脱发困扰的您何不相信科学,为未来的自己挑战一把,杭州杭城植发医院头发种植等着您。杭城植发医院温馨提示: 杭州杭城植发医院专家指出:患有毛发问题不要轻视,要及时咨询医生并确定最佳的解决方案。如您还有更多疑问或是有来院的需求,可以通过以下方式与杭州杭城植发专家进行在线咨询！(如果您对来院路线不清楚,建议您用拍照方式保存或用小纸条记录收藏好,方便您更快来院诊疗,早日健康完美！
trainer_interface.cc(367) LOG(INFO) Reserved chars are found. Skipped: 杭州下城区治疗溢脂性脱发哪家医院好

WARNING: Too long lines are skipped in the training. The maximum length can be changed

Could you please tell us what the best max-length is here?

Thanks a lot!

(env_OpenLlama) [root@ Open-Llama]# python3 dataset/train_tokenizer.py
Loading data from data/pretrain_data/part-wudao-770.jsonl.zst
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
input_format:
model_prefix:
model_type: BPE
vocab_size: 100000
self_test_sample_size: 0
character_coverage: 0.9995
input_sentence_size: 0
shuffle_input_sentence: 0
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 16384
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 1
split_by_number: 1
split_by_whitespace: 1
split_digits: 1
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 1
required_chars:
byte_fallback: 1
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 1
hard_vocab_limit: 1
use_all_vocab: 0
unk_id: 0
bos_id: 1
eos_id: 2
pad_id: 3
unk_piece:
bos_piece:
~~eos_piece:~~
pad_piece:
unk_surface: ⁇
}
normalizer_spec {
name: nfkc
add_dummy_prefix: 1
remove_extra_whitespaces: 0
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(356) LOG(WARNING) Found too long line (17205 > 16384).
trainer_interface.cc(358) LOG(WARNING) Too long lines are skipped in the training.
trainer_interface.cc(359) LOG(WARNING) The maximum length can be changed with --max_sentence_length= flag.
Loading data from data/pretrain_data/part-wudao-2598.jsonl.zst

instruction tuning失败

accelerate launch --config_file configs/default_config.yaml inctruction_tuning.py

File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 251, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'data/saved_ckpt/83200.pt'

83200.pt 这个checkpoint是需要经过预训练生成出来的吗？

RoPE bias

llama应该默认没有启用bias项。但按照苏神最新思路，把q,k的bias项加回来可以明显提升长度外推性能，作者考虑预训练测试一下不
https://kexue.fm/archives/9577

Help! Strange messages in running dataset/train-tokenizer.py

Hi there,

Could you please help - what's problem with these strange messages?

Thanks a lot!

Loading data from data/pretrain_data/part-pile-603.jsonl.zst
trainer_interface.cc(367) LOG(INFO) Reserved chars are found. Skipped: Q:

Using sed transliterate command in python

So there is this sed command that allows you to transform the quality code in ASCII into bar symbols:
sed -e 'n;n;n;y/!"#$%&'''()*+,-./0123456789:;<=>?@ABCDEFGHIJKL/▁▁▁▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇██████/' myfile.fastq

I have been checking ways to do the same in python, but I have not found a solution I can use. Maybe pysed or re.sub, but I do not even know how to write the ASCII code in a string without python getting mixed up the characters.

So, you want to transliterate characters in the 3rd line of your FASTQ file?
You can use str.translate on translation table built with str.maketrans:
#!/usr/bin/env python3
lut = str.maketrans('''!"#$%&'()*+,-./0123456789:;<=>?@abcdefghijkl''',
'''▁▁▁▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇██████''')

with open('/path/to/fastq') as f:
line3 = f.readlines()[3].strip()

print(line3.translate(lut))

For a sample file from Wikipedia:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''((((+))%%%++)(%%%%).1*-+*''))**55CCF>>>>>>CCCCCCC65

the Python script above will produce:
▁▁▁▂▁▁▁▁▂▂▂▂▂▂▁▁▁▂▂▂▁▁▁▁▁▂▃▃▂▂▂▂▂▂▁▁▂▂▂▂▄▄▇▇▇▆▆▆▆▆▆▇▇▇▇▇▇▇▄▄

However, note that according to FASTQ format description on Wikipedia, your translation table is incorrect. The character ! represents the lowest quality while ~ is the highest (not L as you have).
Also note that quality value characters directly map the ASCII character range !-~ to the quality value. In other words, we can build the translation table programmatically:
span = ord('█') - ord('▁') + 1
src = ''.join(chr(c) for c in range(ord('!'), ord('~')+1))
dst = ''.join(chr(ord('▁') + span*(ord(c)-ord('!'))//len(src)) for c in src)
lut = str.maketrans(src, dst)
Loading data from data/pretrain_data/part-pile-3372.jsonl.zst
Loading data from data/pretrain_data/part-pile-1704.jsonl.zst

RLHF

感谢作者的开源，请问RLHF有计划嘛？大概什么时候能开源相关代码

s-jol / open-llama Goto Github PK

open-llama's People

Contributors

Stargazers

Watchers

Forkers

open-llama's Issues

Thanks a lot!

Recommend Projects

Recommend Topics

Recommend Org