Giter Site home page Giter Site logo

Comments (15)

jameswu2014 avatar jameswu2014 commented on May 21, 2024 4

llama.cpp我提了一个PR:ggerganov/llama.cpp#3009
你先按照Baichuan2的Readme里的Baichuan2->Baichuan1 的lm_head转换修改一下模型,就可以用上面链接里的修改。

from baichuan2.

songkq avatar songkq commented on May 21, 2024 2

@jameswu2014 @dlutsniper quantize gguf model failed on RTX3090 with Driver Version: 525.105.17 CUDA Version: 12.0. Could you please give some advice for this issue?

./quantize /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-f16.gguf /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-Q8_0.gguf 7

CUDA error 804 at /llama.cpp/ggml-cuda.cu:5522: forward compatibility was attempted on non supported HW
current device: 0

Solved by building a docker image from nvidia/cuda:12.0.0-devel-ubuntu22.04

from baichuan2.

dereklll avatar dereklll commented on May 21, 2024 2

(alpaca_env) chunzhamini@chunzhamini llama.cpp % ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p '从前有一只小狐狸,他' --temp 0 -ngl 1 Log start main: warning: changing RoPE frequency base to 0 (default 10000.0) main: warning: scaling RoPE frequency by 0 (default 1.0) main: build = 1270 (c091cdf) main: built with Apple clang version 14.0.3 (clang-1403.0.22.14.1) for arm64-apple-darwin22.5.0 main: seed = 1695699630 llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin (version GGUF V2 (latest)) llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 5120, 125696, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 2: blk.0.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 3: blk.0.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 5: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 6: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 7: blk.1.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 8: blk.1.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 9: blk.1.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 10: blk.1.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 11: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 12: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 13: blk.2.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 14: blk.2.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 15: blk.2.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 16: blk.2.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 17: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 18: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 19: blk.3.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 20: blk.3.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 21: blk.3.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 22: blk.3.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 23: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 24: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 25: blk.4.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 26: blk.4.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 27: blk.4.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 28: blk.4.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 29: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 30: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 31: blk.5.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 32: blk.5.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 33: blk.5.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 34: blk.5.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 35: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 36: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 37: blk.6.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 38: blk.6.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 39: blk.6.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 40: blk.6.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 41: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 42: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 43: blk.7.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 44: blk.7.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 45: blk.7.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 46: blk.7.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 47: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 48: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 49: blk.8.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 50: blk.8.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 51: blk.8.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 52: blk.8.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 53: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 54: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 55: blk.9.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 56: blk.9.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 57: blk.9.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 58: blk.9.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 59: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 60: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 61: blk.10.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 62: blk.10.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 63: blk.10.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 64: blk.10.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 65: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 66: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 67: blk.11.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 68: blk.11.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 69: blk.11.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 70: blk.11.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 71: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 72: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 73: blk.12.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 74: blk.12.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 75: blk.12.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 76: blk.12.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 77: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 78: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 79: blk.13.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 80: blk.13.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 81: blk.13.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 82: blk.0.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 83: blk.0.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 84: blk.0.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 85: blk.1.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 86: blk.1.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 87: blk.1.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 88: blk.2.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 89: blk.2.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 90: blk.2.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 91: blk.3.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 92: blk.3.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 93: blk.3.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 94: blk.4.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 95: blk.4.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 96: blk.4.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 97: blk.5.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 98: blk.5.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 99: blk.5.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 100: blk.6.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 101: blk.6.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 102: blk.6.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 103: blk.7.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 104: blk.7.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 105: blk.7.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 106: blk.8.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 107: blk.8.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 108: blk.8.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 109: blk.9.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 110: blk.9.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 111: blk.9.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 112: blk.10.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 113: blk.10.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 114: blk.10.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 115: blk.11.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 116: blk.11.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 117: blk.11.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 118: blk.12.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 119: blk.12.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 120: blk.12.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 121: blk.13.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 122: blk.13.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 123: blk.13.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 124: blk.13.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 127: blk.14.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 128: blk.14.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 129: blk.14.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 130: blk.14.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 131: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 132: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 133: blk.15.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 134: blk.15.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 135: blk.15.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 136: blk.15.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 137: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 138: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 139: blk.16.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 140: blk.16.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 141: blk.16.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 142: blk.16.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 143: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 144: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 145: blk.17.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 146: blk.17.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 147: blk.17.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 148: blk.17.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 149: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 150: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 151: blk.18.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 152: blk.18.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 153: blk.18.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 154: blk.18.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 155: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 156: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 157: blk.19.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 158: blk.19.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 159: blk.19.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 160: blk.19.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 161: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 162: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 163: blk.20.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 164: blk.20.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 165: blk.20.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 166: blk.20.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 167: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 168: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 169: blk.21.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 170: blk.21.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 171: blk.21.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 172: blk.21.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 173: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 174: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 175: blk.22.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 176: blk.22.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 177: blk.22.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 178: blk.22.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 179: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 180: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 181: blk.23.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 182: blk.23.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 183: blk.23.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 184: blk.23.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 185: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 186: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 187: blk.24.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 188: blk.24.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 189: blk.24.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 190: blk.24.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 191: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 192: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 193: blk.25.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 194: blk.25.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 195: blk.25.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 196: blk.25.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 197: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 198: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 199: blk.26.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 200: blk.26.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 201: blk.26.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 202: blk.26.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 203: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 204: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 205: blk.27.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 206: blk.27.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 207: blk.27.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 208: blk.27.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 209: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 210: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 211: blk.28.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 212: blk.28.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 213: blk.28.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 214: blk.28.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 215: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 216: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 217: blk.29.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 218: blk.29.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 219: blk.14.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 220: blk.14.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 221: blk.14.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 222: blk.15.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 223: blk.15.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 224: blk.15.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 225: blk.16.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 226: blk.16.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 227: blk.16.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 228: blk.17.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 229: blk.17.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 230: blk.17.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 231: blk.18.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 232: blk.18.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 233: blk.18.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 234: blk.19.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 235: blk.19.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 236: blk.19.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 237: blk.20.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 238: blk.20.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 239: blk.20.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 240: blk.21.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 241: blk.21.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 242: blk.21.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 243: blk.22.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 244: blk.22.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 245: blk.22.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 246: blk.23.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 247: blk.23.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 248: blk.23.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 249: blk.24.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 250: blk.24.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 251: blk.24.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 252: blk.25.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 253: blk.25.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 254: blk.25.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 255: blk.26.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 256: blk.26.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 257: blk.26.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 258: blk.27.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 259: blk.27.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 260: blk.27.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 261: blk.28.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 262: blk.28.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 263: blk.28.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 264: blk.29.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 265: blk.29.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 266: blk.29.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 267: blk.29.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 268: blk.29.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 271: blk.30.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 272: blk.30.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 273: blk.30.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 274: blk.30.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 275: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 276: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 277: blk.31.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 278: blk.31.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 279: blk.31.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 280: blk.31.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 282: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 283: blk.32.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 284: blk.32.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 285: blk.32.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 286: blk.32.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 287: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 288: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 289: blk.33.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 290: blk.33.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 291: blk.33.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 292: blk.33.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 293: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 294: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 295: blk.34.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 296: blk.34.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 297: blk.34.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 298: blk.34.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 299: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 300: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 301: blk.35.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 302: blk.35.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 303: blk.35.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 304: blk.35.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 305: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 306: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 307: blk.36.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 308: blk.36.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 309: blk.36.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 310: blk.36.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 311: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 312: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 313: blk.37.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 314: blk.37.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 315: blk.37.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 316: blk.37.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 317: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 318: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 319: blk.38.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 320: blk.38.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 321: blk.38.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 322: blk.38.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 323: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 324: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 325: blk.39.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 326: blk.39.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 327: blk.39.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 328: blk.39.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 329: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 330: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 331: output_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 332: output.weight q6_K [ 5120, 125696, 1, 1 ] llama_model_loader: - tensor 333: blk.30.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 334: blk.30.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 335: blk.30.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 336: blk.31.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 337: blk.31.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 338: blk.31.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 339: blk.32.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 340: blk.32.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 341: blk.32.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 342: blk.33.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 343: blk.33.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 344: blk.33.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 345: blk.34.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 346: blk.34.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 347: blk.34.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 348: blk.35.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 349: blk.35.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 350: blk.35.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 351: blk.36.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 352: blk.36.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 353: blk.36.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 354: blk.37.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 355: blk.37.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 356: blk.37.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 357: blk.38.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 358: blk.38.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 359: blk.38.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 360: blk.39.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 361: blk.39.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 362: blk.39.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: baichuan.tensor_data_layout str llama_model_loader: - kv 3: baichuan.context_length u32 llama_model_loader: - kv 4: baichuan.embedding_length u32 llama_model_loader: - kv 5: baichuan.block_count u32 llama_model_loader: - kv 6: baichuan.feed_forward_length u32 llama_model_loader: - kv 7: baichuan.rope.dimension_count u32 llama_model_loader: - kv 8: baichuan.attention.head_count u32 llama_model_loader: - kv 9: baichuan.attention.head_count_kv u32 llama_model_loader: - kv 10: baichuan.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 11: tokenizer.ggml.model str llama_model_loader: - kv 12: tokenizer.ggml.tokens arr llama_model_loader: - kv 13: tokenizer.ggml.scores arr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 llama_model_loader: - kv 18: general.quantization_version u32 llama_model_loader: - kv 19: general.file_type u32 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q4_0: 281 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = baichuan llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 125696 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_ctx = 512 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: freq_base = 10000.0 llm_load_print_meta: freq_scale = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 13.90 B llm_load_print_meta: model size = 7.44 GiB (4.60 BPW) llm_load_print_meta: general.name = Baichuan2-13B-Chat llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 1099 '<0x0A>' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: mem required = 7614.46 MB (+ 400.00 MB per state) ........................................................................................... llama_new_context_with_model: kv self size = 400.00 MB ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 ggml_metal_init: picking default device: Apple M2 ggml_metal_init: loading '/Volumes/WD_sn770/LLAMA2/llamacpp/llama.cpp/ggml-metal.metal' ggml_metal_init: loaded kernel_add 0x119507430 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_add_row 0x119507c60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul 0x119508180 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_row 0x1195087b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_scale 0x119508cd0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_silu 0x1195091f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_relu 0x119509710 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_gelu 0x119509c30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max 0x13cf059a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max_4 0x13ce07530 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf 0x13ce07b70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf_8 0x13ce08340 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f32 0x13ce089f0 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f16 0x13ce090a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_0 0x13ce09750 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_1 0x13ce09e00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q8_0 0x13ce0a4b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q2_K 0x13ce0ab60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q3_K 0x13ce0b210 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_K 0x13ce0ba30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q5_K 0x13ce0c0e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q6_K 0x13ce0c790 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rms_norm 0x13ce0ce50 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_norm 0x13ce0d680 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f32_f32 0x13ce0dee0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13ce0e740 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row 0x13ce0efa0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4 0x13ce0fa00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13ce10160 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x13ce10b20 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q8_0_f32 0x13ce11280 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x13ce119e0 | th_max = 640 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x13ce11f00 | th_max = 576 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x13ce12660 | th_max = 576 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x13ce12dc0 | th_max = 640 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x13ce13520 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x13ce13d30 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x13ce14540 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x13ce14d50 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x13ce15560 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x13ce15d70 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x13ce16580 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x13ce16d90 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x13ce175a0 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x11950a320 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x11950ac50 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_rope 0x11950b3d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_alibi_f32 0x11950bfa0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f16 0x11950c830 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f32 0x11950d0c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f16_f16 0x11950d950 | th_max = 1024 | th_width = 32 ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB ggml_metal_init: maxTransferRate = built-in GPU llama_new_context_with_model: compute buffer total size = 256.97 MB llama_new_context_with_model: max tensor size = 503.47 MB ggml_metal_add_buffer: allocated 'data ' buffer, size = 7617.11 MB, ( 7617.61 / 10922.67) ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.48 MB, ( 7619.09 / 10922.67) ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 8021.09 / 10922.67) ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 255.52 MB, ( 8276.61 / 10922.67) GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented" GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented" zsh: abort ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p 0 按照上面的步骤,GPU推理报错,CPU下正常。大佬可以帮忙看下吗?MAC MINI M2 @jameswu2014

我也是这个问题,CPU正常,但是GPU不行。
CUDA error 9 at ggml-cuda.cu:6829: invalid configuration argument

from baichuan2.

chunzha1 avatar chunzha1 commented on May 21, 2024 1

(alpaca_env) chunzhamini@chunzhamini llama.cpp % ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p '从前有一只小狐狸,他' --temp 0 -ngl 1
Log start
main: warning: changing RoPE frequency base to 0 (default 10000.0)
main: warning: scaling RoPE frequency by 0 (default 1.0)
main: build = 1270 (c091cdf)
main: built with Apple clang version 14.0.3 (clang-1403.0.22.14.1) for arm64-apple-darwin22.5.0
main: seed = 1695699630
llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 5120, 125696, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 7: blk.1.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 8: blk.1.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 9: blk.1.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 13: blk.2.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 14: blk.2.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 15: blk.2.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 16: blk.2.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 17: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.3.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 20: blk.3.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 21: blk.3.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 22: blk.3.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 23: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 24: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 25: blk.4.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 26: blk.4.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 27: blk.4.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 28: blk.4.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 29: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 30: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 31: blk.5.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 32: blk.5.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 33: blk.5.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 34: blk.5.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 35: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.6.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 38: blk.6.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 39: blk.6.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 40: blk.6.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 41: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 42: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 43: blk.7.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 44: blk.7.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 45: blk.7.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 46: blk.7.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 47: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 48: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 49: blk.8.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 50: blk.8.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 51: blk.8.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 52: blk.8.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 53: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 54: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 55: blk.9.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 56: blk.9.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 57: blk.9.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 58: blk.9.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 59: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 60: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 61: blk.10.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 62: blk.10.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 63: blk.10.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 64: blk.10.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 65: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 66: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 67: blk.11.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 68: blk.11.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 69: blk.11.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 70: blk.11.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 71: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 72: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 73: blk.12.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 74: blk.12.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 75: blk.12.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 76: blk.12.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 77: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 78: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 79: blk.13.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 80: blk.13.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 81: blk.13.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 82: blk.0.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 83: blk.0.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 84: blk.0.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 85: blk.1.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 86: blk.1.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 87: blk.1.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 88: blk.2.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 89: blk.2.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 90: blk.2.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 91: blk.3.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 92: blk.3.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 93: blk.3.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 94: blk.4.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 95: blk.4.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 96: blk.4.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 97: blk.5.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 98: blk.5.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 99: blk.5.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 100: blk.6.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 101: blk.6.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 102: blk.6.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 103: blk.7.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 104: blk.7.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 105: blk.7.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 106: blk.8.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 107: blk.8.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 108: blk.8.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 109: blk.9.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 110: blk.9.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 111: blk.9.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 112: blk.10.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 113: blk.10.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 114: blk.10.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 115: blk.11.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 116: blk.11.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 117: blk.11.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 118: blk.12.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 119: blk.12.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 120: blk.12.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 121: blk.13.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 122: blk.13.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 123: blk.13.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 124: blk.13.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 127: blk.14.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 128: blk.14.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 129: blk.14.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 130: blk.14.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 131: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 132: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 133: blk.15.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 134: blk.15.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 135: blk.15.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 136: blk.15.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 137: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 138: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 139: blk.16.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 140: blk.16.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 141: blk.16.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 142: blk.16.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 143: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 144: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 145: blk.17.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 146: blk.17.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 147: blk.17.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 148: blk.17.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 149: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 150: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 151: blk.18.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 152: blk.18.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 153: blk.18.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 154: blk.18.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 155: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 156: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 157: blk.19.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 158: blk.19.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 159: blk.19.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 160: blk.19.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 161: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 162: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 163: blk.20.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 164: blk.20.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 165: blk.20.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 166: blk.20.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 167: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 168: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 169: blk.21.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 170: blk.21.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 171: blk.21.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 172: blk.21.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 173: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 174: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 175: blk.22.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 176: blk.22.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 177: blk.22.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 178: blk.22.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 179: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 180: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 181: blk.23.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 182: blk.23.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 183: blk.23.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 184: blk.23.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 185: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 186: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 187: blk.24.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 188: blk.24.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 189: blk.24.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 190: blk.24.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 191: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 192: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 193: blk.25.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 194: blk.25.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 195: blk.25.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 196: blk.25.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 197: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 198: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 199: blk.26.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 200: blk.26.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 201: blk.26.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 202: blk.26.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 203: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 204: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 205: blk.27.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 206: blk.27.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 207: blk.27.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 208: blk.27.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 209: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 210: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 211: blk.28.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 212: blk.28.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 213: blk.28.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 214: blk.28.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 215: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 216: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 217: blk.29.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 218: blk.29.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 219: blk.14.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 220: blk.14.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 221: blk.14.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 222: blk.15.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 223: blk.15.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 224: blk.15.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 225: blk.16.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 226: blk.16.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 227: blk.16.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 228: blk.17.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 229: blk.17.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 230: blk.17.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 231: blk.18.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 232: blk.18.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 233: blk.18.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 234: blk.19.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 235: blk.19.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 236: blk.19.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 237: blk.20.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 238: blk.20.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 239: blk.20.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 240: blk.21.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 241: blk.21.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 242: blk.21.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 243: blk.22.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 244: blk.22.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 245: blk.22.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 246: blk.23.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 247: blk.23.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 248: blk.23.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 249: blk.24.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 250: blk.24.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 251: blk.24.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 252: blk.25.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 253: blk.25.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 254: blk.25.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 255: blk.26.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 256: blk.26.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 257: blk.26.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 258: blk.27.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 259: blk.27.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 260: blk.27.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 261: blk.28.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 262: blk.28.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 263: blk.28.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 264: blk.29.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 265: blk.29.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 266: blk.29.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 267: blk.29.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 268: blk.29.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 271: blk.30.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 272: blk.30.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 273: blk.30.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 274: blk.30.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 275: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 276: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 277: blk.31.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 278: blk.31.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 279: blk.31.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 280: blk.31.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 282: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 283: blk.32.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 284: blk.32.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 285: blk.32.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 286: blk.32.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 287: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 288: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 289: blk.33.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 290: blk.33.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 291: blk.33.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 292: blk.33.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 293: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 294: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 295: blk.34.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 296: blk.34.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 297: blk.34.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 298: blk.34.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 299: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 300: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 301: blk.35.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 302: blk.35.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 303: blk.35.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 304: blk.35.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 305: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 306: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 307: blk.36.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 308: blk.36.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 309: blk.36.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 310: blk.36.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 311: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 312: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 313: blk.37.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 314: blk.37.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 315: blk.37.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 316: blk.37.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 317: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 318: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 319: blk.38.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 320: blk.38.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 321: blk.38.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 322: blk.38.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 323: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 324: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 325: blk.39.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 326: blk.39.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 327: blk.39.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 328: blk.39.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 329: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 330: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 331: output_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 332: output.weight q6_K [ 5120, 125696, 1, 1 ]
llama_model_loader: - tensor 333: blk.30.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 334: blk.30.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 335: blk.30.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 336: blk.31.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 337: blk.31.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 338: blk.31.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 339: blk.32.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 340: blk.32.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 341: blk.32.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 342: blk.33.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 343: blk.33.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 344: blk.33.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 345: blk.34.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 346: blk.34.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 347: blk.34.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 348: blk.35.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 349: blk.35.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 350: blk.35.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 351: blk.36.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 352: blk.36.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 353: blk.36.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 354: blk.37.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 355: blk.37.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 356: blk.37.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 357: blk.38.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 358: blk.38.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 359: blk.38.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 360: blk.39.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 361: blk.39.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 362: blk.39.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: baichuan.tensor_data_layout str
llama_model_loader: - kv 3: baichuan.context_length u32
llama_model_loader: - kv 4: baichuan.embedding_length u32
llama_model_loader: - kv 5: baichuan.block_count u32
llama_model_loader: - kv 6: baichuan.feed_forward_length u32
llama_model_loader: - kv 7: baichuan.rope.dimension_count u32
llama_model_loader: - kv 8: baichuan.attention.head_count u32
llama_model_loader: - kv 9: baichuan.attention.head_count_kv u32
llama_model_loader: - kv 10: baichuan.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - kv 19: general.file_type u32
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = baichuan
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 125696
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: n_ff = 13696
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 13.90 B
llm_load_print_meta: model size = 7.44 GiB (4.60 BPW)
llm_load_print_meta: general.name = Baichuan2-13B-Chat
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MB
llm_load_tensors: mem required = 7614.46 MB (+ 400.00 MB per state)
...........................................................................................
llama_new_context_with_model: kv self size = 400.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: loading '/Volumes/WD_sn770/LLAMA2/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x119507430 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x119507c60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x119508180 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x1195087b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x119508cd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x1195091f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x119509710 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x119509c30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max 0x13cf059a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max_4 0x13ce07530 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf 0x13ce07b70 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf_8 0x13ce08340 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f32 0x13ce089f0 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x13ce090a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_0 0x13ce09750 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13ce09e00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q8_0 0x13ce0a4b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x13ce0ab60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x13ce0b210 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x13ce0ba30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x13ce0c0e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13ce0c790 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rms_norm 0x13ce0ce50 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_norm 0x13ce0d680 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f32_f32 0x13ce0dee0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13ce0e740 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row 0x13ce0efa0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4 0x13ce0fa00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13ce10160 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x13ce10b20 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q8_0_f32 0x13ce11280 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x13ce119e0 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x13ce11f00 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x13ce12660 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x13ce12dc0 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x13ce13520 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x13ce13d30 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x13ce14540 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x13ce14d50 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x13ce15560 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x13ce15d70 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x13ce16580 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x13ce16d90 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x13ce175a0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x11950a320 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x11950ac50 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_rope 0x11950b3d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_alibi_f32 0x11950bfa0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f16 0x11950c830 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f32 0x11950d0c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f16_f16 0x11950d950 | th_max = 1024 | th_width = 32
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 256.97 MB
llama_new_context_with_model: max tensor size = 503.47 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 7617.11 MB, ( 7617.61 / 10922.67)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.48 MB, ( 7619.09 / 10922.67)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 8021.09 / 10922.67)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 255.52 MB, ( 8276.61 / 10922.67)
GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented"
GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented"
zsh: abort ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p 0
按照上面的步骤,GPU推理报错,CPU下正常。大佬可以帮忙看下吗?MAC MINI M2 @jameswu2014

from baichuan2.

wzp123123 avatar wzp123123 commented on May 21, 2024 1

同样的问题,有没有解决的思路

from baichuan2.

dlutsniper avatar dlutsniper commented on May 21, 2024

最新版本的llama.cpp

install Python dependencies

python3 -m pip install -r requirements.txt

最新开发版本gguf

cd llama.cpp/gguf-py
pip install --editable .

转换

python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf
27.8GB

量化

./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0
7.99GB

运行

./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding

macbook pro 15年款 完美可以推理
有个小问题,不知道提示词模版是不是需要调整 @jameswu2014
图片

from baichuan2.

dlutsniper avatar dlutsniper commented on May 21, 2024

不知道为什么,llama.cpp server方式,推理结果没有命令行好,命令行还可以的哈,使用server的话,就比较奇怪了。
请问这个问题怎么回事儿啊 @jameswu2014

选择模型 Baichuan2-13B-Chat
最新版本llama.cpp

main推理命令行
图片
main推理测试结果
图片

server推理服务命令行
图片
server推理服务测试结果
图片

from baichuan2.

songkq avatar songkq commented on May 21, 2024

@jameswu2014 @dlutsniper quantize gguf model failed on RTX3090 with Driver Version: 525.105.17 CUDA Version: 12.0. Could you please give some advice for this issue?

./quantize /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-f16.gguf /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-Q8_0.gguf 7

CUDA error 804 at /llama.cpp/ggml-cuda.cu:5522: forward compatibility was attempted on non supported HW
current device: 0

from baichuan2.

zhangqiangauto avatar zhangqiangauto commented on May 21, 2024

最新版本的llama.cpp

install Python dependencies

python3 -m pip install -r requirements.txt

最新开发版本gguf

cd llama.cpp/gguf-py pip install --editable .

转换

python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf 27.8GB

量化

./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0 7.99GB

运行

./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding

macbook pro 15年款 完美可以推理 有个小问题,不知道提示词模版是不是需要调整 @jameswu2014 图片

我按照这个步骤量化了Baichuan2-7B 的 chat 版本,但问答输出大部分是英文,且质量很差,不知道什么原因。
附件是日志记录:debug.txt

from baichuan2.

zhangqiangauto avatar zhangqiangauto commented on May 21, 2024

最新版本的llama.cpp

install Python dependencies

python3 -m pip install -r requirements.txt

最新开发版本gguf

cd llama.cpp/gguf-py pip install --editable .

转换

python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf 27.8GB

量化

./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0 7.99GB

运行

./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding
macbook pro 15年款 完美可以推理 有个小问题,不知道提示词模版是不是需要调整 @jameswu2014 图片

我按照这个步骤量化了Baichuan2-7B 的 chat 版本,但问答输出大部分是英文,且质量很差,不知道什么原因。 附件是日志记录:debug.txt

问题解决了。需要按照@jameswu2014的步骤,将 Baichuan2转 Baichuan1. 所以当前版本的 llama.cpp 还不能直接支持 Baichuan2 模型的 convert

from baichuan2.

aisensiy avatar aisensiy commented on May 21, 2024

llama.cpp我提了一个PR:ggerganov/llama.cpp#3009 你先按照Baichuan2的Readme里的Baichuan2->Baichuan1 的lm_head转换修改一下模型,就可以用上面链接里的修改。

This works for Q8_0 Q5_0 Q4_0 but failed for others with such error message:

$ ./quantize /models/baichuan2-13b-chat.gguf /models/baichuan2-13b-chat-Q4_K_M.gguf Q4_K

...
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llama_model_quantize_internal: meta size = 2883232 bytes
[   1/ 363]                    token_embd.weight - [ 5120, 125696,     1,     1], type =    f16, quantizing to q4_K .. size =  1227.50 MB ->   345.23 MB | hist: 
[   2/ 363]             blk.0.attn_output.weight - [ 5120,  5120,     1,     1], type =    f16, quantizing to q4_K .. size =    50.00 MB ->    14.06 MB | hist: 
[   3/ 363]                blk.0.ffn_gate.weight - [ 5120, 13696,     1,     1], type =    f16, quantizing to q4_K .. size =   133.75 MB ->    37.62 MB | hist: 
[   4/ 363]                blk.0.ffn_down.weight - [13696,  5120,     1,     1], type =    f16, 

get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for k-quants
llama_model_quantize: failed to quantize: Unsupported tensor size encountered

main: failed to quantize model from '/output/baichuan2-13b-chat.gguf'

from baichuan2.

wzp123123 avatar wzp123123 commented on May 21, 2024

ggerganov/llama.cpp#3740

from baichuan2.

chunzha1 avatar chunzha1 commented on May 21, 2024

好像和我的问题不太一样,请问你是什么设备?

from baichuan2.

guoqiangqi avatar guoqiangqi commented on May 21, 2024

使用最新版本llama.cpp 量化后server推理结果不准确,量化前未将baichuan2-13b-chat转为baichuan1模式,是这个导致的吗?
Uploading llama_cpp.png…

from baichuan2.

VJJJJJJ1 avatar VJJJJJJ1 commented on May 21, 2024

llama.cpp我提了一个PR:ggerganov/llama.cpp#3009 你先按照Baichuan2的Readme里的Baichuan2->Baichuan1 的lm_head转换修改一下模型,就可以用上面链接里的修改。

请问微调后的baichuan2也可以用这个方法来加速吗?

from baichuan2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.