Comments (15)
llama.cpp我提了一个PR:ggerganov/llama.cpp#3009
你先按照Baichuan2的Readme里的Baichuan2->Baichuan1 的lm_head转换修改一下模型,就可以用上面链接里的修改。
from baichuan2.
@jameswu2014 @dlutsniper quantize gguf model failed on
RTX3090
withDriver Version: 525.105.17 CUDA Version: 12.0
. Could you please give some advice for this issue?./quantize /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-f16.gguf /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-Q8_0.gguf 7 CUDA error 804 at /llama.cpp/ggml-cuda.cu:5522: forward compatibility was attempted on non supported HW current device: 0
Solved by building a docker image from nvidia/cuda:12.0.0-devel-ubuntu22.04
from baichuan2.
(alpaca_env) chunzhamini@chunzhamini llama.cpp % ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p '从前有一只小狐狸,他' --temp 0 -ngl 1 Log start main: warning: changing RoPE frequency base to 0 (default 10000.0) main: warning: scaling RoPE frequency by 0 (default 1.0) main: build = 1270 (c091cdf) main: built with Apple clang version 14.0.3 (clang-1403.0.22.14.1) for arm64-apple-darwin22.5.0 main: seed = 1695699630 llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin (version GGUF V2 (latest)) llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 5120, 125696, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 2: blk.0.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 3: blk.0.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 5: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 6: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 7: blk.1.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 8: blk.1.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 9: blk.1.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 10: blk.1.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 11: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 12: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 13: blk.2.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 14: blk.2.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 15: blk.2.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 16: blk.2.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 17: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 18: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 19: blk.3.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 20: blk.3.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 21: blk.3.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 22: blk.3.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 23: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 24: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 25: blk.4.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 26: blk.4.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 27: blk.4.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 28: blk.4.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 29: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 30: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 31: blk.5.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 32: blk.5.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 33: blk.5.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 34: blk.5.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 35: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 36: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 37: blk.6.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 38: blk.6.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 39: blk.6.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 40: blk.6.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 41: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 42: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 43: blk.7.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 44: blk.7.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 45: blk.7.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 46: blk.7.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 47: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 48: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 49: blk.8.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 50: blk.8.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 51: blk.8.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 52: blk.8.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 53: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 54: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 55: blk.9.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 56: blk.9.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 57: blk.9.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 58: blk.9.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 59: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 60: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 61: blk.10.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 62: blk.10.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 63: blk.10.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 64: blk.10.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 65: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 66: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 67: blk.11.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 68: blk.11.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 69: blk.11.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 70: blk.11.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 71: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 72: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 73: blk.12.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 74: blk.12.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 75: blk.12.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 76: blk.12.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 77: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 78: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 79: blk.13.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 80: blk.13.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 81: blk.13.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 82: blk.0.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 83: blk.0.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 84: blk.0.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 85: blk.1.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 86: blk.1.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 87: blk.1.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 88: blk.2.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 89: blk.2.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 90: blk.2.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 91: blk.3.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 92: blk.3.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 93: blk.3.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 94: blk.4.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 95: blk.4.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 96: blk.4.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 97: blk.5.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 98: blk.5.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 99: blk.5.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 100: blk.6.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 101: blk.6.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 102: blk.6.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 103: blk.7.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 104: blk.7.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 105: blk.7.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 106: blk.8.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 107: blk.8.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 108: blk.8.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 109: blk.9.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 110: blk.9.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 111: blk.9.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 112: blk.10.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 113: blk.10.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 114: blk.10.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 115: blk.11.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 116: blk.11.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 117: blk.11.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 118: blk.12.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 119: blk.12.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 120: blk.12.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 121: blk.13.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 122: blk.13.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 123: blk.13.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 124: blk.13.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 127: blk.14.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 128: blk.14.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 129: blk.14.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 130: blk.14.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 131: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 132: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 133: blk.15.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 134: blk.15.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 135: blk.15.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 136: blk.15.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 137: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 138: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 139: blk.16.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 140: blk.16.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 141: blk.16.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 142: blk.16.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 143: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 144: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 145: blk.17.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 146: blk.17.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 147: blk.17.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 148: blk.17.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 149: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 150: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 151: blk.18.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 152: blk.18.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 153: blk.18.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 154: blk.18.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 155: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 156: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 157: blk.19.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 158: blk.19.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 159: blk.19.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 160: blk.19.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 161: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 162: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 163: blk.20.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 164: blk.20.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 165: blk.20.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 166: blk.20.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 167: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 168: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 169: blk.21.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 170: blk.21.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 171: blk.21.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 172: blk.21.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 173: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 174: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 175: blk.22.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 176: blk.22.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 177: blk.22.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 178: blk.22.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 179: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 180: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 181: blk.23.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 182: blk.23.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 183: blk.23.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 184: blk.23.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 185: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 186: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 187: blk.24.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 188: blk.24.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 189: blk.24.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 190: blk.24.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 191: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 192: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 193: blk.25.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 194: blk.25.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 195: blk.25.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 196: blk.25.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 197: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 198: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 199: blk.26.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 200: blk.26.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 201: blk.26.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 202: blk.26.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 203: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 204: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 205: blk.27.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 206: blk.27.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 207: blk.27.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 208: blk.27.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 209: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 210: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 211: blk.28.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 212: blk.28.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 213: blk.28.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 214: blk.28.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 215: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 216: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 217: blk.29.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 218: blk.29.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 219: blk.14.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 220: blk.14.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 221: blk.14.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 222: blk.15.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 223: blk.15.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 224: blk.15.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 225: blk.16.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 226: blk.16.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 227: blk.16.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 228: blk.17.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 229: blk.17.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 230: blk.17.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 231: blk.18.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 232: blk.18.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 233: blk.18.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 234: blk.19.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 235: blk.19.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 236: blk.19.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 237: blk.20.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 238: blk.20.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 239: blk.20.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 240: blk.21.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 241: blk.21.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 242: blk.21.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 243: blk.22.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 244: blk.22.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 245: blk.22.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 246: blk.23.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 247: blk.23.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 248: blk.23.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 249: blk.24.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 250: blk.24.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 251: blk.24.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 252: blk.25.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 253: blk.25.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 254: blk.25.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 255: blk.26.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 256: blk.26.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 257: blk.26.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 258: blk.27.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 259: blk.27.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 260: blk.27.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 261: blk.28.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 262: blk.28.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 263: blk.28.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 264: blk.29.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 265: blk.29.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 266: blk.29.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 267: blk.29.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 268: blk.29.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 271: blk.30.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 272: blk.30.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 273: blk.30.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 274: blk.30.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 275: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 276: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 277: blk.31.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 278: blk.31.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 279: blk.31.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 280: blk.31.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 282: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 283: blk.32.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 284: blk.32.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 285: blk.32.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 286: blk.32.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 287: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 288: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 289: blk.33.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 290: blk.33.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 291: blk.33.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 292: blk.33.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 293: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 294: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 295: blk.34.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 296: blk.34.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 297: blk.34.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 298: blk.34.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 299: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 300: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 301: blk.35.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 302: blk.35.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 303: blk.35.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 304: blk.35.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 305: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 306: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 307: blk.36.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 308: blk.36.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 309: blk.36.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 310: blk.36.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 311: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 312: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 313: blk.37.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 314: blk.37.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 315: blk.37.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 316: blk.37.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 317: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 318: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 319: blk.38.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 320: blk.38.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 321: blk.38.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 322: blk.38.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 323: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 324: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 325: blk.39.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 326: blk.39.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 327: blk.39.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 328: blk.39.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 329: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 330: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 331: output_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 332: output.weight q6_K [ 5120, 125696, 1, 1 ] llama_model_loader: - tensor 333: blk.30.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 334: blk.30.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 335: blk.30.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 336: blk.31.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 337: blk.31.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 338: blk.31.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 339: blk.32.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 340: blk.32.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 341: blk.32.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 342: blk.33.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 343: blk.33.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 344: blk.33.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 345: blk.34.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 346: blk.34.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 347: blk.34.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 348: blk.35.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 349: blk.35.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 350: blk.35.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 351: blk.36.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 352: blk.36.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 353: blk.36.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 354: blk.37.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 355: blk.37.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 356: blk.37.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 357: blk.38.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 358: blk.38.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 359: blk.38.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 360: blk.39.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 361: blk.39.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 362: blk.39.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: baichuan.tensor_data_layout str llama_model_loader: - kv 3: baichuan.context_length u32 llama_model_loader: - kv 4: baichuan.embedding_length u32 llama_model_loader: - kv 5: baichuan.block_count u32 llama_model_loader: - kv 6: baichuan.feed_forward_length u32 llama_model_loader: - kv 7: baichuan.rope.dimension_count u32 llama_model_loader: - kv 8: baichuan.attention.head_count u32 llama_model_loader: - kv 9: baichuan.attention.head_count_kv u32 llama_model_loader: - kv 10: baichuan.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 11: tokenizer.ggml.model str llama_model_loader: - kv 12: tokenizer.ggml.tokens arr llama_model_loader: - kv 13: tokenizer.ggml.scores arr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 llama_model_loader: - kv 18: general.quantization_version u32 llama_model_loader: - kv 19: general.file_type u32 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q4_0: 281 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = baichuan llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 125696 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_ctx = 512 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: freq_base = 10000.0 llm_load_print_meta: freq_scale = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 13.90 B llm_load_print_meta: model size = 7.44 GiB (4.60 BPW) llm_load_print_meta: general.name = Baichuan2-13B-Chat llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 1099 '<0x0A>' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: mem required = 7614.46 MB (+ 400.00 MB per state) ........................................................................................... llama_new_context_with_model: kv self size = 400.00 MB ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 ggml_metal_init: picking default device: Apple M2 ggml_metal_init: loading '/Volumes/WD_sn770/LLAMA2/llamacpp/llama.cpp/ggml-metal.metal' ggml_metal_init: loaded kernel_add 0x119507430 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_add_row 0x119507c60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul 0x119508180 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_row 0x1195087b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_scale 0x119508cd0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_silu 0x1195091f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_relu 0x119509710 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_gelu 0x119509c30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max 0x13cf059a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max_4 0x13ce07530 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf 0x13ce07b70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf_8 0x13ce08340 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f32 0x13ce089f0 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f16 0x13ce090a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_0 0x13ce09750 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_1 0x13ce09e00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q8_0 0x13ce0a4b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q2_K 0x13ce0ab60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q3_K 0x13ce0b210 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_K 0x13ce0ba30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q5_K 0x13ce0c0e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q6_K 0x13ce0c790 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rms_norm 0x13ce0ce50 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_norm 0x13ce0d680 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f32_f32 0x13ce0dee0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13ce0e740 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row 0x13ce0efa0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4 0x13ce0fa00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13ce10160 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x13ce10b20 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q8_0_f32 0x13ce11280 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x13ce119e0 | th_max = 640 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x13ce11f00 | th_max = 576 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x13ce12660 | th_max = 576 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x13ce12dc0 | th_max = 640 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x13ce13520 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x13ce13d30 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x13ce14540 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x13ce14d50 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x13ce15560 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x13ce15d70 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x13ce16580 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x13ce16d90 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x13ce175a0 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x11950a320 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x11950ac50 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_rope 0x11950b3d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_alibi_f32 0x11950bfa0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f16 0x11950c830 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f32 0x11950d0c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f16_f16 0x11950d950 | th_max = 1024 | th_width = 32 ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB ggml_metal_init: maxTransferRate = built-in GPU llama_new_context_with_model: compute buffer total size = 256.97 MB llama_new_context_with_model: max tensor size = 503.47 MB ggml_metal_add_buffer: allocated 'data ' buffer, size = 7617.11 MB, ( 7617.61 / 10922.67) ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.48 MB, ( 7619.09 / 10922.67) ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 8021.09 / 10922.67) ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 255.52 MB, ( 8276.61 / 10922.67) GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented" GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented" zsh: abort ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p 0 按照上面的步骤,GPU推理报错,CPU下正常。大佬可以帮忙看下吗?MAC MINI M2 @jameswu2014
我也是这个问题,CPU正常,但是GPU不行。
CUDA error 9 at ggml-cuda.cu:6829: invalid configuration argument
from baichuan2.
(alpaca_env) chunzhamini@chunzhamini llama.cpp % ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p '从前有一只小狐狸,他' --temp 0 -ngl 1
Log start
main: warning: changing RoPE frequency base to 0 (default 10000.0)
main: warning: scaling RoPE frequency by 0 (default 1.0)
main: build = 1270 (c091cdf)
main: built with Apple clang version 14.0.3 (clang-1403.0.22.14.1) for arm64-apple-darwin22.5.0
main: seed = 1695699630
llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 5120, 125696, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 7: blk.1.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 8: blk.1.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 9: blk.1.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 13: blk.2.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 14: blk.2.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 15: blk.2.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 16: blk.2.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 17: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.3.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 20: blk.3.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 21: blk.3.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 22: blk.3.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 23: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 24: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 25: blk.4.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 26: blk.4.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 27: blk.4.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 28: blk.4.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 29: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 30: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 31: blk.5.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 32: blk.5.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 33: blk.5.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 34: blk.5.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 35: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.6.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 38: blk.6.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 39: blk.6.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 40: blk.6.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 41: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 42: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 43: blk.7.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 44: blk.7.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 45: blk.7.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 46: blk.7.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 47: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 48: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 49: blk.8.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 50: blk.8.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 51: blk.8.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 52: blk.8.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 53: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 54: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 55: blk.9.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 56: blk.9.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 57: blk.9.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 58: blk.9.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 59: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 60: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 61: blk.10.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 62: blk.10.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 63: blk.10.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 64: blk.10.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 65: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 66: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 67: blk.11.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 68: blk.11.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 69: blk.11.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 70: blk.11.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 71: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 72: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 73: blk.12.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 74: blk.12.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 75: blk.12.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 76: blk.12.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 77: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 78: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 79: blk.13.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 80: blk.13.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 81: blk.13.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 82: blk.0.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 83: blk.0.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 84: blk.0.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 85: blk.1.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 86: blk.1.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 87: blk.1.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 88: blk.2.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 89: blk.2.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 90: blk.2.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 91: blk.3.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 92: blk.3.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 93: blk.3.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 94: blk.4.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 95: blk.4.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 96: blk.4.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 97: blk.5.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 98: blk.5.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 99: blk.5.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 100: blk.6.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 101: blk.6.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 102: blk.6.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 103: blk.7.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 104: blk.7.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 105: blk.7.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 106: blk.8.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 107: blk.8.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 108: blk.8.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 109: blk.9.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 110: blk.9.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 111: blk.9.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 112: blk.10.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 113: blk.10.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 114: blk.10.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 115: blk.11.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 116: blk.11.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 117: blk.11.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 118: blk.12.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 119: blk.12.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 120: blk.12.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 121: blk.13.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 122: blk.13.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 123: blk.13.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 124: blk.13.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 127: blk.14.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 128: blk.14.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 129: blk.14.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 130: blk.14.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 131: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 132: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 133: blk.15.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 134: blk.15.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 135: blk.15.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 136: blk.15.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 137: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 138: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 139: blk.16.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 140: blk.16.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 141: blk.16.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 142: blk.16.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 143: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 144: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 145: blk.17.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 146: blk.17.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 147: blk.17.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 148: blk.17.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 149: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 150: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 151: blk.18.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 152: blk.18.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 153: blk.18.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 154: blk.18.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 155: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 156: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 157: blk.19.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 158: blk.19.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 159: blk.19.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 160: blk.19.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 161: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 162: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 163: blk.20.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 164: blk.20.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 165: blk.20.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 166: blk.20.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 167: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 168: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 169: blk.21.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 170: blk.21.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 171: blk.21.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 172: blk.21.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 173: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 174: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 175: blk.22.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 176: blk.22.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 177: blk.22.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 178: blk.22.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 179: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 180: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 181: blk.23.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 182: blk.23.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 183: blk.23.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 184: blk.23.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 185: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 186: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 187: blk.24.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 188: blk.24.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 189: blk.24.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 190: blk.24.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 191: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 192: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 193: blk.25.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 194: blk.25.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 195: blk.25.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 196: blk.25.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 197: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 198: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 199: blk.26.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 200: blk.26.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 201: blk.26.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 202: blk.26.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 203: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 204: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 205: blk.27.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 206: blk.27.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 207: blk.27.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 208: blk.27.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 209: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 210: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 211: blk.28.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 212: blk.28.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 213: blk.28.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 214: blk.28.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 215: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 216: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 217: blk.29.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 218: blk.29.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 219: blk.14.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 220: blk.14.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 221: blk.14.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 222: blk.15.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 223: blk.15.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 224: blk.15.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 225: blk.16.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 226: blk.16.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 227: blk.16.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 228: blk.17.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 229: blk.17.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 230: blk.17.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 231: blk.18.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 232: blk.18.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 233: blk.18.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 234: blk.19.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 235: blk.19.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 236: blk.19.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 237: blk.20.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 238: blk.20.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 239: blk.20.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 240: blk.21.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 241: blk.21.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 242: blk.21.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 243: blk.22.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 244: blk.22.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 245: blk.22.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 246: blk.23.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 247: blk.23.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 248: blk.23.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 249: blk.24.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 250: blk.24.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 251: blk.24.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 252: blk.25.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 253: blk.25.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 254: blk.25.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 255: blk.26.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 256: blk.26.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 257: blk.26.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 258: blk.27.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 259: blk.27.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 260: blk.27.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 261: blk.28.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 262: blk.28.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 263: blk.28.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 264: blk.29.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 265: blk.29.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 266: blk.29.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 267: blk.29.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 268: blk.29.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 271: blk.30.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 272: blk.30.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 273: blk.30.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 274: blk.30.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 275: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 276: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 277: blk.31.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 278: blk.31.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 279: blk.31.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 280: blk.31.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 282: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 283: blk.32.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 284: blk.32.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 285: blk.32.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 286: blk.32.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 287: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 288: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 289: blk.33.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 290: blk.33.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 291: blk.33.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 292: blk.33.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 293: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 294: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 295: blk.34.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 296: blk.34.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 297: blk.34.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 298: blk.34.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 299: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 300: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 301: blk.35.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 302: blk.35.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 303: blk.35.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 304: blk.35.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 305: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 306: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 307: blk.36.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 308: blk.36.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 309: blk.36.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 310: blk.36.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 311: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 312: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 313: blk.37.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 314: blk.37.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 315: blk.37.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 316: blk.37.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 317: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 318: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 319: blk.38.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 320: blk.38.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 321: blk.38.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 322: blk.38.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 323: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 324: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 325: blk.39.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 326: blk.39.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 327: blk.39.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 328: blk.39.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 329: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 330: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 331: output_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 332: output.weight q6_K [ 5120, 125696, 1, 1 ]
llama_model_loader: - tensor 333: blk.30.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 334: blk.30.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 335: blk.30.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 336: blk.31.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 337: blk.31.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 338: blk.31.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 339: blk.32.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 340: blk.32.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 341: blk.32.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 342: blk.33.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 343: blk.33.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 344: blk.33.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 345: blk.34.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 346: blk.34.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 347: blk.34.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 348: blk.35.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 349: blk.35.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 350: blk.35.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 351: blk.36.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 352: blk.36.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 353: blk.36.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 354: blk.37.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 355: blk.37.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 356: blk.37.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 357: blk.38.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 358: blk.38.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 359: blk.38.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 360: blk.39.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 361: blk.39.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 362: blk.39.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: baichuan.tensor_data_layout str
llama_model_loader: - kv 3: baichuan.context_length u32
llama_model_loader: - kv 4: baichuan.embedding_length u32
llama_model_loader: - kv 5: baichuan.block_count u32
llama_model_loader: - kv 6: baichuan.feed_forward_length u32
llama_model_loader: - kv 7: baichuan.rope.dimension_count u32
llama_model_loader: - kv 8: baichuan.attention.head_count u32
llama_model_loader: - kv 9: baichuan.attention.head_count_kv u32
llama_model_loader: - kv 10: baichuan.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - kv 19: general.file_type u32
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = baichuan
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 125696
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: n_ff = 13696
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 13.90 B
llm_load_print_meta: model size = 7.44 GiB (4.60 BPW)
llm_load_print_meta: general.name = Baichuan2-13B-Chat
llm_load_print_meta: BOS token = 1 '''
llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MB
llm_load_tensors: mem required = 7614.46 MB (+ 400.00 MB per state)
...........................................................................................
llama_new_context_with_model: kv self size = 400.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: loading '/Volumes/WD_sn770/LLAMA2/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x119507430 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x119507c60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x119508180 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x1195087b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x119508cd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x1195091f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x119509710 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x119509c30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max 0x13cf059a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max_4 0x13ce07530 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf 0x13ce07b70 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf_8 0x13ce08340 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f32 0x13ce089f0 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x13ce090a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_0 0x13ce09750 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13ce09e00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q8_0 0x13ce0a4b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x13ce0ab60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x13ce0b210 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x13ce0ba30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x13ce0c0e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13ce0c790 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rms_norm 0x13ce0ce50 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_norm 0x13ce0d680 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f32_f32 0x13ce0dee0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13ce0e740 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row 0x13ce0efa0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4 0x13ce0fa00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13ce10160 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x13ce10b20 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q8_0_f32 0x13ce11280 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x13ce119e0 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x13ce11f00 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x13ce12660 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x13ce12dc0 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x13ce13520 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x13ce13d30 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x13ce14540 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x13ce14d50 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x13ce15560 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x13ce15d70 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x13ce16580 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x13ce16d90 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x13ce175a0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x11950a320 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x11950ac50 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_rope 0x11950b3d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_alibi_f32 0x11950bfa0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f16 0x11950c830 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f32 0x11950d0c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f16_f16 0x11950d950 | th_max = 1024 | th_width = 32
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 256.97 MB
llama_new_context_with_model: max tensor size = 503.47 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 7617.11 MB, ( 7617.61 / 10922.67)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.48 MB, ( 7619.09 / 10922.67)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 8021.09 / 10922.67)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 255.52 MB, ( 8276.61 / 10922.67)
GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented"
GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented"
zsh: abort ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p 0
按照上面的步骤,GPU推理报错,CPU下正常。大佬可以帮忙看下吗?MAC MINI M2 @jameswu2014
from baichuan2.
同样的问题,有没有解决的思路
from baichuan2.
最新版本的llama.cpp
install Python dependencies
python3 -m pip install -r requirements.txt
最新开发版本gguf
cd llama.cpp/gguf-py
pip install --editable .
转换
python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf
27.8GB
量化
./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0
7.99GB
运行
./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding
macbook pro 15年款 完美可以推理
有个小问题,不知道提示词模版是不是需要调整 @jameswu2014
from baichuan2.
不知道为什么,llama.cpp server方式,推理结果没有命令行好,命令行还可以的哈,使用server的话,就比较奇怪了。
请问这个问题怎么回事儿啊 @jameswu2014
选择模型 Baichuan2-13B-Chat
最新版本llama.cpp
from baichuan2.
@jameswu2014 @dlutsniper quantize gguf model failed on RTX3090
with Driver Version: 525.105.17 CUDA Version: 12.0
. Could you please give some advice for this issue?
./quantize /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-f16.gguf /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-Q8_0.gguf 7
CUDA error 804 at /llama.cpp/ggml-cuda.cu:5522: forward compatibility was attempted on non supported HW
current device: 0
from baichuan2.
最新版本的llama.cpp
install Python dependencies
python3 -m pip install -r requirements.txt
最新开发版本gguf
cd llama.cpp/gguf-py pip install --editable .
转换
python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf 27.8GB
量化
./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0 7.99GB
运行
./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding
macbook pro 15年款 完美可以推理 有个小问题,不知道提示词模版是不是需要调整 @jameswu2014
我按照这个步骤量化了Baichuan2-7B 的 chat 版本,但问答输出大部分是英文,且质量很差,不知道什么原因。
附件是日志记录:debug.txt
from baichuan2.
最新版本的llama.cpp
install Python dependencies
python3 -m pip install -r requirements.txt
最新开发版本gguf
cd llama.cpp/gguf-py pip install --editable .
转换
python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf 27.8GB
量化
./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0 7.99GB
运行
./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding
macbook pro 15年款 完美可以推理 有个小问题,不知道提示词模版是不是需要调整 @jameswu2014我按照这个步骤量化了Baichuan2-7B 的 chat 版本,但问答输出大部分是英文,且质量很差,不知道什么原因。 附件是日志记录:debug.txt
问题解决了。需要按照@jameswu2014的步骤,将 Baichuan2转 Baichuan1. 所以当前版本的 llama.cpp 还不能直接支持 Baichuan2 模型的 convert
from baichuan2.
llama.cpp我提了一个PR:ggerganov/llama.cpp#3009 你先按照Baichuan2的Readme里的Baichuan2->Baichuan1 的lm_head转换修改一下模型,就可以用上面链接里的修改。
This works for Q8_0
Q5_0
Q4_0
but failed for others with such error message:
$ ./quantize /models/baichuan2-13b-chat.gguf /models/baichuan2-13b-chat-Q4_K_M.gguf Q4_K
...
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
llama_model_quantize_internal: meta size = 2883232 bytes
[ 1/ 363] token_embd.weight - [ 5120, 125696, 1, 1], type = f16, quantizing to q4_K .. size = 1227.50 MB -> 345.23 MB | hist:
[ 2/ 363] blk.0.attn_output.weight - [ 5120, 5120, 1, 1], type = f16, quantizing to q4_K .. size = 50.00 MB -> 14.06 MB | hist:
[ 3/ 363] blk.0.ffn_gate.weight - [ 5120, 13696, 1, 1], type = f16, quantizing to q4_K .. size = 133.75 MB -> 37.62 MB | hist:
[ 4/ 363] blk.0.ffn_down.weight - [13696, 5120, 1, 1], type = f16,
get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for k-quants
llama_model_quantize: failed to quantize: Unsupported tensor size encountered
main: failed to quantize model from '/output/baichuan2-13b-chat.gguf'
from baichuan2.
from baichuan2.
好像和我的问题不太一样,请问你是什么设备?
from baichuan2.
使用最新版本llama.cpp 量化后server推理结果不准确,量化前未将baichuan2-13b-chat转为baichuan1模式,是这个导致的吗?
from baichuan2.
llama.cpp我提了一个PR:ggerganov/llama.cpp#3009 你先按照Baichuan2的Readme里的Baichuan2->Baichuan1 的lm_head转换修改一下模型,就可以用上面链接里的修改。
请问微调后的baichuan2也可以用这个方法来加速吗?
from baichuan2.
Related Issues (20)
- 调用接口时CPU100% HOT 1
- 请问有开源1B左右模型的计划吗
- Performance drop when deployed with TGI
- 请问是否有办法能扩大输入窗口到8k呢?
- Baichuan2 Chat Template HOT 6
- Baichuan2 7B和13B的模型训练数据和数据的训练顺序是否一致?
- base模型推理pred和inputs完全一样 HOT 2
- Baichuan2-7B-Base微调报错 AttributeError: 'BaichuanConfig' object has no attribute 'z_loss_weight'AttributeError: 'BaichuanConfig' object has no attribute 'z_loss_weight' HOT 1
- LLM相同输入,多次输出不一样
- 使用fastgpt需要流式接口,请求支持 HOT 1
- 使用fastgpt框架对接baichuan2需要流式接口,请求支持 HOT 1
- 一块V100的卡,跑13b openai的启动脚本。短文本没事,文本一长就报CUDA error: out of memory。
- Baichuan2-13B-Chat-4bits 跑不起来 HOT 2
- baichuan2-13B-chat 微调loss 一直为0 HOT 2
- baichuan2-7B-chat 微调使用TrainerCallback,报错
- baichuan2-13b 微调后模型使用vllm输出与官方web_demo结果不一致
- 请问模型 Baichuan2-13B-Chat-4bits 支持MAC吗?
- 数据集
- 我使用了lora微调训练的4个epoch,但是模型还没有收敛,如何从保存的checkpoint继续训练
- 输入窗口是多少呢
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from baichuan2.