ggerganov / ggml Goto Github PK

Tensor library for machine learning

License: MIT License

CMake 1.42% C 34.96% Shell 0.53% Cuda 11.79% Objective-C 3.61% Metal 4.48% C++ 40.15% Zig 0.59% Swift 0.03% Python 2.45%

automatic-differentiation large-language-models machine-learning tensor-algebra

ggml's Introduction

ggml

Roadmap / Manifesto

Tensor library for machine learning

Note that this project is under active development.
Some of the development is currently happening in the llama.cpp and whisper.cpp repos

Features

Written in C
16-bit float support
Integer quantization support (4-bit, 5-bit, 8-bit, etc.)
Automatic differentiation
ADAM and L-BFGS optimizers
Optimized for Apple Silicon
On x86 architectures utilizes AVX / AVX2 intrinsics
On ppc64 architectures utilizes VSX intrinsics
No third-party dependencies
Zero memory allocations during runtime

Updates

GPT inference (example)

With ggml you can efficiently run GPT-2 and GPT-J inference on the CPU.

Here is how to run the example programs:

# Build ggml + examples
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j4 gpt-2-backend gpt-j

# Run the GPT-2 small 117M model
../examples/gpt-2/download-ggml-model.sh 117M
./bin/gpt-2-backend -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

# Run the GPT-J 6B model (requires 12GB disk space and 16GB CPU RAM)
../examples/gpt-j/download-ggml-model.sh 6B
./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"

# Install Python dependencies
python3 -m pip install -r ../requirements.txt

# Run the Cerebras-GPT 111M model
# Download from: https://huggingface.co/cerebras
python3 ../examples/gpt-2/convert-cerebras-to-ggml.py /path/to/Cerebras-GPT-111M/
./bin/gpt-2 -m /path/to/Cerebras-GPT-111M/ggml-model-f16.bin -p "This is an example"

The inference speeds that I get for the different models on my 32GB MacBook M1 Pro are as follows:

Model	Size	Time / Token
GPT-2	117M	5 ms
GPT-2	345M	12 ms
GPT-2	774M	23 ms
GPT-2	1558M	42 ms
---	---	---
GPT-J	6B	125 ms

For more information, checkout the corresponding programs in the examples folder.

Using Metal (only with GPT-2)

For GPT-2 models, offloading to GPU is possible. Note that it will not improve inference performances but will reduce power consumption and free up the CPU for other tasks.

To enable GPU offloading on MacOS:

cmake -DGGML_METAL=ON -DBUILD_SHARED_LIBS=Off ..

# add -ngl 1
./bin/gpt-2 -t 4 -ngl 100 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

Using cuBLAS

# fix the path to point to your CUDA compiler
cmake -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc ..

Using hipBLAS

cmake -DCMAKE_C_COMPILER="$(hipconfig -l)/clang" -DCMAKE_CXX_COMPILER="$(hipconfig -l)/clang++" -DGGML_HIPBLAS=ON

Using SYCL

# linux
source /opt/intel/oneapi/setvars.sh
cmake -G "Ninja" -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL=ON ..

# windows
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
cmake -G "Ninja" -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DGGML_SYCL=ON ..

Compiling for Android

Download and unzip the NDK from this download page. Set the NDK_ROOT_PATH environment variable or provide the absolute path to the CMAKE_ANDROID_NDK in the command below.

cmake .. \
   -DCMAKE_SYSTEM_NAME=Android \
   -DCMAKE_SYSTEM_VERSION=33 \
   -DCMAKE_ANDROID_ARCH_ABI=arm64-v8a \
   -DCMAKE_ANDROID_NDK=$NDK_ROOT_PATH
   -DCMAKE_ANDROID_STL_TYPE=c++_shared

# Create directories
adb shell 'mkdir /data/local/tmp/bin'
adb shell 'mkdir /data/local/tmp/models'

# Push the compiled binaries to the folder
adb push bin/* /data/local/tmp/bin/

# Push the ggml library
adb push src/libggml.so /data/local/tmp/

# Push model files
adb push models/gpt-2-117M/ggml-model.bin /data/local/tmp/models/


# Now lets do some inference ...
adb shell

# Now we are in shell
cd /data/local/tmp
export LD_LIBRARY_PATH=/data/local/tmp
./bin/gpt-2-backend -m models/ggml-model.bin -p "this is an example"

Resources

GGML - Large Language Models for Everyone: a description of the GGML format provided by the maintainers of the llm Rust crate, which provides Rust bindings for GGML
marella/ctransformers: Python bindings for GGML models.
go-skynet/go-ggml-transformers.cpp: Golang bindings for GGML models
smspillaz/ggml-gobject: GObject-introspectable wrapper for use of GGML on the GNOME platform.

ggml's People

Contributors

Stargazers

Watchers

Forkers

sizzles pcannon67 straitrobot ttldtor ishine neopunisher grivet trholding praveenmunagapati celsopitta icodein takuya-takeuchi codefromanywhere katsu560 jimbog tiendung suryatmodulus ocordeiro segmond akamil-etsy compilade guberti amorjnyh stjordanis arulk birthbaum xieren58 ukaserge jeffersonmf qjchen1972 smpanaro loichau1997 thomasfsteeples sburnicki machine-learning-works tpoisonooo limorelin yakimakhali pavelkucherov hidenorly archiwed kallisti5 rjsabet co-simulation jdehorty pikot bollu dorgonman selose maihd cualquiercosa327 martialdelastic allsulli19s augustrush ericbioinf codingonion papasega jabariholder yanghuaxuan kustomzone rane2021 anylee2021 djinn doandongnguyen hertera1 siran evelynmitchell koogle eltociear sudofusion theolivenbaum hbcbh1999 de30 pikalover6 david-gemmell miknoj tylerchr cromwellian yunfanye sadjad manyoso iacore tonysvault hymilex pervrosen ravenscroftj mackenbaron wyklq lostruins ezhomelabs apollohuang1 botogoske e-tjan147 pugzly sanity pub-ai 0of emilianoeloi sawradip inu-ai

ggml's Issues

GGML CLIP?

I saw quite a few apps like rclip built to search through huge large galleries of photos with CLIP, but most wasted so much power and compute on nothing. Maybe such a C++ port would be rather useful for this application.

[Bug] Zero temperature yields incorrect results

Passing the parameter --temp 0 causes GPT-J (and I suspect all other models) to behave very strangely. See the following output:

guberti@Precision-7540:~/ggml/build$ ./bin/gpt-j -p "Once upon a time there was a" --temp 0
main: seed = 1678657256
gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 1
gptj_model_load: ggml ctx size = 13334.86 MB
gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 11542.79 MB / num tensors = 285
main: number of tokens in prompt = 7

Once upon a time there was aGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

No matter what prompt is used, the model repeatedly generates the token G.

Solution

This is caused by a division-by-zero error in examples/utils.cpp:

{
    const double scale = 1.0/temp;
    for (int i = 0; i < n_logits; ++i) {
        logits_id.push_back(std::make_pair(logits[i]*scale, i));
    }
}

Happy to contribute a simple fix if @ggerganov is busy.

Models output <|endoftext|> when n is a larger value

The models sometimes (but frequent enough to notice) output <|endoftext|> when n is a value such as 1000 after the generated text. It would be nice if that <|endoftext|> token could be filtered out.

Repetition penalty?

Is there a way to add/use repetition penalty to the generator code?

GPT-J broken after transpose changes in ggml_new_tensor_2d

I am receiving this error when loading GPT-J on AWS Graviton2 EC2 node.
gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]

The error seems to be issuing from ggml_new_tensor_2d and specifically transpose activity in recently checked in code. Are there any ways to either modify the model or use it as is with current code.

Implementing Roberta

I wanted to check the feasibility of implementing roberta-large-mnli model through ggml. Is there anything that could be a potential hurdle ?

Proposal for integrating ggml and tiny-dnn and extending whisper.cpp and llama.cpp

Dear Georgi Gerganov,

I am a fan of your work on ggml, whisper.cpp and llama.cpp. I think you have done an amazing job of creating efficient and portable deep learning libraries and models in C/C++. I am also interested in tiny-dnn, a header-only, dependency-free deep learning framework in C++ that supports various types of neural network layers, activation functions, loss functions and optimization algorithms.

I have a proposal for integrating ggml and tiny-dnn and extending whisper.cpp and llama.cpp with training and fine-tuning abilities. I think this would bring many benefits for both projects and the users, such as:

Reducing the memory footprint and improving the inference speed of deep learning models using ggml’s 4-bit integer quantization support. This would be useful for tiny-dnn, which aims to run on limited computational resource and embedded systems. It would also enable tiny-dnn to run more complex models such as GPT-J and LLaMA that require large amounts of memory.
Enhancing the performance of tiny-dnn on different platforms and devices using ggml’s optimized inference for models such as GPT-2, GPT-J, Whisper, LLaMA, and RWKV using NEON intrinsics and Accelerate framework on Apple silicon, and AVX intrinsics on x86 architectures. This would allow tiny-dnn to support more types of natural language models that use attention mechanisms and transformers.
Experimenting with different network architectures and hyperparameters to improve the accuracy and robustness of whisper.cpp and llama.cpp models using tiny-dnn’s various types of neural network layers, activation functions, loss functions and optimization algorithms. This would enable whisper.cpp and llama.cpp to support training and fine-tuning, which are currently not available. It would also allow whisper.cpp and llama.cpp to leverage the existing knowledge and resources from the tiny-dnn community.
Leveraging the existing pre-trained models from Caffe using tiny-dnn’s ability to import models from Caffe. This would enable whisper.cpp and llama.cpp to support more types of models from different sources. It would also allow whisper.cpp and llama.cpp to benefit from the popularity and availability of Caffe models online.

I hope you find this proposal interesting and worthwhile. I would love to hear your feedback and thoughts on this idea. I think it would be a great opportunity to collaborate and create something awesome together.

Thank you for your time and attention.

Sincerely, Amir Rasti

Suggestion: CLIP Interrogator

This is using the ViT-L-14/openai model, which converts images to text with stunning results.

Here is how the recognised text will look on midjourney.

https://huggingface.co/spaces/pharma/CLIP-Interrogator

Keeping the model loaded on RAM

Is there a way to keep the model loaded in the RAM between successive runs? I have an api like setup, and every time a prompt comes in, the model has to be loaded into RAM again, which takes a while for GPT-J.
I'm using python and basically just running the ./bin/gpt-j command via subprocess.Popen.

Suggestion: RWKV Language Model

Hi I am the dev of https://github.com/BlinkDL/ChatRWKV and it is a RNN (so faster and saves VRAM) that can match transformer performance (and already scaled to 14B params. more to come).

Let me know if you will be interested in supporting it :) The inference is very simple:
https://github.com/BlinkDL/ChatRWKV/blob/main/src/model_run.py

Temp not used in gpt utils

I set temp = 0 and saw that I was still getting different outputs, so I checked the code and found that temperature isn't being used anywhere.

https://github.com/ggerganov/ggml/blob/master/examples/utils.cpp#L252-L335

Build failiure on armv7 (docker)

Hi! Thank you for your work on ggml and whisper.cpp, these two amazing projects really did wonders for the performance. I tried compiling the examples to see if they can be run on an Ubuntu Touch phone, which is essentially an Ubuntu 16.04 LTS Core on arm. Sadly, the build failed due to not finding immintrin.h, which as far as I understood is present only on x84 systems. As you can tell, I am a novice, so I'd appreciate any help regarding this. Once again, thanks for all your work! :)

Logs

root@d1fa92e4b731:~/ggml/build# make -j4 gpt-2 gpt-j
[ 50%] Built target ggml_utils
[ 50%] Building C object src/CMakeFiles/ggml.dir/ggml.c.o
/root/ggml/src/ggml.c:153:23: fatal error: immintrin.h: No such file or directory
compilation terminated.
src/CMakeFiles/ggml.dir/build.make:62: recipe for target 'src/CMakeFiles/ggml.dir/ggml.c.o' failed
make[3]: *** [src/CMakeFiles/ggml.dir/ggml.c.o] Error 1
CMakeFiles/Makefile2:85: recipe for target 'src/CMakeFiles/ggml.dir/all' failed
make[2]: *** [src/CMakeFiles/ggml.dir/all] Error 2
CMakeFiles/Makefile2:622: recipe for target 'examples/gpt-2/CMakeFiles/gpt-2.dir/rule' failed
make[1]: *** [examples/gpt-2/CMakeFiles/gpt-2.dir/rule] Error 2
Makefile:329: recipe for target 'gpt-2' failed
make: *** [gpt-2] Error 2
root@d1fa92e4b731:~/ggml/build# uname -a
Linux d1fa92e4b731 5.19.0-26-generic #27-Ubuntu SMP PREEMPT_DYNAMIC Wed Nov 23 20:44:15 UTC 2022 armv7l armv7l armv7l GNU/Linux

GPT-2: Wrong shape in model file: got [768, 2304], expected [2304, 768]

Hello,
I met the wrong shape in the model file, does anyone know how to resolve this?

➜  build git:(master) make -j4 gpt-2
Consolidate compiler generated dependencies of target ggml_utils
Consolidate compiler generated dependencies of target ggml
[ 33%] Built target ggml_utils
[ 66%] Built target ggml
Consolidate compiler generated dependencies of target gpt-2
[100%] Built target gpt-2
➜  build git:(master) ../examples/gpt-2/download-ggml-model.sh 117M
Downloading ggml model 117M ...
models/gpt-2-117M/ggml-mod 100%[======================================>] 239.58M  41.3MB/s    in 6.2s
Done! Model '117M' saved in 'models/gpt-2-117M/ggml-model.bin'
You can now use it like this:

  $ ./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

➜  build git:(master) ./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

main: seed = 1680664186
gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 1024
gpt2_model_load: n_embd  = 768
gpt2_model_load: n_head  = 12
gpt2_model_load: n_layer = 12
gpt2_model_load: f16     = 1
gpt2_model_load: ggml ctx size = 384.74 MB
gpt2_model_load: memory size =    72.00 MB, n_mem = 12288
gpt2_model_load: tensor 'model/h0/attn/c_attn/w' has wrong shape in model file: got [768, 2304], expected [2304, 768]
main: failed to load model from 'models/gpt-2-117M/ggml-model.bin'

core dumps when running examples

I got core dumps when running both of your inference examples, e.g.

gptj_model_load: f16     = 1
gptj_model_load: ggml ctx size = 13334.86 MB
Illegal instruction (core dumped)

This was in a Linux Mint guest in VirtualBox with over 32 GB of memory. Any suggestions about what the problem might be?

Illegal instruction on Intel sandy bridge

Following the instructions on the readme, I get all the way to running the model. Then ./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example" gives me this output:

main: seed = 1677652321
gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx = 1024
gpt2_model_load: n_embd = 768
gpt2_model_load: n_head = 12
gpt2_model_load: n_layer = 12
gpt2_model_load: f16 = 1
gpt2_model_load: ggml ctx size = 311.12 MB
Illegal instruction (core dumped)

According to everything I had read, my processor should support all of the required instruction sets. I have installed libmkl-avx21 and tried make clean/make all. Specs are i7 2600k and Ubuntu 22.04

Text to speech models in GGML?

@ggerganov do you have any interest in producing more models in GGML format?

I'm now convinced your approach of zero dependency, no memory allocation cpu-first ideaology will make it accessible to everyone.

You've got LLAMA, whisper, what is remaining is the reverse of whisper.

What are your thoughts?

Segmentation fault

Hello,

I tried to run GPT-2 and it worked fine. GPT-J however gives a Segmentation fault error (see attachment)

Could this be due to not having enough RAM? I only have 16 GB RAM and read that the model itself requires 16 GB RAM, so the 16 GB RAM would not enough (since only 60-70% is actually available for things other than the OS) but I figured the lacking GB's would be compensated with virtual RAM/swap/pagefile.

If it's due to insufficient RAM, I'll go and upgrade it to 32 GB later this month. But I'd like confirmation for that first before I spend the money! :)

Thanks!

Getting error on make -j4 gpt-2

Consolidate compiler generated dependencies of target ggml
[ 33%] Built target ggml_utils
[ 50%] Building C object src/CMakeFiles/ggml.dir/ggml.c.o
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:99,
from /home/kumapra/MACHINE/ggml/src/ggml.c:153:
/home/kumapra/MACHINE/ggml/src/ggml.c: In function ‘ggml_vec_dot_f16’:
/usr/lib/gcc/x86_64-redhat-linux/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
_mm256_cvtph_ps (__m128i __A)
^~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:543:33: note: called from here
#define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i )(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:553:37: note: in expansion of macro ‘GGML_F32Cx8_LOAD’
#define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
^~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:906:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’
ay[j] = GGML_F16_VEC_LOAD(y + i + jGGML_F16_EPR, j);
^~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:99,
from /home/kumapra/MACHINE/ggml/src/ggml.c:153:
/usr/lib/gcc/x86_64-redhat-linux/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
_mm256_cvtph_ps (__m128i __A)
^~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:543:33: note: called from here
#define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i )(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:553:37: note: in expansion of macro ‘GGML_F32Cx8_LOAD’
#define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
^~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:905:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’
ax[j] = GGML_F16_VEC_LOAD(x + i + jGGML_F16_EPR, j);
^~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:99,
from /home/kumapra/MACHINE/ggml/src/ggml.c:153:
/usr/lib/gcc/x86_64-redhat-linux/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
_mm256_cvtph_ps (__m128i __A)
^~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:543:33: note: called from here
#define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i )(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:553:37: note: in expansion of macro ‘GGML_F32Cx8_LOAD’
#define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
^~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:905:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’
ax[j] = GGML_F16_VEC_LOAD(x + i + jGGML_F16_EPR, j);
^~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:99,
from /home/kumapra/MACHINE/ggml/src/ggml.c:153:
/usr/lib/gcc/x86_64-redhat-linux/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
_mm256_cvtph_ps (__m128i __A)
^~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:543:33: note: called from here
#define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i )(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:553:37: note: in expansion of macro ‘GGML_F32Cx8_LOAD’
#define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
^~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:906:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’
ay[j] = GGML_F16_VEC_LOAD(y + i + jGGML_F16_EPR, j);
^~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:99,
from /home/kumapra/MACHINE/ggml/src/ggml.c:153:
/usr/lib/gcc/x86_64-redhat-linux/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
_mm256_cvtph_ps (__m128i __A)
^~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:543:33: note: called from here
#define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i )(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:553:37: note: in expansion of macro ‘GGML_F32Cx8_LOAD’
#define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
^~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:905:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’
ax[j] = GGML_F16_VEC_LOAD(x + i + jGGML_F16_EPR, j);
^~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-redhat-linux/8/include/immintrin.h:99,
from /home/kumapra/MACHINE/ggml/src/ggml.c:153:
/usr/lib/gcc/x86_64-redhat-linux/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
_mm256_cvtph_ps (__m128i __A)
^~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:543:33: note: called from here
#define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i )(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:553:37: note: in expansion of macro ‘GGML_F32Cx8_LOAD’
#define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
^~~~~~~~~~~~~~~~
/home/kumapra/MACHINE/ggml/src/ggml.c:906:21: note: in expansion of macro ‘GGML_F16_VEC_LOAD’
ay[j] = GGML_F16_VEC_LOAD(y + i + jGGML_F16_EPR, j);
^~~~~~~~~~~~~~~~~
make[3]: *** [src/CMakeFiles/ggml.dir/build.make:76: src/CMakeFiles/ggml.dir/ggml.c.o] Error 1
make[2]: *** [CMakeFiles/Makefile2:205: src/CMakeFiles/ggml.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:499: examples/gpt-2/CMakeFiles/gpt-2.dir/rule] Error 2
make: *** [Makefile:322: gpt-2] Error 2

Quantizing GPT-J Produces Nonsense

Hey thanks for the great package! When I try to quantize an fp16 ggml file of GPT-J, the outputs from chat are nonsense. Also the outputs of the gpt-j-quantize bin seem to be off as I'd expect the hist to have non-zero values (as seen in other examples like llama.cpp quantize)

 0.000 0.000 
                     transformer.h.0.ln_1.weight - [ 4096,     1], type =    f32 size =    0.016 MB
                       transformer.h.0.ln_1.bias - [ 4096,     1], type =    f32 size =    0.016 MB
              transformer.h.0.attn.k_proj.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
              transformer.h.0.attn.v_proj.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
              transformer.h.0.attn.q_proj.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
            transformer.h.0.attn.out_proj.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
                transformer.h.0.mlp.fc_in.weight - [ 4096, 16384], type =    f16 quantizing .. size =   256.00 MB ->    40.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
                  transformer.h.0.mlp.fc_in.bias - [16384,     1], type =    f32 size =    0.062 MB
               transformer.h.0.mlp.fc_out.weight - [16384,  4096], type =    f16 quantizing .. size =   256.00 MB ->    40.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
                 transformer.h.0.mlp.fc_out.bias - [ 4096,     1], type =    f32 size =    0.016 MB
                     transformer.h.1.ln_1.weight - [ 4096,     1], type =    f32 size =    0.016

Is quantizing from fp16 not possible for GPT-J?

Instruct GPT-J

Someone fine-tuned GPT-J on the Alpaca instruction dataset using PETF:

peft_model_id = "crumb/Instruct-GPT-J"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto', revision='sharded')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

# This example is in the alpaca training set
batch = tokenizer("Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: How can we reduce air pollution? ### Response:", return_tensors='pt')

Recently I have successfully tried GPT-J model itself on GGML, using converted binary provided, so I suppose InstructGPT-J it should work off the shelf converting the checkpoint and then doing quantization.

Model adapter is here

Suggestion: Implement Text Completion and Compression using GPT-2

Fabrice Bellard's https://bellard.org/libnc/gpt2tc.html gpt2tc is an excellent opensource tool to compress text to very tiny sizes with GPT-2 however it uses https://bellard.org/libnc/ https://bellard.org/libnc/libnc.html which is distributed only as a binary blob / lib. Maybe it could be possible to re-implement gpt2tc with ggml. Would be awesome.

gpt2c performs excellently on very small texts and produces the smallest file sizes.

Another compressor that is based on libnc is nncp and by the same author, it currently holds the world record in text compression:
http://www.mattmahoney.net/dc/text.html

nncp

[nncp](https://bellard.org/nncp/) is a free, experimental file compressor by Fabrice Bellard, released May 8, 2019. It uses a neural network model with dictionary preprocessing described in the paper [Lossless Data Compression with Neural Networks](https://bellard.org/nncp/nncp.pdf). Compression of enwik9 uses the options:
./preprocess c out.words enwik9 out.pre 16384 512
./nncp -n_layer 7 -hidden_size 384 -n_embed_out 5 -n_symb 16388 -full_connect 1 -lr 6e-3 c out.pre out.bin
Version 2019-11-16 was released Nov. 16, 2019. It was run in 8 threads.

Version 2 was released Jan. 3, 2021. It uses a [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) architecture, a recurrent neural network with attention mechanism to allow parallelism. The algorithm is described briefly [here](https://bellard.org/nncp/nncp_v2.pdf). It uses the same dictionary preprocessing as earlier versions. It was tested with an [Intel Xeon E3-1230 v6](https://ark.intel.com/content/www/us/en/ark/products/97474/intel-xeon-processor-e3-1230-v6-8m-cache-3-50-ghz.html) at 3.5 GHz and a [Geforce RTX 3090 GPU](https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3090/) with 10,496 Cuda cores and 24 GB RAM.

nncp v2.1 was released Feb. 6, 2021. It is the same code as v2 except for a larger model and slightly different hyperparameters.

nncp v3 was released Apr. 24, 2021. This new version is coded in C and supports recent NVIDIA GPUs. It is much faster (3x) due to algorithmic improvements and requires less memory. The Transformer model is similar (199M parameters) but the hyperparameters have been tuned.

nncp v3.1 was released June 1, 2021.

            Compression     Compressed size      Decompresser  Total size   Time (ns/byte)
Program       Options      enwik8      enwik9     size (zip)   enwik9+prog  Comp   Decomp   Mem  Alg Notes
-------       -------    ----------  -----------  -----------  -----------  ------ ------  ----- --- -----
nncp 2019-05-08          16,791,077  125,623,896    161,133 xd 125,785,029  420168 602409   2040 LSTM 84
nncp 2019-11-16          16,292,774  119,167,224    238,452 xd 119,405,676  826048 1156467  5360 LSTM 84
nncp v2                  15,600,675  114,317,255     99,671 xd 114,317,255  308645 313468  17000 Transformer 88     
nncp v2.1                15,020,691  112,219,309    100,046 xd 112,319,355  508332 515401  23000 Transformer 88
nncp v3                  15,206,966  110,034,293    197,491 xd 110,231,784  161812 158982   6000 Transformer 88
nncp v3.1                14,969,569  108,378,032    201,620 xd 108,579,652  212766 210970   6000 Transformer 88

[gpt-J] swap space support

is it possible to have swap space support? ( I heard about ' Handling big models for inference' and was wondering if ggml can support a similar feature or store part of the large model in swap.)

Sorry, this is just a request from a novice programmer :(

I've been spending a lot of time on text to speech, do you mind if I want you to make a text to speech example using ggml? using cpp/c

Sorry I troubled you

GPT Benchmarks

GPT models without KV cache have to recalculate values and thus time to compute grows exponentially given a longer input.

Thus, for your benchmarks, how many tokens were generated, and with how many total? Does this support a caching system?

Int 8 / FP 8 quantization support similar to bnb

Hi there @ggerganov and great work. The performance on CPU is just amazing.
Would it be possible in the future to also implement Int 8 / FP 8 loading of models (a few layers still must be loaded with their original fp16 or fp32 weights) similar to bitsandbytes library: https://github.com/TimDettmers/bitsandbytes
This would allow loading of bigger models on some systems with limited amount of cpu ram. Or perhaps even faster inference for models like GPT-J.
In theory on a mac system or x64 (AVX2 or AVX512) system with 128GB cpu ram you would be able load a 120B model this way... Wouldnt that be amazing :)))

[Feature Request] rinna's Japanese GPT model support

Thanks @ggerganov for your sharing.

I want to use GPT on my local pc.
rinna Co.,Ltd is Japanese AI company. rinna provides some GPT models on huggingface.
https://huggingface.co/rinna/japanese-gpt-1b
https://huggingface.co/rinna/japanese-gpt2-xsmall
https://huggingface.co/rinna/japanese-gpt2-small
https://huggingface.co/rinna/japanese-gpt2-medium
I'd like to use these models with ggml gpt.
But, I can't convert these models for ggml.
I think these models are pytorch bin model and tensorflow H5 weight.

Could you convert and support these models ?

T0pp GGML port?

It seems to be a tiny and somewhat capable model, making it perfect to run locally. A GGML port may even allow mid-range phones to run it.

Related to #12.

[Question] DeOldify compatibility

I have not done much research on jantic/DeOldify apart that it uses pytorch, but would ggml theoretically support the model?

[Feature Request] Support for GALACTICA & EleutherAI Neo & Neo X models

Support for the GALACTICA & EleutherAI Neo, Neo X models would be an awesome addition.

GALACTICA seems like a ChatGPT for scientific stuff.

Info:
https://galactica.org/
https://huggingface.co/facebook/galactica-120b
https://the-decoder.com/galactica-is-an-open-source-language-model-for-scientific-progress/

https://huggingface.co/EleutherAI/gpt-neo-125M
https://huggingface.co/EleutherAI/gpt-neo-1.3B
https://huggingface.co/EleutherAI/gpt-neo-2.7B
https://huggingface.co/EleutherAI/gpt-neox-20b

gpt2 and gpt-j load error: tensor has wrong shape in model file

Hello after latest update I get an error when running the gpt-2 model:

 % ./bin/gpt-2 -m models/gpt-2-1558M/ggml-model.bin -p "Donald Trump is"
main: seed = 1680541533
gpt2_model_load: loading model from 'models/gpt-2-1558M/ggml-model.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 1024
gpt2_model_load: n_embd  = 1600
gpt2_model_load: n_head  = 25
gpt2_model_load: n_layer = 48
gpt2_model_load: f16     = 1
gpt2_model_load: ggml ctx size = 3729.46 MB
gpt2_model_load: memory size =   600.00 MB, n_mem = 49152
gpt2_model_load: tensor 'model/h0/attn/c_attn/w' has wrong shape in model file: got [1600, 4800], expected [4800, 1600]
main: failed to load model from 'models/gpt-2-1558M/ggml-model.bin'

and even the gpt-j model:

% ./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "James Dean was"
main: seed = 1680541576
gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 1
gptj_model_load: ggml ctx size = 13334.86 MB
gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]
main: failed to load model from 'models/gpt-j-6B/ggml-model.bin'
gptj_model_load: %

slow inference with very long input embeddings

I've been able to run a large multimodal model with your library but this means I have easily very large tensors as input (148x4096 for example).
this led me to find out that the inference time is proportional to the input length but why is it?
I think that in the ideal case, the bottleneck here should definitely be the memory bandwidth as for every tensor multiplication the cpu needs to retrieve a large chunk of weights data from memory and then multiply it very fast in the upper cache levels and all of this shouldn't exceed the L3 cache size for efficiency but I'm no expert in tensor multiplication.
for example, in pytorch the inference time is only slightly slower with larger chunks of data but it doesn't scale linearly.
is there something that can be done to improve cache locality for very long input embeddings?

Embeddings models?

Is it possible to use GGML for faster and more portable calculating of sentence embeddings? That might make for a useful offline text search tool.

Pytorch model converting

is converting from PyTorch possible? similar to TensorFlow conversion.

[Feature Request] Salesforce Codegen model

Hi, could you give us a hint on how to convert the Salesforce codegen model to ggml? The models are here https://github.com/salesforce/CodeGen and they allow Github copilot style code complete locally.
Thank you!

[Feature request] Implement 8-bit GPT-J

Results in ~11Gb weights vs. 16Gb, implemented in PyTorch now as load_in_8bit=True:

https://huggingface.co/hivemind/gpt-j-6B-8bit

test cases finished with exit code 8

I test some cases in the folder "tests", but some cases can't be run correctly, such as "test-grad0.c".
The exit code is 8. When debugging, i found div by zero.

[Documentation request] How to add a new ISA target?

I got a NEC VectorEngine Co-processor which uses the "ve" ISA which operates on 16348 bit vectors. This target is not yet supported by this library however depending on the effort for porting it might be an interesting library (although it only supports 32 bit floats). The vector engine can either execute programs natively or it can be used for offloading.
How would one go about adding support for a new ISA in this library?

If the ISA docs motivates people familiar with the library to port it to target more then the idea of documenting how to add a new i can give access to hardware running that ISA.

If i read the code correctly instead of "GGML_F32Cx8_ADD" the implementation would use "GGML_F32Cx512_ADD" which is as fun as it sounds. ve provids a VL register which can be used to change the vector length at run time which can be useful for the last loop iteration. The FP16<->FP32 conversion is not supported by the VectorEngine architecture.

Built-in support for pytorch models?

Hey there, I'm wondering if there's a way to directly load weights from a pytorch model without converting them first. I'd like to try running inference with it, I have some finetuned GPT-J models in pytorch fp16, but having to covert to this format makes it difficult.

Build on Windows

the examples (using make) are for Mac, I think. How would I go about building this on Windows?

[Feature request] Implement GPT-JT

e.g. https://www.together.xyz/blog/releasing-v1-of-gpt-jt-powered-by-open-source-ai

Illegal instruction (core dumped)

Hello!
Thanks for reading. Please see below: Illegal instruction (core dumped)
I upgraded to 32GB RAM on an i7 3.4 Ghz CPU, Ubuntu 20.04. 374G unused harddrive space.

Here is the output from terminal:

#:~/ggml-master/build$ ../examples/gpt-j/download-ggml-model.sh 6B
Downloading ggml model 6B ...
models/gpt-j-6B/gg 100%[==============>]  11.27G  7.56MB/s    in 24m 53s 
Done! Model '6B' saved in 'models/gpt-j-6B/ggml-model.bin'
You can now use it like this:

$ ./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"

$ ~/ggml-master/build$ ./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"
main: seed = 1677098741
gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 1
gptj_model_load: ggml ctx size = 13334.86 MB
Illegal instruction (core dumped)

Thank you again!

Suggestion: Interactive Demo (second example code)

Is it possible to modify the C++ to create a second example source code file, that loads the model once, before sending new prompts read in a loop from STDIN?

After backing up the original C++ source code file, I modified the code to read a prompt from STDIN in a loop, instead of argv. There were no errors going through the loop, generating responses, except I seem to be processing new responses from the first prompt read from STDIN, over again, instead processing the new subsequent prompts read from STDIN.

The funny part, is these unintended results may be useful for prompt engineering in the future, to keep the context. But first, the goal would be to try save time by avoiding reloads of the model for generating responses to each new prompt in a loop. Lastly, this is a suggestion for a second separate example source code file. The first example code source code file is correct and very useful.

Failing to load Cerebras-gpt after 4-bit quantization.

I simply followed all of the instructions provided, and I used the 2.7B model from huggingface. When trying to run it like this:
/build/bin/gpt-2 -m models/Cerebras-GPT-2.7B/ggml-model-q4_1.bin -p "Once upon a time"
It returns this error:
gpt2_model_load: tensor 'model/lm_head' has wrong size in model file: got 96493440, expected 257315840
I am on an i7 4790 with tons of ram. The model works fine before quantization.

How to fine tune it?

I am a noob. Can you describe how I can fine-tune it with your program? It is possible? Maybe some articles.

Readme: add comparisons to similar frameworks like Flashlight

There are a few other C++ machine learning frameworks:

When stumbling upon such a library like ggml, my first thought is always, how does it compare to other similar libraries. I would suggest that you can maybe add such a comparison section in your readme.

Suggestion: Stable Diffusion Model

Been following along with your speed increases on Whisper using ggml, which have been amazing

Would be interesting to see how stable diffusion runs on CPUs using ggml

Here are current benchmarks:

Popular way to run SD:
https://github.com/huggingface/diffusers

Cerebras-GPT whitespace □ tokens emitted

I have downloaded GPT-1.3B from Cerebras here https://huggingface.co/cerebras/Cerebras-GPT-1.3B
After converting the model weights to GGML with

python3 ./examples/gpt-2/convert-cerebras-to-ggml.py /Users/loretoparisi/Downloads/Cerebras-GPT/pytorch_model.bin

inference worked just fine. I therefore did quantization:

./bin/gpt-2-quantize models/Cerebras-GPT/1.3B/ggml-model-f16.bin models/Cerebras-GPT/1.3B/ggml-model-q4_1.bin 3

but when running inference the model outputs "blank" tokens apparently (represented by char □ in the example below for better understanding - actually being a simple whitespace):

% ./bin/gpt-2 -m models/Cerebras-GPT/1.3B/ggml-model-q4_1.bin -p "Java is a programming "
main: seed = 1680545329
gpt2_model_load: loading model from 'models/Cerebras-GPT/1.3B/ggml-model-q4_1.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 2048
gpt2_model_load: n_embd  = 2048
gpt2_model_load: n_head  = 16
gpt2_model_load: n_layer = 24
gpt2_model_load: f16     = 3
gpt2_model_load: ggml ctx size = 1797.76 MB
gpt2_model_load: memory size =   768.00 MB, n_mem = 49152
gpt2_model_load: model size  =  1029.69 MB
main: prompt: 'Java is a programming '
main: number of tokens in prompt = 5, first 8 tokens: 29584 318 257 8300 220 

Java is a programming□□□□□□□□□□□□□□□□□□□□□□□□□                                                                                                                                                                                                   
□□□□□□□□

main: mem per token =  9581308 bytes
main:     load time =   373.08 ms
main:   sample time =    27.11 ms
main:  predict time =  7583.80 ms / 37.18 ms per token
main:    total time =  8110.62 ms

This behaviour does not happen all times, in fact in a consecutive inference:

% ./bin/gpt-2 -m models/Cerebras-GPT/1.3B/ggml-model-q4_1.bin -p "Java is a programming " -n 100                 
main: seed = 1680545760
gpt2_model_load: loading model from 'models/Cerebras-GPT/1.3B/ggml-model-q4_1.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 2048
gpt2_model_load: n_embd  = 2048
gpt2_model_load: n_head  = 16
gpt2_model_load: n_layer = 24
gpt2_model_load: f16     = 3
gpt2_model_load: ggml ctx size = 1797.76 MB
gpt2_model_load: memory size =   768.00 MB, n_mem = 49152
gpt2_model_load: model size  =  1029.69 MB
main: prompt: 'Java is a programming '
main: number of tokens in prompt = 5, first 8 tokens: 29584 318 257 8300 220 

Java is a programming 
language that can be used in computer graphics, computer games, and many other purposes including computer programming, graphics systems, video games, computer programming, video games, video game systems, computer programming systems, video game systems, computer simulation systems, video game systems, computer programming systems, computer programming systems, computer game systems, computer game systems, computer software systems, computer software systems, computer software systems, computer software systems, computer programming systems, computer software systems, computer software systems, computer game systems

main: mem per token =  9581308 bytes
main:     load time =   377.92 ms
main:   sample time =    14.40 ms
main:  predict time =  3556.80 ms / 34.20 ms per token
main:    total time =  4081.12 ms

ggml_init: Illegal instruction at address 0x<address>

When build tests/test0.c program, it's crashed when run on Windows compile with zig cc. Because in ggml_time_us() timer_freq are not inialized. I fixed by adding ggml_time_init().
Src: https://github.com/maihd/fun-with-zig/blob/main/ggml_bindings/src/main.zig

Add support for bloomz model

I t would be great if you add support for bloomz model because the model use is not restricted like llama and also it has been trained on 43 languages

Cerberas 2.7B yields garbage tokens after quantizing to 4bits

I'm getting garbage-looking tokens (&>,32>G$F7"=%0.173)@++*$16*:=!32%;:2@$5")0!!DGDA(:F*G$!")=9&9D69C9H-4.>&<A+1>.;6D7^C) after quantizing an f16 Cerberas model like this:

../../build/bin/gpt-2-quantize ./cerebras-gpt2.7b-alpaca-sp/ggml-model-f16.bin ./cerebras-gpt2.7b-alpaca-sp/ggml-model-int4.bin 2

Example:

(llama-lora) lxe@lxepc:~/ggml/examples/gpt-2$ ../../build/bin/gpt-2 -m cerebras-gpt2.7b-alpaca-sp/ggml-model-int4.bin -p "Human: How old is the Sun?\nAssistant:"
main: seed = 1680242724
gpt2_model_load: loading model from 'cerebras-gpt2.7b-alpaca-sp/ggml-model-int4.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 2048
gpt2_model_load: n_embd  = 2560
gpt2_model_load: n_head  = 32
gpt2_model_load: n_layer = 32
gpt2_model_load: f16     = 2
gpt2_model_load: ggml ctx size = 2957.55 MB
gpt2_model_load: memory size =  1280.00 MB, n_mem = 65536
gpt2_model_load: model size  =  1677.45 MB
main: prompt: 'Human: How old is the Sun?\nAssistant:'
main: number of tokens in prompt = 15, first 8 tokens: 20490 25 1374 1468 318 262 3825 30

Human: How old is the Sun?\nAssistant:&>,32>G$F7"=%0.173)@++*$16*:=!32%;:2@$5")0!!DGDA(:F*G$!")=9&9D69C9H-4.>&<A+1>.;6D7^C

The f16 model loads and works fine.