atome-fe / llama-node Goto Github PK

Believe in AI democratization. llama for nodejs backed by llama-rs, llama.cpp and rwkv.cpp, work locally on your laptop CPU. support llama/alpaca/gpt4all/vicuna/rwkv model.

Home Page: https://llama-node.vercel.app/

License: Apache License 2.0

JavaScript 20.66% Rust 45.49% TypeScript 30.97% Python 1.43% C 0.01% CSS 0.61% Makefile 0.12% HTML 0.26% MDX 0.45%

ai gpt large-language-models llama llama-rs llm napi napi-rs nodejs llama-node embeddings llamacpp langchain rwkv

llama-node's Introduction

LLaMA Node

llama-node: Node.js Library for Large Language Model

Official Documentations

_{Picture generated by stable diffusion.}

LLaMA Node

Introduction

This project is in an early stage and is not production ready, we do not follow the semantic versioning. The API for nodejs may change in the future, use it with caution.

This is a nodejs library for inferencing llama, rwkv or llama derived models. It was built on top of llm (originally llama-rs), llama.cpp and rwkv.cpp. It uses napi-rs for channel messages between node.js and llama thread.

Supported models

llama.cpp backend supported models (in GGML format):

llm(llama-rs) backend supported models (in GGML format):

GPT-2
GPT-J
LLaMA: LLaMA, Alpaca, Vicuna, Koala, GPT4All v1, GPT4-X, Wizard
GPT-NeoX: GPT-NeoX, StableLM, RedPajama, Dolly v2
BLOOM: BLOOMZ

rwkv.cpp backend supported models (in GGML format):

RWKV

Supported platforms

darwin-x64
darwin-arm64
linux-x64-gnu (glibc >= 2.31)
linux-x64-musl
win32-x64-msvc

Node.js version: >= 16

Installation

Install llama-node npm package

npm install llama-node

Install anyone of the inference backends (at least one)

llama.cpp

npm install @llama-node/llama-cpp

or llm

npm install @llama-node/core

or rwkv.cpp

npm install @llama-node/rwkv-cpp

Manual compilation

Please see how to start with manual compilation on our contribution guide

CUDA support

Please read the document on our site to get started with manual compilation related to CUDA support

Acknowledgments

This library was published under MIT/Apache-2.0 license. However, we strongly recommend you to cite our work/our dependencies work if you wish to reuse the code from this library.

Models/Inferencing tools dependencies

LLaMA models: facebookresearch/llama
RWKV models: BlinkDL/RWKV-LM
llama.cpp: ggreganov/llama.cpp
llm: rustformers/llm
rwkv.cpp: saharNooby/rwkv.cpp

Some source code comes from

llama-cpp bindings: sobelio/llm-chain
rwkv logits sampling: KerfuffleV2/smolrsrwkv

Community

Join our Discord community now! Click to join llama-node Discord

llama-node's People

Contributors

Stargazers

Watchers

Forkers

mehyar500 triestpa zxser katherineq11 maeganyork leeseon johngeng-xj zjf1165 assmaverick ai-learn-use petercao netmaxww micallannister widebluesky zouhangwithsweet hertera1 forex24 joshuaekaiser lengocgiang lxfater wdshin h2nguyen hlhr202 harsh-dhillon hubingliang alzc dinex-dev manoj-manoharan ekryski decentralised-ai firstbober groundbasesoft averyyan 0xrjman bakiwebdev fardjad yorkzero831 daxsum filip9f lebaker tommoffat suryatmodulus jstty jonholman lisinan cpluspatch dpirek manirikhi erssebaggala mwksandman jjeantw rkhomyk loganbek zeroxclem stophobia junilpyeon okobsamoht abdul-kaiyum1 bhanuc ego 18873523259

llama-node's Issues

cross build error

pending rustformers/llm#101

Segmentation fault local cuda build

I have successfully build llama-cpp.linux-x64-gnu.node with cuda enabled.

llama.cpp at ac7876ac20124a15a44fd6317721ff1aa2538806
llama-node at 649457a2528b6d3b2ef5d35e0507c4022419973a

When I try to load there is a Segmentation fault

I have installed segfault-handler and I get the following log

PID 130046 received SIGSEGV for address: 0x1
.../tmp/test-llama-server/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x3236)[0x7f82483dc236]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7f8248083140]
/lib/x86_64-linux-gnu/libc.so.6(+0x15d319)[0x7f8247ff9319]
<home>/.llama-node/libllama.so(llama_init_from_file+0x522)[0x7f82442c2e72]
.../tmp/test-llama-server/llama-node/packages/llama-cpp/@llama-node/llama-cpp.linux-x64-gnu.node(+0x45a0a)[0x7f8244402a0a]
.../tmp/test-llama-server/llama-node/packages/llama-cpp/@llama-node/llama-cpp.linux-x64-gnu.node(+0x35cbb)[0x7f82443f2cbb]
.../tmp/test-llama-server/llama-node/packages/llama-cpp/@llama-node/llama-cpp.linux-x64-gnu.node(+0x9d550)[0x7f824445a550]
.../tmp/test-llama-server/llama-node/packages/llama-cpp/@llama-node/llama-cpp.linux-x64-gnu.node(+0x97b01)[0x7f8244454b01]
.../tmp/test-llama-server/llama-node/packages/llama-cpp/@llama-node/llama-cpp.linux-x64-gnu.node(+0x963c6)[0x7f82444533c6]
.../tmp/test-llama-server/llama-node/packages/llama-cpp/@llama-node/llama-cpp.linux-x64-gnu.node(+0xa33a5)[0x7f82444603a5]
.../tmp/test-llama-server/llama-node/packages/llama-cpp/@llama-node/llama-cpp.linux-x64-gnu.node(+0x8ea85)[0x7f824444ba85]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7)[0x7f8248077ea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f8247f97a2f]

My index.js is

import { LLM } from "./llama-node/dist/index.js";
import { LLamaCpp } from "./llama-node/dist/llm/llama-cpp.js";

const llama = new LLM(LLamaCpp);

await llama.load({
    path: `./models/wizard-mega-13B.ggml.q5_0.bin`,
    enableLogging: true,
    nCtx: 1024,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: true,
    nGpuLayers: 40,
});

Any idea of what to do to find more details?

use with Next.js (webpack fails to load binary file)

I'm trying to use it in Next.js API action, but getting error:

- error ./node_modules/@llama-node/llama-cpp/@llama-node/llama-cpp.darwin-arm64.node
Module parse failed: Unexpected character '�' (1:0)
You may need an appropriate loader to handle this file type, currently no loaders are configured to process this file. See https://webpack.js.org/concepts#loaders
(Source code omitted for this binary file)

Import trace for requested module:
./node_modules/@llama-node/llama-cpp/@llama-node/llama-cpp.darwin-arm64.node
./node_modules/@llama-node/llama-cpp/index.js
./node_modules/llama-node/dist/llm/llama-cpp.js

Do I need to add some binary loader rule to webpack config, or somehow skip its loading?

Can't run the example on MacOS M1 pro

I tried to run: https://github.com/Atome-FE/llama-node/blob/main/example/js/langchain/langchain.js with ggml-vic7b-q4_0.bin
At 'await llama.load(config);' --> I'm getting: Illegal instruction: 4
I'm using: npm install @llama-node/llama-cpp and alpaca.cpp is working on another example.

Any ideas?

[Error: Could not load model] { code: 'GenericFailure' }

Getting this error [Error: Could not load model] { code: 'GenericFailure' } when trying to load a model:

$ node ./bin/llm/llm.js --model ~/models/gpt4-alpaca-lora-30B.ggml.q5_1.bin
[Error: Could not load model] { code: 'GenericFailure' }

I've modified the example a bit to take an argument as --model

import minimist from 'minimist';
import { LLM } from "llama-node";
import { LLamaRS } from "llama-node/dist/llm/llama-rs.js";
import path from "path";
const args = minimist(process.argv.slice(2));
const modelPath = args.model;
const model = path.resolve(modelPath);
const llama = new LLM(LLamaRS);
const template = `how are you`;
const prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

${template}

### Response:`;

const params = {
    prompt,
    numPredict: 128,
    temp: 0.2,
    topP: 1,
    topK: 40,
    repeatPenalty: 1,
    repeatLastN: 64,
    seed: 0,
    feedPrompt: true,
};
const run = async () => {
    try {
        await llama.load({ path: model });
        await llama.createCompletion(params, (response) => {
            process.stdout.write(response.token);
        });
    } catch (err) {
        console.error(err);
    }
};

run();

Issue to build GPU version

I am trying to build a GPU version of llama-node/lama-cpp but running into the issue with napi package. The package does not get installed automatically when I try to install it I get an error that supported platform for napi is Node 0.4. Am I missing something? Has anyone build this successfully for the Nvidia GPU?

[llama-cpp]$ pnpm build:cuda

Checking environment...

Checking rustc...✅
Checking cargo...✅
Checking cmake...✅
Checking llvm...✅
Checking clang.../bin/sh: clang: command not found
Checking gcc...✅
Checking nvcc...✅

Compiling...

/bin/sh: napi: command not found
[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "127".] {
code: 'ERR_UNHANDLED_REJECTION'
}

Napi Error:

npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: '[email protected]',
npm WARN EBADENGINE required: { node: '~0.4' },
npm WARN EBADENGINE current: { node: 'v16.14.0', npm: '8.3.1' }
npm WARN EBADENGINE }

foreign exception error

llama.cpp: loading model from /home/ubuntu/models/ggml-gpt4all-j-v1.3-groovy.bin
fatal runtime error: Rust cannot catch foreign exceptions
Aborted2023-05-18T11:49:39.270396Z  INFO surrealdb::net: SIGTERM received. Start graceful shutdown...    
2023-05-18T11:49:39.270450Z  INFO surrealdb::net: Shutdown complete. Bye!    

==> /var/log/descriptive-web.log <==

llama-node/llama-cpp uses more memory than standalone llama.cpp with the same parameters

I'm trying to process a large text file. For the sake of reproducibility, let's use this. The following code:

Expand to see the code

import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "node:path";
import fs from "node:fs";

const model = path.resolve(
    process.cwd(),
    "/path/to/model.bin"
);
const llama = new LLM(LLamaCpp);
const prompt = fs.readFileSync("./path/to/file.txt", "utf-8");

await llama.load({
    enableLogging: true,
    modelPath: model,

    nCtx: 4096,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: false,
    nGpuLayers: 0,
});

await llama.createCompletion(
    {
        nThreads: 8,
        nTokPredict: 256,
        topK: 40,
        prompt,
    },
    (response) => {
        process.stdout.write(response.token);
    }
);

Crashes the process with a segfault error:

ggml_new_tensor_impl: not enough space in the scratch memory
segmentation fault  node index.mjs

When I compile the exact same version of llama.cpp and run it with the following args:

./main -m /path/to/ggml-vic7b-q5_1.bin -t 8 -c 4096 -n 256 -f ./big-input.txt

It runs perfectly fine (of course with a warning that the context is larger than what the model supports but it doesn't crash with a segfault).

Comparing the logs:

llama-node Logs

llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936280.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 2052.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 4096.00 MB
[Sun, 28 May 2023 14:35:50 +0000 - INFO - llama_node_cpp::context] - AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
[Sun, 28 May 2023 14:35:50 +0000 - INFO - llama_node_cpp::llama] - tokenized_stop_prompt: None
ggml_new_tensor_impl: not enough space in the scratch memory

llama.cpp Logs

main: warning: model does not support context sizes greater than 2048 tokens (4096 specified);expect poor results
main: build = 561 (5ea4339)
main: seed  = 1685284790
llama.cpp: loading model from ../my-llmatic/models/ggml-vic7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 2048.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 256, n_keep = 0

Looks like the context size in llama-node is set to 4GBs and the kv self size is twice as large as what llama.cpp used.

I'm not sure if I'm missing something in my Load/Invocation config or if that's an issue in llama-node. Can you please have a look?

Bumping from 0.1.5 to 0.1.6 resulting with `Error: invariant broken`

Got an LLM running with GPT4All models (tried with ggml-gpt4all-j-v1.3-groovy.bin and ggml-gpt4all-l13b-snoozy.bin).

Version 0.1.5: - Works
Version 0.1.6 - Results with with Error: invariant broken: 999255479 <= 2 in Some("{PATH_TO}/ggml-gpt4all-j-v1.3-groovy.bin")

Package versions:

"@llama-node/core": "0.1.6",
"@llama-node/llama-cpp": "0.1.6",
"llama-node": "0.1.6",

/* eslint-disable @typescript-eslint/no-unused-vars */
/* eslint-disable @typescript-eslint/no-var-requires */
import { ModelType } from '@llama-node/core';
import { LLM } from 'llama-node';
// @ts-expect-error
import { LLMRS } from 'llama-node/dist/llm/llm-rs.cjs';
import path from 'path';

const modelPath = path.join(
  __dirname,
  '..',
  'models',
  'ggml-gpt4all-j-v1.3-groovy.bin',
);
const llama = new LLM(LLMRS);

const toChatTemplate = (prompt: string) => `### Instruction:
${prompt}

### Response:`;

export const createCompletion = async (
  prompt: string,
  onData: (data: string) => void,
  onDone: () => void,
) => {
  const params = {
    prompt: toChatTemplate(prompt),
    numPredict: 128,
    temperature: 0.8,
    topP: 1,
    topK: 40,
    repeatPenalty: 1,
    repeatLastN: 64,
    seed: 0,
    feedPrompt: true,
  };
  await llama.load({ modelPath, modelType: ModelType.GptJ });
  await llama.createCompletion(params, (response) => {
    if (response.completed) {
      return onDone();
    } else {
      onData(response.token);
    }
  });
};

llama-node does not display prompt in Deno environment

I am currently trying to use the [email protected] library with Deno. However, when I execute my program, I am not getting the expected prompt result. Instead, the Rust backend is returning Ok(()). I am suspecting that the issue might be related to a compatibility problem between [email protected] and the Deno runtime environment or am I missing something?

Any help or guidance on how to resolve this issue would be greatly appreciated. Thank you in advance for your assistance.

Lllama-node: [email protected]
Runtime: deno 1.32.5
Model: ggml-model-q4_0.bin / 7B alpaca 4bit quantized (Ran flawless with llama-rs itself)

`main.ts` code:

import { resolve } from "https://deno.land/[email protected]/path/mod.ts";
import { LLama } from "npm:[email protected]";
import { LLamaRS } from "npm:[email protected]/dist/llm/llama-rs.js";

const llama = new LLama(LLamaRS);

llama.load({
  path: resolve(Deno.cwd(), "./models/7B/ggml-model-q4_0.bin"),
});

const template = `How are you?`;

const prompt = `### Human:

${template}

### Assistant:`;

llama.createCompletion(
  {
      prompt,
      numPredict: 128,
      temp: 0.2,
      topP: 1,
      topK: 40,
      repeatPenalty: 1,
      repeatLastN: 64,
      seed: 0,
      feedPrompt: true,
  },
  (response) => {
      console.log(response.token)
  }
);

`console.log` output:

DEBUG RS - deno_runtime::worker:418 - received module evaluate Ok(
    Ok(
        (),
    ),
)

getting error after second run

thread 'tokio-runtime-worker' panicked at 'Builder::init should not be called after logger initialized: SetLoggerError(())', /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/env_logger-0.10.0/src/lib.rs:816:14stack backtrace:   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::result::unwrap_failed
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/result.rs:1750:5
   3: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
   4: tokio::runtime::task::raw::poll
   5: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   6: tokio::runtime::task::raw::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

dyld[50013]: missing symbol called


import { LLama } from "llama-node";
import { LLamaCpp, LoadConfig } from "llama-node/dist/llm/llama-cpp.js";


const model = path.resolve(process.cwd(), "./src/ggml-vicuna-13b-1.1-q4_0.bin");


const llama = new LLama(LLamaCpp);

const config: LoadConfig = {
    path: model,
    enableLogging: true,
    nCtx: 1024,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
};

llama.load(config);

const template = `How are you`;

const prompt = `### Human:

${template}

### Assistant:`;

llama.createCompletion(
    {
        nThreads: 4,
        nTokPredict: 2048,
        topK: 40,
        topP: 0.1,
        temp: 0.2,
        repeatPenalty: 1,
        stopSequence: "### Human",
        prompt,
    },
    (response) => {
        process.stdout.write(response.token);
    }
);

My code was written

dyld[50013]: missing symbol called
sh: line 1: 50013 Abort trap: 6 node dist/app.js

I am getting an error, what is the reason and what is the solution?

[ERROR] cublas `TypeError: this.instance.inference is not a function`

since I compiled for using cuda core
first I had to add nGpuLayers (seem logic as it's an option available in llama.cpp)

then I obtain this error:

TypeError: this.instance.inference is not a function                                                    │
    at file:///root/git/llama-selfbot/node_modules/llama-node/dist/llm/llama-cpp.js:54:23               │
    at new Promise (<anonymous>)                                                                        │
    at LLamaCpp.<anonymous> (file:///root/git/llama-selfbot/node_modules/llama-node/dist/llm/llama-cpp.j│
s:53:14)                                                                                                │
    at Generator.next (<anonymous>)                                                                     │
    at file:///root/git/llama-selfbot/node_modules/llama-node/dist/llm/llama-cpp.js:33:61               │
    at new Promise (<anonymous>)                                                                        │
    at __async (file:///root/git/llama-selfbot/node_modules/llama-node/dist/llm/llama-cpp.js:17:10)     │
    at LLamaCpp.createCompletion (file:///root/git/llama-selfbot/node_modules/llama-node/dist/llm/llama-│
cpp.js:50:12)                                                                                           │
    at LLM.<anonymous> (/root/git/llama-selfbot/node_modules/llama-node/dist/index.cjs:56:23)           │
    at Generator.next (<anonymous>)

did I missed something or did something wrong ?

Get started guideline for contributors

This issue just note down the getting start guidelines for contributors, will close after we move these to documentations.

for contributors to start nodejs environment

prepare nodejs >= 16
prepare pnpm npm install -g pnpm
install without scripts (to ignore preinstall binary build) pnpm install --ignore-scripts
prepare github cli https://cli.github.com/
download binary artifacts pnpm run artifacts

for native addon development, install

rust
cmake
clang/gcc

Failed to load modal

llama.cpp: loading model from C:/Users/bazha/work/ai/models/llama/ggml-vic13b-uncensored-q8_0.bin
error loading model: unrecognized tensor type 8

llama_init_from_file: failed to load model
[2023-04-28T12:32:46Z INFO  llama_node_cpp::context] AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
[2023-04-28T12:32:46Z INFO  llama_node_cpp::llama] tokenized_stop_prompt: None

here the code, i just copied from the website:

import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";

const model = "C:/Users/bazha/work/ai/models/llama/ggml-vic13b-uncensored-q8_0.bin"
const llama = new LLM(LLamaCpp);
const config = {
    path: model,
    enableLogging: true,
    nCtx: 1024,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: true,
};
llama.load(config);

const template = `How are you?`;
const prompt = `A chat between a user and an assistant. USER: ${template}; ASSISTANT:`;

llama.createCompletion({
    nThreads: 4,
    nTokPredict: 2048,
    topK: 40,
    topP: 0.1,
    temp: 0.2,
    repeatPenalty: 1,
    prompt,
}, (response) => {
    process.stdout.write(response.token);
});

node: v18.12.1
@llama-node/llama-cpp: ^0.0.31
llama-node: ^0.0.31

model i downloaded

Version 0.0.34 gives drastic different ouput when paired with vicuna.

The same code when user with 0.0.34 versus 0.0.26 gives very weird output effects when paired with the vicuna 7 B model

0.0.26 works perfectly, so happy to stay on it, but whatever 0.0.34 shipped has broken something.

See example screens bellow.

Prompt: "Count from 1 to 10"

0.0.26:

0.0.34:

As you can see the model goes haywire, and this is the same across many different prompts.

¿Update cross-compile.mts with aarch64 architecture?

It is possible to upgrade cross-compiling with aarch64 linux architecture. I am doing build tests for a raspberry pi, based on ubuntu 64. There is a corresponding version of musl for aarch64.

Unable to load latest GGML models using llama.cpp after latest quantisation changes

Following this commit on llama.cpp repo, ggerganov/llama.cpp@b9fd7ee any model which has been re-quantised, won't be loaded by the current version of llama-cpp shipped with this labrary.

What's the plan on updating llama-cpp to the latest available version?

implement new sampling logic for llama.cpp

llama.cpp introduce new sampling logic here:
ggerganov/llama.cpp#1126

Code only using 4 CPU, when I have 16 CPU

This is the code that I am using

import {RetrievalQAChain} from 'langchain/chains';
import {HNSWLib} from "langchain/vectorstores";
import {RecursiveCharacterTextSplitter} from 'langchain/text_splitter';
import {LLamaEmbeddings} from "llama-node/dist/extensions/langchain.js";
import {LLM} from "llama-node";
import {LLamaCpp} from "llama-node/dist/llm/llama-cpp.js";
import * as fs from 'fs';
import * as path from 'path';

It is only using 4 CPU at the time of "vectorStore = await HNSWLib.fromDocuments(docs, new LLamaEmbeddings({maxConcurrency: 1}, llama));"

Can we change anything for it to use more than 4 CPU?

tsx: not found while running npm install

Hi there! Nice package.

I'm trying to install this in a brand project, but I'm facing this issue

npm ERR! code 127
npm ERR! path /MYPATHHERE/node_modules/@llama-node/core
npm ERR! command failed
npm ERR! command sh -c -- tsx scripts/build.ts
npm ERR! sh: 1: tsx: not found

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/jonit/.npm/_logs/2023-04-07T01_54_20_183Z-debug-0.log

Any toughts?

Is it a typo?

Shouldn't it be tsc scripts/build.ts?

Error: Missing field `nGpuLayers`

Hello guys, i try to run mpt-7b model , and i am getting this code, i appreciate any help, here is the detail

Node.js v19.5.0

node_modules\llama-node\dist\llm\llama-cpp.cjs:82
this.instance = yield import_llama_cpp.LLama.load(path, rest, enableLogging);
^

Error: Missing field `nGpuLayers` at LLamaCpp.<anonymous> (<path>\node_modules\llama-node\dist\llm\llama-cpp.cjs:82:52) at Generator.next (<anonymous>) at <path>\node_modules\llama-node\dist\llm\llama-cpp.cjs:50:61 at new Promise (<anonymous>) at __async (<path>\node_modules\llama-node\dist\llm\llama-cpp.cjs:34:10) at LLamaCpp.load (<path>\node_modules\llama-node\dist\llm\llama-cpp.cjs:80:12) at LLM.load (<path>\node_modules\llama-node\dist\index.cjs:52:21) at run (file:///<path>/index.mjs:27:17) at file:///<path>/index.mjs:42:1 at ModuleJob.run (node:internal/modules/esm/module_job:193:25) { code: 'InvalidArg' }

folder structure

index.mjs

import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.cjs"
import path from "path"

const model = path.resolve(process.cwd(), "./model/ggml-mpt-7b-base.bin");
const llama = new LLM(LLamaCpp);
const config = {
    path: model,
    enableLogging: true,
    nCtx: 1024,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: true,
};

const template = `How are you?`;
const prompt = `A chat between a user and an assistant.
USER: ${template}
ASSISTANT:`;

const run = async () => {
    await llama.load(config);

    await llama.createCompletion({
        nThreads: 4,
        nTokPredict: 2048,
        topK: 40,
        topP: 0.1,
        temp: 0.2,
        repeatPenalty: 1,
        prompt,
    }, (response) => {
        process.stdout.write(response.token);
    });
}

run();

thank you for your time

TypeError: llm._modelType is not a function

This is the code I am using

At this line "const chain = RetrievalQAChain.fromLLM(llama, vectorStore.asRetriever());" It is throwing this error

file:///root/project/node_modules/langchain/dist/chains/prompt_selector.js:34
return llm._modelType() === "base_chat_model";
^

TypeError: llm._modelType is not a function
at isChatModel (file:///root/project/node_modules/langchain/dist/chains/prompt_selector.js:34:16)
at ConditionalPromptSelector.getPrompt (file:///root/project/node_modules/langchain/dist/chains/prompt_selector.js:23:17)
at loadQAStuffChain (file:///root/project/node_modules/langchain/dist/chains/question_answering/load.js:20:41)
at RetrievalQAChain.fromLLM (file:///root/project/node_modules/langchain/dist/chains/retrieval_qa.js:69:25)
at run (file:///root/project/index.js:47:36)

How can we fix this issue?

"libstdc++.so.6: version `GLIBCXX_3.4.29' not found" error with docker

I've added llama-node to my project and it's working locally on my windows machine, but when I try to run it in my docker container I get this error:

Error: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /app/node_modules/@llama-node/llama-cpp/@llama-node/llama-cpp.linux-x64-gnu.node)
    at Object.Module._extensions..node (node:internal/modules/cjs/loader:1361:18)
    at Module.load (node:internal/modules/cjs/loader:1133:32)
    at Function.Module._load (node:internal/modules/cjs/loader:972:12)
    at Module.require (node:internal/modules/cjs/loader:1157:19)
    at require (node:internal/modules/helpers:119:18)
    at Object.<anonymous> (/app/node_modules/@llama-node/llama-cpp/index.js:188:31)
    at Module._compile (node:internal/modules/cjs/loader:1275:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1329:10)
    at Module.load (node:internal/modules/cjs/loader:1133:32)
    at Function.Module._load (node:internal/modules/cjs/loader:972:12)

I've tried a ton of things: different base images, updating libstdc++6, etc

How to use the GPU?

Hi, I am looking for a way for this library to trigger the inference generation using the GPU.
Is this supported?

no kernel image is available for execution on the device

~/llama-node/packages/llama-cpp$ node example/mycode.ts
llama.cpp: loading model from /llama-node/packages/llama-cpp/ggml-vic7b-uncensored-q5_1.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 1024
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 72.75 KB
llama_model_load_internal: mem required = 6612.59 MB (+ 2052.00 MB per state)
llama_model_load_internal: [cublas] offloading 8 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 1158 MB
llama_init_from_file: kv self size = 1024.00 MB
[Fri, 26 May 2023 09:45:06 +0000 - INFO - llama_node_cpp::context] - AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
[Fri, 26 May 2023 09:45:06 +0000 - INFO - llama_node_cpp::llama] - tokenized_stop_prompt: None
CUDA error 209 at /llama-node/packages/llama-cpp/llama-sys/llama.cpp/ggml-cuda.cu:693: no kernel image is available for execution on the device

got TESLA K80 card running on ubuntu. any advice what to do and where to look would be appreciated

Llama.cpp Typescript: Cannot find name 'LoadModel'

There is an error when compiling Typescript code using Llama.cpp:

$ tsc -p .
node_modules/@llama-node/llama-cpp/index.d.ts:137:31 - error TS2304: Cannot find name 'LoadModel'.

137   static load(params: Partial<LoadModel>, enableLogger: boolean): Promise<LLama>

I found that changing in llama-cpp/index.d.ts line 137 static load(params: Partial<LoadModel> by static load(params: Partial<ModelLoad> works. It is probably a typo.
I found another reference to LoadModel in llama-cpp/src/lib.rs, so unfortunately as I don't know Rust I can not suggest a valid fix in a PR.

Support for Prompt Caching/Restoring from llama-rs?

would be great for performance

Error: Failed to convert napi value Function into rust type `f64`

Getting this error: Error: Failed to convert napi value Function into rust type f64``

import { LLM } from "llama-node";
//import { LLamaRS } from "llama-node/dist/llm/llama-rs.js";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";
import os from 'os';

export default class AI {
    constructor(chat, msg) {
        this.chat = chat;
        this.msg = msg;
        this.model = path.resolve(os.homedir(), 'models', path.basename(this.chat.model));
        this.llama = new LLM(LLamaCpp);
        this.template = this.msg.body;
        this.prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

${this.template}

### Response:`;


        this.cppConfig = {
            enableLogging: true,
            nCtx: 1024,
            nParts: -1,
            seed: 0,
            f16Kv: false,
            logitsAll: false,
            vocabOnly: false,
            useMlock: false,
            embedding: false,
            useMmap: true,
        };

        this.cppParams = {
            prompt: this.prompt,
            nThreads: 4,
            nTokPredict: 2048,
            topK: 40,
            topP: 0.1,
            temp: this.model.temp || 0.2,
            repeatPenalty: this.model.repeat || 1,
        };

        this.getAIResponse = this.getAIResponse.bind(this);
    }


    async getAIResponse() {
        console.log('calling ai: ', this.chat);
        try {
            await this.llama.load({ path: this.model, ...this.cppConfig });
            await this.llama.createCompletion(this.cppParams, (response) => {
                process.stdout.write(JSON.stringify({ prompt, response: response.token }));
                process.stdout.write(response.token);

                return {
                    prompt: this.prompt,
                    response: response.token
                };
            });
        } catch (err) {
            console.error(err);
        }
    }
}

Model is: WizardLM-7B-uncensored.ggml.q5_1.bin

How do I force the end of text generation?

I am writing a telegram bot that works as a chat bot based on llama. When a user writes something the bot replies to the user and generates text in real-time. But llama crashes very often and starts communicating with itself for example
"""
ASSISTANT: Hello! How can I assist you today?
!!!Starts communicating with herself!!!
HUMAN: I don't know what to do.
ASSIST: Oh, that's okay! We can start with something simple. Can you tell me about your day so far?
HUMAN: I've been trying to write some code, but I keep getting errors.
"""
So I decided to catch the word HUMAN: and if such a word is detected, to forcibly stop generating text. But here is the problem, I don't understand how to do this with this library. I'm not very experienced so don't scold too much :)
Here is a piece of code responsible for answering the user

bot.on("text", ctx => {
    const input_text = ctx.message.text;

    var conv = plus_res("HUMAN", input_text)
    var res = get_res_ai(conv)
    plus_res("ASSISTANT", res)
    ctx.reply("Starting generate")

    
     function get_res_ai(prompt){
        var textAll = ""
        const newMessageId = ctx.message.message_id + 1
        llama.createCompletion({
            nThreads: 8,
            nTokPredict: 2048,
            topK: 40,
            topP: 0.95,
            temp: 0.8,
            repeatPenalty: 1,
            prompt,
        }, (response) => {
            textAll += response.token
            if (textAll.includes("HUMAN:")) {
                textAll = textAll.replace(/HUMAN:/g, '')
                ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
                //This is where the generation must stop
            }

            ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
        });
        return textAll
    }
    
});```

Option to get value as buffer - Support emoji

When you get tokens as text, some information is lost, for example, emojis that are built with more than one byte.
The token sends half an emoji, and the utf8 encodes that to an unknown character.

I thought about 2 possible solutions:

to send as a buffer to ensure no information is lost, and let the client handle it
if you see the last character is encoded to unknown save the byte for the next token.

I implement the second solution for something similar, maybe it could help
massage-builder

zsh: illegal hardware instruction

I get zsh: illegal hardware instruction when executing with v14.17.3 a inference script like

import { LLama } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";

const model = path.resolve(process.cwd(), "../gpt4all/gpt4all-converted.bin");

const llama = new LLama(LLamaCpp);

const config = {
    path: model,
    enableLogging: true,
    nCtx: 1024,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
};

llama.load(config);

const template = `How are you`;

const prompt = `### Human:

${template}

### Assistant:`;

llama.createCompletion(
    {
        nThreads: 4,
        nTokPredict: 2048,
        topK: 40,
        topP: 0.1,
        temp: 0.2,
        repeatPenalty: 1,
        stopSequence: "### Human",
        prompt,
    },
    (response) => {
        process.stdout.write(response.token);
    }
);

I'm using

% tsc --version                        
Version 5.0.4
% npm --version
9.6.4
% node --version
v14.17.3

@llama-node/rwkv-cpp only contains rwkv-cpp.darwin-arm64.node

https://www.npmjs.com/package/@llama-node/rwkv-cpp?activeTab=code

How can I install Types?

I got a lot of error messages like this one:

Cannot find module 'llama-node/dist/llm/llama-rs' or its corresponding type declarations.

How can I install the types?

app crashes when input is too long

thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: Input too long', packages/llama-cpp/src/llama.rs:99:10
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::result::unwrap_failed
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/result.rs:1750:5
   3: tokio::runtime::task::raw::poll

the software has no reaction with no errors

`import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";
import fs from 'fs';

process.on('unhandledRejection', error => {
console.error('Unhandled promise rejection:', error);
});
const model = path.resolve(process.cwd(), "../llama.cpp/models/13B/ggml-model-q4_0.bin");

if (!fs.existsSync(model)) {
console.error("Model file does not exist: ", model);
}
const llama = new LLM(LLamaCpp);
//console.log("model:", model)
const config = {
modelPath: model,
enableLogging: true,
nCtx: 1024,
seed: 0,
f16Kv: false,
logitsAll: false,
vocabOnly: false,
useMlock: false,
embedding: true,
useMmap: true,
nGpuLayers: 0
};
//console.log("config:", config)
const prompt = Who is the president of the United States?;
const params = {
nThreads: 4,
nTokPredict: 2048,
topK: 40,
topP: 0.1,
temp: 0.2,
repeatPenalty: 1.1,
prompt,
};
//console.log("params:", params)

try {
console.log("Loading model...");
await llama.load(config);
console.log("Model loaded");
} catch (error) {
console.error("Error loading model: ", error);
}

const response = await llama.createCompletion(params);
console.log(response)

I added a lot thing to debug it and find that it ends in the lin 44: await llama.load(config); the sequence is just stopped there and the software terminated. no errors were caught.

Mac book pro with m1 max
mac os 13.4 (22F66)
node js v20.3.0

basic unified API for all backends

the syntax for different backends is similar, but many config properties (that do exactly the same thing) have different names, for example numPredict (llm-rs) vs n_tok_predict (llama.cpp)
A "universal api" would have the same property names for every backend, so that switching to a different one would be as simple as changing one parameter.
It would only support the basic features all backends have (prompt, number of tokens, stop sequence(?), temperature) (hence "basic" in the title)
Backend-specific features (save/loadSession from llm-rs, mmap from llama.cpp) would be unavailable, or throw an error for backends that haven't implemented them yet (bad because throwing errors for obscure reasons?)
Parameters that are optional for some backends but mandatory for others (like topP?) could be replaced with "sane" default values.

Support for the new k-quant methods in Llama.cpp

It would be great to have this support to use the new models with these k-quants

req: support async/await

Can we support async/await instead of callbacks?

const response = await llama.createCompletion(params);

is this supposed to have a gpu to run?

If I run on my laptop its pretty fast, however on my server (32-core cpu, no gpu) its super slow.

Error: Too many tokens predicted

Why is this being thrown as an error?

llama-node/packages/llama-cpp/src/llama.rs

Line 120 in 649457a

return Err(napi::Error::from_reason("Too many tokens predicted"));

Forgive my ignorance if I'm missing something, but seems like this should just set response.completed = true;

Context for error:
prompt is a String, ex "Hello World, "
tokens is Integer, ex 128

this.model.createCompletion({
    nThreads: 4,
    nTokPredict: tokens,
    topK: 40,
    topP: 0.1,
    temp: 0.2,
    repeatPenalty: 1,
    prompt,
}, (response) => {
    completition.push(response.token)
    console.log(response.token)
    if(response.completed || count == tokens) {
        resolve(completition)
        return;
    }
}).catch((error) => {
    reject(new Error("Failed to generate completition: " + error));
});

Again, if I'm not missing something it seems clunky to have to check it the error is actually just because we have hit the end token.

Ggml v3 support in Llama.cpp

Hi, thanks for this nice package. LLama.cpp made a new breaking change in their quantization methods recently (PR ref).

Would it be possible to get an update for the llama-node package to be able to use the ggml v3 models? Actually the new ggml models that come out are all using this format

Interactive

I'm new to llm and llama but learning fast, I've wrote a small piece of code to chat via cli, but it seems to not follow the context (ie work in interactive mode).

import { LLM } from "llama-node";
import readline from "readline";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import { LLamaRS } from "llama-node/dist/llm/llama-rs.js";
const saveSession = path.resolve(process.cwd(), "./tmp/session.bin");
const loadSession = path.resolve(process.cwd(), "./tmp/session.bin");

import path from "path";
const model = path.resolve(process.cwd(), "./ggml-vic7b-q4_1.bin"); 

const llama = new LLM(LLamaRS);
llama.load({ path: model });


var rl = readline.createInterface(process.stdin, process.stdout);
console.log("Chatbot started!");
rl.setPrompt("> ");
rl.prompt();
rl.on("line", async function (line) {
    const prompt = `A chat between a user and an assistant.
    USER: ${line}
    ASSISTANT:`;
    llama.createCompletion({
        prompt,
        numPredict: 128,
        temp: 0.2,
        topP: 1,
        topK: 40,
        repeatPenalty: 1,
        repeatLastN: 64,
        seed: 0,
        feedPrompt: true,
        saveSession,
        loadSession,
    }, (response) => {
        if(response.completed) {
            process.stdout.write('\n'); 
            rl.prompt(); 
        } else {
            process.stdout.write(response.token);
        }  
    });
});

I'm missing something?

Expand on install/usage instructions

readme states, install:

npm install llama-node
and provides some usage examples, which don't actually work, it is typescript, so there are more requirements.

Please provide all prerequisite installation instructions that actually provides a working example.

It might all be second nature to experts, but for the less fortunate it is not as straight forward. (ie. don't redirect someone to some examples directory with a config file as that still requires prerequisite steps).

How to train the model with this library?

Hi,
Is there a way to train the model with custom data? Like a text file that can be indexed?

Illegal instruction (core dumped)

I am unable to use llama-node on a xeon E5-2690 v2, I get the error:

Illegal instruction (core dumped)

I am assuming it is because it doesn't support AVX2, as llama-node works on my i7 12700F. Is there any way to get it to work?

[ASK] enable cuda with manual compilation

Hi,

According to the llama.cpp github repo, it's now possible to use cuda from nvidia gpu by using cuBLAS build.

So you may got where I want to come, how to do Manual compilation with using make or cmake args to enable LLAMA_CUBLAS :)

greeting

langchain integration

Hello, I'm trying to use the langchin integration but I cannot figure out how to use it, I'm following some examples in langchain:

import { LLM } from "llama-node";
import { LLamaRS } from "llama-node/dist/llm/llama-rs.js";
import readline from "readline";
import fs from "fs";
import path from "path";
import { SerpAPI } from 'langchain/tools';
import {initializeAgentExecutorWithOptions} from 'langchain/agents';
import { Calculator } from 'langchain/tools/calculator';
import { LLamaEmbeddings } from "llama-node/dist/extensions/langchain.js";

const SERPAPI_KEY = '';

const model = path.resolve(process.cwd(), "./ggml-vic7b-q4_1.bin"); 
const llama = new LLM(LLamaRS);
llama.load({ path: model });

const tools =[
    new SerpAPI(SERPAPI_KEY,{
        hl:'en',
        gl:'us'
    }),
    new Calculator(),
]

const executor = await initializeAgentExecutorWithOptions(tools, llama, {
    agentType: 'chat-zero-shot-react-description'
});
console.log('initialized')
const ret = await executor.call({
    input: "Who is Olivia Wilde's boyfrient? What is his age raised to the 0.23 power?"
});

console.log('ret:', ret.output);

but I got:

TypeError: this.llm.generatePrompt is not a function
    at LLMChain._call (file:///Users/lvx/dalai/node_modules/langchain/dist/chains/llm_chain.js:80:48)
    at async LLMChain.call (file:///Users/lvx/dalai/node_modules/langchain/dist/chains/base.js:65:28)
    at async LLMChain.predict (file:///Users/lvx/dalai/node_modules/langchain/dist/chains/llm_chain.js:98:24)
    at async ChatAgent._plan (file:///Users/lvx/dalai/node_modules/langchain/dist/agents/agent.js:197:24)
    at async AgentExecutor._call (file:///Users/lvx/dalai/node_modules/langchain/dist/agents/executor.js:82:28)
    at async AgentExecutor.call (file:///Users/lvx/dalai/node_modules/langchain/dist/chains/base.js:65:28)
    at async file:///Users/lvx/dalai/agent.js:35:13

i understand that is because the LLM model does not have this function is there any method to call it or do I have to create a translation class?

Embeddings.js file does not work correctly

This is Code:

import { LLM } from "llama-node";
import { LLMRS } from "llama-node/dist/llm/llm-rs.js";
import path from "path";
import fs from "fs";
const model = path.resolve(process.cwd(), "../../models/ggml-old-vic13b-q4_2.bin");
const llama = new LLM(LLMRS);
const getWordEmbeddings = async (prompt, file) => {
    const data = await llama.getEmbedding({
        prompt,
        numPredict: 128,
        temperature: 0.2,
        topP: 1,
        topK: 40,
        repeatPenalty: 1,
        repeatLastN: 64,
        seed: 0,
    });
    console.log(prompt, data);
    await fs.promises.writeFile(path.resolve(process.cwd(), file), JSON.stringify(data));
};
const run = async () => {
    await llama.load({ modelPath: model, modelType: "Llama" /* ModelType.Llama */ });
    const dog1 = `My favourite animal is the dog`;
    await getWordEmbeddings(dog1, "./example/semantic-compare/dog1.json");
    const dog2 = `I have just adopted a cute dog`;
    await getWordEmbeddings(dog2, "./example/semantic-compare/dog2.json");
    const cat1 = `My favourite animal is the cat`;
    await getWordEmbeddings(cat1, "./example/semantic-compare/cat1.json");
};
run();

Output:

thread 'tokio-runtime-worker' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `5120`'

Failed to convert napi value into rust type `bool`

inference.mjs

import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";

const model = path.resolve(process.cwd(), "../ggml-vic7b-uncensored-q5_1.bin");
const llama = new LLM(LLamaCpp);

const config = {
path: model,
//modelPath: model, // does not exist in llama-cpp.js instead path used
enableLogging: true,
nCtx: 1024,
seed: 0,
f16Kv: false,
logitsAll: false,
vocabOnly: false,
useMlock: false,
embedding: false,
useMmap: true,
nGpuLayers: 0
};

params passed

{
nCtx: 1024,
seed: 0,
f16Kv: false,
logitsAll: false,
vocabOnly: false,
useMlock: false,
embedding: false,
useMmap: true,
nGpuLayers: 0
}
llama-node/node_modules/llama-node/dist/llm/llama-cpp.js:65
this.instance = yield LLama.load(path, rest, enableLogging);
^

Error: Failed to convert napi value into rust type bool
at LLamaCpp. (file:///home/simon/llama-node/node_modules/llama-node/dist/llm/llama-cpp.js:65:35)
at Generator.next ()
at file:///home/simon/llama-node/node_modules/llama-node/dist/llm/llama-cpp.js:33:61
at new Promise ()
at __async (file:///home/simon/llama-node/node_modules/llama-node/dist/llm/llama-cpp.js:17:10)
at LLamaCpp.load (file:///home/simon/llama-node/node_modules/llama-node/dist/llm/llama-cpp.js:61:14)
at LLM.load (/home/simon/llama-node/node_modules/llama-node/dist/index.cjs:52:21)
at run (file:///home/simon/llama-node/packages/llama-cpp/example-js/inference.mjs:39:17)
at file:///home/simon/llama-node/packages/llama-cpp/example-js/inference.mjs:45:1
at ModuleJob.run (node:internal/modules/esm/module_job:194:25) {
code: 'BooleanExpected'
}

I guess something isn't correct with prams passed any chance to know what is the correct way to pass it?

used model below
ggml-vic7b-uncensored-q5_1.bin