I am using the alpaca model 30B and feeding a list of prompts externally from a csv file. While doing so, I get segmentation fault error
after 3 prompts are processed. I made sure that, large prompts are filtered out.
Curious thing is. At first iteration, model handled a prompt string with len of 912. Few iterations later, for another prompt with len of 517, it was giving me segmentation fault again. Considering I have 4 GB extra ram and Swap memory still available when this was happening, I am confused.
Could it be that, I am forgetting to clear memory or something after processing ?
Here is my simple python code. Also, to save model output, I had to slightly modify that streaming_fn
. Could you verify if I am doing something wrong here ?
import pandas as pd
import numpy as np
import sys, time
sys.path.append("../build/")
import fastLlama
MODEL_PATH = PROJECT_ROOT_PATH / "../../alpaca-lora-30B-ggml/ggml-model-q4_0.bin"
model_output = []
def gen_output(instruction_input, writable):
global model_output
res = model.ingest(user_input)
start = time.time()
res = model.generate(
num_tokens=120, # increased from 120
top_p=0.95, #top p sampling (Optional) > increased from 0.92
temp=0.1, #temperature (Optional) > reduced from 0.65
repeat_penalty=1.1, #repetition penalty (Optional) > changed from 1.3
streaming_fn=stream_token, #streaming function
stop_word=[".\n", "# "] #stop generation when this word is encountered (Optional)
)
tot_time = round( time.time() - start, 3 )
# Do something with model_output here
model_output_str = ''.join(model_output)
## write model output to a file.
writable_output = f"{model_output_str}{SEP}{tot_time}{SEP}"
writable.write(writable_output)
writable.flush()
# reset model_output
model_output = []
def stream_token(x: str) -> None:
"""
This function is called by the llama library to stream tokens
"""
global model_output
model_output.append( x )
print(x, end='', flush=True)
if __name__ == '__main__':
# Load the file
excel = pd.read_csv("./prompts.csv")
# Load the model
print("Loading the model ...")
model = fastLlama.Model(
id="ALPACA-LORA-30B",
path=str(MODEL_PATH.resolve()), #path to model
num_threads=16, #number of threads to use
n_ctx=512, #context size of model
last_n_size=64, #size of last n tokens (used for repetition penalty) (Optional)
seed=0 #seed for random number generator (Optional)
)
alpaca_output = open('alpaca_inferences_results.txt', 'a')
print('Starting Inference Generation ...')
for row_ind, row_info in excel.iterrows():
if row_ind % 10 == 0:
print(f'Processed {row_ind} complaints.')
prompt_id = row_info['prompt_id']
prompt = row_info['prompt_text']
gen_output(complaint, question_type, alpaca_output)