Bug deion <a class="user-mention notranslate" data-hovercard

ch07 - ollama reproducibility about llms-from-scratch HOT 20 CLOSED

d-kleine commented on August 20, 2024

ch07 - ollama reproducibility

from llms-from-scratch.

Comments (20)

rasbt commented on August 20, 2024 1

Oh wow, thanks so much for figuring this out. I tried lots of things but somehow didn't think of this. It's kind of weird that Ollama doesn't error if the options are passed differently (but then silently ignores them). In any case, I can confirm that the responses are now deterministic. But it still seems they are not deterministic across operating systems (but that's ok).

from llms-from-scratch.

d-kleine commented on August 20, 2024 1

I just tested the Eval NB with gemma2, it's seems to be reproducible across OS. Maybe the params have an effect? https://ollama.com/library/gemma2

I have also tested it with the instruction-finetuned model for llama3 (llama3:instruct), the outputs are inconsistent too unfortunately. https://ollama.com/library/llama3 and https://ollama.com/library/llama3:instruct

from llms-from-scratch.

rasbt commented on August 20, 2024 1

Haha wow that's kind of cool. I knew that the huge 256k vocab is good for something, haha

from llms-from-scratch.

rasbt commented on August 20, 2024 1

Thanks for going down that rabbit whole. That sounds plausible! So yeah, temperature 0 is an attempt to force-disable the sampling, but then disabling do_sample by setting it to False would be the better route (maybe temperature 0 is not truly zero because they add a small coef to prevent zero division errors)

from llms-from-scratch.

rasbt commented on August 20, 2024 1

Thanks for the comment, and yes, that's correct, I used the instruction-finetuned variant on because I think otherwise it wouldn't work well. I mentioned it in the info box:

from llms-from-scratch.

rasbt commented on August 20, 2024 1

Thanks for keeping an eye out for this!

from llms-from-scratch.

d-kleine commented on August 20, 2024

Oh wow, thanks so much for figuring this out. I tried lots of things but somehow didn't think of this.

Tbh, I am really happy that the model is deterministic now, so the same evaluation scores also differ less than before 🙂

It's kind of weird that Ollama doesn't error if the options are passed differently (but then silently ignores them).

Yeah, I was thinking the same...

In any case, I can confirm that the responses are now deterministic. But it still seems they are not deterministic across operating systems (but that's ok).

Yeah, I can confirm that. I have tested it with Windows 10 and with my Ubuntu image on Docker, the generated output on the same OS is deterministic and reproducible, but across different OS it is inconsistent. This also seems to be the same to when restarting the kernel, the first execution's output as different that the following ones (these are consistent then). My assumption is that this is not an issue of the model itself, but rather one in Ollama (probably even llama.cpp in the backend).

I have opened an GH issue on this: ollama/ollama#5321

from llms-from-scratch.

d-kleine commented on August 20, 2024

Thanks for updating the code!

from llms-from-scratch.

rasbt commented on August 20, 2024

Oh nice! After reading the Gemma 2 paper, maybe also the logit capping can have an effect (not sure if it's implemented in the ollama version though). I think deviations can often happen in the extreme(r)-value realm, and logit capping could somehow help with that somewhat.

from llms-from-scratch.

d-kleine commented on August 20, 2024

Good point (just saw your post on that)!

What I found funny about gemma2 was that it even generated a llama icon in the output 🦙 😄

from llms-from-scratch.

d-kleine commented on August 20, 2024

Yesterday, I explored the issue in llama.cpp, which Ollama is based on, in more detail. I have a vague idea why there might be random output sometimes: Llama 3 supports various sampling methods, and decoding is enabled by default (do_sample=True), which means it might not use fully greedy decoding.
https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig

Currently, in llama.cpp (so in Ollama) you cannot control the params num_beams (setting it to 1 ensures that the model does not use beam search) and do_sample (setting it to False ensures that the model does not use sampling methods like multinomial sampling)
ggerganov/llama.cpp#8265

What do you think about this?

from llms-from-scratch.

d-kleine commented on August 20, 2024

maybe temperature 0 is not truly zero because they add a small coef to prevent zero division errors

Good point!

BTW, I also think that in the code the seed parameter is irrelevant once the model is using greedy decoding (with "temperature": 0) and the output is deterministic. Setting a seed is primarily useful when using non-deterministic sampling methods, such as Top-K or Top-p sampling, to ensure reproducibility.

from llms-from-scratch.

d-kleine commented on August 20, 2024

@rasbt Could you please test the following code if this is deterministic across different OS for you?

import urllib.request
import json


def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "options": {     # Settings below are required for deterministic responses
            "temperature": 0.0
            "num_ctx": 2048, 
            "num_keep": 0 # no tokens saved in KV cache
        }
    }

    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data


prompt = """What do LLamas eat?"""
result = query_model(prompt)
print(result)

My output is:

Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: Hay is a staple in a llama's diet. They enjoy eating timothy hay, alfalfa hay, or other types of hay as a source of fiber and nutrients.
3. Grains: Llamas may also eat grains like oats, barley, or corn, although these should be given in moderation due to their high calorie content.
4. Fruits and vegetables: Llamas can enjoy fruits and veggies as treats or supplements, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas need access to minerals like calcium, phosphorus, and salt to maintain strong bones and overall health.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, like willow, alder, or birch.
2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or cottonwood.
3. Mosses: Llamas might snack on mosses, which are non-vascular plants that grow in dense clusters.

In captivity, llama owners typically provide a balanced diet consisting of hay, grains, and supplements specifically formulated for llamas. It's essential to consult with a veterinarian or experienced llama breeder to determine the best diet for your llama based on its age, size, and health status.

Scores:

model 1 response
Number of scores: 100 of 100
Average score: 78.58

Scoring entries: 100%|██████████| 100/100 [03:48<00:00,  2.29s/it]
model 2 response
Number of scores: 99 of 100
Average score: 65.46

from llms-from-scratch.

rasbt commented on August 20, 2024

Arg, close but not quite :(

model 1 response
Number of scores: 100 of 100
Average score: 78.48

Scoring entries: 100%|████████████████████████| 100/100 [01:14<00:00,  1.34it/s]

model 2 response
Number of scores: 99 of 100
Average score: 64.98

(The results do match my previous results though that I got with the old setting. I first thought I forgot to change that setting but double-checked 2 times.)

My guess is that because ollama is based on llama.cpp, there maybe some compilation-induced differences. But it's close!

from llms-from-scratch.

d-kleine commented on August 20, 2024

(The results do match my previous results though that I got with the old setting. I first thought I forgot to change that setting but double-checked 2 times.)

Thanks for checking, unfortunately I don't have an option to test it on macOS. At least on Windows and Ubuntu the outputs are 1:1 the same, even the scores match on both OS.

My guess is that because ollama is based on llama.cpp, there maybe some compilation-induced differences. But it's close!

Yeah, I think so too. May I ask if which Ollama version you have used? I am running both of my OS with Ollama most recent version (0.2.7, has been released today).

This involves pulling llama.cpp commits too, so making sure that we run the code under the same conditions.

from llms-from-scratch.

rasbt commented on August 20, 2024

I just updated before I ran it yesterday. Just checking, it was also 0.2.7 :(

from llms-from-scratch.

d-kleine commented on August 20, 2024

I just updated before I ran it yesterday. Just checking, it was also 0.2.7 :(

Hm, strange... thanks for testing though!

from llms-from-scratch.

d-kleine commented on August 20, 2024

@rasbt BTW I just figured out that ollama run llama3 uses the instruction-finetuned variant of Llama3 (Meta-Llama-3-8B-Instruct), not the vanilla model (Meta-Llama-3-8B). So ollama run llama3 is practically ollama run llama3:instruct.
https://ollama.com/library/llama3/blobs/6a0746a1ec1a

I think for this chapter that suits well, but if you want to use the model without instruction finetuning, you would need to use ollama run llama3:text
https://ollama.com/library/llama3:text/blobs/cebceffdc781

from llms-from-scratch.

d-kleine commented on August 20, 2024

Alright, thanks 🙂

from llms-from-scratch.

d-kleine commented on August 20, 2024

I am currently waiting for this PR ollama/ollama#5760 to be merged and implemented in the Ollama app. With this, you would be able to disable KV cache on Ollama which might be the reason for the non-deterministic outputs due to variations in memory management, concurrency handling, hardware differences, system-level optimizations, and resource access.

I will let you know once there is an update on that, so we can test if the outputs are deterministic across all OS then.

from llms-from-scratch.

ch07 - ollama reproducibility about llms-from-scratch HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent