Giter Site home page Giter Site logo

Comments (20)

rasbt avatar rasbt commented on August 20, 2024 1

Oh wow, thanks so much for figuring this out. I tried lots of things but somehow didn't think of this. It's kind of weird that Ollama doesn't error if the options are passed differently (but then silently ignores them). In any case, I can confirm that the responses are now deterministic. But it still seems they are not deterministic across operating systems (but that's ok).

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024 1

I just tested the Eval NB with gemma2, it's seems to be reproducible across OS. Maybe the params have an effect? https://ollama.com/library/gemma2

I have also tested it with the instruction-finetuned model for llama3 (llama3:instruct), the outputs are inconsistent too unfortunately. https://ollama.com/library/llama3 and https://ollama.com/library/llama3:instruct

from llms-from-scratch.

rasbt avatar rasbt commented on August 20, 2024 1

Haha wow that's kind of cool. I knew that the huge 256k vocab is good for something, haha

from llms-from-scratch.

rasbt avatar rasbt commented on August 20, 2024 1

Thanks for going down that rabbit whole. That sounds plausible! So yeah, temperature 0 is an attempt to force-disable the sampling, but then disabling do_sample by setting it to False would be the better route (maybe temperature 0 is not truly zero because they add a small coef to prevent zero division errors)

from llms-from-scratch.

rasbt avatar rasbt commented on August 20, 2024 1

Thanks for the comment, and yes, that's correct, I used the instruction-finetuned variant on because I think otherwise it wouldn't work well. I mentioned it in the info box:

Screenshot 2024-07-24 at 11 39 14 AM

from llms-from-scratch.

rasbt avatar rasbt commented on August 20, 2024 1

Thanks for keeping an eye out for this!

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

Oh wow, thanks so much for figuring this out. I tried lots of things but somehow didn't think of this.

Tbh, I am really happy that the model is deterministic now, so the same evaluation scores also differ less than before 🙂

It's kind of weird that Ollama doesn't error if the options are passed differently (but then silently ignores them).

Yeah, I was thinking the same...

In any case, I can confirm that the responses are now deterministic. But it still seems they are not deterministic across operating systems (but that's ok).

Yeah, I can confirm that. I have tested it with Windows 10 and with my Ubuntu image on Docker, the generated output on the same OS is deterministic and reproducible, but across different OS it is inconsistent. This also seems to be the same to when restarting the kernel, the first execution's output as different that the following ones (these are consistent then). My assumption is that this is not an issue of the model itself, but rather one in Ollama (probably even llama.cpp in the backend).

I have opened an GH issue on this: ollama/ollama#5321

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

Thanks for updating the code!

from llms-from-scratch.

rasbt avatar rasbt commented on August 20, 2024

Oh nice! After reading the Gemma 2 paper, maybe also the logit capping can have an effect (not sure if it's implemented in the ollama version though). I think deviations can often happen in the extreme(r)-value realm, and logit capping could somehow help with that somewhat.

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

Good point (just saw your post on that)!

What I found funny about gemma2 was that it even generated a llama icon in the output 🦙 😄
grafik

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

Yesterday, I explored the issue in llama.cpp, which Ollama is based on, in more detail. I have a vague idea why there might be random output sometimes: Llama 3 supports various sampling methods, and decoding is enabled by default (do_sample=True), which means it might not use fully greedy decoding.
https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig

Currently, in llama.cpp (so in Ollama) you cannot control the params num_beams (setting it to 1 ensures that the model does not use beam search) and do_sample (setting it to False ensures that the model does not use sampling methods like multinomial sampling)
ggerganov/llama.cpp#8265

What do you think about this?

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

maybe temperature 0 is not truly zero because they add a small coef to prevent zero division errors

Good point!

BTW, I also think that in the code the seed parameter is irrelevant once the model is using greedy decoding (with "temperature": 0) and the output is deterministic. Setting a seed is primarily useful when using non-deterministic sampling methods, such as Top-K or Top-p sampling, to ensure reproducibility.

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

@rasbt Could you please test the following code if this is deterministic across different OS for you?

import urllib.request
import json


def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "options": {     # Settings below are required for deterministic responses
            "temperature": 0.0
            "num_ctx": 2048, 
            "num_keep": 0 # no tokens saved in KV cache
        }
    }

    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data


prompt = """What do LLamas eat?"""
result = query_model(prompt)
print(result)

My output is:

Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: Hay is a staple in a llama's diet. They enjoy eating timothy hay, alfalfa hay, or other types of hay as a source of fiber and nutrients.
3. Grains: Llamas may also eat grains like oats, barley, or corn, although these should be given in moderation due to their high calorie content.
4. Fruits and vegetables: Llamas can enjoy fruits and veggies as treats or supplements, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas need access to minerals like calcium, phosphorus, and salt to maintain strong bones and overall health.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, like willow, alder, or birch.
2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or cottonwood.
3. Mosses: Llamas might snack on mosses, which are non-vascular plants that grow in dense clusters.

In captivity, llama owners typically provide a balanced diet consisting of hay, grains, and supplements specifically formulated for llamas. It's essential to consult with a veterinarian or experienced llama breeder to determine the best diet for your llama based on its age, size, and health status.

Scores:

model 1 response
Number of scores: 100 of 100
Average score: 78.58

Scoring entries: 100%|██████████| 100/100 [03:48<00:00,  2.29s/it]
model 2 response
Number of scores: 99 of 100
Average score: 65.46

from llms-from-scratch.

rasbt avatar rasbt commented on August 20, 2024

Arg, close but not quite :(

model 1 response
Number of scores: 100 of 100
Average score: 78.48

Scoring entries: 100%|████████████████████████| 100/100 [01:14<00:00,  1.34it/s]

model 2 response
Number of scores: 99 of 100
Average score: 64.98

(The results do match my previous results though that I got with the old setting. I first thought I forgot to change that setting but double-checked 2 times.)

My guess is that because ollama is based on llama.cpp, there maybe some compilation-induced differences. But it's close!

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

(The results do match my previous results though that I got with the old setting. I first thought I forgot to change that setting but double-checked 2 times.)

Thanks for checking, unfortunately I don't have an option to test it on macOS. At least on Windows and Ubuntu the outputs are 1:1 the same, even the scores match on both OS.

My guess is that because ollama is based on llama.cpp, there maybe some compilation-induced differences. But it's close!

Yeah, I think so too. May I ask if which Ollama version you have used? I am running both of my OS with Ollama most recent version (0.2.7, has been released today).
grafik

This involves pulling llama.cpp commits too, so making sure that we run the code under the same conditions.

from llms-from-scratch.

rasbt avatar rasbt commented on August 20, 2024

I just updated before I ran it yesterday. Just checking, it was also 0.2.7 :(

Screenshot 2024-07-19 at 6 58 57 AM

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

I just updated before I ran it yesterday. Just checking, it was also 0.2.7 :(

Hm, strange... thanks for testing though!

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

@rasbt BTW I just figured out that ollama run llama3 uses the instruction-finetuned variant of Llama3 (Meta-Llama-3-8B-Instruct), not the vanilla model (Meta-Llama-3-8B). So ollama run llama3 is practically ollama run llama3:instruct.
https://ollama.com/library/llama3/blobs/6a0746a1ec1a

I think for this chapter that suits well, but if you want to use the model without instruction finetuning, you would need to use ollama run llama3:text
https://ollama.com/library/llama3:text/blobs/cebceffdc781

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

Alright, thanks 🙂

from llms-from-scratch.

d-kleine avatar d-kleine commented on August 20, 2024

I am currently waiting for this PR ollama/ollama#5760 to be merged and implemented in the Ollama app. With this, you would be able to disable KV cache on Ollama which might be the reason for the non-deterministic outputs due to variations in memory management, concurrency handling, hardware differences, system-level optimizations, and resource access.

I will let you know once there is an update on that, so we can test if the outputs are deterministic across all OS then.

from llms-from-scratch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.