paolorechia / learn-langchain Goto Github PK

License: MIT License

Python 99.41% Shell 0.59%

learn-langchain's Introduction

learn-langchain

AI Agent with Vicuna

This repository is a playground that makes it easy to use Zero Shot / Few Shot prompts based on the ReAct framework, with LLama based models, such as Vicuna, with the langchain framework.

Installation

Disclaimer: This may not be the most effective way to install, but it's how I've done. It's possible that the installation process is not working as expected, and you might have to install additional requirements. If that's the case, feel free to open a PR to fix it or an issue about it.

NVIDIA Driver / Toolkit

First, install the NVIDIA toolkit: https://developer.nvidia.com/cuda-11-8-0-download-archive

In my case, I installed using the deb repository, which downloaded a CUDA toolkit 12.1

If you don't have a GPU, you should skip this, of course.

Installing Python dependencies

Run, in your bash, run:

chmod +x ./install_on_virtualenv_and_pip.sh 
./install_on_virtualenv_and_pip.sh`

If you're not running bash, and you don't have a virtualenv, or created it another way, you can execute the commands, adapting to your OS/shell:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip3 install -r requirements.txt

Just note that might clobber your main Python installation, if possible, use a virtualenv!

Git LFS for cloning quantized models

For the quantized models, you also need git lfs installed: https://git-lfs.com/

Running the server

Option 1 - Text Generation WebUI (new on 30.04.2023)

This is useful when:

This is the recommended approach for most users. You should only use option 2 if you need prompt logging.

With this approach you can load quantized models (see how to run the newest models as described here: #24)

You can also change any of the available parameters:

def default_parameters():
    return {
        "max_new_tokens": 250,
        "do_sample": True,
        "temperature": 0.001,
        "top_p": 0.1,
        "typical_p": 1,
        "repetition_penalty": 1.2,
        "top_k": 1,
        "min_length": 32,
        "no_repeat_ngram_size": 0,
        "num_beams": 1,
        "penalty_alpha": 0,
        "length_penalty": 1,
        "early_stopping": False,
        "seed": -1,
        "add_bos_token": True,
        "truncation_length": 2048,
        "ban_eos_token": False,
        "skip_special_tokens": True,
        "stopping_strings": [STOP_TOKEN + ":"],
    }

If you define your own parameters, you should pass them in the LLM builder function:

def build_text_generation_web_ui_client_llm(
    prompt_url="http://localhost:5000/api/v1/generate", parameters=None  # override these parameters
):
    if parameters is None:
        parameters = default_parameters()

    return HTTPBaseLLM(
        prompt_url=prompt_url,
        parameters=parameters,
        stop_parameter_name="stopping_strings",
        response_extractor=response_extractor,
    )

Also make sure to update the response_extractor if you modify the stopping_strings parameter, or else things will break in unexpected ways.

Steps to install:

Use https://github.com/oobabooga/text-generation-webui as the backend.
Install the text-generation-webui as instructed in the repository README,
Download a model and start the server / UI.
In the UI, go to Interface Mode -> Available Extensions -> api (tick on this one). Click on Apply and restart the interface.

Important: small code modification required

Some code samples will not work out of the box with this option. To use the Text Generation WebUI you should use the correct LLM client:

from langchain_app.models.text_generation_web_ui import build_text_generation_web_ui_client_llm

llm = build_text_generation_web_ui_client_llm()

You can see Chuck Norris example using it here: langchain_app/agents/chuck_norris_test_web_generation_textui.py

To execute it and test it:

python3 -m langchain_app.agents.chuck_norris_test_web_generation_textui

Option 2 - Use this repo's web server

Update 13.05.2023: I don't recommend this option at the moment, as it seems there are some open bugs with the quantized version, and I'm not planning to fix them anytime soon. Please use the text generation ui instead if you want to use quantized models. Open bugs:

This option only makes sense if you want to use my server prompt logging feature to generate datasets. Currently only working with HF models (at most 8 bits quantization).

Default Parameters

If you have the virtualenv, start it by running: source learn-langchain/bin/activate

Then:

uvicorn servers.vicuna_server:app

If you want to load a different model than the default:

export USE_13B_MODEL=true && export USE_4BIT=true && uvicorn servers.vicuna_server:app

On Windows

If you've somehow managed to install everything on Windows (congrats!), please feel free to contribute to extend this README.md. So far we know that you need to follow this change:

Your command to use a different model returns an error if you try and run it on windows:

export USE_13B_MODEL=true && export USE_4BIT=true && uvicorn servers.vicuna_server:app

The export and && isn't supported. I found that you can use the set command and format it like this:

set USE_13B_MODEL=true; set USE_4BIT=true;uvicorn servers.vicuna_server:app

Thanks to @unoriginalscreenname for sharing

Downloading quantized models

When you run it for the first time, the server might throw you an error that the model is not found. You should follow the instruction, for instance, cloning:

git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g

Make sure you have installed git lfs first. You can also run this command beforehand, if you know the version you want to use.

Config (Update 25.04)

This repository has again been reorganized, adding support for 4 bit models (from gpqt_for_llama: https://github.com/qwopqwop200/GPTQ-for-LLaMa)

You can now change the server behavior by setting environment variables:

class Config:
    def __init__(self) -> None:
        self.base_model_size = "13b" if os.getenv("USE_13B_MODEL") else "7b"
        self.use_for_4bit = True if os.getenv("USE_FOR_4BIT") else False
        self.use_fine_tuned_lora = True if os.getenv("USE_FINE_TUNED_LORA") else False
        self.lora_weights = os.getenv("LORA_WEIGHTS")
        self.device = "cpu" if os.getenv("USE_CPU") else "cuda"
        self.model_path = os.getenv("MODEL_PATH", "")
        self.model_checkpoint = os.getenv("MODEL_CHECKPOINT", "")

Some options are incompatible with each other, the code does not check for all possibilities.

This repository's web server support the following models:

Vicuna 7b unquantized, HF format (16-bits) - this is the default (https://huggingface.co/eachadea/vicuna-7b-1.1)
Vicuna 7b LoRA fine-tune (8-bits)
Vicuna 7b GPQT 4-bit group-size 128
Vicuna 13b unquantized, HF format (16-bits)
Vicuna 13b GPQT 4-bit group-size 128

Examples

Note: The coding prompts are currently unreliable, and what works with one model might not work with another. Changing the model parameters also greatly affect the output. For instance, increasing repetition penalty to 1.3 seems to improve the quality of the output code, when using the WizardLM model.

Cat Jokes

Using this to start server:

export USE_4BIT=true
export USE_13B_MODEL=true
uvicorn servers.vicuna_server:app

You can get this output:

(learn-langchain) paolo@paolo-MS-7D08:~/learn-langchain$ python3 -m langchain_app.agents.cat_joke

> Entering new AgentExecutor chain...
I must use the Python REPL to write a script that generates cat jokes and saves them to a CSV file called 'catjokes.csv'.
Action: Python REPL
Action Input:
import csv # line 1
jokes = ["Why did the cat go to the vet?", "Because it was meowt!"] # line 2
with open("catjokes.csv", "w", newline="") as csvfile: # line 3
    writer = csv.writer(csvfile) # line 4
    writer.writerow(jokes) # line 5

Observation: 
Thought:I now know the answer
Final Answer: I have executed the task successfully.

Final Answer: I have executed the task successfully.

> Finished chain.
(learn-langchain) paolo@paolo-MS-7D08:~/learn-langchain$ cat catjokes.csv 
Why did the cat go to the vet?,Because it was meowt!

Chuck Norris Prompt

Note: The original prompt from the article is not working anymore, as I might have modified something in the process.

The model that seems to perform better at this sort of task is the new WizardLM: https://huggingface.co/TheBloke/wizardLM-7B-HF

Server start. Modify the file 'run_server.sh' with this:

export MODEL_PATH=TheBloke/wizardLM-7B-HF
uvicorn servers.vicuna_server:app

> Entering new AgentExecutor chain...
I should use requests library to fetch the data from the API
Action: Python REPL
Action Input: 
import requests
Observation: 
Thought:I should define the endpoint URL
Action: Python REPL
Action Input: 
url = 'https://api.chucknorris.io/jokes/random'
Observation: 
Thought:I should send a GET request to the endpoint
Action: Python REPL
Action Input: 
response = requests.get(url)
Observation: 
Thought:I should check if the response status code is 200
Action: Python REPL
Action Input: 
if response.status_code == 200:
    # extract the joke from the response
    data = response.json()
    joke = data['value']
    print(joke)
else:
    print('Error fetching joke')
Observation: Chuck Norris once ran a one-minute mile. He did it dragging an 18-wheeler while running in a field of wet cement.

Thought:I have successfully fetched and printed the joke.
Final Answer: Chuck Norris once ran a one-minute mile. He did it dragging an 18-wheeler while running in a field of wet cement.

> Finished chain.
(learn-langchain) paolo@paolo-MS-7D08:~/learn-langchain$

Answer about Germany Q/A

python -m langchain_app.agents.answer_about_germany

Sample output:


Type your question: Where is Germany Located?


> Entering new AgentExecutor chain...
I should always think about what to do
Action: Search
Action Input: Germany
Observation: [(Document(page_content="'''Germany''',{{efn|{{lang-de|Deutschland}}, {{IPA-de|ˈdɔʏtʃlant|pron|De-Deutschland.ogg}}}} officially the '''Federal Republic of Germany''',{{efn|{{Lang-de|Bundesrepublik Deutschland}}, {{IPA-de|ˈbʊndəsʁepuˌbliːk ˈdɔʏtʃlant|pron|De-Bundesrepublik Deutschland.ogg}}<ref>{{cite book|title=Duden, Aussprachewörterbuch|publisher=Dudenverlag|year=2005|isbn=978-3-411-04066-7|editor-last=Mangold|editor-first=Max|edition=6th|pages=271, 53f|language=de}}</ref>}} is a country in [[Central Europe]]. It is the [[List of European countries by population|second-most populous country]] in Europe after [[Russia]], and the most populous [[member state of the European Union]]. Germany is situated between the [[Baltic Sea|Baltic]] and [[North Sea|North]] seas to the north, and the [[Alps]] to the south. Its 16 [[States of Germany|constituent states]] are bordered by [[Denmark]] to the north, [[Poland]] and the [[Czech Republic]] to the east, [[Austria]] and [[Switzerland]] to the south, and [[France]], [[Luxembourg]], [[Belgium]], and the [[Netherlands]] to the west. The nation's capital and [[List of cities in Germany by population|most populous city]] is [[Berlin]] and its main financial centre is [[Frankfurt]]; the largest urban area is the [[Ruhr]].", metadata={'source': '1'}), 0.8264833092689514)]
Thought:I now know the final answer
Final Answer: Germany is a country located in Central Europe, bordered by the Baltic and North seas to the north, the Alps to the south, and bordered by Denmark, Poland, the Czech Republic, Austria, Switzerland, France, Luxembourg, Belgium, and the Netherlands to the west. Its capital and most populous city is Berlin and its main financial center is Frankfurt. The largest urban area is the Ruhr.

Experimental: Code Editor Tool / Code-it task executor

Nothign is guaranteed to work here.

Code-it task executor

This is an experimental task executor I've been developing at: https://github.com/paolorechia/code-it

You can try it using this example:

python -m langchain_app.agents.coder_plot_chart_executor_test

I'll explain this better at some point :)

Matplotlib Prompt

Install matplotlib and run:

python -m langchain_app.agents.coder_plot_chart

If all works well, the bot will create a example_chart.png file and a persistent_source.py with the source code generated.

Medium Articles

Example run

https://gist.github.com/paolorechia/0b8b5e08b38040e7ec10eef237caf3a5

Acknowledgements

The quantized models are a direct copy of this repository: https://github.com/qwopqwop200/GPTQ-for-LLaMa
The inference code is slightly modified version of FastChat: https://github.com/lm-sys/FastChat
Thanks to oobagooba's contributors (https://github.com/oobabooga/text-generation-webui) for creating an easy to use backend.
All model creators / community that are training and sharing models

learn-langchain's People

Contributors

Stargazers

Watchers

Forkers

mindrages curiosity007 amanintech kuntal-c roobenk deetungsten imsrbh rooben-me mrfcow tuanshu bigjeager supernathansh sudhanshu-shukla-git ddkang1 thebennos jjhw maniklem dkzdev obonyojimmy echo719 david10rio wdshin cancer2leo suradhyakshadasa ccir41 leisc lcsouzamenezes proper231 iwatesans jpka j2deen almyai jcarlosneto human-zoo-dan nashid life0fun private-gpt itneshkumar ltcevil himakshi2209 saravananpsg

learn-langchain's Issues

Embeddings example with text_generation_web_ui

chuck_norris_test_web_generation_textui.py seem work fine, except of the fact that it generates code that not complies (missing import json)

would be great to see an example of using embeddings from text_generation_web_ui

[help] Vicuna with local document

Hi,
I have a local document and I want to know how to use langchain and vicuna to read my local document and then I can talk to vicuna to discuss about the content of the document？

Error from `load_quant`

I am using AWS P3 8xLarge instance. I was trying to run your code and getting the following error -

Loading model Models/vicuna-7B-1.1-GPTQ-4bit-128g checkpoint Models/vicuna-7B-1.1-GPTQ-4bit-128g/vicuna-7B-1.1-GPTQ-4bit-128g.safetensors
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
0%| python3: project/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::
Aborted

Linux install, No module named 'gptq_for_llama'

I finally gave up on windows and installed everything on a fresh ubuntu wsl. I have everything setup properly including the nvidia toolkit, however if you try and run the server with 4 bit true, you get this error:

from gptq_for_llama.llama_inference import load_quant
ModuleNotFoundError: No module named 'gptq_for_llama'

How to get Agent Execution running, no output from server

I'm having an issue where I'm trying to run an example using a zero shot agent and a basic tool via your short_instruction example.

If I load in the OpenAI api as the LLM and run all the other code in the example I get exactly what I'd expect printed out in the console:

Entering new AgentExecutor chain...
I need to create a csv file and save the jokes to it.
Action: Python REPL
Action Input:
import csv

with open('catjokes.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerow(['Joke 1', 'Joke 2', 'Joke 3'])
writer.writerow(['Why did the cat go to the party?', 'Because he was feline fine!', 'Why did the cat cross the road? To get to the meowtel!'])
Observation:
Thought: I now know how to save cat jokes to a csv file.

However, if I switch to the Vicuna server and run everything, with the only difference being the llm is now coming from the local server, I get nothing back in the console and my GPU gets stuck processing something but I can't tell what's going on.

Are you able to run these examples locally? I have this feeling like there's some piece of information being left out here. All agent based examples running locally through here exhibit the same behavior. It must be something to do with what's being passed into the model server, but I can't figure it out.

Thoughts?

I found a way how to use these models directly with Text Generation WebUI

From the README
"If you try an unsupported model, you'll see "gibberish output".
This happens for instance with https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g
If you know how to use these models directly with Text Generation WebUI please share your expertise :)"

I Managed to get this working on my local on Linux. with
https://huggingface.co/4bit/vicuna-13B-1.1-GPTQ-4bit-128g
https://huggingface.co/TheBloke/wizard-vicuna-13B-GPTQ
https://huggingface.co/4bit/gpt4-x-alpaca-13b-native-4bit-128g-cuda
https://huggingface.co/4bit/stable-vicuna-13B-GPTQ

If that helps, my setup:

Latest revision in branch main in https://github.com/oobabooga/text-generation-webui.git
GPTQ-for-LLaMa: using https://github.com/qwopqwop200/GPTQ-for-LLaMa revision:
eb3be97 ("I noticed a slowdown. Revert the code.", 2023-04-26)

load with:
python server.py --model vicuna-13B-1.1-GPTQ-4bit-128g --wbits 4 --groupsize 128 --model_type Llama --api

Currently running models in a nvidia A2000 and consuming from langchain just using api endpoint... but simple stuff no agents. Alpaca and vicuna 1.1 are the best ones for me so far.
I was about to try to use embeddings and found your repo... great work!
Trying to understand how did you managed to get embeddings working xD.

UnboundLocalError: local variable 'stop_list' referenced before assignment

When trying to perform the following:

from langchain import PromptTemplate, LLMChain
from langchain_app.models.text_generation_web_ui import (
    build_text_generation_web_ui_client_llm,
)


# Create an instance of the language model with default parameters
llm = build_text_generation_web_ui_client_llm(parameters=None)


template = """
...
"""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = """
...
"""
llm_chain.run(question)

Was getting:

UnboundLocalError                         Traceback (most recent call last)
Cell In[27], line 22
     11 llm_chain = LLMChain(prompt=prompt, llm=llm)
     13 question = """
     14 
     21 """
---> 22 llm_chain.run(question)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/langchain/chains/base.py:213, in Chain.run(self, *args, **kwargs)
    211     if len(args) != 1:
    212         raise ValueError("`run` supports only one positional argument.")
--> 213     return self(args[0])[self.output_keys[0]]
    215 if kwargs and not args:
    216     return self(kwargs)[self.output_keys[0]]

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/langchain/chains/base.py:116, in Chain.__call__(self, inputs, return_only_outputs)
    114 except (KeyboardInterrupt, Exception) as e:
    115     self.callback_manager.on_chain_error(e, verbose=self.verbose)
--> 116     raise e
    117 self.callback_manager.on_chain_end(outputs, verbose=self.verbose)
    118 return self.prep_outputs(inputs, outputs, return_only_outputs)
...
     40     },
     41 )
     42 response.raise_for_status()

UnboundLocalError: local variable 'stop_list' referenced before assignment

Fix in learn-langchain/langchain_app/models/http_llm.py

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
    # Initialize stop_list with the current value or an empty list
        stop_list = self.parameters.get(self.stop_parameter_name, [])

        # Merge passed stop list with class parameters
        if isinstance(stop, list):
            stop_list = list(
                set(stop).union(set(self.parameters[self.stop_parameter_name]))
            )

        params = deepcopy(self.parameters)
        params[self.stop_parameter_name] = stop_list

        response = requests.post(
            self.prompt_url,
            json={
                "prompt": prompt,
                **params,
            },
        )
        response.raise_for_status()
        return self.response_extractor(response.json())

Seems to fix the error, however, I'm not sure if that is a right way to do it.

Colab

Can we have a Colab for this? So those without the option of GPU based machines can try it out?

vicuna_request_llm.py error stop + ["Observation:"]

I've modified your vicuna_chroma example to try and work with pdf documents. To do this i've added in a load_qa_chain and passed in your VicunaLLM() as the model. The pdf data was loading in properly, but then I kept getting an error when trying to call the model in the chain. Here's what my code looks like:

chain = load_qa_chain(llm, chain_type="stuff")

query = "What is the title of this document"

docs = docsearch.similarity_search(query)

response = chain.run(input_documents=docs, question=query)

print(response)

and here's the error:

vicuna_request_llm.py", line 19, in _call
"stop": stop + ["Observation:"],
TypeError: unsupported operand type(s) for +: 'NoneType' and 'list'

If I comment out the "stop" parameter in that file, it actually works. Any idea what I'm doing wrong?

Also, your setup is working quite well! I'm able to run this locally on GPU and getting speedy response.

Unexpected MMA layout version found"' failed (gptq_for_llama)

I'm getting into this issue when try to run this on Paperspace linux instance with nvidia quadro p5000
CUDA: Cuda compilation tools, release 10.1, V10.1.243

The same problem was reported here: oobabooga/text-generation-webui#1461

root@np2hkqa6bg:/notebooks/learn-langchain# uvicorn servers.vicuna_server:app Using config: {'base_model_size': '13b', 'use_4bit': True, 'use_fine_tuned_lora': False, 'lora_weights': None, 'device': 'cuda', 'model_path': None, 'checkpoint_path': None} Loading model ... Found 3 unique KN Linear values. Warming up autotune cache ... 0%| | 0/12 [00:00<?, ?it/s]python3.9: /project/lib/Analysis/Utility.cpp:136: bool mlir::supportMMA(mlir::Value, int): Assertion (version == 1 || version == 2) && "Unexpected MMA layout version found"' failed.`

pip install -r requirements.txt

The installation command breaks existing python libraries. Is anyone else having this issue? I can get the server to run, but it breaks the text generation webui.

Vicuna is pretty cool, I also build a project use it.

I build a project use vicuna. it's pretty cool. you can find us from here.
https://github.com/csunny/DB-GPT

Not pulling a model a model from a repo

I get this error: HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '../learn-vicuna/vicuna-7b/'. Use repo_type argument if needed from the latest code.

I want to use a 4bit model in any case so it fits on my 8gb GPU. I have the actual model file locally, do you know if there is way for your vicuna_server to support this?

Model Loader

Hey, I made a model loader class that takes in an llm model file as a new parameter from the config and then dynamically looks for the function or class to load. I think this is kinda cool because it let's you swap out a different model in all of your examples without having to import the model directly, you just change one config file. Maybe this is a bit too much. But thought I'd share.

import importlib
import inspect
from servers.load_config import Config

def load_llm(config: Config = None):
    if config is None:
        config = Config()

    try:
        module = importlib.import_module(f"langchain_app.models.{config.model_loader}")
        
        build_function = None
        found_class = None
        
        for name, obj in inspect.getmembers(module):
            if inspect.isfunction(obj) and name.startswith("build_"):
                build_function = obj
                break
            elif inspect.isclass(obj) and found_class is None and obj.__module__ == module.__name__:
                found_class = obj

        if build_function is not None:
            return build_function()
        elif found_class is not None:
            return found_class()
        else:
            raise ValueError(f"Invalid model loader: {config.model_loader}")
    except ImportError:
        raise ValueError(f"Invalid model loader: {config.model_loader}")

from langchain_app.models.model_loader import load_llm

llm = load_llm()

Running code generation on your own API Docs

@paolorechia Did you consider using something ready to use like this dataset to finetune the models:
https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca

I'm working on a similar project currently (as you described in the closed issue), where llm generates python code using my API. The code is being evaluated and executed on the source request server. It works using GPT-3.5-turbo and now i'm solving the puzzle to make it working using an open model.
Biggest problems? 2k tokens are not enough to fit the docs (API docs) within the prompt. Lanchain on the other hand should be useful so if the right embedding method is used, the right snippets of the documentation should be passed within the prompt. But there is still quite a chance that context won't be fed using right chunks from the db but something random instead.. I already faced that lately and i'm not sure how much it depends on the scheme of the docs.

I consider fine tuning, just as you described, but it would take couple weeks to prepare the dataset properly on your own data. Also, when I extend the API I will have to train again which takes another couple hours and the results are not predictable every single time.

So I believe that the best approach would be to finetune some simple model like Wizard 7B or wizard-vicuna-13b (getting the best results so far) to write the python code properly (using external, ready dataset), and then pass the right chunks of the API Docs in the context (langchain style).

What do you think about it?

Why create your own server?

Just thinking out loud here, but I don't understand why everybody seems to be trying to make their own server and load the models themselves. This oobabooga (https://github.com/oobabooga) repository has clearly figured it all out and they've got everything loading and running on the GPU and they even have an API. When I look at the AutoGPT and BabyAGI repos, they are all struggling to get local llama support and it makes no sense to me. Why wouldn't everybody just assume oobabooga as the back-end model serving api to replace all the OpenAI API calls? I mean, aren't we all just trying to have a drop in replacement for the OpenAI api?

I see that someone is also trying to do this: Significant-Gravitas/AutoGPT#348 (comment)

But nobody is going to beat what oobabooga has setup right now. It's super fast and it makes everything really easy, including downloading and managing the models. I see people copying the download model code as well, which is insane because you've got a great UI to do it in oobabooga.

Honest question here, but why wouldn't you plug into all that so you could focus on the actual langchain and agent aspect of this project? What is preventing everybody from just using oobabooga? Is it the embedding api? I don't know, after using oobabooga and seeing how far along it is I just don't get why other projects like AutoGPT don't have this as an option.

Load different models command on windows

Really love the repo here and the approach you're taking. I'm trying to get it to work and found one suggestion for your instructions. Your command to use a different model returns an error if you try and run it on windows:

export USE_13B_MODEL=true && export USE_4BIT=true && uvicorn servers.vicuna_server:app

The export and && isn't supported. I found that you can use the set command and format it like this:

set USE_13B_MODEL=true; set USE_4BIT=true;uvicorn servers.vicuna_server:app

error running 13b and 4bit

Hi, so I'm really interested in trying out your new document qa example. This is exactly what I was looking to prototype with a local llama model! I have found that the 4bit quatized safetensor model loads and performs very quickly using the Oobabooga framework, and so I'm really excited to see that you're helping develop code that supports it outside of the UI.

However, when I try to run your server script with 4bit set to true and the 13b model flagged, i get an error that says "triton not installed" and then "No module named 'toml'".

I created a local venv for this project and ran the install requirements. Could there be extra installation requirments for the gptq_for_llama code?

langchain autogpt examples without opeanai embeddings and faiss vector store

So I appear to have the basic integration working with oobabooga, but i've been struggling a little bit with some of your agent examples. I took a look at the langchain autogpt example here to see what some of the differences were: https://python.langchain.com/en/latest/use_cases/autonomous_agents/autogpt.html

It does a few things differently than in your files, and it seems to not require the big prompt templates you are using. I have yet to successfully get an of the example agents in this repo to do more complex things than a basic single instruct. However, these examples are still tied to the openAI embeddings and openAI apis.

I was trying to figure out how to take this example and convert it off of openai embeddings to use the sentence transformers and chroma for the memory. These might be really really great examples to re-implement in your setup and show people how to use it with local models instead of all the openai stuff.

I'm kind of close to getting it to work? I've tried porting your Chroma example over to replace the FAISS version, but I've mostly failed.

example error

I get an error while trying to use llama for embedding in your example.

embedding_function=self._embedding_function.embed_documents
AttributeError: 'function' object has no attribute 'embed_documents'


from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

import transformers

base_model= "yahma/llama-7b-hf"


model = transformers.AutoModelForCausalLM.from_pretrained(base_model)
tokenizer = transformers.AutoTokenizer.from_pretrained(base_model,model_max_length=512,padding_side="right",use_fast=False)
def embeddings(prompt_request: EmbeddingRequest):
    params = {"prompt": prompt_request.prompt}
    print("Received prompt: ", params["prompt"])
    output = get_embeddings(model, tokenizer, params["prompt"])
    return {"response": [float(x) for x in output]}

def get_embeddings(model, tokenizer, prompt):
    input_ids = tokenizer(prompt).input_ids
    input_embeddings = model.get_input_embeddings()
    embeddings = input_embeddings(torch.LongTensor([input_ids]))
    mean = torch.mean(embeddings[0], 0).cpu().detach()
    return mean

with open("german.txt") as f:
    book = f.read()
    
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(book)
docsearch = Chroma.from_texts(
    texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]
)


while True:
    query = input("Type your search: ")
    docs = docsearch.similarity_search_with_score(query, k=1)
    for doc in docs:
        print(doc)

Specifying a local model

I really like the download model feature that oobabooga uses and wanted to test out linking directly to another model using your model_path and checkpoint_path in the config. However, I can't seem to get it to work. I downloaded TheBloke 13b model via the oobabooga download and linked to the directory and the file using the config variables but i get this error:

learn-langchain\gptq_for_llama\quant\quant_linear.py", line 267, in matmul248
matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0),
NameError: name 'matmul_248_kernel' is not defined

I honestly find all of these "model" files super confusing. There are safetensors files, bin files, .pt files. It's a real mess. Do you have any help or tips here? Could you provide an example of linking to a local model file? I think what happens by default is that your code is using the cached hugging face model download.