jerryjliu / llama_index Goto Github PK

View Code? Open in Web Editor NEW

32.6K 233.0 4.5K 155.61 MB

LlamaIndex is a data framework for your LLM applications

Home Page: https://docs.llamaindex.ai

License: MIT License

Python 55.87% Makefile 1.36% Jupyter Notebook 41.82% Shell 0.08% Dockerfile 0.02% Starlark 0.51% JavaScript 0.33%

agents application data fine-tuning framework llamaindex llm rag vector-database

llama_index's Introduction

🗂️ LlamaIndex 🦙

LlamaIndex (GPT Index) is a data framework for your LLM application. Building with LlamaIndex typically involves working with LlamaIndex core and a chosen set of integrations (or plugins). There are two ways to start building with LlamaIndex in Python:

Starter: llama-index (https://pypi.org/project/llama-index/). A starter Python package that includes core LlamaIndex as well as a selection of integrations.
Customized: llama-index-core (https://pypi.org/project/llama-index-core/). Install core LlamaIndex and add your chosen LlamaIndex integration packages on LlamaHub that are required for your application. There are over 300 LlamaIndex integration packages that work seamlessly with core, allowing you to build with your preferred LLM, embedding, and vector store providers.

The LlamaIndex Python library is namespaced such that import statements which include core imply that the core package is being used. In contrast, those statements without core imply that an integration package is being used.

# typical pattern
from llama_index.core.xxx import ClassABC  # core submodule xxx
from llama_index.xxx.yyy import (
    SubclassABC,
)  # integration yyy for submodule xxx

# concrete example
from llama_index.core.llms import LLM
from llama_index.llms.openai import OpenAI

Important Links

LlamaIndex.TS (Typescript/Javascript): https://github.com/run-llama/LlamaIndexTS.

Documentation: https://docs.llamaindex.ai/en/stable/.

Twitter: https://twitter.com/llama_index.

Discord: https://discord.gg/dGcwcsnxhU.

Ecosystem

LlamaHub (community library of data loaders): https://llamahub.ai.
LlamaLab (cutting-edge AGI projects using LlamaIndex): https://github.com/run-llama/llama-lab.

🚀 Overview

NOTE: This README is not updated as frequently as the documentation. Please check out the documentation above for the latest updates!

Context

LLMs are a phenomenal piece of technology for knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data.
How do we best augment LLMs with our own private data?

We need a comprehensive toolkit to help perform this data augmentation for LLMs.

Proposed Solution

That's where LlamaIndex comes in. LlamaIndex is a "data framework" to help you build LLM apps. It provides the following tools:

Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.).
Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).

LlamaIndex provides tools for both beginner users and advanced users. Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code. Our lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.

💡 Contributing

Interested in contributing? Contributions to LlamaIndex core as well as contributing integrations that build on the core are both accepted and highly encouraged! See our Contribution Guide for more details.

📄 Documentation

Full documentation can be found here: https://docs.llamaindex.ai/en/latest/.

Please check it out for the most up-to-date tutorials, how-to guides, references, and other resources!

💻 Example Usage

# custom selection of integrations to work with core
pip install llama-index-core
pip install llama-index-llms-openai
pip install llama-index-llms-replicate
pip install llama-index-embeddings-huggingface

Examples are in the docs/examples folder. Indices are in the indices folder (see list of indices below).

To build a simple vector store index using OpenAI:

import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
index = VectorStoreIndex.from_documents(documents)

To build a simple vector store index using non-OpenAI LLMs, e.g. Llama 2 hosted on Replicate, where you can easily create a free trial API token:

import os

os.environ["REPLICATE_API_TOKEN"] = "YOUR_REPLICATE_API_TOKEN"

from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.replicate import Replicate
from transformers import AutoTokenizer

# set the LLM
llama2_7b_chat = "meta/llama-2-7b-chat:8e6975e5ed6174911a6ff3d60540dfd4844201974602551e10e9e87ab143d81e"
Settings.llm = Replicate(
    model=llama2_7b_chat,
    temperature=0.01,
    additional_kwargs={"top_p": 1, "max_new_tokens": 300},
)

# set tokenizer to match LLM
Settings.tokenizer = AutoTokenizer.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf"
)

# set the embed model
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
index = VectorStoreIndex.from_documents(
    documents,
)

To query:

query_engine = index.as_query_engine()
query_engine.query("YOUR_QUESTION")

By default, data is stored in-memory. To persist to disk (under ./storage):

index.storage_context.persist()

To reload from disk:

from llama_index.core import StorageContext, load_index_from_storage

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="./storage")
# load index
index = load_index_from_storage(storage_context)

🔧 Dependencies

We use poetry as the package manager for all Python packages. As a result, the dependencies of each Python package can be found by referencing the pyproject.toml file in each of the package's folders.

cd <desired-package-folder>
pip install poetry
poetry install --with dev

📖 Citation

Reference to cite if you use LlamaIndex in a paper:

@software{Liu_LlamaIndex_2022,
author = {Liu, Jerry},
doi = {10.5281/zenodo.1234},
month = {11},
title = {{LlamaIndex}},
url = {https://github.com/jerryjliu/llama_index},
year = {2022}
}

llama_index's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes dumpmemory c00renut brianpetro goulash1971 teoh cwdeee jkitchin rp98 mamigot evdcush edanpragma bockaerts zahraghasemi-ai bartmr weejiaquan dawah-wadah richychn cnrpman saxenabhishek codeaudit deorsi ibestvina mariowains shobith rafgro yubozhao sergeykadiyevskiy ukaserge mannykayy mmz-001 touristshaun kausmeows clueless-skywatcher cclauss scooter7 danzee1 hadryan hwchase17 joshmayr veered johnshahawy benjaminearlevans kevinbean areebbajwa maothman4 alec-tschantz jmsktm eltociear anantrp aviral10x nside ravi03071991 cwelton colmarius r4nd0mth1ngs emptycrown loftwah rustaz lucasrodriguesus kahkeng kardiu praveennanda124 sunilkgrao jrcribb richardsonjf polya20 apumapho iiqbal2000 iamkaiwei globaloptimal iuriimattos2 israelvaday quangdaicaa boriscergol avishaiasaf zakwatts phucbuiworks hyohhh yourbuddyconner jadentripp ejhortala ricklentz mistapproach briansemiglia zeyuzhao andrflor cyd3nt ndehouche headinthebox dancaron jaytoday shantanunair damonkost shivamsinha15 aifylabs djon3s 5l1v3r1 ddrling heisemo

llama_index's Issues

is there a way to disable logs(print statements)?

UTF-8 decoding error when trying to read a data folder on Mac Osx

    documents = SimpleDirectoryReader('data').load_data()
  File "/usr/local/lib/python3.10/site-packages/gpt_index/readers/file.py", line 33, in load_data
    data = f.read()
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

on this data: https://github.com/awsdocs/amazon-quicksight-user-guide/tree/main/doc_source

Use custom prompt

I was wondering how can I use a custom prompt/template for the index.query() method.

Build embedding-integrated indices (e.g. FAISS)

Integrate FAISS embeddings with existing data strucutres (list, tree). For instance we could have a FaissList where every document chunk is a vector, we accumulate the top document chunks by dot product, and then use the retrieved chunks to synthesize an answer.

We could even have a FaissTree where we use a similar summarization procedure as GPTTreeIndex, but we convert each piece of text to an additional embedding. Then traversal becomes dot product against embeddings rather than purely text-based reasoning.

The original design exercise of GPT index was to do text-based only traversal but now we can try focusing this on practical use.

query mode "embedding" not supported by GPTSimpleVectorIndex?

According to the documentation, GPTSimpleVectorIndex may be queried with "embedding" mode (last example of https://gpt-index.readthedocs.io/en/latest/how_to/embeddings.html). However when I do so, I get an error.

embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
index = GPTSimpleVectorIndex.load_from_disk(
    save_path="index.json",
    embed_model=embed_model,
)

response = index.query(
    query_string,     
    mode="embedding",
)

error:

  File "/usr/local/lib/python3.10/site-packages/gpt_index/indices/base.py", line 334, in query
    return query_runner.query(query_str, self._index_struct)
  File "/usr/local/lib/python3.10/site-packages/gpt_index/indices/query/query_runner.py", line 100, in query
    query_cls = get_query_cls(index_struct_type, mode)
  File "/usr/local/lib/python3.10/site-packages/gpt_index/indices/query/query_map.py", line 79, in get_query_cls
    return MODE_TO_QUERY_MAP_SIMPLE[mode]

Is this an actual error or is it not possible/recommended to use embeddings with GPTSimpleVectorIndex? If I use Tree or List indices, then when I save them to disk, there are no actual embeddings, so I am confused on the correct set of classes to use for top-k embedding based queries.

GPTSQLStructStoreIndex does not work with views

SQL Alchemy does not load view metadata on:

SQLDatabase

    def __init__(self, *args: Any, **kwargs: Any) -> None:
        """Init params."""
        super().__init__(*args, **kwargs)
        self.metadata_obj = MetaData(bind=self._engine)
        self.metadata_obj.reflect()

Which causes the initialization of the index to fail with a KeyError

GPTSQLStructStoreIndex

 def __init__(...):
    ...
    table = self.sql_database.metadata_obj.tables[table_name]   <--- XXX no such table name
    ...

installation with pip does not work

Something seems wrong with import paths. With a pip install gpt_index I see this error:

#+BEGIN_SRC jupyter-python
from gpt_index import GPTTreeIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('../../../../bibliography/literature-summaries/').load_data()
documents
#+END_SRC

#+RESULTS:
:RESULTS:
# [goto error]
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/var/folders/3q/ht_2mtk52hl7ydxrcr87z2gr0000gn/T/ipykernel_18523/2206965719.py in <cell line: 1>()
----> 1 from gpt_index import GPTTreeIndex, SimpleDirectoryReader
      2 documents = SimpleDirectoryReader('../../../../bibliography/literature-summaries/').load_data()
      3 documents

~/opt/anaconda3/lib/python3.8/site-packages/gpt_index/__init__.py in <module>
     23 # readers
     24 from gpt_index.readers.file import SimpleDirectoryReader
---> 25 from gpt_index.readers.google.gdocs import GoogleDocsReader
     26 from gpt_index.readers.mongo import SimpleMongoReader
     27 from gpt_index.readers.notion import NotionPageReader

ModuleNotFoundError: No module named 'gpt_index.readers.google'
:END:

Refactor mocks in unit testing

Follow up to this pr comment: #85 (comment)

Most unit tests use similar mocks, so there's an opportunity to avoid repetition.

Faiss load_from_disk doesn't actually work

Here's a minimal example:

First, create and save the FAISS Index

from gpt_index import GPTFaissIndex, SimpleDirectoryReader

faiss_index = faiss.IndexFlatL2(1536)
documents = SimpleDirectoryReader("data").load_data()
index = GPTFaissIndex(documents, faiss_index=faiss_index)
index.save_to_disk("index_faiss.json", faiss_index_save_path="index_faiss_core.index")

Now, try and load the same index

new_index = GPTFaissIndex.load_from_disk(
    save_path="index_faiss.json", faiss_index_save_path="index_faiss_core.index"
)
print(new_index._faiss_index.ntotal)

Notice that 0 documents exist.

And just to verify that loading it regularly works, try this:

import faiss

faiss_index = faiss.read_index("index_faiss_core.index")
print(faiss_index.ntotal)

This was a weird one to track down, because it doesn't actually "fail". Instead, we always return Node 0 with a distance of infinity.

I poked around the code a bit but couldn't figure out where exactly the bug is. Separately, we should also include a quick sanity check, just ensure that the faiss_index isn't empty if we're loading it from disk.

Add JIRA data connector

specifying the model name isn't working in the latest version

Sample Code -

llm = OpenAI(temperature=0.7, model_name="text-curie-001")
llm_predictor = LLMPredictor(llm)
prompt_helper = PromptHelper.from_llm_predictor(self.llm_predictor)

index = GPTListIndex(documents, llm_predictor=self.llm_predictor, prompt_helper=self.prompt_helper)
index.query(prompt, response_mode="tree_summarize")

The breaking change seems to have happened in this commit. The response builder in query is created using the default davinci llm_predictor here instead of the passed in predictor. This is because query_runner doesn't pass in the llm_predictor while creating the query obj and sets it later.

Error Recovery Part (ii): pick up where you left off

Now that we've added retries with exponential backoff in #215, it would be cool to add support for "picking up where you left off". From the example in #210:

>>> index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
> Building index from nodes: 502 chunks
0/5029
10/5029
20/5029
30/5029
40/5029
50/5029
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
// stack trace and error

If we run index = GPTTreeIndex(documents, prompt_helper=prompt_helper), we'd have to start from the beginning. With 502 chunks above, that's a lot of computation we'd be redoing, not to mention token budget gone to waste!

It would be cool if this happened instead:

>>> index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
> Building index from nodes: 502 chunks
> continuing from chunk 50:
50/5029
60/5029
...
// hopefully no errors this time

I can think of two ways this might be done:

The some global in gpt_index tracks some state that stores the results of the computation from the failed run. This might not be great since tracking state like this is confusing
If we run into errors during, say, the build step of the index, we return some result anyway, which can then be fed into the next call of GPTTreeIndex(documents, prompt_helper=prompt_helper). This might be possible today with index composability?

If we added support for this, I believe that this will give developers more confidence to index larger sets of documents.

Going over the maximum context length when building an index

Hey i'm getting an error when building an index because gpt_index is trying to go over the maximum context length.

openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 4169 tokens (3913 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

to reproduce:

from gpt_index import GPTTreeIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('data').load_data()
index = GPTTreeIndex(documents)

# save to disk
index.save_to_disk('index.json')

on this data: https://github.com/awsdocs/amazon-quicksight-user-guide

Process finished with exit code 0

Following the code in the gpt_index starter tutorial:

from gpt_index import GPTSimpleVectorIndex, SimpleDirectoryReader
from IPython.display import Markdown, display

documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex(documents)

response = index.query("What did the author do growing up?")
print(response)

I am receiving Process finished with exit code 0 as opposed to The author wrote short stories and tried to program on an IBM 1401. which I am supposed to receive according to the tutorial.

I don't think I've provided an API key yet. That sounds like something I would have needed to have done to get a response from the LLM? However, I don't think the tutorial prompted me to put my key anywhere just yet.

can't load data

I have a txt file in data/ and I'm trying to load it. However, it's failing:

Manual Restoration of Index silently fails

Hi! Hope all is well!

I'm writing and loading my index to AWS S3 using the following:

  def __serialize_index(self, gpt_idx: GPTListIndex):
        out_dict: Dict[str, dict] = {
            "index_struct": gpt_idx.index_struct.to_dict(),
            "docstore": gpt_idx.docstore.to_dict(),
        }

        return json.dumps(out_dict)

    def __deserialize_index(self, serialized_idx: str):
        idx_dict = json.loads(serialized_idx)

        index_struct = GPTSimpleVectorIndex.index_struct_cls.from_dict(
            idx_dict["index_struct"])
        docstore = DocumentStore.from_dict(idx_dict["docstore"])
        print(index_struct, docstore)
        return GPTSimpleVectorIndex(docstore=docstore, index_struct=index_struct)

these two functions are used here

def get(self, key):
        if key in self.cache:
            return self.cache[key]

        idx_json = self.s3_service.get_index_from_s3(key)

        if idx_json is None:
            return None
        print("json", idx_json)
        deserialized_idx = self.__deserialize_index(idx_json)

        self.__add_to_cache(key, deserialized_idx)

        return deserialized_idx

def flush(self, key, evict=True):
        prev_idx = self.cache[key]

        if evict:
            prev_idx = self.cache.pop(key)

        serialized_idx = self.__serialize_index(prev_idx)

        self.s3_service.write_index_to_s3(key, serialized_idx)

Unfortunately, when I restore my app, everything gets loaded and seems like it works but every time I try to make a query I get the following:

Empty Response

reading into the logs I found this

> [query] Total LLM token usage: 0 tokens
> [query] Total embedding token usage: 3 tokens

which weirdly seems like the embedding is working but the LLM is not.

Hoping someone might have some insight! 🙂

RateLimitError when using GPTSimpleVectorIndex

I'm getting a RateLimitError when constructing a GPTSimpleVectorIndex as given below:

from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex
documents = SimpleDirectoryReader('./data').load_data()

index = GPTSimpleVectorIndex(documents)
index.save_to_disk('./index.json')

I'm currently using OpenAI's free trial for testing and it has a 150,000 tokens / minute hard limit. Is there some way to add a delay between API calls?

ImportError: cannot import name 'Version' from 'packaging.version'

requirements.txt

gpt_index
numpy==1.23.5
pandas==1.5.2
torch
tensorflow
slack_sdk==3.19.5

generate_index.py

import gpt_index
reader = gpt_index.SlackReader(slack_token='XXX)
documents = reader.load_data(channel_ids=[
    'XXX',
    'XXX',
])
index = gpt_index.GPTTreeIndex(documents)
index.save_to_disk('gpt-index.json')

python3.10 ./generate_index.py returns error

Traceback (most recent call last):
  File "/mnt/c/Users/Slach/Downloads/altinity.staff/src/github.com/altinity/slack-qa/./generate_index.py", line 1, in <module>
    import gpt_index
  File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/gpt_index/__init__.py", line 9, in <module>
    from gpt_index.indices.keyword_table.base import GPTKeywordTableIndex
  File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/gpt_index/indices/keyword_table/base.py", line 15, in <module>
    from gpt_index.indices.base import (
  File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/gpt_index/indices/base.py", line 17, in <module>
    from gpt_index.indices.data_structs import IndexStruct
  File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/gpt_index/indices/data_structs.py", line 9, in <module>
    from dataclasses_json import DataClassJsonMixin
  File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/dataclasses_json/__init__.py", line 2, in <module>
    from dataclasses_json.api import (DataClassJsonMixin,
  File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/dataclasses_json/api.py", line 6, in <module>
    from dataclasses_json.cfg import config, LetterCase  # noqa: F401
  File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/dataclasses_json/cfg.py", line 5, in <module>
    from marshmallow.fields import Field as MarshmallowField
  File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/marshmallow/__init__.py", line 3, in <module>
    from packaging.version import Version
ImportError: cannot import name 'Version' from 'packaging.version' (/home/slach/venv/slack-qa/lib/python3.10/site-packages/packaging/version.py)

could you suggest properly libraries version?

New feature: Add compability with more file formats for SimpleDirectoryReader

While the SimpleDirectoryReader is very efficient and allows to simply query a (set of) text file(s) with very few lines of code, it would be great if the data connector was able to ingest more file formats such as PDF for example.

requested number of tokens exceed the max supported by the model

Sample code

text = "...."    # > 40k characters

documents = [Document(text)]
llm = OpenAI(temperature=0.7, model_name="text-curie-001")
llm_predictor = LLMPredictor(llm)
prompt_helper = PromptHelper.from_llm_predictor(self.llm_predictor)

index = GPTListIndex(documents, llm_predictor=self.llm_predictor, prompt_helper=self.prompt_helper)
index.query(prompt, response_mode="tree_summarize")

Surprisingly this seems to be happening for me for all long texts. It doesn't happen when davinci is used though and went unnoticed at first due to #182. Anyway I can help in debugging? which function/file should I look into?

Stack Trace


File "/Users/tushar/PycharmProjects/transcription/transcription/gptindex.py", line 35, in _list_index_summarize
    return str(index.query(prompt, response_mode="tree_summarize"))
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/base.py", line 322, in query
    return query_runner.query(query_str, self._index_struct)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/query/query_runner.py", line 106, in query
    return query_obj.query(query_str, verbose=self._verbose)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/utils.py", line 113, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/query/base.py", line 233, in query
    response = self._query(query_str, verbose=verbose)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/query/base.py", line 222, in _query
    response_str = self._give_response_for_nodes(
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/query/base.py", line 183, in _give_response_for_nodes
    response = self.response_builder.get_response(
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/response/builder.py", line 239, in get_response
    return self._get_response_tree_summarize(
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/response/builder.py", line 210, in _get_response_tree_summarize
    root_nodes = index_builder.build_index_from_nodes(
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/common/tree/base.py", line 103, in build_index_from_nodes
    new_summary, _ = self._llm_predictor.predict(
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/langchain_helpers/chain_wrapper.py", line 96, in predict
    llm_prediction = self._predict(prompt, **prompt_args)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/langchain_helpers/chain_wrapper.py", line 82, in _predict
    llm_prediction = llm_chain.predict(**full_prompt_args)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/llm.py", line 103, in predict
    return self(kwargs)[self.output_key]
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/base.py", line 146, in __call__
    raise e
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/base.py", line 142, in __call__
    outputs = self._call(inputs)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/llm.py", line 87, in _call
    return self.apply([inputs])[0]
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/llm.py", line 78, in apply
    response = self.generate(input_list)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/llm.py", line 73, in generate
    response = self.llm.generate(prompts, stop=stop)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/llms/base.py", line 81, in generate
    raise e
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/llms/base.py", line 77, in generate
    output = self._generate(prompts, stop=stop)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/llms/openai.py", line 155, in _generate
    response = self.client.create(prompt=_prompts, **params)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_resources/completion.py", line 25, in create
    return super().create(*args, **kwargs)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 115, in create
    response, _, api_key = requestor.request(
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_requestor.py", line 181, in request
    resp, got_stream = self._interpret_response(result, stream)
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_requestor.py", line 396, in _interpret_response
    self._interpret_response_line(
  File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_requestor.py", line 429, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: This model's maximum context length is 2049 tokens, however you requested 3159 tokens (2903 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

Manipulating GPT Responses (Prefixes & Prompting for Formatting)

I'm querying GPT with a table like:

Name:
Date:
Location:
SF:

And would love for the response to just be the answer alone since I am writing the answer to an excel file that already has the table so my result ends up looking like:

Name: Name: Bob
Date: Date: 1/1/2023
Location: Location: New York
SF: SF: 2000 square feet
Months: Months: 20 months

How do I prevent it from repeating the question in this way
How do I prevent it from providing units (i.e 2000 "square feet" vs. "2000")

Thank you!

Can this be used for text generation?

I just ran into gpt_index, awesome project!

I took a quick look at the prompts docs and I see that it's geared towards summarization and QA. I have a project where I need to pass in a lot of context for text generation, so I was wondering if gpt_index could be used for that as well? I don't fully understand yet how gpt_index works, so excuse me if this is a dumb question ;)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 8: character maps to

pip install gpt-index leads to below:

Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [19 lines of output]
      Traceback (most recent call last):
        File "F:\Python39\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 351, in <module>
          main()
        File "F:\Python39\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 333, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "F:\Python39\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "C:\Users\Mohammad\AppData\Local\Temp\pip-build-env-g92ox94r\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "C:\Users\Mohammad\AppData\Local\Temp\pip-build-env-g92ox94r\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
          self.run_setup()
        File "C:\Users\Mohammad\AppData\Local\Temp\pip-build-env-g92ox94r\overlay\Lib\site-packages\setuptools\build_meta.py", line 484, in run_setup
          super(_BuildMetaLegacyBackend,
        File "C:\Users\Mohammad\AppData\Local\Temp\pip-build-env-g92ox94r\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
          exec(code, locals())
        File "<string>", line 11, in <module>
        File "F:\Python39\lib\encodings\cp1252.py", line 23, in decode
          return codecs.charmap_decode(input,self.errors,decoding_table)[0]
      UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 8: character maps to <undefined>
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Suggest questions to ask, based on the documents that gpt indexes

would be a great way to explore large texts

add diagrams to docs for embeddings-related use

this will make it easier for new devs to grasp how we're interacting with embeddings and vector dbs
ref to #108 and #103

Feature Request: Example of using a preloaded SQLIndex without document extraction

SQLIndexDemo is an interesting starting point, would be helpful to have an example usecase leveraging a pre-loaded database rather than creating a database based on unstructured document parsing.

E.g. it is non obvious how to correctly set documents/index_struct when the data is already loaded.

replace constructor kwargs with llm_predictor

Small change, mostly to familiarize myself with the code.

In #57 we added some kwargs some constructors. Below is a list of classes for which we can replace kwargs with llm_predictor:

Gatsby index may be out of date

Exciting project! I'm looking to build an AI assistant that can answer questions using hundreds of thousands of words of loosely organized notes as context. gpt-index seems like a promising route.

Attempting to load index_gatsby.json from disk yields the following KeyError.

In [12]: index = GPTTreeIndex.load_from_disk('examples/gatsby/index_gatsby.json')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-65a5668c7898> in <module>
----> 1 index = GPTTreeIndex.load_from_disk('examples/gatsby/index_gatsby.json')

~/code/github/jerryjliu/gpt_index/gpt_index/indices/base.py in load_from_disk(cls, save_path, **kwargs)
    340         with open(save_path, "r") as f:
    341             result_dict = json.load(f)
--> 342             index_struct = cls.index_struct_cls.from_dict(result_dict["index_struct"])
    343             docstore = DocumentStore.from_dict(result_dict["docstore"])
    344             return cls(index_struct=index_struct, docstore=docstore, **kwargs)

KeyError: 'index_struct'

This could be that index_gatsby.json is from an outdated version of the project. The Paul Graham essay example index loads just fine for me.

Google credential URI changes everytime

The issue with documentreader for google doc is that every time it uses a new port and hence the underlying URI for request changes. Can we add a port option?

Make package pip installable

I'm halfway there!

Add "simpler" data structures (to table data structure, to start)

Currently the table and tree indices all GPT in building the index. This is not optimal because calls to GPT incur latency and cost. There are ways to build indices without needing to invoke GPT, and only invoke GPT during query time (when traversing the index).

We can start with the keyword table, for instance. We can develop a simple keyword extractor that extracts keywords without invoking GPT at all (both during index creation and query). We only need to invoke GPT when synthesizing an answer.

add pymongo as requirement

This was not installed with pip, but was required to use the package.

Add delete to GPT Indices (tree, keyword table, list)

How many tokens per search ?

I am planning on integrating gpt_index into a project of mine. I would like to have semantic search over a document. Is this library as expensive as the pricing mentioned here -> https://gpttools.com/searchtokens ?
Or is it that the initial build costs a lot of tokens and after the initial build it is much cheaper ? . Thanks

SimpleDirectoryReader subfolders

Unless I'm doing something wrong it doesn't seem that the SimpleDirectoryReader loads documents in subfolders. Would it be possible to add this?

Thank you! 🙏

ValueError: Error raised by inference API: Input is too long for this model, shorten your input or use 'parameters': {'truncation': 'only_first'} to run the model only on the first part. when using gpt2 from huggingface

I run into the following error when using gpt2 from huggingface -

ValueError: Error raised by inference API: Input is too long for this model, shorten your input or use 'parameters': {'truncation': 'only_first'} to run the model only on the first part.

Can the index not be build chunk by chunk? Or am I missing something?

ValueError: text_id not found in id_map when loading an index from disk with GPTFaissIndex.load_from_disk

In the example notebook, FaissIndexDemo.ipynb when loading an index from disk and querying the index the following error is thrown:

     76 if not query_obj._llm_predictor_set:
     77     query_obj.set_llm_predictor(self._llm_predictor)
---> 79 return query_obj.query(query_str, verbose=self._verbose)
...
--> 180         raise ValueError("text_id not found in id_map")
    181     int_id = self.id_map[text_id]
    182     if int_id not in self.nodes_dict:

ValueError: text_id not found in id_map```

Building a new index and querying it works.

Indexing should recover from errors more gracefully

I've been trying to generate a tree index, but I'm hitting OAI ratelimits. The problem is that this forces me to start the index from scratch again, which is time-consuming and expensive.

If a ratelimit gets hit, the index should retry, or at the very least save some sort of intermediate state so you can resume indexing later.

Run linting and formatting in PRs that come from forks

GPT Index and Typescript / Node.js

I work mostly with typescript and node.js, and I think many others do. Any ideas on how to make this thing compatible? Is your api getting stable already? I guess it must be possible then, to create a wrapper package in node.js. I'd like to get in touch with anyone interested in this, I might make it.

UPDATE

What are some of the principles of GPT Index?

It seems to be a library that contains the result of massive amount of work for months. It is certainly the start of "AI search" that helps you answer questions appropriately against large knowledge bases. It is the thing I need.

Can I use GPT Index in node.js easily?

No. We would need to create a typescript wrapper around their python API if we want to use it within node.js. This would a an enourmous task and also we would become dependent on an ever changing library that might make choices that I don't like. For such a crucial part, I think I would be better off creating my own Node.js implementation. Especially looking at long-term, this seems better.

Conclusion

I figured it will be a hassle to make this compatible with Typescript, and since GPT indexation is at the core of my company I decided I will at least try to make my own implementation in Node.js that uses similar principles.

I will try to replicate GPT Index as much as possible and needed in a typescript node.js package. Anyone that wants to help me: please get in touch; https://calendly.com/karsens

I'll regularly update my work in https://github.com/CodeFromAnywhere/gpt-index-js

Build connector with mongodb

Define an interface where we can construct an index from mongo and save it to mongo. This will help us hook up data connectors to our core abstractions.

(Later on once we add insert/delete to indices we can also hook it up to this data store).

Refactor: outer and inner methods for insert/query/build_index

Follow up to this PR comment: https://github.com/jerryjliu/gpt_index/pull/85/files#r1045482090

When the PR above lands, the placement of the @llm_token_counter() decorator will be confusing, because it's done on methods found both in the base and implementation classes. Better to have inner (defined in subclass) and outer (defined in base class) methods. Pasting a chat from somewhere else

we could define "outer" methods (build_index, query, insert) on the base class, and these outer methods call "inner" methods (_build_index, _query, _insert) that are abstract and implemented by subclasses. Then token counting could take place in the outer method since it's shared among subclasses.

KeyError when using MockLLMPredictor in GPTTreeIndex

I ran the TokenPredictor.ipynb given in the examples and got the following error:

from gpt_index import GPTTreeIndex, MockLLMPredictor, SimpleDirectoryReader
documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()
llm_predictor = MockLLMPredictor(max_tokens=256)
index = GPTTreeIndex(documents, llm_predictor=llm_predictor)

KeyError                                  Traceback (most recent call last)
d:\Capture\Python\OpenAI\GPT Index\examples\cost_analysis\TokenPredictor.ipynb Cell 7 in <cell line: 1>()
----> [1](vscode-notebook-cell:/d%3A/Capture/Python/OpenAI/GPT%20Index/examples/cost_analysis/TokenPredictor.ipynb#W6sZmlsZQ%3D%3D?line=0) index = GPTTreeIndex(documents, llm_predictor=llm_predictor)

File c:\Users\mmz\AppData\Local\Programs\Python\Python310\lib\site-packages\gpt_index\indices\tree\base.py:65, in GPTTreeIndex.__init__(self, documents, index_struct, summary_template, insert_prompt, num_children, llm_predictor, build_tree, **kwargs)
     63 self.insert_prompt: TreeInsertPrompt = insert_prompt or DEFAULT_INSERT_PROMPT
     64 self.build_tree = build_tree
---> 65 super().__init__(
     66     documents=documents,
     67     index_struct=index_struct,
     68     llm_predictor=llm_predictor,
     69     **kwargs,
     70 )

File c:\Users\mmz\AppData\Local\Programs\Python\Python310\lib\site-packages\gpt_index\indices\base.py:86, in BaseGPTIndex.__init__(self, documents, index_struct, llm_predictor, docstore, prompt_helper, chunk_size_limit, verbose)
     84 self._validate_documents(documents)
     85 # TODO: introduce document store outside __init__ function
---> 86 self._index_struct = self.build_index_from_documents(
     87     documents, verbose=verbose
     88 )

File c:\Users\mmz\AppData\Local\Programs\Python\Python310\lib\site-packages\gpt_index\utils.py:113, in llm_token_counter.<locals>.wrap.<locals>.wrapped_llm_predict(_self, *args, **kwargs)
    111 def wrapped_llm_predict(_self: Any, *args: Any, **kwargs: Any) -> Any:
    112     start_token_ct = _self._llm_predictor.total_tokens_used
...
---> 18 num_text_tokens = len(globals_helper.tokenizer(prompt_args["text"]))
     19 token_limit = min(num_text_tokens, max_tokens)
     20 return " ".join(["summary"] * token_limit)

KeyError: 'text'

Add insert to GPT Indices (tree, keyword table, list)

Cost estimator

More of a nice-to-have.

Since building the index involves calls to GPT-3, it might be nice to have a verbose output option that counts the number of tokens involved, which will help devs get a feel for the costs involved with using this package.

Incorrect documentation in WeaviateIndexDemo.ipynb example notebook

Hi,

There is slight incorrect documentation in WeaviateIndexDemo.ipynb file. Though GPTWeaviateIndex is imported, in the markdown cell GPTFaissIndex is mentioned which confused me when I am going through it. Would be great if it is corrected.

Regards,
Ravi Theja

Add unit tests

We need a proper suite of unit tests with tests

Document use of OpenAI API

In trying to use this, you get an error if you haven't setup an OpenAI account with API key. That should probably be noted on the readme. I had to setup a paid account to use this. It would also be helpful to indicate what typical costs associated with this might be.

I did see there is some cost analysis in the index documenation (eg https://github.com/jerryjliu/gpt_index/blob/main/gpt_index/indices/keyword_table/README.md#faqadditional).

Run index operations in the background

Today, if I build a gpt index like this:

>>> index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
> Building index from nodes: 502 chunks
0/5029
10/5029
...

This may take a while, and I'm blocked from doing anything else before then. (The same can be said for querying).

If I'm building some app on top of GPT index, and have an endpoint to start the build, like below:

from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route('/build', methods=['POST'])
def build():
    // get documents and prompt_helper
    index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
    // do something with index
    return {
        "message": "build complete"
    }

then I have to wait for the build to complete before getting a response.

I'm looking for ideas on how to support running the build in the background, and checking status, something like below:

// same flask boilerplate
@app.route('/build', methods=['POST'])
def build():
    // stuff
    index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
    // more stuff + produces id to check on later
    return {
        "message": "started building the index",
        "task_id": id
    }

// stuff
def status():
    ...
    // get id
    // returns a message that says that the build is complete, or is x% done

This should be possible with python's threading library, or with a task queue like celery. However it probably gets complicated depending on your application, e.g. you have more than one process.

I'm currently thinking of ways to support this within gpt_index, whether it's adding extra functionality (without bloating the library), or adding some code samples somewhere so that no one's starting from scratch. If you have ideas, please feel free to add them here!

Add distance metrics for query results

Currently we do not return score/distance metrics for query results.

Can we add distance metrics for each SourceNode in query response?