Bug Deion Using index.ref_doc_info does not work with chroma

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug]: index.ref_doc_info does not work with chromadb about llama_index HOT 6 OPEN

Rust-Ninja-Sabi commented on August 16, 2024

[Bug]: index.ref_doc_info does not work with chromadb

from llama_index.

Comments (6)

dosubot commented on August 16, 2024 2

Hey @Rust-Ninja-Sabi! I'm here to help you with any bugs, questions, or contributions you have for this repository. I'll assist you while waiting for a human maintainer.

index.ref_doc_info does not work with chromadb in version 0.10.41 because the ChromaVectorStore class does not have an attribute or method named ref_doc_info. The class is designed to handle operations related to nodes and their embeddings within a ChromaDB collection, but it does not include any functionality related to ref_doc_info ^[1]^[2].

The specific error message encountered is:

raise NotImplementedError(
    "Vector store integrations that store text in the vector store are "
    "not supported by ref_doc_info yet."
)

For managing document references and metadata in vector stores that do not support ref_doc_info, you can follow recommended practices such as storing nodes in the document store and ensuring metadata is properly managed ^[3]^[4].

To continue talking to Dosu, mention @dosu.

from llama_index.

dosubot commented on August 16, 2024 2

Document management is available for the chromadb vector database. The provided code demonstrates how to create, add, delete, and query documents within a ChromaVectorStore using chromadb.

Here is an example of how to set up and use the ChromaVectorStore:

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

# Create a Chroma client and collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("example_collection")

# Set up the ChromaVectorStore and StorageContext
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Example of adding nodes
nodes = [...]  # List of BaseNode instances with embeddings
vector_store.add(nodes)

# Example of querying nodes
query = VectorStoreQuery(query_embedding=[...], similarity_top_k=5)
result = vector_store.query(query)

# Example of deleting nodes
vector_store.delete_nodes(node_ids=["node_id_1", "node_id_2"])

Additionally, the ChromaReader class from the llama_index library allows you to retrieve documents from existing persisted Chroma collections. Here is an example of how to use the ChromaReader:

from llama_index.legacy.readers.chroma import ChromaReader

# Initialize the ChromaReader
chroma_reader = ChromaReader(
    collection_name="example_collection",
    persist_directory="/path/to/persisted/collection"
)

# Load data from the collection
documents = chroma_reader.load_data(query_embedding=[...], limit=10)

For persistent storage, you can use the PersistentClient to manage your Chroma collections. Here is an example:

from typing import Optional
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

def get_vector_store(docset_id) -> Optional[ChromaVectorStore]:
    db = chromadb.PersistentClient(path="/path/to/persisted/collection")
    chroma_collection = db.get_or_create_collection(docset_id)
    return ChromaVectorStore(
        chroma_collection=chroma_collection, embed_model=EMBEDDINGS
    )

This setup ensures that your document management with chromadb is both flexible and persistent ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

from llama_index.

dosubot commented on August 16, 2024 2

To query and retrieve all nodes from the ChromaVectorStore, you can use the query method with an empty query. Here is an example:

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.vector_stores.types import VectorStoreQuery

# Create a Chroma client and collection
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection("example_collection")

# Set up the ChromaVectorStore
vector_store = ChromaVectorStore.from_collection(collection)

# Retrieve all nodes
query = VectorStoreQuery(query_embedding=[], similarity_top_k=0)
all_nodes_result = vector_store.query(query)

# all_nodes_result.nodes will contain the list of all nodes in the collection
print(all_nodes_result.nodes)

This code will retrieve all nodes from the ChromaVectorStore by calling the query method with an empty query embedding and similarity_top_k set to 0. The query method will return a VectorStoreQueryResult object containing all nodes in the collection ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

from llama_index.

Rust-Ninja-Sabi commented on August 16, 2024

@dosu When will it work with Chromadb? I use llama index to have one api for all available stores. But as far as I understand document management is not available for chromadb. Is ist available for any Vector database? Thanks.

from llama_index.

Rust-Ninja-Sabi commented on August 16, 2024

@dosu vector_store.get_notes() is not implemented. Can you show me the query to get all nodes? Thanks.

from llama_index.

logan-markewich commented on August 16, 2024

@Rust-Ninja-Sabi this is correct. ref_doc_info relies on having a docstore, but most vector dbs disable the docstore because they are storing all the nodes in the vector db (simplifies storage)

I'd recommend using an ingestion pipeline with a vector store and docstore attached

pipeline = IngestionPipeline(
  transformations=[SentenceSplitter(), OpenAIEmbedding()], 
  docstore=docstore, 
  vector_store=vector_store
)

pipeline.run(documents=documents)

https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/?h=ingestion

from llama_index.

[Bug]: index.ref_doc_info does not work with chromadb about llama_index HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent