cinnamon / kotaemon Goto Github PK

View Code? Open in Web Editor NEW

12.2K 76.0 916.0 26.71 MB

An open-source RAG-based tool for chatting with your documents.

Home Page: https://cinnamon.github.io/kotaemon/

License: Apache License 2.0

Python 90.49% HTML 3.84% CSS 0.41% JavaScript 0.48% Mako 0.07% Shell 2.83% Batchfile 1.67% Dockerfile 0.23%

chatbot llms open-source rag

kotaemon's People

Contributors

Stargazers

Watchers

Forkers

lone17 cin-albert cin-kay techthiyanes marlinspike polya20 mrmonkey94 binmosa iodmitri digitalapplied venturaeffect levinehuang sepehrmn kerwinchina sunholo-data defamationstation samithaj avergnaud suryatmodulus 596050 codeaudit abhishekbhakat sf9040 algonacci mivanovitch miali88 fellowtraveler hapizi gofullthrottle krzysiekswider tourountzis vinayreddy100 mcc5635 shashisingh maheskrishnan andridge corektion shacharbard vineetp6 mrg7 vinicius-ianni nymbo randomcomposition azomland amrith-am nxquang-al emohr44 mjjalfar uscabayaosj emezac omids yoohooyoo fakhruddin90 rikokir voltable-com awhitter saipklvs jeffara gbrian al-amini lucasvetezo faisalshahbaz ml-aware24k b3h3rkz pm2r veryvanya mojowebs cylee0000 afar1 jinwoongyoo mclabgalaxy hercules261188 diomedesdigital bagayi wantdeeptester ai-jie01 xc0r techventurebuilder maxxusxyz joemxz huangyingting prakhar64 chandan0000 yuitonakamori diogodsa water-in-stone vasanthasaikalluri bjshl thomascherickal kendrick7410 jeandelest cafonso himanshushukla12 ammar-alnagar surenpoghosian hossein-madadi dmillner ahmed-sabri ashleypng quanticoi

kotaemon's Issues

reload upload panel after upload finished

pin dependencies version

Now that we start versioning this and making releases, it's necessary to make the release stable. Dependencies version control is a step in that direction.

It's easier to start by hard pinning and loosen the range as we go.

restructure the packages in libs to submodules

Having 2 packages living inside a big package makes it hard for versioning. Currently, we have to maintain versions of 3 packages: kotaemon-app (the root), kotaemon and ktem. They have the following functions:

kotaemon: core RAG components such as LLMs, Embedding Models, Reader, Indexer, etc.
ktem: kotaemon + app building stuff
kotaemon-app: ktem + docs and tests + settings up scripts

Consider changing them to submodules instead:

kotaemon -> kotaemon.core
ktem -> kotaemon.app
- kotaemon-app -> kotaemon

Then we would only need to version the root kotaemon package

Enhance figure reading for Azure AI Document Intelligence

The current usage of Azure AI Document Intelligence uses the Markdown output format. We can use the json output format from Azure DI to extract figure information.

Steps I took to get local app working

Hey there! Super cool project. Thought I'd add some of the (yet to be documented) steps that I took to get the application working on my macbook pro with an M1 chip.

I did not use the docker image because it is out of date and because it does not match my amd64 platform

Cloned repo

git clone https://github.com/Cinnamon/kotaemon
cd kotaemon

Create virtual environment and activate it

python -m venv kotaemon-env

source kotaemon-env/bin/activate

add "unstructured==0.15.8" to the dependency array of libs/kotaemon/pyproject.toml
brew install libmagic
open a python session and run

import nltk
nltk.download('averaged_perceptron_tagger')

sources: https://www.nltk.org/data.html
https://stackoverflow.com/questions/35861482/nltk-lookup-error

pip install -e "libs/kotaemon[all]"
pip install -e "libs/ktem"

Start up the app using python app.py
SET YOUR API KEYS IN THE APPLICATION, NOT THE .env!!!! Do this for both the LLM resource and the embedding resource

Hope this helps someone else! I've been having a lot of fun using this.

I can add a PR documenting these steps and adding system requirements

Unable to delete files

support url to a website as input for indexing

Refactor the DocSearch into the Tool section in Kotaemon

Embedding Model

When I follow default set up embedding model, it show:
ImportError: cannot import name 'TextEmbedding' from 'fastembed' (unknown location)

But when I register OpenAI Embedding Model and delete old one, it show
KeyError: 'local-bge-base-en-v1.5'

I have attached detail error in screenshot below:

Failed to upload and index the file, retried multiple times

block chat tab when no chat model is in pool

default local embedding model causes slow processing on large document

[BUG] - <title>Password change is not working

Description

On the password change page, you will received this error when trying to change the password:
from ktem.pages.admin.user import validate_password
ModuleNotFoundError: No module named 'ktem.pages.admin'

Logs

File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/blocks.py", line 1923, in process_api
    result = await self.call_function(
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/blocks.py", line 1508, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/utils.py", line 818, in wrapper
    response = f(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/ktem/pages/settings.py", line 232, in change_password
    from ktem.pages.admin.user import validate_password
ModuleNotFoundError: No module named 'ktem.pages.admin'

Browsers

No response

OS

No response

Additional information

No response

support continuous conversation

Make vision model configurable in the UI

Currently only LLMs and Embedding models are configurable in the UI. Adding this for vision models would be nice.

Support for markdown files

Add a markdown loader. llama index And langchain both have very good reader for markdown.

Incorporate chat history into the agent pipeline

Add mechanism in the pipline and UI display for long-processing jobs (agents, tool calls, etc)

Reference image

"File Index" seems to be too much of a technical term for end users

Add Kùzu graph database as a graph store

Hi,

This is a great project, looking forward to trying it out more and experimenting with some workflows!

I'm looking at the portions of the code that utilize networkx for the graph retrieval and/or visualization, and I was thinking, the existing use of Pandas DataFrames makes this workflow very amenable to using Kùzu, an embedded graph database that's very similar to DuckDB and LanceDB in philosophy (to be easy to deploy, and fully open source). Using a graph database with persistence and durability guarantees, rather than an in-memory database like NetworkX, is preferable. And the fact that Kùzu is embedded and open source makes it that much more simple and user-friendly, in the same way that LanceDB and DuckDB are.

It's trivial to read data into a Kùzu graph via Pandas, as described here. The benefit of using Kùzu, imo, over NetworkX, is that Kùzu can scale very well to out-of-memory data, and imo, it's the perfect compliment to LanceDB for users who are familiar with that database from a vector search perspective.

Additionally, it's also trivial to convert a Kùzu graph into a networkx Graph or DiGraph object, which can be used for all downstream workflows that require networkx objects.

The Microsoft GraphRAG repo also uses LanceDB as its default vector store, and the reason Kùzu isn't used there (they also leverage NetworkX for their graph computations) is that at the time of them writing their code, Kùzu wasn't well known enough. I think that's changing, as Kùzu is becoming more and more popular (disclaimer: I work at Kùzu).

I just wanted to create this issue so that this could be something that's on the roadmap, and I'd be happy to try out the framework more and offer my inputs as this project grows. Cheers!

allow the app to start without an embedding model, but block the file index tab if there is no embedding model

[BUG] - Error when installing graphrag

Description

I have an error when installing graphrag.
in order to run graphrag I try to run
pip install graphrag future But get the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

Reproduction steps

1. Go to 'project root folder'
2. Run on 'pip install graphrag future'
3. Get the error: ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

Screenshots

![DESCRIPTION](LINK.png)

Logs

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

Browsers

No response

OS

MacOS

Additional information

No response

Show file content in the file index

When click on a file, users can see the file content. Depending on the resources, it can be:

The raw file content.
The indexed chunks.
....

Can't create index: KeyError: 'embedding'

The answer returns as in the attached image. Is it because I didn't name the file?

Given that I have followed all the steps from adding the AI model and select the uploaded file.
Uploading HBR.pdf…

While in the terminal

Feature request: display LaTex in chat output

Chatting about STEM related PDF documents that include LaTex syntax does not render properly into math symbols.

Please see if you can use MathJax to render LaTex inside the Chat panel.

[BUG] - <Error while uploading and indexing>

Description

I have the following error while uploading and indexing file (both in docker in direct installation)

Reproduction steps

1. Go to 'Files --> File Collection'
2. Drag and drop a pdf file and Click on 'Upload and index'
3. See error

Screenshots

No response

Logs

No response

Browsers

No response

OS

No response

Additional information

No response

[REQUEST] - Docker image for arm architectures

Reference Issues

No response

Summary

As it seems in https://hub.docker.com/r/taprosoft/kotaemon/tags there is no docker image for arm arch. Waiting for it

Basic Example

I want to pull it to my raspi4b and deploy it there, but need to find an another solution for now.

Drawbacks

idk

Additional information

No response

creating smaller release installer for end users

It's unnecessary for end users to download the whole repo

Setup an upgrading mechanism

Not able to run the project as end user

The latest zip extracts to following files:

After installation, conda and virtual environment is made, but when app.py is run it gives following error:

 File "C:\kotaemon-app\app.py", line 3, in <module>
    from theflow.settings import settings as flowsettings
ModuleNotFoundError: No module named 'theflow'

After inspecting both app.py and flowsettings.py located at root is importing from theflow.settings and there is no theflow directory or file.

How should i run the project as a user?

disable QA prompt when no file index is selected

Show progress while indexing files

Indexing takes time, especially when indexing lots of files. Without showing the progress, we are giving the impression that the app isn't running.

Formulate tools for the reasoning pipeline to use

Container image is missing 'unstructured' pip package

Resulting in these errors:

Setting up quick upload event
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
User "admin" already exists
Setting up quick upload event
User-id: None, can see public conversations: False
User-id: 1, can see public conversations: True
len(results)=0, len(file_list)=1
len(results)=0, len(file_list)=1
Overriding with default loaders
use_quick_index_mode False
reader_mode default
Using reader <kotaemon.loaders.unstructured_loader.UnstructuredReader object at 0x7f8d1a655e10>
use_quick_index_mode True
reader_mode default
Using reader <kotaemon.loaders.unstructured_loader.UnstructuredReader object at 0x7f8d1a655e10>
No module named 'unstructured'
Traceback (most recent call last):
  File "/app/libs/ktem/ktem/index/file/pipelines.py", line 724, in stream
    file_id, docs = yield from pipeline.stream(
  File "/app/libs/ktem/ktem/index/file/pipelines.py", line 586, in stream
    docs = self.loader.load_data(file_path, extra_info=extra_info)
  File "/app/libs/kotaemon/kotaemon/loaders/unstructured_loader.py", line 70, in load_data
    from unstructured.partition.auto import partition
ModuleNotFoundError: No module named 'unstructured'
/usr/local/lib/python3.10/site-packages/gradio/components/dropdown.py:188: UserWarning:

The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: None or set allow_custom_value=True.

add api_key and extra params to EndpointChatLLM

No file selected shouldn't mean all files are selected. Also, newly added file should be auto selected.

Display version number or PR name so that user could know when the current system updated

Unable to run in docker on ARM Mac

When using the following to run kotaemon in docker, the container exits. (Host is M2 Mac with Rosetta)

docker run --platform linux/amd64 -e GRADIO_SERVER_NAME=0.0.0.0 -e GRADIO_SERVER_PORT=7860 -p 7860:7860 -it taprosoft/kotaemon:v1.0

The following output is printed to the screen:

Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/app' '/app'
[nltk_data] Downloading package punkt_tab to
[nltk_data] /usr/local/lib/python3.10/site-
[nltk_data] packages/llama_index/core/_static/nltk_cache...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.

Not sure what I'm missing.

Environment Issue

A little issue here, does this has any relation with anything? I'm testing the latest v.0.3.4
I saw couple of CMake error during the building, however I didn't see the flow, is there any pre-requisite for the environment before using the latest release?

Any insights and help greatly appreciated. Thank you

[Feature Request] mechanism to install advanced features on top of current installation

We need an extension-based framework to install advanced features on top of the existing installation, which means the user should be able to install additional features after the app is installed.

For example: several advanced readers use unstructured, which is a bit tricky to implement and takes up a decent amount of storage. So this can be an optional feature that user can install on-demand.

Cannot launch (non-technical person)

Hi, I followed the installation instructions but im not a software person. I tried to launch but it didn't open in browser. Here is the message I got:

Requirements are already installed

Setting up a local model

LOCAL_MODEL not set in the .env file.

Launching Kotaemon in your browser, please wait...

Traceback (most recent call last):
File "/Users/johnnybonk/Downloads/kotaemon-app/app.py", line 15, in
app = App()
File "/Users/johnnybonk/Downloads/kotaemon-app/install_dir/env/lib/python3.10/site-packages/ktem/app.py", line 57, in init
self.register_reasonings()
File "/Users/johnnybonk/Downloads/kotaemon-app/install_dir/env/lib/python3.10/site-packages/ktem/app.py", line 83, in register_reasonings
reasoning_cls = import_dotted_string(value, safe=False)
File "/Users/johnnybonk/Downloads/kotaemon-app/install_dir/env/lib/python3.10/site-packages/theflow/utils/modules.py", line 47, in import_dotted_string
return getattr(module, obj_name)
AttributeError: module 'ktem.reasoning.simple' has no attribute 'FullDecomposeQAPipeline'

Will exit now...

Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.

[Process completed]

Showing Error Message

We might have to find better way to handle error and inform user, when there is issue in embedding process, UI show no Error and I could only refer back to terminal to check,

ReAct's `text_spliter` bug

duckduckgo search unit test does not use mock object causing tests to fail unexpectedly

citation not working on simple pipeline

fastembed not installable on MacOS 12

related to qdrant/fastembed#187

Provide clipper.js as a HTML reader

Source: https://github.com/philschmid/clipper.js
Upon preliminary usage, the tools can represent HTML page into reasonable Markdown format. Running it requires nodejs installed, and treating it as a CLI command.

Approach:

Treat clipper.js as a plugin that user can enable.
If nodejs is unavailable, Inform the user to install nodejs, and cancel plugin activation.
Install clipper.js (maybe using npx).
HTMLClipperJS will wrap around the command line.

Issue: Error in create_base_entity_graph Step During Indexing

Description

Hi,
First of all, thank you for this great project! I have been experimenting with it and encountered an issue that I hope you can help me with.

Description
I cloned the repository and followed the instructions to upload and index a PDF file. The file was successfully uploaded and processed into chunks, but the process failed at the create_base_entity_graph step. Below are the details:

Reproduction steps

1. Uploaded a PDF file.
2. The file was converted to text and processed into 432 chunks.
3. The indexing process started and created base text units and extracted entities.
The process failed at the create_base_entity_graph step.

Screenshots

No response

Logs

Indexing [1/1]: ewfacve.pdf
 => Converting ewfacve.pdf to text
 => Converted ewfacve.pdf to text
 => [ewfacve.pdf] Processed 432 chunks
 => Finished indexing ewfacve.pdf
[GraphRAG] Creating index... This can take a long time.
Logging enabled at /app/ktem_app_data/user_data/files/graphrag/56bbcd91-b07f-4ca4-a904-cce71bdf4571/output/20240828-110528/reports/indexing-engine.log

🚀 create_base_text_units

                                   id  ... n_tokens

0    b85337bedc1f1de271961c7251d46b5c  ...      918

1    0b9e954415cd9c6376528587903a9a96  ...      748

2    d03cbffc5b626210553dd466e022d2ee  ...      891

3    d28fa2ce915c137677eaaa2aabdaa304  ...      732

4    3a7d2941f7ba32940874d0f06f9f62f4  ...     1141

..                                ...  ...      ...

132  54d5b9a2b71895341def789d88113ef4  ...     1200

133  53f5ed0fb3870cf8e59a8f55c1271f3a  ...      398

134  ea40a2366f7ad18819cfa804827eb4d4  ...     1183

135  f1e686b8519e7d93ce6e461d9fa8d1a0  ...       83

136  9eac5a3d46b4b013fef67b5c78330d89  ...      886

[275 rows x 5 columns]

🚀 create_base_extracted_entities

                                        entity_graph

0  <graphml xmlns="http://graphml.graphdrawing.or...

🚀 create_summarized_entities

                                        entity_graph

0  <graphml xmlns="http://graphml.graphdrawing.or...

❌ create_base_entity_graph

None

⠋ GraphRAG Indexer 

├── Loading Input (text) - 216 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00

├── create_base_text_units

├── create_base_extracted_entities

├── create_summarized_entities

└── create_base_entity_graph❌ Errors occurred during the pipeline run, see logs for more details.

Browsers

Chrome

OS

Windows

Additional information

No response

cannot install(non technical)

Collecting git+https://github.com/Cinnamon/[email protected]
Cloning https://github.com/Cinnamon/kotaemon.git (to revision v0.3.6) to c:\users\micha\appdata\local\temp\pip-req-build-5dizcuak
ERROR: Error [WinError 2] 系統找不到指定的檔案。 while executing command git version
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?

cinnamon / kotaemon Goto Github PK

kotaemon's People

Contributors

Stargazers

Watchers

Forkers

kotaemon's Issues

Description

Logs

Browsers

OS

Additional information

Description

Reproduction steps

Screenshots

Logs

Browsers

OS

Additional information

Description

Reproduction steps

Screenshots

Logs

Browsers

OS

Additional information

Reference Issues

Summary

Basic Example

Drawbacks

Additional information

Description

Reproduction steps

Screenshots

Logs

Browsers

OS

Additional information

Recommend Projects

Recommend Topics

Recommend Org