cinnamon / kotaemon Goto Github PK
View Code? Open in Web Editor NEWAn open-source RAG-based tool for chatting with your documents.
Home Page: https://cinnamon.github.io/kotaemon/
License: Apache License 2.0
An open-source RAG-based tool for chatting with your documents.
Home Page: https://cinnamon.github.io/kotaemon/
License: Apache License 2.0
Now that we start versioning this and making releases, it's necessary to make the release stable. Dependencies version control is a step in that direction.
It's easier to start by hard pinning and loosen the range as we go.
Having 2 packages living inside a big package makes it hard for versioning. Currently, we have to maintain versions of 3 packages: kotaemon-app
(the root), kotaemon
and ktem
. They have the following functions:
kotaemon
: core RAG components such as LLMs, Embedding Models, Reader, Indexer, etc.ktem
: kotaemon
+ app building stuffkotaemon-app
: ktem
+ docs and tests + settings up scriptsConsider changing them to submodules instead:
kotaemon
-> kotaemon.core
ktem
-> kotaemon.app
kotaemon-app
-> kotaemon
Then we would only need to version the root kotaemon
package
The current usage of Azure AI Document Intelligence uses the Markdown output format. We can use the json output format from Azure DI to extract figure information.
Hey there! Super cool project. Thought I'd add some of the (yet to be documented) steps that I took to get the application working on my macbook pro with an M1 chip.
I did not use the docker image because it is out of date and because it does not match my amd64 platform
git clone https://github.com/Cinnamon/kotaemon
cd kotaemon
python -m venv kotaemon-env
source kotaemon-env/bin/activate
libs/kotaemon/pyproject.toml
brew install libmagic
import nltk
nltk.download('averaged_perceptron_tagger')
sources: https://www.nltk.org/data.html
https://stackoverflow.com/questions/35861482/nltk-lookup-error
pip install -e "libs/kotaemon[all]"
pip install -e "libs/ktem"
python app.py
Hope this helps someone else! I've been having a lot of fun using this.
I can add a PR documenting these steps and adding system requirements
On the password change page, you will received this error when trying to change the password:
from ktem.pages.admin.user import validate_password
ModuleNotFoundError: No module named 'ktem.pages.admin'
File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/queueing.py", line 536, in process_events
response = await route_utils.call_process_api(
File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/route_utils.py", line 276, in call_process_api
output = await app.get_blocks().process_api(
File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/blocks.py", line 1923, in process_api
result = await self.call_function(
File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/blocks.py", line 1508, in call_function
prediction = await anyio.to_thread.run_sync( # type: ignore
File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
return await future
File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 859, in run
result = context.run(func, *args)
File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/utils.py", line 818, in wrapper
response = f(*args, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/ktem/pages/settings.py", line 232, in change_password
from ktem.pages.admin.user import validate_password
ModuleNotFoundError: No module named 'ktem.pages.admin'
No response
No response
No response
Currently only LLMs and Embedding models are configurable in the UI. Adding this for vision models would be nice.
Related reddit discussion: https://www.reddit.com/r/LocalLLaMA/comments/1f25wo0/comment/lk4nli5/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
Add a markdown loader. llama index And langchain both have very good reader for markdown.
Hi,
This is a great project, looking forward to trying it out more and experimenting with some workflows!
I'm looking at the portions of the code that utilize networkx
for the graph retrieval and/or visualization, and I was thinking, the existing use of Pandas DataFrames makes this workflow very amenable to using Kùzu, an embedded graph database that's very similar to DuckDB and LanceDB in philosophy (to be easy to deploy, and fully open source). Using a graph database with persistence and durability guarantees, rather than an in-memory database like NetworkX, is preferable. And the fact that Kùzu is embedded and open source makes it that much more simple and user-friendly, in the same way that LanceDB and DuckDB are.
It's trivial to read data into a Kùzu graph via Pandas, as described here. The benefit of using Kùzu, imo, over NetworkX, is that Kùzu can scale very well to out-of-memory data, and imo, it's the perfect compliment to LanceDB for users who are familiar with that database from a vector search perspective.
Additionally, it's also trivial to convert a Kùzu graph into a networkx
Graph or DiGraph object, which can be used for all downstream workflows that require networkx
objects.
The Microsoft GraphRAG repo also uses LanceDB as its default vector store, and the reason Kùzu isn't used there (they also leverage NetworkX for their graph computations) is that at the time of them writing their code, Kùzu wasn't well known enough. I think that's changing, as Kùzu is becoming more and more popular (disclaimer: I work at Kùzu).
I just wanted to create this issue so that this could be something that's on the roadmap, and I'd be happy to try out the framework more and offer my inputs as this project grows. Cheers!
I have an error when installing graphrag.
in order to run graphrag I try to run
pip install graphrag future
But get the following error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
1. Go to 'project root folder'
2. Run on 'pip install graphrag future'
3. Get the error: ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
![DESCRIPTION](LINK.png)
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
No response
MacOS
No response
When click on a file, users can see the file content. Depending on the resources, it can be:
Given that I have followed all the steps from adding the AI model and select the uploaded file.
Uploading HBR.pdf…
I have the following error while uploading and indexing file (both in docker in direct installation)
1. Go to 'Files --> File Collection'
2. Drag and drop a pdf file and Click on 'Upload and index'
3. See error
No response
No response
No response
No response
No response
No response
As it seems in https://hub.docker.com/r/taprosoft/kotaemon/tags there is no docker image for arm arch. Waiting for it
I want to pull it to my raspi4b and deploy it there, but need to find an another solution for now.
idk
No response
It's unnecessary for end users to download the whole repo
The latest zip extracts to following files:
After installation, conda and virtual environment is made, but when app.py is run it gives following error:
File "C:\kotaemon-app\app.py", line 3, in <module>
from theflow.settings import settings as flowsettings
ModuleNotFoundError: No module named 'theflow'
After inspecting both app.py
and flowsettings.py
located at root is importing from theflow.settings
and there is no theflow
directory or file.
How should i run the project as a user?
Indexing takes time, especially when indexing lots of files. Without showing the progress, we are giving the impression that the app isn't running.
Resulting in these errors:
Setting up quick upload event
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
User "admin" already exists
Setting up quick upload event
User-id: None, can see public conversations: False
User-id: 1, can see public conversations: True
len(results)=0, len(file_list)=1
len(results)=0, len(file_list)=1
Overriding with default loaders
use_quick_index_mode False
reader_mode default
Using reader <kotaemon.loaders.unstructured_loader.UnstructuredReader object at 0x7f8d1a655e10>
use_quick_index_mode True
reader_mode default
Using reader <kotaemon.loaders.unstructured_loader.UnstructuredReader object at 0x7f8d1a655e10>
No module named 'unstructured'
Traceback (most recent call last):
File "/app/libs/ktem/ktem/index/file/pipelines.py", line 724, in stream
file_id, docs = yield from pipeline.stream(
File "/app/libs/ktem/ktem/index/file/pipelines.py", line 586, in stream
docs = self.loader.load_data(file_path, extra_info=extra_info)
File "/app/libs/kotaemon/kotaemon/loaders/unstructured_loader.py", line 70, in load_data
from unstructured.partition.auto import partition
ModuleNotFoundError: No module named 'unstructured'
/usr/local/lib/python3.10/site-packages/gradio/components/dropdown.py:188: UserWarning:
The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: None or set allow_custom_value=True.
When using the following to run kotaemon in docker, the container exits. (Host is M2 Mac with Rosetta)
docker run --platform linux/amd64 -e GRADIO_SERVER_NAME=0.0.0.0 -e GRADIO_SERVER_PORT=7860 -p 7860:7860 -it taprosoft/kotaemon:v1.0
The following output is printed to the screen:
Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/app' '/app'
[nltk_data] Downloading package punkt_tab to
[nltk_data] /usr/local/lib/python3.10/site-
[nltk_data] packages/llama_index/core/_static/nltk_cache...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.
Not sure what I'm missing.
A little issue here, does this has any relation with anything? I'm testing the latest v.0.3.4
I saw couple of CMake error during the building, however I didn't see the flow, is there any pre-requisite for the environment before using the latest release?
Any insights and help greatly appreciated. Thank you
We need an extension-based framework to install advanced features on top of the existing installation, which means the user should be able to install additional features after the app is installed.
For example: several advanced readers use unstructured
, which is a bit tricky to implement and takes up a decent amount of storage. So this can be an optional feature that user can install on-demand.
Hi, I followed the installation instructions but im not a software person. I tried to launch but it didn't open in browser. Here is the message I got:
Requirements are already installed
Setting up a local model
LOCAL_MODEL not set in the .env
file.
Launching Kotaemon in your browser, please wait...
Traceback (most recent call last):
File "/Users/johnnybonk/Downloads/kotaemon-app/app.py", line 15, in
app = App()
File "/Users/johnnybonk/Downloads/kotaemon-app/install_dir/env/lib/python3.10/site-packages/ktem/app.py", line 57, in init
self.register_reasonings()
File "/Users/johnnybonk/Downloads/kotaemon-app/install_dir/env/lib/python3.10/site-packages/ktem/app.py", line 83, in register_reasonings
reasoning_cls = import_dotted_string(value, safe=False)
File "/Users/johnnybonk/Downloads/kotaemon-app/install_dir/env/lib/python3.10/site-packages/theflow/utils/modules.py", line 47, in import_dotted_string
return getattr(module, obj_name)
AttributeError: module 'ktem.reasoning.simple' has no attribute 'FullDecomposeQAPipeline'
Will exit now...
Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.
[Process completed]
We might have to find better way to handle error and inform user, when there is issue in embedding process, UI show no Error and I could only refer back to terminal to check,
related to qdrant/fastembed#187
Source: https://github.com/philschmid/clipper.js
Upon preliminary usage, the tools can represent HTML page into reasonable Markdown format. Running it requires nodejs installed, and treating it as a CLI command.
Approach:
Hi,
First of all, thank you for this great project! I have been experimenting with it and encountered an issue that I hope you can help me with.
Description
I cloned the repository and followed the instructions to upload and index a PDF file. The file was successfully uploaded and processed into chunks, but the process failed at the create_base_entity_graph step. Below are the details:
1. Uploaded a PDF file.
2. The file was converted to text and processed into 432 chunks.
3. The indexing process started and created base text units and extracted entities.
The process failed at the create_base_entity_graph step.
No response
Indexing [1/1]: ewfacve.pdf
=> Converting ewfacve.pdf to text
=> Converted ewfacve.pdf to text
=> [ewfacve.pdf] Processed 432 chunks
=> Finished indexing ewfacve.pdf
[GraphRAG] Creating index... This can take a long time.
Logging enabled at /app/ktem_app_data/user_data/files/graphrag/56bbcd91-b07f-4ca4-a904-cce71bdf4571/output/20240828-110528/reports/indexing-engine.log
🚀 create_base_text_units
id ... n_tokens
0 b85337bedc1f1de271961c7251d46b5c ... 918
1 0b9e954415cd9c6376528587903a9a96 ... 748
2 d03cbffc5b626210553dd466e022d2ee ... 891
3 d28fa2ce915c137677eaaa2aabdaa304 ... 732
4 3a7d2941f7ba32940874d0f06f9f62f4 ... 1141
.. ... ... ...
132 54d5b9a2b71895341def789d88113ef4 ... 1200
133 53f5ed0fb3870cf8e59a8f55c1271f3a ... 398
134 ea40a2366f7ad18819cfa804827eb4d4 ... 1183
135 f1e686b8519e7d93ce6e461d9fa8d1a0 ... 83
136 9eac5a3d46b4b013fef67b5c78330d89 ... 886
[275 rows x 5 columns]
🚀 create_base_extracted_entities
entity_graph
0 <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_summarized_entities
entity_graph
0 <graphml xmlns="http://graphml.graphdrawing.or...
❌ create_base_entity_graph
None
⠋ GraphRAG Indexer
├── Loading Input (text) - 216 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
└── create_base_entity_graph❌ Errors occurred during the pipeline run, see logs for more details.
Chrome
Windows
No response
Collecting git+https://github.com/Cinnamon/[email protected]
Cloning https://github.com/Cinnamon/kotaemon.git (to revision v0.3.6) to c:\users\micha\appdata\local\temp\pip-req-build-5dizcuak
ERROR: Error [WinError 2] 系統找不到指定的檔案。 while executing command git version
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.