Giter Site home page Giter Site logo

cinnamon / kotaemon Goto Github PK

View Code? Open in Web Editor NEW
12.2K 76.0 916.0 26.71 MB

An open-source RAG-based tool for chatting with your documents.

Home Page: https://cinnamon.github.io/kotaemon/

License: Apache License 2.0

Python 90.49% HTML 3.84% CSS 0.41% JavaScript 0.48% Mako 0.07% Shell 2.83% Batchfile 1.67% Dockerfile 0.23%
chatbot llms open-source rag

kotaemon's People

Contributors

anush008 avatar cin-albert avatar ducminhle avatar gofullthrottle avatar jacky0218 avatar lone17 avatar ngnhng avatar osushinekotan avatar phv2312 avatar taprosoft avatar trducng avatar zc277584121 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kotaemon's Issues

pin dependencies version

Now that we start versioning this and making releases, it's necessary to make the release stable. Dependencies version control is a step in that direction.

It's easier to start by hard pinning and loosen the range as we go.

restructure the packages in libs to submodules

Having 2 packages living inside a big package makes it hard for versioning. Currently, we have to maintain versions of 3 packages: kotaemon-app (the root), kotaemon and ktem. They have the following functions:

  • kotaemon: core RAG components such as LLMs, Embedding Models, Reader, Indexer, etc.
  • ktem: kotaemon + app building stuff
  • kotaemon-app: ktem + docs and tests + settings up scripts

Consider changing them to submodules instead:

  • kotaemon -> kotaemon.core
  • ktem -> kotaemon.app
    • kotaemon-app -> kotaemon

Then we would only need to version the root kotaemon package

Steps I took to get local app working

Hey there! Super cool project. Thought I'd add some of the (yet to be documented) steps that I took to get the application working on my macbook pro with an M1 chip.

I did not use the docker image because it is out of date and because it does not match my amd64 platform

  1. Cloned repo
git clone https://github.com/Cinnamon/kotaemon
cd kotaemon
  1. Create virtual environment and activate it
python -m venv kotaemon-env

source kotaemon-env/bin/activate
  1. add "unstructured==0.15.8" to the dependency array of libs/kotaemon/pyproject.toml
  2. brew install libmagic
  3. open a python session and run
import nltk
nltk.download('averaged_perceptron_tagger')

sources: https://www.nltk.org/data.html
https://stackoverflow.com/questions/35861482/nltk-lookup-error

pip install -e "libs/kotaemon[all]"
pip install -e "libs/ktem"
  1. Start up the app using python app.py
  2. SET YOUR API KEYS IN THE APPLICATION, NOT THE .env!!!! Do this for both the LLM resource and the embedding resource
Screenshot 2024-08-27 at 8 48 24 PM

Hope this helps someone else! I've been having a lot of fun using this.

I can add a PR documenting these steps and adding system requirements

Embedding Model

When I follow default set up embedding model, it show:
ImportError: cannot import name 'TextEmbedding' from 'fastembed' (unknown location)

But when I register OpenAI Embedding Model and delete old one, it show
KeyError: 'local-bge-base-en-v1.5'

I have attached detail error in screenshot below:

截圖 2024-04-17 上午11 38 54
截圖 2024-04-17 上午11 38 46

[BUG] - <title>Password change is not working

Description

On the password change page, you will received this error when trying to change the password:
from ktem.pages.admin.user import validate_password
ModuleNotFoundError: No module named 'ktem.pages.admin'

Logs

File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/blocks.py", line 1923, in process_api
    result = await self.call_function(
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/blocks.py", line 1508, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/gradio/utils.py", line 818, in wrapper
    response = f(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/kotaemon-QmUhOJgz/lib/python3.10/site-packages/ktem/pages/settings.py", line 232, in change_password
    from ktem.pages.admin.user import validate_password
ModuleNotFoundError: No module named 'ktem.pages.admin'

Browsers

No response

OS

No response

Additional information

No response

Add Kùzu graph database as a graph store

Hi,

This is a great project, looking forward to trying it out more and experimenting with some workflows!

I'm looking at the portions of the code that utilize networkx for the graph retrieval and/or visualization, and I was thinking, the existing use of Pandas DataFrames makes this workflow very amenable to using Kùzu, an embedded graph database that's very similar to DuckDB and LanceDB in philosophy (to be easy to deploy, and fully open source). Using a graph database with persistence and durability guarantees, rather than an in-memory database like NetworkX, is preferable. And the fact that Kùzu is embedded and open source makes it that much more simple and user-friendly, in the same way that LanceDB and DuckDB are.

It's trivial to read data into a Kùzu graph via Pandas, as described here. The benefit of using Kùzu, imo, over NetworkX, is that Kùzu can scale very well to out-of-memory data, and imo, it's the perfect compliment to LanceDB for users who are familiar with that database from a vector search perspective.

Additionally, it's also trivial to convert a Kùzu graph into a networkx Graph or DiGraph object, which can be used for all downstream workflows that require networkx objects.

The Microsoft GraphRAG repo also uses LanceDB as its default vector store, and the reason Kùzu isn't used there (they also leverage NetworkX for their graph computations) is that at the time of them writing their code, Kùzu wasn't well known enough. I think that's changing, as Kùzu is becoming more and more popular (disclaimer: I work at Kùzu).

I just wanted to create this issue so that this could be something that's on the roadmap, and I'd be happy to try out the framework more and offer my inputs as this project grows. Cheers!

[BUG] - Error when installing graphrag

Description

I have an error when installing graphrag.
in order to run graphrag I try to run
pip install graphrag future But get the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

Reproduction steps

1. Go to 'project root folder'
2. Run on 'pip install graphrag future'
3. Get the error: ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

Screenshots

![DESCRIPTION](LINK.png)

Logs

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

Browsers

No response

OS

MacOS

Additional information

No response

Show file content in the file index

When click on a file, users can see the file content. Depending on the resources, it can be:

  • The raw file content.
  • The indexed chunks.
  • ....

Feature request: display LaTex in chat output

Chatting about STEM related PDF documents that include LaTex syntax does not render properly into math symbols.

image

Please see if you can use MathJax to render LaTex inside the Chat panel.

[BUG] - <Error while uploading and indexing>

Description

I have the following error while uploading and indexing file (both in docker in direct installation)
image
error

Reproduction steps

1. Go to 'Files --> File Collection'
2. Drag and drop a pdf file and Click on 'Upload and index'
3. See error

Screenshots

No response

Logs

No response

Browsers

No response

OS

No response

Additional information

No response

Not able to run the project as end user

The latest zip extracts to following files:
image

After installation, conda and virtual environment is made, but when app.py is run it gives following error:

 File "C:\kotaemon-app\app.py", line 3, in <module>
    from theflow.settings import settings as flowsettings
ModuleNotFoundError: No module named 'theflow'

After inspecting both app.py and flowsettings.py located at root is importing from theflow.settings and there is no theflow directory or file.

How should i run the project as a user?

Show progress while indexing files

Indexing takes time, especially when indexing lots of files. Without showing the progress, we are giving the impression that the app isn't running.

Container image is missing 'unstructured' pip package

Resulting in these errors:

Setting up quick upload event
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
User "admin" already exists
Setting up quick upload event
User-id: None, can see public conversations: False
User-id: 1, can see public conversations: True
len(results)=0, len(file_list)=1
len(results)=0, len(file_list)=1
Overriding with default loaders
use_quick_index_mode False
reader_mode default
Using reader <kotaemon.loaders.unstructured_loader.UnstructuredReader object at 0x7f8d1a655e10>
use_quick_index_mode True
reader_mode default
Using reader <kotaemon.loaders.unstructured_loader.UnstructuredReader object at 0x7f8d1a655e10>
No module named 'unstructured'
Traceback (most recent call last):
  File "/app/libs/ktem/ktem/index/file/pipelines.py", line 724, in stream
    file_id, docs = yield from pipeline.stream(
  File "/app/libs/ktem/ktem/index/file/pipelines.py", line 586, in stream
    docs = self.loader.load_data(file_path, extra_info=extra_info)
  File "/app/libs/kotaemon/kotaemon/loaders/unstructured_loader.py", line 70, in load_data
    from unstructured.partition.auto import partition
ModuleNotFoundError: No module named 'unstructured'
/usr/local/lib/python3.10/site-packages/gradio/components/dropdown.py:188: UserWarning:

The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: None or set allow_custom_value=True.

Unable to run in docker on ARM Mac

When using the following to run kotaemon in docker, the container exits. (Host is M2 Mac with Rosetta)

docker run --platform linux/amd64 -e GRADIO_SERVER_NAME=0.0.0.0 -e GRADIO_SERVER_PORT=7860 -p 7860:7860 -it taprosoft/kotaemon:v1.0

The following output is printed to the screen:

Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/app' '/app'
[nltk_data] Downloading package punkt_tab to
[nltk_data] /usr/local/lib/python3.10/site-
[nltk_data] packages/llama_index/core/_static/nltk_cache...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.

Not sure what I'm missing.

Environment Issue

image

  1. A little issue here, does this has any relation with anything? I'm testing the latest v.0.3.4

  2. I saw couple of CMake error during the building, however I didn't see the flow, is there any pre-requisite for the environment before using the latest release?

Any insights and help greatly appreciated. Thank you

[Feature Request] mechanism to install advanced features on top of current installation

We need an extension-based framework to install advanced features on top of the existing installation, which means the user should be able to install additional features after the app is installed.

For example: several advanced readers use unstructured, which is a bit tricky to implement and takes up a decent amount of storage. So this can be an optional feature that user can install on-demand.

Cannot launch (non-technical person)

Hi, I followed the installation instructions but im not a software person. I tried to launch but it didn't open in browser. Here is the message I got:

Requirements are already installed


Setting up a local model


LOCAL_MODEL not set in the .env file.


Launching Kotaemon in your browser, please wait...


Traceback (most recent call last):
File "/Users/johnnybonk/Downloads/kotaemon-app/app.py", line 15, in
app = App()
File "/Users/johnnybonk/Downloads/kotaemon-app/install_dir/env/lib/python3.10/site-packages/ktem/app.py", line 57, in init
self.register_reasonings()
File "/Users/johnnybonk/Downloads/kotaemon-app/install_dir/env/lib/python3.10/site-packages/ktem/app.py", line 83, in register_reasonings
reasoning_cls = import_dotted_string(value, safe=False)
File "/Users/johnnybonk/Downloads/kotaemon-app/install_dir/env/lib/python3.10/site-packages/theflow/utils/modules.py", line 47, in import_dotted_string
return getattr(module, obj_name)
AttributeError: module 'ktem.reasoning.simple' has no attribute 'FullDecomposeQAPipeline'

Will exit now...

Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.

[Process completed]

Showing Error Message

We might have to find better way to handle error and inform user, when there is issue in embedding process, UI show no Error and I could only refer back to terminal to check,

Provide clipper.js as a HTML reader

Source: https://github.com/philschmid/clipper.js
Upon preliminary usage, the tools can represent HTML page into reasonable Markdown format. Running it requires nodejs installed, and treating it as a CLI command.

Approach:

  • Treat clipper.js as a plugin that user can enable.
  • If nodejs is unavailable, Inform the user to install nodejs, and cancel plugin activation.
  • Install clipper.js (maybe using npx).
  • HTMLClipperJS will wrap around the command line.

Issue: Error in create_base_entity_graph Step During Indexing

Description

Hi,
First of all, thank you for this great project! I have been experimenting with it and encountered an issue that I hope you can help me with.

Description
I cloned the repository and followed the instructions to upload and index a PDF file. The file was successfully uploaded and processed into chunks, but the process failed at the create_base_entity_graph step. Below are the details:

Reproduction steps

1. Uploaded a PDF file.
2. The file was converted to text and processed into 432 chunks.
3. The indexing process started and created base text units and extracted entities.
The process failed at the create_base_entity_graph step.

Screenshots

No response

Logs

Indexing [1/1]: ewfacve.pdf
 => Converting ewfacve.pdf to text
 => Converted ewfacve.pdf to text
 => [ewfacve.pdf] Processed 432 chunks
 => Finished indexing ewfacve.pdf
[GraphRAG] Creating index... This can take a long time.
Logging enabled at /app/ktem_app_data/user_data/files/graphrag/56bbcd91-b07f-4ca4-a904-cce71bdf4571/output/20240828-110528/reports/indexing-engine.log

🚀 create_base_text_units

                                   id  ... n_tokens

0    b85337bedc1f1de271961c7251d46b5c  ...      918

1    0b9e954415cd9c6376528587903a9a96  ...      748

2    d03cbffc5b626210553dd466e022d2ee  ...      891

3    d28fa2ce915c137677eaaa2aabdaa304  ...      732

4    3a7d2941f7ba32940874d0f06f9f62f4  ...     1141

..                                ...  ...      ...

132  54d5b9a2b71895341def789d88113ef4  ...     1200

133  53f5ed0fb3870cf8e59a8f55c1271f3a  ...      398

134  ea40a2366f7ad18819cfa804827eb4d4  ...     1183

135  f1e686b8519e7d93ce6e461d9fa8d1a0  ...       83

136  9eac5a3d46b4b013fef67b5c78330d89  ...      886

[275 rows x 5 columns]

🚀 create_base_extracted_entities

                                        entity_graph

0  <graphml xmlns="http://graphml.graphdrawing.or...

🚀 create_summarized_entities

                                        entity_graph

0  <graphml xmlns="http://graphml.graphdrawing.or...

❌ create_base_entity_graph

None

⠋ GraphRAG Indexer 

├── Loading Input (text) - 216 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00

├── create_base_text_units

├── create_base_extracted_entities

├── create_summarized_entities

└── create_base_entity_graph❌ Errors occurred during the pipeline run, see logs for more details.

Browsers

Chrome

OS

Windows

Additional information

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.