gerevai / gerev Goto Github PK

🧠 AI-powered enterprise search engine 🔎

Home Page: https://app.klu.so/signup?utm_source=github_gerevai

License: MIT License

Dockerfile 0.17% Python 60.24% HTML 0.79% CSS 0.61% JavaScript 0.59% TypeScript 37.15% Shell 0.22% Mako 0.23%

ai search-engine workplace-search enterprise-search chatgpt semantic-search-engine similarity-search confluence machine-learning vector-search

gerev's People

Stargazers

Watchers

Forkers

tangowithfoxtrot nathancrotty exe-r mearman flifloo wilsoncodespace alvis nas-beginner deluxeowl troy-f taimurayaz banditolabs bary12 yuvalsteuer citvej adik1203 bilalnawaz072 dangolbeeker d4yz techthiyanes philipmoniaga eltociear bsnk-dev bryan-pakulski rossman22590 hhy5277 jason19801216 acrobate98 timson2010 yaalsn orlgln harizalfarizi184 ahmet-kaplan jaedukseo dimaborz embercoat nirlevy98 inputobject2 voarsh2 iuriimattos2 tchigher syroysf akankushjnvku infgme imnoorfahad colinai wolflegend99 ultra-stack praek mcfearsome ksksks2222 paulinholeo itaykal tmanager22 godxkey akozhuhov amorist ai-jie01 dorucioclea vlameiras zhangzhiqi1999 yibit sphericaldev jshaktiraj shgopher rishi003 xdyb cemberk mengunogul sammindinventory amor-tech awesome-openai teynar mieitza backupmanager soon14 ashwin-developer12 suminb moarblur pinkuburu anylee2021 allen-munsch mrbuchixiangcai cao-xu dansusu axel-hub-ops analogllp omkarkirpan yesabhishek rafaharo fenglui realsrisri spikeophant rsohlot aedrax startime-h qwertzcode eudaimonia-tech bl19 zlepper

gerev's Issues

Suggestion: Change data source file structure

TL;DR - Make each data source a package instead of a module.

Changing data_sources folder to contain foldersinstead of a file for each source.

It will make the code less "monolithic" without creating a meess, you could implement your own clients/wrappers for those clients which will make the code much more easy to maintain. add files for testing, etc etc.

Have a look at CloudQuery's solution. (although it's an extremely different product for a different use case..)
I guess it's better to do that sooner than later ;/

Google Drive source fails indexing due to missing 'parents'

The document in question was shared from another use's Drive (in case that matters).

2023-03-28 16:40:40,412 | INFO | google_drive.py:138 | processing file [redacted]
2023-03-28 16:40:41,468 | ERROR | base_data_source.py:120 | Error while indexing data source
Traceback (most recent call last):
  File "/app/data_source/api/base_data_source.py", line 118, in index
    self._feed_new_documents()
  File "/app/data_source/sources/google_drive/google_drive.py", line 105, in _feed_new_documents
    self._feed_drive(drive=drive)
  File "/app/data_source/sources/google_drive/google_drive.py", line 131, in _feed_drive
    self._feed_file(file)
  File "/app/data_source/sources/google_drive/google_drive.py", line 172, in _feed_file
    parent_name = self._get_parents_string(file)
  File "/app/data_source/sources/google_drive/google_drive.py", line 101, in _get_parents_string
    return self._get_parent_name(file['parents'][0]) if file['parents'] else ''
KeyError: 'parents'

[bug] docker container killed and not work

Hey, I tried to run the application on a ubuntu server, using the command:

docker run -p 4000:80 -v ~/.gerev/storage:/opt/storage gerev/gerev

But port 80 cannot be accessed, and it will automatically exited for a period of time.

docker logs
---
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
running online
Killed

CONTAINER ID   IMAGE                               COMMAND                  CREATED          STATUS                 
7296daa1314a   gerev/gerev                         "/bin/sh -c ./run.sh"    4 minutes ago    Exited (137) 3 minutes ago

However, there is no additional information in the log that would facilitate debugging. If there are other means to aid diagnosis, I can provide assistance.

Add email as a data source

People should be able to connect their put their email into Gerev and any incoming emails will be added to their data.

Slack         :  10 million users
Notion        :  10 million users
Confluence    :  25 million users
Email         : 4.4 BILLION users

Ideally, this should be able to access emails sent before joining Gerev too. I think this could be done through the user's google account. That might trigger security concerns in some people, but, if the end product is good enough people will use it despite security concerns. We can just have a subheading that says.

your data is safe

(Also we should actually have good enough security do avoid a data leak)

Invite link to discord is broken

UI is a little glitched on small laptop screens

GitHub integration

To find everything that resides on GH, including organization repos, files, branches, tags, releases, issues, PRs, and discussions.

ValueError: time data '2017-12-05T15:00:24.972+08:00' does not match format '%Y-%m-%dT%H:%M:%S.%fZ'

time format error.

Email Spam

Stop with the email spam, thanks.

docker image doesn't work with ports other than 80

I have other containers running, notably a heimdell landing page which takes precedence. I tried running docker run --rm -p 9898:80 -v ~/.gerev/storage:/opt/storage gerev/gerev and whilst this worked with showing the webpage, it did not allow me to add data sources. Switching from -p 9898:80 to -p 80:80 solved this issue. From the looks of it, it seems to be a CORS issue.

Google Drive (Shared workspace drive) bug.

Currently doesn't work because of an invalid conditional, we should have this fixed.

Google Drive source file processing fails due to missing lastModifyingUser.photoLink field

The following access to the lastModifyingUser.photoLink field fails

gerev/app/data_sources/google_drive.py

Line 184 in 20d0245

author_image_url=file['lastModifyingUser']['photoLink'],

2023-03-27 09:26:51,039 | INFO | google_drive.py:140 | processing file ********************
2023-03-27 09:26:52,481 | ERROR | base_data_source.py:95 | Error while indexing data source
Traceback (most recent call last):
  File "/app/data_source_api/base_data_source.py", line 93, in index
    self._feed_new_documents()
  File "/app/data_sources/google_drive.py", line 199, in _feed_new_documents
    self._index_files_from_drive(drive)
  File "/app/data_sources/google_drive.py", line 184, in _index_files_from_drive
    author_image_url=file['lastModifyingUser']['photoLink'],
KeyError: 'photoLink'

Google's documentation mentions that it may not be defined:
lastModifyingUser.photoLink string A link to the user's profile photo, if available.

Saudi ?

How to handle updated documents

Let's say i have a gitlab data source that indexes issues and one day i update one of the issues, does it need to overwrite the version that's already indexed or just add "duplicate" data / have two sources of "truth"

./ui/build is not exists!

In Dockerfile, line 18:
COPY ./ui/build /ui

But ./ui.build is not exists now!

Container won't start

I have tried twice to install Gerev but each time I get this problem:

INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 9c2f5b290b16, Add fields to DataSourceType model
INFO  [alembic.runtime.migration] Running upgrade 9c2f5b290b16 -> 792a820e9374, document id_in_data_source
INFO  [alembic.runtime.migration] Running upgrade 792a820e9374 -> 513db5127df7, Your migration message here
INFO  [alembic.runtime.migration] Running upgrade 513db5127df7 -> 4d9562314bd3, parnet
INFO  [alembic.runtime.migration] Running upgrade 4d9562314bd3 -> 836a5f803c4d, status
running online
(sqlite3.OperationalError) duplicate column name: is_active
[SQL: ALTER TABLE document ADD COLUMN is_active BOOLEAN]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Killed
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
running online
Killed

Spelling error

Tired of searching for that one docuement you know exists somewhere, but not sure exactly where?

docuement --> document

Chat message icon should contain user avatar + data source logo

Instead of showing just data source logo for each search result,
we want to have avatar+logo for chat results.

Figma design:

Relevant file: search-result.tsx

Support wikirpc

support for wikirpc, like dokuwiki's implementation https://www.dokuwiki.org/devel:xmlrpc would be good

Support PDF in GDrive

See similar parsers here:

Add new parser:
e.g: docx, txt, html, pptx etc...
https://github.com/GerevAI/gerev/tree/main/app/parsers
add file type:
app/data_source_api/basic_document.py
Google Drive support: app/data_sources/google_drive.py
...

Problem with accessing Gerev behind an SSL proxy

baseUrl seems to be hardcoded with http schema, we need to change that otherwise users are getting:

What amount of data can this engine handle?

Will I be able to index more than 1 million data with each data having a volume of approximately 150 - 200 KB? I noticed that SQLite is "under the hood" and it raised some concerns...

Support answer.dev

The best open-source competitor of Stackoverflow in my opinion is https://answer.dev/

https://github.com/answerdev/answer

Can we hope for an integration?

Support MediaWiki

The engine used by WikiPedia and many other Wikipedia-like sites

Suggestion: wrap data_source clients

Using SDKs and official APIs is nice and very easy to read/implement/maintain, but when using requests I think wrapping those requests in a client could make life easier on future changes.
Same goes to the official APIs and SDKs but those are much less of a pain when reading the code itself.

Creating a client.py file or data_class/client folder seems right for me, Ideally when something changes in the API itself you'd want your code to work properly, when you wrap it up in your own class you need to change much less code in the data source itself and only touch the client side.

As for making the code safer - adding this folder could be a nice place to start writing tests and mocks for each client but that's a whole other topic.

Receive "Confluence returned status code 429 for document xxxx" warning

I am receiving a "Confluence returned status code 429 for document xxxx" warning in the logs while trying to index a self-hosted Confluence instance, meaning that a lot of the pages in the space are skipped. I don't see any rate-limiting in the code, and I can't see a way around this either. Got any hints?

Support plain Website

Add support to index any website with plain HTML.

Use-case: Index a documentation website like https://docs.appuio.cloud/ to ask questions about this documentation. There are countless documentation pages like this and it would be a great addition.

Discord invite link expired

Hi,

I tried to join Discord to get an early access code, but the invite link expired.

Can you please advise?

Thanks

Support Atlassian Confluence Cloud

Issue

Currently only self-hosted Atlassian Confluence is supported. However Atlassian Cloud is widely adopted and has a nearly identical API endpoint structure.

Currently Gerev suggests creating a PAT in order for the system to authenticate and pull data. That PAT is then used as a bearer token to the upstream Atlassian API. For Cloud installations, account-specific API tokens are allowed, which are used as basic auth where the username is the account holder's email address, and the password is the generated API key.

From there, the same endpoint structure and payloads are used to scrape the /space endpoint

API Tokens are managed here

Further Questions

How do we distinguish auth types? It doesn't make sense to duplicate the entire data source logic just to support a different auth scheme.

feat: Nextcloud Support

After seeing what was done with the Google Drive support, I would like to recommend and if possible sponsor somehow a Nextcloud feature. Paired with the new Bookstack integration this would pair up nicely.

No image found for data source causes UI bugs

When there's no image for at least one of the data sources the UI wont load any data source.
This bug originates in app/api/data_source.py in line 33:

@staticmethod
    def from_data_source_class(name: str, data_source_class: BaseDataSource) -> 'DataSourceTypeDto':
        with open(f"static/data_source_icons/{name}.png", "rb") as file:
            encoded_string = base64.b64encode(file.read())
            image_base64 = f"data:image/png;base64,{encoded_string.decode()}"

        return DataSourceTypeDto(
            name=name,
            display_name=data_source_class.get_display_name(),
            config_fields=data_source_class.get_config_fields(),
            image_base64=image_base64,
            has_prerequisites=data_source_class.has_prerequisites()
        )

The function is lacking error handling, handle the errors and fix it.

Google Drive source file processing fails due to missing lastModifyingUser.displayName field

The following access to the lastModifyingUser.displayName field fails

gerev/app/data_sources/google_drive.py

Line 183 in 20d0245

author=file['lastModifyingUser']['displayName'],

2023-03-27 10:05:46,340 | INFO | google_drive.py:140 | processing file ********************
2023-03-27 10:05:47,523 | ERROR | base_data_source.py:95 | Error while indexing data source
Traceback (most recent call last):
  File "/app/data_source_api/base_data_source.py", line 93, in index
    self._feed_new_documents()
  File "/app/data_sources/google_drive.py", line 199, in _feed_new_documents
    self._index_files_from_drive(drive)
  File "/app/data_sources/google_drive.py", line 183, in _index_files_from_drive
     author=file['lastModifyingUser']['displayName'],
KeyError: 'displayName'

Google's documentation DOES NOT mention that it may not be defined, but it was for me:
lastModifyingUser.displayName string A plain text displayable name for this user.

It is probably related to a user in a shared Google workspace that was deleted.

Support Phabricator

Phabricator is a web application collaboration tool, which includes a wiki, a code review tool, diffusion repository browser, a bug tracker, kanban, etc. It has integration with Git, Mercurial, and Subversion.

Add UI functionality to delete data sources

Add a pencil icon on top right of Data Source Screen, on click, it should turn on edit mode.
On edit mode, connected data sources should have a " - " button instead of a " + ".
When hovering the "minus" button on a connected data source on edit mode, it should be red or something.
On click, call a function that makes an "axios.post" to delete integration with its name.
In backend, add an empty endpoint that just returns 200 OK.
Show a toast on success.

What does the main repo image text mean?

The main repo image

says

Steve Job* searches with gerev.ai
*similarly named to someone famous, we paid him to do so

I get from the tone that this is mean to be a joke of some kind but I don't get it. I'm a native English speaker and this goes over my head.

Is the idea that that name of the AI is "Steve Job" (like Siri or Dall-E) and it's a play on "Steve Jobs"?
Is the idea that Steve Jobs likes this tool so much he's using it from beyond the grave?
Is the joke that you paid a dead person (which isn't possible)?

I only mention it as you may want to modify this image to something that more people understand. Also, it's a little icky to be making a joke about a dead guy using your product, if indeed that is the joke.