Giter Site home page Giter Site logo

gerevai / gerev Goto Github PK

View Code? Open in Web Editor NEW
2.7K 17.0 168.0 6.98 MB

๐Ÿง  AI-powered enterprise search engine ๐Ÿ”Ž

Home Page: https://app.klu.so/signup?utm_source=github_gerevai

License: MIT License

Dockerfile 0.17% Python 60.24% HTML 0.79% CSS 0.61% JavaScript 0.59% TypeScript 37.15% Shell 0.22% Mako 0.23%
ai search-engine workplace-search enterprise-search chatgpt semantic-search-engine similarity-search confluence machine-learning vector-search

gerev's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gerev's Issues

Suggestion: Change data source file structure

TL;DR - Make each data source a package instead of a module.

Changing data_sources folder to contain foldersinstead of a file for each source.

It will make the code less "monolithic" without creating a meess, you could implement your own clients/wrappers for those clients which will make the code much more easy to maintain. add files for testing, etc etc.

Have a look at CloudQuery's solution. (although it's an extremely different product for a different use case..)
I guess it's better to do that sooner than later ;/

Google Drive source fails indexing due to missing 'parents'

The document in question was shared from another use's Drive (in case that matters).

2023-03-28 16:40:40,412 | INFO | google_drive.py:138 | processing file [redacted]
2023-03-28 16:40:41,468 | ERROR | base_data_source.py:120 | Error while indexing data source
Traceback (most recent call last):
  File "/app/data_source/api/base_data_source.py", line 118, in index
    self._feed_new_documents()
  File "/app/data_source/sources/google_drive/google_drive.py", line 105, in _feed_new_documents
    self._feed_drive(drive=drive)
  File "/app/data_source/sources/google_drive/google_drive.py", line 131, in _feed_drive
    self._feed_file(file)
  File "/app/data_source/sources/google_drive/google_drive.py", line 172, in _feed_file
    parent_name = self._get_parents_string(file)
  File "/app/data_source/sources/google_drive/google_drive.py", line 101, in _get_parents_string
    return self._get_parent_name(file['parents'][0]) if file['parents'] else ''
KeyError: 'parents'

[bug] docker container killed and not work

Hey, I tried to run the application on a ubuntu server, using the command:

docker run -p 4000:80 -v ~/.gerev/storage:/opt/storage gerev/gerev

But port 80 cannot be accessed, and it will automatically exited for a period of time.

docker logs
---
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
running online
Killed
CONTAINER ID   IMAGE                               COMMAND                  CREATED          STATUS                 
7296daa1314a   gerev/gerev                         "/bin/sh -c ./run.sh"    4 minutes ago    Exited (137) 3 minutes ago

However, there is no additional information in the log that would facilitate debugging. If there are other means to aid diagnosis, I can provide assistance.

Add email as a data source

People should be able to connect their put their email into Gerev and any incoming emails will be added to their data.

Slack         :  10 million users
Notion        :  10 million users
Confluence    :  25 million users
Email         : 4.4 BILLION users

Ideally, this should be able to access emails sent before joining Gerev too. I think this could be done through the user's google account. That might trigger security concerns in some people, but, if the end product is good enough people will use it despite security concerns. We can just have a subheading that says.

your data is safe

(Also we should actually have good enough security do avoid a data leak)

GitHub integration

To find everything that resides on GH, including organization repos, files, branches, tags, releases, issues, PRs, and discussions.

docker image doesn't work with ports other than 80

I have other containers running, notably a heimdell landing page which takes precedence. I tried running docker run --rm -p 9898:80 -v ~/.gerev/storage:/opt/storage gerev/gerev and whilst this worked with showing the webpage, it did not allow me to add data sources. Switching from -p 9898:80 to -p 80:80 solved this issue. From the looks of it, it seems to be a CORS issue.

image

Google Drive source file processing fails due to missing lastModifyingUser.photoLink field

The following access to the lastModifyingUser.photoLink field fails

author_image_url=file['lastModifyingUser']['photoLink'],

2023-03-27 09:26:51,039 | INFO | google_drive.py:140 | processing file ********************
2023-03-27 09:26:52,481 | ERROR | base_data_source.py:95 | Error while indexing data source
Traceback (most recent call last):
  File "/app/data_source_api/base_data_source.py", line 93, in index
    self._feed_new_documents()
  File "/app/data_sources/google_drive.py", line 199, in _feed_new_documents
    self._index_files_from_drive(drive)
  File "/app/data_sources/google_drive.py", line 184, in _index_files_from_drive
    author_image_url=file['lastModifyingUser']['photoLink'],
KeyError: 'photoLink'

Google's documentation mentions that it may not be defined:
lastModifyingUser.photoLink string A link to the user's profile photo, if available.

How to handle updated documents

Let's say i have a gitlab data source that indexes issues and one day i update one of the issues, does it need to overwrite the version that's already indexed or just add "duplicate" data / have two sources of "truth"

Container won't start

I have tried twice to install Gerev but each time I get this problem:

INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 9c2f5b290b16, Add fields to DataSourceType model
INFO  [alembic.runtime.migration] Running upgrade 9c2f5b290b16 -> 792a820e9374, document id_in_data_source
INFO  [alembic.runtime.migration] Running upgrade 792a820e9374 -> 513db5127df7, Your migration message here
INFO  [alembic.runtime.migration] Running upgrade 513db5127df7 -> 4d9562314bd3, parnet
INFO  [alembic.runtime.migration] Running upgrade 4d9562314bd3 -> 836a5f803c4d, status
running online
(sqlite3.OperationalError) duplicate column name: is_active
[SQL: ALTER TABLE document ADD COLUMN is_active BOOLEAN]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Killed
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
running online
Killed

Spelling error

Tired of searching for that one docuement you know exists somewhere, but not sure exactly where?

docuement --> document

What amount of data can this engine handle?

Will I be able to index more than 1 million data with each data having a volume of approximately 150 - 200 KB? I noticed that SQLite is "under the hood" and it raised some concerns...

Suggestion: wrap data_source clients

Using SDKs and official APIs is nice and very easy to read/implement/maintain, but when using requests I think wrapping those requests in a client could make life easier on future changes.
Same goes to the official APIs and SDKs but those are much less of a pain when reading the code itself.

Creating a client.py file or data_class/client folder seems right for me, Ideally when something changes in the API itself you'd want your code to work properly, when you wrap it up in your own class you need to change much less code in the data source itself and only touch the client side.

As for making the code safer - adding this folder could be a nice place to start writing tests and mocks for each client but that's a whole other topic.

Receive "Confluence returned status code 429 for document xxxx" warning

I am receiving a "Confluence returned status code 429 for document xxxx" warning in the logs while trying to index a self-hosted Confluence instance, meaning that a lot of the pages in the space are skipped. I don't see any rate-limiting in the code, and I can't see a way around this either. Got any hints?

Support plain Website

Add support to index any website with plain HTML.

Use-case: Index a documentation website like https://docs.appuio.cloud/ to ask questions about this documentation. There are countless documentation pages like this and it would be a great addition.

Discord invite link expired

Hi,

I tried to join Discord to get an early access code, but the invite link expired.

Can you please advise?

Thanks

Support Atlassian Confluence Cloud

Issue

Currently only self-hosted Atlassian Confluence is supported. However Atlassian Cloud is widely adopted and has a nearly identical API endpoint structure.

Currently Gerev suggests creating a PAT in order for the system to authenticate and pull data. That PAT is then used as a bearer token to the upstream Atlassian API. For Cloud installations, account-specific API tokens are allowed, which are used as basic auth where the username is the account holder's email address, and the password is the generated API key.

From there, the same endpoint structure and payloads are used to scrape the /space endpoint

API Tokens are managed here

Further Questions

How do we distinguish auth types? It doesn't make sense to duplicate the entire data source logic just to support a different auth scheme.

feat: Nextcloud Support

After seeing what was done with the Google Drive support, I would like to recommend and if possible sponsor somehow a Nextcloud feature. Paired with the new Bookstack integration this would pair up nicely.

No image found for data source causes UI bugs

When there's no image for at least one of the data sources the UI wont load any data source.
This bug originates in app/api/data_source.py in line 33:

@staticmethod
    def from_data_source_class(name: str, data_source_class: BaseDataSource) -> 'DataSourceTypeDto':
        with open(f"static/data_source_icons/{name}.png", "rb") as file:
            encoded_string = base64.b64encode(file.read())
            image_base64 = f"data:image/png;base64,{encoded_string.decode()}"

        return DataSourceTypeDto(
            name=name,
            display_name=data_source_class.get_display_name(),
            config_fields=data_source_class.get_config_fields(),
            image_base64=image_base64,
            has_prerequisites=data_source_class.has_prerequisites()
        )

The function is lacking error handling, handle the errors and fix it.

Google Drive source file processing fails due to missing lastModifyingUser.displayName field

The following access to the lastModifyingUser.displayName field fails

author=file['lastModifyingUser']['displayName'],

2023-03-27 10:05:46,340 | INFO | google_drive.py:140 | processing file ********************
2023-03-27 10:05:47,523 | ERROR | base_data_source.py:95 | Error while indexing data source
Traceback (most recent call last):
  File "/app/data_source_api/base_data_source.py", line 93, in index
    self._feed_new_documents()
  File "/app/data_sources/google_drive.py", line 199, in _feed_new_documents
    self._index_files_from_drive(drive)
  File "/app/data_sources/google_drive.py", line 183, in _index_files_from_drive
     author=file['lastModifyingUser']['displayName'],
KeyError: 'displayName'

Google's documentation DOES NOT mention that it may not be defined, but it was for me:
lastModifyingUser.displayName string A plain text displayable name for this user.

It is probably related to a user in a shared Google workspace that was deleted.

Support Phabricator

Phabricator is a web application collaboration tool, which includes a wiki, a code review tool, diffusion repository browser, a bug tracker, kanban, etc. It has integration with Git, Mercurial, and Subversion.

Add UI functionality to delete data sources

Add a pencil icon on top right of Data Source Screen, on click, it should turn on edit mode.
On edit mode, connected data sources should have a " - " button instead of a " + ".
When hovering the "minus" button on a connected data source on edit mode, it should be red or something.
On click, call a function that makes an "axios.post" to delete integration with its name.
In backend, add an empty endpoint that just returns 200 OK.
Show a toast on success.

What does the main repo image text mean?

The main repo image

everything

says

Steve Job* searches with gerev.ai
*similarly named to someone famous, we paid him to do so

I get from the tone that this is mean to be a joke of some kind but I don't get it. I'm a native English speaker and this goes over my head.

  • Is the idea that that name of the AI is "Steve Job" (like Siri or Dall-E) and it's a play on "Steve Jobs"?
  • Is the idea that Steve Jobs likes this tool so much he's using it from beyond the grave?
  • Is the joke that you paid a dead person (which isn't possible)?

I only mention it as you may want to modify this image to something that more people understand. Also, it's a little icky to be making a joke about a dead guy using your product, if indeed that is the joke.

UI for adding CA certificates

I think there should be a way to add a ca certificate to the container or at leas an option to disable cert checks. Can't add my bookstack instance because of cert errors

Support notion

implement NotionDataSource, look at BasicDocument. Good luck!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.