gerevai / gerev Goto Github PK
View Code? Open in Web Editor NEW๐ง AI-powered enterprise search engine ๐
Home Page: https://app.klu.so/signup?utm_source=github_gerevai
License: MIT License
๐ง AI-powered enterprise search engine ๐
Home Page: https://app.klu.so/signup?utm_source=github_gerevai
License: MIT License
Changing data_sources folder to contain foldersinstead of a file for each source.
It will make the code less "monolithic" without creating a meess, you could implement your own clients/wrappers for those clients which will make the code much more easy to maintain. add files for testing, etc etc.
Have a look at CloudQuery's solution. (although it's an extremely different product for a different use case..)
I guess it's better to do that sooner than later ;/
The document in question was shared from another use's Drive (in case that matters).
2023-03-28 16:40:40,412 | INFO | google_drive.py:138 | processing file [redacted]
2023-03-28 16:40:41,468 | ERROR | base_data_source.py:120 | Error while indexing data source
Traceback (most recent call last):
File "/app/data_source/api/base_data_source.py", line 118, in index
self._feed_new_documents()
File "/app/data_source/sources/google_drive/google_drive.py", line 105, in _feed_new_documents
self._feed_drive(drive=drive)
File "/app/data_source/sources/google_drive/google_drive.py", line 131, in _feed_drive
self._feed_file(file)
File "/app/data_source/sources/google_drive/google_drive.py", line 172, in _feed_file
parent_name = self._get_parents_string(file)
File "/app/data_source/sources/google_drive/google_drive.py", line 101, in _get_parents_string
return self._get_parent_name(file['parents'][0]) if file['parents'] else ''
KeyError: 'parents'
Hey, I tried to run the application on a ubuntu server, using the command:
docker run -p 4000:80 -v ~/.gerev/storage:/opt/storage gerev/gerev
But port 80 cannot be accessed, and it will automatically exited for a period of time.
docker logs
---
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
running online
Killed
CONTAINER ID IMAGE COMMAND CREATED STATUS
7296daa1314a gerev/gerev "/bin/sh -c ./run.sh" 4 minutes ago Exited (137) 3 minutes ago
However, there is no additional information in the log that would facilitate debugging. If there are other means to aid diagnosis, I can provide assistance.
People should be able to connect their put their email into Gerev and any incoming emails will be added to their data.
Slack : 10 million users
Notion : 10 million users
Confluence : 25 million users
Email : 4.4 BILLION users
Ideally, this should be able to access emails sent before joining Gerev too. I think this could be done through the user's google account. That might trigger security concerns in some people, but, if the end product is good enough people will use it despite security concerns. We can just have a subheading that says.
(Also we should actually have good enough security do avoid a data leak)
.
To find everything that resides on GH, including organization repos, files, branches, tags, releases, issues, PRs, and discussions.
time format error.
Stop with the email spam, thanks.
I have other containers running, notably a heimdell landing page which takes precedence. I tried running docker run --rm -p 9898:80 -v ~/.gerev/storage:/opt/storage gerev/gerev
and whilst this worked with showing the webpage, it did not allow me to add data sources. Switching from -p 9898:80
to -p 80:80
solved this issue. From the looks of it, it seems to be a CORS issue.
Currently doesn't work because of an invalid conditional, we should have this fixed.
The following access to the lastModifyingUser.photoLink
field fails
gerev/app/data_sources/google_drive.py
Line 184 in 20d0245
2023-03-27 09:26:51,039 | INFO | google_drive.py:140 | processing file ********************
2023-03-27 09:26:52,481 | ERROR | base_data_source.py:95 | Error while indexing data source
Traceback (most recent call last):
File "/app/data_source_api/base_data_source.py", line 93, in index
self._feed_new_documents()
File "/app/data_sources/google_drive.py", line 199, in _feed_new_documents
self._index_files_from_drive(drive)
File "/app/data_sources/google_drive.py", line 184, in _index_files_from_drive
author_image_url=file['lastModifyingUser']['photoLink'],
KeyError: 'photoLink'
Google's documentation mentions that it may not be defined:
lastModifyingUser.photoLink string A link to the user's profile photo, if available.
Let's say i have a gitlab data source that indexes issues and one day i update one of the issues, does it need to overwrite the version that's already indexed or just add "duplicate" data / have two sources of "truth"
In Dockerfile, line 18:
COPY ./ui/build /ui
But ./ui.build is not exists now!
I have tried twice to install Gerev but each time I get this problem:
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
INFO [alembic.runtime.migration] Running upgrade -> 9c2f5b290b16, Add fields to DataSourceType model
INFO [alembic.runtime.migration] Running upgrade 9c2f5b290b16 -> 792a820e9374, document id_in_data_source
INFO [alembic.runtime.migration] Running upgrade 792a820e9374 -> 513db5127df7, Your migration message here
INFO [alembic.runtime.migration] Running upgrade 513db5127df7 -> 4d9562314bd3, parnet
INFO [alembic.runtime.migration] Running upgrade 4d9562314bd3 -> 836a5f803c4d, status
running online
(sqlite3.OperationalError) duplicate column name: is_active
[SQL: ALTER TABLE document ADD COLUMN is_active BOOLEAN]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Killed
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
running online
Killed
Tired of searching for that one docuement you know exists somewhere, but not sure exactly where?
docuement --> document
support for wikirpc, like dokuwiki's implementation https://www.dokuwiki.org/devel:xmlrpc would be good
See similar parsers here:
Add new parser:
e.g: docx, txt, html, pptx etc...
https://github.com/GerevAI/gerev/tree/main/app/parsers
add file type:
Google Drive support: app/data_sources/google_drive.py
...
Will I be able to index more than 1 million data with each data having a volume of approximately 150 - 200 KB? I noticed that SQLite is "under the hood" and it raised some concerns...
The best open-source competitor of Stackoverflow in my opinion is https://answer.dev/
https://github.com/answerdev/answer
Can we hope for an integration?
The engine used by WikiPedia and many other Wikipedia-like sites
Using SDKs and official APIs is nice and very easy to read/implement/maintain, but when using requests I think wrapping those requests in a client could make life easier on future changes.
Same goes to the official APIs and SDKs but those are much less of a pain when reading the code itself.
Creating a client.py file or data_class/client folder seems right for me, Ideally when something changes in the API itself you'd want your code to work properly, when you wrap it up in your own class you need to change much less code in the data source itself and only touch the client side.
As for making the code safer - adding this folder could be a nice place to start writing tests and mocks for each client but that's a whole other topic.
I am receiving a "Confluence returned status code 429 for document xxxx" warning in the logs while trying to index a self-hosted Confluence instance, meaning that a lot of the pages in the space are skipped. I don't see any rate-limiting in the code, and I can't see a way around this either. Got any hints?
Add support to index any website with plain HTML.
Use-case: Index a documentation website like https://docs.appuio.cloud/ to ask questions about this documentation. There are countless documentation pages like this and it would be a great addition.
Hi,
I tried to join Discord to get an early access code, but the invite link expired.
Can you please advise?
Thanks
Currently only self-hosted Atlassian Confluence is supported. However Atlassian Cloud is widely adopted and has a nearly identical API endpoint structure.
Currently Gerev suggests creating a PAT in order for the system to authenticate and pull data. That PAT is then used as a bearer token to the upstream Atlassian API. For Cloud installations, account-specific API tokens are allowed, which are used as basic auth where the username is the account holder's email address, and the password is the generated API key.
From there, the same endpoint structure and payloads are used to scrape the /space
endpoint
API Tokens are managed here
How do we distinguish auth types? It doesn't make sense to duplicate the entire data source logic just to support a different auth scheme.
After seeing what was done with the Google Drive support, I would like to recommend and if possible sponsor somehow a Nextcloud feature. Paired with the new Bookstack integration this would pair up nicely.
When there's no image for at least one of the data sources the UI wont load any data source.
This bug originates in app/api/data_source.py
in line 33:
@staticmethod
def from_data_source_class(name: str, data_source_class: BaseDataSource) -> 'DataSourceTypeDto':
with open(f"static/data_source_icons/{name}.png", "rb") as file:
encoded_string = base64.b64encode(file.read())
image_base64 = f"data:image/png;base64,{encoded_string.decode()}"
return DataSourceTypeDto(
name=name,
display_name=data_source_class.get_display_name(),
config_fields=data_source_class.get_config_fields(),
image_base64=image_base64,
has_prerequisites=data_source_class.has_prerequisites()
)
The function is lacking error handling, handle the errors and fix it.
The following access to the lastModifyingUser.displayName
field fails
gerev/app/data_sources/google_drive.py
Line 183 in 20d0245
2023-03-27 10:05:46,340 | INFO | google_drive.py:140 | processing file ********************
2023-03-27 10:05:47,523 | ERROR | base_data_source.py:95 | Error while indexing data source
Traceback (most recent call last):
File "/app/data_source_api/base_data_source.py", line 93, in index
self._feed_new_documents()
File "/app/data_sources/google_drive.py", line 199, in _feed_new_documents
self._index_files_from_drive(drive)
File "/app/data_sources/google_drive.py", line 183, in _index_files_from_drive
author=file['lastModifyingUser']['displayName'],
KeyError: 'displayName'
Google's documentation DOES NOT mention that it may not be defined, but it was for me:
lastModifyingUser.displayName string A plain text displayable name for this user.
It is probably related to a user in a shared Google workspace that was deleted.
Phabricator is a web application collaboration tool, which includes a wiki, a code review tool, diffusion repository browser, a bug tracker, kanban, etc. It has integration with Git, Mercurial, and Subversion.
Add a pencil icon on top right of Data Source Screen, on click, it should turn on edit mode.
On edit mode, connected data sources should have a " - " button instead of a " + ".
When hovering the "minus" button on a connected data source on edit mode, it should be red or something.
On click, call a function that makes an "axios.post" to delete integration with its name.
In backend, add an empty endpoint that just returns 200 OK.
Show a toast on success.
The main repo image
says
Steve Job* searches with gerev.ai
*similarly named to someone famous, we paid him to do so
I get from the tone that this is mean to be a joke of some kind but I don't get it. I'm a native English speaker and this goes over my head.
I only mention it as you may want to modify this image to something that more people understand. Also, it's a little icky to be making a joke about a dead guy using your product, if indeed that is the joke.
I think there should be a way to add a ca certificate to the container or at leas an option to disable cert checks. Can't add my bookstack instance because of cert errors
implement NotionDataSource
, look at BasicDocument
. Good luck!
Would be helpful if could find all things stored in clickup :)
possibly other os as well, but on chrome/safari/firefox on macOS clicking 'copy' doesn't work, and the UI doesn't show the manifest you can manually copy and paste over.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.