Giter Site home page Giter Site logo

kantord / seagoat Goto Github PK

View Code? Open in Web Editor NEW
955.0 8.0 59.0 19.26 MB

local-first semantic code search engine

Home Page: https://kantord.github.io/SeaGOAT/

License: MIT License

Python 99.64% Shell 0.36%
ai code-search code-search-engine embeddings grep grep-like llm regular-expression ripgrep vector-database

seagoat's Introduction

Logo SeaGOAT

A code search engine for the AI age. SeaGOAT is a local search tool that leverages vector embeddings to enable you to search your codebase semantically.

Getting started

Install SeaGOAT

In order to install SeaGOAT, you need to have the following dependencies already installed on your computer:

  • Python 3.11 or newer
  • ripgrep
  • bat (optional, highly recommended)

When bat is installed, it is used to display results as long as color is enabled. When SeaGOAT is used as part of a pipeline, a grep-line output format is used. When color is enabled, but bat is not installed, SeaGOAT will highlight the output using pygments. Using bat is recommended.

To install SeaGOAT using pipx, use the following command:

pipx install seagoat

System requirements

Hardware

Should work on any decent laptop.

Operating system

SeaGOAT is designed to work on Linux (tested ✅), macOS (partly tested, help 🙏) and Windows (help needed 🙏).

Start SeaGOAT server

In order to use SeaGOAT in your project, you have to start the SeaGOAT server using the following command:

seagoat-server start /path/to/your/repo

Search your repository

If you have the server running, you can simply use the gt or seagoat command to query your repository. For example:

gt "Where are the numbers rounded"

You can also use Regular Expressions in your queries, for example

gt "function calc_.* that deals with taxes"

Stopping the server

You can stop the running server using the following command:

seagoat-server stop /path/to/your/repo

Configuring SeaGOAT

SeaGOAT can be tailored to your needs through YAML configuration files, either globally or project-specifically with a .seagoat.yml file. For instance:

# .seagoat.yml

server:
  port: 31134  # Specify server port

Check out the documentation for more details!

Development

Requirements:

Install dependencies

After cloning the repository, install dependencies using the following command:

poetry install

Running tests

Watch mode (recommended)

poetry run ptw

Test changed files

poetry run pytest .  --testmon

Test all files

poetry run pytest .

Manual testing

You can test any SeaGOAT command manually in your local development environment. For example to test the development version of the seagoat-server command, you can run:

poetry run seagoat-server start ~/path/an/example/repository

FAQ

The points in this FAQ are indications of how SeaGOAT works, but are not a legal contract. SeaGOAT is licensed under an open source license and if you are in doubt about the privacy/safety/etc implications of SeaGOAT, you are welcome to examine the source code, raise your concerns, or create a pull request to fix a problem.

How does SeaGOAT work? Does it send my data to ChatGPT?

SeaGOAT does not rely on 3rd party APIs or any remote APIs and executes all functionality locally using the SeaGOAT server that you are able to run on your own machine.

Instead of relying on APIs or "connecting to ChatGPT", it uses the vector database called ChromaDB, with a local vector embedding engine and telemetry disabled by default.

Apart from that, SeaGOAT also uses ripgrep, a regular-expression based code search engine in order to provider regular expression/keyword based matches in addition to the "AI-based" matches.

While the current version of SeaGOAT does not send your data to remote servers, it might be possible that in the future there will be optional features that do so, if any further improvement can be gained from that.

Why does SeaGOAT need a server?

SeaGOAT needs a server in order to provide a speedy response. SeaGOAT heavily relies on vector embeddings and vector databases, which at the moment cannot be replace with an architecture that processes files on the fly.

It's worth noting that you are able to run SeaGOAT server entirely locally, and it works even if you don't have an internet connection. This use case does not require you to share data with a remote server, you are able to use your own SeaGOAT server locally, albeit it's also possible to run a SeaGOAT server and allow other computers to connect to it, if you so wish.

Does SeaGOAT create AI-derived work? Is SeaGOAT ethical?

If you are concerned about the ethical implications of using AI tools keep in mind that SeaGOAT is not a code generator but a code search engine, therefore it does not create AI derived work.

That being said, a language model is being used to generate vector embeddings. At the moment SeaGOAT uses ChromaDB's default model for calculating vector embeddings, and I am not aware of this being an ethical concern.

What programming languages are supported?

Currently SeaGOAT is hard coded to only process files in the following formats:

  • Text Files (*.txt)
  • Markdown (*.md)
  • Python (*.py)
  • C (*.c, *.h)
  • C++ (*.cpp, *.cc, *.cxx, *.hpp)
  • TypeScript (*.ts, *.tsx)
  • JavaScript (*.js, *.jsx)
  • HTML (*.html)
  • Go (*.go)
  • Java (*.java)
  • PHP (*.php)
  • Ruby (*.rb)

Why is SeaGOAT processing files so slowly while barely using my CPU?

Since processing files for large repositories can take a long time, SeaGOAT is designed to allow you to use your computer while processing files. It is an intentional design choice to avoid blocking/slowing down your computer.

This design decision does not affect the performance of queries.

By the way, you are able to use SeaGOAT to query your repository while it's processing your files! When you make a query, and the files are not processed yet, you will receive a warning with an estimation of the accuracy of your results. Also, regular expression/full text search based results will be displayed from the very beginning!

What character encodings are supported?

The preferred character encoding is UTF-8. Most other character encodings should also work. Only text files are supported, SeaGOAT ignores binary files.

Where does SeaGOAT store it's database/cache?

Where SeaGOAT stores databases and cache depends on your operating system. For your convenience, you can use the seagoat-server server-info command to find out where these files are stored on your system.

Can I host SeaGOAT server on a different computer?

Yes, if you would like to use SeaGOAT without having to run the server on the same computer, you can simply self-host SeaGOAT server on a different computer or in the cloud, and configure the seagoat/gt command to connect to this remote server through the internet.

Keep in mind that SeaGOAT itself does not enforce any security as it is primarily designed to run locally. If you have private code that you do not wish to leak, you will have to make sure that only trusted people have access to the SeaGOAT server. This could be done by making it only available through a VPN that only your teammates can access.

Can I ignore files/directories?

SeaGOAT already ignores all files/directories ignored in your .gitignore. If you wish to ignore additional files but keep them in git, you can use the ignorePatterns attribute from the server configuration. Learn more

seagoat's People

Contributors

0scvr avatar actions-user avatar ashishdatta avatar bhargavshirin avatar danipozo avatar elouafiqali avatar eltociear avatar hkabig avatar ka1bi4 avatar kantord avatar ldhough avatar lukehinds avatar mb avatar photonbit avatar renovate[bot] avatar sh4d0wy avatar tanishq-dubey avatar tbetous avatar tsultanov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seagoat's Issues

Configure gt using custom host and port

I'm running seagoat-server in the cloud and want to configure gt to use an external host and port instead of http://0.0.0.0:35257. Please consider adding this feature.

remote seagoat-server and local gt client path mismatch causes FileNotFoundError

I'm trying to run seagoat-server on a remote host - i've got it running and exposed and have my client configured correctly and when i run the server not in the background I can see my query hit the server, but my query ends up with a FileNotFoundError referencing a path that most definitely exists on the remote repo and a file that exists in the local repo at a different path - the two repos are structurally the same of course but at different parent paths.

If I ln -s the repo to the same parent on the server side and then run the server from the symlink path, I don't receive an error.

Not sure if this is expected and if so whether this workaround should be documented somewhere, or if respecting a repo root would be more effective.

Originally posted by @cori in #269

Support config files

It would be useful to allow SeaGOAT to be configured using config files.

This is a good first issue if you are experienced with Python, as it does not require deep knowledge of this project, you just have to focus on the parts that currently parse command line arguments.

Namely, these 2 places in the code:

SeaGOAT/seagoat/cli.py

Lines 109 to 119 in 1edae8e

@click.option(
"-C",
"--context",
type=int,
default=None,
help="Include this many lines of context after and before each result",
)
@click.version_option(version=__version__, prog_name="seagoat")
def seagoat(
query, repo_path, no_color, max_results, context_above, context_below, context
):

SeaGOAT/seagoat/server.py

Lines 126 to 128 in 1edae8e

@click.group()
@click.version_option(version=__version__, prog_name="seagoat")
def server():

Keep in mind that the server and the client will both need configuration, the best way to deal with this could be to have different sections in the configuration files.

We already have appdirs as a dependency which should help find a good default location for config files, and click for parsing command line options. There should be a simple way to support configuring the same options through config files.

It would be nice to also have these options:

  • --config= allow specifying a different config file than the default
  • --no-config disable the default config file and force the use of default configs. If the user specifies command line options, those should still take effect

Keep in mind that command line options/flags, when provided, should always be considered more important than the data in the configuration files!

And extra feature would be allowing project-level configuration files in addition to the default ones, so basically if there is a configuration file in the git repo root, (should be a hidden file, so start the name with .) that should overwrite the data in the default configuration file.

Support Windows. Help wanted 🙏🪟

At all steps, SeaGOAT is developed with the intention of guaranteeing Windows support. That being said, as I don't have a Windows device and I don't have expertise regarding Windows development workflows, I have not been able to verify that it works properly on Windows.

The main goal is to support WSL, but supporting Windows natively is also a goal if it's possible!

Help wanted!

Here is a rough checklist of the current status of Windows support:

  • Enable CI checks for Windows
  • Enable tests for WSL if it's possible and necessary with GitHub actions
  • Make sure that all CI tests pass for Windows
  • Manually test SeaGOAT on Windows
  • Document installation steps on Windows
  • Make sure that any required Windows-specific features are implemented
  • Ensure it's easy to install on Windows using common tools
  • Currently, the main problem is that I have not been able to try and run SeaGOAT manually on Windows.

Also, I have discovered that some of the tests are failing on Windows. So, I created a CI file specifically for Windows that only checks some of the tests that are passing without issue:

- name: Run pytest
run: |
poetry run pytest tests/test_repository.py -vvs --timeout=60

For comparison, the Linux tests look like this:

- name: Run pytest
run: |
poetry run pytest . -vvs --timeout=60

Support Mac OS. Help wanted 🙏🍎

At all steps, SeaGOAT is developed with guaranteeing Mac OS support. That being said, as I don't have a Mac OS device and I don't have expertise regarding Mac OS development workflows, I have not been able to verify that it works properly.

Help wanted!

Here is a rough checklist of the current status of Mac OS support:

  • Enable CI checks for Mac OS
  • Make sure that all CI tests pass for Mac
  • Manually test SeaGOAT on Mac
  • Document installation steps on Mac
  • Make sure that any required Mac-specific features are implemented
  • Make sure that it's easy to install on Mac using common tools (homebrew?)

Currently the main problem is that I have not been able to try and run SeaGOAT manually on OSX.

Also I have discovered that some of the tests are failing, so I created a CI file specifically for OSX that only checks some of the tests are are passing without issue:

- name: Run pytest
run: |
poetry run pytest tests/test_repository.py -vvs --timeout=60

For comparison the Linux tests look like this:

- name: Run pytest
run: |
poetry run pytest . -vvs --timeout=60

IndexError: list index out of range

Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/kantord/repos/SeaGOAT/seagoat/queue/base_queue.py", line 77, in _worker_function
self._handle_task(context, task)
File "/home/kantord/repos/SeaGOAT/seagoat/queue/base_queue.py", line 58, in _handle_task
result = handler(context, *task.args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kantord/repos/SeaGOAT/seagoat/queue/task_queue.py", line 80, in handle_query
context["seagoat_engine"].fetch_sync(
File "/home/kantord/repos/SeaGOAT/seagoat/engine.py", line 156, in fetch_sync
loop.run_until_complete(self.fetch(*args, **kwargs))
File "/home/kantord/repos/SeaGOAT/.venv/lib/python3.11/site-packages/nest_asyncio.py", line 99, in run_until_complete
return f.result()
^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/futures.py", line 203, in result
raise self._exception.with_traceback(self._exception_tb)
File "/usr/lib/python3.11/asyncio/tasks.py", line 267, in __step
result = coro.send(None)
^^^^^^^^^^^^^^^
File "/home/kantord/repos/SeaGOAT/seagoat/engine.py", line 140, in fetch
self._results.extend(source["fetch"](self.query_string, limit_clue))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kantord/repos/SeaGOAT/seagoat/sources/chroma.py", line 50, in fetch
files[path].add_line(line, distance)
File "/home/kantord/repos/SeaGOAT/seagoat/result.py", line 99, in add_line
self.line_texts[line - 1],
~~~~~~~~~~~~~~~^^^^^^^^^^
IndexError: list index out of range

Looks like this could be due to outdated data being in the database

Run across multiple repos

Ideally, there is a folder called /reposthat contains multiple git repositories. seagoat-server start /path/to/folder/repos/ would monitor all the subdirectories that are git repositories

Search in all past versions of files

At the moment, SeaGOAT is designed to only search in one specific version of a file. This actually leads to problems when for example a chunk is cached for an older version of the file and we are trying to retrieve it from the current version of the file: #226 (That issue can be worked around, but that is an imperfect solution)

Instead, SeaGOAT should be modified as such:

  • git hash-object should be used to calculate current versions of the file
  • All past versions should be analyzed (with low priority) with their corresponding hash as well
  • The ripgrep source should search in all git blobs with their corresponding hash, and later match them to the most appropriate filename
  • If a file matches the search in multiple versions, all versions should be returned in the JSON, but in the CLI output only the latest version should be shown

Show warnings when remote server returns files that are not available locally

When the remote server returns files that are not available locally, they are simply hidden from the results locally. That kinda makes sense, after all the file does not exist. However that issue could be fixed by a simple git fetch --all for the most part. So we could show a warning that certain files were not shown locally and it might be fixed by getting the latest version from the server

Set up automated line length formatting for YAML files

We have rules that limit line width for YAML files, which is good, but it is annoying because it is not an autofixable error. This can be implemented on an individual basis by configuring code editors to do it, but it's much better if it's done automatically the same way all the other autoformatting/linting is done.

Here is our configuration for pre-commit: https://github.com/kantord/SeaGOAT/blob/main/.pre-commit-config.yaml, which would be the tool we need to use to implement this autoformatting.

Here is the website of pre-commit: https://pre-commit.com/

A hook should already exists for this, but if not, it can probably be manually added as pre-commit should be able to run arbitrary commands.

Exception "raise Empty” starting server for new repo

I decided to try Seagoat on a different repo, a smaller one. (The other server process was not running anymore.) This time it failed immediately with two exceptions.

$  seagoat-server start ~/Projects/xxxxxx
2023-09-24 11:12:56,738 Creating server...
2023-09-24 11:12:56,739 Starting worker thread...
2023-09-24 11:12:56,741 Serving on http://0.0.0.0:61607
2023-09-24 11:12:57,971 Checking repository for new changes
Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 76, in _worker_function
    task = self._task_queue.get(timeout=1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/queue.py", line 179, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 81, in _worker_function
    self.handle_maintenance(context)
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 50, in handle_maintenance
    remaining_chunks_to_analyze = context["seagoat_engine"].analyze_codebase(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 84, in analyze_codebase
    return self._create_vector_embeddings(minimum_chunks_to_analyze)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 106, in _create_vector_embeddings
    for chunk in file.get_chunks():
                 ^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/file.py", line 81, in get_chunks
    lines = self._get_file_lines()
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/file.py", line 36, in _get_file_lines
    for i, line in enumerate(source_code_file.read().splitlines())
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/encodings/cp1254.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8e in position 510: character maps to <undefined>

KeyError on like 212 in engine.py on request.

Just installed it on my work laptop (running macos)

Server usually crashes once I make a request on like 212 in engine.py complaining about a KeyError on one of the files.

I had it working one time on my 3rd try starting the server. Not sure I did anything different though

I can't post the log here unfortunately :(

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

github-actions
.github/workflows/docs.yml
  • actions/checkout v4@692973e3d937129bcbf40652eb9f2f61becf3332
  • actions/setup-python v5@39cd14951b08e74b54015e9e001cdefcf80e669f
  • snok/install-poetry v1@76e04a911780d5b312d89783f7b1cd627778900a
.github/workflows/frizbee.yml
  • actions/checkout v4
  • stacklok/frizbee-action v0.0.2
.github/workflows/lint.yml
  • actions/checkout v4@692973e3d937129bcbf40652eb9f2f61becf3332
  • actions/setup-python v5@39cd14951b08e74b54015e9e001cdefcf80e669f
  • snok/install-poetry v1@76e04a911780d5b312d89783f7b1cd627778900a
.github/workflows/release.yml
  • actions/checkout v4@692973e3d937129bcbf40652eb9f2f61becf3332
  • python-semantic-release/python-semantic-release v9.8.7@708671d0eb33bcbea78c5a3d81ae04c60deeddf3
  • pypa/gh-action-pypi-publish fb9fc6a4e67ca27a7a76b17bbf90be83c2d3c716
  • KSXGitHub/github-actions-deploy-aur v3.0.0@32e74b369e077d605a823d29574313d894dd0f31
.github/workflows/test.yml
  • actions/checkout v4@692973e3d937129bcbf40652eb9f2f61becf3332
  • actions/setup-python v5@39cd14951b08e74b54015e9e001cdefcf80e669f
  • snok/install-poetry v1@76e04a911780d5b312d89783f7b1cd627778900a
  • codecov/codecov-action v4@e28ff129e5465c2c0dcc6f003fc735cb6ae0c673
pep621
pyproject.toml
poetry
pyproject.toml
  • python ^3.10, < 3.13
  • chromadb ^0.5.3
  • gitpython ^3.1.31
  • tqdm ^4.65.0
  • appdirs ^1.4.4
  • click ^8.1.3
  • blessed ^1.20.0
  • pygments ^2.15.1
  • nest-asyncio ^1.5.6
  • requests ^2.31.0
  • setuptools ^72.0.0
  • psutil ^6.0.0
  • orjson ^3.9.5
  • waitress ^3.0.0
  • chardet ^5.2.0
  • pyyaml ^6.0.1
  • jsonschema ^4.19.1
  • deepmerge ^1.1.0
  • stop-words ^2018.7.23
  • flask ^3.0.0
  • pytest ^8.0.0
  • pre-commit ^3.3.3
  • pyright ^1.1.314
  • pytest-watch ^4.2.0
  • freezegun ^1.2.2
  • syrupy ^4.0.4
  • pytest-asyncio ^0.23.0
  • ipython ^8.14.0
  • exceptiongroup ^1.1.2
  • pytest-mock ^3.11.1
  • pytest-fast-first ^1.0.5
  • pytest-testmon ^2.0.12
  • pytest-leaks ^0.3.1
  • mkdocs-material ^9.1.19
  • markdown-include ^0.8.1
  • python-semantic-release ^9.0.0
  • pytest-sugar ^1.0.0
  • pytest-profiling ^1.7.0
  • pytest-timeout ^2.1.0
  • psutil ^6.0.0
  • pytest-cov ^5.0.0
  • matplotlib ^3.8.0
  • ipykernel ^6.25.2
  • jupyterlab-widgets ^3.0.9
  • pandas ^2.1.1
  • locust ^2.17.0
  • pytest-clarity ^1.0.1
  • flask-basicauth ^0.2.0
  • ruff ^0.6.0
  • seaborn ^0.13.0
pyenv
.python-version
  • python 3.12

  • Check this box to trigger a request for Renovate to run again on this repository

allow listing/clearing old caches

there should be a list of all cache folders that exist, along with the cache version they have. outdated cache folders should be deleted automatically by this command, and there should be an option to delete all cache folders. there should also be an option to measure how big each cache is, and also to show what git repository they belong to and what is their last use date

` poetry run seagoat "definition" .` leads to error

Seems like paths are not normalized properly, as running the command above leads to an error when running it with the full path or even ~/repos/SeaGOAT worked on my computer.

The error

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/kantord/repos/SeaGOAT/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kantord/repos/SeaGOAT/.venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/kantord/repos/SeaGOAT/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kantord/repos/SeaGOAT/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kantord/repos/SeaGOAT/seagoat/cli.py", line 169, in seagoat
    _, __, ___, server_address = load_server_info(server_info_file)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kantord/repos/SeaGOAT/seagoat/server.py", line 112, in load_server_info
    with open(server_info_file, "r", encoding="utf-8") as file:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/kantord/.cache/seagoat-servers/..json'

Support encodings other than utf-8

Hi, running seagoat-server start with a repository known to include strange characters (I don't know which, but it shouldn't be relevant; I've had similar errors by running OpenAI code embedding notebook) I get the following:

2023-09-20 16:14:22,735 Creating server...
2023-09-20 16:14:22,737 Starting worker thread...
2023-09-20 16:14:22,746 Serving on http://0.0.0.0:49999
2023-09-20 16:14:24,363 Checking repository for new changes
Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
File "C:\Users\user.local\pipx\venvs\seagoat\lib\site-packages\seagoat\queue\base_queue.py", line 76, in _worker_function
task = self._task_queue.get(timeout=1)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\queue.py", line 179, in get
raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\user.local\pipx\venvs\seagoat\lib\site-packages\seagoat\queue\base_queue.py", line 81, in _worker_function
self.handle_maintenance(context)
File "C:\Users\user.local\pipx\venvs\seagoat\lib\site-packages\seagoat\queue\task_queue.py", line 50, in handle_maintenance
remaining_chunks_to_analyze = context["seagoat_engine"].analyze_codebase(
File "C:\Users\user.local\pipx\venvs\seagoat\lib\site-packages\seagoat\engine.py", line 82, in analyze_codebase
self.repository.analyze_files()
File "C:\Users\user.local\pipx\venvs\seagoat\lib\site-packages\seagoat\repository.py", line 43, in analyze_files
for line in iter(proc.stdout.readline, ""):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3682: character maps to

I think seagoat should report the error, discard such files form the search, and go on.

"UnicodeDecodeError: 'charmap' codec can't decode byte 0x81"

Me again :) I erased the server cache, upgraded with pipx to 0.31.0, rebuilt the index and tried another query. This time I got a different exception.

Looks like a source file is in an unexpected text encoding and has bytes that fail to decode using cp1254, which Wikipedia tells me is a Windows text encoding primarily used for Turkish(!)

The filename doesn’t appear in the backtrace or server logs, and at first I couldn’t figure out why there might be Turkish in the repo, but then I remembered the repo has a submodule that handles language-aware word stemming for full-text search. One of the files handles Turkish, although I didn’t spot any non-ASCII characters in it.

Anyway, back in the dark days before everything was UTF-8, I remember that to parse arbitrary files without errors the heuristic was to (a) attempt to parse as UTF-8, (b) if that fails, parse as CP-1252 aka ISO-8859-1. The latter will never fail since it maps every byte value to a character, and it’s a superset of ASCII and the most common non-UTF-8 text format.

Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 79, in _worker_function
    self._handle_task(context, task)
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 66, in _handle_task
    result = handler(context, *task.args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 78, in handle_query
    context["seagoat_engine"].fetch_sync(
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 163, in fetch_sync
    loop.run_until_complete(self.fetch(*args, **kwargs))
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/nest_asyncio.py", line 99, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 269, in __step
    result = coro.throw(exc)
             ^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 149, in fetch
    results = await asyncio.gather(*async_tasks)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 339, in __wakeup
    future.result()
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/sources/ripgrep.py", line 45, in fetch
    return _fetch(query_text, str(path), limit)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/sources/ripgrep.py", line 33, in _fetch
    files[relative_path] = Result(str(relative_path), absolute_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/result.py", line 77, in __init__
    self.line_texts = self._read_lines()
                      ^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/result.py", line 87, in _read_lines
    return source_code_file.read().splitlines()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/encodings/cp1254.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 37401: character maps to <undefined>

Only return good results from vector embeddings

Regardless of limit, we should not return all results from vector embeddings. We should create a reasonable threshold for similarity that all results have to meet to be included in the final output

Ignore certain filetypes

Seagoat is taking a very long while to scan a repository - and by the looks of it, it is stuck on the translation files which would add little value to the task at hand. Is there a way to ignore them?

Log: https://dpaste.com/4UK6GPF8D (started at 8:40, stopped it at 10:40 with little progress done)

Ignore chunks that repeat in several files

Some chunks might be repeating across several files. These will probably not lead to useful results and might not even contain a lot of useful information. We could experiment with ignoring these chunks or at least measuring them. Probably anything that shows up more than 4 times will be some useless boilerplate code

Stopping server does not actually skill server process

This should be fairly simple to solve. We already have a test for stopping the server:

@pytest.mark.usefixtures("server")
def test_stop(repo):
subprocess.run(
["python", "-m", "seagoat.server", "stop", repo.working_dir],
capture_output=True,
text=True,
check=False,
)
result = subprocess.run(
["python", "-m", "seagoat.server", "status", repo.working_dir],
capture_output=True,
text=True,
check=False,
)
assert result.returncode == 0
assert "Server is not running" in result.stdout

The only thing wrong with this test that it does not test that the server process is actually killed. And it seems like the implementation for the server stopping command does not actually kill the server process, it just does some things that fool the test:

SeaGOAT/seagoat/server.py

Lines 184 to 199 in 0356687

@server.command()
@click.argument("repo_path")
def stop(repo_path):
"""Stops the server."""
try:
remove_server_info(repo_path)
except ServerDoesNotExist:
click.echo(
f"No server information found for {repo_path}. It might not be running or was never started."
)
return
click.echo(
"Server stopped. If it was running, it will stop after finishing current tasks."
)

The correct way to solve it is by:

  1. Making sure that the tests fails because the server process is not being killed
  2. After seeing that the tests "fails correctly" fix the implementation and make sure that the tests now pass

The get_server_info function should already return the process ID, so you should be able to use that in the test in order to make sure that the process is running before the server is killed and that it is not running after it's killed. Also, you should be able to use that to actually kill the server process in the implementation!

Deserialize data coming from the server

To avoid things like what happened #18 it would be nice to unserialize results coming from the server and make a more extensive use of static typing for the data structures and classes already defined.

queue.Empty exception crash

When performing
yshen@L-2JDM8S2:~/dev/icsr-study/projects/icsr-root$ seagoat-server start . 2023-09-22 11:25:47,414 Creating server... 2023-09-22 11:25:47,416 Starting worker thread... 2023-09-22 11:25:47,430 Serving on http://0.0.0.0:34373 2023-09-22 11:25:49,086 Checking repository for new changes
got the following:
`2023-09-22 11:25:47,430 Serving on http://0.0.0.0:34373
2023-09-22 11:25:49,086 Checking repository for new changes
Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 76, in _worker_function
task = self._task_queue.get(timeout=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yshen/miniconda3/envs/seagoat-python311/lib/python3.11/queue.py", line 179, in get
raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/yshen/miniconda3/envs/seagoat-python311/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/home/yshen/miniconda3/envs/seagoat-python311/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 81, in _worker_function
self.handle_maintenance(context)
File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 50, in handle_maintenance
remaining_chunks_to_analyze = context["seagoat_engine"].analyze_codebase(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 84, in analyze_codebase
return self._create_vector_embeddings(minimum_chunks_to_analyze)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 106, in _create_vector_embeddings
for chunk in file.get_chunks():
^^^^^^^^^^^^^^^^^
File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/file.py", line 79, in get_chunks
lines = self._get_file_lines()
^^^^^^^^^^^^^^^^^^^^^^
File "/home/yshen/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/file.py", line 34, in _get_file_lines
for i, line in enumerate(source_code_file.read().splitlines())
^^^^^^^^^^^^^^^^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte`

Running in a Ubuntu 24.4 on WSL2/Windows 11. The current directory in which running seagoat-server is a git repo of Java, with all files checked-in.
It might or might not be relevant:
When the exception happened, I was doing kill -9 %4 to a seagoat server running on a different path, which had exception of KeyError.

Merge almost continuous code blocks

When there are almost continuous code blocks in the result (for example only 1 or 2 lines missing to combine to code blocks into a single one) it could be useful to combine them into a single code block.

This could be implemented using a command line argument which controls how many extra lines can be added in order to achieve this. This value could default to 0 which would in effect disable this feature.

reduce weight of "boring" code lines such as import statements

things like an import statement can seem semantically highly relevant however they are unlikely to be what the user is looking for

these lines should probably not be completely hidden, especially if they are results of exact matches. however they could be given lower weight and conditionally hidden if there are more relevant results from the same file

also if they are the only result from a file, that file should have a very low weight compared to files that have it in 'real' code

Monorepo/big repo support

This is a list of ideas to support huge repositories and monorepos

  • Allow creating separate caches for subfolders
  • Allow querying specific folders. In this case, the top list of files should be filtered just for this folder
  • Get top files from any folder with more than X files. In addition to guaranteeing that the overall top N files are cached, guarantee that these files are also cached
  • Detect delimiters of "subprojects" such as git subrepos, package.json files and other files that belong to independent packages. Apart from getting the overall top list of files, get top lists from these folders and also consider this in sorting the results.
  • Try to find "names" for subpackages and if possible, include these as context for chunks. This way the user could be able to specify that they are looking for something from that package

Use `networkx` based heuristics to improve the results

Currently the search results are mostly being sorted based on their semantic relevance, and partly on relevance heuristics based on patterns in Git history.

To improve the result sorting, it could be interesting to also experiment with analyzing the structure of the codebase with regards to which files reference each other and such. Based on this, centrality scores could be calculated, or even a score similar to pagerank could be used.

https://networkx.org/

Deleting files breaks search (server throws exceptions)

TL;DR:

I stopped the server during initial indexing, used rm -r to delete a subdirectory of my repo, restarted the server; and now whenever I search the server throws file-not-found exceptions trying to read deleted files.

What I Did

  1. I installed SeaGOAT on an ARM MacBook Pro (understanding that macOS is unsupported.)
  2. Started the server from a shell.
  3. Noticed it had a huge number (30000+) of files to index, and that it was scanning a huge directory of Doxygen-generated HTML/js that I don’t care about.
  4. Hit Ctrl-C twice to stop the server.
  5. rm -r docs/
  6. Started server again and let it finish indexing.
  7. In another shell in the repo directory, entered a gt command to search

Results

The gt command hung, not printing anything.

Over in the other window, the server had thrown a Python exception:

2023-09-23 10:18:06,773 Handling task: query
Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 79, in _worker_function
    self._handle_task(context, task)
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 66, in _handle_task
    result = handler(context, *task.args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 78, in handle_query
    context["seagoat_engine"].fetch_sync(
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 163, in fetch_sync
    loop.run_until_complete(self.fetch(*args, **kwargs))
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/nest_asyncio.py", line 99, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 267, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 147, in fetch
    self._results.extend(source["fetch"](self.query_string, limit_clue))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/sources/chroma.py", line 58, in fetch
    files[path] = Result(path, Path(repository.path) / path)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/result.py", line 75, in __init__
    self.line_texts = self._read_lines()
                      ^^^^^^^^^^^^^^^^^^
  File "/Users/snej/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/result.py", line 84, in _read_lines
    with open(self.full_path, encoding="utf-8") as source_code_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/Users/snej/path/to/repo/docs/C/html/search/enumvalues_0.js'

I can’t get the problem to go away — to get the server to notice a change I tried mkdir docs, then touch docs/foo/html, but nothing happened (didn’t see any output from the server.)

I’d wipe my state by deleting the server’s database, but I don’t know where it is...

'Penalize' text file in vector embeddings

Looks like text files get much better results typically than code files. This is annoying because often the first result is a markdown file or other text file, but it does not generally seem to be what the user is looking for.

This is probably because code is "less dense" in meaning. One solution could be to add a multiplier to vector distances based on file type, to 'penalize' a vector coming from a text file compared to a code file

limit number of results from one file

sometimes there can be way too many results from a file especially from documentation files for example.

there should be some limit established regarding the number of lines one file can include, at least as long as they are not direct matches

Use `bat` to display files whenever possible

When bat is available, it should be used as the default syntax highlighting method instead of pygments as it is faster and has more features.

It can be used like this: bat ~/repos/ledger-cli-next/components/charts/LedgerPieChartWithDetail/index.tsx --line-range 10:15 --line-range 20:24 to highlight specific lines in a file

Document the location of the database

I can’t find anything in the README that says where the server stores its data. And I’ve looked in the repo dir, my home dir and the SeaGoat installation dir (~/.local/pipx…) and found nothing that looks like it, even invisible files.

Since I imagine the database gets pretty big, users should know how to remove it if they uninstall SeaGoat or need to free up storage (or have a damaged install as I seem to have.)

It would also be nice if the server had a subcommand to delete an index for a repo.

There is no current event loop in thread

❯ seagoat-server start ~/Projects/middleware/control/pong-front-end
2023-09-23 11:21:57,045 Creating server...
2023-09-23 11:21:57,046 Starting worker thread...
2023-09-23 11:21:57,047 Serving on http://0.0.0.0:57187
Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "/opt/homebrew/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 72, in _worker_function
context = self._get_context()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 39, in _get_context
from seagoat.engine import Engine
File "/opt/homebrew/lib/python3.11/site-packages/seagoat/engine.py", line 40, in
nest_asyncio.apply()
File "/opt/homebrew/lib/python3.11/site-packages/nest_asyncio.py", line 18, in apply
loop = loop or asyncio.get_event_loop()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/nest_asyncio.py", line 45, in _get_event_loop
loop = events.get_event_loop_policy().get_event_loop()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/events.py", line 677, in get_event_loop
raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Thread-1 (_worker_function)'.
^CServer started at http://localhost:57187
Server running.

SeaGoat assumes always that input files are valid UTF-8 and fails it's not true

If SeaGoat are trying to analyse a code base that isn't on UTF-8, fails at the first invalid UTF-8 character that finds (on my case was a Java code repository on ISO8859-1 ) :

2023-09-21 07:54:16,620 Checking repository for new changes
Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
  File "/home/luis/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 76, in _worker_function
    task = self._task_queue.get(timeout=1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/queue.py", line 179, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/home/luis/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/base_queue.py", line 81, in _worker_function
    self.handle_maintenance(context)
  File "/home/luis/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/queue/task_queue.py", line 50, in handle_maintenance
    remaining_chunks_to_analyze = context["seagoat_engine"].analyze_codebase(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luis/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 84, in analyze_codebase
    return self._create_vector_embeddings(minimum_chunks_to_analyze)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luis/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/engine.py", line 106, in _create_vector_embeddings
    for chunk in file.get_chunks():
                 ^^^^^^^^^^^^^^^^^
  File "/home/luis/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/file.py", line 79, in get_chunks
    lines = self._get_file_lines()
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luis/.local/pipx/venvs/seagoat/lib/python3.11/site-packages/seagoat/file.py", line 34, in _get_file_lines
    for i, line in enumerate(source_code_file.read().splitlines())
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 63: invalid continuation byte

At least, this behaviour should be on the README.

gt reports that SeaGOAT server is not running when in fact it does

I've followed a README and started a server in one tab, it indexed the repo and then outputs

Analyzing source code: 0it [00:00, ?it/s]
2023-09-20 20:37:42,312 Analyzed the minimum number of chunks needed to operate. 
2023-09-20 20:37:42,312 Analyzed all chunks!

In another ttab running an example query fails

gt "Where are the numbers rounded"
The SeaGOAT server is not running. Please start the server using the following command:

seagoat-server start /home/git/repo

Separate results from the same file into multiple groups

While grouping results from the same file together is useful and should be kept, in some cases this can result in subpar quality.

For example, a very large file with many results might keep the result screen very busy and draw attention from another file that contains a good result.

Probably results with a few lines in the same file should always be kept together, and also results that have very similar relevance scores should also be kept together.

However, when there are more than a few lines in a result, and their relevance scores are not very similar, then probably they should be grouped separately.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.