Giter Site home page Giter Site logo

llama-coder's Introduction

Llama Coder

Llama Coder is a better and self-hosted Github Copilot replacement for VS Code. Llama Coder uses Ollama and codellama to provide autocomplete that runs on your hardware. Works best with Mac M1/M2/M3 or with RTX 4090.

VS Code Plugin

Features

  • ๐Ÿš€ As good as Copilot
  • โšก๏ธ Fast. Works well on consumer GPUs. Apple Silicon or RTX 4090 is recommended for best performance.
  • ๐Ÿ” No telemetry or tracking
  • ๐Ÿ”ฌ Works with any language coding or human one.

Recommended hardware

Minimum required RAM: 16GB is a minimum, more is better since even smallest model takes 5GB of RAM. The best way: dedicated machine with RTX 4090. Install Ollama on this machine and configure endpoint in extension settings to offload to this machine. Second best way: run on MacBook M1/M2/M3 with enough RAM (more == better, but 10gb extra would be enough). For windows notebooks: it runs good with decent GPU, but dedicated machine with a good GPU is recommended. Perfect if you have a dedicated gaming PC.

Local Installation

Install Ollama on local machine and then launch the extension in VSCode, everything should work as it is.

Remote Installation

Install Ollama on dedicated machine and configure endpoint to it in extension settings. Ollama usually uses port 11434 and binds to 127.0.0.1, to change it you should set OLLAMA_HOST to 0.0.0.0.

Models

Currently Llama Coder supports only Codellama. Model is quantized in different ways, but our tests shows that q4 is an optimal way to run network. When selecting model the bigger the model is, it performs better. Always pick the model with the biggest size and the biggest possible quantization for your machine. Default one is stable-code:3b-code-q4_0 and should work everywhere and outperforms most other models.

Name RAM/VRAM Notes
stable-code:3b-code-q4_0 3GB
codellama:7b-code-q4_K_M 5GB
codellama:7b-code-q6_K 6GB m
codellama:7b-code-fp16 14GB g
codellama:13b-code-q4_K_M 10GB
codellama:13b-code-q6_K 14GB m
codellama:34b-code-q4_K_M 24GB
codellama:34b-code-q6_K 32GB m
  • m - slow on MacOS
  • g - slow on older NVidia cards (pre 30xx)

Troubleshooting

Most of the problems could be seen in output of a plugin in VS Code extension output.

Changelog

[0.0.14]

  • Ability to pause completition (by @bkyle)
  • Bearer token support for remote inference (by @Sinan-Karakaya)

[0.0.13]

  • Fix remote files support

[0.0.12]

  • Remote support
  • Fix codellama prompt preparation
  • Add trigger delay
  • Add jupyter notebooks support

[0.0.11]

  • Added Stable Code model
  • Pause download only for specific model instead of all models

[0.0.10]

  • Adding ability to pick a custom model
  • Asking user if they want to download model if it is not available

[0.0.9]

  • Adding deepseek 1b model and making it default

[0.0.8]

  • Improved DeepSeek support and language detection

[0.0.7]

  • Added DeepSeek support
  • Ability to change temperature and top p
  • Fixed some bugs

[0.0.6]

  • Fix ollama links
  • Added more models

[0.0.4]

  • Initial release of Llama Coder

llama-coder's People

Contributors

bkyle avatar corinfinite avatar drblury avatar dre-on avatar ex3ndr avatar fuad00 avatar kevsnz avatar sahandevs avatar spinespine avatar staff0rd avatar themcsebi avatar wrapss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama-coder's Issues

Allow toggle completion instead of autocompletion

It would be nice to be able to actively tell the plugin "I want an autocompletion for the current line" instead of "do autocomplete all the time" with a shortcut.

So instead of query the model after 250ms automatically after the last pressed key, just wait for the user to press a shortcut.

Astro files not getting completions

For some reason the extension doesnt work with .astro files?

The tab on the bottom doesn't show, and there is no code completion. Seems like a bug with vscode maybe? All other file types I've tried work.

Avoid downloading models if already installed using ollama

I already installed the codellama and deepseek-llm models using ollama but your vscode extensions llama-coder is now displaying downloading.., so i guess it's downloading additional models needed. Would it be possible you could check if the models already exists on the system and make downloading them optional to save time and space?

Move model to external harddrive and symlink it

This is actually a question not an issue:

Where does the VS Code extension download the model to?
Is there a way to choose the location where the model can be saved to?

I am testing lots of UIs for LLMs and therefore come into the situation that a model is downloaded rather often. This reduces disk space very quickly. For things likes Stable Diffusion with Automatic1111, ComfyUI, Fooocus, StabelSwarm etc it is easy to either set a custom source folder via the software's settings or to manually create a symlinked folder. Then I only need the models once.

Here it is no really clear where the model is saved to, which is something that tends to create distrust. I would very much like to be able to see where things are downloaded to, and even better: I would like to set options for things like that.

Would that be possible to implement here, too?

Also โ€“ since this is not an issue: Can I motivate you to open the "Discussions" panel here in your github repo? That would provide a "Questions" section, where this topic would fit better. Thank you! :-)

Larger models just seem to return metadata

When I run a model like codellama:34b-code-q6_K it does seem to spin up my GPUs but then I end up with unusable output. Running latest ollama and extension version on Ubuntu 22.04

2024-03-12 01:57:27.392 [info] Running AI completion...
2024-03-12 01:57:31.655 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.654294365Z","response":" ","done":false}
2024-03-12 01:57:31.700 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.699862247Z","response":"\n","done":false}
2024-03-12 01:57:31.746 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.745519433Z","response":"\u003c/","done":false}
2024-03-12 01:57:31.790 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.790407012Z","response":"PRE","done":false}
2024-03-12 01:57:31.836 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.835927564Z","response":"\u003e","done":false}
2024-03-12 01:57:31.881 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.881386364Z","response":"","done":true,"total_duration":4488985784,"load_duration":270215,"prompt_eval_count":1427,"prompt_eval_duration":4261101000,"eval_count":6,"eval_duration":226981000}
2024-03-12 01:57:31.882 [info] AI completion completed:

image

Not working with Remote-SSH form Microsoft, works fine on local files

I work primarily remote using the ms-vscode-remote.remote-ssh-edit extention. I'm not able to get the llama-coder autocomplete to work on those projects. It works just fine for projects on my local machine. Not sure how to debug further, but seems that llama-coder does not trigger when I stop typing.

Extension Literally Does Nothing

I have spent some time trying to find a usage guide but failed miserably. I can see the extension is installed and doing something as I can see the spinning wheel in the footer of the VSCode app.

I am also sure I have the model on my machine as I can see it in Ollama and run an instance via ollama run <MODE_NAME>

When go to the output tab of VSCode I can see the following:

2024-02-26 14:21:32.734 [info] Llama Coder is activated.
2024-02-26 14:26:51.699 [info] Canceled after AI completion.
2024-02-26 14:26:52.006 [info] Running AI completion...
2024-02-26 14:27:00.153 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:27:00.151644Z","response":".","done":false}
2024-02-26 14:27:00.158 [info] AI completion completed: 
2024-02-26 14:27:00.158 [info] Canceled after AI completion.
2024-02-26 14:27:00.160 [info] Canceled before AI completion.
2024-02-26 14:28:34.455 [info] Running AI completion...
2024-02-26 14:28:47.386 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:28:47.384057Z","response":"","done":true,"total_duration":12925537752,"load_duration":5852886428,"prompt_eval_count":130,"prompt_eval_duration":7072116000,"eval_count":1,"eval_duration":22000}
2024-02-26 14:28:47.388 [info] AI completion completed: 
2024-02-26 14:28:47.388 [info] Canceled after AI completion.
2024-02-26 14:28:47.392 [info] Canceled before AI completion.
2024-02-26 14:28:47.392 [info] Canceled before AI completion.
2024-02-26 14:28:47.394 [info] Canceled before AI completion.
2024-02-26 14:28:47.404 [info] Running AI completion...
2024-02-26 14:28:47.726 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:28:47.725568Z","response":"\n\n","done":false}
2024-02-26 14:28:47.829 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:28:47.828662Z","response":"","done":true,"total_duration":422186670,"load_duration":762292,"prompt_eval_count":4,"prompt_eval_duration":317920000,"eval_count":2,"eval_duration":102965000}
2024-02-26 14:28:47.830 [info] AI completion completed: 


2024-02-26 14:28:50.926 [info] Running AI completion...
2024-02-26 14:28:51.185 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:28:51.184114Z","response":"","done":true,"total_duration":257221422,"load_duration":439644,"prompt_eval_count":3,"prompt_eval_duration":256262000,"eval_count":1,"eval_duration":24000}
2024-02-26 14:28:51.186 [info] AI completion completed: 

However, nothing is ever outputted in the terminal or within the filesystem

Error during inference: fetch failed

I have an ollama container running the stable-code:3b-code-q4_0 model. I'm able to interact with the model via curl:

curl -d '{"model":"stable-code:3b-code-q4_0", "prompt": "c++"}' https://notarealurl.io/api/generate

and get a response in a terminal in wsl where I'm running vscode:

image

However when I set the Ollama Server Endpoint to https://notarealurl.io/ I just get [warning] Error during inference: fetch failed

Ollama support and decouple codellama

Ollama is very popular and very easy to setup. It would be great if the extension didn't bundle with codellama and instead we could drop in the ollama endpoint to begin completions

Can't use custom model

I want to use a custom model via Ollama, to which I gave the name copilot. I did ollama create copilot -f copilot and running ollama run copilot works in the terminal, and other extensions also work well with it. However, when I start Llama Coder and set the custom model to 'copilot,' the extension tells me that copilot wasn't downloaded and asks if I want to download it. If I click on 'No,' nothing happens. Looking at the logs, I see "Ingoring since the user asked to ignore download.".

How does llama-coder check if a model exists with ollama?

How can I use this model, as I already manually created it using the gguf file and ollama create?
Thanks in advance!

Open-WebUI compatibility

Ran into an issue with using the Ollama API through Open-WebUI.

They add an id message at the beginning of the stream that does not include a '.response' part of the message which causes llama-coder to have issues because it still tries to iterate 'tokens.reponse'.

It can be easily fixed by adding the following code to Line 28 of the autocomplete.js file.

if ('response' in tokens) {
...
// This is just to show where the closing bracket goes, this is existing code.
if (totalLines > args.maxLines && blockStack.length === 0) {
        (0, log_1.info)('Too many lines, breaking.');
        break;
    }
}

Feature Request: Integrate llama-github for Enhanced GitHub Retrieval in Simple Mode

Hello,

I hope this message finds you well. I am the maintainer of llama-github, an open-source Python library designed to empower LLM Chatbots, AI Agents, and Auto-dev Solutions by providing intelligent retrieval of code snippets, issues, and repository information from GitHub.

Proposal:
I believe that integrating llama-github into llama-coder could significantly enhance its functionality by enabling efficient retrieval of relevant GitHub content. This would align with llama-coder's philosophy of using local models, as llama-github can operate in a "simple mode" that does not require GPT-4, thus maintaining the spirit of local processing.

Benefits:

  • Efficient Retrieval: llama-github's advanced retrieval techniques can quickly provide relevant code snippets and repository information, enhancing the coding assistance provided by llama-coder.
  • Local Processing: By using the simple mode of llama-github, you can avoid external OpenAI calls, ensuring that all LLM processing remains local, which is in line with the design principles of llama-coder.
  • Repository Pool: llama-github features a repository pool mechanism that helps conserve users' GitHub API quota by efficiently managing and reusing repository data. This can be particularly beneficial for llama-coder users who may have limited API quota.
  • Enhanced Context: Integrating llama-github can provide richer context and more comprehensive answers to coding queries, improving the overall user experience.

Example Usage:
Here is a simple mode example of how llama-github can be integrated into llama-coder:

pip install llama-github
from llama_github import GithubRAG

# Initialize GithubRAG with your credentials
github_rag = GithubRAG(
    github_access_token="your_github_access_token"
)

# Retrieve context for a coding question
query = "How to create a NumPy array in Python?"
context = github_rag.retrieve_context(query, simple_mode = True)
print(context)

Additional Information:
You can find more details and documentation about llama-github here. I would be more than happy to assist with the integration process if you find this proposal valuable.

Thank you for considering this request!

Best regards,
Jet Xu

Unknown model undefined

Always shows Unknown model undefined when trying to use. Ollama is running and different models tried.
Reinstalled several time with no help.

More flexibility with remote hosts

Hi,

I use VS Code's Remote SSH/Remote Containers plugin for most of my development.

Testing out llama-coder initially I ran ollama on another machine on my local network, separate from my development laptop. This worked great!

However when I use VS Code to connect to a remote development server, that server doesn't have access to the local machine I use for inference. VS Code only allows llama-coder to run in the remote host. Unfortunately VS Code can't use reverse SSH tunnels, which was the first solution I tried.

I suspect that this could be resolved by adding "extensionKind": ["workspace", "ui"] to the extension manifest, as described here. This would leave the default behavior of the extension running on the remote host, but allow it to be switched to running locally if desired.

Let me know what you think, the above comes from reading over the linked pages but I don't have any experience writing VS Code extensions so I could be wrong about something here.

Documentation on how to use?

I've downloaded ollama I'm not sure what i'm expecting to happen I've pulled the model locally. There is no guidance on what is expected to happen or how to use?

Is it supposed to run on save is there a way to run a command or activate generate?

Unable to switch models

  • Mac Mini M1 (Silicon)
  • macOS 13.6 Ventura
  • Cursor 0.26.2 (Cursor is a VS Code clone and seems to be compatible with Llama Coder)
  • Llama Coder

I started using Llama Coder with the default stable-code:3b-code-q4_0 model.

I then inadvertently downloaded the codellama:7b-code-q4_K_M model and set Llama coder to use that. This gave me the following error...

SCR-20240215-oshm

Seeing that the q4_K_M model was not recommended for Macs, I deleted the model from Ollama, and I tried to revert to the stable-code:3b-code-q4_0 model again, in the settings...

SCR-20240215-otkf

The problem is that Llama Coder still seems to think that I am using the codellama:7b-code-q4_K_M model, and nothing I can do seems to get rid of that reference...

SCR-20240215-otrq

I've even tried closing Cursor and restarting it, but it still continues to reference the wrong model and give the aforementioned error message.

Unsupported document: vscode-vfs

im wonking github in vscode and i get this error in codellama output:

[info] Unsupported document: vscode-vfs://github/docker/dockercraft/docker-compose.yaml ignored.

image

Prevent checking if model exists on every autocompletion

Hi!

Thanks for coding this little wonder of extension. Kudos! I've been using it for a bit, and I have noticed that every autocompletion generates an extra request to the /api/tags endpoint in Ollama:

image

I suspect it comes from the call to ollamaCheckModel() in provideInlineCompletionItems():

let modelExists = await ollamaCheckModel(inferenceConfig.endpoint, inferenceConfig.modelName);

In my view it should not be necessary to send a request to the /api/tags endpoint every time. I am aware the latency it introduces is orders of magnitude lower than the /api/generate cat, but still ... it's extra job for the extension that (in my view) does not need to do.

I'd suggest to go for a different strategy ๐Ÿค” Perhaps do the check once and save the list of available models to check locally. Then check again whenever the configuration changes, or every now and then.

Thanks!

How to run it in VS Code?

Hi.

I installed it locally on my M1 and it works in CLI.
When i click on Llama Coder in top right corner (status bar) of VS Code it does nothing.
Sorry for question, maybe its too obvious for me.

Inconsistent Tab Behavior Due to Suggestions Ending with Newline

s

I've noticed that code suggestions from the extension sometimes include a trailing newline as shown in the screenshot, leading to an issue where pressing the Tab key inserts a tab space instead of accepting the suggestion. This disrupts the coding flow. Maybe this could be fixed by triming the newline from the end of suggestions before they're returned, as shown below:

if (res.endsWith('\n')) {
    res = res.slice(0, -1);
}

Add setting to reduce autocomplete suggestions

I noticed that on every little change in the file, the extension will give me autocompletion suggestion. Often they are not necessary and "random" since they lack context.

Additionally this causes a big overheat and makes Ollama crash easily on my PC. So I suggest a feature would be nice, which regulates the amount of requests the extension does to the model. Something like an additional setting would be nice, which can be regulated by the user (e.g. make request of autocomplete to the model only 0.5 seconds after the user didn't input anything in the editor).

Vscode: add help text to model selection dropdown

There are a lot of models to choose from
I can't tell from their names (e.g. deepseek-coder:6.7b-base-q what the difference of all these models is
It would be good to at least link to a help page in that section of the settings that has info what the differences and use cases of each of the models is

As a user I would like to be able to disable autocomplete when running on battery

When my computer is running on battery I would like to be able to disable autocomplete so that I can save power. This behaviour should be configurable so that users can decide whether they want to optimize for power efficiency or productivity.

  • In order to preserve the first-run experience the extension should default to performing autocomplete regardless of power source.
  • If autocomplete is disabled due to this behaviour it should be indicated in the status bar item.
  • This behaviour should take precedent over whether a user has decided to pause autocomplete (see #13).
  • If autocomplete is disabled due to this setting and the user manually pauses autocomplete, autocomplete should remain paused (for the duration of the session) after either disabling this setting or attaching the computer to a power source.

Need clarification: Ollama and codellama-70b running. Will Llama Coder use this?

As title says
If I've already pulled the new (as of 2024-01-30) codellama-70b from meta (or python variant)
Will Llama Coder use this?
Or does it download the 34b and run that?

Does it just run whatever I'm running in Ollama and the list of models you provide are more "recommendations"?
Instructions seem to contradict or are not clear.
One one hand it simply says:

Local Installation
Install Ollama on local machine and then launch the extension in VSCode, everything should work as it is.

But then below that, the list of models doesn't go to 70b and probably doesn't include the new meta ones 70b, 70b python and 70b instruct?

Since my machine is capable of running it, I would prefer to.
Successfully running (and quite fast!) ollama run codellama:70b from here: https://ollama.ai/library/codellama:70b

Only reason I haven't simply installed and launched is because I don't want to end up with a 34b download in an unspecified location despite already running 70b :)

Thanks

Ignoring download reset

It seems that ignoring the download of a model will disable any download, even when the model is changed. There should be a way to reset this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.