ex3ndr / llama-coder Goto Github PK

Replace Copilot local AI

Home Page: https://marketplace.visualstudio.com/items?itemName=ex3ndr.llama-coder

License: MIT License

TypeScript 99.27% JavaScript 0.73%

llama-coder's Introduction

Llama Coder

Llama Coder is a better and self-hosted Github Copilot replacement for VS Code. Llama Coder uses Ollama and codellama to provide autocomplete that runs on your hardware. Works best with Mac M1/M2/M3 or with RTX 4090.

VS Code Plugin

Features

🚀 As good as Copilot
⚡️ Fast. Works well on consumer GPUs. Apple Silicon or RTX 4090 is recommended for best performance.
🔐 No telemetry or tracking
🔬 Works with any language coding or human one.

Recommended hardware

Minimum required RAM: 16GB is a minimum, more is better since even smallest model takes 5GB of RAM. The best way: dedicated machine with RTX 4090. Install Ollama on this machine and configure endpoint in extension settings to offload to this machine. Second best way: run on MacBook M1/M2/M3 with enough RAM (more == better, but 10gb extra would be enough). For windows notebooks: it runs good with decent GPU, but dedicated machine with a good GPU is recommended. Perfect if you have a dedicated gaming PC.

Local Installation

Install Ollama on local machine and then launch the extension in VSCode, everything should work as it is.

Remote Installation

Install Ollama on dedicated machine and configure endpoint to it in extension settings. Ollama usually uses port 11434 and binds to 127.0.0.1, to change it you should set OLLAMA_HOST to 0.0.0.0.

Models

Currently Llama Coder supports only Codellama. Model is quantized in different ways, but our tests shows that q4 is an optimal way to run network. When selecting model the bigger the model is, it performs better. Always pick the model with the biggest size and the biggest possible quantization for your machine. Default one is stable-code:3b-code-q4_0 and should work everywhere and outperforms most other models.

Name	RAM/VRAM	Notes
stable-code:3b-code-q4_0	3GB
codellama:7b-code-q4_K_M	5GB
codellama:7b-code-q6_K	6GB	m
codellama:7b-code-fp16	14GB	g
codellama:13b-code-q4_K_M	10GB
codellama:13b-code-q6_K	14GB	m
codellama:34b-code-q4_K_M	24GB
codellama:34b-code-q6_K	32GB	m

m - slow on MacOS
g - slow on older NVidia cards (pre 30xx)

Troubleshooting

Most of the problems could be seen in output of a plugin in VS Code extension output.

Changelog

[0.0.14]

Ability to pause completition (by @bkyle)
Bearer token support for remote inference (by @Sinan-Karakaya)

[0.0.13]

Fix remote files support

[0.0.12]

Remote support
Fix codellama prompt preparation
Add trigger delay
Add jupyter notebooks support

[0.0.11]

Added Stable Code model
Pause download only for specific model instead of all models

[0.0.10]

Adding ability to pick a custom model
Asking user if they want to download model if it is not available

[0.0.9]

Adding deepseek 1b model and making it default

[0.0.8]

Improved DeepSeek support and language detection

[0.0.7]

Added DeepSeek support
Ability to change temperature and top p
Fixed some bugs

[0.0.6]

Fix ollama links
Added more models

[0.0.4]

Initial release of Llama Coder

llama-coder's People

Contributors

Stargazers

Watchers

Forkers

staff0rd oc013 wrapss octag0no dnsmichi wangdq1989 samyarkd jmusser73 dre-on ai-mou dav1nx1 327 assassindesign seniaxz benedict-lee sorokinvld felbdogg jinxiaoman even7644 stophobia tutumomo fuad00 quikssb spydaz themcsebi victoronl swifilaboroka spinespine gvc0461082002 fmbento mdturp ddanon dyl777 corinfinite kryesh bodhihu ambarrlite repos-ai-local yaqinhei polya20 sparta1573 kevsnz aszazel sahandevs hirendave47 pravegali zhutony bongjulio bkyle m-c-frank sinan-karakaya santosrai lusu2004 mahadih534 drblury nikolasmelui ch-xiong andrei-iuliu sergssv rahulsushilsharma catley94 dreamstony allreadytaken jgladwill chihuadelishu ari1988 sikkgit hanwsf santoshpattnaik bowen31337 theayushmajumdar itsallforkedup conglesolutionx xebitstudios joepdejong azmisahincom prasmalla jeffreysblake alanyao91168 toasterofbread analystfang leamiko-eich r44v krishan0507 abhijitkumarj lml2468 degerli thomaspujol69 32bitmicro vamcily yc-lai oisee run2ai-m

llama-coder's Issues

Allow toggle completion instead of autocompletion

It would be nice to be able to actively tell the plugin "I want an autocompletion for the current line" instead of "do autocomplete all the time" with a shortcut.

So instead of query the model after 250ms automatically after the last pressed key, just wait for the user to press a shortcut.

Astro files not getting completions

For some reason the extension doesnt work with .astro files?

The tab on the bottom doesn't show, and there is no code completion. Seems like a bug with vscode maybe? All other file types I've tried work.

Adding more models in enum package.json

@ex3ndr, hi,
is it possible to add to enum package.json absolutely all models? Very few 70b models

Avoid downloading models if already installed using ollama

I already installed the codellama and deepseek-llm models using ollama but your vscode extensions llama-coder is now displaying downloading.., so i guess it's downloading additional models needed. Would it be possible you could check if the models already exists on the system and make downloading them optional to save time and space?

Move model to external harddrive and symlink it

This is actually a question not an issue:

Where does the VS Code extension download the model to?
Is there a way to choose the location where the model can be saved to?

I am testing lots of UIs for LLMs and therefore come into the situation that a model is downloaded rather often. This reduces disk space very quickly. For things likes Stable Diffusion with Automatic1111, ComfyUI, Fooocus, StabelSwarm etc it is easy to either set a custom source folder via the software's settings or to manually create a symlinked folder. Then I only need the models once.

Here it is no really clear where the model is saved to, which is something that tends to create distrust. I would very much like to be able to see where things are downloaded to, and even better: I would like to set options for things like that.

Would that be possible to implement here, too?

Also – since this is not an issue: Can I motivate you to open the "Discussions" panel here in your github repo? That would provide a "Questions" section, where this topic would fit better. Thank you! :-)

Larger models just seem to return metadata

When I run a model like codellama:34b-code-q6_K it does seem to spin up my GPUs but then I end up with unusable output. Running latest ollama and extension version on Ubuntu 22.04

2024-03-12 01:57:27.392 [info] Running AI completion...
2024-03-12 01:57:31.655 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.654294365Z","response":" ","done":false}
2024-03-12 01:57:31.700 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.699862247Z","response":"\n","done":false}
2024-03-12 01:57:31.746 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.745519433Z","response":"\u003c/","done":false}
2024-03-12 01:57:31.790 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.790407012Z","response":"PRE","done":false}
2024-03-12 01:57:31.836 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.835927564Z","response":"\u003e","done":false}
2024-03-12 01:57:31.881 [info] Receive line: {"model":"codellama:34b-code-q6_K","created_at":"2024-03-12T05:57:31.881386364Z","response":"","done":true,"total_duration":4488985784,"load_duration":270215,"prompt_eval_count":1427,"prompt_eval_duration":4261101000,"eval_count":6,"eval_duration":226981000}
2024-03-12 01:57:31.882 [info] AI completion completed:

Not working with Remote-SSH form Microsoft, works fine on local files

I work primarily remote using the ms-vscode-remote.remote-ssh-edit extention. I'm not able to get the llama-coder autocomplete to work on those projects. It works just fine for projects on my local machine. Not sure how to debug further, but seems that llama-coder does not trigger when I stop typing.

Extension Literally Does Nothing

I have spent some time trying to find a usage guide but failed miserably. I can see the extension is installed and doing something as I can see the spinning wheel in the footer of the VSCode app.

I am also sure I have the model on my machine as I can see it in Ollama and run an instance via ollama run <MODE_NAME>

When go to the output tab of VSCode I can see the following:

2024-02-26 14:21:32.734 [info] Llama Coder is activated.
2024-02-26 14:26:51.699 [info] Canceled after AI completion.
2024-02-26 14:26:52.006 [info] Running AI completion...
2024-02-26 14:27:00.153 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:27:00.151644Z","response":".","done":false}
2024-02-26 14:27:00.158 [info] AI completion completed: 
2024-02-26 14:27:00.158 [info] Canceled after AI completion.
2024-02-26 14:27:00.160 [info] Canceled before AI completion.
2024-02-26 14:28:34.455 [info] Running AI completion...
2024-02-26 14:28:47.386 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:28:47.384057Z","response":"","done":true,"total_duration":12925537752,"load_duration":5852886428,"prompt_eval_count":130,"prompt_eval_duration":7072116000,"eval_count":1,"eval_duration":22000}
2024-02-26 14:28:47.388 [info] AI completion completed: 
2024-02-26 14:28:47.388 [info] Canceled after AI completion.
2024-02-26 14:28:47.392 [info] Canceled before AI completion.
2024-02-26 14:28:47.392 [info] Canceled before AI completion.
2024-02-26 14:28:47.394 [info] Canceled before AI completion.
2024-02-26 14:28:47.404 [info] Running AI completion...
2024-02-26 14:28:47.726 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:28:47.725568Z","response":"\n\n","done":false}
2024-02-26 14:28:47.829 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:28:47.828662Z","response":"","done":true,"total_duration":422186670,"load_duration":762292,"prompt_eval_count":4,"prompt_eval_duration":317920000,"eval_count":2,"eval_duration":102965000}
2024-02-26 14:28:47.830 [info] AI completion completed: 


2024-02-26 14:28:50.926 [info] Running AI completion...
2024-02-26 14:28:51.185 [info] Receive line: {"model":"stable-code:3b-code-q4_0","created_at":"2024-02-26T14:28:51.184114Z","response":"","done":true,"total_duration":257221422,"load_duration":439644,"prompt_eval_count":3,"prompt_eval_duration":256262000,"eval_count":1,"eval_duration":24000}
2024-02-26 14:28:51.186 [info] AI completion completed:

However, nothing is ever outputted in the terminal or within the filesystem

Add support for oobabooga/text-generation-webui

Git repo: https://github.com/oobabooga/text-generation-webui

Would be cool to just be able to use one thing to serve the models.

As a bonus point, it's compatible with OpenAI API, so if you support stuff like auth, that would mean llama-coder would support any OpenAI-compatibile server.

Error during inference: fetch failed

I have an ollama container running the stable-code:3b-code-q4_0 model. I'm able to interact with the model via curl:

curl -d '{"model":"stable-code:3b-code-q4_0", "prompt": "c++"}' https://notarealurl.io/api/generate

and get a response in a terminal in wsl where I'm running vscode:

However when I set the Ollama Server Endpoint to https://notarealurl.io/ I just get [warning] Error during inference: fetch failed

Ollama support and decouple codellama

Ollama is very popular and very easy to setup. It would be great if the extension didn't bundle with codellama and instead we could drop in the ollama endpoint to begin completions

Can't use custom model

I want to use a custom model via Ollama, to which I gave the name copilot. I did ollama create copilot -f copilot and running ollama run copilot works in the terminal, and other extensions also work well with it. However, when I start Llama Coder and set the custom model to 'copilot,' the extension tells me that copilot wasn't downloaded and asks if I want to download it. If I click on 'No,' nothing happens. Looking at the logs, I see "Ingoring since the user asked to ignore download.".

How does llama-coder check if a model exists with ollama?

How can I use this model, as I already manually created it using the gguf file and ollama create?
Thanks in advance!

Unable to install extension 'ex3ndr.llama-coder' as it is not compatible with VSCodium '1.83.1'.

I am using VSCodium as an open alternative and this plugin does not seem to like Vscodium. I always install using .vsix packages but this one failed with this message.

thanks

EDIT:

updated Vscodium, works now

Keyboard Shortcut to Pause autocomplete?

Is there a key binding or a way to key bind to pause the autocomplete?

Open-WebUI compatibility

Ran into an issue with using the Ollama API through Open-WebUI.

They add an id message at the beginning of the stream that does not include a '.response' part of the message which causes llama-coder to have issues because it still tries to iterate 'tokens.reponse'.

It can be easily fixed by adding the following code to Line 28 of the autocomplete.js file.

if ('response' in tokens) {
...
// This is just to show where the closing bracket goes, this is existing code.
if (totalLines > args.maxLines && blockStack.length === 0) {
        (0, log_1.info)('Too many lines, breaking.');
        break;
    }
}

SUGGESTION: allow using arbitrary model

I think it would be nice to have the possibility to utilize any model instead of being limited by the list you provide in settings

Feature Request: Integrate llama-github for Enhanced GitHub Retrieval in Simple Mode

Hello,

I hope this message finds you well. I am the maintainer of llama-github, an open-source Python library designed to empower LLM Chatbots, AI Agents, and Auto-dev Solutions by providing intelligent retrieval of code snippets, issues, and repository information from GitHub.

Proposal:
I believe that integrating llama-github into llama-coder could significantly enhance its functionality by enabling efficient retrieval of relevant GitHub content. This would align with llama-coder's philosophy of using local models, as llama-github can operate in a "simple mode" that does not require GPT-4, thus maintaining the spirit of local processing.

Benefits:

Efficient Retrieval: llama-github's advanced retrieval techniques can quickly provide relevant code snippets and repository information, enhancing the coding assistance provided by llama-coder.
Local Processing: By using the simple mode of llama-github, you can avoid external OpenAI calls, ensuring that all LLM processing remains local, which is in line with the design principles of llama-coder.
Repository Pool: llama-github features a repository pool mechanism that helps conserve users' GitHub API quota by efficiently managing and reusing repository data. This can be particularly beneficial for llama-coder users who may have limited API quota.
Enhanced Context: Integrating llama-github can provide richer context and more comprehensive answers to coding queries, improving the overall user experience.

Example Usage:
Here is a simple mode example of how llama-github can be integrated into llama-coder:

pip install llama-github

from llama_github import GithubRAG

# Initialize GithubRAG with your credentials
github_rag = GithubRAG(
    github_access_token="your_github_access_token"
)

# Retrieve context for a coding question
query = "How to create a NumPy array in Python?"
context = github_rag.retrieve_context(query, simple_mode = True)
print(context)

Additional Information:
You can find more details and documentation about llama-github here. I would be more than happy to assist with the integration process if you find this proposal valuable.

Thank you for considering this request!

Best regards,
Jet Xu

Unknown model undefined

Always shows Unknown model undefined when trying to use. Ollama is running and different models tried.
Reinstalled several time with no help.

Support for Visual Studio

Please create a Visual Studio version of the extension. Thanks.

More flexibility with remote hosts

Hi,

I use VS Code's Remote SSH/Remote Containers plugin for most of my development.

Testing out llama-coder initially I ran ollama on another machine on my local network, separate from my development laptop. This worked great!

However when I use VS Code to connect to a remote development server, that server doesn't have access to the local machine I use for inference. VS Code only allows llama-coder to run in the remote host. Unfortunately VS Code can't use reverse SSH tunnels, which was the first solution I tried.

I suspect that this could be resolved by adding "extensionKind": ["workspace", "ui"] to the extension manifest, as described here. This would leave the default behavior of the extension running on the remote host, but allow it to be switched to running locally if desired.

Let me know what you think, the above comes from reading over the linked pages but I don't have any experience writing VS Code extensions so I could be wrong about something here.

publish to OpenVSX registry

this would allow it to be used with VSCodium (de-Microsofted VSCode)

the registry is here: https://open-vsx.org/

Documentation on how to use?

I've downloaded ollama I'm not sure what i'm expecting to happen I've pulled the model locally. There is no guidance on what is expected to happen or how to use?

Is it supposed to run on save is there a way to run a command or activate generate?

llama-coder on JetBrains IDE

Hi!
Do you plan to create a version of llama-coder for the JetBrains IDE?

Unable to switch models

Mac Mini M1 (Silicon)
macOS 13.6 Ventura
Cursor 0.26.2 (Cursor is a VS Code clone and seems to be compatible with Llama Coder)
Llama Coder

I started using Llama Coder with the default stable-code:3b-code-q4_0 model.

I then inadvertently downloaded the codellama:7b-code-q4_K_M model and set Llama coder to use that. This gave me the following error...

Seeing that the q4_K_M model was not recommended for Macs, I deleted the model from Ollama, and I tried to revert to the stable-code:3b-code-q4_0 model again, in the settings...

The problem is that Llama Coder still seems to think that I am using the codellama:7b-code-q4_K_M model, and nothing I can do seems to get rid of that reference...

I've even tried closing Cursor and restarting it, but it still continues to reference the wrong model and give the aforementioned error message.

base model issue

i have downloaded the base model but it's not working

Does it work on Windows

I saw Ollama does not support Windows yet

Unsupported document: vscode-vfs

im wonking github in vscode and i get this error in codellama output:

[info] Unsupported document: vscode-vfs://github/docker/dockercraft/docker-compose.yaml ignored.

Prevent checking if model exists on every autocompletion

Hi!

Thanks for coding this little wonder of extension. Kudos! I've been using it for a bit, and I have noticed that every autocompletion generates an extra request to the /api/tags endpoint in Ollama:

I suspect it comes from the call to ollamaCheckModel() in provideInlineCompletionItems():

llama-coder/src/prompts/provider.ts

Line 89 in 996ac71

    
           let modelExists = await ollamaCheckModel(inferenceConfig.endpoint, inferenceConfig.modelName);

In my view it should not be necessary to send a request to the /api/tags endpoint every time. I am aware the latency it introduces is orders of magnitude lower than the /api/generate cat, but still ... it's extra job for the extension that (in my view) does not need to do.

I'd suggest to go for a different strategy 🤔 Perhaps do the check once and save the list of available models to check locally. Then check again whenever the configuration changes, or every now and then.

Thanks!

Any plan to add latest coding models like WizardCoder-Python, Phind-Codellama?

There are superior coding models out now that are better than CodeLlama. Any plan to add support for those?

Add support for Deepseek Coder - Instruct Models

Can support for Deepseek Coder Instruct be added? According to the documentation here: https://deepseekcoder.github.io/ it is a significant improvement over the base model.

Would be good to add the equivalent of the Copilot Chat

As said in the title, the chat feature is really missing. ie if I want the assistant to explain some code, I can't do it currently with llama coder.

Please add Deepseek-coder models

Hi,

can you please update the extension to enable usage the new Deepseek-coder models.
The models are already implemented in Ollama and outperforms code-lama models.

deepseek-coder

Thank you.

How to run it in VS Code?

Hi.

I installed it locally on my M1 and it works in CLI.
When i click on Llama Coder in top right corner (status bar) of VS Code it does nothing.
Sorry for question, maybe its too obvious for me.

Inconsistent Tab Behavior Due to Suggestions Ending with Newline

I've noticed that code suggestions from the extension sometimes include a trailing newline as shown in the screenshot, leading to an issue where pressing the Tab key inserts a tab space instead of accepting the suggestion. This disrupts the coding flow. Maybe this could be fixed by triming the newline from the end of suggestions before they're returned, as shown below:

if (res.endsWith('\n')) {
    res = res.slice(0, -1);
}

Add setting to reduce autocomplete suggestions

I noticed that on every little change in the file, the extension will give me autocompletion suggestion. Often they are not necessary and "random" since they lack context.

Additionally this causes a big overheat and makes Ollama crash easily on my PC. So I suggest a feature would be nice, which regulates the amount of requests the extension does to the model. Something like an additional setting would be nice, which can be regulated by the user (e.g. make request of autocomplete to the model only 0.5 seconds after the user didn't input anything in the editor).

Vscode: add help text to model selection dropdown

There are a lot of models to choose from
I can't tell from their names (e.g. deepseek-coder:6.7b-base-q what the difference of all these models is
It would be good to at least link to a help page in that section of the settings that has info what the differences and use cases of each of the models is

As a user I would like to be able to disable autocomplete when running on battery

When my computer is running on battery I would like to be able to disable autocomplete so that I can save power. This behaviour should be configurable so that users can decide whether they want to optimize for power efficiency or productivity.

In order to preserve the first-run experience the extension should default to performing autocomplete regardless of power source.
If autocomplete is disabled due to this behaviour it should be indicated in the status bar item.
This behaviour should take precedent over whether a user has decided to pause autocomplete (see #13).
If autocomplete is disabled due to this setting and the user manually pauses autocomplete, autocomplete should remain paused (for the duration of the session) after either disabling this setting or attaching the computer to a power source.

Need clarification: Ollama and codellama-70b running. Will Llama Coder use this?

As title says
If I've already pulled the new (as of 2024-01-30) codellama-70b from meta (or python variant)
Will Llama Coder use this?
Or does it download the 34b and run that?

Does it just run whatever I'm running in Ollama and the list of models you provide are more "recommendations"?
Instructions seem to contradict or are not clear.
One one hand it simply says:

Local Installation
Install Ollama on local machine and then launch the extension in VSCode, everything should work as it is.

But then below that, the list of models doesn't go to 70b and probably doesn't include the new meta ones 70b, 70b python and 70b instruct?

Since my machine is capable of running it, I would prefer to.
Successfully running (and quite fast!) ollama run codellama:70b from here: https://ollama.ai/library/codellama:70b

Only reason I haven't simply installed and launched is because I don't want to end up with a 34b download in an unspecified location despite already running 70b :)

Thanks

Ignoring download reset

It seems that ignoring the download of a model will disable any download, even when the model is changed. There should be a way to reset this.

add ipynb support (jupyter notebook)

it works in .py files, but don't in .ipynb