scaleapi / llm-engine Goto Github PK

View Code? Open in Web Editor NEW

756.0 23.0 49.0 4.13 MB

Scale LLM Engine public repository

Home Page: https://llm-engine.scale.com

License: Apache License 2.0

Dockerfile 0.30% Makefile 0.01% Smarty 0.91% Python 98.73% Shell 0.05%

fine-tune llm python scaleai

llm-engine's Introduction

LLM Engine

🚀 The open source engine for fine-tuning and serving large language models. 🚀

Scale's LLM Engine is the easiest way to customize and serve LLMs. In LLM Engine, models can be accessed via Scale's hosted version or by using the Helm charts in this repository to run model inference and fine-tuning in your own infrastructure.

💻 Quick Install

pip install scale-llm-engine

🤔 About

Foundation models are emerging as the building blocks of AI. However, deploying these models to the cloud and fine-tuning them are expensive operations that require infrastructure and ML expertise. It is also difficult to maintain over time as new models are released and new techniques for both inference and fine-tuning are made available.

LLM Engine is a Python library, CLI, and Helm chart that provides everything you need to serve and fine-tune foundation models, whether you use Scale's hosted infrastructure or do it in your own cloud infrastructure using Kubernetes.

Key Features

🎁 Ready-to-use APIs for your favorite models: Deploy and serve open-source foundation models — including LLaMA, MPT and Falcon. Use Scale-hosted models or deploy to your own infrastructure.

🔧 Fine-tune foundation models: Fine-tune open-source foundation models on your own data for optimized performance.

🎙️ Optimized Inference: LLM Engine provides inference APIs for streaming responses and dynamically batching inputs for higher throughput and lower latency.

🤗 Open-Source Integrations: Deploy any Hugging Face model with a single command.

Features Coming Soon

🐳 K8s Installation Documentation: We are working hard to document installation and maintenance of inference and fine-tuning functionality on your own infrastructure. For now, our documentation covers using our client libraries to access Scale's hosted infrastructure.

❄ Fast Cold-Start Times: To prevent GPUs from idling, LLM Engine automatically scales your model to zero when it's not in use and scales up within seconds, even for large foundation models.

💸 Cost Optimization: Deploy AI models cheaper than commercial ones, including cold-start and warm-down times.

🚀 Quick Start

Navigate to Scale Spellbook to first create an account, and then grab your API key on the Settings page. Set this API key as the SCALE_API_KEY environment variable by adding the following line to your .zshrc or .bash_profile:

export SCALE_API_KEY="[Your API key]"

If you run into an "Invalid API Key" error, you may need to run the . ~/.zshrc command to re-read your updated .zshrc.

With your API key set, you can now send LLM Engine requests using the Python client. Try out this starter code:

from llmengine import Completion

response = Completion.create(
    model="falcon-7b-instruct",
    prompt="I'm opening a pancake restaurant that specializes in unique pancake shapes, colors, and flavors. List 3 quirky names I could name my restaurant.",
    max_new_tokens=100,
    temperature=0.2,
)

print(response.output.text)

You should see a successful completion of your given prompt!

What's next? Visit the LLM Engine documentation pages for more on the Completion and FineTune APIs and how to use them. Check out this blog post for an end-to-end example.

llm-engine's People

Contributors

Stargazers

Watchers

Forkers

luvkothari st-scale shaun-scale afern247 meefs tomchapin inayet apollohuang1 neocho getkksingh1 eltociear techthiyanes goswamig marcojacobss hubayirp plaban1981 zeroxclem zhaotonapple fdoperezi ruizehung iamrealwilson jxzhangjhu jansystemic vital121 jfontestad mjdhasan alessandrossecret echallenge yunfeng-scale hbcbh1999 mengtinghuang ml-algorhythms javiervicho m9e elenavid juanmackie ifitsmanu ai-mou thaisdelima polya20 zhouzhengjun stophobia karam-qaoud kendralabs codeaudit michaelkirschbaum next-drought piyushchandra17

llm-engine's Issues

Add github sidebar

FineTune.create - NotFoundError (API endpoint seems to through 404)

When calling FineTune.create with an arbitrary model and CSV a "NotFoundError" is thrown. When investigating, the Scale AI Spellbook API endpoint at https://api.spellbook.scale.com/v1/llm/fine-tune throws an 404.

Used version: 0.0.0b5

Code:

from llmengine import FineTune
import os
import yaml

if __name__ == "__main__":
    with open("environments/spellbook_api_key.yaml", "r") as stream:
        os.environ.update(yaml.safe_load(stream))

    response = FineTune.create(
        model="llama-2-7b",
        training_file="storage.googleapis.com/llama2-finetune-test/finetune_dataset.csv",
    )

    print(response.json())

Error:

Test out spot instances

we currently use on-demand instances and spot instances are cheaper with lower availability https://aws.amazon.com/ec2/spot/instance-advisor/ so with lower cold start time we could give it a try

Investigate CUDA graphs

potentially put into vLLM vllm-project/vllm#914

self host on runnpod

Hi, can you please add documentation to self-host on runpod? it offers GPU prices cheaper than aws.

Investigate Multi-Query Attention

https://blog.fireworks.ai/multi-query-attention-is-all-you-need-db072e758055

Allow users to set API key without using env variables

Currently, you need the SCALE_API_KEY env var set. This is generally fine, but is annoying for say notebook users.

We should allow setting a module-level API key that, if present, supersedes the env var value, e.g.:

import llmengine

llmengine.api_key = "abc"

Control frequency - completion

Hi,

Is there a way to change the frequency_penality or logit bias when sending a completion request?

Example ScienceQA fine-tuning notebook has no documentation for how to install dependencies

https://github.com/scaleapi/llm-engine/blob/main/examples/finetune_llama_2_on_science_qa.ipynb

Fine-tuning API should return immediately with a clear error message if the input is invalid

Right now, if you don't have prompt,response as the CSV headers to your input, the fine-tuning job will run, but fail later and not give a clear message. Ideally we'd fail fast.

[Feature Request] support InternLM

Dear llm-engine developer,
Greetings! I am Jimmy, a community developer and volunteer at InternLM. Your work has been immensely beneficial to me, and I believe it can be effectively utilized in InternLM as well. Welcome to add Discord https://discord.gg/gF9ezcmtM3 . I hope to get in touch with you.

Best regards,
Jimmy

model name parameter after finetuning

in the inference script for a finetune model, there's a model parameter:

from llmengine import Completion

response = Completion.create(
    model="llama-2-7b",
    prompt="Do you offer in-flight Wi-fi?",
    max_new_tokens=100,
    temperature=0.2,
)
print(response.json())

inserting the name of the finetune from the get() method as described in the documentation throws an error "NotFoundError: The specified endpoint could not be found.".

what should the model name be?

Import completion error

from llmengine import Completion
ImportError: cannot import name 'Completion' from partially initialized module 'llmengine' (most likely due to a circular import)

pip list:

scale-llm-engine 0.0.0b7

Model ids =/= Fine Tune Id?

Bug: API calls don't work on Windows due to os.path.join

Hey everyone. This stems from an older, closed issue: #170.

The post_sync method in api_engine.py concatenates parts of the URL to make the API call against using the os.path.join Python method.

See:

    @classmethod
    def post_sync(cls, resource_name: str, data: Dict[str, Any], timeout: int) -> Dict[str, Any]:
        api_key = get_api_key()
        response = requests.post(
            os.path.join(LLM_ENGINE_BASE_PATH, resource_name),
            json=data,
            timeout=timeout,
            headers={"x-api-key": api_key},
            auth=(api_key, ""),
        )
        if response.status_code != 200:
            raise parse_error(response.status_code, response.content)
        payload = response.json()
        return payload

However, that method is designed to join paths on file systems depending on the OS Python is being executed on. When on Windows, this results in the path being joined into a URL where one slash is actually a backslash since that is the "divider" used on the Windows file system.

Long story short: bug that breaks all API calls when using package on Windows since os.path.join is used which puts a backslash instead of a slash into the endpoint URL

[Feature Request] Add token log probabilities to response

This is a feature request to add token log probabilities to the response of a Completion.create call.
cc: @yunfeng-scale

⚠️ LLM Engine fine-tuning maintenance ⚠️

We are doing some maintenance for LLM Engine fine-tuning events. See email text below:

Begin email text

Dear Spellbook users,

We are performing some maintenance of LLM Engine this week. If you are not an LLM Engine user, you can disregard this email. Otherwise, please read on.

How Will You Be Affected?
During the maintenance window, any fine-tuned models that you’ve created will be unavailable to serve traffic. Requests sent to those models will return with a HTTP 503 error. You will be able to create fine-tunes throughout the maintenance window, but the resulting models will be subject to the maintenance schedule downtime.

Note that the Completions API for base models is unaffected. Also, any files uploaded through the Files API will be unaffected. There is no loss of any of your metadata or data as a result of this maintenance.

Maintenance Schedule
LLM Engine fine-tuned models will be unavailable between Monday August 28th 22:00 PDT and Thursday August 31st 09:00 PDT. After 09:00 PDT on August 31st, fine-tuned models will be available again.

Why is this Maintenance Necessary?
This maintenance is necessary to roll out performance enhancements that will increase throughput and reduce cost/token for fine-tuned endpoints. As we prepare to transition LLM Engine out of evaluation mode and into production, these enhancements are necessary to scale to your production workloads.

What Do You Need to Do?
We kindly request that you take the following steps:

Communicate with your team: Please inform your team members about this maintenance so that they are aware of the downtimes.
Stay informed: We will provide updates via this Github issue. Please follow the issue and comment there for any additional assistance you might need.

Thank you for using LLM Engine. We look forward to delivering an even better experience after this maintenance.

Thanks,
-Yi

End email text

GQA for Llama 2 7B and 13B models

RetNet adaptation

Need to compare RetNet and llama 2 models at comparable parameter #

PEFT adapters with continuous batching

Potentially implement in vLLM. Implementation would be tricky but I think doable.

max token length for finetune and completion endpoints on Lllama-2?

Great job with this repo. I was able to finetune Llama-2 and it certainly seems to have an effect.

Unfortunately the finetune silently accepts all inputs and the documentation states that you simply truncate inputs to max length. But it's not specified anywhere what's LLama-2's max length. Originally Meta released it with a bug that caused max length to be 2048 while the native max length seems to be 4096. So which is it?

Also, I tested my finetune model's completion code with inputs as big as 12,000 tokens and it still makes a completion. So I assume you truncate there as well? Only taking the tail of the prompt, presumably?

tldr:

What is llama-2's max token length?
Is there anything we can do to effect this or get better visibility into how the input got tokenized, etc?

[Lora] Allow more Lora hyperparams

Allow r and alpha to be set in Create request

Integrate TensorRT-LLM

Feature Request

Why do you want this feature?
TensorRT-LLM boasted some good performance gain https://github.com/NVIDIA/TensorRT-LLM

Surface pandas ParserError to user

Context: https://twitter.com/mikewadhera/status/1684730279234355200?s=46&t=4hgFpDqJMyzRI1wfZC2dAQ

User currently sees Unable to read dataset from s3.; we should be surfacing Error tokenizing data. C error: Expected 2 fields in line 277, saw 15

[Feature Request] Add `on_inference_ready` callback to llm-engine deployments via `Model.create`

This is a feature request to add an on_inference_ready callback to Model.create, which sends a GET request to the provided endpoint when the deployed model is ready for inference.

Further reduction of pod cold start time

I believe we could do < 20s for llama 2 7b models.

Fine-Tuning LLM using Local GPU and Infra

a big thanks for the LLM engine, i don't know if there's a provision for fine-tuning LLM models using local GPUs/ infrastructure . kindly help me with that.
`--------------------------------------------------------------------------
UnauthorizedError Traceback (most recent call last)
/data/MBBS_pharma/fineTuningLLAMAScienceQA.ipynb Cell 15 line 1
----> 1 response = FineTune.create(
2 model="llama-2-7b",
3 training_file=r"/data/MBBS_pharma/trainScience.csv",
4 validation_file=r"/data/MBBS_pharma/valScience.csv",
5 hyperparameters={
6 'lr':2e-4,
7 },
8 suffix='science-qa-llama'
9 )
10 run_id = response.id

File /data/MBBS_pharma/mbbsPharma/lib64/python3.9/site-packages/llmengine/fine_tuning.py:151, in FineTune.create(cls, model, training_file, validation_file, hyperparameters, wandb_config, suffix)
35 """
36 Creates a job that fine-tunes a specified model with a given dataset.
37
ref='/data/MBBS_pharma/mbbsPharma/lib64/python3.9/site-packages/llmengine/fine_tuning.py:0'>0;32m (...)
141
142 """
143 request = CreateFineTuneRequest(
144 model=model,
145 training_file=training_file,
ref='/data/MBBS_pharma/mbbsPharma/lib64/python3.9/site-packages/llmengine/fine_tuning.py:0'>0;32m (...)
...
--> 107 raise parse_error(response.status_code, response.content)
108 payload = response.json()
109 return payload

UnauthorizedError: Invalid API Key.
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...`

[Tokenizer] Allow for custom tokens/tokenizer

Create guide for how to deploy an existing Hugging Face model on self-hosted LLM Engine

Requested by Biju on Twitter.

Parsing Error Raised when Attempting to Run Example Code

I installed the scale-llm engine using pip for windows and am currently running it on python. To test the engine, I ran the example code shown on the website for the engine. However, I ran into the following error.

Below is the code that was used, along with the code that raised the error

Any help would be appreciated

Add Completions support for Llama 3

Feature Request

What is the problem you're currently running into?
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Llama 3 is now available, let's add to the Model Zoo!

Why do you want this feature?
A clear and concise description of why you want the feature.

Internal user demand.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Prioritization

Does this feature block you from using the project?
- Yes
- No
How many users will benefit from this feature?
- Just me
- Few people might benefit
- Many users will love it!
Complexity
- I believe it's a simple feature to implement
- It might require some effort to implement
- It's probably complex, and might take significant effort

Thank you for your contribution to llm-engine. Please ensure you've given the feature considerable thought before submitting it. Once your feature request is accepted, and you're interested in building it, please mention it so that the maintainers can guide you!

[dev] Deploy docs through CI

Add CI step to master that runs mkdocs gh-deploy

Error: Internal Server Error: <class 'AttributeError'>: 'CreateFineTuneResponse' object has no attribute 'artifact_id'

Hi Team,

Can you provide a quick solution to the internal server error?
I was trying to fine tune and get this error using my code as well as the code provided in the llm engine documents.

Code:

create_fine_tune_response = FineTune.create(
model="llama-2-7b",
training_file="https://scale-demo-datasets.s3.us-west-2.amazonaws.com/sports/sports_training_dataset.csv",
validation_file=None,
hyperparameters={"epochs": "1", "lr": "0.0002"},
suffix="my-first-fine-tune"
)

fine_tune_id = create_fine_tune_response.fine_tune_id

Error:

UnknownError Traceback (most recent call last)
in <cell line: 1>()
----> 1 create_fine_tune_response = FineTune.create(
2 model="llama-2-7b",
3 training_file="https://scale-demo-datasets.s3.us-west-2.amazonaws.com/sports/sports_training_dataset.csv",
4 validation_file=None,
5 hyperparameters={"epochs": "1", "lr": "0.0002"},

1 frames
/usr/local/lib/python3.10/dist-packages/llmengine/api_engine.py in post_sync(cls, resource_name, data, timeout)
105 )
106 if response.status_code != 200:
--> 107 raise parse_error(response.status_code, response.content)
108 payload = response.json()
109 return payload

UnknownError: Internal Server Error: <class 'AttributeError'>: 'CreateFineTuneResponse' object has no attribute 'artifact_id'

FineTune.get_events errors when there are no FineTune events yet

I would expect this to instead return an empty list of events.

After fine-tuning can we push the mode to Hugging Face?

Comparison benchmarks?

Hi,
Thanks for open-sourcing the code.

I was wondering how it compares in terms of throughput with existing inference frameworks like https://github.com/huggingface/text-generation-inference and https://github.com/vllm-project/vllm , do we have any benchmarks?

[Tracking] Allow wandb tracking

GKE Helm deployment

Is there any thought/desire to setup a helm chart for gke deployment as well?

Discuss: Chat API with a "session" with the client caching messages to make usage more ergonomic

Remove `status` and `traceback` from completion response on the server

We removed this from our internal version (we'll be moving to public repo-first development soon!), we just need to remove from the public repo too:

llm-engine/server/llm_engine_server/common/dtos/llms.py

Lines 117 to 119 in b2588ea

    
           status: TaskStatus 
        
           outputs: List[CompletionOutput] 
        
           traceback: Optional[str] = None

Cannot pass in PEFT configs when creating a finetuning job

Describe the bug
I'm unable to pass in PEFT configs when doing FineTune.create(). The documentation states that it should be a "dict of parameters" but when I pass in a dictionary I get the following error:

llmengine.errors.UnknownError: Internal Server Error: <class 'launch.api_client.exceptions.ApiValueError'>: Invalid inputs given to generate an instance of <class 'launch.api_client.model.create_fine_tune_request.CreateFineTuneRequest.MetaOapg.properties.hyperparameters.MetaOapg.additional_properties'>. None of the anyOf schemas matched the input data.

Looking through the code it seems like only strings, ints, or floats are accepted, so I tried to convert the dictionary to a string but that results in this error later on in the pipeline when the finetuning is actually being done:

'{"events": [{"timestamp": 1697248409.2646654, "message": "\'str\' object has no attribute \'get\'", "level": "error"}]}'

I also tried passing in a LoraConfig object from HuggingFace but that wasn't accepted either. Am I missing something very obvious? How do I pass in my PEFT configs?

LLM Engine Version

LLM Engine Version: 0.0.0b19

System Version

Python Version: 3.11.5
Operating System: MacOS

Minimal Reproducible Example
Running the code snippet as is should do it. I made some dummy test files for this example.

from llmengine import FineTune
response = FineTune.create(
    model = "llama-2-7b",
    training_file = "file-963KxGaPu8WET25",
    validation_file = "file-qU5_mPTaRlmIuLK",
    hyperparameters = {
        "lr": 2e-3,
        "epochs": 5,
        "peft_config": {
            "r": 8,
            "lora_alpha": 8,
            "lora_dropout": 0.0
        }
    },
    suffix = "foofoobarbar"
)

Please advise, thank you!

Can we fine tune with data stored in local csv file?

The docs state that the training_file parameter of the FineTune.create() method should be a "publicly accessible URL to a CSV file for training".

Does that mean that we can only fine-tune a model on data hosted publicly and not locally? It kind of defeats the purpose of experimenting with data we may not want to make public just yet.

If not, then it would be great if the fine tuning jobs failed earlier because right now, I am testing it with a local CSV file and it is simply running for a long time, and it is unclear if the failure is caused by a bad input or not.

[Feature Request] Add additional options for text generation

Currently Completion.create only supports the option to adjust sampling via temperature - and doesn't support greedy decoding.

This is a feature request to support the following options - pulled from the TGI text_generation client.generate docstring.

"""
    do_sample (`bool`):
        Activate logits sampling
    max_new_tokens (`int`):
        Maximum number of generated tokens
    best_of (`int`):
        Generate best_of sequences and return the one if the highest token logprobs
    repetition_penalty (`float`):
        The parameter for repetition penalty. 1.0 means no penalty. See [this
        paper](https://arxiv.org/pdf/1909.05858.pdf) for more details.
    return_full_text (`bool`):
        Whether to prepend the prompt to the generated text
    seed (`int`):
        Random sampling seed
    stop_sequences (`List[str]`):
        Stop generating tokens if a member of `stop_sequences` is generated
    temperature (`float`):
        The value used to module the logits distribution.
    top_k (`int`):
        The number of highest probability vocabulary tokens to keep for top-k-filtering.
    top_p (`float`):
        If set to < 1, only the smallest set of most probable tokens with probabilities that add up to `top_p` or
        higher are kept for generation.
"""

[Checkpointing] Allow users to save/use multiple checkpoints

Allow users to checkpoint during training and then use various checkpoints during inference

Preparing dataset for LLaMa-13b-chat

For LLaMa-13b (non-chat version) documentation states that data must be formatted as a CSV file that includes two columns: prompt and response.

How do we have to prepare data for the chat-version?

[Datasets] s3 presigned url fails?

Are we downloading the data from s3 immediately? I am creating presigned URLs that should last long enough, but my model run failed after a while...

2023-07-24 21:34:49,770 INFO [train_llm_engine] [train_llm_engine.py:113] - run_id: ft-civestsdoujg03t1akcg, job_status: RUNNING
...
2023-07-24 22:39:14,368 INFO [train_llm_engine] [train_llm_engine.py:113] - run_id: ft-civestsdoujg03t1akcg, job_status: RUNNING
2023-07-24 22:40:14,750 INFO [train_llm_engine] [train_llm_engine.py:113] - run_id: ft-civestsdoujg03t1akcg, job_status: FAILURE

Get events:

(Pdb) FineTune.get_events('ft-civestsdoujg03t1akcg')
GetFineTuneEventsResponse(events=[LLMFineTuneEvent(timestamp=1690238353.925053, message='Unable to read dataset from s3.', level='error')])

If we're continuously reading from s3 would be better to download up front

Add support for Mistral-7B in the Completions API

Feature Request

What is the problem you're currently running into?
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Add Mistral-7B to LLM Engine.

Why do you want this feature?
A clear and concise description of why you want the feature.

https://mistral.ai/news/announcing-mistral-7b/

Benchmarks seem impressive:

Describe the solution you'd like
A clear and concise description of what you want to happen.

Mistral-7B should appear in the model zoo.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

N/A

Additional context
Add any other context or screenshots about the feature request here.

Prioritization

Does this feature block you from using the project?
- Yes
- No
How many users will benefit from this feature?
- Just me
- Few people might benefit
- Many users will love it!
Complexity
- I believe it's a simple feature to implement
- It might require some effort to implement
- It's probably complex, and might take significant effort

Add support for Mistral in the FineTune API

Feature Request

What is the problem you're currently running into?
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
We don't yet have mistral models in the FineTune API.

Why do you want this feature?
A clear and concise description of why you want the feature.
Mistral outperforms llama-2-7b, so it should be at least equally popular for finetuning.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Add Mistral models (mistral-7b and mistral-7b-instruct) to the FineTune API so that users can call

FineTune.create(model="mistral-7b", ...)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Prioritization

Does this feature block you from using the project?
- Yes
- No
How many users will benefit from this feature?
- Just me
- Few people might benefit
- Many users will love it!
Complexity
- I believe it's a simple feature to implement
- It might require some effort to implement
- It's probably complex, and might take significant effort

Llama-2-70B support

Current Model Zoo page doesn't support llama 2 70b:

Speculative decoding

Currently on roadmap of vLLM and TGI

	status: TaskStatus
	outputs: List[CompletionOutput]
	traceback: Optional[str] = None

scaleapi / llm-engine Goto Github PK

llm-engine's Introduction

LLM Engine

💻 Quick Install

🤔 About

Key Features

Features Coming Soon

🚀 Quick Start

llm-engine's People

Contributors

Stargazers

Watchers

Forkers

llm-engine's Issues

Begin email text

End email text

Feature Request

Feature Request

Prioritization

Code:

Error:

Feature Request

Prioritization

Feature Request

Prioritization

Recommend Projects

Recommend Topics

Recommend Org