Giter Site home page Giter Site logo

Tuning API in Katib for LLMs about katib HOT 4 OPEN

andreyvelich avatar andreyvelich commented on June 11, 2024 3
Tuning API in Katib for LLMs

from katib.

Comments (4)

tariq-hasan avatar tariq-hasan commented on June 11, 2024 2

Having worked through the Python SDK and examples for training operator and Katib I have further ideas on an appropriate implementation of the tuning API in Katib for LLMs.

It appears that the current implementation of the tune API for Katib Python SDK leverages a mandatory objective function to define the trial specification as a batch job. That said the higher-level interface to the existing API for Katib is meant to fine-tune pre-trained models on custom datasets.

The following are some important points to note:

  • This implementation is in the context of LLMs
  • The objective function has already been defined in the trainer module (along with a pre-requisite storage initializer module)
  • The trainer and storage initializer modules are used to define containers for the master and workers in the specification for PyTorchJob as part of the implementation of the train API for training operator.

Following the example implementation of a Katib experiment using PyTorchJob we would therefore need to modify the tune API to take in either a combination of objective and parameters or a combination of model_provider_parameters, dataset_provider_parameters and train_parameters.

In the former case the code would default to defining a Katib experiment using a batch job in the trial specification. In the latter case the code would define a Katib experiment using a PyTorchJob in the trial specification. This PyTorchJob would define an init container and an app container for the master and use the same app container for the workers.

from katib.

tariq-hasan avatar tariq-hasan commented on June 11, 2024 1

I presume that the initiative here is motivated by the recent trend in the ML space to fine-tune pre-trained models (LLMs or otherwise) using custom datasets instead of training bare models from scratch.

This requires that the interface provided to users for training and hyperparameter tuning needs to be enriched.

Training (training-operator):

The train function takes the following arguments and is essentially an abstraction over the create_job function that enables model fine-tuning.

  • System parameters: Number of workers and number of resources (GPUs) per worker
  • Model parameters: Model provider and repository details
  • Dataset parameters: Dataset provider and dataset details
  • Training parameters: Training-specific parameters e.g. learning rate, etc.
trainingClient.train(
   num_workers=1,
   num_procs_per_worker = 1,
   resources_per_worker={"gpu": "2", "cpu":8, "memory": "16Gi"},
   HuggingFaceModelParams(model='hf://openchat/openchat_3.5', access_token = "hf_..." ),
   S3DatasetParams(dataset= 's3://doc-example-bucket1/train_dataset', eval_dataset = "s3://doc-example-bucket1/eval_dataset", access_token = "s3 access token", region="us-west-2"),
   HuggingFaceTrainParams(learning_rate=0.1, transformerClass="Trainer", peft_config = {})
)

Hyperparameter tuning (Katib):

Taking inspiration from the design in training-operator I would think that the higher-level interface in Katib would be an abstraction over the tune function and would still allow users to specify function parameters such as hyperparameters, algorithm name, evaluation metric, etc. but that the objective function would be replaced by model provider and dataset provider.

katib_client.tune(
    name=exp_name,
    objective=train_mnist_model, # Objective function.
    parameters=parameters, # HyperParameters to tune.
    algorithm_name="cmaes", # Algorithm to use.
    objective_metric_name="accuracy", # Katib is going to optimize "accuracy".
    additional_metric_names=["loss"], # Katib is going to collect these metrics in addition to the objective metric.
    max_trial_count=12, # Trial Threshold.
    parallel_trial_count=2,
)

I presume then that the difference with this example implementation would just be that train_mnist_model is replaced with a model provider and a dataset provider that forms the basis for the hyperparameter tuning.

from katib.

andreyvelich avatar andreyvelich commented on June 11, 2024

/assign @helenxie-bit

from katib.

google-oss-prow avatar google-oss-prow commented on June 11, 2024

@andreyvelich: GitHub didn't allow me to assign the following users: helenxie-bit.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @helenxie-bit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

from katib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.