Giter Site home page Giter Site logo

pluto's Introduction

๐ŸŒŒ Pluto: Generate Synthetic Data for LLM Fine-Tuning ๐ŸŒŒ

Oak


๐ŸŒ Website ย ย โ€ขย ย  ๐Ÿ’ฌ Discord

Welcome ๐Ÿ’œ

Welcome! We're the team behind Haven, a platform for fine-tuning LLMs. We realized that many of our users lack datasets for fine-tuning LLMs, which is why we built Pluto, a library for synthetic data generation with LLMs. Here's what you can do with it:

  • Overcome repetitiveness and make your data highly diverse using topic trees
  • Run multiple sampling requests in parallel to speed up data generation
  • Use any model provider to generate data

Quickstart ๐Ÿš€

To get started, let's use GPT-4 to generate a dataset of coding questions about numpy. First install the pluto library:

pip install pluto-data

Make sure that you've set your OpenAI API Key as an environment variable:

export OPENAI_API_KEY=<your-key>

Then run the following code:

from pluto import EngineArguments, DataEngine, Dataset, TopicTree, TopicTreeArguments

system_prompt = "You are a helpful AI coding assistant. You do not just give high level coding advice, but instead, you respond to coding questions with specific code examples."

tree = TopicTree(
    args=TopicTreeArguments(
        root_prompt="Functionalities of numpy",
        model_system_prompt=system_prompt,
        tree_degree=10,
        tree_depth=2
    )
)

tree.build_tree(model_name="gpt-3.5-turbo-1106")
tree.save("numpy_topictree.jsonl")

engine = DataEngine(
    args=EngineArguments(
        instructions="Please specifically provide training examples with questions about numpy. A training sample should consist of just one question and a response, and not a chat with multiple messages.",
        system_prompt=system_prompt,
        # example_data = Dataset.from_jsonl("example_data.jsonl") | OPTIONAL: comment out this argument to provide examples for the model generating training data

    )
)

dataset = engine.create_data(
    model_name="gpt-4-1106-preview",
    num_steps=20,
    batch_size=5,
    topic_tree=tree
)

dataset.save("output_with_topictree.jsonl")

What happened in this example? ๐Ÿค”

In the example above, we did the following things:

Generate Topic Tree: We first used GPT-3.5 to generate a "topic tree" with the root "Functionalities of numpy". A topic tree is simply a tree in which each child of a node needs to be a subtopic of its parent node and allows us to generate a list of aspects that should be covered in our training dataset. This is what paths from root to leaves within a topic tree look like (you can also find a full file here):

Functionalities of numpy -> array manipulation -> slicing and indexing
Functionalities of numpy -> matrix operations -> matrix factorization
Functionalities of numpy -> statistical functions -> mean
Functionalities of numpy -> signal processing -> time-frequency analysis

Generate Data from Topic Tree: After generating our topic tree, we feed it into the create_data function of the DataEngineto ensure that our dataset touches upon a broad range of subjects and is not repetitive. Concretely, in this function, we iterate over all root-to-leaf paths in our topic tree and tell GPT-4 Turbo, which we use to generate our training data, to take the corresponding (sub)topic into account in its generated training sample. The parameter batch_size=5 controls how many OpenAI requests we send simultaneously.

We also provide the option to provide examples of how your dataset should look like to the DataEngine. To do this, simply add example_data=Dataset.from_jsonl('your_data.jsonl') as an argument to DataEngine. Just Three or four samples are totally sufficient for your example datasets and help a lot.


Fine-Tune LLMs with your generated Datasets โš™๏ธ

Datasets generated with pluto are saved in a jsonl format:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

You can directly use these dataset files to fine-tune models with Haven (docs) or OpenAI (docs). As an open source alternative, we recommend taking a look at the training code provided by fastchat.


Telemetry

We use Posthog to collect anonymous data about how people use Pluto. Concretely, we log whenever a data / topic tree creation job is started and ended. We do not collect any contents of your datasets.

You can simply disable telemetry by setting the environment variable ANONYMIZED_TELEMETRY to False:

export ANONYMIZED_TELEMETRY=False

pluto's People

Contributors

justusmattern27 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.