Giter Site home page Giter Site logo

kevinscaria / targen Goto Github PK

View Code? Open in Web Editor NEW
14.0 2.0 2.0 2.32 MB

Targeted Data Generation with Large Language Models

Home Page: https://arxiv.org/abs/2310.17876

License: MIT License

Python 8.14% Shell 0.15% Jupyter Notebook 91.71%
alignment chatgpt datagen datageneration llm model-alignment nlp nlp-machine-learning synthetic-dataset-generation large-language-models

targen's Introduction

๐Ÿ’ฅ What's New?

  • Added the self-correction support in BaseExperiment.
  • Modularized the repository as a package for quick replication. Currently added for SyntheticCopa. Other experiment objects for Synthetic SuperGLUE will be shortly added.

TarGEN: Targeted Data Generation with Large Language Models

This is the official repository of the paper: TarGEN: Targeted Data Generation with Large Language Models

How To?

-Step 1: Import Packages & Add API_KEYS in the .env file that should be created in the <ROOT_DIRECTORY>:

The ability to control model objects has also been introduced in the main.py script abstracting away from internal classes to provide flexibility. Any langchain-supported API can be used for experimentation.

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI

from TarGEN import Generate
from experiments.copa import SyntheticCopa, SyntheticCb

load_dotenv(<ROOT_DIRECTORY>)
API_KEY = os.getenv("OPEN_AI_KEY")
TARGET_DATA_STYLE = "COPA"

# Load Model
openai_llm = ChatOpenAI(openai_api_key=API_KEY)

- Step 2: In the experiments directory add the configs for all the steps:

Important

Check out sample SyntheticCopa class which extends BaseExperiment class as a systematic way to enforce required methods.

  • For added flexibility, this file can also override the generator_function in case of changes in logic in the generator function.
  • This file requires defining the pydantic object class of the instance sample, where each field and description is explicitly mentioned. There are a few global variables that will be available during runtime for the generator to access such as DOMAIN, N, SENTENCE etc. Changes to global variables are frozen and require changes in the BaseExperiment class. In the next update, we will provide accessibility as a local runtime variable that can be configured as the prompt engineer requires.

**- Step 3: Load experiment object in main.py

Once the experiment object has been designed as detailed in step 2, it has to be loaded in the runtime.

# Load orchestrator
if TARGET_DATA_STYLE == "COPA":
    experiment_object = SyntheticCopa(model=openai_llm)
else:
    experiment_object = BaseExperiment(llm=openai_llm)

- Step 4: The generator only requires the experiment object. The create_synthetic_data() method will orchestrate the generation of samples based on the get_config() method defined in the experiment-specific class:

targen = Generate(experiment_object=experiment_object)
targen.create_synthetic_data(output_path="outputs/copa_sample.json",
                             n_samples=8,
                             overwrite=True,
                             num_instance_seeds=1
                             )

If you find our work useful, please cite the paper:

@article{gupta2023targen,
  title={TarGEN: Targeted Data Generation with Large Language Models},
  author={Gupta, Himanshu and Scaria, Kevin and Anantheswaran, Ujjwala and Verma, Shreyas and Parmar, Mihir and Sawant, Saurabh Arjun and Mishra, Swaroop and Baral, Chitta},
  journal={arXiv preprint arXiv:2310.17876},
  year={2023}
}

targen's People

Contributors

him1411 avatar kevinscaria avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

targen's Issues

Missing LICENCE

Hi Kevin, since the repo does not include a license, I wondered if the code is meant to be free to use.
Could you add a LICENCE to the repo? Would be much appreciated. @kevinscaria

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.