Giter Site home page Giter Site logo

txt_to_dataset's Introduction

text to instruction dataset

A collection of scripts used to generate the dataset for the WW-Storytelling-70B-LoRA.

Supports:

  • Breaking .txt files into chunks based on context length.
  • Using a local instance of Oobabooga (or anything that supports an OpenAI-style API) to generate prompts and other metadata.
  • Outputting a final .json file for training with Oobabooga.
  • Various tools for analyzing the dataset (count common phrases, randomize names, batch generate responses from the final model).

Setup

  1. Clone this repo.
  2. Install requirements with pip install -r requirements.txt.
  3. Edit settings.toml (or make a copy named user.toml).
    • The defaults should work out of the box, unless:
      • You're not using Oobabooga to generate prompts or it isn't being hosted on the same machine.
      • You're not working on a storytelling dataset.
        • This should still be doable, but you'll have to live with the keys being named "story", "context", "prompt".
        • See prompt_gen.prompt_file_path in the settings.toml for how to modify the prompt generation request.
  4. (If generating prompts) Launch Oobabooga's Text-Generation-Webui with the API extension enabled and load a model.

Usage

  1. Build a dataset in the form of a folder of .txt files.
  2. Turn your .txt files into "chunks" that you want to train the model on.
python -m extractors.book_to_chunks --input_folder IN_FOLDER 
    --output_folder OUT_FOLDER --max_tokens CHUNK_SIZE
# The CHUNK_SIZE should be the CUTOFF_LENGTH that you specify during training, minus your
# expected 'instruction' size (~300 tokens if using this repo's defaults).
  1. Use a local model to generate prompts (see Setup for details).
python process_prompts.py --input_folder IN_FOLDER 
    --output_folder OUT_FOLDER --mode generate_prompts
  1. Generate the .json file to use with your training scripts.
python finalize_dataset.py --input_folder IN_FOLDER1 IN_FOLDER2
    --output_folder OUT_FOLDER
# You can specify multiple input folders after the --input_folder flag
  1. Use the .json dataset files in the output folder to train in Oobabooga. A compatible format file is provided in the project root directory.

Tools

  • The COUNT_PHRASES mode of process_prompts.py counts repeated phrases across your prompt json folders. Comparing this across two kinds of input (e.g. literature vs fanfiction) can be useful for finding phrases biases in the dataset.
  • The RANDOMIZE_NAMES mode of process_prompts.py randomizes names in prompt json folders. Avoids name biases.
  • python -m tools.inject_hardcoded_keys can be used to batch edit prompt jsons (e.g. add the genre, year of publication, etc.)
  • python -m tools.prompt_tester can be used to generate sample outputs. Uses a template + list of values to cycle through.
  • python -m tools.merge_prompts can be used to combine prompts from two different folders. This is mostly for doing partial reverts on the prompt json folders.

Credits

This repo contains lists of male and female names sourced from:

txt_to_dataset's People

Contributors

alac avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.