Giter Site home page Giter Site logo

mutate's Introduction

🦠 Mutate

A library to synthesize text datasets using Large Language Models (LLM). Mutate reads through the examples in the dataset and generates similar examples using auto generated few shot prompts.

1. Installation

pip install mutate-nlp

or

pip install git+https://github.com/infinitylogesh/mutate

2. Usage

Open In Colab

2.1 Synthesize text data from local csv files

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-2.7B",
                device=1)

task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"


# returns a python generator
text_synth_gen = pipe("csv",
                    data_files=["local/path/sentiment_classfication.csv"],
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    text_column_alias="Comment",
                    label_column_alias="sentiment",
                    shot_count=5,
                    class_names=["pos","neg"])

#Loop through the generator to synthesize examples by class
for synthesized_examples  in text_synth_gen:
    print(synthesized_examples)
Show Output
{
    "text": ["The story was very dull and was a waste of my time. This was not a film I would ever watch. The acting was bad. I was bored. There were no surprises. They showed one dinosaur,",
    "I did not like this film. It was a slow and boring film, it didn't seem to have any plot, there was nothing to it. The only good part was the ending, I just felt that the film should have ended more abruptly."]
    "label":["neg","neg"]
}

{
    "text":["The Bell witch is one of the most interesting, yet disturbing films of recent years. It’s an odd and unique look at a very real, but very dark issue. With its mixture of horror, fantasy and fantasy adventure, this film is as much a horror film as a fantasy film. And it‘s worth your time. While the movie has its flaws, it is worth watching and if you are a fan of a good fantasy or horror story, you will not be disappointed."],
    "label":["pos"]
}

# and so on .....

2.2 Synthesize text data from 🤗 datasets

Under the hood Mutate uses the wonderful 🤗 datasets library for dataset processing, So it supports 🤗 datasets out of the box.

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-2.7B",
                device=1)

task_desc = "Each item in the following contains customer service queries expressing the mentioned intent"

synthesizerGen = pipe("banking77",
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    # if the `text_column` doesn't have a meaningful value
                    text_column_alias="Queries",
                    label_column_alias="Intent", # if the `label_column` doesn't have a meaningful value
                    shot_count=5,
                    dataset_args=["en"])


for exp in synthesizerGen:
    print(exp)
Show Output
{"text":["How can i know if my account has been activated? (This is the one that I am confused about)",
         "Thanks! My card activated"],
"label":["activate_my_card",
         "activate_my_card"]
}

{
"text": ["How do i activate this new one? Is it possible?",
         "what is the activation process for this card?"],
"label":["activate_my_card",
         "activate_my_card"]
}

# and so on .....

2.3 I am feeling lucky : Infinetly loop through the dataset to generate examples indefinetly

Caution: Infinetly looping through the dataset has a higher chance of duplicate examples to be generated.

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-2.7B",
                device=1)

task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"


# returns a python generator
text_synth_gen = pipe("csv",
                    data_files=["local/path/sentiment_classfication.csv"],
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    text_column_alias="Comment",
                    label_column_alias="sentiment",
                    class_names=["pos","neg"],
                    # Flag to generate indefinite examples
                    infinite_loop=True)

#Infinite loop
for exp in synthesizerGen:
    print(exp)

3. Support

3.1 Currently supports

  • Text classification dataset synthesis : Few Shot text data synsthesize for text classification datasets using Causal LLMs ( GPT like )

3.2 Roadmap:

  • Other types of text Dataset synthesis - NER , sentence pairs etc
  • Finetuning support for better quality generation
  • Pseudo labelling

4. Credit

5. References

The Idea of generating examples from Large Language Model is inspired by the works below,

mutate's People

Contributors

davidberenstein1957 avatar infinitylogesh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mutate's Issues

required setup?

hi i'm very interested in trying this out but i'm running into errors in the environment and wondering if the setup is the issue.

running on paperspace. here is output of nvidia-smi

Thu Jan 19 19:35:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 Off | 00000000:00:05.0 Off | Off |
| 33% 27C P8 14W / 230W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

and below is the error:


RuntimeError Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 pipe = pipeline("text-classification-synthesis",
2 model="EleutherAI/gpt-neo-2.7B",
3 device=1)

File /usr/local/lib/python3.9/dist-packages/mutate/init.py:107, in pipeline(task, model, device, generation_kwargs, **kwargs)
101 else:
102 raise ValueError(
103 f"Task - {task} is not supported. Supported tasks -"
104 f" {SUPPORTED_TASKS.keys()}"
105 )
--> 107 return pipeline_class(model, device, **generation_kwargs)

File /usr/local/lib/python3.9/dist-packages/mutate/pipelines/text_classification.py:69, in TextClassificationSynthesize.init(self, model, device, **generate_kwargs)
24 def init(
25 self,
26 model: Union[str, PreTrainedModel],
27 device: Optional[int] = -1,
28 **generate_kwargs
29 ):
30 """
31 Pipeline to synthesize Text classification examples from a given dataset.
32
(...)
67
68 """
---> 69 self.infer = TextGeneration(model_name=model, device=device)
70 self._collate_fn = partial(
71 TextClassSynthesizePromptDataset._collate_fn,
72 self.infer.tokenizer,
73 self.infer.device,
74 )
75 self.generate_kwargs = (
76 self.generate_kwargs if not generate_kwargs else generate_kwargs
77 )

File /usr/local/lib/python3.9/dist-packages/mutate/infer.py:41, in TextGeneration.init(self, model_name, device)
39 self.tokenizer.pad_token = self.tokenizer.eos_token
40 self.device = torch.device(f"cuda:{device}" if device>=0 else "cpu")
---> 41 self.model.to(self.device)
42 self.model.eval()

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:927, in Module.to(self, *args, **kwargs)
923 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
924 non_blocking, memory_format=convert_to_format)
925 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
--> 927 return self._apply(convert)

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:579, in Module._apply(self, fn)
577 def _apply(self, fn):
578 for module in self.children():
--> 579 module._apply(fn)
581 def compute_should_use_set_data(tensor, tensor_applied):
582 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
583 # If the new tensor has compatible tensor type as the existing tensor,
584 # the current behavior is to change the tensor in-place using .data =,
(...)
589 # global flag to let the user control whether they want the future
590 # behavior of overwriting the existing tensor or not.

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:579, in Module._apply(self, fn)
577 def _apply(self, fn):
578 for module in self.children():
--> 579 module._apply(fn)
581 def compute_should_use_set_data(tensor, tensor_applied):
582 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
583 # If the new tensor has compatible tensor type as the existing tensor,
584 # the current behavior is to change the tensor in-place using .data =,
(...)
589 # global flag to let the user control whether they want the future
590 # behavior of overwriting the existing tensor or not.

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:602, in Module._apply(self, fn)
598 # Tensors stored in modules are graph leaves, and we don't want to
599 # track autograd history of param_applied, so we have to use
600 # with torch.no_grad():
601 with torch.no_grad():
--> 602 param_applied = fn(param)
603 should_use_set_data = compute_should_use_set_data(param, param_applied)
604 if should_use_set_data:

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:925, in Module.to..convert(t)
922 if convert_to_format is not None and t.dim() in (4, 5):
923 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
924 non_blocking, memory_format=convert_to_format)
--> 925 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.