madaan / self-refine Goto Github PK

View Code? Open in Web Editor NEW

507.0 12.0 41.0 54.96 MB

LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.

Home Page: https://selfrefine.info

License: Apache License 2.0

Python 74.96% Jupyter Notebook 25.04%

few-shot-learning language-generation large-language-models llms prompting reasoning chatgpt gpt-35 gpt-4 prompts

self-refine's People

Contributors

Stargazers

Watchers

Forkers

cat-stack-boop evelynmitchell standardgalactic abdulk084 yanzia12138 shon-cf xiepengli joskid haotiansun14 stjordanis taneset debjitpaul nashid jfontestad tuxracer juny116 nicolaygerold bellyfat shubhampachori12110095 littlehacker26 cryptowealth-technology balabalabalabiu rodrigomasiniai apollohuang1 mikewangwzhl techthiyanes anforsm liamdgray brycego33 onbncbjocp68898 sowmi09 teoarganaraz realcatking mivanovitch natedude nischal5123 adityasudhakar rajeevveera24 jhshim98 mohankhilariwal

self-refine's Issues

GSM8K performance difference issue

In the appendix, the original PAL with ChatGPT is around 74%.

But how come the initial accuracy is only 71% in self-refine, I was expecting the initial should be the same?

Questions about initial generation

I want to use self-refine for reasoning task, such as open-book qa for example.
For the few-shot examples for the initial generation. Does the examples have to be bad examples?
If I have good examples, could I use them for the initial stage and hope that through iterations, it gets even better?
However if i were to use already good examples, it might be tough to come up with even better ones in the few-shot examples for the refine stage?

Codex discontinued

Hello. For code-related tasks, do you plan to update your code to replace the Codex with another model? Do you have a suggestion for the alternative model & do you plan to push the updated code?

Licence Request

Could this repo get an MIT or Apache license to fully free up anyone to take and adapt work found here and innovate further?

Missing File in Sentiment Reversal

In the Yelp benchmark, the file for the task_measure is missing. I.e. The class SentimentTransferMeasurement, can't be found. Can you upload this file? Thanks!

issues with hallucinations in tasks

Hello authors,

similar to previous post: #6, when I run the tasks, I also face severe hallucation issues when using GPT-3.5-turbo-0125.

error message like this:

An error occurred: list index out of range. Traceback (most recent call last):
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\utils.py", line 39, in wrapper
return func(*args, **kwargs)
File "C:\Users\ITSupp\Downloads\codes\self-refine\src\gsm\run.py", line 45, in iterative_gsm
fb_and_maybe_soln = task_feedback(solution=solution)
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\gsm\feedback.py", line 52, in call
improved_soln = entire_output.split("def solution():")[1]
IndexError: list index out of range
. Left retries: 2.
n_attempts===> 0

Take GSM task for example, in the script: self-refine\src\gsm\run.py,
I see that to avoid this type of error, you add @retry_parse_fail_prone_cmd, which is a retry mechanism to have multiple times to have a decent answer with regard to format.

Ok, let's first execute the run.py, and we will have the output file for further evaluation, where I use the default max_attempts=4.

with retry_parse_fail_prone_cmd added, only around 50% of the test samples( slow_programs_df) can be collected into results which then is written to outfile on disk.
without retry_parse_fail_prone_cmd , the retention rate is smaller, around 30%.

However, in output file of data/tasks/gsm/gsm_outputs.jsonl has 1253 samples. Out of the 1319 input test cases (data/tasks/gsm/gsm.jsonl), the retention rate is 95%. May I know why the retention rate is so high ? do you execute the self-refine\src\gsm\run.py multiple times to merge?

Then evaluating on the outfile via src/gsm/gsm_selfref_eval.py, the accuracy is in between 70% to 80% ( I run several times) .
However, most of the time, among the attempts, the accuracy is the same.

Accuracy at attempt 0 = 74.07%
Accuracy at attempt 1 = 74.07%
Accuracy at attempt 2 = 74.07%
Accuracy at attempt 3 = 74.07%
Accuracy at attempt 4 = 74.07%

Not sure in your original experiements, if this is quite common actually ?

Thanks.

IndexError in src/gsm/feedback.py

Hello! I'm running python -u src/gsm/run.py with "gpt-3.5-turbo" and has the following error. This is happening because "def solution():" not in entire_output in feedback.py.

Question 1: Has this error happened with Codex? I'm wondering if this is because ChatGPT is not always following input exemplar's format perfectly.
Question 2: Even with these errors, src/gsm/run.py keeps running. Should I just ignore these errors? I'm hoping to obtain results that is close to or better than gsm results in your paper.

1%|▊                                                                                                                           | 8/1319 [03:04<8:42:28, 23.91s/it]

An error occurred: list index out of range. Traceback (most recent call last):
  File "/home/ubuntu/code/hideodeo/self-refine/src/utils.py", line 39, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/code/hideodeo/self-refine/src/gsm/run.py", line 40, in iterative_gsm
    fb_and_maybe_soln = task_feedback(solution=solution)
  File "/home/ubuntu/code/hideodeo/self-refine/src/gsm/feedback.py", line 42, in __call__
    solution = entire_output.split("def solution():")[1]
IndexError: list index out of range
. Left retries: 2.

CommonGen-Hard dataset

Will you release the CommonGen-Hard dataset soon? It looks like that it has 20-30 concepts for each sentence. Very curious about this dataset.

Code optimization

Hello,I am really interested in your fine work and trying to reproduce the results!

Can you share the prompts and examples for the code optimization?
I am having hard time reproducing the code optimization results.
I have an additional question regarding code optimization in your paper. Specifically, I'm interested in how you calculated the percentage of programs that were optimized. When reproducing the results, I noticed that some codes performed worse, while others actually showed improvement.

Instructions on PIE Evaluation

Hi,

Thank you for your fantastic work!

It seems like the instructions for conducting PIE evaluation are absent. Would you be able to provide instructions on how to use the pie_eval.py script? I'm particularly uncertain about the process of obtaining the .report file. Thanks!

reproduction for the paper

hello.

May I ask how to run the full test cases to get the figures reported in the paper ?
For example, in acronym task, I only see the unit testing, and I cannot find any code to run https://github.com/madaan/self-refine/blob/main/src/acronym/run_mcts.py.

also, where is the titles_file in https://github.com/madaan/self-refine/blob/main/src/acronym/run.py#L102 ?

For task Commongen, it seems that running python -u src/commongen/run.py cmd stair bubble team dryer puppy aliens cat could not download any data.

PIE run.py

Hi All,

Thank you for your lovely work.

There is no run.py in PIE folder.

Thank you

Trying to understand GSM code

Hello! First of all this is a super nice paper.

I am trying to wrap my head around the concept of the paper. What I don't understand is this:
No matter what the output from the LM is, the LM is prompted again with the same question and the generated code/text (code+comments), until the LM itself says "it is correct", with a maximum of max_attempts for each question?
The paper reports improvements over 5 iterations, so if the model outputs "it is correct" the same output is used for the next iteration? Just want to make sure I understood this correctly.

Releasing the Yelp Dataset

Amazing work! We are interested in building on this work. Are there any plans for releasing the long-form sentiment reversal Yelp datatset in this work? Thanks!

Evaluation for Dialogue Response Generation

Would you please provide the instructions for evaluating model responses in the Dialogue Response Generation task?

Codebase Completeness

Great work! We are currently trying to reproduce your results such that we can build on top of your insights. I see that you are still working on this codebase. Are there already some benchmarks that are fully implemented in this repository that you do not intend to develop further in the near future? Because when I am running the different benchmarks I sometimes run into some errors and I am not sure when the error is from my side, or just because some functionality is missing.

For example: When I am running the CommonGen benchmark on a reduced test set, I observe a lot of errors in the output file. My main source of error seems that the feedback from GPT does not have the intended structure, such that exceptions in the code occur. I did not change any of the training prompts/instructions prompts you provide in this file and use "gpt-3.5-turbo". Did you also observe this behaviour?

I also noticed two different things, which I thought I would notify you about:

Compared to other benchmarks you do not include the refinement history in the GSM-8k benchmark. You only use the training prompt, the instruction prompt and the current iteration. Is this intended?
The run.py file is missing for the sentiment_reversal benchmark and I can't find the training data for the Code Readability Improvement

Thanks a lot!

Is it compatible with LLaMa 2

Was there an attempt to test this library with LLaMa 2 model

Pseudocode from https://selfrefine.info

In https://selfrefine.info/ I found the following pseudocode

def self_refine(prompt: str) -> str:
    def is_refinement_sufficient(prompt, feedback, initial, refined) -> bool:
        # Define stopping criteria here
        pass

    answer = ChatGPT(prompt)

    while True:
        feedback = ChatGPT(feedback_prompt, answer)
        refined = ChatGPT(refiner_prompt, feedback, answer)

        if is_refinement_sufficient(prompt, feedback, answer, refined):
            break

        answer = refined

    return refined

It seemed a bit weird to me, so I had this conversation

====================================================

User
there is one function enclosing it all, but I think it doesn't have all the required parameters. What do you think?

====================================================

ChatGPT
You're correct. The self_refine function is the outermost enclosing function, but it seems to be missing some necessary parameters for the nested functions (is_refinement_sufficient, feedback_prompt, and refiner_prompt) to work properly.

In its current state, the self_refine function only accepts a single parameter (prompt). However, the nested functions and the loop inside self_refine rely on additional parameters such as feedback_prompt and refiner_prompt, which are not provided as arguments to the self_refine function.

To fix this issue, the self_refine function should be modified to accept these additional parameters or they should be defined globally outside the function. Alternatively, the nested functions could be modified to access these parameters from the global scope.

====================================================

User
improve the code

====================================================

ChatGPT
To improve the code, we need to ensure that all necessary parameters are provided to the functions and that the stopping criteria function (is_refinement_sufficient) is properly implemented. Here's an improved version of the code:

python
Copy code


def self_refine(prompt: str, feedback_prompt: str, refiner_prompt: str) -> str:
    def is_refinement_sufficient(prompt: str, feedback: str, initial: str, refined: str) -> bool:
        # Define stopping criteria here
        # For example, you could compare the similarity between refined and initial responses
        if len(refined) > len(initial):
            return True
        else:
            return False

    answer = ChatGPT(prompt)

    while True:
        feedback = ChatGPT(feedback_prompt, answer)
        refined = ChatGPT(refiner_prompt, feedback, answer)

        if is_refinement_sufficient(prompt, feedback, answer, refined):
            break

        answer = refined

    return refined

In this improved version:

The self_refine function now accepts three additional parameters: feedback_prompt and refiner_prompt, which are required for generating feedback and refining the answer, respectively.
The is_refinement_sufficient function now only takes four parameters (prompt, feedback, initial, and refined). I've also added a basic stopping criterion example that checks whether the length of the refined response is greater than the length of the initial response. You can adjust this criterion based on your specific requirements.
The function now returns the refined response.
This improved version should work provided that the ChatGPT function is properly implemented and accessible within the scope of this code.

====================================================

User
should the is_refinement_sufficient be defined outside to make the self_refine fully adaptable to multiple cases?

====================================================

ChatGPT
Defining the is_refinement_sufficient function outside of the self_refine function could indeed enhance adaptability and reusability, especially if you anticipate using the refinement logic in multiple contexts. By defining it outside, you can easily reuse it across different refinement scenarios without duplicating code.

Here's how you can refactor the code to define is_refinement_sufficient outside of self_refine:

python
Copy code

def is_refinement_sufficient(prompt: str, feedback: str, initial: str, refined: str) -> bool:
    # Define stopping criteria here
    # For example, you could compare the similarity between refined and initial responses
    if len(refined) > len(initial):
        return True
    else:
        return False

def self_refine(prompt: str, feedback_prompt: str, refiner_prompt: str) -> str:
    answer = ChatGPT(prompt)

    while True:
        feedback = ChatGPT(feedback_prompt, answer)
        refined = ChatGPT(refiner_prompt, feedback, answer)

        if is_refinement_sufficient(prompt, feedback, answer, refined):
            break

        answer = refined

    return refined

With this refactoring:

The is_refinement_sufficient function is defined separately, making it reusable across different contexts.
The self_refine function now only focuses on the iterative refinement process without the need to define the stopping criteria, thus making it more modular and adaptable.
This approach enhances code organization and facilitates maintenance and reuse. Now, if you need to adjust the stopping criteria or use it elsewhere, you can do so without modifying the self_refine function.

====================================================

User
but if we provide a parameter to pass a function as is_refinement_sufficient function to self_refine we would not be locked with the "name" of the equivalent is_refinement_sufficient function.

====================================================

ChatGPT
Absolutely, passing the refinement criterion function as a parameter to self_refine adds even more flexibility and decouples the refinement process from any specific stopping criteria function. Here's how you can modify the self_refine function to accept a custom refinement criterion function:

python
Copy code

def self_refine(prompt: str, feedback_prompt: str, refiner_prompt: str, refinement_criteria_func) -> str:
    answer = ChatGPT(prompt)

    while True:
        feedback = ChatGPT(feedback_prompt, answer)
        refined = ChatGPT(refiner_prompt, feedback, answer)

        if refinement_criteria_func(prompt, feedback, answer, refined):
            break

        answer = refined

    return refined

Now, when calling self_refine, you can pass any function that meets the criteria for refinement. For example:

python
Copy code

def is_refinement_sufficient(prompt: str, feedback: str, initial: str, refined: str) -> bool:
    # Define stopping criteria here
    # For example, you could compare the similarity between refined and initial responses
    if len(refined) > len(initial):
        return True
    else:
        return False

# Example call to self_refine with custom refinement criteria function
refined_response = self_refine(prompt, feedback_prompt, refiner_prompt, is_refinement_sufficient)

This way, the self_refine function becomes more adaptable and can be customized to different refinement scenarios by simply passing the appropriate refinement criteria function.

====================================================

User
An infinite loop seems too much. Change it to have a maximum amount of loops defined by a constant with predefined value

====================================================

ChatGPT
Certainly! Let's incorporate a maximum loop count into the self_refine function to prevent infinite loops. We'll define a constant with a predefined value for the maximum number of iterations. Here's the updated code:

python
Copy code

MAX_ITERATIONS = 10

def self_refine(prompt: str, feedback_prompt: str, refiner_prompt: str, refinement_criteria_func) -> str:
    answer = ChatGPT(prompt)

    for _ in range(MAX_ITERATIONS):
        feedback = ChatGPT(feedback_prompt, answer)
        refined = ChatGPT(refiner_prompt, feedback, answer)

        if refinement_criteria_func(prompt, feedback, answer, refined):
            break

        answer = refined

    return refined

In this updated version:

We introduced a constant MAX_ITERATIONS with a predefined value, indicating the maximum number of iterations allowed.
Instead of a while True loop, we now use a for loop with a range of MAX_ITERATIONS iterations.
The loop breaks if the refinement criteria function returns True or if the maximum number of iterations is reached.
This modification ensures that the refinement process doesn't run indefinitely and provides a safety mechanism in case the refinement criteria are not met within a reasonable number of iterations. Adjust the value of MAX_ITERATIONS as needed based on your specific requirements and computational constraints.

====================================================

madaan / self-refine Goto Github PK

self-refine's People

Contributors

Stargazers

Watchers

Forkers

self-refine's Issues

Recommend Projects

Recommend Topics

Recommend Org