madaan / self-refine Goto Github PK
View Code? Open in Web Editor NEWLLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
Home Page: https://selfrefine.info
License: Apache License 2.0
LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
Home Page: https://selfrefine.info
License: Apache License 2.0
how can i get a ANTHROPIC_API_KEY?
There is never a feedback
I want to use self-refine for reasoning task, such as open-book qa for example.
For the few-shot examples for the initial generation. Does the examples have to be bad examples?
If I have good examples, could I use them for the initial stage and hope that through iterations, it gets even better?
However if i were to use already good examples, it might be tough to come up with even better ones in the few-shot examples for the refine stage?
Hello. For code-related tasks, do you plan to update your code to replace the Codex with another model? Do you have a suggestion for the alternative model & do you plan to push the updated code?
Could this repo get an MIT or Apache license to fully free up anyone to take and adapt work found here and innovate further?
In the Yelp benchmark, the file for the task_measure is missing. I.e. The class SentimentTransferMeasurement, can't be found. Can you upload this file? Thanks!
Hello authors,
similar to previous post: #6, when I run the tasks, I also face severe hallucation issues when using GPT-3.5-turbo-0125.
error message like this:
An error occurred: list index out of range. Traceback (most recent call last):
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\utils.py", line 39, in wrapper
return func(*args, **kwargs)
File "C:\Users\ITSupp\Downloads\codes\self-refine\src\gsm\run.py", line 45, in iterative_gsm
fb_and_maybe_soln = task_feedback(solution=solution)
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\gsm\feedback.py", line 52, in call
improved_soln = entire_output.split("def solution():")[1]
IndexError: list index out of range
. Left retries: 2.
n_attempts===> 0
Take GSM task for example, in the script: self-refine\src\gsm\run.py
,
I see that to avoid this type of error, you add @retry_parse_fail_prone_cmd
, which is a retry mechanism to have multiple times to have a decent answer with regard to format.
Ok, let's first execute the run.py, and we will have the output file for further evaluation, where I use the default max_attempts=4.
with retry_parse_fail_prone_cmd
added, only around 50% of the test samples( slow_programs_df
) can be collected into results
which then is written to outfile
on disk.
without retry_parse_fail_prone_cmd
, the retention rate is smaller, around 30%.
However, in output file of data/tasks/gsm/gsm_outputs.jsonl
has 1253 samples. Out of the 1319 input test cases (data/tasks/gsm/gsm.jsonl
), the retention rate is 95%. May I know why the retention rate is so high ? do you execute the self-refine\src\gsm\run.py
multiple times to merge?
Then evaluating on the outfile via src/gsm/gsm_selfref_eval.py
, the accuracy is in between 70% to 80% ( I run several times) .
However, most of the time, among the attempts, the accuracy is the same.
Accuracy at attempt 0 = 74.07%
Accuracy at attempt 1 = 74.07%
Accuracy at attempt 2 = 74.07%
Accuracy at attempt 3 = 74.07%
Accuracy at attempt 4 = 74.07%
Not sure in your original experiements, if this is quite common actually ?
Thanks.
Hello! I'm running python -u src/gsm/run.py
with "gpt-3.5-turbo" and has the following error. This is happening because "def solution():" not in entire_output in feedback.py.
1%|โ | 8/1319 [03:04<8:42:28, 23.91s/it]
An error occurred: list index out of range. Traceback (most recent call last):
File "/home/ubuntu/code/hideodeo/self-refine/src/utils.py", line 39, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/code/hideodeo/self-refine/src/gsm/run.py", line 40, in iterative_gsm
fb_and_maybe_soln = task_feedback(solution=solution)
File "/home/ubuntu/code/hideodeo/self-refine/src/gsm/feedback.py", line 42, in __call__
solution = entire_output.split("def solution():")[1]
IndexError: list index out of range
. Left retries: 2.
Will you release the CommonGen-Hard dataset soon? It looks like that it has 20-30 concepts for each sentence. Very curious about this dataset.
Hello,I am really interested in your fine work and trying to reproduce the results!
Can you share the prompts and examples for the code optimization?
I am having hard time reproducing the code optimization results.
I have an additional question regarding code optimization in your paper. Specifically, I'm interested in how you calculated the percentage of programs that were optimized. When reproducing the results, I noticed that some codes performed worse, while others actually showed improvement.
Hi,
Thank you for your fantastic work!
It seems like the instructions for conducting PIE evaluation are absent. Would you be able to provide instructions on how to use the pie_eval.py script? I'm particularly uncertain about the process of obtaining the .report file. Thanks!
hello.
May I ask how to run the full test cases to get the figures reported in the paper ?
For example, in acronym task, I only see the unit testing, and I cannot find any code to run https://github.com/madaan/self-refine/blob/main/src/acronym/run_mcts.py.
also, where is the titles_file in https://github.com/madaan/self-refine/blob/main/src/acronym/run.py#L102 ?
For task Commongen, it seems that running python -u src/commongen/run.py cmd stair bubble team dryer puppy aliens cat
could not download any data.
Hi All,
Thank you for your lovely work.
There is no run.py in PIE folder.
Thank you
Hello! First of all this is a super nice paper.
I am trying to wrap my head around the concept of the paper. What I don't understand is this:
No matter what the output from the LM is, the LM is prompted again with the same question and the generated code/text (code+comments), until the LM itself says "it is correct", with a maximum of max_attempts for each question?
The paper reports improvements over 5 iterations, so if the model outputs "it is correct" the same output is used for the next iteration? Just want to make sure I understood this correctly.
Amazing work! We are interested in building on this work. Are there any plans for releasing the long-form sentiment reversal Yelp datatset in this work? Thanks!
Would you please provide the instructions for evaluating model responses in the Dialogue Response Generation task?
Great work! We are currently trying to reproduce your results such that we can build on top of your insights. I see that you are still working on this codebase. Are there already some benchmarks that are fully implemented in this repository that you do not intend to develop further in the near future? Because when I am running the different benchmarks I sometimes run into some errors and I am not sure when the error is from my side, or just because some functionality is missing.
For example: When I am running the CommonGen benchmark on a reduced test set, I observe a lot of errors in the output file. My main source of error seems that the feedback from GPT does not have the intended structure, such that exceptions in the code occur. I did not change any of the training prompts/instructions prompts you provide in this file and use "gpt-3.5-turbo". Did you also observe this behaviour?
I also noticed two different things, which I thought I would notify you about:
Thanks a lot!
Was there an attempt to test this library with LLaMa 2 model
In https://selfrefine.info/ I found the following pseudocode
def self_refine(prompt: str) -> str:
def is_refinement_sufficient(prompt, feedback, initial, refined) -> bool:
# Define stopping criteria here
pass
answer = ChatGPT(prompt)
while True:
feedback = ChatGPT(feedback_prompt, answer)
refined = ChatGPT(refiner_prompt, feedback, answer)
if is_refinement_sufficient(prompt, feedback, answer, refined):
break
answer = refined
return refined
It seemed a bit weird to me, so I had this conversation
====================================================
User
there is one function enclosing it all, but I think it doesn't have all the required parameters. What do you think?
====================================================
ChatGPT
You're correct. The self_refine function is the outermost enclosing function, but it seems to be missing some necessary parameters for the nested functions (is_refinement_sufficient, feedback_prompt, and refiner_prompt) to work properly.
In its current state, the self_refine function only accepts a single parameter (prompt). However, the nested functions and the loop inside self_refine rely on additional parameters such as feedback_prompt and refiner_prompt, which are not provided as arguments to the self_refine function.
To fix this issue, the self_refine function should be modified to accept these additional parameters or they should be defined globally outside the function. Alternatively, the nested functions could be modified to access these parameters from the global scope.
====================================================
User
improve the code
====================================================
ChatGPT
To improve the code, we need to ensure that all necessary parameters are provided to the functions and that the stopping criteria function (is_refinement_sufficient) is properly implemented. Here's an improved version of the code:
python
Copy code
def self_refine(prompt: str, feedback_prompt: str, refiner_prompt: str) -> str:
def is_refinement_sufficient(prompt: str, feedback: str, initial: str, refined: str) -> bool:
# Define stopping criteria here
# For example, you could compare the similarity between refined and initial responses
if len(refined) > len(initial):
return True
else:
return False
answer = ChatGPT(prompt)
while True:
feedback = ChatGPT(feedback_prompt, answer)
refined = ChatGPT(refiner_prompt, feedback, answer)
if is_refinement_sufficient(prompt, feedback, answer, refined):
break
answer = refined
return refined
In this improved version:
The self_refine function now accepts three additional parameters: feedback_prompt and refiner_prompt, which are required for generating feedback and refining the answer, respectively.
The is_refinement_sufficient function now only takes four parameters (prompt, feedback, initial, and refined). I've also added a basic stopping criterion example that checks whether the length of the refined response is greater than the length of the initial response. You can adjust this criterion based on your specific requirements.
The function now returns the refined response.
This improved version should work provided that the ChatGPT function is properly implemented and accessible within the scope of this code.
====================================================
User
should the is_refinement_sufficient be defined outside to make the self_refine fully adaptable to multiple cases?
====================================================
ChatGPT
Defining the is_refinement_sufficient function outside of the self_refine function could indeed enhance adaptability and reusability, especially if you anticipate using the refinement logic in multiple contexts. By defining it outside, you can easily reuse it across different refinement scenarios without duplicating code.
Here's how you can refactor the code to define is_refinement_sufficient outside of self_refine:
python
Copy code
def is_refinement_sufficient(prompt: str, feedback: str, initial: str, refined: str) -> bool:
# Define stopping criteria here
# For example, you could compare the similarity between refined and initial responses
if len(refined) > len(initial):
return True
else:
return False
def self_refine(prompt: str, feedback_prompt: str, refiner_prompt: str) -> str:
answer = ChatGPT(prompt)
while True:
feedback = ChatGPT(feedback_prompt, answer)
refined = ChatGPT(refiner_prompt, feedback, answer)
if is_refinement_sufficient(prompt, feedback, answer, refined):
break
answer = refined
return refined
With this refactoring:
The is_refinement_sufficient function is defined separately, making it reusable across different contexts.
The self_refine function now only focuses on the iterative refinement process without the need to define the stopping criteria, thus making it more modular and adaptable.
This approach enhances code organization and facilitates maintenance and reuse. Now, if you need to adjust the stopping criteria or use it elsewhere, you can do so without modifying the self_refine function.
====================================================
User
but if we provide a parameter to pass a function as is_refinement_sufficient function to self_refine we would not be locked with the "name" of the equivalent is_refinement_sufficient function.
====================================================
ChatGPT
Absolutely, passing the refinement criterion function as a parameter to self_refine adds even more flexibility and decouples the refinement process from any specific stopping criteria function. Here's how you can modify the self_refine function to accept a custom refinement criterion function:
python
Copy code
def self_refine(prompt: str, feedback_prompt: str, refiner_prompt: str, refinement_criteria_func) -> str:
answer = ChatGPT(prompt)
while True:
feedback = ChatGPT(feedback_prompt, answer)
refined = ChatGPT(refiner_prompt, feedback, answer)
if refinement_criteria_func(prompt, feedback, answer, refined):
break
answer = refined
return refined
Now, when calling self_refine, you can pass any function that meets the criteria for refinement. For example:
python
Copy code
def is_refinement_sufficient(prompt: str, feedback: str, initial: str, refined: str) -> bool:
# Define stopping criteria here
# For example, you could compare the similarity between refined and initial responses
if len(refined) > len(initial):
return True
else:
return False
# Example call to self_refine with custom refinement criteria function
refined_response = self_refine(prompt, feedback_prompt, refiner_prompt, is_refinement_sufficient)
This way, the self_refine function becomes more adaptable and can be customized to different refinement scenarios by simply passing the appropriate refinement criteria function.
====================================================
User
An infinite loop seems too much. Change it to have a maximum amount of loops defined by a constant with predefined value
====================================================
ChatGPT
Certainly! Let's incorporate a maximum loop count into the self_refine function to prevent infinite loops. We'll define a constant with a predefined value for the maximum number of iterations. Here's the updated code:
python
Copy code
MAX_ITERATIONS = 10
def self_refine(prompt: str, feedback_prompt: str, refiner_prompt: str, refinement_criteria_func) -> str:
answer = ChatGPT(prompt)
for _ in range(MAX_ITERATIONS):
feedback = ChatGPT(feedback_prompt, answer)
refined = ChatGPT(refiner_prompt, feedback, answer)
if refinement_criteria_func(prompt, feedback, answer, refined):
break
answer = refined
return refined
In this updated version:
We introduced a constant MAX_ITERATIONS with a predefined value, indicating the maximum number of iterations allowed.
Instead of a while True loop, we now use a for loop with a range of MAX_ITERATIONS iterations.
The loop breaks if the refinement criteria function returns True or if the maximum number of iterations is reached.
This modification ensures that the refinement process doesn't run indefinitely and provides a safety mechanism in case the refinement criteria are not met within a reasonable number of iterations. Adjust the value of MAX_ITERATIONS as needed based on your specific requirements and computational constraints.
====================================================
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.