Giter Site home page Giter Site logo

Comments (15)

aflah02 avatar aflah02 commented on June 21, 2024

I also have an unrelated question. Why is WEIGHT_INIT_SAMPLES=0 in the all_exp3.sh file? The other files have it set to 1000

from flad.

aflah02 avatar aflah02 commented on June 21, 2024

Another error I encountered is when I try to run multirun_create_weight_inits.py. I get an error on this line which says that weight_init_only is not a valid attribute. To fix this I just commented it out as I noticed that it is indeed not an attribute of the class but then I get an error here which says AttributeError: 'DataParallel' object has no attribute 'name_or_path'. To fix this error I changed the line to model_name = model.module.name_or_path.replace("/","-")

After all these changes I've been stuck here for quite a while -

image

The progress bar doesn't seem to be running and the end time estimate is not populated

Do let me know if you have any suggestions to fix these in a different way?

EDIT: I checked in after a few hours and it seems to be running (The ETA is 50ish hours for T0Mix alongside the 3B Model)

from flad.

alon-albalak avatar alon-albalak commented on June 21, 2024

I also have an unrelated question. Why is WEIGHT_INIT_SAMPLES=0 in the all_exp3.sh file? The other files have it set to 1000

WEIGHT_INIT_SAMPLES is only used for the UCB1 algorithm which has an explicit reward initialization phase. See algorithms 1 and 2 from the paper (https://arxiv.org/pdf/2302.00674.pdf)

In reality, you could also initialize the rewards for EXP3, but we didn't for our experiments.

from flad.

alon-albalak avatar alon-albalak commented on June 21, 2024

Another error I encountered is when I try to run multirun_create_weight_inits.py. I get an error on this line which says that weight_init_only is not a valid attribute. To fix this I just commented it out as I noticed that it is indeed not an attribute of the class but then I get an error here which says AttributeError: 'DataParallel' object has no attribute 'name_or_path'. To fix this error I changed the line to model_name = model.module.name_or_path.replace("/","-")

After all these changes I've been stuck here for quite a while -

image The progress bar doesn't seem to be running and the end time estimate is not populated

Do let me know if you have any suggestions to fix these in a different way?

EDIT: I checked in after a few hours and it seems to be running (The ETA is 50ish hours for T0Mix alongside the 3B Model)

Regarding the weight initialization. The script currently is set to run over 2 quantities of weight initialization samples (this line) and 5 different seeds (this line).

You can reduce those to just use a single weight initialization samples, and a single seed which will significantly speed up the initialization.

Thanks for pointing this out, I think I know what the problem is. At one point, I moved the weight initialization into the trainer class (here), but didn't make it compatible with the multirun_train_mixed script.

For now, the solution is to compute gradients prior to the Exploit-only method, as you're currently doing. And I will add that information into the instructions. Thank you for finding this bug!

Let me know if it still doesn't work for some reason after pre-computing the gradients.

from flad.

aflah02 avatar aflah02 commented on June 21, 2024

Thanks for the reference, I'll check it out!
For the alignment computation, I did limit it to only one seed and one weight initialization sample value but it's still taking around 40 hours on a A100 (80 GB). Also does the program write any intermediate outputs? It did create a directory but it's still empty (around 30 hours have passed). Just wanted to confirm the same

from flad.

aflah02 avatar aflah02 commented on June 21, 2024

Update: It did not work. I did not save all logs in a text file, in hindsight I should have done that but this is the only log which is still there on the terminal
Not quite sure what's wrong though

I ran this command - python3 src/multirun_create_weight_inits.py --target_dataset $TARGET_DATASET --auxiliary_dataset $AUXILIARY_DATASET

image

Also there is this folder which was created but is empty - FLAD/outputs/weight_inits/T5_LM_3B/T0Mixture/copa/42/1000

from flad.

alon-albalak avatar alon-albalak commented on June 21, 2024

Okay, I've made a few fixes and been able to run the all_exploit.sh script and the multirun_create_weight_inits.py script.

Try pulling the newest version of the code base and let me know if you can run all_exploit.sh and multirun_create_weight_inits.py

from flad.

aflah02 avatar aflah02 commented on June 21, 2024

Thank you for the quick response! Just to confirm should I still be running the gradient alignment computations first? or can they be run in parallel now?

from flad.

alon-albalak avatar alon-albalak commented on June 21, 2024

Also, I did catch this error once: AttributeError: 'DataParallel' object has no attribute 'name_or_path'
It has to do with how the model was initialized by huggingface. However, after my bug fixes it's disappeared for me. In case it's still there for you, let me know and I'll make changes for that as well.

The solution that I found for that is to change lines 1639-1640 from:

        # Initialize weights if needed
        if self.args.weight_initialization_samples > 0:
            self._initialize_weights(train_dataloader, target_dataloader, model)

to

        # Initialize weights if needed
        if self.args.weight_initialization_samples > 0:
            if hasattr(model, "name_or_path"):
                self._initialize_weights(train_dataloader, target_dataloader, model)
            else:
                self._initialize_weights(train_dataloader, target_dataloader, self.model)

I'm hesitant to make that change unless it's required though, because that will affect both the EXP3 and UCB1 trainers as well.

from flad.

alon-albalak avatar alon-albalak commented on June 21, 2024

Thank you for the quick response! Just to confirm should I still be running the gradient alignment computations first? or can they be run in parallel now?

The all_exploit.sh script should take care of the alignment computations

from flad.

aflah02 avatar aflah02 commented on June 21, 2024

Got it!
I'll rerun it and update you with the outcome 🙌

from flad.

alon-albalak avatar alon-albalak commented on June 21, 2024

Got it! I'll rerun it and update you with the outcome 🙌

Sorry for so many back and forths. When I was debugging the all_exploit.sh script, it was running so I assumed it was also calculating the alignment, but it actually doesn't. So, then I went back to compute the alignments with multirun_create_weight_inits.py and I realized why it was taking such a long time. The weight_init_only flag actually is important and I must have removed it's use at some point, so I've added it back in.

Now I've successfully been able to precompute the alignments, and then train the exploit-only model.

To replicate our experiments you do need to run multirun_create_weight_inits.py first, then you can run the all_exploit.sh script. Pull the newest update and let me know if that fixes it for you

from flad.

aflah02 avatar aflah02 commented on June 21, 2024

Thanks for all the fixes!
I've started the run. The ETA is just 20 mins however I did have to change the lines you mentioned above to fix the dataparallel error.

from flad.

alon-albalak avatar alon-albalak commented on June 21, 2024

Any update on this? Have you succeeded with the exploit-only baseline?

Have you been able to run either the EXP3 or UCB1 methods?

Just checking to see if you've found any other bugs

from flad.

aflah02 avatar aflah02 commented on June 21, 2024

Hey
So I did manage to run the exploit-only baseline after precomputing gradients. I could run EXP3 without it too so haven't retested it. I did not get time to rerun the UCB1 baseline. I'll update you incase I hit any issues there
In terms of issues the only one was that dataparallel error and adding the changes you suggested above fixed it
So closing this issue as well as everything seems to be fine now
Thanks for all the help!

from flad.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.