mlfoundations / model-soups Goto Github PK

View Code? Open in Web Editor NEW

371.0 371.0 34.0 350 KB

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Python 100.00%

model-soups's People

Contributors

Stargazers

Watchers

model-soups's Issues

Reproducing the text classification results

Is it possible with the existing code here on github to run the text classification experiments (fine-tuning a NLP model) in section 3.3.3 of the paper (https://arxiv.org/pdf/2203.05482.pdf) ?

I mean, without extending the code base here on github significantly.

Hyperparameter config for the CIFAR models

Hello,
Thank you for the great work.

I understand that you provide the finetuned weights for ImageNet as well as the hyperparameter config. Furthermore, you have also privded the models for CIFAR-10.

I would kindly request you to also provide the hyperparameter config for the different models finetuned on CIFAR-10 dataset.
Also it would be really great if you could also provide the script you trained models on ImageNet or CIFAR datasets, the provided finetune.py has minimal hyperparameters, but I believe you have explored extensive augmentations.

It would be greatly appreciated if you provided the above (atleast 1st one).

Thank you!

Some questions about hyperparameter sweep

I am trying to reproduce the results of hyperparameter sweep on CLIP ViT-B/32, using this script

Could you give some help about:
In J.2.1, standard grid, extreme grid, and random search is adopted for CLIP ViT-B/32 sweep. Which search method is the best and finally used for CLIP ViT-B/32？

Possible to provide minimal code for Figure 2?

Thanks for the awesome paper!

Is it possible to provide minimal code for reproducing some parts of Figure 2? I believe that would greatly benefit the community of researchers that are working on similar things.

How to fine-tune on custom dataset?

Hey guys, I have this classification dataset with currency denominations but I'm not sure how to use model-soup for fine-tuning. The way my dataset is structured is that I have several folders for each denomination and each folder has thousands of images. I have a train, valid, and test folder overall. Can someone help me with this please?

Question: Data shuffling

I have found this Paper very interesting and have applied the method. I've notice the following:

Greedy soup increase performance when souping from the same run, e.g. epoch 59+60 = 0.3% increase
Souping from different runs with different augmentations, I get worse results sometimes -60%.

In the Paper you mention Frankle et al. (2020) and their observation that data order matters. Did you find a correlation here to greedy soup, e.g. that data shuffling reduces results?

I have to do more testing on my side, currently limited by the time to generate new weights. Also, I have trained all model so far from scratch instead of fine tuning, however I plan to test with fine tuned models later on.

cannot get imagenetv2

Hello. could you share the imagenetv2 dataset i am unable to download it.

i keep having error

`wget https://s3-us-west-2.amazonaws.com/imagenetv2public/imagenetv2-matched-frequency.tar.gz

HTTP request sent, awaiting response... 403 Forbidden
2023-03-31 07:02:20 ERROR 403: Forbidden`
thank you

Add license information

Currently, this repository is provided without a license.
I recommend adding one to help clarify what users can (and not) do with the code.

Are model_4 and model_5 the same model?

I appreciate your generosity in providing the finetuned models' weight for future research.
I have a small inquiry.

I noticed that model_4 and model_5 have the same state_dict value.
Would it be a mistake?

Thanks.

Greedy Soup selects only best individual model

Dear M. Wortsman,

I am experimenting with Model Soups for four-class brain tumor classification. I use ViT-B32 with AdamW and CategoricalCrossentropy (with label_smoothing). I randomly created 12 model configurations from the hyperparameter grid below. From my 12 models, the best and worst models have a validation accuracy of 91.964% and 84.226%, respectively. The Uniform Soup has a validation accuracy of 88.393%. My Greedy Soup, however, only includes the best individual model (i.e. no combination of weights yields accuracy > 91.964%). What can I do to have my Greedy Soup outperform the best individual model, besides creating a bigger model pool?

Many thanks in advance.

learning_rate = [3e-5, 1e-5, 5e-4]
weight_decay = [1e-6, 1e-7, 1e-8]
epochs = [12, 16, 20]
img_aug = [img_aug_low, img_aug_medium, img_aug_high]
label_smoothing = [0.1, 0.2, 0.3]

Where the different data augmentation intensities are defined as:

def img_aug_low(image, label):
    image = tf.image.random_flip_left_right(image)
    return image, label 

def img_aug_medium(image, label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_brightness(image, 0.1)
    return image, label 

def img_aug_high(image, label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_brightness(image, 0.1)
    image = tf.image.random_saturation(image, 0.7, 1.3)
    return image, label

How to plot Figure 2 ?

Kindly provide some code references to fig. 2 or you may also publish your own code.

Discussion on souping for text classification models fine-tuned from DistilBERT

Hi there.

I followed Cade's Colab Notebook and thought of expanding it to text classification models. So, to do that, I first fine-tuned five models (with different hyperparameters) on the text classification task with DistilBERT [1].

I am observing interesting results. This is how the individual models perform (which I believe is not different from what's shown in Cade's notebook):

These are the raw scores:

{'distilbert-base-uncased-finetuned-emotion-lr-0.0003-wd-003': 0.9335,
 'distilbert-base-uncased-finetuned-emotion-lr-3e-05-wd-001': 0.919,
 'distilbert-base-uncased-finetuned-emotion-lr-2e-05-wd-0001': 0.896,
 'distilbert-base-uncased-finetuned-emotion-lr-0.0006-wd-0003': 0.8875,
 'distilbert-base-uncased-finetuned-emotion-lr-1e-05-wd-0002': 0.7495}

With uniform souping, I get:

Perform drop is quite drastic with it. I understand there can be a myriad of reasons for this. But wanted to know if you have observed something similar in your experiments.

This is the utility I am using for souping:

def get_souped_model(state_dicts):
    new_state_dict = {}
    state_dict_keys = list(state_dicts[0].keys())

    for k in state_dict_keys:
        temp_weights = []
        for state_dict in state_dicts:
            temp_weights.append(state_dict[k].cpu())

        stacked_weights = torch.stack(temp_weights)
        averaged_weights = torch.mean(stacked_weights, dim=0)

        new_state_dict.update({k: averaged_weights})

    model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=6)
    model.load_state_dict(new_state_dict)
    return model.to(device)

With greedy souping, only the best model survives.

I have open-sourced my code here: https://github.com/sayakpaul/model-soups-text-classification.

Looking forward to your thoughts.

Hyperparameter Information for each model

Hello,

Your work is truly impressive, and I'm grateful for your contributions.

I'm interested in the hyperparameter settings for each model.
Having hyperparameter information, would be incredibly helpful for future research.

Best regards,

Inquiry Regarding timm Version and Rand-Aug Settings in the Provided Code

Hello,

Thank you for sharing your code - it's been very helpful.
I have a question regarding the timm version used in your research, specifically about the RandAug settings.

I noticed that in the code here, there is a constraint that limits the magnitude of RandAug to 10.
Following the environment setup as outlined in environment.md, I installed timm==0.6.13, which also appears to restrict the maximum magnitude of augmentation to 10.

Could this setting potentially affect the value of m in the RandAug configuration? I'm curious whether this constraint might influence the results or the model's performance in some way.

I would appreciate it if you could take a moment to look into this. Thank you for your time and assistance.

Inquiry about Coefficients for Learned Soup

Hello,

I found the comparison between the performance of Greedy Soup and Learned Soup in your recent paper particularly intriguing. I'm reaching out to ask if you could share a log or list of coefficients for the Learned Soup that you reported.

Having access to this data would be extremely helpful for identifying potential trends among these coefficients (both by model and by layer). This information could be invaluable for replication studies, comparative analyses, or further research in the field.

I would greatly appreciate any help you can provide with this request.

Thanks for considering my inquiry.

Are there weights available online for uniform soup model?

Hello, I wanted to use a model that was constructed with uniform weights, is it available online (I didn't find) or I need to download all models and construct it myself?

checkpoint for ViT-g/14

Where to get the checkpoints of vit-g/14 models?
In the code i only saw code for VIT-B/32 base. could you show me how to load the ViT-G/14?

Souping on regression model leading to a drastic drop in accuracy

Hello,

I have a regression model that I composed by taking a MobileNet classifier (pre-trained with ImageNet weights), then removing its classification head and adding a flatten+dense layer that spits out a scalar output. I define an accuracy metric based on if the absolute error is below a threshold.

I take the above model and train it first using LP for 15 iterations, then using FT for 2 iterations. This is my starter model. This starter model was trained using RMSprop.

I then take this starter model, and train it (using LP) for a variable number of iterations, variable learning rate, variable optimizer types (RMSprop, Adam, AdamW), variable seeds to get my soup ingredient models.

I get approximately 91% accuracy on a held-out test using the starter model, 93% and 94% using two of my ingredient models.

Issue: I take a random pair of well performing models (>90%) amongst my starter and ingredient models, and average their weights. However, almost always the souped models have an accuracy of 2% on the test set.

Illustrative code I use to average the weights:

def uniform_soup(model_list):
    soups = []
    
    tf.keras.backend.clear_session()
    model_init = create_skeleton_model() #Any model from my starter or ingredients just for its architecture.
    
    for model_individual in model_list:
                
        soup = [np.array(weights) for weights in model_individual.weights]
        soups.append(soup)
         
    mean_soup = np.array(soups).mean(axis = 0)
    
    ## Replacing model's weight with Unifrom Soup Weights
    for w1, w2 in zip(model_init.weights, mean_soup ):
        tf.keras.backend.set_value(w1, w2)
        
    return model_init

Is there anything wrong in my design or anything that stands out to you?
Is it okay to use a regression model? Does anything in the loss landscape change owing to it being a regression model?

I did peruse through #10 and followed your advice on that thread to design my souping.

Thanks in advance.

mlfoundations / model-soups Goto Github PK

model-soups's People

Contributors

Stargazers

Watchers

Forkers

model-soups's Issues

`wget https://s3-us-west-2.amazonaws.com/imagenetv2public/imagenetv2-matched-frequency.tar.gz

Recommend Projects

Recommend Topics

Recommend Org