I have a flax model : <div class="highlight highlight-source-python notranslate po

Thanks, is there such thing for mlp-mixer models? <a class="user-mention notranslate"

Fine-Tuning about vision_transformer HOT 3 CLOSED

IMvision12 commented on September 7, 2024

Fine-Tuning

from vision_transformer.

Comments (3)

andsteing commented on September 7, 2024

When you want to fine-tune a ViT model with a different image size than it was pre-trained on, then you'll need to adjust the position embeddings accordingly. Section 3.2 of the ViT Paper proposes to perform 2D interpolation.

This is supported in this codebase when loading a checkpoint:

vision_transformer/vit_jax/checkpoint.py

Lines 192 to 201 in 297866a

    
           if 'posembed_input' in restored_params.get('Transformer', {}): 
        
             # Rescale the grid of position embeddings. Param shape is (1,N,1024) 
        
             posemb = restored_params['Transformer']['posembed_input']['pos_embedding'] 
        
             posemb_new = init_params['Transformer']['posembed_input']['pos_embedding'] 
        
             if posemb.shape != posemb_new.shape: 
        
               logging.info('load_pretrained: resized variant: %s to %s', posemb.shape, 
        
                            posemb_new.shape) 
        
               posemb = interpolate_posembed( 
        
                   posemb, posemb_new.shape[1], model_config.classifier == 'token') 
        
               restored_params['Transformer']['posembed_input']['pos_embedding'] = posemb

Which is done automatically when you call checkpoint.load_pretrained() and provide both init_params that expect a certain image size (e.g. 128 in your example), and load from a checkpoint that has weights that were trained on a different size (e.g. 224).

See the code in the main Colab that fine-tunes on on cifar10 (size 32), specifically this cell:

https://colab.research.google.com/github/google-research/vision_transformer/blob/main/vit_jax.ipynb#scrollTo=zIXjOEDkvAWM

from vision_transformer.

IMvision12 commented on September 7, 2024

Thanks, is there such thing for mlp-mixer models? @andsteing

from vision_transformer.

andsteing commented on September 7, 2024

It's not as straight-forward with mlp-mixer, because the token mixing MLP blocks are trained for a specific number of patches (as opposed to ViT, where the attention operation works on sequences of variable length, and only the position embeddings need to be modified).

See Section C from the Mixer paper for details how to do this.

(not implemented in this repo)

from vision_transformer.

Recommend Projects

Fine-Tuning about vision_transformer HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if 'posembed_input' in restored_params.get('Transformer', {}):
	# Rescale the grid of position embeddings. Param shape is (1,N,1024)
	posemb = restored_params['Transformer']['posembed_input']['pos_embedding']
	posemb_new = init_params['Transformer']['posembed_input']['pos_embedding']
	if posemb.shape != posemb_new.shape:
	logging.info('load_pretrained: resized variant: %s to %s', posemb.shape,
	posemb_new.shape)
	posemb = interpolate_posembed(
	posemb, posemb_new.shape[1], model_config.classifier == 'token')
	restored_params['Transformer']['posembed_input']['pos_embedding'] = posemb