Giter Site home page Giter Site logo

Fine-Tuning about vision_transformer HOT 3 CLOSED

IMvision12 avatar IMvision12 commented on September 7, 2024
Fine-Tuning

from vision_transformer.

Comments (3)

andsteing avatar andsteing commented on September 7, 2024

When you want to fine-tune a ViT model with a different image size than it was pre-trained on, then you'll need to adjust the position embeddings accordingly. Section 3.2 of the ViT Paper proposes to perform 2D interpolation.

This is supported in this codebase when loading a checkpoint:

if 'posembed_input' in restored_params.get('Transformer', {}):
# Rescale the grid of position embeddings. Param shape is (1,N,1024)
posemb = restored_params['Transformer']['posembed_input']['pos_embedding']
posemb_new = init_params['Transformer']['posembed_input']['pos_embedding']
if posemb.shape != posemb_new.shape:
logging.info('load_pretrained: resized variant: %s to %s', posemb.shape,
posemb_new.shape)
posemb = interpolate_posembed(
posemb, posemb_new.shape[1], model_config.classifier == 'token')
restored_params['Transformer']['posembed_input']['pos_embedding'] = posemb

Which is done automatically when you call checkpoint.load_pretrained() and provide both init_params that expect a certain image size (e.g. 128 in your example), and load from a checkpoint that has weights that were trained on a different size (e.g. 224).

See the code in the main Colab that fine-tunes on on cifar10 (size 32), specifically this cell:

https://colab.research.google.com/github/google-research/vision_transformer/blob/main/vit_jax.ipynb#scrollTo=zIXjOEDkvAWM

from vision_transformer.

IMvision12 avatar IMvision12 commented on September 7, 2024

Thanks, is there such thing for mlp-mixer models? @andsteing

from vision_transformer.

andsteing avatar andsteing commented on September 7, 2024

It's not as straight-forward with mlp-mixer, because the token mixing MLP blocks are trained for a specific number of patches (as opposed to ViT, where the attention operation works on sequences of variable length, and only the position embeddings need to be modified).

See Section C from the Mixer paper for details how to do this.

(not implemented in this repo)

from vision_transformer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.