Hello! First thing I wanna thank you for this piece of code! Amazing

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Thank you <a class="user-mention notranslate" data-hovercard-type="user"

Decoder parameters become huge when scaling up the problem about mae-scalable-vision-learners HOT 4 CLOSED

danifranco commented on July 16, 2024

Decoder parameters become huge when scaling up the problem

from mae-scalable-vision-learners.

Comments (4)

ariG23498 commented on July 16, 2024

Hey @danifranco

Thank you for taking interest in this work!

What I remember from our tutorial is that we tweaked the implementation in such a way that the decoder spits out the entire image all at once (hence the image_size, image_size, 3) that you have pointed out. This makes it easier for us to resuse the patchification layer and gather the patches from the output image.

def calculate_loss(self, images, test=False):
    # Augment the input images.
    if test:
        augmeneted_images = self.test_augmentation_model(images)
    else:
        augmeneted_images = self.train_augmentation_model(images)

    # Patch the augmented images.
    patches = self.patch_layer(augmeneted_images)

    # Encode the patches.
    (
        unmasked_embeddings,
        masked_embeddings,
        unmasked_positions,
        mask_indices,
        unmask_indices,
    ) = self.patch_encoder(patches)

    # Pass the unmaksed patche to the encoder.
    encoder_outputs = self.encoder(unmasked_embeddings)

    # Create the decoder inputs.
    encoder_outputs = encoder_outputs + unmasked_positions
    decoder_inputs = tf.concat([encoder_outputs, masked_embeddings], axis=1)

    # Decode the inputs.
    decoder_outputs = self.decoder(decoder_inputs) # <----- outputs an image
    decoder_patches = self.patch_layer(decoder_outputs) # <----- patchifies the image and 

    loss_patch = tf.gather(patches, mask_indices, axis=1, batch_dims=1) # <----- gathers the patches for loss (this can be optimized a lot!)
    loss_output = tf.gather(decoder_patches, mask_indices, axis=1, batch_dims=1)

    # Compute the total loss.
    total_loss = self.compiled_loss(loss_patch, loss_output)

    return total_loss, loss_patch, loss_output

You are right here. We cannot scale this model up and it indeed is taking a hit on the parameter count. Please feel free to raise a PR if you wanted to solve this issue with better gathering and slicing techniques inside the training loop.

Please feel free to reach out if there are any more concerns.

from mae-scalable-vision-learners.

wolo-wolo commented on July 16, 2024

Thank you @ariG23498 for your great work. I have encountered the same problems, is the following code correct？

    # -----------------ori-----
    # x = layers.Flatten()(x)
    # pre_final = layers.Dense(units=image_size * image_size * 3, activation="sigmoid")(x)
    # outputs = layers.Reshape((image_size, image_size, 3))(pre_final)
    # -----------------modified-----
    pre_final = layers.Dense(units=patch_size * patch_size * 3, activation="sigmoid")(x)
    outputs = layers.Reshape((image_size, image_size, 3))(pre_final)

from mae-scalable-vision-learners.

ariG23498 commented on July 16, 2024

Thank you @ariG23498 for your great work. I have encountered the same problems, is the following code correct？

    # -----------------ori-----
    # x = layers.Flatten()(x)
    # pre_final = layers.Dense(units=image_size * image_size * 3, activation="sigmoid")(x)
    # outputs = layers.Reshape((image_size, image_size, 3))(pre_final)
    # -----------------modified-----
    pre_final = layers.Dense(units=patch_size * patch_size * 3, activation="sigmoid")(x)
    outputs = layers.Reshape((image_size, image_size, 3))(pre_final)

It is correct, but as far as I remember (I might be wrong here as I had trained it quite some time ago), removing the Flatten() gave me poor results. The problem was with the dataset size and the number of epocs I guess.

You could probably give this a try and report the results in this issue. That would be a great contribution!

from mae-scalable-vision-learners.

ariG23498 commented on July 16, 2024

Closing this for inactivity. Please feel free to open it if the query is not resolved.

from mae-scalable-vision-learners.

Decoder parameters become huge when scaling up the problem about mae-scalable-vision-learners HOT 4 CLOSED

Comments (4)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent