Giter Site home page Giter Site logo

Comments (2)

nzmora avatar nzmora commented on May 22, 2024 4

Hi,

I'll start with a long explanation :-), and then I'll take your questions.

Regularization can be a means to achieve sparsity - but there is an important distinction between sparsity and pruning which relates to the rest of my answer. Sparsity is a measure of the absolute zeros in a tensor. Pruning algorithms are one approach to achieve sparsity. But the distinction is even deeper.

Consider what happens when we prune connections: we remove those connections entirely from the network which means that no information flows through these connections: neither forward data, nor backward gradients. Practically, we mask both weights during the forward pass, and gradients during the backward pass. But you know this 😉

What happens when we regularize? At first glance, there is no relation between pruning and regularization, because in regularization we just use an added loss term to put “downward pressure” on the weights (individually; or in grouped structures) - We don't remove connections. So no masking should be involved, right?
Well, not quite: we use a “soft-thresholding operator” (i.e. thresholding + masking) to prevent the weights from oscillating around zero (I tried to show this in this notebook using L1 regularization on a toy example).
OK, so when we regularize, we also mask the weights – but what about the gradients? No, we leave the gradients alone, because we don’t want to completely remove the regularized connections from the network: i.e. we want the regularized connections to continue passing information in the backward direction. Another way to look at this difference between pruning and regularization: pruned connections are removed forever, however regularized connections that are masked out (they are below some threshold) can sometimes grow back in size.

  1. " I think it's natural that this happens on the end of one epoch or end of whole training when the regularization terms have been decreased enough for pruning."
    This is an interesting idea. If we implemented it, we wouldn't be able to easily see in the logs the sparsity of the weights during part of the training (because we wouldn't have absolute zeros, most likely). But this is not the reason I chose to threshold regularization at the end of each mini-batch. You see, pruning is iterative and therefore not "continuous": We prune, then we fine-tune for a "long" time, then we prune some more, and fine-tune some more, and so on. Regularization is "continuous" by definition: every time we compute the data loss, we also also compute the regularization loss. And as far as I understand, the “soft-thresholding operator” is part of every regularization calculation (on_minibatch_end).
    BTW, you can also configure the regularizer not to threshold. BTW 2: Today we can only prune at the beginning epochs, but in the future I want to allow scheduling of pruning at the mini-batch granularity.

  2. "The regularization and pruning both use the same zeros_mask_dict, it may brings some messes." This is a good comment and it tells me that I didn't document the interaction between pruning and regularization. I think that when you choose to mix these two, you want to smoothly push the solution towards sparsity (using the regularization loss term), but prune using a more "clumsy" pruning schedule. Now, the only reason to use a pruner when you're already using a regularizer, is if the pruner is more aggressive than the regularizer (otherwise, the pruner does nothing - it's mask is below the regularization mask). To sum up: if you're both pruning and regularizing, don't enable the regularizer's mask.

  3. "What is the purpose of keeping the regulatization mask of the last epoch. I guess it may be used by some remover in thinning.py, right?" - Correct: we keep the mask to get sparsity, which we can exploit to remove structures (thinning.py).

Thanks for the interesting comments,
Neta

from distiller.

nzmora avatar nzmora commented on May 22, 2024

@hunterkun I'm closing because this has been idle for 19 days. If you have questions remaining we can reopen, or use another issue.
Cheers,
Neta

from distiller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.