I found you take regularization as another means of pruning. But the procedure i

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Questions about regularization and pruning about distiller HOT 2 CLOSED

intellabs commented on May 22, 2024 1

Questions about regularization and pruning

from distiller.

Comments (2)

nzmora commented on May 22, 2024 4

Hi,

I'll start with a long explanation :-), and then I'll take your questions.

Regularization can be a means to achieve sparsity - but there is an important distinction between sparsity and pruning which relates to the rest of my answer. Sparsity is a measure of the absolute zeros in a tensor. Pruning algorithms are one approach to achieve sparsity. But the distinction is even deeper.

Consider what happens when we prune connections: we remove those connections entirely from the network which means that no information flows through these connections: neither forward data, nor backward gradients. Practically, we mask both weights during the forward pass, and gradients during the backward pass. But you know this 😉

What happens when we regularize? At first glance, there is no relation between pruning and regularization, because in regularization we just use an added loss term to put “downward pressure” on the weights (individually; or in grouped structures) - We don't remove connections. So no masking should be involved, right?
Well, not quite: we use a “soft-thresholding operator” (i.e. thresholding + masking) to prevent the weights from oscillating around zero (I tried to show this in this notebook using L1 regularization on a toy example).
OK, so when we regularize, we also mask the weights – but what about the gradients? No, we leave the gradients alone, because we don’t want to completely remove the regularized connections from the network: i.e. we want the regularized connections to continue passing information in the backward direction. Another way to look at this difference between pruning and regularization: pruned connections are removed forever, however regularized connections that are masked out (they are below some threshold) can sometimes grow back in size.

" I think it's natural that this happens on the end of one epoch or end of whole training when the regularization terms have been decreased enough for pruning."
This is an interesting idea. If we implemented it, we wouldn't be able to easily see in the logs the sparsity of the weights during part of the training (because we wouldn't have absolute zeros, most likely). But this is not the reason I chose to threshold regularization at the end of each mini-batch. You see, pruning is iterative and therefore not "continuous": We prune, then we fine-tune for a "long" time, then we prune some more, and fine-tune some more, and so on. Regularization is "continuous" by definition: every time we compute the data loss, we also also compute the regularization loss. And as far as I understand, the “soft-thresholding operator” is part of every regularization calculation (on_minibatch_end).
BTW, you can also configure the regularizer not to threshold. BTW 2: Today we can only prune at the beginning epochs, but in the future I want to allow scheduling of pruning at the mini-batch granularity.
"The regularization and pruning both use the same zeros_mask_dict, it may brings some messes." This is a good comment and it tells me that I didn't document the interaction between pruning and regularization. I think that when you choose to mix these two, you want to smoothly push the solution towards sparsity (using the regularization loss term), but prune using a more "clumsy" pruning schedule. Now, the only reason to use a pruner when you're already using a regularizer, is if the pruner is more aggressive than the regularizer (otherwise, the pruner does nothing - it's mask is below the regularization mask). To sum up: if you're both pruning and regularizing, don't enable the regularizer's mask.
"What is the purpose of keeping the regulatization mask of the last epoch. I guess it may be used by some remover in thinning.py, right?" - Correct: we keep the mask to get sparsity, which we can exploit to remove structures (thinning.py).

Thanks for the interesting comments,
Neta

from distiller.

nzmora commented on May 22, 2024

@hunterkun I'm closing because this has been idle for 19 days. If you have questions remaining we can reopen, or use another issue.
Cheers,
Neta

from distiller.

Questions about regularization and pruning about distiller HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent