yuanli2333 / teacher-free-knowledge-distillation Goto Github PK

View Code? Open in Web Editor NEW

574.0 11.0 67.0 910 KB

Knowledge Distillation: CVPR2020 Oral, Revisiting Knowledge Distillation via Label Smoothing Regularization

License: MIT License

Python 100.00%

knowledge-distillation pytorch teacher-free paper-implementations label-smoothing

teacher-free-knowledge-distillation's People

Contributors

Stargazers

Watchers

Forkers

wenqingchu yyht cristinausc zxlzr edhonack hkm-1995 debuluoyi huandrew labimage tianxingyzxq studian shubhampachori12110095 scape1989 dbcool twistedmove lliai 459548764 forks-learning windspire khanhdinhduy cunjian damioncheng varnithchordia timebear majunfu xiaobaoli15 mldl girishmk0602 leoozy jidezhu tommy-xu felixzhang7 yan-song noeverer heuyklee linhduongtuan caoyuhang xiuyangleiasp pjirayu coasxu sorrowyn summertaiyuan sri9s zkzssf xuliwu arifcse10 luqianren idomatheveryday ixiondbz cswangle techthiyanes korallin exp-deeplearning-tools whohy nooralahzadeh dianekexin zhulongxi saleknia david8388 emersonzc dl-kd lizhuo97 liuyifan6613 karthick47v2 shuchunxu deepzlk

teacher-free-knowledge-distillation's Issues

TFselftraining parameters in the paper ?

Thanks for sharing the results of the paper.

I have a question about the TeacherFreeSelf Training loss function, written in my_loss_function.py .
I see a parameter after the KLDiv with the T*T coefficient. I was wondering where does it come from ? I haven't see it mentionned in the paper. Is it normal ?

Have you ever try on deeper network?

In the paper, the reported results are shallow in general, resnet18, googlenet, etc. I found it's easier to get some improvement in shallower models. Have you ever tried deeper networks like pre-resnet164 or resnext164? And have your method get some improvement on cifar10?

wonder if it work on a weak and small student network

Distillation by a strong teacher can improve accuracy of a much weak and small student network. But in your paper, the student network structure are strong case. I wonder if the teacher-free KD can work well on a weak and small student network?

Resnet architectures is differnet from the original networks in the paper

Question about KD Regularization in code

According to the KD Regularization code, the $D_{KL}$ term in the total loss is written as $D_{KL}(p^t_{\tau}, p)$.

While in the paper, according to Eqn (9) and Eqn (5), the $D_{KL}$ term in loss is written as $D_{KL}(p^t_{\tau}, p_{\tau})$.

Since the ${\tau}$ is not always 1, the above two functions are not the same.

Sort of confused about this.
Thanks.

Pretrained model for student network

Thanks for the great work.

I find the pre-trained model for the teacher network.
Will you release the Pretrained model for student network?

Thanks!

Mismatch between Eq.9 in the paper and the code

Hello, thanks for your great work! I have a question about a possible mismatch between the Eq.9 in the paper and the real implementations in the code.

Here are the loss equations of LS, and your proposed regularization:

As seen, the temperature $\tau$ is missing for the $p$ in Eq.10 compared with Eq.9. This might be problematic: In your paper (Sec.4) and many other places (like this issue and 2020 ICLR openreview), when you differentiate your method from Label Smoothing (LS), the existence of the temperature is an essential factor to support your statement, while in practice, it is not used. This looks like a big mismatch in terms of the methodology, because for Eq.10 above, I can set the $\tau$ to a very large number to make $p^d_{\tau}$ become uniform distribution $u$ (in fact, the values you picked -- 20 or 40 in Tab.6 of your supplementary material -- is large enough to make this happen. You can print the value of F.softmax(teacher_soft/T, dim=1) in your code to verify this), then set the $\alpha$ in Eq.10 to the $\alpha$ in Eq.3. Then Eq.10 will be exactly the same as Eq.3. This shows your implementation is truly an over-parameterized version of LS, contradicting your claim in the paper and many other places. Do you have any comments about this potential problem?

I may misunderstand something. If any, please point it out. Thanks again!

do you have email? I have some trouble with your code.

How to search the best temperature and alpha

hello ,author.
I read the paper, the parameters (temperature and alpha) are obtained by gird search.
can you release the code . I want to learn it . Thank you.

Working with larger image size

I see that the default repo and the settings suit 32x32 input images. How can I make this work for images of larger size, (eg - 512x512) ?

It just feels like "炼丹"

Does this method work on the detection tasks？

Hi, it's a very good and interesting work! How do you think the effectiveness of this method in detection tasks?

Implementation doesn't have loss_soft_regularization and loss_fn_kd for ImageNet dataset

I'm training the ImageNet model. The loss_function.py file doesn't contain the above two functions. Where can I find these?

The baseline of ResNet18 on CIFAR100 is relatively lower

Hi, I would first appreciate your work for interpreting the relationship between the KD and LSR. However, the baseline of ResNet18 on cifar100 is much lower than the implementation pytorch-cifar100, which may be caused by the modified ResNet. In fact, based on the pytorch-cifar100, without any extra augmentations, the top1 accuracy can achieve up to 78.05% in my previous experiments. So I would cast my doubt on the performance gain of the self distillation. And I have conducted an experiment using the distillation, which improves the baseline from 77.96% to 78.45%. It does improve the performance yet not conspicuous as the paper claimed.

KD loss is zero

My loss after distillation is 0, which feels very strange. I want to ask whether there is a problem with the distillation method or the calculation of distillation function in the code. A little confused, I hope the writer or someone who knows can tell me.Thanks.

questions about The two Tf-KD methods

The first Tf-KD method is self-training, which is quite similar to the "Deep Mutual Learning" paper.

The second Tf-KD method actually equals to the LSR.let a=(1-alpha)+alpha/K, u=1/K, your manually designed distribution equals to LSR=(1-alpha)p(k)+alpha*u, where p(k) is hard labels

Question about the loss function of Tf-reg KD

Hi, thank you for sharing such an awesome project.
For the TF-reg KD, in line 47 of my_loss_function.py, should we also divide the temperature T on the output variable, like:
loss_soft_regu = nn.KLDivLoss()(F.log_softmax(outputs / T, dim=1), F.softmax(teacher_soft/T, dim=1))*params.multiplier

As in Eq (9) of your paper, the loss function is $$D_{KL}(p^d_\tau, p_\tau)$$.

I would really appreciate it if you could help me. Look forward to your reply, thanks!

where's the paper?

I cannot find Revisiting Knowledge Distillation via Label Smoothing Regularization on arxiv, only Revisit Knowledge Distillation: a Teacher-free Framework. is there some difference between those two papers?

why there is a 'multiplier' param in the loss funtion?

As the title says, I do not understand why to multiple the kd loss, and how to set the params?

a question about mobilenetv2

Your parameter of mobilenetv2's stage6 is (3, 96, 160, 1, 6), however, it seems should be (3, 96, 160, 2, 6).
Is this a bug or something else?

Can't download the pre-trained model

Hi,

Could you kindly check on source of pre-trained models? I can't download it from the website your provided, many thanks in advance.

Questions about KD loss

Hello!
I noticed that you didn't use batch_mean for calculating the KL loss on Pytorch 1.2.0.
Could you please tell me why you didn't use the batch_mean option?

Thanks~

Difference between L_REG and LSR

Sorry to disturb, i wonder what's the difference between L_REG and LSR.
In my opinion, both LSR and L_REG is the combination of H(p,q) and H(u, q) with certain weights

Torch Vision Version

Is the torch-vision version is correct in requirements.txt ?

The version is wriiten as
torchvision==0.4.0a0+9232c4a

So it's giving an error. And one more important thing is to run this project python 3.6 is required. It gives error for python 3.7

Does this work for dataset with only two classes

The dataset use in experiment section have many classes. Does this work (teacher free distill) for dataset with only two classes?

Many Bugs

Hi,

I would like to do some work based on your code, but found so many bugs here, could you kindly check the validity for me? Up to now, I found you should add "int.py" file to the folder "model", otherwise you can't import the model in there. Second, the way of commenting is problematic in "denseness.py", Thirdly, I got this " File "/Users/cherrysun/Documents/robot/torch/Teacher-free-Knowledge-Distillation/model/mobilenetv2.py", line 15, in init
super().init()
TypeError: super() takes at least 1 argument (0 given)"

I am afraid more bugs will appear before real running.

What is the difference between Born Again Network and your self-training KD method?

Data augmentation for Tiny-ImageNet

Hello,

How have you decided on the data augmentation transformations that you have applied on Tiny-ImageNet? Have you used the setting from some other paper? Thank you in advance.