Giter Site home page Giter Site logo

yuanli2333 / teacher-free-knowledge-distillation Goto Github PK

View Code? Open in Web Editor NEW
574.0 11.0 67.0 910 KB

Knowledge Distillation: CVPR2020 Oral, Revisiting Knowledge Distillation via Label Smoothing Regularization

License: MIT License

Python 100.00%
knowledge-distillation pytorch teacher-free paper-implementations label-smoothing

teacher-free-knowledge-distillation's People

Contributors

dependabot[bot] avatar yuanli2333 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

teacher-free-knowledge-distillation's Issues

TFselftraining parameters in the paper ?

Thanks for sharing the results of the paper.

I have a question about the TeacherFreeSelf Training loss function, written in my_loss_function.py .
I see a parameter after the KLDiv with the T*T coefficient. I was wondering where does it come from ? I haven't see it mentionned in the paper. Is it normal ?

Have you ever try on deeper network?

In the paper, the reported results are shallow in general, resnet18, googlenet, etc. I found it's easier to get some improvement in shallower models. Have you ever tried deeper networks like pre-resnet164 or resnext164? And have your method get some improvement on cifar10?

wonder if it work on a weak and small student network

Distillation by a strong teacher can improve accuracy of a much weak and small student network. But in your paper, the student network structure are strong case. I wonder if the teacher-free KD can work well on a weak and small student network?

Question about KD Regularization in code

According to the KD Regularization code, the $D_{KL}$ term in the total loss is written as $D_{KL}(p^t_{\tau}, p)$.

While in the paper, according to Eqn (9) and Eqn (5), the $D_{KL}$ term in loss is written as $D_{KL}(p^t_{\tau}, p_{\tau})$.

Since the ${\tau}$ is not always 1, the above two functions are not the same.

Sort of confused about this.
Thanks.

Pretrained model for student network

Thanks for the great work.

I find the pre-trained model for the teacher network.
Will you release the Pretrained model for student network?

Thanks!

Mismatch between Eq.9 in the paper and the code

Hello, thanks for your great work! I have a question about a possible mismatch between the Eq.9 in the paper and the real implementations in the code.

Here are the loss equations of LS, and your proposed regularization:
Imgur

As seen, the temperature $\tau$ is missing for the $p$ in Eq.10 compared with Eq.9. This might be problematic: In your paper (Sec.4) and many other places (like this issue and 2020 ICLR openreview), when you differentiate your method from Label Smoothing (LS), the existence of the temperature is an essential factor to support your statement, while in practice, it is not used. This looks like a big mismatch in terms of the methodology, because for Eq.10 above, I can set the $\tau$ to a very large number to make $p^d_{\tau}$ become uniform distribution $u$ (in fact, the values you picked -- 20 or 40 in Tab.6 of your supplementary material -- is large enough to make this happen. You can print the value of F.softmax(teacher_soft/T, dim=1) in your code to verify this), then set the $\alpha$ in Eq.10 to the $\alpha$ in Eq.3. Then Eq.10 will be exactly the same as Eq.3. This shows your implementation is truly an over-parameterized version of LS, contradicting your claim in the paper and many other places. Do you have any comments about this potential problem?

I may misunderstand something. If any, please point it out. Thanks again!

Working with larger image size

I see that the default repo and the settings suit 32x32 input images. How can I make this work for images of larger size, (eg - 512x512) ?

The baseline of ResNet18 on CIFAR100 is relatively lower

Hi, I would first appreciate your work for interpreting the relationship between the KD and LSR. However, the baseline of ResNet18 on cifar100 is much lower than the implementation pytorch-cifar100, which may be caused by the modified ResNet. In fact, based on the pytorch-cifar100, without any extra augmentations, the top1 accuracy can achieve up to 78.05% in my previous experiments. So I would cast my doubt on the performance gain of the self distillation. And I have conducted an experiment using the distillation, which improves the baseline from 77.96% to 78.45%. It does improve the performance yet not conspicuous as the paper claimed.

KD loss is zero

My loss after distillation is 0, which feels very strange. I want to ask whether there is a problem with the distillation method or the calculation of distillation function in the code. A little confused, I hope the writer or someone who knows can tell me.Thanks.

questions about The two Tf-KD methods

The first Tf-KD method is self-training, which is quite similar to the "Deep Mutual Learning" paper.

The second Tf-KD method actually equals to the LSR.let a=(1-alpha)+alpha/K, u=1/K, your manually designed distribution equals to LSR=(1-alpha)p(k)+alpha*u, where p(k) is hard labels

Question about the loss function of Tf-reg KD

Hi, thank you for sharing such an awesome project.
For the TF-reg KD, in line 47 of my_loss_function.py, should we also divide the temperature T on the output variable, like:
loss_soft_regu = nn.KLDivLoss()(F.log_softmax(outputs / T, dim=1), F.softmax(teacher_soft/T, dim=1))*params.multiplier

As in Eq (9) of your paper, the loss function is $$D_{KL}(p^d_\tau, p_\tau)$$.

I would really appreciate it if you could help me. Look forward to your reply, thanks!

where's the paper?

I cannot find Revisiting Knowledge Distillation via Label Smoothing Regularization on arxiv, only Revisit Knowledge Distillation: a Teacher-free Framework. is there some difference between those two papers?

a question about mobilenetv2

Your parameter of mobilenetv2's stage6 is (3, 96, 160, 1, 6), however, it seems should be (3, 96, 160, 2, 6).
Is this a bug or something else?

Questions about KD loss

Hello!
I noticed that you didn't use batch_mean for calculating the KL loss on Pytorch 1.2.0.
Could you please tell me why you didn't use the batch_mean option?

Thanks~

Difference between L_REG and LSR

Sorry to disturb, i wonder what's the difference between L_REG and LSR.
In my opinion, both LSR and L_REG is the combination of H(p,q) and H(u, q) with certain weights
image

Torch Vision Version

Is the torch-vision version is correct in requirements.txt ?

The version is wriiten as
torchvision==0.4.0a0+9232c4a

So it's giving an error. And one more important thing is to run this project python 3.6 is required. It gives error for python 3.7

Many Bugs

Hi,

I would like to do some work based on your code, but found so many bugs here, could you kindly check the validity for me? Up to now, I found you should add "int.py" file to the folder "model", otherwise you can't import the model in there. Second, the way of commenting is problematic in "denseness.py", Thirdly, I got this " File "/Users/cherrysun/Documents/robot/torch/Teacher-free-Knowledge-Distillation/model/mobilenetv2.py", line 15, in init
super().init()
TypeError: super() takes at least 1 argument (0 given)"

I am afraid more bugs will appear before real running.

Data augmentation for Tiny-ImageNet

Hello,

How have you decided on the data augmentation transformations that you have applied on Tiny-ImageNet? Have you used the setting from some other paper? Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.