yuanli2333 / teacher-free-knowledge-distillation Goto Github PK
View Code? Open in Web Editor NEWKnowledge Distillation: CVPR2020 Oral, Revisiting Knowledge Distillation via Label Smoothing Regularization
License: MIT License
Knowledge Distillation: CVPR2020 Oral, Revisiting Knowledge Distillation via Label Smoothing Regularization
License: MIT License
Thanks for sharing the results of the paper.
I have a question about the TeacherFreeSelf Training loss function, written in my_loss_function.py
.
I see a parameter after the KLDiv with the T*T
coefficient. I was wondering where does it come from ? I haven't see it mentionned in the paper. Is it normal ?
In the paper, the reported results are shallow in general, resnet18, googlenet, etc. I found it's easier to get some improvement in shallower models. Have you ever tried deeper networks like pre-resnet164 or resnext164? And have your method get some improvement on cifar10?
Distillation by a strong teacher can improve accuracy of a much weak and small student network. But in your paper, the student network structure are strong case. I wonder if the teacher-free KD can work well on a weak and small student network?
According to the KD Regularization code, the
While in the paper, according to Eqn (9) and Eqn (5), the
Since the
Sort of confused about this.
Thanks.
Thanks for the great work.
I find the pre-trained model for the teacher network.
Will you release the Pretrained model for student network?
Thanks!
Hello, thanks for your great work! I have a question about a possible mismatch between the Eq.9 in the paper and the real implementations in the code.
Here are the loss equations of LS, and your proposed regularization:
As seen, the temperature F.softmax(teacher_soft/T, dim=1)
in your code to verify this), then set the
I may misunderstand something. If any, please point it out. Thanks again!
do you have email? I have some trouble with your code.
hello ,author.
I read the paper, the parameters (temperature and alpha) are obtained by gird search.
can you release the code . I want to learn it . Thank you.
I see that the default repo and the settings suit 32x32 input images. How can I make this work for images of larger size, (eg - 512x512) ?
:)
Hi, it's a very good and interesting work! How do you think the effectiveness of this method in detection tasks?
I'm training the ImageNet model. The loss_function.py
file doesn't contain the above two functions. Where can I find these?
Hi, I would first appreciate your work for interpreting the relationship between the KD and LSR. However, the baseline of ResNet18 on cifar100 is much lower than the implementation pytorch-cifar100, which may be caused by the modified ResNet. In fact, based on the pytorch-cifar100, without any extra augmentations, the top1 accuracy can achieve up to 78.05% in my previous experiments. So I would cast my doubt on the performance gain of the self distillation. And I have conducted an experiment using the distillation, which improves the baseline from 77.96% to 78.45%. It does improve the performance yet not conspicuous as the paper claimed.
My loss after distillation is 0, which feels very strange. I want to ask whether there is a problem with the distillation method or the calculation of distillation function in the code. A little confused, I hope the writer or someone who knows can tell me.Thanks.
The first Tf-KD method is self-training, which is quite similar to the "Deep Mutual Learning" paper.
The second Tf-KD method actually equals to the LSR.let a=(1-alpha)+alpha/K, u=1/K, your manually designed distribution equals to LSR=(1-alpha)p(k)+alpha*u, where p(k) is hard labels
Hi, thank you for sharing such an awesome project.
For the TF-reg KD, in line 47 of my_loss_function.py, should we also divide the temperature T on the output variable, like:
loss_soft_regu = nn.KLDivLoss()(F.log_softmax(outputs / T, dim=1), F.softmax(teacher_soft/T, dim=1))*params.multiplier
As in Eq (9) of your paper, the loss function is $$D_{KL}(p^d_\tau, p_\tau)$$.
I would really appreciate it if you could help me. Look forward to your reply, thanks!
I cannot find Revisiting Knowledge Distillation via Label Smoothing Regularization on arxiv, only Revisit Knowledge Distillation: a Teacher-free Framework. is there some difference between those two papers?
As the title says, I do not understand why to multiple the kd loss, and how to set the params?
Your parameter of mobilenetv2's stage6 is (3, 96, 160, 1, 6), however, it seems should be (3, 96, 160, 2, 6).
Is this a bug or something else?
Hi,
Could you kindly check on source of pre-trained models? I can't download it from the website your provided, many thanks in advance.
Hello!
I noticed that you didn't use batch_mean
for calculating the KL loss on Pytorch 1.2.0.
Could you please tell me why you didn't use the batch_mean
option?
Thanks~
Is the torch-vision version is correct in requirements.txt ?
The version is wriiten as
torchvision==0.4.0a0+9232c4a
So it's giving an error. And one more important thing is to run this project python 3.6 is required. It gives error for python 3.7
The dataset use in experiment section have many classes. Does this work (teacher free distill) for dataset with only two classes?
Hi,
I would like to do some work based on your code, but found so many bugs here, could you kindly check the validity for me? Up to now, I found you should add "int.py" file to the folder "model", otherwise you can't import the model in there. Second, the way of commenting is problematic in "denseness.py", Thirdly, I got this " File "/Users/cherrysun/Documents/robot/torch/Teacher-free-Knowledge-Distillation/model/mobilenetv2.py", line 15, in init
super().init()
TypeError: super() takes at least 1 argument (0 given)"
I am afraid more bugs will appear before real running.
Hello,
How have you decided on the data augmentation transformations that you have applied on Tiny-ImageNet? Have you used the setting from some other paper? Thank you in advance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.