alecwangcq / grasp Goto Github PK

View Code? Open in Web Editor NEW

97.0 97.0 14.0 28 KB

Code for "Picking Winning Tickets Before Training by Preserving Gradient Flow" https://openreview.net/pdf?id=SkgsACVKPH

License: MIT License

Python 100.00%

deep-learning lottery-ticket-hypothesis machine-learning neural-networks pruning pytorch

grasp's People

Contributors

Stargazers

Watchers

Forkers

wzn0828 chomd90 danielkunin samuelstanton bernardo1998 stjordanis nogabar mariomzhang buffaloiron davidschischke olokevin standardgalactic aomar03 lingyan0

grasp's Issues

Question about score normalization

Hi Chaoqi,

I was wondering whether dividing by gradient norm is necessary. It seems like it doesn't affect the ordering and therefore the results. I might be missing something.

GraSP/pruner/GraSP.py

Line 146 in f17d87a

all_scores.div_(norm_factor)

Question: Why is are the inputs/targets split into different batches?

Hi, quick question. Why are the inputs and targets split into different batches here?

https://github.com/alecwangcq/GraSP/blob/master/pruner/GraSP.py#L82

Did you not have the memory to compute the gradients in a single batch?

Some questions about your paper and code

Hello, Thank you very much for your code. I am a lit bit confused about some details in your paper. Could you please help me with them?

In your equation 8, you multiple the Hg with -theta. I do not understand why you have to multiply Hg with -theta since Hg is already a measurement of the importance.

In your code, you use 1000 examples for cifar100 and divide them into 4250. While you computing the Hessian vector product: z += (grad_w[count].data * grad_f[count]).sum(), you use different numbers of data for grad_w and grad_f. The grad_w is the sum of gradients for 4250, but grad_f is only the gradients of this batch of 250 examples. Why do you do it in this way?
If possible, could you please share a detailed proof of eqa.7?

Thank you very much for your work!

Question about Hessian-gradient product

Hi Chaoqi,
Thanks for sharing the awesome paper and code with us. I have a little problem about the Hessian-gradient product.

What's the stop_grad function in the 3rd line of Algorithm.2 of the original paper.
I checked the GraSP_ImageNet.py and found that grad_w and grad_f seem the same. They are all the gradient of the CE loss w.r.t the weights. What's the difference between them and is it possible to replace grad_w[count] * grad_f[count] (L102 of GraSP_ImageNet.py) with grad_w[count]*grad_w[count]?
I'm also curious about the input split in GraSP_ImageNet.py. If you don't have enough memory, why not decrease input batch_size? I guess maybe there are some special reasons to not decrease batch_size?

Thanks,
Ziqi

Test accuracy for tiny-imagenet

Hello, when I am training tiny-imagenet with this repo using Grasp, the test accuracy will never reach even 1%. While the training accuracy will reach 99%. This behavior was noticed out of the box. Any idea what might be going on?

Question about the learning rate and weight decay in ResNet50 training on Imagenet

Hi Chaoqi,
Thanks for sharing the awesome paper and code with us. I have a little problem about the code.

in the code:

configs/imagenet/resnet50/GraSP_80.json

"learning_rate" and "weight_decay" are both equal to 0.

could you tell me the value about this two parameters in your expriment?

Thanks,
Pong

alecwangcq / grasp Goto Github PK

grasp's People

Contributors

Stargazers

Watchers

Forkers

grasp's Issues

Question about score normalization

Question: Why is are the inputs/targets split into different batches?

Some questions about your paper and code

Question about Hessian-gradient product

Test accuracy for tiny-imagenet

Question about the learning rate and weight decay in ResNet50 training on Imagenet

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent