zx55 / dmcp Goto Github PK

View Code? Open in Web Editor NEW

119.0 119.0 20.0 135 KB

License: Creative Commons Attribution 4.0 International

Python 100.00%

dmcp's People

Contributors

Stargazers

Watchers

dmcp's Issues

when last epoch，sampling model. error

@Zx55
ubuntu:~/work_code/dmcp$ CUDA_VISIBLE_DEVICES=3 python main.py --mode train --data /data2/ImageNet --config config/mbv2/dmcp.yaml --flops 87
/home/wangzhaoming/work_code/dmcp/utils/tools.py:61: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
[2020-05-20 09:47:21,712][ main.py][line: 60][ INFO] {'training': {'epoch': 40, 'sandwich': {'sample_type': 'offset', 'max_width': 1.5, 'min_width': 0.1, 'width_offset': 0.1, 'num_sample': 4}, 'label_smooth': 0.1, 'distillation': {'enable': True, 'temperature': 1, 'loss_weight': 1, 'hard_label': False}}, 'arch': {'target_flops': '87', 'train_freq': 1, 'sample_type': ['max', 'min', 'scheduled_random', 'scheduled_random'], 'floss_type': 'log_l1', 'flop_loss_weight': 0.1, 'num_flops_stats_sample': 3000, 'num_model_sample': 5, 'start_train': 400380}, 'validation': {'width': [1.5], 'calibration': {'enable': True, 'num_batch': 5}}, 'evaluation': {'width': [1.5], 'calibration': {'enable': True, 'num_batch': 5}}, 'model': {'type': 'DMCPMobileNetV2', 'kwargs': {'num_classes': 1000, 'input_size': 224, 'width': [0.1, 1.5, 0.1], 'prob_type': 'sigmoid'}, 'runner': {'type': 'DMCPRunner'}}, 'recover': {'enable': True, 'checkpoint': '/home/wangzhaoming/work_code/dmcp/results/DMCPMobileNetV2_87_051610/checkpoints/0520_0925.pth'}, 'distributed': {'enable': False}, 'optimizer': {'momentum': 0.9, 'weight_decay': 4e-05, 'nesterov': True, 'no_wd': True}, 'lr_scheduler': {'base_lr': 0.2, 'warmup_lr': 0.5, 'warmup_steps': 1000, 'min_lr': 0.08, 'max_iter': 800760}, 'arch_lr_scheduler': {'base_lr': 0.5, 'warmup_lr': 0.5, 'min_lr': 0.1, 'max_iter': 800760, 'warmup_steps': 400380}, 'dataset': {'type': 'ImageNet', 'augmentation': {'test_resize': 256, 'color_jitter': [0.2, 0.2, 0.2, 0.1]}, 'workers': 4, 'batch_size': 64, 'num_classes': 1000, 'input_size': 224, 'path': '/data2/ImageNet'}, 'logging': {'print_freq': 50}, 'random_seed': 0, 'save_path': './results/DMCPMobileNetV2_87_052009'}
[2020-05-20 09:47:21,748][normal_runner.py][line: 157][ INFO] using label_smooth: 0.1
[2020-05-20 09:47:21,748][normal_runner.py][line: 157][ INFO] sampling model...
Traceback (most recent call last):
File "main.py", line 80, in
main()
File "main.py", line 63, in main
train(config, runner, loaders, checkpoint, tb_logger)
File "main.py", line 39, in train
runner.train(train_loader, val_loader, optimizer, lr_scheduler, tb_logger)
File "/home/wangzhaoming/work_code/dmcp/runner/dmcp_runner.py", line 63, in train
dmcp_utils.sample_model(self.config, self.model)
File "/home/wangzhaoming/work_code/dmcp/models/dmcp/utils.py", line 157, in sample_model
dist.barrier()
File "/home/wangzhaoming/work_code/dmcp/utils/distributed.py", line 117, in barrier
dist.barrier()
File "/home/wangzhaoming/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1488, in barrier
_check_default_pg()
File "/home/wangzhaoming/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 193, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized

Retrain with Distillation

Hi,

First of all, thank you so much for releasing this repository.

As stated in Table 6 of the original paper, the DMCP results marked with "*" superscript are obtained by retraining pruned models with slimmable method. It is also explained in Section 4.3 that AutoSlim utilizes in-place distillation during retraining so that the comparison between AutoSlim and DMCP (without superscript) is not fair.

In that case, could you please provide some details regarding the "retraining pruned models with slimmable method" process? It would be better if relevant code and configurations could be released.

Thanks!

Add tensorboardX into requirements

As titled.
Just a warm reminder.

About parameters of pruned models

Hi! Great work!
Could you show the comparison of parameters? e.g. ResNet-50

os.environ['RANK'] keyError: 'RANK'

I run
python main.py --mode train --data data1/ImageNetOrigin --config config/mbv2/retrain.yaml
--flops 43 --chcfg ./results/DMCPMobileNetV2_43_MMDDHH/model_sample/expected_ch
Traceback (most recent call last):
File "main.py", line 75, in
main()
File "main.py", line 42, in main
tools.init(config)
File "/data1/task/tools/dmcp/utils/tools.py", line 28, in init
dist.init_dist(config.distributed.enable)
File "/data1/task/tools/dmcp/utils/distributed.py", line 29, in init_dist
rank = int(os.environ['RANK'])
File "/usr/local/miniconda3/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

Issue of training on multi-GPUs

When i set distributed_enable=False and train on multi-GPUs, error happend. But single GPU work.

Traceback (most recent call last):
File "main.py", line 71, in
main()
File "main.py", line 54, in main
train(config, runner, loaders, checkpoint, tb_logger)
File "main.py", line 30, in train
runner.train(train_loader, val_loader, optimizer, lr_scheduler, tb_logger)
File "/home/Brin1/dmcp-master/runner/dmcp_runner.py", line 46, in train
self._train_one_batch(x, y, optimizer, lr_scheduler, meters, criterions, end)
File "/home/Brin1/dmcp-master/runner/dmcp_runner.py", line 145, in _train_one_batch
criterions, end)
File "/home/Brin1/dmcp-master/runner/us_runner.py", line 201, in _train_one_batch
out = self.model(x)
File "/opt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/opt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 142, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/opt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 147, in replicate
return replicate(module, device_ids)
File "/opt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 53, in replicate
param_idx = param_indices[param]
KeyError: Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device='cuda:0', requires_grad=True)

thanks!

How to carry out channel pruning?

How to prune 4 channels or 8 channels at a time so that the number of reserved channels is a multiple of 4? How can I set up?

任何模型训练完得到pth模型，然后运行此git？

test_resize 256. Why it isn't 224?

augmentation:
    test_resize: 256

traning cost of DMCP training, Pruned Model Sampling, fine-tuning

Thanks for your great work!
Could you please provide the traning cost of DMCP training, Pruned Model Sampling, fine-tuning?
How many hours needed for these steps?
Thanks!

Why pruned mobilenet-v2 DMCP 300M and 211M has higher parameter count?

Hello,
I could understand that the DMCP prunes channel layer by layer and hence the MAC(or FLOP) count decreases. I found that DMCP 211M and 300M pruned from Mobilenet-V2 has higher parameter count than Mobilenet-V2. Why the parameter count increases ?

Classification Models	FLOPS (G)	Params (M)	TOP_acc@1	TOP_acc@5
Mobilenet-V2	0.858	3.48	71.87	90.294
DMCP 300M	0.600	5.3	73.48	91.10
DMCP 211M	0.420	4.2	71.60	89.95

Best Regards,
Atul

error of KeyError:Parameter containing:

First of all, Thank you for sharing your source code.
I have an error when I try to train with single gpu (i.e distributed enable = False in dmcp.yaml-mbv2) for Mobilenet v2.
Here is my config modification (I used CIFAR10 and I modify loss class in source code)
$ python main.py --mode train --data ./dataset/ --config config/mbv2/dmcp.yaml --flops 43
Error message

(base) root@452fa72bec2d:/workspace/hdd/06_model_compression/dmcp# python main.py --mode train --data ./dataset/ --config config/mbv2/dmcp.yaml --flops 43
/workspace/hdd/06_model_compression/dmcp/utils/tools.py:61: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
[2020-07-30 02:19:50,612][ main.py][line: 51][ INFO] {'training': {'epoch': 40, 'sandwich': {'sample_type': 'offset', 'max_width': 1.5, 'min_width': 0.1, 'width_offset': 0.1, 'num_sample': 4}, 'label_smooth': 0.1, 'distillation': {'enable': True, 'temperature': 1, 'loss_weight': 1, 'hard_label': False}}, 'arch': {'target_flops': '43', 'train_freq': 1, 'sample_type': ['max', 'min', 'scheduled_random', 'scheduled_random'], 'floss_type': 'log_l1', 'flop_loss_weight': 0.1, 'num_flops_stats_sample': 3000, 'num_model_sample': 5, 'start_train': 15640}, 'validation': {'width': [1.5], 'calibration': {'enable': True, 'num_batch': 5}}, 'evaluation': {'width': [1.5], 'calibration': {'enable': True, 'num_batch': 5}}, 'model': {'type': 'DMCPMobileNetV2', 'kwargs': {'num_classes': 10, 'input_size': 32, 'width': [0.1, 1.5, 0.1], 'prob_type': 'sigmoid'}, 'runner': {'type': 'DMCPRunner'}}, 'recover': {'enable': False, 'checkpoint': 'None'}, 'distributed': {'enable': False}, 'optimizer': {'momentum': 0.9, 'weight_decay': 4e-05, 'nesterov': True, 'no_wd': True}, 'lr_scheduler': {'base_lr': 0.2, 'warmup_lr': 0.5, 'warmup_steps': 1000, 'min_lr': 0.08, 'max_iter': 31280}, 'arch_lr_scheduler': {'base_lr': 0.5, 'warmup_lr': 0.5, 'min_lr': 0.1, 'max_iter': 31280, 'warmup_steps': 15640}, 'dataset': {'type': 'CIFAR10', 'augmentation': {'test_resize': 32, 'color_jitter': [0.2, 0.2, 0.2, 0.1]}, 'workers': 4, 'batch_size': 64, 'num_classes': 10, 'input_size': 32, 'path': './dataset/'}, 'logging': {'print_freq': 50}, 'random_seed': 0, 'save_path': './results/DMCPMobileNetV2_43_073002'}
[2020-07-30 02:19:50,613][normal_runner.py][line: 159][ INFO] using label_smooth: 0.1
Traceback (most recent call last):
File "main.py", line 75, in
main()
File "main.py", line 54, in main
train(config, runner, loaders, checkpoint, tb_logger)
File "main.py", line 30, in train
runner.train(train_loader, val_loader, optimizer, lr_scheduler, tb_logger)
File "/workspace/hdd/06_model_compression/dmcp/runner/dmcp_runner.py", line 46, in train
self._train_one_batch(x, y, optimizer, lr_scheduler, meters, criterions, end)
File "/workspace/hdd/06_model_compression/dmcp/runner/dmcp_runner.py", line 145, in _train_one_batch
criterions, end)
File "/workspace/hdd/06_model_compression/dmcp/runner/us_runner.py", line 201, in _train_one_batch
out = self.model(x)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 162, in replicate
param_idx = param_indices[param]
KeyError: Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device='cuda:0', requires_grad=True)

dmcp.yaml - mbv2
training:
epoch: 40
sandwich:
sample_type: offset
max_width: &max_width 1.5
min_width: &min_width 0.1
width_offset: &width_offset 0.1
num_sample: 4
label_smooth: 0.1
distillation:
enable: true
temperature: 1
loss_weight: 1
hard_label: False

arch:
target_flops: None
train_freq: 1
sample_type: [max, min, scheduled_random, scheduled_random]
floss_type: log_l1
flop_loss_weight: 0.1
num_flops_stats_sample: 3000
num_model_sample: 5

validation:
width: [*max_width]
calibration:
enable: True
num_batch: 5

evaluation:
width: [*max_width]
calibration:
enable: True
num_batch: 5

model:
type: DMCPMobileNetV2
kwargs:
num_classes: &num_classes 10
input_size: &input_size 32
width: [*min_width, *max_width, *width_offset]
prob_type: sigmoid

runner:
    type: DMCPRunner

recover:
enable: False
checkpoint: None

distributed:
enable: False

optimizer:
momentum: 0.9
weight_decay: 0.00004
nesterov: True
no_wd: True

lr_scheduler:
base_lr: 0.2
warmup_lr: 0.5
warmup_steps: 1000
min_lr: 0.08

arch_lr_scheduler:
base_lr: 0.5
warmup_lr: 0.5
min_lr: 0.1

dataset:
type: CIFAR10
augmentation:
test_resize: 32
color_jitter: [0.2, 0.2, 0.2, 0.1]
workers: 4
batch_size: 64
num_classes: *num_classes
input_size: *input_size

logging:
print_freq: 50

random_seed: 0
save_path: ./results

Error about loss during training

Why is the loss value in max_width much greater than in min_width and random_width during the training process? For example, in max_width, the loss is 59, while the other two have a loss of only about 0.001

error

(torch1.4) wangzhaoming@ubuntu:~/work_code/dmcp$ python main.py --mode train --data /data2/ImageNet --config config/mbv2/dmcp.yaml --flops 87
/home/wangzhaoming/work_code/dmcp/utils/tools.py:62: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
Traceback (most recent call last):
File "main.py", line 71, in
main()
File "main.py", line 42, in main
tools.init(config)
File "/home/wangzhaoming/work_code/dmcp/utils/tools.py", line 27, in init
dist.init_dist(config.distributed.enable)
File "/home/wangzhaoming/work_code/dmcp/utils/distributed.py", line 30, in init_dist
rank = int(os.environ['RANK'])
File "/home/wangzhaoming/anaconda3/envs/torch1.4/lib/python3.7/os.py", line 681, in getitem
raise KeyError(key) from None
KeyError: 'RANK'
@Zx55

zx55 / dmcp Goto Github PK

dmcp's People

Contributors

Stargazers

Watchers

Forkers

dmcp's Issues

When i set distributed_enable=False and train on multi-GPUs, error happend. But single GPU work.

random_seed: 0 save_path: ./results

Recommend Projects

Recommend Topics

Recommend Org

random_seed: 0
save_path: ./results