canyonwind / single-path-one-shot-nas-mxnet Goto Github PK

Single Path One-Shot NAS MXNet implementation with full training and searching pipeline. Support both Block and Channel Selection. Searched models better than the original paper are provided.

Python 98.14% Shell 1.86%

neural-architecture-search neural-network shufflenet mxnet gluon single-path-one-shot

single-path-one-shot-nas-mxnet's People

Contributors

Stargazers

Watchers

single-path-one-shot-nas-mxnet's Issues

Bug in random_channel_mask

if i set epoch_start_cs=30 and when epoch=0, and epoch_after_cs will be -30.

epoch_delay_early = {0: 0,  # 8
                             1: 1, 2: 1,  # 7
                             3: 2, 4: 2, 5: 2,  # 6
                             6: 3, 7: 3, 8: 3, 9: 3,  # 5
                             10: 4, 11: 4, 12: 4, 13: 4, 14: 4,
                             15: 5, 16: 5, 17: 5, 18: 5, 19: 5, 20: 5,
                             21: 6, 22: 6, 23: 6, 24: 6, 25: 6, 27: 6, 28: 6,
                             29: 6, 30: 6, 31: 6, 32: 6, 33: 6, 34: 6, 35: 6, 36: 7,
                           }
        epoch_delay_late = {0: 0,
                            1: 1,
                            2: 2,
                            3: 3,
                            4: 4, 5: 4,  # warm up epoch: 2 [1.0, 1.2, ... 1.8, 2.0]
                            6: 5, 7: 5, 8: 5,  # warm up epoch: 3 ...
                            9: 6, 10: 6, 11: 6, 12: 6,  # warm up epoch: 4 ...
                            13: 7, 14: 7, 15: 7, 16: 7, 17: 7,  # warm up epoch: 5 [0.4, 0.6, ... 1.8, 2.0]
                            18: 8, 19: 8, 20: 8, 21: 8, 22: 8, 23: 8}  # warm up epoch: 6, after 17, use all scales

        if 0 <= epoch_after_cs <= 23 and self.stage_out_channels[0] >= 64:
            delayed_epoch_after_cs = epoch_delay_late[epoch_after_cs]
        elif 0 <= epoch_after_cs <= 36 and self.stage_out_channels[0] < 64:
            delayed_epoch_after_cs = epoch_delay_early[epoch_after_cs]
        else:
            delayed_epoch_after_cs = epoch_after_cs # delayed_epoch_after_cs = -30

                        channel_scale_start = max(2, 10 - (-30) - 2) # channel_scale_start=38
                        channel_choice = random.randint(channel_scale_start, len(self.candidate_scales) - 1) #random.randint(38, 9) will be error ValueError: empty range for randrange() (38,10, -28)

And why epoch_delay_early doesn`t have the key 26

Context switching causes multi GPU idling

Single-Path-One-Shot-NAS-MXNet/oneshot_nas_blocks.py

Line 166 in df1c1ff

x = F.broadcast_mul(x, block_channel_mask.as_in_context(x.context))

Single-Path-One-Shot-NAS-MXNet/oneshot_nas_blocks.py

Lines 467 to 470 in df1c1ff

    
           running_mean = F.add(F.multiply(self.running_mean.data(), self.momentum.as_in_context(x.context)), 
        
                                F.multiply(mean, self.momentum_rest.as_in_context(x.context))) 
        
           running_var = F.add(F.multiply(self.running_var.data(), self.momentum.as_in_context(x.context)), 
        
                               F.multiply(var, self.momentum_rest.as_in_context(x.context)))

Supernet training is too slow

Thanks for your implementation of SPOS by MXNET^_^. But I found the supernet training was too slow when I trained my own network. I profiled the training procedure and found some problems as follows.

At first, the imperative mode is slower than hybrid mode so much. Then I tried to use more GPUs to train, however, get no acceleration. Instead, the GPU utility decreased dramatically when GPU numbers increase. I guess the calculation in different GPUs is serial but not parallel in imperative mode. Have you ever encountered these problems above?

Furthermore, anything can be improved to accelerate the training? Could we set the mode to be imperative when sampling subnet, then change the mode to be hybrid when training subnet?

Waiting for your reply!

Supernet model weights

Thanks a lot for the work
I am preparing to reproduce your experimental results
Can you release the model weights of supernet + and supernet + S?
thank you very much

Supernet Training with Constraints

Thanks for your excellent work!
When i train supernet with constraints with follow script, i meet error in the val.

export MXNET_SAFE_ACCUMULATION=1

python train_imagenet.py
--rec-train /data3/wangzhaoming/mxnet_imagenet/rec/train.rec --rec-train-idx /data3/wangzhaoming/mxnet_imagenet/rec/train.idx
--rec-val /data3/wangzhaoming/mxnet_imagenet/rec/val.rec --rec-val-idx /data3/wangzhaoming/mxnet_imagenet/rec/val.idx
--mode imperative --lr 1.3 --wd 0.00004 --lr-mode cosine --dtype float16
--num-epochs 120 --batch-size 128 --num-gpus 8 -j 48
--label-smoothing --no-wd --warmup-epochs 5 --use-rec
--model ShuffleNas
--epoch-start-cs 60 --cs-warm-up --channels-layout OneShot
--save-dir params_shufflenas_supernet --logging-file ./logs/shufflenas_supernet.log
--train-upper-constraints flops-160-params-2.5 --train-bottom-constraints flops-90-params-1.4
--train-constraint-method evolution

Epoch[0] Batch [49] Speed: 322.095226 samples/sec accuracy=0.000605 lr=0.010393
Epoch[0] Batch [99] Speed: 492.513575 samples/sec accuracy=0.000791 lr=0.020787
Epoch[0] Batch [149] Speed: 457.981573 samples/sec accuracy=0.000937 lr=0.031180
Epoch[0] Batch [199] Speed: 688.650089 samples/sec accuracy=0.000903 lr=0.041573
Epoch[0] Batch [249] Speed: 465.918790 samples/sec accuracy=0.000957 lr=0.051967
Epoch[0] Batch [299] Speed: 490.846376 samples/sec accuracy=0.000957 lr=0.062360
Epoch[0] Batch [349] Speed: 606.910845 samples/sec accuracy=0.000977 lr=0.072753
Epoch[0] Batch [399] Speed: 567.445527 samples/sec accuracy=0.000986 lr=0.083147
Epoch[0] Batch [449] Speed: 618.184875 samples/sec accuracy=0.000990 lr=0.093540
Epoch[0] Batch [499] Speed: 593.677446 samples/sec accuracy=0.000982 lr=0.103933
Epoch[0] Batch [549] Speed: 631.991306 samples/sec accuracy=0.000978 lr=0.114327
Epoch[0] Batch [599] Speed: 614.757373 samples/sec accuracy=0.000985 lr=0.124720
Epoch[0] Batch [649] Speed: 568.749700 samples/sec accuracy=0.000975 lr=0.135114
Epoch[0] Batch [699] Speed: 610.768222 samples/sec accuracy=0.000961 lr=0.145507
Epoch[0] Batch [749] Speed: 659.102106 samples/sec accuracy=0.000961 lr=0.155900
Epoch[0] Batch [799] Speed: 563.044769 samples/sec accuracy=0.000964 lr=0.166294
Epoch[0] Batch [849] Speed: 572.482835 samples/sec accuracy=0.000959 lr=0.176687
Epoch[0] Batch [899] Speed: 611.510812 samples/sec accuracy=0.000969 lr=0.187080
Epoch[0] Batch [949] Speed: 585.310555 samples/sec accuracy=0.000970 lr=0.197474
Epoch[0] Batch [999] Speed: 586.269362 samples/sec accuracy=0.000970 lr=0.207867
Epoch[0] Batch [1049] Speed: 584.871140 samples/sec accuracy=0.000973 lr=0.218260
Epoch[0] Batch [1099] Speed: 580.345403 samples/sec accuracy=0.000976 lr=0.228654
Epoch[0] Batch [1149] Speed: 604.746532 samples/sec accuracy=0.000979 lr=0.239047
Epoch[0] Batch [1199] Speed: 425.625182 samples/sec accuracy=0.000976 lr=0.249440
Epoch[0] Batch [1249] Speed: 673.577257 samples/sec accuracy=0.000977 lr=0.259834
Traceback (most recent call last):
File "train_imagenet.py", line 738, in
main()
File "train_imagenet.py", line 734, in main
train(context)
File "train_imagenet.py", line 710, in train
err_top1_val, err_top5_val = test(ctx, val_data, epoch)
File "train_imagenet.py", line 439, in test
ignore_first_two_cs=opt.ignore_first_two_cs)
File "/data3/wangzhaoming/Single-Path-One-Shot-NAS-MXNet/oneshot_nas_network.py", line 248, in random_channel_mask
channel_choice = random.randint(channel_scale_start, len(self.candidate_scales) - 1)
File "/data3/wangzhaoming/anconda3/lib/python3.7/random.py", line 222, in randint
return self.randrange(a, b+1)
File "/data3/wangzhaoming/anconda3/lib/python3.7/random.py", line 200, in randrange
raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (68,10, -58)

Official Review

Hello! Thanks so much for your submission!

When we try to run your model checkpoints we're getting a segmentation fault. We're using the mxnet/python:1.5.0_gpu_cu101_mkl_py3 docker image and running on a V100. We downloaded the dataset from your link. Could you confirm what environment you're evaluating in?

Thanks!
Trevor

Export ONNX model

I am trying to export MNXet Model to ONNX by doing the following at the end of the script, Single-Path-One-Shot-NAS-MXNet/oneshot_nas_network.py

sym_file_name = "./symbols/ShuffleNas_fixArch-symbol.json"
param_file_name = './symbols/ShuffleNas_fixArch-0000.params'
onnx_file_name = "./supernet_random.onnx"
input_shape = (1,3,224,224)
converted_model_path = onnx_mxnet.export_model(sym_file_name, param_file_name, [input_shape], np.float32, onnx_file_name, verbose=True)

I am getting the following error,
AttributeError: ('Reshape: Shape value not supported in ONNX', -4)

output_shape_list is [0, -4, 2, -1, -2]. And MxNet/ONNX does not support the following values.
not_supported_shape = [-2, -3, -4]

What am I doing wrong? Or ONNX export is not supported for this model?

the loss of supernet can't converge

Hi，thanks for your excellent work！
I am preparing to reappear your work，but when trainning supernet, the loss can't converge, and val top-1 error is don't decline. my trainning scripts is
python train_imagenet.py \ --rec-train ~/facedata.mxnet.hot/rec2/train.rec --rec-train-idx ~/facedata.mxnet.hot/rec2/train.idx \ --rec-val ~/facedata.mxnet.hot/rec2/val.rec --rec-val-idx ~/facedata.mxnet.hot/rec2/val.idx \ --mode imperative --lr 0.65 --wd 0.00004 --lr-mode cosine --dtype float16\ --num-epochs 120 --batch-size 64 --num-gpus 1 -j 16 \ --label-smoothing --no-wd --warmup-epochs 5 --use-rec \ --model ShuffleNas \ --epoch-start-cs 60 --cs-warm-up --use-se --last-conv-after-pooling --channels-layout OneShot \ --save-dir params_shufflenas_supernet+ --logging-file ./logs/shufflenas_supernet+.log \ --train-upper-constraints flops-330-params-5.0 --train-bottom-constraints flops-190-params-2.8 \ --train-constraint-method evolution

and when run test ,it will report a error, i change select_all_channels=True in line 435 and 440 of train_imagenet.py

supernet training

Training an updated version of the supernet, resulting in the following error：

File "train_imagenet.py", line 512, in <module>
    main()
  File "train_imagenet.py", line 508, in main
    train(context)
  File "train_imagenet.py", line 393, in train
    trainer = gluon.Trainer(net.collect_params(), optimizer, optimizer_params)
  File "/usr/local/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/gluon/trainer.py", line 100, in __init__
    self._contexts = self._check_contexts()
  File "/usr/local/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/gluon/trainer.py", line 113, in _check_contexts
    ctx = param.list_ctx()
  File "/usr/local/lib/python3.6/site-packages/mxnet-1.5.0-py3.6.egg/mxnet/gluon/parameter.py", line 539, in list_ctx
    raise RuntimeError("Parameter '%s' has not been initialized"%self.name)
RuntimeError: Parameter 'shufflenasoneshot0_features_fc_weight' has not been initialized

evolution search

Hi!
When running evolution search, I encountered the following problems:
1.there did not pass the parameters into Evolver, causing the latter search to run on the cpu, which should be evolver = Evolver(net, train_data, val_data, batch_fn, param_dict, num_gpus=num_gpus). In addition, there seems to be redundant.
2. In order to quickly look at the effects of the evolutionary algorithm, I set the number of count to 20, but the following error occurred:

Traceback (most recent call last):
  File "search_supernet.py", line 527, in <module>
    main(num_gpus=1, batch_size=128, search_mode='genetic', dtype='float16', comparison_model='SinglePathOneShot')
  File "search_supernet.py", line 521, in main
    genetic_search(net, num_gpus=len(context), batch_size=batch_size, logger=logger, ctx=context)
  File "search_supernet.py", line 467, in genetic_search
    population, local_topk = evolver.evolve(population)
  File "search_supernet.py", line 300, in evolve
    selected2 = [(self.fitness_2nd_stage(person['block'], person['channel']), person) for person in parents]
  File "search_supernet.py", line 300, in <listcomp>
    selected2 = [(self.fitness_2nd_stage(person['block'], person['channel']), person) for person in parents]
  File "search_supernet.py", line 225, in fitness_2nd_stage
    batch_size=self.batch_size, update_bn_images=self.update_bn_images)
  File "search_supernet.py", line 113, in update_bn
    data, _ = batch_fn(batch, ctx)
TypeError: 'list' object is not callable

supernet training with resource constrain

Thanks for your excellent work!

the supernet trained with this strolling evolution method could be more trustworthy when using it to sample the subnet's performance.

Have you compared the results of searched acrhitecture between strolling evolution and random?

Can you offer a toy data and example like "CIFAR10"?

Great Work!

But could you offer some examples with toy data？ Because Imagenet is large to run.

the loss of supernet doesn't seem to converge

Thanks for your amazing work!
I run sh ./train_supernet.sh, but loss seems cannot converge.

Namespace(batch_norm=False, batch_size=128, crop_ratio=0.875, data_dir='~/.mxnet/datasets/imagenet', dtype='float16', hard_weight=0.5, input_size=224, label_smoothing=True, last_gamma=False, log_interval=50, logging_file='shufflenas_supernet.log', lr=0.5, lr_decay=0.1, lr_decay_epoch='40,60', lr_decay_period=0, lr_mode='cosine', mixup=False, mixup_alpha=0.2, mixup_off_epoch=0, mode='imperative', model='ShuffleNas', momentum=0.9, no_wd=True, num_epochs=120, num_gpus=4, num_workers=8, rec_train='/ImageNet/ILSVRC2012_img_train_rec/_train.rec', rec_train_idx='/ImageNet/ILSVRC2012_img_train_rec/_train.idx', rec_val='/ImageNet/ILSVRC2012_img_val_rec/_val.rec', rec_val_idx='/ImageNet/ILSVRC2012_img_val_rec/_val.idx', resume_epoch=0, resume_params='', resume_states='', save_dir='params_shufflenas_supernet', save_frequency=10, teacher=None, temperature=20, use_gn=False, use_pretrained=False, use_rec=True, use_se=False, warmup_epochs=10, warmup_lr=0.0, wd=4e-05) [13:25:44] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: /ImageNet/ILSVRC2012_img_train_rec/_train.rec, use 8 threads for decoding.. [13:25:51] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: /ImageNet/ILSVRC2012_img_val_rec/_val.rec, use 8 threads for decoding.. Epoch[0] Batch [49] Speed: 535.322791 samples/sec accuracy=0.001016 lr=0.000999 Epoch[0] Batch [99] Speed: 1081.946811 samples/sec accuracy=0.001133 lr=0.001998 Epoch[0] Batch [149] Speed: 954.276519 samples/sec accuracy=0.001055 lr=0.002998 Epoch[0] Batch [199] Speed: 838.546627 samples/sec accuracy=0.001055 lr=0.003997 Epoch[0] Batch [249] Speed: 871.972491 samples/sec accuracy=0.001133 lr=0.004996 Epoch[0] Batch [299] Speed: 887.530725 samples/sec accuracy=0.001152 lr=0.005995 Epoch[0] Batch [349] Speed: 843.583828 samples/sec accuracy=0.001122 lr=0.006995 Epoch[0] Batch [399] Speed: 849.005440 samples/sec accuracy=0.001108 lr=0.007994 Epoch[0] Batch [449] Speed: 893.221403 samples/sec accuracy=0.001111 lr=0.008993 Epoch[0] Batch [499] Speed: 852.374898 samples/sec accuracy=0.001070 lr=0.009992 Epoch[0] Batch [549] Speed: 887.384738 samples/sec accuracy=0.001048 lr=0.010992 Epoch[0] Batch [599] Speed: 847.726991 samples/sec accuracy=0.001025 lr=0.011991 Epoch[0] Batch [649] Speed: 872.797215 samples/sec accuracy=0.001043 lr=0.012990 Epoch[0] Batch [699] Speed: 876.450059 samples/sec accuracy=0.001024 lr=0.013989 Epoch[0] Batch [749] Speed: 846.336330 samples/sec accuracy=0.001021 lr=0.014989 Epoch[0] Batch [799] Speed: 867.812489 samples/sec accuracy=0.001033 lr=0.015988 Epoch[0] Batch [849] Speed: 873.677731 samples/sec accuracy=0.001023 lr=0.016987 Epoch[0] Batch [899] Speed: 874.266688 samples/sec accuracy=0.001039 lr=0.017986 Epoch[0] Batch [949] Speed: 850.073132 samples/sec accuracy=0.001044 lr=0.018986 Epoch[0] Batch [999] Speed: 869.275106 samples/sec accuracy=0.001055 lr=0.019985 Epoch[0] Batch [1049] Speed: 896.434749 samples/sec accuracy=0.001038 lr=0.020984 Epoch[0] Batch [1099] Speed: 848.964063 samples/sec accuracy=0.001046 lr=0.021983 Epoch[0] Batch [1149] Speed: 865.988757 samples/sec accuracy=0.001039 lr=0.022983 Epoch[0] Batch [1199] Speed: 885.459825 samples/sec accuracy=0.001047 lr=0.023982 Epoch[0] Batch [1249] Speed: 862.004503 samples/sec accuracy=0.001044 lr=0.024981 Epoch[0] Batch [1299] Speed: 853.095485 samples/sec accuracy=0.001043 lr=0.025980 Epoch[0] Batch [1349] Speed: 871.394519 samples/sec accuracy=0.001039 lr=0.026979 Epoch[0] Batch [1399] Speed: 860.783110 samples/sec accuracy=0.001048 lr=0.027979 Epoch[0] Batch [1449] Speed: 890.852874 samples/sec accuracy=0.001026 lr=0.028978 Epoch[0] Batch [1499] Speed: 860.505098 samples/sec accuracy=0.001012 lr=0.029977 Epoch[0] Batch [1549] Speed: 854.555221 samples/sec accuracy=0.001006 lr=0.030976 Epoch[0] Batch [1599] Speed: 885.776952 samples/sec accuracy=0.001003 lr=0.031976 Epoch[0] Batch [1649] Speed: 879.426519 samples/sec accuracy=0.001006 lr=0.032975 Epoch[0] Batch [1699] Speed: 906.833596 samples/sec accuracy=0.001006 lr=0.033974 Epoch[0] Batch [1749] Speed: 888.854880 samples/sec accuracy=0.001010 lr=0.034973 Epoch[0] Batch [1799] Speed: 927.614338 samples/sec accuracy=0.001007 lr=0.035973 Epoch[0] Batch [1849] Speed: 862.451794 samples/sec accuracy=0.001011 lr=0.036972 Epoch[0] Batch [1899] Speed: 871.394123 samples/sec accuracy=0.001008 lr=0.037971 Epoch[0] Batch [1949] Speed: 877.935443 samples/sec accuracy=0.001009 lr=0.038970 Epoch[0] Batch [1999] Speed: 855.234498 samples/sec accuracy=0.001013 lr=0.039970 Epoch[0] Batch [2049] Speed: 863.621976 samples/sec accuracy=0.001011 lr=0.040969 Epoch[0] Batch [2099] Speed: 879.784556 samples/sec accuracy=0.001004 lr=0.041968 Epoch[0] Batch [2149] Speed: 876.706042 samples/sec accuracy=0.001001 lr=0.042967 Epoch[0] Batch [2199] Speed: 857.379596 samples/sec accuracy=0.000998 lr=0.043967 Epoch[0] Batch [2249] Speed: 870.759792 samples/sec accuracy=0.001001 lr=0.044966 Epoch[0] Batch [2299] Speed: 931.048332 samples/sec accuracy=0.001001 lr=0.045965 Epoch[0] Batch [2349] Speed: 837.125432 samples/sec accuracy=0.001004 lr=0.046964 Epoch[0] Batch [2399] Speed: 890.299059 samples/sec accuracy=0.001003 lr=0.047964 Epoch[0] Batch [2449] Speed: 851.878090 samples/sec accuracy=0.001001 lr=0.048963 Epoch[0] Batch [2499] Speed: 843.473559 samples/sec accuracy=0.001005 lr=0.049962 [Epoch 0] training: accuracy=0.001005 [Epoch 0] speed: 863 samples/sec time cost: 1511.254638 [Epoch 0] validation: err-top1=0.999123 err-top5=0.995117 Epoch[1] Batch [49] Speed: 887.937809 samples/sec accuracy=0.000937 lr=0.051021 Epoch[1] Batch [99] Speed: 867.829884 samples/sec accuracy=0.000879 lr=0.052020 Epoch[1] Batch [149] Speed: 885.684899 samples/sec accuracy=0.000990 lr=0.053020 Epoch[1] Batch [199] Speed: 859.224721 samples/sec accuracy=0.000986 lr=0.054019 Epoch[1] Batch [249] Speed: 931.187842 samples/sec accuracy=0.000930 lr=0.055018 Epoch[1] Batch [299] Speed: 873.437715 samples/sec accuracy=0.001048 lr=0.056017 Epoch[1] Batch [349] Speed: 907.056282 samples/sec accuracy=0.001004 lr=0.057017 Epoch[1] Batch [399] Speed: 883.788834 samples/sec accuracy=0.001001 lr=0.058016 Epoch[1] Batch [449] Speed: 872.911729 samples/sec accuracy=0.001059 lr=0.059015 Epoch[1] Batch [499] Speed: 856.328974 samples/sec accuracy=0.001066 lr=0.060014 Epoch[1] Batch [549] Speed: 880.310697 samples/sec accuracy=0.001044 lr=0.061014 Epoch[1] Batch [599] Speed: 854.885046 samples/sec accuracy=0.001051 lr=0.062013 Epoch[1] Batch [649] Speed: 873.367026 samples/sec accuracy=0.001025 lr=0.063012 Epoch[1] Batch [699] Speed: 888.783690 samples/sec accuracy=0.001032 lr=0.064011 Epoch[1] Batch [749] Speed: 883.911068 samples/sec accuracy=0.001026 lr=0.065011 Epoch[1] Batch [799] Speed: 849.367160 samples/sec accuracy=0.001023 lr=0.066010 Epoch[1] Batch [849] Speed: 893.493904 samples/sec accuracy=0.001032 lr=0.067009 Epoch[1] Batch [899] Speed: 869.416983 samples/sec accuracy=0.001031 lr=0.068008 Epoch[1] Batch [949] Speed: 889.061814 samples/sec accuracy=0.001028 lr=0.069008 Epoch[1] Batch [999] Speed: 886.601807 samples/sec accuracy=0.001016 lr=0.070007 Epoch[1] Batch [1049] Speed: 861.480139 samples/sec accuracy=0.001017 lr=0.071006 Epoch[1] Batch [1099] Speed: 847.957797 samples/sec accuracy=0.001019 lr=0.072005 Epoch[1] Batch [1149] Speed: 871.888190 samples/sec accuracy=0.001009 lr=0.073005 Epoch[1] Batch [1199] Speed: 854.231719 samples/sec accuracy=0.001011 lr=0.074004 Epoch[1] Batch [1249] Speed: 860.546974 samples/sec accuracy=0.001020 lr=0.075003 Epoch[1] Batch [1299] Speed: 892.008358 samples/sec accuracy=0.001022 lr=0.076002 Epoch[1] Batch [1349] Speed: 860.516429 samples/sec accuracy=0.001021 lr=0.077001 Epoch[1] Batch [1399] Speed: 870.058043 samples/sec accuracy=0.001014 lr=0.078001 Epoch[1] Batch [1449] Speed: 874.015535 samples/sec accuracy=0.001009 lr=0.079000 Epoch[1] Batch [1499] Speed: 877.646925 samples/sec accuracy=0.001009 lr=0.079999 Epoch[1] Batch [1549] Speed: 881.612005 samples/sec accuracy=0.001014 lr=0.080998 Epoch[1] Batch [1599] Speed: 923.026051 samples/sec accuracy=0.001017 lr=0.081998 Epoch[1] Batch [1649] Speed: 922.794219 samples/sec accuracy=0.001016 lr=0.082997 Epoch[1] Batch [1699] Speed: 840.872535 samples/sec accuracy=0.001014 lr=0.083996 Epoch[1] Batch [1749] Speed: 882.178377 samples/sec accuracy=0.001013 lr=0.084995 Epoch[1] Batch [1799] Speed: 851.254716 samples/sec accuracy=0.001011 lr=0.085995 Epoch[1] Batch [1849] Speed: 854.998162 samples/sec accuracy=0.001005 lr=0.086994 Epoch[1] Batch [1899] Speed: 869.947032 samples/sec accuracy=0.001006 lr=0.087993 Epoch[1] Batch [1949] Speed: 878.984794 samples/sec accuracy=0.001003 lr=0.088992 Epoch[1] Batch [1999] Speed: 894.164795 samples/sec accuracy=0.001009 lr=0.089992 Epoch[1] Batch [2049] Speed: 858.502574 samples/sec accuracy=0.000998 lr=0.090991 Epoch[1] Batch [2099] Speed: 873.440842 samples/sec accuracy=0.000996 lr=0.091990 Epoch[1] Batch [2149] Speed: 867.112840 samples/sec accuracy=0.000997 lr=0.092989 Epoch[1] Batch [2199] Speed: 862.130601 samples/sec accuracy=0.000995 lr=0.093989 Epoch[1] Batch [2249] Speed: 876.746889 samples/sec accuracy=0.000990 lr=0.094988 Epoch[1] Batch [2299] Speed: 922.624034 samples/sec accuracy=0.000988 lr=0.095987 Epoch[1] Batch [2349] Speed: 922.009291 samples/sec accuracy=0.000986 lr=0.096986 Epoch[1] Batch [2399] Speed: 849.558231 samples/sec accuracy=0.000989 lr=0.097986 Epoch[1] Batch [2449] Speed: 876.022224 samples/sec accuracy=0.000996 lr=0.098985 Epoch[1] Batch [2499] Speed: 871.302722 samples/sec accuracy=0.000991 lr=0.099984 [Epoch 1] training: accuracy=0.000991 [Epoch 1] speed: 876 samples/sec time cost: 1490.117548 [Epoch 1] validation: err-top1=0.998864 err-top5=0.994858 Epoch[2] Batch [49] Speed: 875.446324 samples/sec accuracy=0.001289 lr=0.101023 Epoch[2] Batch [99] Speed: 880.258945 samples/sec accuracy=0.001270 lr=0.102022 Epoch[2] Batch [149] Speed: 890.495640 samples/sec accuracy=0.001185 lr=0.103022 Epoch[2] Batch [199] Speed: 886.474817 samples/sec accuracy=0.001182 lr=0.104021 Epoch[2] Batch [249] Speed: 844.365761 samples/sec accuracy=0.001219 lr=0.105020 Epoch[2] Batch [299] Speed: 860.068421 samples/sec accuracy=0.001198 lr=0.106019 Epoch[2] Batch [349] Speed: 886.609999 samples/sec accuracy=0.001166 lr=0.107019 Epoch[2] Batch [399] Speed: 856.511474 samples/sec accuracy=0.001138 lr=0.108018 Epoch[2] Batch [449] Speed: 889.596902 samples/sec accuracy=0.001159 lr=0.109017 Epoch[2] Batch [499] Speed: 857.013210 samples/sec accuracy=0.001133 lr=0.110016 Epoch[2] Batch [549] Speed: 846.386632 samples/sec accuracy=0.001108 lr=0.111016 Epoch[2] Batch [599] Speed: 867.927046 samples/sec accuracy=0.001117 lr=0.112015 Epoch[2] Batch [649] Speed: 873.530772 samples/sec accuracy=0.001106 lr=0.113014 Epoch[2] Batch [699] Speed: 899.242059 samples/sec accuracy=0.001113 lr=0.114013 Epoch[2] Batch [749] Speed: 897.646130 samples/sec accuracy=0.001102 lr=0.115013 Epoch[2] Batch [799] Speed: 863.833289 samples/sec accuracy=0.001079 lr=0.116012 Epoch[2] Batch [849] Speed: 850.293461 samples/sec accuracy=0.001071 lr=0.117011 Epoch[2] Batch [899] Speed: 877.293435 samples/sec accuracy=0.001059 lr=0.118010 Epoch[2] Batch [949] Speed: 853.094725 samples/sec accuracy=0.001055 lr=0.119010 Epoch[2] Batch [999] Speed: 877.785427 samples/sec accuracy=0.001043 lr=0.120009 Epoch[2] Batch [1049] Speed: 865.957000 samples/sec accuracy=0.001047 lr=0.121008 Epoch[2] Batch [1099] Speed: 881.378680 samples/sec accuracy=0.001060 lr=0.122007 Epoch[2] Batch [1149] Speed: 869.027128 samples/sec accuracy=0.001060 lr=0.123007 Epoch[2] Batch [1199] Speed: 885.561954 samples/sec accuracy=0.001048 lr=0.124006 Epoch[2] Batch [1249] Speed: 850.997669 samples/sec accuracy=0.001055 lr=0.125005 Epoch[2] Batch [1299] Speed: 901.019216 samples/sec accuracy=0.001050 lr=0.126004 Epoch[2] Batch [1349] Speed: 861.174421 samples/sec accuracy=0.001053 lr=0.127003 Epoch[2] Batch [1399] Speed: 866.571390 samples/sec accuracy=0.001059 lr=0.128003 Epoch[2] Batch [1449] Speed: 882.264165 samples/sec accuracy=0.001057 lr=0.129002 Epoch[2] Batch [1499] Speed: 826.516384 samples/sec accuracy=0.001046 lr=0.130001 Epoch[2] Batch [1549] Speed: 879.534336 samples/sec accuracy=0.001045 lr=0.131000 Epoch[2] Batch [1599] Speed: 877.408366 samples/sec accuracy=0.001038 lr=0.132000 Epoch[2] Batch [1649] Speed: 880.151845 samples/sec accuracy=0.001036 lr=0.132999 Epoch[2] Batch [1699] Speed: 869.908042 samples/sec accuracy=0.001040 lr=0.133998 Epoch[2] Batch [1749] Speed: 867.162827 samples/sec accuracy=0.001044 lr=0.134997 Epoch[2] Batch [1799] Speed: 885.801702 samples/sec accuracy=0.001041 lr=0.135997 Epoch[2] Batch [1849] Speed: 867.087289 samples/sec accuracy=0.001034 lr=0.136996 Epoch[2] Batch [1899] Speed: 890.086716 samples/sec accuracy=0.001022 lr=0.137995 Epoch[2] Batch [1949] Speed: 873.000224 samples/sec accuracy=0.001017 lr=0.138994 Epoch[2] Batch [1999] Speed: 853.781716 samples/sec accuracy=0.001013 lr=0.139994 Epoch[2] Batch [2049] Speed: 856.835816 samples/sec accuracy=0.001009 lr=0.140993 Epoch[2] Batch [2099] Speed: 860.044303 samples/sec accuracy=0.001009 lr=0.141992 Epoch[2] Batch [2149] Speed: 876.481051 samples/sec accuracy=0.001009 lr=0.142991 Epoch[2] Batch [2199] Speed: 892.382938 samples/sec accuracy=0.001002 lr=0.143991 Epoch[2] Batch [2249] Speed: 855.905974 samples/sec accuracy=0.001003 lr=0.144990 Epoch[2] Batch [2299] Speed: 889.843663 samples/sec accuracy=0.001003 lr=0.145989 Epoch[2] Batch [2349] Speed: 863.953235 samples/sec accuracy=0.000997 lr=0.146988 Epoch[2] Batch [2399] Speed: 870.380783 samples/sec accuracy=0.000997 lr=0.147988 Epoch[2] Batch [2449] Speed: 903.148794 samples/sec accuracy=0.001000 lr=0.148987 Epoch[2] Batch [2499] Speed: 916.964349 samples/sec accuracy=0.000999 lr=0.149986 [Epoch 2] training: accuracy=0.000998 [Epoch 2] speed: 872 samples/sec time cost: 1495.993527 [Epoch 2] validation: err-top1=0.998953 err-top5=0.994644 Epoch[3] Batch [49] Speed: 929.914738 samples/sec accuracy=0.000937 lr=0.151025 Epoch[3] Batch [99] Speed: 906.610637 samples/sec accuracy=0.000937 lr=0.152024 Epoch[3] Batch [149] Speed: 919.249578 samples/sec accuracy=0.001042 lr=0.153024 Epoch[3] Batch [199] Speed: 879.526642 samples/sec accuracy=0.001104 lr=0.154023 Epoch[3] Batch [249] Speed: 854.386749 samples/sec accuracy=0.001047 lr=0.155022 Epoch[3] Batch [299] Speed: 924.763784 samples/sec accuracy=0.001100 lr=0.156021 Epoch[3] Batch [349] Speed: 873.171714 samples/sec accuracy=0.001027 lr=0.157021 Epoch[3] Batch [399] Speed: 895.651351 samples/sec accuracy=0.001050 lr=0.158020 Epoch[3] Batch [449] Speed: 900.585902 samples/sec accuracy=0.001042 lr=0.159019 Epoch[3] Batch [499] Speed: 860.875526 samples/sec accuracy=0.001027 lr=0.160018 Epoch[3] Batch [549] Speed: 887.030089 samples/sec accuracy=0.001023 lr=0.161018 Epoch[3] Batch [599] Speed: 843.265624 samples/sec accuracy=0.001012 lr=0.162017 Epoch[3] Batch [649] Speed: 860.336493 samples/sec accuracy=0.001004 lr=0.163016 Epoch[3] Batch [699] Speed: 862.636338 samples/sec accuracy=0.000974 lr=0.164015 Epoch[3] Batch [749] Speed: 893.532553 samples/sec accuracy=0.001000 lr=0.165015 Epoch[3] Batch [799] Speed: 851.940280 samples/sec accuracy=0.000999 lr=0.166014 Epoch[3] Batch [849] Speed: 882.564236 samples/sec accuracy=0.000988 lr=0.167013 Epoch[3] Batch [899] Speed: 868.853416 samples/sec accuracy=0.000994 lr=0.168012 Epoch[3] Batch [949] Speed: 874.946556 samples/sec accuracy=0.001003 lr=0.169012 Epoch[3] Batch [999] Speed: 879.573134 samples/sec accuracy=0.000986 lr=0.170011 Epoch[3] Batch [1049] Speed: 897.879537 samples/sec accuracy=0.000962 lr=0.171010 Epoch[3] Batch [1099] Speed: 871.219244 samples/sec accuracy=0.000977 lr=0.172009 Epoch[3] Batch [1149] Speed: 844.503256 samples/sec accuracy=0.000978 lr=0.173009 Epoch[3] Batch [1199] Speed: 864.961451 samples/sec accuracy=0.000981 lr=0.174008 Epoch[3] Batch [1249] Speed: 854.099408 samples/sec accuracy=0.000980 lr=0.175007 Epoch[3] Batch [1299] Speed: 883.432764 samples/sec accuracy=0.000969 lr=0.176006 Epoch[3] Batch [1349] Speed: 869.831730 samples/sec accuracy=0.000969 lr=0.177005 Epoch[3] Batch [1399] Speed: 862.704275 samples/sec accuracy=0.000975 lr=0.178005 Epoch[3] Batch [1449] Speed: 883.263738 samples/sec accuracy=0.000986 lr=0.179004 Epoch[3] Batch [1499] Speed: 854.746531 samples/sec accuracy=0.000975 lr=0.180003 Epoch[3] Batch [1549] Speed: 872.240603 samples/sec accuracy=0.000964 lr=0.181002 Epoch[3] Batch [1599] Speed: 894.921970 samples/sec accuracy=0.000973 lr=0.182002 Epoch[3] Batch [1649] Speed: 882.197715 samples/sec accuracy=0.000975 lr=0.183001 Epoch[3] Batch [1699] Speed: 852.241153 samples/sec accuracy=0.000977 lr=0.184000 Epoch[3] Batch [1749] Speed: 846.303891 samples/sec accuracy=0.000977 lr=0.184999 Epoch[3] Batch [1799] Speed: 869.288118 samples/sec accuracy=0.000967 lr=0.185999 Epoch[3] Batch [1849] Speed: 878.585063 samples/sec accuracy=0.000958 lr=0.186998 Epoch[3] Batch [1899] Speed: 912.308064 samples/sec accuracy=0.000959 lr=0.187997 Epoch[3] Batch [1949] Speed: 880.248265 samples/sec accuracy=0.000957 lr=0.188996 Epoch[3] Batch [1999] Speed: 869.212224 samples/sec accuracy=0.000963 lr=0.189996 Epoch[3] Batch [2049] Speed: 895.356755 samples/sec accuracy=0.000976 lr=0.190995 Epoch[3] Batch [2099] Speed: 906.813738 samples/sec accuracy=0.000965 lr=0.191994 Epoch[3] Batch [2149] Speed: 878.990076 samples/sec accuracy=0.000966 lr=0.192993

Channel selection is disabled after resuming

https://github.com/CanyonWind/MXNet-Single-Path-One-Shot-NAS/blob/e1928e5bbf071ce76ddf7d9774ca11d07d8ab269/train_imagenet.py#L533-L534

It should be:

if epoch >= opt.epoch_start_cs: 
     opt.use_all_channels = False

Because of this, this supernet, which was trained from 0 - 70 and resumed from 70 to the end, was actually mainly trained with Block Selection. Only epochs between 60 - 70 are trained with Block Selection + Channel Selection. And during this period, the validation accuracy is found dropping to 1/1000. The same phenomenon was found in #4 (comment)

On the contrary, this Block Selection only supernet works well with the random/genetic search which was exploring Blocks as well as Channels (even though the channels were not randomly selected and trained as what the original paper claims). This may raise the possibility that randomly sampling (decoupling) these Channels may not be strongly related to a representative supernet.

Further experiments are required to make a more concrete conclusion.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	running_mean = F.add(F.multiply(self.running_mean.data(), self.momentum.as_in_context(x.context)),
	F.multiply(mean, self.momentum_rest.as_in_context(x.context)))
	running_var = F.add(F.multiply(self.running_var.data(), self.momentum.as_in_context(x.context)),
	F.multiply(var, self.momentum_rest.as_in_context(x.context)))