nas,junrq

I mesure the speed in my device

I mesure the speed in my device, and I retrain it on the new speed.txt file.
But a mistake appeared:
Traceback (most recent call last):
File "/workspace/fbnet-pytorch/train_cifar10.py", line 105, in
speed_f=config.speed_f)
File "/workspace/fbnet-pytorch/model.py", line 84, in init
self._speed = torch.tensor(self._speed, requires_grad=False)
ValueError: expected sequence of length 8 at dim 1 (got 9)

what should I do?

Params, BN, Learned Distribution and FC width layer about FBnet and your implementation

Thank you for sharing this great code!

I wonder if you have tested the released architecture of FBNETA,B and C? I calculated the parameters of FBNET-B, but it doesn't match with the origin paper.

in_channels	out_channels	kernel	expansion	groups	stride	input	params
3	16	3	1	1	2	224	432
16	16	3	1	1	1	112	656
16	24	3	6	1	2	112	4704
24	24	5	1	1	1	56	1752
24	24	3	1	1	1	56	1368
24	24	3	1	1	1	56	1368
24	32	5	6	1	2	56	11664
32	32	5	3	1	1	28	8544
32	32	3	6	1	1	28	14016
32	32	5	6	1	1	28	17088
32	64	5	6	1	2	28	23232
64	64	5	1	1	1	14	9792
64	64	0	0	0	0	14	0
64	64	5	3	1	1	14	29376
64	112	5	6	1	1	14	77184
112	112	3	1	1	1	14	26096
112	112	5	1	1	1	14	27888
112	112	5	3	1	1	14	83664
112	184	5	6	1	2	14	215712
184	184	5	1	1	1	7	72312
184	184	5	6	1	1	7	433872
184	184	5	6	1	1	7	433872
184	352	3	6	1	1	7	601680
352	1504	1	1	1	1	7	529408
1504	1000	1	1	1	1	1	1504000
							4129680

The calculated parameters is 4.1M, which is lower than the reported 4.5M in the paper. The flops is also not consist. I wonder if you have calculated the number of parameters or flops?

Besides:

Did you test BN layer in implementation? From the paper and your implementation, I didn't find BN in the middle, however, it is used in Shufflenet v2. I wonder if you have tested the effect? will the performance be better is you remove the BN layer in the middle?
Have you ever searched architectures with latency on GPU. I first evaluate the lantency on my Titan XP GPU and search FBNet architectures and found it tends to select the most parameter module with almost no variance. However, in the original paper, the author says that he sampled 6 different architectures after training. I wonder if your searched architecture on GPU/CPU latency has the observation that the learned distribution could generate both large model like FBNET-C and light-weight model like FBNET-A
Why do you use 1984 as FC layer width? Tab. 1 in paper have 1504 and 1984, which make me confusing.

Which branch is latest

there is three branch in your project, which is latest?

Some questions in measure_speed.py

The size of CIFAR-10 is (32, 32), and the size of ImageNet is (224, 224). Why the input_shape is (1, 3, 108, 108)?
def measure(blocks, input_shape = (1, 3, 108, 108), result_path='speed_custom.txt'):

The value of lat is different from total_loss - ce

@JunrQ
put lat in loss = [..., stop_grad(lat)]
and print it with print(outputs[-1].asnumpy())

bug in model_parameters of snas

Thanks for yours code! I think 'and' in line 224 should be changed to 'or'.

NAS/snas/snas/snas.py

Line 224 in d430c32

if not ('alphas_nomal' in k[0] and 'alphas_reduce' in k[0]):

Model structure for the CIFAR-10, FLOPS and params

What is the model size (FLOPs and # of params) for the CIFAR-10 trained model? How should we constraint the number of FLOPS for the final searched model?

Test, initialization in multi-dev training is different for parameters

Test issues
Parameters should be initialized with the same array in different gpus. @JunrQ

Do you have any good choice for the three levels of resource constraint?

It seams that 1e-9 is not suitable for your new code. Experiment outputs costs is about 6e-04, and the accompany loss is 2.414636e+00 during the former steps.

NAS/snas/snas/train_cifar10.py

Line 31 in f5b0f25

resource_constraint_weight = 1e-9

How to reproduce your result?

According to your pytorch-cifar10.alpha.0.01.init_lat.438.log, the final acc result is 0.84164 and the lat is 283.62738ms. Unfortunately I couldn't remake it. It was strange that my acc was just 0.67671 and my lat is 553.25839ms. Additionally the lat_loss was increasing when i saw it on tensorboard.

Look forward to your reply.

My config param were shown as this:

I try to train a sample net

Hello,
I trained a super net on cifar to get the suitable theta, then I try to retrain a sample net only to find the output of each layer to be NAN for just 3 batch. Do you have met this situation before? What should I do?
Thanks.

kernel size in class FBnetblock

You didn't use kernel size in class FBnetblock.
What's wrong with Bn?

training FBNet on CIFAR-10

Hi JunrQ,

Thanks for your work, it helps a lot. When I trained the FBNet on CIFAR-10, the accuracy began to quickly drop at epoch 66. And when it comes to epoch 71, the accuracy is about 0.1 and both loss
and ce are nan. Is there something wrong or normal?

By the way, the lowest loss is about 8.5 at epoch 27 and then it gets higher.

Thanks!

Could you please upload the code for retraining SNAS?

In the original paper of SNAS, after the best architecture is found, the model will be retrained on the training set. Could you please upload the code for retraining? Thank you so much

FLOP should be changed to FLOP1

NAS/snas/snas/snas.py

Line 42 in 5c23276

FLOP1 = FLOP / op.op[1].groups

Forward and backward for FBNet

Hi, JunrQ:

Thanks for your work, it is really quite helpful~
I have a question: I found that in your FBNet source code, you generate batch_size models for batch_size samples per batch, however, the total loss is summed and the loss.backward() function is called. So how this backward() function is applied? For a single model or for batch_size model? Besides, I wonder that why you use this method for FBNet while a single model is generated, loss.backward() is called and then two .step() function is applied in SNAS code.

Loss of snas doesn't decrease in `train_cifar10.py`

@JunrQ

have you reproduced the results in original paper?

what meaning MAC means in MixedOp?

In my understanding, MAC means "Multipy Add Cost", but the following code seams to calculate the memory cost of DilConv.

` MAC1 = (self.width * self.height) * (op.op[1].in_channels + op.op[1].in_channels)
MAC1 += (op.op[1].kernel_size[0] ** 2 *op.op[1].in_channels * op.op[1].out_channels) / op.op[1].groups
MAC2 = (self.width * self.height) * (op.op[2].in_channels + op.op[2].in_channels)
MAC2 += (op.op[2].kernel_size[0] ** 2 *op.op[2].in_channels * op.op[2].out_channels) / op.op[2].groups

    MAC = MAC1 + MAC2

`

NAS/snas/snas/snas.py

Line 55 in 5c23276

    
           MAC1 = (self.width * self.height) * (op.op[1].in_channels + op.op[1].in_channels)

imagenet training code bug.

Hi, thanks for your code. But when I training imagenet dataset, I found there are some bugs. For example, code can not found self.samples in data.py. So, do you have check your imagenet code before you upload?

got nan when calculate gumbel_softmax

Sometimes nn.functional.gumbel_softmax will return nan if using GPU to calculate. It will not happen if using CPU to calculate.

test code:

import torch
import torch.nn as nn
import math

if __name__ == "__main__":
    batch_size = 128
    temperature = 5.0
    theta = torch.FloatTensor([1.753356814384460449,1.898535370826721191,0.6992630958557128906,
                                0.2227068245410919189,0.6384450793266296387,1.431323885917663574,
                                -0.05012089386582374573, -0.06672633439302444458])
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    t_gpu = theta.repeat(batch_size, 1).to(device)
    max_num = 1000000
    nan_num = 0
    for i in range(max_num):
        weight = nn.functional.gumbel_softmax(t_gpu, temperature)
        if math.isnan(torch.sum(weight)):
            nan_num+=1
    print("GPU: nan {:.3f}% probability happen, tot {}".format(100.0 * nan_num / max_num, nan_num))
    nan_num = 0
    t_cpu = theta.repeat(batch_size, 1)
    for i in range(max_num):
        weight = nn.functional.gumbel_softmax(t_cpu, temperature)
        if math.isnan(torch.sum(weight)):
            nan_num+=1
    print("CPU: nan {:.3f}% probability happen, tot {}".format(100.0 * nan_num / max_num, nan_num))

got results:

GPU: nan 0.004% probability happen, tot 38
CPU: nan 0.000% probability happen, tot 0

I'm not sure if it is a bug of pytorch or a bug of gumbel_softmax or there are some restrictions for the value of theta.

Why the calculation of FLOPs of dilconv didnt need to consider the parameter 'dilation'?

NAS/snas/snas/snas.py

Line 50 in 5c23276

if 'dil' in primitive:

junrq / nas Goto Github PK

nas's People

Contributors

Stargazers

Watchers

Forkers

nas's Issues

Recommend Projects

Recommend Topics

Recommend Org