bayeswatch / nas-without-training Goto Github PK

View Code? Open in Web Editor NEW

448.0 448.0 60.0 1.16 MB

Code for Neural Architecture Search without Training (ICML 2021)

Python 99.51% Shell 0.49%

nas-without-training's People

Contributors

Stargazers

Watchers

Forkers

lilujunai wentaozhu yangtong1989 ml-lab ejhortala kp-forks peterouzh brjathu lucehe marcelomata huynhlam punyajoy chnxindong egracheva jufi2112 arvindia hayeonlee vascolopes bentengma zeta1999 linhduongtuan saqibmamoon susugit nebularaid2000 victor328 miaozhang0525 yyangzixin kelvinyang0320 jack-willturner maryamhoss dsysoev pengyao96 huynhnhathao xrosliang mountains-high luckmonkeys joey61liuyi jie311 qinghua-zhou tempbrucefu michelepresti francescacossu chester-w-xie leeaurele machinecf sasansalmani balablabala anishamadouche mldl ruchira2k jesusrafaelchris oemiliatano dan255 sudojarvis adarshpalaskar1 ayushabrol13 guan-jw rachmadvwp mns-csharp y0un0

nas-without-training's Issues

About target

Hi, I notice that you are not using the loss to backward, rather the forward result.

But leaving the target interface and the criterion in your code. I wonder why? Are you trying to use the target to calculate the loss to backward but it looks bad?

Related codes:

nas-without-training/search.py

Line 58 in a98f872

def eval_score(jacob, labels=None):

nas-without-training/search.py

Line 50 in a98f872

_, y = net(x)

Hi! I have a question about the score caculation. In your code, you sum all the modules which has relu to caculate network.K, i.e. , the K_{H}. Then you caculate logdet(network.K) without normalization which should set the diagonal entry as 1. But in different network, the total number of modules which has relu is different. So without normalization the network.K，it seems to prefer the network which has more modules with relu. I would like to question if the normalization of the netwrk.K is necessary. In my opinion, the normaliztion may be right in some way? Hope you give me an answer. Thanks!

Wrong correlation matrix dimension?

In your codes both 'search.py' and 'plot_histograms.py', the shape of Jacob is [batch_size, CHW] and the shape of correlation matrix is [batch_size, batch_size]. However, the correct shape of the correlation matrix should be [CHW, CHW], actually C=3, H=32, W=32. Therefore, the shape of Jacob should be [CHW, batch_size]. This caused the wrong results of correlation matrix histograms and incorrect architecture score.

IDE

Hi
Which IDE are you using?
Best regards,
PeterPham

Score computed on eval or train mode network?

I'm having a hard time sifting through the code to the find the answer.

Should one compute the NASWOT score on a network in train or eval mode?

I get that we are trying to get a measure of how much the network creates these linear separations relative to a dataset. To me this would suggest that something like dropout should not be considered when computing the score and hence would speak towards using an eval model. But including batchnorm seems like it should be since we are normalizing the data (speaking for using the train mode). But if the batch norm is unfit, then possibly eval is better.

input would be appreciated

Question about the histigrams of the correlations matrix for different archs

Firstly, it is very nice for you to share this project with us. However, I have some issues with the histograms of the correlation matrix in your paper.

I also try to plot the histograms and my implementation steps are as follows:

statistic the acc of cifar10 validation dataset of each arch in the NAS benchmark 201;
random sample 10 arches in different acc interval(>=90%, [80%,90%],[70%,80%],[60%,70%],<60%) and random sample mini-batch(256) in cifar10 training dataset for each arch to calculate the correlation matrix.

The histograms I plot are as follows:

Compared to the histograms your paper reported, the above histograms have two main difference:

The histogram is not symmetric about 0, even though I think that symmetry about 0 is not meaningful.
The correlation matrix corresponding to arches in different acc interval is not obvious, especially in [70%,80%],[60%,70%],<60%.

So is there anything wrong in my implementation? Looking forward to your reply.

It's possible to transform one of the best arch to tensorflow model?

I want to measure almost everything related to the "best" arch found. For example, model size in MB, how many bytes are used by the inferences, etc...
I only know how to do it with tensorflow API, so there's any way to import the final model, or any architecture and "transform" in a TF model?

Question about development of score when network is trained

Just out of interest, did you also measure the scores when actually training the networks after each epoch? How do they develop?

I was wondering if the scores are even more correlated to the final model's accuracy when they are measured not directly after initialization but after 1 or 2 epochs.

implementation of KL divergence

I do understand that the comparison of the eigval from a uncorrelated matrix (i.e. identity matrix) would all be ones.
But, I'm not understanding how the KL divergence form is expressed as np.log(eigval) + 1/eigval.

I'm in the understanding that KL divergence = -sum[ f(x) log( g(x) / f(x) ]
maybe I"m just not understanding how f(x) and g(x) is expressed in the form of eigval.. hope to have some enlightment.. thanks

NAS-Bench-101: stem layer size

Hello!
First, thanks a lot for your paper and your code.

As for the issue, I found an inconsistency in your implementation of NAS-Bench-101.
The number of convolution filters reported in the NAS-Bench-101 paper is 128.
However, as far as I can judge from your code, 16 filters are implemented instead:

parser.add_argument('--stem_out_channels', default=16, type=int, help='output channels of stem convolution (nasbench101)')

This change made a big difference for my trainless metric.

While I suppose this modification originates from the memory limitation problem,
I believe it should be explicitly mentioned both in the code and in your paper.

Cheers,
Ekaterina

[Question] Initialisation induced score variation

Hello,

I'm posting this question here because I don't know anywhere else to ask it but I can take it down if it's not the proper medium to ask it.

Looking at the paper, it seemed to me that the relatively large variation in score due to different initialisation could mean that if we average the scoring given by n different initialisation we could have a much more consistent scoring.

However, after implementing a quick version of that idea, the result showed no significative improvement.
Can someone explain to me why I was wrong ?

Thanks in advance

About the calculation of the Jacob.

First, thanks for your nice work!
But I have a question that when you calculate the Jacob and you multiply the output with an all-one matrix to get a scalar.
For example, when the batch_size=1, the input shape is 32x32x1 (flatten to 1024), and the output shape is 10. By definition, we should solve the gradient of each output with respect to the input, so we get a Jacob matrix of 10x1024. However, when you multiply the output with an all-one matrix to get a scalar, you can only get a Jacob matrix with shape 1x1024.
Hence, what does this scalar mean? And are these two matrices(10x1024 vs 1x1024) equivalent?
Best wishes and look forward to your reply.

Question about using det or norm in equation 2.

Hi, thanks for your intersting work. I notice that determinant is used in this code rather than norm, is it better than the latter?

derive cell structure from info

I would like to find commonalities between well performing architectures by parsing info of InferCell. I'm struggling to decipher the description of the cell structure, though.

example:

info :: nodes=4, inC=64, outC=64, 
[1<-(I0-L0) | 2<-(I0-L1,I1-L2) | 3<-(I0-L3,I1-L4,I2-L5)],
|nor_conv_3x3~0|+|nor_conv_3x3~0|none~1|+|nor_conv_1x1~0|skip_connect~1|skip_connect~2|

Could you please break down how to interpret this? E.g. how to figure out which layers the skip connection links? Or point me to a documentation of TinyNetwork?

Cifar 10 with Standard classifiers which are available in torchvision.models

@jack-willturner, So I was trying this methodology with standard classifiers. I am not getting anything useful and I am getting scores as the multiple of batch size.

Models chosen -
"alexnet", "resnet18", "resnet34", "resnet50", "vgg11", "vgg13", "densenet121", "squeezenet1_0", "squeezenet1_1"

scores of 3 runs with Batch Size 512
[-515.8197304827107, -518.2625130150999, -516.5254794393328, -516.4521991392903, -513.0445545389762, -513.0351409136783, -531.7337148363374, -526.7972858922, -523.973769767945]

[-515.9147741416524, -517.8983410664996, -515.5328530338215, -515.9739907753669, -513.055884116389, -513.0808931741735, -533.8211941483576, -518.7453811789933, -518.6898948738232]

[-515.8902849753072, -518.636564681304, -516.60644909449, -515.057577734371, -513.0407785346342, -513.0632672108508, -527.9074191760042, -518.9160167714085, -521.2741464876093]

Scores of 3 runs with Batch Size 128 -

[-128.26608670187613, -128.9030387211241, -128.50543259689053, -128.45200275464504, -128.0648502228396
5, -128.06361606092634, -131.5393815597839, -130.2799460305505, -129.34274766217487]

[-128.27500716523485, -128.8883337356769, -128.3406425853629, -128.5692902076405, -128.06446535511876,
-128.06785661838, -132.06972535717804, -128.77940184757026, -128.60743606369488]

[-128.28348337334037, -129.04774963648924, -128.53128977454088, -128.25740881809102, -128.066172268912
1, -128.06880782087086, -130.62699720044384, -128.84707498551833, -128.83386057804955]

Scores of 3 runs with Batch Size 16 -
[-16.00536177981646, -16.023735408663864, -16.00541814906714, -16.01275948633266, -16.001070293337335, -16.000959487618033, -16.088079533081366, -16.00829194903181, -16.008265848684317]

[-16.005564368012045, -16.037812752585545, -16.01270083749682, -16.007351551415354, -16.000882548413536, -16.000910242701053, -16.246709654891255, -16.06660858188009, -16.014516625366127]

[-16.005648553602278, -16.020132359932354, -16.011393278637634, -16.006370556147218, -16.001099750108267, -16.000646158922933, -16.20028038506129, -16.014573365687735, -16.004068435407078]

Any Idea about this ?

analyze about imagenet-1k

Hi, I am confused that what's happened if the input size becomes larger, e.g. 224. Under this condition, the dimension of Jacob matrix becomes 3x224x224 (after flatten)

Is your work failed when the Jacob matrix dimension becomes very large?

About the implementation of NASWOT based on the released method in ICLR 2021

Hi, I observe that you have modified the method in ICLR 2021, and the main difference includes two folds:

change inputs with different images to inputs with the same image but repeated 256 times with cutout;
change the score calculation method based on KL divergence to the sum of indicator functions.

I also try to reimplement the results based on your repo. However, I failed. I mainly modify two partial codes as follows:

add torchvision.transforms.RandomErasing aug

class RandomErasing:
  def __init__(self, batch_size=256):
    self.batch_size = batch_size
    self.random_erasing = torchvision.transforms.RandomErasing(p=0.9, scale=(0.02,0.04))

  def __call__(self, image):
      images = []
      for i in range(self.batch_size):
        image_erase = self.random_erasing(image)
        images.append(image_erase)

      images_out = torch.stack(images)

      return images_out


def get_datasets(name, root, cutout, random_erasing=True):

  if name == 'cifar10':
    mean = [x / 255 for x in [125.3, 123.0, 113.9]]
    std  = [x / 255 for x in [63.0, 62.1, 66.7]]
  elif name == 'cifar100':
    mean = [x / 255 for x in [129.3, 124.1, 112.4]]
    std  = [x / 255 for x in [68.2, 65.4, 70.4]]
  elif name.startswith('imagenet-1k'):
    mean, std = [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
  elif name.startswith('ImageNet16'):
    mean = [x / 255 for x in [122.68, 116.66, 104.01]]
    std  = [x / 255 for x in [63.22,  61.26 , 65.09]]
  else:
    raise TypeError("Unknow dataset : {:}".format(name))

  # Data Argumentation
  if name == 'cifar10' or name == 'cifar100':
    lists = [transforms.RandomHorizontalFlip(), transforms.RandomCrop(32, padding=4), transforms.ToTensor(), transforms.Normalize(mean, std)]
    if cutout > 0 : lists += [CUTOUT(cutout)]
    if random_erasing: lists += [RandomErasing()]
    train_transform = transforms.Compose(lists)
    test_transform  = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean, std)])
    xshape = (1, 3, 32, 32)
  elif name.startswith('ImageNet16'):
    lists = [transforms.RandomHorizontalFlip(), transforms.RandomCrop(16, padding=2), transforms.ToTensor(), transforms.Normalize(mean, std)]
    if cutout > 0 : lists += [CUTOUT(cutout)]
    if random_erasing: lists += [RandomErasing()]
    train_transform = transforms.Compose(lists)
    test_transform  = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean, std)])
    xshape = (1, 3, 16, 16)
  elif name == 'tiered':
    lists = [transforms.RandomHorizontalFlip(), transforms.RandomCrop(80, padding=4), transforms.ToTensor(), transforms.Normalize(mean, std)]
    if cutout > 0 : lists += [CUTOUT(cutout)]
    if random_erasing: lists += [RandomErasing()]
    train_transform = transforms.Compose(lists)
    test_transform  = transforms.Compose([transforms.CenterCrop(80), transforms.ToTensor(), transforms.Normalize(mean, std)])
    xshape = (1, 3, 32, 32)
  elif name.startswith('imagenet-1k'):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    if name == 'imagenet-1k':
      xlists    = [transforms.RandomResizedCrop(224)]
      xlists.append(
        transforms.ColorJitter(
        brightness=0.4,
        contrast=0.4,
        saturation=0.4,
        hue=0.2))
      xlists.append( Lighting(0.1))
    elif name == 'imagenet-1k-s':
      xlists    = [transforms.RandomResizedCrop(224, scale=(0.2, 1.0))]
    else: raise ValueError('invalid name : {:}'.format(name))
    xlists.append( transforms.RandomHorizontalFlip(p=0.5) )
    xlists.append( transforms.ToTensor() )
    xlists.append( normalize )
    train_transform = transforms.Compose(xlists)
    test_transform  = transforms.Compose([transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), normalize])
    xshape = (1, 3, 224, 224)
  else:
    raise TypeError("Unknow dataset : {:}".format(name))

  if name == 'cifar10':
    train_data = dset.CIFAR10 (root, train=True , transform=train_transform, download=True)
    test_data  = dset.CIFAR10 (root, train=False, transform=test_transform , download=True)
    assert len(train_data) == 50000 and len(test_data) == 10000
  elif name == 'cifar100':
    train_data = dset.CIFAR100(root, train=True , transform=train_transform, download=True)
    test_data  = dset.CIFAR100(root, train=False, transform=test_transform , download=True)
    assert len(train_data) == 50000 and len(test_data) == 10000
  elif name.startswith('imagenet-1k'):
    train_data = dset.ImageFolder(osp.join(root, 'train'), train_transform)
    test_data  = dset.ImageFolder(osp.join(root, 'val'),   test_transform)
    assert len(train_data) == 1281167 and len(test_data) == 50000, 'invalid number of images : {:} & {:} vs {:} & {:}'.format(len(train_data), len(test_data), 1281167, 50000)
  elif name == 'ImageNet16':
    train_data = ImageNet16(root, True , train_transform)
    test_data  = ImageNet16(root, False, test_transform)
    assert len(train_data) == 1281167 and len(test_data) == 50000
  elif name == 'ImageNet16-120':
    train_data = ImageNet16(root, True , train_transform, 120)
    test_data  = ImageNet16(root, False, test_transform , 120)
    assert len(train_data) == 151700 and len(test_data) == 6000
  elif name == 'ImageNet16-150':
    train_data = ImageNet16(root, True , train_transform, 150)
    test_data  = ImageNet16(root, False, test_transform , 150)
    assert len(train_data) == 190272 and len(test_data) == 7500
  elif name == 'ImageNet16-200':
    train_data = ImageNet16(root, True , train_transform, 200)
    test_data  = ImageNet16(root, False, test_transform , 200)
    assert len(train_data) == 254775 and len(test_data) == 10000
  else: raise TypeError("Unknow dataset : {:}".format(name))
  
  class_num = Dataset2Class[name]
  return train_data, test_data, xshape, class_num

change the method of score calculation:

def eval_score(jacob, labels=None):
    corrs = np.corrcoef(jacob)
    corrs = np.where((corrs>0.0)&(corrs<0.25), 1.0, 0.0)
    score = np.sum(corrs)

    return score

Other codes keep the same as the previous codes. But I obtain 85.75% cifar-10 test accuracy with N=10, runs=500, seed=1. Could you please give me some tips to reimplement the results in your latest paper?

Random indices in search.py

Hi @jack-willturner , I came across the code in search.py and noticed that the sample architecture indices are generated from np.random.randint, which will result in duplicated numbers.

It should not be a big issue when the sampling size is relatively small. However, I think it is better to replace it with np.random.choice to avoid evaluating the same network twice.

Thanks,
Bill

ResNeXt-A doesn't work

I provide my solution here:
In nasspace.py insert:
if len(config['bot_muls']) > 0:
if config['bot_muls'][0] == 0:
config['bot_muls'][0] = 1
after line280

Full imagenet performance

Thanks for the interesting work.
Did you happen to experiment on the full Imagenet? Are results consistent for that dataset as well?

ImageNet16-120 dataset download

Can anyone tell me how to download the ImageNet16-120 dataset? Or is it created from the original Imagenet dataset? If it is created from the original ImageNet dataset, can anyone tell me if there is a script that I can use to create the ImageNet16-120 dataset.

Thank you

mobilenetv2-based search space

Is the method in the paper suitable for the mobilenetv2-based search space?

Is it possible to search a network for a custom dataset?

cifar-split.txt

I needed 'cifar-split.txt' to test 'sh scorehook.sh'.
In this project, I couldn't find the file.
After googling, I got 'cifar-split.txt' from PRDARTS project(https://github.com/salesforce/PR-DARTS/tree/main/PRDARTS_search/configs)

for someone who need help with this.

Changes: "How are bid, ask, and last size quotes displayed for stocks?"

Did anyone else get this message today? I'm not entirely 100% sure what has changed.

Does this mean all reqHistoricalData, data feed subscriptions and order entry commands were previously in lots of 100. And now they must all be either given or will be returned in actual shares. So previously shares might have been shown as 100, and now it will be shown as 10,000?

"Effective in TWS version 985 and later, for US stocks the bid, ask, and last size quotes are shown in shares (not in lots). API users have the option to configure the TWS API to work in compatibility mode for older programs, but we recommend migrating to "quotes in shares" at your earliest convenience. To use compatibility mode, from the Global Configuration menu select API followed by the Settings page. Once there, check "Bypass US Stocks market data in shares warning for API orders.""