Giter Site home page Giter Site logo

imagenet18_old's People

Contributors

bearpelican avatar yaroslavvb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

imagenet18_old's Issues

default VPC failure on >5 year old accounts

If your AWS account was created before 2013-12-04, may be in EC2-Classic mode. In such case train.py fails with error below

botocore.exceptions.ClientError: An error occurred (OperationNotPermitted) when calling the CreateDefaultVpc operation: Accounts on the EC2-Classic platform cannot create a default VPC.

Work-around is to contact AWS support and ask them to transition account to EC-VPC mode, with a list of regions where this should happen.

Existing infrastructure must be compatible with EC-VPC, transitioning is documented here https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/vpc-migrate.html

Whether the model is overfitting?

Hi,
Whether the image size of 288 and the min_scale = 0.5 will cause overfitting?

When using 64 * V100 (8 machines), the top-1 and top-5 can reach about 79% and 95% during training, but less than 74% and 92%(not using rec val) during eval.

I see that you trained 7 epochs with image_size=288 and min_scale=0.5 in 8 machines, but others only 2 or 3 epochs.

Synchronization issue when changing batch size

I appear to be hitting a synchronization problem when running on a single machine (4 gpus). I got an exception on the second phase for epoch 25 that shifts the batch size...

Epoch: [25][2500/2503]  Time 0.149 (0.180)      Loss 1.1147 (1.0585)    Acc@1 70.703 (73.791)   Acc@5 90.430 (91.173)   Data 0.116 (0
.143)   BW 0.000 0.000
Changing LR from 0.001750299640431482 to 0.0017499999999999998
Epoch: [25][2503/2503]  Time 2.004 (0.181)      Loss 1.1104 (1.0584)    Acc@1 74.306 (73.792)   Acc@5 89.583 (91.175)   Data 0.115 (0
.143)   BW 0.000 0.000
  File "training/train_imagenet_nv.py", line 439, in <module>
    main()
  File "training/train_imagenet_nv.py", line 145, in main
    top1, top5 = validate(dm.val_dl, model, criterion, epoch, start_time)
  File "training/train_imagenet_nv.py", line 244, in validate
    top1acc, top5acc, loss, batch_total = distributed_predict(input, target, model, criterion)
  File "training/train_imagenet_nv.py", line 279, in distributed_predict
    loss = criterion(output, target).data
  File "/home/rakelkar/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rakelkar/anaconda3/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 904, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/rakelkar/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/rakelkar/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1788, in nll_loss
    .format(input.size(0), target.size(0)))
Expected input batch_size (256) to match target batch_size (128).

I can not reproduce "imagenet 18" with fp32

Hi @yaroslavvb

I tried to reproduce "imagenet 18" on my host, it works well with fp16(top1 acc get 75.776% at the 27th epoch), but only get 51.018% top1-acc at the 27th epoch when fp32.

image

The entrypoint is as followed, I turned down the batch_size to avoid OOM with fp32, and the same argument with fp16 in my experiment:

PYTHONPATH=/imagenet18 \
NCCL_DEBUG=VERSION \
stdbuf -oL nohup python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
--master_addr=127.0.0.1 \
--master_port=6010 \
training/train_imagenet_nv.py /data/imagenet \
--logdir /imagenet18/log_small_bs \
--distributed \
--init-bn0 \
--no-bn-wd \
--phases '[{"ep": 0, "sz": 128, "bs": 224, "trndir": "/sz/160"}, {"ep": (0, 7), "lr": (1.0, 2.0)}, {"ep": (7, 13), "lr": (2.0, 0.25)}, {"ep": 13, "sz": 224, "bs": 96, "trndir": "/sz/352", "min_scale": 0.087}, {"ep": (13, 22), "lr": (0.42857142857142855, 0.04285714285714286)}, {"ep": (22, 25), "lr": (0.04285714285714286, 0.004285714285714286)}, {"ep": 25, "sz": 288, "bs": 50, "min_scale": 0.5, "rect_val": True}, {"ep": (25, 28), "lr": (0.0022321428571428575, 0.00022321428571428573)}]' 2>&1 > local_train.log &

Have you reproduced the conclusion with fp32?

Huge spike in memory usage during initialization

While trying to train with resnext_101_64x4d taken from the fastai/fastai/models repository, and using FP32, there is a huge spike in memory usage. I observed that if I reduce the batch size from 512 to something like 64 or 32 (this is based on the network being trained) then the training goes through but it is a lot slower than if the batch size were 512.

At the start of the training, probably after the first batch, the memory usage goes down drastically. For example here is a capture of memory usage for one of the GPUs in the machine, for batch size set to 32, with everything else being the same for resnext_101_64x4d.

Memory usage in MiB
73
73
81
122
508
680
1084
1094
1094
1094
1094
3048
14024
13976
4776
9356
4696
1706
3110
5170
5596
5596

Note that the same thing repeats for all the 8 GPUs except for a small variations in the values.
As can be seen from above with 32 batch size the memory occupancy only 5596 MiB during the rest of the training (up to 14 epochs, then due to the size it changes). The rest of the memory is unused.

If it is possible to reduce this initial spike in memory usage or if there is someway to change the batch size to a bigger batch size once the training reaches a stable state, it would make the training a lot faster. I tried setting up a bigger batch size from epoch 2 by adding an additional phase with sz : 128 and bs : 512 but it doesn't seem to work for some reason.

Thanks for this amazing work.

What is the pre-warming for?

I am doing research and development for distributed deep learning, so your code is very interesting to me.

I noticed the code needs to run train.py for two times, while the first time is pre-warming.

What is pre-warming for? How is the pre-warming done and how long?

Is it the case that the inter node all-reduce could be down with large number of instances without the pre-warming?

Thanks a lot!

Spot Instance or On-Demand instances

Recently I was asked by my supervisor to try to run the code, but I need to know more about the actual cost of running the code.

I saw the code utilize ncluster and awscli for cloud management.

Does the code utilize spot instances or on-demand p3.16xlarge instances of AWS?

In other words:
A single run cost around $40 with spot instances (ref: https://www.fast.ai/2018/08/10/fastai-diu-imagenet/)
or
It costs $118.07 ? (ref: https://dawn.cs.stanford.edu/benchmark/ImageNet/train.html)

Does the code terminate the p3.16xlarge instances automatically upon any termination condition? Or I need to terminate the p3.16xlarge instances by hand after around 18 minutes?

Thanks a lot!

Instructions for setting up imagenet dataset on AWS storage

Hi,

I am trying to use train.py for one p3.16xlarge instance and I am getting Access Denied Errors (likely because I do not have access to EFS):

Creating security group ncluster
Creating keypair ncluster-ubuntu
efs_client.describe_file_systems failed with <class 'botocore.exceptions.ClientError'>(An error occurred (AccessDeniedException) when calling the DescribeFileSystems operation: User: arn:aws:iam::xxxxxxxxxxxx:user/xxxxxxxx is not authorized to perform: elasticfilesystem:DescribeFileSystems on the specified resource), retrying

Is this error occurring because I do not have access to EFS on AWS? Can you please provide instructions in README on what specifically to setup for storage? Do we setup an EFS and download imagenet in it at the following path: ~/data/imagenet? Or will the code do all that on its own?

Also, I see that we need access to io2 disk on AWS as given here: https://github.com/diux-dev/imagenet18/blob/59a8f25171fb8cede51db9187a32fc8f802384a0/train.py#L161.

However, I could not find an io2 storage on AWS (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html) and could find only io1. The aws_backend.py code (in ncluster) also talks about setting up io1 (as given here: https://github.com/diux-dev/ncluster/blob/839514a1536b6157d59d043a1ff4dac45c2a4ffd/ncluster/aws_backend.py#L770-L776). Am I missing something? Can you please update instructions for what all we need to do to process the imagenet dataset for train.py?

Thanks a lot.
PS: initially, we setup an general purpose EBS, downloaded imagenet on it, and mounted it on this p3.16xlarge instance. Is that not needed?

About Configure of Parameters

Hi! Thanks for your sharing
I tried to run this project on one machine. I run the code successfully by removing the ncluster.
But I have some confusions about configure of parameters.
I cannot understand 'trndir':'-sz/160' and 'trndir':'-sz/352'. I download the imagenet dataset by pytorch, there are two folders: train and val. May I know the differences between 'trndir1':'-sz/60', 'trndir':'-sz/352' and original imagenet dataset? Can I assign the same path to the 'trndir':'-sz/160', 'trndir':'-sz/352' , namely the imagenet dataset I downloaded? (in your code, just need to remove to 'trndir' )

Test Results Confusion

Thanks for your code, amazing job!
But Im a little bit confused on the results you obtained, please correct me if Im wrong.
To my understanding, the standard ImageNet Classification task should be testing on images whose shorter edges are 256 and center cropped to 224. However, at the last phrase of your codes you simply test it on size 288. If you evaluate your weights on 224 I guess you should expect some numbers below 93. Have you ever tried it?

Can you provide a tsv file for the training?

Hi,

Thanks for the fantastic work.

I wonder if you can upload a tsv file for validation/train accuracy/loss or other stats?

It would be very helpful for us to know what is the convergence without running the experiment.

Best regards,

The loss becomes NAN in the fifth epochs training on 1 machine

Hi all,

I am trying to reproduce the result using one p3.16xlarge instance with the default setting.

After doing the AWS configure, I ran:

python train.py --machines=1

However, the loss becomes NAN in the fifth epoch:

image

The full log file can be found here.

Does this mean that the default learning rate does not work? Or should I try the learning rate specified in the single machine log here.

Thanks a lot.

Add documentation for training

How do you split/sort the data for training? It says that you train on the smaller images first, are they sorted by increasing filesize, or are they grouped?

'async' is a reserved word in Python 3.7

async is a reserved word in Python 3.7 so it is a Syntax Error to use it as the name of a function parameter. Some projects like cuda are using non_blocking instead of async.

flake8 testing of https://github.com/diux-dev/imagenet18 on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./train.py:192:23: E999 SyntaxError: invalid syntax
    task.run(cmd, async=True)
                      ^
./tools/launch_tensorboard.py:14:56: E999 SyntaxError: invalid syntax
task.run(f'tensorboard --logdir={task.logdir}/..', async=True)
                                                       ^
2     E999 SyntaxError: invalid syntax
2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.