cybertronai / imagenet18_old Goto Github PK

View Code? Open in Web Editor NEW

847.0 847.0 162.0 509 KB

Code to reproduce "imagenet in 18 minutes" DAWN-benchmark entry

License: The Unlicense

Shell 1.70% Python 98.30%

imagenet18_old's People

Contributors

Stargazers

Watchers

Forkers

fxfactorial ml-lab xinkez cappelchi meelement stjordanis zxt881108 statml mahmud83 neo4reo franciszchen aihill holm-xie andres-root deisler134 esmaeilinia shaunstanislauslau cclauss jx1100370217 4handheld ntuaha trendingtechnology zerocurve zsk423200 yashwordlife gridl tccccd pinglmlcv piotrm777 kmcgrath branmu aakashkumarnain ruotianluo donnyyou kyoungrok0517 iidsample wolegechu sanjeevm0 barseghyanartur murari023 sathishmtech01 nick-choudhary muke-sh gauravsir183 belalmohsen dwtcourses vedsgit raghavendra-gali jdetras prashant118 not-a-builder sgwd harpreetsinghguller esskay0000 batermj selvamshan kevinwenya rssanjeev showhilllee briando2005 guru4elephant seungwooyoo cybersp nilakshdas shubhampachori12110095 scapeqin liangdaojun reloadbrain jurjsorinliviu atlonxp behzadhaghgoo tony32769 amimul jin7788 akgostar nikolayvoronchikhin frankgestrada greenfigo2015 hello-anmol rakelkar marcosgodoy dreamkeep a415432669 shiniapplegami thomascx nanshuiyu charlottesean lambdacoldstorage sxhdroid qrr0408 bananahero michaels72 ssusantachary natalie-lizhang derekdjia galdamour miguelmoscoso hwenjun18 muyiai meysamhamel

imagenet18_old's Issues

default VPC failure on >5 year old accounts

If your AWS account was created before 2013-12-04, may be in EC2-Classic mode. In such case train.py fails with error below

botocore.exceptions.ClientError: An error occurred (OperationNotPermitted) when calling the CreateDefaultVpc operation: Accounts on the EC2-Classic platform cannot create a default VPC.

Work-around is to contact AWS support and ask them to transition account to EC-VPC mode, with a list of regions where this should happen.

Existing infrastructure must be compatible with EC-VPC, transitioning is documented here https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/vpc-migrate.html

Whether the model is overfitting?

Hi,
Whether the image size of 288 and the min_scale = 0.5 will cause overfitting?

When using 64 * V100 (8 machines), the top-1 and top-5 can reach about 79% and 95% during training, but less than 74% and 92%(not using rec val) during eval.

I see that you trained 7 epochs with image_size=288 and min_scale=0.5 in 8 machines, but others only 2 or 3 epochs.

any parameter suggestion if I use 8x?

Any suggestion on parameters if I use 8x? Now I have a higher limit on 8x but not on 16x.

Synchronization issue when changing batch size

I appear to be hitting a synchronization problem when running on a single machine (4 gpus). I got an exception on the second phase for epoch 25 that shifts the batch size...

Epoch: [25][2500/2503]  Time 0.149 (0.180)      Loss 1.1147 (1.0585)    Acc@1 70.703 (73.791)   Acc@5 90.430 (91.173)   Data 0.116 (0
.143)   BW 0.000 0.000
Changing LR from 0.001750299640431482 to 0.0017499999999999998
Epoch: [25][2503/2503]  Time 2.004 (0.181)      Loss 1.1104 (1.0584)    Acc@1 74.306 (73.792)   Acc@5 89.583 (91.175)   Data 0.115 (0
.143)   BW 0.000 0.000
  File "training/train_imagenet_nv.py", line 439, in <module>
    main()
  File "training/train_imagenet_nv.py", line 145, in main
    top1, top5 = validate(dm.val_dl, model, criterion, epoch, start_time)
  File "training/train_imagenet_nv.py", line 244, in validate
    top1acc, top5acc, loss, batch_total = distributed_predict(input, target, model, criterion)
  File "training/train_imagenet_nv.py", line 279, in distributed_predict
    loss = criterion(output, target).data
  File "/home/rakelkar/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rakelkar/anaconda3/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 904, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/rakelkar/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/rakelkar/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1788, in nll_loss
    .format(input.size(0), target.size(0)))
Expected input batch_size (256) to match target batch_size (128).

I can not reproduce "imagenet 18" with fp32

Hi @yaroslavvb

I tried to reproduce "imagenet 18" on my host, it works well with fp16(top1 acc get 75.776% at the 27th epoch), but only get 51.018% top1-acc at the 27th epoch when fp32.

The entrypoint is as followed, I turned down the batch_size to avoid OOM with fp32, and the same argument with fp16 in my experiment:

PYTHONPATH=/imagenet18 \
NCCL_DEBUG=VERSION \
stdbuf -oL nohup python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
--master_addr=127.0.0.1 \
--master_port=6010 \
training/train_imagenet_nv.py /data/imagenet \
--logdir /imagenet18/log_small_bs \
--distributed \
--init-bn0 \
--no-bn-wd \
--phases '[{"ep": 0, "sz": 128, "bs": 224, "trndir": "/sz/160"}, {"ep": (0, 7), "lr": (1.0, 2.0)}, {"ep": (7, 13), "lr": (2.0, 0.25)}, {"ep": 13, "sz": 224, "bs": 96, "trndir": "/sz/352", "min_scale": 0.087}, {"ep": (13, 22), "lr": (0.42857142857142855, 0.04285714285714286)}, {"ep": (22, 25), "lr": (0.04285714285714286, 0.004285714285714286)}, {"ep": 25, "sz": 288, "bs": 50, "min_scale": 0.5, "rect_val": True}, {"ep": (25, 28), "lr": (0.0022321428571428575, 0.00022321428571428573)}]' 2>&1 > local_train.log &

Have you reproduced the conclusion with fp32?

checkpoint writing is corrupted for multiple machines

currently multiple workers save and clobber each other's checkpoints, need to do something like
stanford-futuredata/pytorch-distributed@23990ca

Huge spike in memory usage during initialization

While trying to train with resnext_101_64x4d taken from the fastai/fastai/models repository, and using FP32, there is a huge spike in memory usage. I observed that if I reduce the batch size from 512 to something like 64 or 32 (this is based on the network being trained) then the training goes through but it is a lot slower than if the batch size were 512.

At the start of the training, probably after the first batch, the memory usage goes down drastically. For example here is a capture of memory usage for one of the GPUs in the machine, for batch size set to 32, with everything else being the same for resnext_101_64x4d.

Memory usage in MiB
73
73
81
122
508
680
1084
1094
1094
1094
1094
3048
14024
13976
4776
9356
4696
1706
3110
5170
5596
5596

Note that the same thing repeats for all the 8 GPUs except for a small variations in the values.
As can be seen from above with 32 batch size the memory occupancy only 5596 MiB during the rest of the training (up to 14 epochs, then due to the size it changes). The rest of the memory is unused.

If it is possible to reduce this initial spike in memory usage or if there is someway to change the batch size to a bigger batch size once the training reaches a stable state, it would make the training a lot faster. I tried setting up a bigger batch size from epoch 2 by adding an additional phase with sz : 128 and bs : 512 but it doesn't seem to work for some reason.

Thanks for this amazing work.

What is the pre-warming for?

I am doing research and development for distributed deep learning, so your code is very interesting to me.

I noticed the code needs to run train.py for two times, while the first time is pre-warming.

What is pre-warming for? How is the pre-warming done and how long?

Is it the case that the inter node all-reduce could be down with large number of instances without the pre-warming?

Thanks a lot!

Repo moved to https://github.com/cybertronai/imagenet18

Since I'm no longer at DIUX I can't make changes to this repo, any development will continue here

https://github.com/cybertronai/imagenet18

Spot Instance or On-Demand instances

Recently I was asked by my supervisor to try to run the code, but I need to know more about the actual cost of running the code.

I saw the code utilize ncluster and awscli for cloud management.

Does the code utilize spot instances or on-demand p3.16xlarge instances of AWS?

In other words:
A single run cost around $40 with spot instances (ref: https://www.fast.ai/2018/08/10/fastai-diu-imagenet/)
or
It costs $118.07 ? (ref: https://dawn.cs.stanford.edu/benchmark/ImageNet/train.html)

Does the code terminate the p3.16xlarge instances automatically upon any termination condition? Or I need to terminate the p3.16xlarge instances by hand after around 18 minutes?

Thanks a lot!

Instructions for setting up imagenet dataset on AWS storage

Hi,

I am trying to use train.py for one p3.16xlarge instance and I am getting Access Denied Errors (likely because I do not have access to EFS):

Creating security group ncluster
Creating keypair ncluster-ubuntu
efs_client.describe_file_systems failed with <class 'botocore.exceptions.ClientError'>(An error occurred (AccessDeniedException) when calling the DescribeFileSystems operation: User: arn:aws:iam::xxxxxxxxxxxx:user/xxxxxxxx is not authorized to perform: elasticfilesystem:DescribeFileSystems on the specified resource), retrying

Is this error occurring because I do not have access to EFS on AWS? Can you please provide instructions in README on what specifically to setup for storage? Do we setup an EFS and download imagenet in it at the following path: ~/data/imagenet? Or will the code do all that on its own?

Also, I see that we need access to io2 disk on AWS as given here: https://github.com/diux-dev/imagenet18/blob/59a8f25171fb8cede51db9187a32fc8f802384a0/train.py#L161.

However, I could not find an io2 storage on AWS (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html) and could find only io1. The aws_backend.py code (in ncluster) also talks about setting up io1 (as given here: https://github.com/diux-dev/ncluster/blob/839514a1536b6157d59d043a1ff4dac45c2a4ffd/ncluster/aws_backend.py#L770-L776). Am I missing something? Can you please update instructions for what all we need to do to process the imagenet dataset for train.py?

Thanks a lot.
PS: initially, we setup an general purpose EBS, downloaded imagenet on it, and mounted it on this p3.16xlarge instance. Is that not needed?

Suggestion: Add some automated testing like Travis CI, Circle CI, Appveyor

They are all free for Open Source projects like this one.

https://github.com/marketplace/category/continuous-integration

How to run on a local machine?

If I have a local dgx2 machine with 4 or 8 gpus, how do I modify the code to run on the local dgx2 machine?

About Configure of Parameters

Hi! Thanks for your sharing
I tried to run this project on one machine. I run the code successfully by removing the ncluster.
But I have some confusions about configure of parameters.
I cannot understand 'trndir':'-sz/160' and 'trndir':'-sz/352'. I download the imagenet dataset by pytorch, there are two folders: train and val. May I know the differences between 'trndir1':'-sz/60', 'trndir':'-sz/352' and original imagenet dataset? Can I assign the same path to the 'trndir':'-sz/160', 'trndir':'-sz/352' , namely the imagenet dataset I downloaded? (in your code, just need to remove to 'trndir' )

document list of AWS permissions needed to run

Test Results Confusion

Thanks for your code, amazing job!
But Im a little bit confused on the results you obtained, please correct me if Im wrong.
To my understanding, the standard ImageNet Classification task should be testing on images whose shorter edges are 256 and center cropped to 224. However, at the last phrase of your codes you simply test it on size 288. If you evaluate your weights on 224 I guess you should expect some numbers below 93. Have you ever tried it?

Is Resnet V2?

Is Resnet V2? @yaroslavvb @bearpelican

Can you provide a tsv file for the training?

Hi,

Thanks for the fantastic work.

I wonder if you can upload a tsv file for validation/train accuracy/loss or other stats?

It would be very helpful for us to know what is the convergence without running the experiment.

Best regards,

AttributeError: module 'ncluster' has no attribute 'set_backend'

Hi,

Thank you for releasing the code. I got the error "AttributeError: module 'ncluster' has no attribute 'set_backend'" while I try to run the code. Could you please let me know how to solve this? Thanks

Why are there no log file for one machine?

Could you provide log file for one machine?

Running on an Infiniband Cluster

Does it run on Infiniband cluster? I am new to NCCL. how you access NCCL in Python?

The loss becomes NAN in the fifth epochs training on 1 machine

Hi all,

I am trying to reproduce the result using one p3.16xlarge instance with the default setting.

After doing the AWS configure, I ran:

python train.py --machines=1

However, the loss becomes NAN in the fifth epoch:

The full log file can be found here.

Does this mean that the default learning rate does not work? Or should I try the learning rate specified in the single machine log here.

Thanks a lot.

Add documentation for training

How do you split/sort the data for training? It says that you train on the smaller images first, are they sorted by increasing filesize, or are they grouped?

'async' is a reserved word in Python 3.7

async is a reserved word in Python 3.7 so it is a Syntax Error to use it as the name of a function parameter. Some projects like cuda are using non_blocking instead of async.

flake8 testing of https://github.com/diux-dev/imagenet18 on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./train.py:192:23: E999 SyntaxError: invalid syntax
    task.run(cmd, async=True)
                      ^
./tools/launch_tensorboard.py:14:56: E999 SyntaxError: invalid syntax
task.run(f'tensorboard --logdir={task.logdir}/..', async=True)
                                                       ^
2     E999 SyntaxError: invalid syntax
2