Giter Site home page Giter Site logo

Training epoch about edge-connect HOT 17 CLOSED

knazeri avatar knazeri commented on August 15, 2024
Training epoch

from edge-connect.

Comments (17)

Yaqiongchai avatar Yaqiongchai commented on August 15, 2024 1

@Yaqiongchai What is your MAX_ITERS value? Can you post your configuration detail here? How many iterations does it take to for the training to end?

It was 900, just incase that I ran into the 1000 iterations problem again. Last night I checked, here's the results of model=2
Training epoch: 50
136/144 [=================>..] - ETA: 0s - epoch: 50 - iter: 899 - psnr: 24.5955 - mae: 0.0778
End training....
code done

So now I am setting it as MAX_ITERS: 9e3 to see if it still works. In the mean time, I am testing model=3.

I'm not sure what you mean by "saving intermediate model"! Are you talking about weights? We are saving weights every SAVE_INTERVAL iterations!
Now I found the weights and using them to train the edge and inpaint (stage3) model. Hope it works.

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on August 15, 2024

Another case is that after I change the MAX_ITERS to 2e4, instead of originally 2e6, the training only runs to the 1000th epoch and never continue. I am very sure it is not due to computational power (two v100 gpus). Is 1000 a tricky stage to go through? I remembered for place2 it converges fast. I also attached my screenshot here.
image

from edge-connect.

knazeri avatar knazeri commented on August 15, 2024

@Yaqiongchai
The case that training starts and ends right after, is I guess because the training file path is invalid. It is also possible that there's no image in the training path. Please note that the images in the example folder are for inference, not training. To run the model under training mode you need to create training file-list as explained here.

For your second question, we only set an upper bound to the number of iterations (not epochs), but I'm surprised that the model stops at 1000th epoch! In my experiments, sometimes python runtime halts which I think is because of the deadlocks happening in OpenCV and/or SkiImage libraries. Basically, we run the data-loader in parallel threads to maximize the performance, but sometimes it causes internal deadlocks in the 3rd party libraries that we use! You can test this hypothesis by setting the number of workers to 1 here:

num_workers=4,

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on August 15, 2024

@knazeri Thanks a lot for your advice! I've downloaded img_align_celeba.zip, which is much larger than expected. I plan to pick up a small sub-set to train, like 1000 faces. How do I generate the mask image though? Should I use Mask Dataset instead?

For the second question, I changed num_workers=1 and it still got stuck on the epoch=1000. With the example pictures, I hope this will be self-solved when I train it on the "actual" dataset.

from edge-connect.

knazeri avatar knazeri commented on August 15, 2024

@Yaqiongchai
It's best to use the Mask Dataset provided by Liu et al. and make sure to download their Testing Irregular Mask Dataset. The dataset contains 12,000 masks and you can easily augment it to 98,000 by rotation and reflection.

Still don't know why it stops after 1,000th epoch! Is it halting or ends with the message "End training...." ?

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on August 15, 2024

@Yaqiongchai
It's best to use the Mask Dataset provided by Liu et al. and make sure to download their Testing Irregular Mask Dataset. The dataset contains 12,000 masks and you can easily augment it to 98,000 by rotation and reflection.

I have downloaded the full dataset and used the irregular mask dataset, I tried to train on a subset (300) of "celeba" dataset but the number of epoch is still one and it'll say "end of training". I made sure that all data are corrected and generated from flist.py.
Any insights?

Also, I saw you can save up the "hallucinated edge" from the first step of the training, "edge connection". I'd like to save out the intermediate model and results too. Which part of the code I should be looking into?

Thanks a lot for your help!

from edge-connect.

2018hello avatar 2018hello commented on August 15, 2024

hi i met the first question, because I set BATCH_SIZE =64. It's to high. with BATCH_SIZE =8, it can run.
Thanks!

from edge-connect.

2018hello avatar 2018hello commented on August 15, 2024

hi i still met the first question, because I used other people' model,and it reached MAX_ITERS.
Thanks!

from edge-connect.

knazeri avatar knazeri commented on August 15, 2024

@Yaqiongchai What is your MAX_ITERS value? Can you post your configuration detail here? How many iterations does it take to for the training to end?

I'm not sure what you mean by "saving intermediate model"! Are you talking about weights? We are saving weights every SAVE_INTERVAL iterations!

from edge-connect.

knazeri avatar knazeri commented on August 15, 2024

@2018hello Are you saying with BATCH_SIZE =64 the model stops training?

from edge-connect.

2018hello avatar 2018hello commented on August 15, 2024

Yes.
I set wrong BATCH_SIZE and MAX_ITERS in config.yml.
And the training end, just like the first image.

from edge-connect.

knazeri avatar knazeri commented on August 15, 2024

I don't believe BATCH_SIZE=64 is wrong unless you set a very small MAX_ITERS! Can you run the model on batches of 64?

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on August 15, 2024

hi i met the first question, because I set BATCH_SIZE =64. It's to high. with BATCH_SIZE =8, it can run.
Thanks!

Thanks a lot! It works now!

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on August 15, 2024

I don't believe BATCH_SIZE=64 is wrong unless you set a very small MAX_ITERS! Can you run the model on batches of 64?

Hi Kamyar,
Because now it's running instead of the results on the first image, I haven't tested batches size=64. My image is 256*256, and only 72 images in total, batchsize=8 would make more sense, I think.
I will test on 64 to see what was the problem when I get a chance.

from edge-connect.

JinMengKang avatar JinMengKang commented on August 15, 2024

Hi,@Yaqiongchai I have encountered the same problem and will not continue until 1000 iterations . Have you solved the problem?

from edge-connect.

Yaqiongchai avatar Yaqiongchai commented on August 15, 2024

Hi,@Yaqiongchai I have encountered the same problem and will not continue until 1000 iterations . Have you solved the problem?
@@JinMengKang
I did not solve the problem till today, I just simply set the max_iteration as 999 (so that it does not stop unexpectedly), but it turns out that the model is totally under-trained

from edge-connect.

JenicChen avatar JenicChen commented on August 15, 2024

Hi, @Yaqiongchai
I have the same first problem.I use celeba for train which has 202,599 pics, and my mask files were augmented to 202,599 too(from 12,000).I'm sure that I made correct flist,but the result is that I just got one epoch,which ends right after

from edge-connect.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.