Hi Knazeri, I am trying out your code, it's just that no matter for place2 or cele

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Training epoch about edge-connect HOT 17 CLOSED

knazeri commented on August 15, 2024

Training epoch

from edge-connect.

Comments (17)

Yaqiongchai commented on August 15, 2024 1

@Yaqiongchai What is your MAX_ITERS value? Can you post your configuration detail here? How many iterations does it take to for the training to end?

It was 900, just incase that I ran into the 1000 iterations problem again. Last night I checked, here's the results of model=2
Training epoch: 50
136/144 [=================>..] - ETA: 0s - epoch: 50 - iter: 899 - psnr: 24.5955 - mae: 0.0778
End training....
code done

So now I am setting it as MAX_ITERS: 9e3 to see if it still works. In the mean time, I am testing model=3.

I'm not sure what you mean by "saving intermediate model"! Are you talking about weights? We are saving weights every SAVE_INTERVAL iterations!
Now I found the weights and using them to train the edge and inpaint (stage3) model. Hope it works.

from edge-connect.

Yaqiongchai commented on August 15, 2024

Another case is that after I change the MAX_ITERS to 2e4, instead of originally 2e6, the training only runs to the 1000th epoch and never continue. I am very sure it is not due to computational power (two v100 gpus). Is 1000 a tricky stage to go through? I remembered for place2 it converges fast. I also attached my screenshot here.

from edge-connect.

knazeri commented on August 15, 2024

@Yaqiongchai
The case that training starts and ends right after, is I guess because the training file path is invalid. It is also possible that there's no image in the training path. Please note that the images in the example folder are for inference, not training. To run the model under training mode you need to create training file-list as explained here.

For your second question, we only set an upper bound to the number of iterations (not epochs), but I'm surprised that the model stops at 1000th epoch! In my experiments, sometimes python runtime halts which I think is because of the deadlocks happening in OpenCV and/or SkiImage libraries. Basically, we run the data-loader in parallel threads to maximize the performance, but sometimes it causes internal deadlocks in the 3rd party libraries that we use! You can test this hypothesis by setting the number of workers to 1 here:

edge-connect/src/edge_connect.py

Line 78 in 826f2b8

num_workers=4,

from edge-connect.

Yaqiongchai commented on August 15, 2024

@knazeri Thanks a lot for your advice! I've downloaded img_align_celeba.zip, which is much larger than expected. I plan to pick up a small sub-set to train, like 1000 faces. How do I generate the mask image though? Should I use Mask Dataset instead?

For the second question, I changed num_workers=1 and it still got stuck on the epoch=1000. With the example pictures, I hope this will be self-solved when I train it on the "actual" dataset.

from edge-connect.

knazeri commented on August 15, 2024

@Yaqiongchai
It's best to use the Mask Dataset provided by Liu et al. and make sure to download their Testing Irregular Mask Dataset. The dataset contains 12,000 masks and you can easily augment it to 98,000 by rotation and reflection.

Still don't know why it stops after 1,000th epoch! Is it halting or ends with the message "End training...." ?

from edge-connect.

Yaqiongchai commented on August 15, 2024

@Yaqiongchai
It's best to use the Mask Dataset provided by Liu et al. and make sure to download their Testing Irregular Mask Dataset. The dataset contains 12,000 masks and you can easily augment it to 98,000 by rotation and reflection.

I have downloaded the full dataset and used the irregular mask dataset, I tried to train on a subset (300) of "celeba" dataset but the number of epoch is still one and it'll say "end of training". I made sure that all data are corrected and generated from flist.py.
Any insights?

Also, I saw you can save up the "hallucinated edge" from the first step of the training, "edge connection". I'd like to save out the intermediate model and results too. Which part of the code I should be looking into?

Thanks a lot for your help!

from edge-connect.

2018hello commented on August 15, 2024

hi i met the first question, because I set BATCH_SIZE =64. It's to high. with BATCH_SIZE =8, it can run.
Thanks!

from edge-connect.

2018hello commented on August 15, 2024

hi i still met the first question, because I used other people' model，and it reached MAX_ITERS.
Thanks!

from edge-connect.

knazeri commented on August 15, 2024

@Yaqiongchai What is your MAX_ITERS value? Can you post your configuration detail here? How many iterations does it take to for the training to end?

I'm not sure what you mean by "saving intermediate model"! Are you talking about weights? We are saving weights every SAVE_INTERVAL iterations!

from edge-connect.

knazeri commented on August 15, 2024

@2018hello Are you saying with BATCH_SIZE =64 the model stops training?

from edge-connect.

2018hello commented on August 15, 2024

Yes.
I set wrong BATCH_SIZE and MAX_ITERS in config.yml.
And the training end, just like the first image.

from edge-connect.

knazeri commented on August 15, 2024

I don't believe BATCH_SIZE=64 is wrong unless you set a very small MAX_ITERS! Can you run the model on batches of 64?

from edge-connect.

Yaqiongchai commented on August 15, 2024

hi i met the first question, because I set BATCH_SIZE =64. It's to high. with BATCH_SIZE =8, it can run.
Thanks!

Thanks a lot! It works now!

from edge-connect.

Yaqiongchai commented on August 15, 2024

I don't believe BATCH_SIZE=64 is wrong unless you set a very small MAX_ITERS! Can you run the model on batches of 64?

Hi Kamyar,
Because now it's running instead of the results on the first image, I haven't tested batches size=64. My image is 256*256, and only 72 images in total, batchsize=8 would make more sense, I think.
I will test on 64 to see what was the problem when I get a chance.

from edge-connect.

JinMengKang commented on August 15, 2024

Hi,@Yaqiongchai I have encountered the same problem and will not continue until 1000 iterations . Have you solved the problem?

from edge-connect.

Yaqiongchai commented on August 15, 2024

Hi,@Yaqiongchai I have encountered the same problem and will not continue until 1000 iterations . Have you solved the problem?
@@JinMengKang
I did not solve the problem till today, I just simply set the max_iteration as 999 (so that it does not stop unexpectedly), but it turns out that the model is totally under-trained

from edge-connect.

JenicChen commented on August 15, 2024

Hi, @Yaqiongchai
I have the same first problem.I use celeba for train which has 202,599 pics, and my mask files were augmented to 202,599 too(from 12,000).I'm sure that I made correct flist,but the result is that I just got one epoch,which ends right after

from edge-connect.

Training epoch about edge-connect HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent