Comments (17)
@Yaqiongchai What is your
MAX_ITERS
value? Can you post your configuration detail here? How many iterations does it take to for the training to end?
It was 900, just incase that I ran into the 1000 iterations problem again. Last night I checked, here's the results of model=2
Training epoch: 50
136/144 [=================>..] - ETA: 0s - epoch: 50 - iter: 899 - psnr: 24.5955 - mae: 0.0778
End training....
code done
So now I am setting it as MAX_ITERS: 9e3 to see if it still works. In the mean time, I am testing model=3.
I'm not sure what you mean by "saving intermediate model"! Are you talking about weights? We are saving weights every
SAVE_INTERVAL
iterations!
Now I found the weights and using them to train the edge and inpaint (stage3) model. Hope it works.
from edge-connect.
Another case is that after I change the MAX_ITERS to 2e4, instead of originally 2e6, the training only runs to the 1000th epoch and never continue. I am very sure it is not due to computational power (two v100 gpus). Is 1000 a tricky stage to go through? I remembered for place2 it converges fast. I also attached my screenshot here.
from edge-connect.
@Yaqiongchai
The case that training starts and ends right after, is I guess because the training file path is invalid. It is also possible that there's no image in the training path. Please note that the images in the example folder are for inference, not training. To run the model under training mode you need to create training file-list as explained here.
For your second question, we only set an upper bound to the number of iterations (not epochs), but I'm surprised that the model stops at 1000th epoch! In my experiments, sometimes python runtime halts which I think is because of the deadlocks happening in OpenCV and/or SkiImage libraries. Basically, we run the data-loader in parallel threads to maximize the performance, but sometimes it causes internal deadlocks in the 3rd party libraries that we use! You can test this hypothesis by setting the number of workers to 1 here:
edge-connect/src/edge_connect.py
Line 78 in 826f2b8
from edge-connect.
@knazeri Thanks a lot for your advice! I've downloaded img_align_celeba.zip, which is much larger than expected. I plan to pick up a small sub-set to train, like 1000 faces. How do I generate the mask image though? Should I use Mask Dataset instead?
For the second question, I changed num_workers=1 and it still got stuck on the epoch=1000. With the example pictures, I hope this will be self-solved when I train it on the "actual" dataset.
from edge-connect.
@Yaqiongchai
It's best to use the Mask Dataset provided by Liu et al. and make sure to download their Testing Irregular Mask Dataset. The dataset contains 12,000 masks and you can easily augment it to 98,000 by rotation and reflection.
Still don't know why it stops after 1,000th epoch! Is it halting or ends with the message "End training...." ?
from edge-connect.
@Yaqiongchai
It's best to use the Mask Dataset provided by Liu et al. and make sure to download their Testing Irregular Mask Dataset. The dataset contains 12,000 masks and you can easily augment it to 98,000 by rotation and reflection.
I have downloaded the full dataset and used the irregular mask dataset, I tried to train on a subset (300) of "celeba" dataset but the number of epoch is still one and it'll say "end of training". I made sure that all data are corrected and generated from flist.py.
Any insights?
Also, I saw you can save up the "hallucinated edge" from the first step of the training, "edge connection". I'd like to save out the intermediate model and results too. Which part of the code I should be looking into?
Thanks a lot for your help!
from edge-connect.
hi i met the first question, because I set BATCH_SIZE =64. It's to high. with BATCH_SIZE =8, it can run.
Thanks!
from edge-connect.
hi i still met the first question, because I used other people' model,and it reached MAX_ITERS.
Thanks!
from edge-connect.
@Yaqiongchai What is your MAX_ITERS
value? Can you post your configuration detail here? How many iterations does it take to for the training to end?
I'm not sure what you mean by "saving intermediate model"! Are you talking about weights? We are saving weights every SAVE_INTERVAL
iterations!
from edge-connect.
@2018hello Are you saying with BATCH_SIZE =64
the model stops training?
from edge-connect.
Yes.
I set wrong BATCH_SIZE and MAX_ITERS in config.yml.
And the training end, just like the first image.
from edge-connect.
I don't believe BATCH_SIZE=64
is wrong unless you set a very small MAX_ITERS
! Can you run the model on batches of 64?
from edge-connect.
hi i met the first question, because I set BATCH_SIZE =64. It's to high. with BATCH_SIZE =8, it can run.
Thanks!
Thanks a lot! It works now!
from edge-connect.
I don't believe
BATCH_SIZE=64
is wrong unless you set a very smallMAX_ITERS
! Can you run the model on batches of 64?
Hi Kamyar,
Because now it's running instead of the results on the first image, I haven't tested batches size=64. My image is 256*256, and only 72 images in total, batchsize=8 would make more sense, I think.
I will test on 64 to see what was the problem when I get a chance.
from edge-connect.
Hi,@Yaqiongchai I have encountered the same problem and will not continue until 1000 iterations . Have you solved the problem?
from edge-connect.
Hi,@Yaqiongchai I have encountered the same problem and will not continue until 1000 iterations . Have you solved the problem?
@@JinMengKang
I did not solve the problem till today, I just simply set the max_iteration as 999 (so that it does not stop unexpectedly), but it turns out that the model is totally under-trained
from edge-connect.
Hi, @Yaqiongchai
I have the same first problem.I use celeba for train which has 202,599 pics, and my mask files were augmented to 202,599 too(from 12,000).I'm sure that I made correct flist,but the result is that I just got one epoch,which ends right after
from edge-connect.
Related Issues (20)
- About supplementary material of your paper
- Results of first stage: edge model HOT 6
- Test image is being filled in a lighter shade HOT 1
- Who can help me slove this error? (when I try to train ) HOT 5
- Run the program on CoLab
- Convergency of edge model HOT 10
- Hello, After reading your paper, may I have a question that why you choice 178 for the celebA dataset drop size.
- 如果对图像修复,edge-connect感兴趣,或者需要帮助,可以联系我
- Training on Google Colab immediately stops HOT 1
- Selection of dataset
- Canny sigma HOT 1
- how to implement the visualization for the learned edges? HOT 2
- Sizes of tensors must match except in dimension 1
- New easy to use inpanting method with transformers HOT 1
- When using edge=2, training has ValueError: operands could not be broadcast together with shapes (256,256,3) (256,256)
- Why is there an error when I train MODEL4: joint model/为什么我训练MODEL4 :joint model会报错
- When I tried to start training, I got an error:RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 512, 4, 4]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). HOT 15
- About precision and recall during training HOT 1
- The loss function is abnormal when the edge network is trained
- RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from edge-connect.