junyanz / pytorch-cyclegan-and-pix2pix Goto Github PK

Image-to-Image Translation in PyTorch

License: Other

Python 87.63% Shell 2.41% TeX 1.02% MATLAB 1.07% Jupyter Notebook 7.86%

computer-graphics computer-vision cyclegan deep-learning gan gans generative-adversarial-network image-generation image-manipulation pix2pix pytorch

pytorch-cyclegan-and-pix2pix's People

Contributors

Stargazers

Watchers

Forkers

chenhongyushare mmmika rogertrullo vikingmew ruotianluo soroushmehr danishabdullah programmingtips gijs vi4m ksong0xd neuroradiology templeblock mafm biddy calvinalvin richardkelley wanjinchang thearchiver kuixu luckyarthur mpicci sohuren bityangke clcarwin vdt kazava lazycrazyowl alejob488 ank-it smartmachinelearning bussiere alfiyazi ksharpdabu sunjieee spiritdude cheaphunter cephlin tspannhw zachmayer iamciera ubaidsayyed54 sanketloke opus506 stonetingxin okam benjamesbabala awesome-archive eternalnation rockystevejobs ir1979 lyrl mylearning2017 vinogradov-am chaoshangcs fducau deepmodel ml-lab pouyan-ghasemi wuatanabe davidsonggithub sfrias tylercarberry ssnl meteora9479 ericschles east-12-305c han-qiu hhy5277 eldon chenbangfeng pepijnolivier zhf459 taras-sereda iwst121 zhixinshu xyang35 cmcneil guanlicome deeplearningsky taesungp ajaytalati gunhochoi shinexunju leejeyeol peterguang dreadlord1984 kpaonaut mjc92 d4le afi278 akumar14 feherbalazs longjohncoder chrysolily allen-liang codeaudit msultan levirve-arxiv chinarefers

pytorch-cyclegan-and-pix2pix's Issues

Which loss should we monitor

Hi, I used CycleGan to train new data. But I don't know which loss or metrics should I monitor.
I wanted to choose loss G_A, but it seems to not to decrease. I also used the data horse2zebra, but loss G_A still didn't decrease a lot during 200 epochs.

can not download anyone , could you tell me what can i do for that ?

wget not found , i have know that . sorry about that

How to predict single image

This network seems should feed into a pair of images which contains A and B, what if I want simply feed A image into net and help get a wonderful generation of B kind. How to do this exactly under the circumstance that I trained the network and got saved checkpoints? Any snippet ?

pix2pix which_direction seems to always be BtoA

Following the directions in the README gave me good results on the facades.

python train.py \
    --display_id 0 \
    --dataroot ./datasets/facades \
    --name facades_pix2pix \
    --model pix2pix \
    --which_model_netG unet_256 \
    --which_direction BtoA \
    --lambda_A 100 \
    --align_data \
    --use_dropout \
    --no_lsgan

However, running it backwards in AtoB mode seems to not change the operation:

python train.py \
    --display_id 0 \
    --dataroot ./datasets/facades \
    --name facades_pix2pix_rev \
    --model pix2pix \
    --which_model_netG unet_256 \
    --which_direction AtoB \
    --lambda_A 100 \
    --align_data \
    --use_dropout \
    --no_lsgan

For reference, here is an image from the dataroot:

Not a huge deal in my case as I can update my dataset appropriately to compensate, but wanted to note this issue.

Question: batch size

The paper indicates that training was done with batch size = 1

Is there a reason not to use a slightly larger batch size to more fully occupy the GPUs? For example, are the results better with batch size = 1 than with batch sizes larger than 1?

RuntimeError: cuda runtime error (2) : out of memory at .. THCStorage.cu:66

I am trying to train the pix2pix model with the facedes database just like in the tutorial but getting Out Of Memory Error below.

My computer has GTX Titan 6GB. Is it enough?

THCudaCheck FAIL file=/py/conda-bld/pytorch_1493676237139/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory Traceback (most recent call last): File "train.py", line 14, in <module> model = create_model(opt) File "/users/taitien.doan/GANs/pix2pix/pytorch-CycleGAN-and-pix2pix/models/models.py", line 18, in create_model model.initialize(opt) File "/users/taitien.doan/GANs/pix2pix/pytorch-CycleGAN-and-pix2pix/models/pix2pix_model.py", line 26, in initialize opt.which_model_netG, opt.norm, opt.use_dropout, self.gpu_ids) File "/users/taitien.doan/GANs/pix2pix/pytorch-CycleGAN-and-pix2pix/models/networks.py", line 46, in define_G netG.cuda(device_id=gpu_ids[0]) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 147, in cuda return self._apply(lambda t: t.cuda(device_id)) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 118, in _apply module._apply(fn) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 124, in _apply param.data = fn(param.data) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/nn/modules/module.py", line 147, in <lambda> return self._apply(lambda t: t.cuda(device_id)) File "/users/taitien.doan/anaconda3/envs/pix2pixlua/lib/python2.7/site-packages/torch/_utils.py", line 65, in _cuda return new_type(self.size()).copy_(self, async) RuntimeError: cuda runtime error (2) : out of memory at /py/conda-bld/pytorch_1493676237139/work/torch/lib/THC/generic/THCStorage.cu:66

Do not support python2

from builtins import object

ImportError: No module named builtins

potential bug

In current release version, there is one potential bug when the gpu_ids is not 0, because some tensor is initialized with "torch.cuda.Tensor(*shape)"(self.tensor(*shape)) like in the implementation of GANLoss. I guess it should be initialized in gpu0 by default, that is to say, when the model is not on gpu0, below errors will happen.
Some of weight/gradient/input tensors are located on different GPUs

one simple fix solution might be to add "torch.cuda.device(self.opt.gpu_ids[0])" in parse function of base_options class.

CycleGAN question

hello! I rewrite the cyclegan code based on your code. My code can run, but training can't be the result!
Can you give me your email? I sent my code to you. I hope you can help me find the cause of the problem.

Aspect ratio flag in test.py

I am trying to use the --aspect_ratio flag in test.py

python test.py --dataroot ./datasets/text/testA/ --name text --model test --which_model_netG unet_256 --which_direction AtoB --dataset_mode single --aspect_ratio 2.0

In the results folder though, the images are still 1:1.

Question: PatchGAN Discriminator

Hi there.
I was investigating your CycleGAN paper and code. And looks like discriminator you've implemented is just a conv net, not a patchgan that was mentioned in the paper.
Maybe I've missed something. Could you point me where the processing of 70x70 patches is going on.
Thanks in advance!

ntrain is ignored

Couldn't find ntrain anywhere else in the code, so I guess it's not taken into account.

Forgot to add norm layer in innermost of UnetSkipConnectionBlock?

Hi
It seems at UnetSkipConnectionBlock line 281. There needs an extra norm layer after convolution.
Isn't it?

Unable to get as good result as in torch

Hi,
I ran your code on horse2zebra, with loadSize, fineSize and which_model_netG changed to match torch version. However, I couldn't get good results.

Your pytorch model here do have some differences to torch, like different padding, and different training strategy.

I'm wondering if you see similar problems.

ConnectionError[Errno 111], during run 'train.py' of maps

when i run :
$ python train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan

the error code :
ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa2c1b19290>: Failed to establish a new connection: [Errno 111] Connection refused',))

how can I solve this??

==================
full script of error

Traceback (most recent call last):
File "/home/khryang/.local/lib/python2.7/site-packages/visdom/init.py", line 228, in _send
data=json.dumps(msg),
File "/home/khryang/.local/lib/python2.7/site-packages/requests/api.py", line 112, in post
return request('post', url, data=data, json=json, **kwargs)
File "/home/khryang/.local/lib/python2.7/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/home/khryang/.local/lib/python2.7/site-packages/requests/sessions.py", line 513, in request
resp = self.send(prep, **send_kwargs)
File "/home/khryang/.local/lib/python2.7/site-packages/requests/sessions.py", line 623, in send
r = adapter.send(request, **kwargs)
File "/home/khryang/.local/lib/python2.7/site-packages/requests/adapters.py", line 504, in send
raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa2c1b19290>: Failed to establish a new connection: [Errno 111] Connection refused',))
(epoch: 1, iters: 700, time: 0.456) D_A: 0.215 G_A: 0.561 Cyc_A: 2.316 D_B: 0.163 G_B: 0.450 Cyc_B: 0.794

Scale transform

Just want to let people know that for me, the scale transform was making the program crash. I worked it out by replacing the list by a single value. I didn't put it in pull requests as I might be the only one who experienced this problem.

https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/blob/master/data/unaligned_dataset.py

Line 26
Before: transform_list.append(transforms.Scale(osize, Image.BICUBIC))
After: transform_list.append(transforms.Scale(opt.loadSize, Image.BICUBIC))

socket.error: [Errno 111] Connection refused

when i run the Pix2Pix
"python train.py --dataroot ./datasets/facades --name facades_pix2pix --model pix2pix --which_model_netG unet_256 --which_direction BtoA --lambda_A 100 --dataset_mode aligned --use_dropout --no_lsgan"

a problem happened as follow:

model [Pix2PixModel] was created
create web directory ./checkpoints/facades_pix2pix/web...
(epoch: 1, iters: 100, time: 5.015) G_GAN: 2.485 G_L1: 36.558 D_real: 0.151 D_fake: 0.257
(epoch: 1, iters: 200, time: 4.833) G_GAN: 3.015 G_L1: 43.858 D_real: 0.045 D_fake: 0.552
(epoch: 1, iters: 300, time: 4.797) G_GAN: 2.519 G_L1: 39.296 D_real: 0.039 D_fake: 0.149
(epoch: 1, iters: 400, time: 6.720) G_GAN: 2.393 G_L1: 25.259 D_real: 0.200 D_fake: 0.504
End of epoch 1 / 200 Time Taken: 5975 sec
Traceback (most recent call last):
File "train.py", line 20, in
for i, data in enumerate(dataset):
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 206, in next
idx, batch = self.data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
return recv()
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/queue.py", line 22, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1133, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
fd = multiprocessing.reduction.rebuild_handle(df)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

how to solve the problem?

pix2pix without B side image,how to test

hi,sir:
I am training the model on my own dataset with the pix2pix model .I know how to create train dataset that is providing A image and B image ,and then combining the two image into AB image like facades image dataset.The model will learn how to translate A to B and create a fake image. I
But I dont know how to do the test when I only have an A side image.Should I conbine the A image with an empty image or something ? Help wanted ,thank you!

a small mistake?

In data/unaligned_dataset.py, line 34, I find a small mistake:

transform_list + = [transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]

I think "+ = " should be changed to "="?

Is it possible only work with cpu

My iMac doesn't have a NVIDIA GPU, I wondering is there a way train with cpu only, can anyone help me, thanks

Running remotely causes errors due to Visdom

Hi, after merging #3 , If I try to use a remote server through ssh, it is showing these errors

How can I run it without it trying to use use visdom?
Thanks!

CUDNN_STATUS_BAD_PARAM with output_nc=1

Hi,

I'm testing the Cycle GAN code using 3-channel input data (A) and 1-channel output data (B). I always get the following error message:

RuntimeError: CUDNN_STATUS_BAD_PARAM

However this works fine when I set output_nc to 3.

I can't find any place in the code where output_nc is hard-coded to 3, so I'm guessing this must be a CUDNN issue? Is there any reason you can think of why a 1-channel output should not work with the current architecture in Cycle GAN?

Thanks!

The full trace is below:
Traceback (most recent call last):
File "train.py", line 26, in
model.optimize_parameters()
File "/home/nbserver/urbanization-patt
erns/models/pytorch-CycleGAN-and-pix2pix
/models/cycle_gan_model.py", line 159, i
n optimize_parameters
self.backward_G()
File "/home/nbserver/urbanization-patt
erns/models/pytorch-CycleGAN-and-pix2pix
/models/cycle_gan_model.py", line 141, i
n backward_G
self.fake_A = self.netG_B.forward(se
lf.real_B)
File "/home/nbserver/urbanization-patt
erns/models/pytorch-CycleGAN-and-pix2pix
/models/networks.py", line 170, in forwa
rd
return nn.parallel.data_parallel(sel
f.model, input, self.gpu_ids)
File "/usr/local/lib/python2.7/dist-pa
ckages/torch/nn/parallel/data_parallel.p
y", line 105, in data_parallel
outputs = parallel_apply(replicas, i
nputs, module_kwargs)
File "/usr/local/lib/python2.7/dist-pa
ckages/torch/nn/parallel/parallel_apply.
py", line 46, in parallel_apply
raise output
RuntimeError: CUDNN_STATUS_BAD_PARAM

Got Connection Error

Here it goes:

raceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/util/connection.py", line 83, in create_connection
    raise err
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connectionpool.py", line 356, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.5/http/client.py", line 1106, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1151, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.5/http/client.py", line 1102, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
    self.send(msg)
  File "/usr/lib/python3.5/http/client.py", line 877, in send
    self.connect()
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connection.py", line 166, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x7fc52b9bd3c8>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 423, in send
    timeout=timeout
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connectionpool.py", line 649, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/util/retry.py", line 376, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fc52b9bd3c8>: Failed to establish a new connection: [Errno 111] Connection refused',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/visdom/__init__.py", line 228, in _send
    data=json.dumps(msg),
  File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 110, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fc52b9bd3c8>: Failed to establish a new connection: [Errno 111] Connection refused',))

Have no idea what this caurse

Multi GPUs not working

I'm trying to run pix2pix network with 2 GPU with option
--gpu_ids 0,1
However only one GPU is running.
I have 2 GPUs both are Titan 6GB

is it possible to make it work on rectangular (i.e. height<>width) images?

I need to work on rectangular (height != width) images. What would it take to adapt this software to my needs? Any hints avoiding me to go through the whole code would be welcome.
Thank you.

B to A training?

I'm really enjoying the library. Maybe consider having a flag for training from B->A so the dataset doesn't have to be redone.

blurred images with my dataset

Hi
first of all you did a very god job!
I am trying to get realistic airplane images from images of CAD models
when i tried to do it with 2000 images of CAD models and 2000 images of real airplanes those are the result
CAD model:

real image:

results:

Do you think I need more data? more iterations? or that this the best results i should expect regarding the blur effect

undefined symbol: PySlice_Unpack

Cool that you got it all to PyTorch!

When running python train.py --dataroot ./datasets/facades --name facades_cyclegan --model cycle_gan
I get:
Traceback (most recent call last):
File "train.py", line 2, in
from options.train_options import TrainOptions
File "/home/ubuntu/pytorch-CycleGAN-and-pix2pix/options/train_options.py", line 1, in
from .base_options import BaseOptions
File "/home/ubuntu/pytorch-CycleGAN-and-pix2pix/options/base_options.py", line 3, in
from util import util
File "/home/ubuntu/pytorch-CycleGAN-and-pix2pix/util/util.py", line 4, in
from PIL import Image
File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/PIL/Image.py", line 56, in
from . import _imaging as core
ImportError: /home/ubuntu/miniconda3/lib/python3.6/site-packages/PIL/_imaging.cpython-36m-x86_64-linux-gnu.so: undefined symbol: PySlice_Unpack

What versions are you using?

NaNs during CycleGAN training

I have run approx. a dozen test runs (using train.py) on 2 datasets (maps and my own custom dataset trying to convert Synthia to Cityscapes). Every run so far is giving NaNs after a couple of epochs, sometimes after more than 70 epochs, sometimes after only a handful of epochs. Until I am getting only NaNs actual learning seems to really happen as e.g. evidenced by looking at transformed images over epoch number. I have also played with various learning rates, but even at pretty low lr NaNs seem to eventually occur.

My question: Is this something others have also observed? Second: in case this is "normal" and e.g. due to the difficulties of training GANs (min-max), what would be critical params to vary to eventually avoid training to break down?

multiple forward passes but one backward call for updating G?

I saw the following code in the cycle_gan_model backward_G method

        # GAN loss
        # D_A(G_A(A))
        self.fake_B = self.netG_A.forward(self.real_A)
        pred_fake = self.netD_A.forward(self.fake_B)
        self.loss_G_A = self.criterionGAN(pred_fake, True)
        # D_B(G_B(B))
        self.fake_A = self.netG_B.forward(self.real_B)
        pred_fake = self.netD_B.forward(self.fake_A)
        self.loss_G_B = self.criterionGAN(pred_fake, True)
        # Forward cycle loss
        self.rec_A = self.netG_B.forward(self.fake_B)
        self.loss_cycle_A = self.criterionCycle(self.rec_A, self.real_A) * lambda_A
        # Backward cycle loss
        self.rec_B = self.netG_A.forward(self.fake_A)
        self.loss_cycle_B = self.criterionCycle(self.rec_B, self.real_B) * lambda_B
        # combined loss
        self.loss_G = self.loss_G_A + self.loss_G_B + self.loss_cycle_A + self.loss_cycle_B + self.loss_idt_A + self.loss_idt_B
        self.loss_G.backward()

The way I see it, G_A and G_B each has three forward passes, twice accepting the real data and twice the fake data.
In tensorflow (I think) the backward pass is always computed w.r.t the last input data. In this case, the backpropagation of loss_G would be wrong and one should instead do backward pass thrice, each immediately following their involving forward pass.
I assume this is somehow taken care of in pytorch. But how does the model know w.r.t which input data it should compute the gradients?

Pretrained models

Do you have any CycleGAN pretrained models available that use pytorch?

Images of current results are rotated

When I view the current results, I find that the images are rotated. Is it okay for the software?

Model parallelism

Hi,
I am trying to generate HD images, but results are not so good and I guess it is because I am training the models using lower resolution images (256x256).
Unfortunately, if I try to train these networks using higher resolution images I run out of memory. So, I was wondering if it is possible to split the networks over multiple GPUs and run the model over multiple devices.

Thanks

Out Of Memory Error on Facades dataset

I'm trying to run pix2pix on the facades dataset, but I keep getting an Out of Memory Error. I've tried using various --loadSize and --fineSize params, along with installing different versions of PyTorch. I have only a GeForge GTX 650 Ti, so only ~1GB of GPU RAM. I tried installing the no-cuda version of PyTorch, but I still get an out of memory error, even though my computer has 8GB of RAM.

I'm new to Torch and deep learning in general, so I'm likely just using it wrong. Any help is appreciated.

using the trained model(fineSize=512) to produce the test image (640*480)result , i got an error

using the trained model(fineSize=512) to produce the test image (640*480)results , i got an error as follows:
runtimeerror: sizes do not match at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorCopy.c:48

when i fineSize the test images, i also reveive the error, do you know why? thanks~ @junyanz

Out of memory?

I trained CycleGAN with a Nvidia Tesla K80 GPU, Ubuntu, batchSize=1.
But I got an error of "out of memory".
Anything I have missed? How large memory does this model use?

Edited: I tested the same thing on another machine with Nvidia TitanX , Ubuntu, batchSize=1, and got the same error.

I ran:
python train.py --dataroot ./datasets/horse2zebra --name horse2zebra_cyclegan --model cycle_gan

The messages I got:

---------------Options------------------
batchSize: 1
beta1: 0.5
checkpoints_dir: ./checkpoints
continue_train: False
dataroot: ./datasets/horse2zebra
dataset_mode: unaligned
display_freq: 100
display_id: 1
display_winsize: 256
fineSize: 256
gpu_ids: [0]
identity: 0.0
input_nc: 3
isTrain: True
lambda_A: 10.0
lambda_B: 10.0
loadSize: 286
lr: 0.0002
max_dataset_size: inf
model: cycle_gan
nThreads: 1
n_layers_D: 3
name: horse2zebra_cyclegan
ndf: 64
ngf: 64
niter: 100
niter_decay: 100
no_flip: False
no_html: False
no_lsgan: False
norm: instance
output_nc: 3
phase: train
pool_size: 50
print_freq: 100
resize_or_crop: resize_and_crop
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
use_dropout: False
which_direction: AtoB
which_epoch: latest
which_model_netD: basic
which_model_netG: resnet_9blocks
-------------- End ----------------
CustomDatasetDataLoader
dataset [UnalignedDataset] was created
#training images = 1067
cycle_gan
---------- Networks initialized -------------
ResnetGenerator (
  (model): Sequential (
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (1): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (2): ReLU (inplace)
    (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (4): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (5): ReLU (inplace)
    (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (7): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (8): ReLU (inplace)
    (9): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (10): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (11): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (12): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (13): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (14): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (15): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (16): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (17): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (18): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (19): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (20): ReLU (inplace)
    (21): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (22): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (23): ReLU (inplace)
    (24): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (25): Tanh ()
  )
)
Total number of parameters: 11388675
ResnetGenerator (
  (model): Sequential (
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (1): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (2): ReLU (inplace)
    (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (4): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (5): ReLU (inplace)
    (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (7): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (8): ReLU (inplace)
    (9): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (10): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (11): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (12): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (13): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (14): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (15): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (16): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (17): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (18): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (19): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (20): ReLU (inplace)
    (21): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (22): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (23): ReLU (inplace)
    (24): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (25): Tanh ()
  )
)
Total number of parameters: 11388675
NLayerDiscriminator (
  (model): Sequential (
    (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): LeakyReLU (0.2, inplace)
    (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (4): LeakyReLU (0.2, inplace)
    (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (7): LeakyReLU (0.2, inplace)
    (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (10): LeakyReLU (0.2, inplace)
    (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
  )
)
Total number of parameters: 2766529
NLayerDiscriminator (
  (model): Sequential (
    (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): LeakyReLU (0.2, inplace)
    (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (4): LeakyReLU (0.2, inplace)
    (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (7): LeakyReLU (0.2, inplace)
    (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (10): LeakyReLU (0.2, inplace)
    (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
  )
)
Total number of parameters: 2766529
-----------------------------------------------
model [CycleGANModel] was created
create web directory ./checkpoints/horse2zebra_cyclegan/web...
THCudaCheck FAIL file=/home/liyh/pytorch/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 25, in <module>
    model.optimize_parameters()
  File "/home/liyh/projects/pytorch_implementation/CycleGAN/models/cycle_gan_model.py", line 158, in optimize_parameters
    self.backward_G()
  File "/home/liyh/projects/pytorch_implementation/CycleGAN/models/cycle_gan_model.py", line 144, in backward_G
    self.rec_A = self.netG_B.forward(self.fake_B)
  File "/home/liyh/projects/pytorch_implementation/CycleGAN/models/networks.py", line 170, in forward
    return nn.parallel.data_parallel(self.model, input, self.gpu_ids)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 103, in data_parallel
    return module(*inputs[0], **module_kwargs[0])
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/container.py", line 64, in forward
    input = module(input)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 237, in forward
    self.padding, self.dilation, self.groups)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/functional.py", line 41, in conv2d
    return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /home/liyh/pytorch/torch/lib/THC/generic/THCStorage.cu:66

is it possible to modify the U-net input image size with 480 x 640 etc.?

FCN score code

Junyan, great work! I cannot find the code to compute the FCN score? Did I miss anything?

Does `continue_train` option work?

I trained edges2shoes dataset with following command:

python train.py --dataroot ./datasets/edges2shoes --name edges2shoes_pix2pix --model pix2pix --which_model_netG unet_256 --which_direction AtoB --lambda_A 100 --align_data --use_dropout --no_lsgan --batchSize 12 --niter 15 --niter_decay 15

Then after 1 epoch (and it's already saved its own checkpoint), I interrupt with Ctrl+C. The day after, I want to continue training with following command:

python train.py --dataroot ./datasets/edges2shoes --name edges2shoes_pix2pix --model pix2pix --which_model_netG unet_256 --which_direction AtoB --lambda_A 100 --align_data --use_dropout --no_lsgan --batchSize 12 --niter 15 --niter_decay 15 --continue_train

You can notice I parse in --continue_train option (as I read in options/train_options.py).

I notice that generated fake image is kept, but epoch is reset back to 1, loss is also graphed from nothing.

I wonder if this continued training or not? If not, how can I keep training my model after interrupting it.

Viewing loss plot of previous result

Sorry if these questions are very basic; I am a student and new to this.

Is there a way to retrieve the loss plot or final training accuracy of previous runs?

Is Ctrl^C the correct way to stopping training?

Additionally, we tried to continue a run right from where we left off, using --continue_train, but it did not seem to work. Is there something else we're missing?

CPU only

I went into the options folder and changed base_options to 1 , but it is still looking for a nvidia chip I do not have. Is it possible to run cpu only?

AssertionError: Torch not compiled with CUDA enabled

When I trained a model, I got the below error. I'm using a Mac book pro. Thanks.

  File "/usr/local/lib/python2.7/site-packages/torch/cuda/__init__.py", line 277, in __new__
    _lazy_init()
  File "/usr/local/lib/python2.7/site-packages/torch/cuda/__init__.py", line 89, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python2.7/site-packages/torch/cuda/__init__.py", line 56, in _check_driver
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

some problems about loadSize and fineSize

it seems that when only use the loadsize and finesize , the input image size is always squre, so i try to define the loadSize_H and loadSize_W seperately in order to make input image have dfferent H*W ration.
but i got some errror like this : valueerror unknown resampling filter

is it possible to give the loadsize's height and weight different value w.r.t current network? can you give some advise? thanks~ @junyanz

A question regarding dropout and latent vector x

Dear authors of pytorch-CycleGAN-and-pix2pix,

I have a question regarding latent vector x for CycleGAN.

A while ago, I read a paper related to pix2pix. According to the paper, the random noise is applied to GANs by using dropout.

I can also see that there's a dropout option for CycleGAN.
However, it seems that dropout is off by default(if it is not specified).

So I have a question.
Is the dropout option is the way to provide randomness for CycleGAN?
And if the option is off, does the generator produce the same output every time?

Thanks for your great work.

Load on laptop CPUs

I have an HP Notebook - 15-ac026tx with Intel i5 (5th gen), 4GB RAM, 2GB AMD Radeon graphics card.Full Specifications
I want to run your code on my laptop. Won't it harm my laptop battery and its hardware?
I read somewhere on the Internet that training high computation models can severely harm laptops?
I just want to know is it okay if I train it on my laptop?

Question: monet2photo training loss

I'm trying to train the monet2photo. My command line was:

python train.py --dataroot ./datasets/monet2photo --name monet2photo --model cycle_gan --gpu_ids 0,1 --batchSize 8 --identity 0.5

The paper discussed using a batch size of 1, but I increased it to 8 to more fully occupy the GPUs. I think this is the only difference between what was described in the paper and my settings, but I may be wrong.

------------ Options -------------
align_data: False
batchSize: 8
beta1: 0.5
checkpoints_dir: ./checkpoints
continue_train: False
dataroot: ./datasets/monet2photo
display_freq: 100
display_id: 1
display_winsize: 256
fineSize: 256
gpu_ids: [0, 1]
identity: 0.5
input_nc: 3
isTrain: True
lambda_A: 10.0
lambda_B: 10.0
loadSize: 286
lr: 0.0002
max_dataset_size: inf
model: cycle_gan
nThreads: 2
n_layers_D: 3
name: monet2photo
ndf: 64
ngf: 64
niter: 100
niter_decay: 100
no_flip: False
no_html: False
no_lsgan: False
norm: instance
output_nc: 3
phase: train
pool_size: 50
print_freq: 100
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
use_dropout: False
which_direction: AtoB
which_epoch: latest
which_model_netD: basic
which_model_netG: resnet_9blocks
-------------- End ----------------
UnalignedDataLoader
#training images = 6287
cycle_gan

I'm training on two GTX-1070s

I'm about 80 epochs in (~40 hours on my set up) and it seems like I'm oscillating between generated 'photos' that look okay-ish and 'photos' that look pretty 'meh', more like the original painting.

My loss declined pretty rapidly for the first 20 or so epochs, but now seems to be relatively stable with occasional crazy spikes:

I think it's improving slightly with each epoch based on the images and there seems to be a slight downward trend on the loss, but I also might just be kidding myself because I've been staring at it for a while. In other words, I'm not certain that what it's generating a epoch 80 is really that much better than epoch 30. Here's the most recent detailed loss curve.

Question: Is this expected behavior (more or less) or should I be concerned that I've plateaued and/or used the wrong settings. At 100 epochs the learning rate is set to start decreasing based on the default settings. Given that it's taking about 30 minutes per epoch and thus about 61 more hours to complete 200 epochs, I'm wondering if I should "keep on going" or "abort" and fix some settings.

Question: Regarding the unsupported operand type error

Hi.
I've been using the torch based CycleGAN and Pix2pix code for a while and I think it's a great work.
The paper is also astonishing.

Now I want to move to the PyTorch based CycleGan/Pix2pix code.
But I have an error in my environment.

My environment is
-Ubuntu 14.04
-Python 2.7
-CUDA 8

and the error message is attached below.

As far as I remember the PyTorch based code worked OK in my environment.
I guess this problem might be caused by my environment settings and it might be fixed easily.
But I don't have any clue to fix this.

Could you give any tips for this issue?

< Error message>

python train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan

...
...
...
model [CycleGANModel] was created
create web directory ./checkpoints/maps_cyclegan/web...
Traceback (most recent call last):
File "train.py", line 20, in
for i, data in enumerate(dataset):
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 212, in next
return self._process_next_batch(batch)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 239, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 41, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/media/illusion/ML_DATA_SSD_M550/pytorch-CycleGAN-and-pix2pix/data/unaligned_dataset.py", line 46, in getitem
A_img = self.transform(A_img)
File "/usr/local/lib/python2.7/dist-packages/torchvision/transforms.py", line 29, in call
img = t(img)
File "/usr/local/lib/python2.7/dist-packages/torchvision/transforms.py", line 139, in call
ow = int(self.size * w / h)
TypeError: unsupported operand type(s) for /: 'list' and 'int'

multiprocessing issue with nThreads>1

Hi,

poking around at the Pix2Pix code I noticed that at times training stops with an error that seems to be related to multiprocessing, probably threading for processing images in parallel. I've set nThreads=1 and that seems to have made the error go away. But I'm wondering if you've seen this in your experiments?

Full trace below:

Traceback (most recent call last):
File "train.py", line 21, in
for i, data in enumerate(dataset):
File "/usr/local/lib/python2.7/dist-packages/future/types/newobject.py", line 71, in next
return type(self).next(self)
File "/home/nbserver/urbanization-patterns/models/pytorch-CycleGAN-and-pix2pix/data/aligned_data_loader_csv.py", line 30, in nex
t
AB, labels, AB_paths = next(self.data_loader_iter)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 206, in next
idx, batch = self.data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get
return recv()
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/queue.py", line 22, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
fd = multiprocessing.reduction.rebuild_handle(df)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 175, in Client
answer_challenge(c, authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 432, in answer_challenge
message = connection.recv_bytes(256) # reject large message
IOError: [Errno 104] Connection reset by peer

Out of memory when training my own datasets

I want to train my own dataset with ~4800 images of training data, the size of each image is 512×512, no matter when I set the --loadSize (and --fineSize) to 512, 256, 128, the program run out of memory with NVIDIA GTX 1080 (8G GPU memory).

I'm new to use pytorch, I wonder whether it was caused by PyTorch or my GPU memory is not sufficient for your code.

Incorrect instruction (memory dump)

CustomDatasetDataLoader
dataset [AlignedDataset] was created
#training images = 400
pix2pix
Błędna instrukcja (zrzut pamięci)

junyanz / pytorch-cyclegan-and-pix2pix Goto Github PK

pytorch-cyclegan-and-pix2pix's People

Contributors

Stargazers

Watchers

Forkers

pytorch-cyclegan-and-pix2pix's Issues

================== full script of error

python train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan

Recommend Projects

Recommend Topics

Recommend Org

==================
full script of error