Hey, great analysis! :) I've learned a lot from reading it. Just

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Latent layers importance about stylegan-encoder HOT 19 OPEN

puzer commented on August 18, 2024

Latent layers importance

from stylegan-encoder.

Comments (19)

pender commented on August 18, 2024 4

@jcpeterson

@pbaylies Yes, I see how it works now. I just don't like the fine-gained texture all of the encodings get. At full size, none of them have the clear quality of the actual samples. I attribute this to over-optimization using the style layers. If I can get a hold of @pender's code, I can search for the subset of the first N dlatents (higher level, less style-focused dimensions) that suffice.

Here you go, I uploaded to a forked repo today. Ultimately had some success. Interestingly I was able to use perceptual loss from the discriminator net instead of a separately trained image classifier net. I'm not sure if that has been done before.

from stylegan-encoder.

pender commented on August 18, 2024 1

@jcpeterson The encoder in this repo produces an 18x512 dlatent vector in which the layers are different, unlike the mapping network which as noted above produces a 1x512 dlatent vector which is tiled up to 18x512. I haven't found an easy way to create a 1x512 dlatent vector from the encoder in this repo that can be tiled to reproduce the encoded image.

I've tried:

tiling each of the resulting 18 layers individually (which each produced very distorted faces from a variety of sharp angles)
averaging the 18 layers (which produced slight variations on a single female face for all of the attempts that I tried), and
adapting the encoder script to optimize a single 1x512 vector that is tiled back to 18x512 before synthesizing (which didn't do a very good job of reproducing the input image).

I am going to try that last method again with the Adam optimizer (implementation is discussed by others in another thread in the repo) (UPDATE: results still terrible), and if that fails, I'm going to start from scratch with a new encoder using the method described in this ICLR paper which dispenses with perceptual loss altogether (I found another repo that uses that method on ProGAN here).

I also need to run a control to see if the existing encoder can reconstruct a face produced by the GAN in the first place, in case the face images I've provided just aren't well represented in the dlatent space.

from stylegan-encoder.

pender commented on August 18, 2024 1

@jcpeterson

He states:

dlatent = mapping_network(qlatent) = shape=(18, 512)

That's technically correct, but I'm not sure he realized at that time that the (18, 512) tensor that the mapping network outputs is in fact a single (512,) tensor that is tiled to (18, 512). But it is!

I don't have a public fork, I've just been noodling with a copy of the source files on my PC.

from stylegan-encoder.

pender commented on August 18, 2024 1

@jcpeterson Yes, the dlatents obtained by the encoding script are clearly out-of-distribution -- that is easily enough demonstrated by tiling a single layer of the dlatent output up to a new 18x512 dlatent and observing the fleshy horrors that emerge. (On the other hand, they are achieving their intended purposes, since they do reconstruct the image, and image transformations learned on native images seem to work on encoded dlatents.)

I don't think trying to learn qlatents directly is likely to be fruitful... that would be the same challenge we are already working with, PLUS trying to reverse the complex transformation of the 8-layer fully-connected mapping network. Qlatents are useful for randomly generating images, but for everything else I think we're better off working on dlatents directly (whether pre or post tiling).

from stylegan-encoder.

pbaylies commented on August 18, 2024 1

@jcpeterson My guess as to why the encoder works better as-is, is that it has the full latent space to search through, so while the mapping network is designed to find specific points on (or near) the manifold, the encoder can find the spaces in between that look the most like the target image, even if the mapping network wouldn't be likely to find that same region given its training.

from stylegan-encoder.

jcpeterson commented on August 18, 2024 1

@pbaylies Yes, I see how it works now. I just don't like the fine-gained texture all of the encodings get. At full size, none of them have the clear quality of the actual samples. I attribute this to over-optimization using the style layers. If I can get a hold of @pender's code, I can search for the subset of the first N dlatents (higher level, less style-focused dimensions) that suffice.

from stylegan-encoder.

tals commented on August 18, 2024 1

@jcpeterson My guess as to why the encoder works better as-is, is that it has the full latent space to search through, so while the mapping network is designed to find specific points on (or near) the manifold, the encoder can find the spaces in between that look the most like the target image, even if the mapping network wouldn't be likely to find that same region given its training.

When experimenting with the net, I've noticed StyleGAN behaves much better when it comes to interpolation & mixing if you "play by the rules". eg, use a single 1x512 dlatent vector to represent your target image.
With 18x512, we're kindof cheating. In fact, Image2Stylegan shows that you can encode images this way on a completely randomly initialized net! (although interpolation is pretty meaningless in this in this instance)

A test that I've tried was to apply this on a subset of W. I tried W{3}, and the first two layers captured the pose and some color, where the 3rd one looked similar to my target face, but at a different angle and with some broken elements.

Note that doing W{1} is possible, but it's a harder for an encoder to accurately hit that goal. Some tricks, like masking out the background when calculating the loss, seem to help a lot.

Additionally, the distribution of the optimized latents appear different from the ones generated by the mapping network.
To counter this, you can "push" it towards the mean (via something like L2). This helps StyleGAN work better, but it's a hack since the natural distribution of latents in W doesn't quite look that way.

from stylegan-encoder.

tals commented on August 18, 2024 1

To illustrate the last point, here's a simple ablation test (this is all done in W{1}):

target_image:

mask:

Target = target_image * mask (so only the white pixels are optimized for).

Without any regularization, it looks kinda broken (I affectionally call this "blobama"):

By pulling it back to the w latent mean with L2, it looks better a lot better (but not quite correct - notice the strange asymmetry and scar like artifact):

from stylegan-encoder.

pbaylies commented on August 18, 2024 1

@tals I've also been playing around with encoders and training a ResNet to encode; I had some similar ideas as far as using the mapping network to generate more values for dlatents to generate examples for training. I haven't compared the distributions, though, that'd be good to know. I've been using L1 regularization to pull back towards the mean while encoding.

from stylegan-encoder.

tals commented on August 18, 2024 1

@tals I've also been playing around with encoders and training a ResNet to encode; I had some similar ideas as far as using the mapping network to generate more values for dlatents to generate examples for training. I haven't compared the distributions, though, that'd be good to know. I've been using L1 regularization to pull back towards the mean while encoding.

Yeah the distribution doesn't look quite right, even with L1/L2 (why did you choose L1, btw?).
This is easily visible if you take the histogram of the optimized latent vs a "natural" one obtained by the mapping network.

The trick of finding a better starting position with the ResNet encoder is interesting.
It might probably work better if it gets to see the gradient from StyleGAN, instead of just working on <rgb, latent> pairs.
Maybe something like: loss = alpha * distance(predicted_latent, target_latent) + beta * perceptual_distance(synthesis_net(predicted_latent), source_rgb)

P.S: I think you have a subtle bug: you use L1 to pull it back to zero, instead of dlatent_avg

from stylegan-encoder.

pender commented on August 18, 2024 1

@pbaylies thanks for your repo, I'm particularly enjoying playing with the efficientnet inverse network. Would you be willing to turn on issues for your repo?

Summarizing the issue I'm looking to report to you: since the effnet is trained to match dlatents in which all 18 layers are the same values (as the training targets are outputs of the StyleGAN mapping network, which tiles a single [1, 512] vector up to [18, 512]), I think the effnet's output layer should be single [1, 512] vector instead of a [18, 512] vector, even if you prefer to tile up the [1, 512] vector up to 18 layers. Right now all 18 layers of its output are different, unlike the outputs of the mapping network.

from stylegan-encoder.

pbaylies commented on August 18, 2024 1

@pender Thanks, I didn't realize that issues were off in my repo! They are on now. I'd be amenable to having a flag to enable using a single [1, 512] vector (or similar options, maybe a [3, 512] vector to tile up coarse, medium, and fine attributes); I'm sure it'd train faster, but I think having it wider in general ultimately yields better results, as it covers more of the latent space. I already have a flag on the encoder to support the tiling up behavior, so you can compare results on that end.

from stylegan-encoder.