Giter Site home page Giter Site logo

Latent layers importance about stylegan-encoder HOT 19 OPEN

puzer avatar puzer commented on August 18, 2024
Latent layers importance

from stylegan-encoder.

Comments (19)

pender avatar pender commented on August 18, 2024 4

@jcpeterson

@pbaylies Yes, I see how it works now. I just don't like the fine-gained texture all of the encodings get. At full size, none of them have the clear quality of the actual samples. I attribute this to over-optimization using the style layers. If I can get a hold of @pender's code, I can search for the subset of the first N dlatents (higher level, less style-focused dimensions) that suffice.

Here you go, I uploaded to a forked repo today. Ultimately had some success. Interestingly I was able to use perceptual loss from the discriminator net instead of a separately trained image classifier net. I'm not sure if that has been done before.

from stylegan-encoder.

pender avatar pender commented on August 18, 2024 1

@jcpeterson The encoder in this repo produces an 18x512 dlatent vector in which the layers are different, unlike the mapping network which as noted above produces a 1x512 dlatent vector which is tiled up to 18x512. I haven't found an easy way to create a 1x512 dlatent vector from the encoder in this repo that can be tiled to reproduce the encoded image.

I've tried:

  • tiling each of the resulting 18 layers individually (which each produced very distorted faces from a variety of sharp angles)
  • averaging the 18 layers (which produced slight variations on a single female face for all of the attempts that I tried), and
  • adapting the encoder script to optimize a single 1x512 vector that is tiled back to 18x512 before synthesizing (which didn't do a very good job of reproducing the input image).

I am going to try that last method again with the Adam optimizer (implementation is discussed by others in another thread in the repo) (UPDATE: results still terrible), and if that fails, I'm going to start from scratch with a new encoder using the method described in this ICLR paper which dispenses with perceptual loss altogether (I found another repo that uses that method on ProGAN here).

I also need to run a control to see if the existing encoder can reconstruct a face produced by the GAN in the first place, in case the face images I've provided just aren't well represented in the dlatent space.

from stylegan-encoder.

pender avatar pender commented on August 18, 2024 1

@jcpeterson

He states:

dlatent = mapping_network(qlatent) = shape=(18, 512)

That's technically correct, but I'm not sure he realized at that time that the (18, 512) tensor that the mapping network outputs is in fact a single (512,) tensor that is tiled to (18, 512). But it is!

I don't have a public fork, I've just been noodling with a copy of the source files on my PC.

from stylegan-encoder.

pender avatar pender commented on August 18, 2024 1

@jcpeterson Yes, the dlatents obtained by the encoding script are clearly out-of-distribution -- that is easily enough demonstrated by tiling a single layer of the dlatent output up to a new 18x512 dlatent and observing the fleshy horrors that emerge. (On the other hand, they are achieving their intended purposes, since they do reconstruct the image, and image transformations learned on native images seem to work on encoded dlatents.)

I don't think trying to learn qlatents directly is likely to be fruitful... that would be the same challenge we are already working with, PLUS trying to reverse the complex transformation of the 8-layer fully-connected mapping network. Qlatents are useful for randomly generating images, but for everything else I think we're better off working on dlatents directly (whether pre or post tiling).

from stylegan-encoder.

pbaylies avatar pbaylies commented on August 18, 2024 1

@jcpeterson My guess as to why the encoder works better as-is, is that it has the full latent space to search through, so while the mapping network is designed to find specific points on (or near) the manifold, the encoder can find the spaces in between that look the most like the target image, even if the mapping network wouldn't be likely to find that same region given its training.

from stylegan-encoder.

jcpeterson avatar jcpeterson commented on August 18, 2024 1

@pbaylies Yes, I see how it works now. I just don't like the fine-gained texture all of the encodings get. At full size, none of them have the clear quality of the actual samples. I attribute this to over-optimization using the style layers. If I can get a hold of @pender's code, I can search for the subset of the first N dlatents (higher level, less style-focused dimensions) that suffice.

from stylegan-encoder.

tals avatar tals commented on August 18, 2024 1

@jcpeterson My guess as to why the encoder works better as-is, is that it has the full latent space to search through, so while the mapping network is designed to find specific points on (or near) the manifold, the encoder can find the spaces in between that look the most like the target image, even if the mapping network wouldn't be likely to find that same region given its training.

When experimenting with the net, I've noticed StyleGAN behaves much better when it comes to interpolation & mixing if you "play by the rules". eg, use a single 1x512 dlatent vector to represent your target image.
With 18x512, we're kindof cheating. In fact, Image2Stylegan shows that you can encode images this way on a completely randomly initialized net! (although interpolation is pretty meaningless in this in this instance)

A test that I've tried was to apply this on a subset of W. I tried W{3}, and the first two layers captured the pose and some color, where the 3rd one looked similar to my target face, but at a different angle and with some broken elements.

Note that doing W{1} is possible, but it's a harder for an encoder to accurately hit that goal. Some tricks, like masking out the background when calculating the loss, seem to help a lot.

Additionally, the distribution of the optimized latents appear different from the ones generated by the mapping network.
To counter this, you can "push" it towards the mean (via something like L2). This helps StyleGAN work better, but it's a hack since the natural distribution of latents in W doesn't quite look that way.

from stylegan-encoder.

tals avatar tals commented on August 18, 2024 1

To illustrate the last point, here's a simple ablation test (this is all done in W{1}):

target_image:
image

mask:
image

Target = target_image * mask (so only the white pixels are optimized for).

Without any regularization, it looks kinda broken (I affectionally call this "blobama"):
image

By pulling it back to the w latent mean with L2, it looks better a lot better (but not quite correct - notice the strange asymmetry and scar like artifact):
image

from stylegan-encoder.

pbaylies avatar pbaylies commented on August 18, 2024 1

@tals I've also been playing around with encoders and training a ResNet to encode; I had some similar ideas as far as using the mapping network to generate more values for dlatents to generate examples for training. I haven't compared the distributions, though, that'd be good to know. I've been using L1 regularization to pull back towards the mean while encoding.

from stylegan-encoder.

tals avatar tals commented on August 18, 2024 1

@tals I've also been playing around with encoders and training a ResNet to encode; I had some similar ideas as far as using the mapping network to generate more values for dlatents to generate examples for training. I haven't compared the distributions, though, that'd be good to know. I've been using L1 regularization to pull back towards the mean while encoding.

Yeah the distribution doesn't look quite right, even with L1/L2 (why did you choose L1, btw?).
This is easily visible if you take the histogram of the optimized latent vs a "natural" one obtained by the mapping network.

The trick of finding a better starting position with the ResNet encoder is interesting.
It might probably work better if it gets to see the gradient from StyleGAN, instead of just working on <rgb, latent> pairs.
Maybe something like: loss = alpha * distance(predicted_latent, target_latent) + beta * perceptual_distance(synthesis_net(predicted_latent), source_rgb)

P.S: I think you have a subtle bug: you use L1 to pull it back to zero, instead of dlatent_avg

from stylegan-encoder.

pender avatar pender commented on August 18, 2024 1

@pbaylies thanks for your repo, I'm particularly enjoying playing with the efficientnet inverse network. Would you be willing to turn on issues for your repo?

Summarizing the issue I'm looking to report to you: since the effnet is trained to match dlatents in which all 18 layers are the same values (as the training targets are outputs of the StyleGAN mapping network, which tiles a single [1, 512] vector up to [18, 512]), I think the effnet's output layer should be single [1, 512] vector instead of a [18, 512] vector, even if you prefer to tile up the [1, 512] vector up to 18 layers. Right now all 18 layers of its output are different, unlike the outputs of the mapping network.

from stylegan-encoder.

pbaylies avatar pbaylies commented on August 18, 2024 1

@pender Thanks, I didn't realize that issues were off in my repo! They are on now. I'd be amenable to having a flag to enable using a single [1, 512] vector (or similar options, maybe a [3, 512] vector to tile up coarse, medium, and fine attributes); I'm sure it'd train faster, but I think having it wider in general ultimately yields better results, as it covers more of the latent space. I already have a flag on the encoder to support the tiling up behavior, so you can compare results on that end.

from stylegan-encoder.

jcpeterson avatar jcpeterson commented on August 18, 2024

@tals does this mean the encoder code also optimizes the same repeated tiled vectors incorrectly too?

from stylegan-encoder.

jcpeterson avatar jcpeterson commented on August 18, 2024

@pender Thanks for the info. That seems at odds with this repo author's post here: https://www.reddit.com/r/MachineLearning/comments/aq6jxf/p_stylegan_encoder_from_real_images_to_latent/egg4rkl

He states:
qlatent = normally distributed noise which have shape=(512)
dlatent = mapping_network(qlatent) = shape=(18, 512)

I tried Adam, and while much faster, and with lower loss, I get more artifacts.

The paper you linked just looks like a pixel loss version of the current repo, or perhaps you meant the clipping, which I think is a great idea. Please ping me if you implement that. Do you have a public fork?

from stylegan-encoder.

jcpeterson avatar jcpeterson commented on August 18, 2024

@pender That means the encoder is simply set up incorrectly. If the network is not trained to use 18 different vectors, the dlatents currently learned are out-of-distribution. It's extremely odd then that your third strategy above didn't work. Have you tried learning qlatent instead?

from stylegan-encoder.

jcpeterson avatar jcpeterson commented on August 18, 2024

@pender I don't see the difference. The encoder already backprops through the generator to the dlatents. Why not just backprop through the fixed mapping network too and find a qlatent such that when mapped and then rendered etc, a good match is attained via VGG?

from stylegan-encoder.

jcpeterson avatar jcpeterson commented on August 18, 2024

@pender Can you please share your code for: "adapting the encoder script to optimize a single 1x512 vector that is tiled back to 18x512 before synthesizing (which didn't do a very good job of reproducing the input image)."

from stylegan-encoder.

pbaylies avatar pbaylies commented on August 18, 2024

@tals L1 was just the first thing I tried as a regularization penalty, but it cut down on artifacts a lot; before, I'd get more blurry results from my loss function. And yes, a good starting prediction helps.

P.S.: Thanks for the tip, I think you're right!

from stylegan-encoder.

pbaylies avatar pbaylies commented on August 18, 2024

Thanks @pender very nice results! I merged your changes for learning rate decay, stochastic gradient clipping, and tiled dlatents into my fork as well.

from stylegan-encoder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.