Giter Site home page Giter Site logo

Comments (7)

fkcptlst avatar fkcptlst commented on May 29, 2024

The part in the paper:

image

from infedit.

fkcptlst avatar fkcptlst commented on May 29, 2024

regarding cross attention control

Additionally, the CrossEdit described in algorithm 3 in the paper is also inconsistent with the implementation.

image

Algorithm 3 used $M^{lay}$ to edit $M^{tgt}$. The implementation is as follows:

InfEdit/app_infedit.py

Lines 255 to 268 in d9f6c1b

def forward(self, attn, is_cross: bool, place_in_unet: str):
if is_cross :
h = attn.shape[0] // self.batch_size
attn = attn.reshape(self.batch_size,h, *attn.shape[1:])
attn_base, attn_repalce,attn_masa = attn[0], attn[1], attn[2]
attn_replace_new = self.replace_cross_attention(attn_masa, attn_repalce)
attn_base_store = self.replace_cross_attention(attn_base, attn_repalce)
if (self.cross_replace_steps >= ((self.cur_step+self.start_steps+1)*1.0 / self.num_steps) ):
attn[1] = attn_base_store
attn_store=torch.cat([attn_base_store,attn_replace_new])
attn = attn.reshape(self.batch_size * h, *attn.shape[2:])
attn_store = attn_store.reshape(2 *h, *attn_store.shape[2:])
super(AttentionControlEdit, self).forward(attn_store, is_cross, place_in_unet)
return attn

To my understanding, attn_base, attn_repalce,attn_masa correspond to $M^{src}, M^{tgt}, M^{lay}$ respectively.

As line 263 shows, the code applies CrossEdit with $M^{src}$ instead of $M^{lay}$ as described in the paper.

I'm trying to dive into this work, but I'm puzzled by these inconsistencies. I wonder which one should I stick to? The code or the paper?

Looking forward to your early reply.

from infedit.

fkcptlst avatar fkcptlst commented on May 29, 2024

regarding local attention blends

The local attention blends are implemented differently as well.

image

I'm assuming the $(m^{tgt} - m^{src})$ and $(1 - m^{tgt} + m^{src})$ are performing arithmetic operations rather than logical operations.

Below are the actual implementations.

InfEdit/app_infedit.py

Lines 65 to 93 in d9f6c1b

def __call__(self, i, x_s, x_t, x_m, attention_store, alpha_prod, temperature=0.15, use_xm=False):
maps = attention_store["down_cross"][2:4] + attention_store["up_cross"][:3]
h,w = x_t.shape[2],x_t.shape[3]
h , w = ((h+1)//2+1)//2, ((w+1)//2+1)//2
maps = [item.reshape(2, -1, 1, h // int((h*w/item.shape[-2])**0.5), w // int((h*w/item.shape[-2])**0.5), MAX_NUM_WORDS) for item in maps]
maps = torch.cat(maps, dim=1)
maps_s = maps[0,:]
maps_m = maps[1,:]
thresh_e = temperature / alpha_prod ** (0.5)
if thresh_e < self.thresh_e:
thresh_e = self.thresh_e
thresh_m = self.thresh_m
mask_e = self.get_mask(x_t, maps_m, self.alpha_e, thresh_e, i)
mask_m = self.get_mask(x_t, maps_s, (self.alpha_m-self.alpha_me), thresh_m, i)
mask_me = self.get_mask(x_t, maps_m, self.alpha_me, self.thresh_e, i)
if self.save_inter:
self.save_image(mask_e,i,"mask_e")
self.save_image(mask_m,i,"mask_m")
self.save_image(mask_me,i,"mask_me")
if self.alpha_e.sum() == 0:
x_t_out = x_t
else:
x_t_out = torch.where(mask_e, x_t, x_m)
x_t_out = torch.where(mask_m, x_s, x_t_out)
if use_xm:
x_t_out = torch.where(mask_me, x_m, x_t_out)
return x_m, x_t_out

  1. arithmetic vs logical: Note that torch.where is used here to perform masking. This is equivalent to logical operations rather than arithmetic operations as described in the paper. Since there is no guarantee that $m^{tgt}, m^{src}$ will not overlap each other, I believe the implementation and the description are not consistent.
  2. masks and blending: Please correct me if I'm wrong. In my understanding, alpha_e highlights tokens in target prompt that also appears in target blend prompt, while alpha_m highlights tokens in target prompt that are also present in source prompt. I'm quite confused by the implementation since it blends source x_s, mutual x_m and target x_t. But only the blending of source x_s and target x_t is mentioned in the paper.
  3. implementation of mapper function: The implementation of mapper function is drastically different from that in the Prompt-to-Prompt. As I've mentioned early in issue #21 , I don't get it why the cosine similarity searching is necessary, and consequently, why should the similarity searching alter the value of alpha_e (alpha_e[max_t] = 1). I could not find relevant information in the paper. Could you also elaborate a little more about the logic of mapper implementation?
  4. temperature: there is a temperature in local blend that controls thresh_e, could you please also explain a little about the design of that as well?

from infedit.

SihanXU avatar SihanXU commented on May 29, 2024

Hi fkcptlst, Thanks for your issue and carefully reviewing our code.
For the first question and second question, I think they are some tiny bugs in our implementation. Replacing both of q_tgt, k_tgt will lead to kv mismatch as mentioned in https://arxiv.org/pdf/2403.02332.pdf Fig.5 and lead to worse performance. And we should use M_lay as our attention map as well.
For the attention blending, we did make some updates and we will update this in the future version of our paper.
As for the mapper function, this is just to make it easier for users to use the gradio demo without manually selecting whether to replace or refine mentioned in prompt2prompt. It has no relation with our methods in our paper.
As for the temperature, I think it should be a feature left over from the development process, possibly due to the previous project not being cleaned up. Sorry for the confusion caused.

from infedit.

fkcptlst avatar fkcptlst commented on May 29, 2024

Hi fkcptlst, Thanks for your issue and carefully reviewing our code. For the first question and second question, I think they are some tiny bugs in our implementation. Replacing both of q_tgt, k_tgt will lead to kv mismatch as mentioned in https://arxiv.org/pdf/2403.02332.pdf Fig.5 and lead to worse performance. And we should use M_lay as our attention map as well. For the attention blending, we did make some updates and we will update this in the future version of our paper. As for the mapper function, this is just to make it easier for users to use the gradio demo without manually selecting whether to replace or refine mentioned in prompt2prompt. It has no relation with our methods in our paper. As for the temperature, I think it should be a feature left over from the development process, possibly due to the previous project not being cleaned up. Sorry for the confusion caused.

Hi, thanks for your reply! Your work is really impressive by the way.

I'm working on a project that's based on what you've done, so I need to make sure I get some things right in my implementation.

  1. So, for the first question, I should replace k_tgt and v_tgt (like the paper says) in my refactored code, right?
  2. Regarding the second question, I should use $M^{lay}$ (also from the paper) in my refactored code, correct?
  3. About the temperature, would it be recommended to just remove it?

I'm still confused about the return values of the mapper. Here's my understanding and how it's actually implemented.

How I understand they should be:

  1. mapper: mapper[i] = j means tgt[i] = src[j]
  2. alphas: alphas[i] = 1 means tgt[i] has matching token from src
  3. m: unused
  4. alpha_e: alpha_e[i] = 1 iff. exists j s.t. tgt_blend[j] = tgt[i]. (tokens in tgt and tgt_blend)
  5. alpha_m: alpha_m[i] = 1 iff. exists j s.t. src[mapper[i]] = tgt[i] = src_blend[j] (tokens both in src, tgt and src_blend)

How it's actually implemented:

  1. mapper: mapper[i] = j means tgt[i] = src[j] or search based on embedding cosine similarity
  2. alphas: alphas[i] = 1 means tgt[i] has matching token from src
  3. m: a clone of mapper without search based on embedding cosine similarity, unused in later codes.
  4. alpha_e: alpha_e[i] = 1 means: exists j s.t. tgt_blend[j] = tgt[i], or src[mapper[i]] = tgt_blend[j]
  5. alpha_m: alpha_m[i] = 1 means: exists j s.t. src[mapper[i]] = tgt[i] = local_blend[j]

The parts that confuses me are highlighted by bold text. I don't understand why the embedding cosine similarity search is necessary.

from infedit.

h6kplus avatar h6kplus commented on May 29, 2024

For the difference in mapper and m, they we generally unused in most cases, since the embedding cosine similarity was introduced when the source prompt and the target prompt was different word-by-word but semantically similar.
For the alpha_e, it should be alpha_e[i] = 1 means: exists j s.t. tgt_blend[j] = tgt[i], and for the latter part, it is just in case that there are some words were mapped using embedding cosine similarity.
If you are using source prompt and target prompt like "a photo of a dog" and "a photo of a cat" it will have no difference, but if you are using the prompt "a picture of a cat", the embedding cosine similarity may automatically match "photo" and "picture" in the mapper.

from infedit.

fkcptlst avatar fkcptlst commented on May 29, 2024

For the difference in mapper and m, they we generally unused in most cases, since the embedding cosine similarity was introduced when the source prompt and the target prompt was different word-by-word but semantically similar. For the alpha_e, it should be alpha_e[i] = 1 means: exists j s.t. tgt_blend[j] = tgt[i], and for the latter part, it is just in case that there are some words were mapped using embedding cosine similarity. If you are using source prompt and target prompt like "a photo of a dog" and "a photo of a cat" it will have no difference, but if you are using the prompt "a picture of a cat", the embedding cosine similarity may automatically match "photo" and "picture" in the mapper.

I get it now. Thanks for the example!

from infedit.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.