regarding mutual self attention control Hello, I noticed an incons

The part in the paper: <a target="_blank" rel="noopener noreferrer" href="https://

Inconsistent implementation with description in the paper,about sled-group/infedit

fkcptlst commented on May 29, 2024

The part in the paper:

from infedit.

fkcptlst commented on May 29, 2024

regarding cross attention control

Additionally, the CrossEdit described in algorithm 3 in the paper is also inconsistent with the implementation.

Algorithm 3 used $M^{lay}$ to edit $M^{tgt}$. The implementation is as follows:

InfEdit/app_infedit.py

Lines 255 to 268 in d9f6c1b

    
           def forward(self, attn, is_cross: bool, place_in_unet: str): 
        
               if is_cross : 
        
                   h = attn.shape[0] // self.batch_size 
        
                   attn = attn.reshape(self.batch_size,h,  *attn.shape[1:]) 
        
                   attn_base, attn_repalce,attn_masa = attn[0], attn[1], attn[2] 
        
                   attn_replace_new = self.replace_cross_attention(attn_masa, attn_repalce)  
        
                   attn_base_store = self.replace_cross_attention(attn_base, attn_repalce) 
        
                   if (self.cross_replace_steps >= ((self.cur_step+self.start_steps+1)*1.0 / self.num_steps) ): 
        
                       attn[1] = attn_base_store 
        
                   attn_store=torch.cat([attn_base_store,attn_replace_new]) 
        
                   attn = attn.reshape(self.batch_size * h, *attn.shape[2:]) 
        
                   attn_store = attn_store.reshape(2 *h, *attn_store.shape[2:]) 
        
                   super(AttentionControlEdit, self).forward(attn_store, is_cross, place_in_unet) 
        
               return attn

To my understanding, attn_base, attn_repalce,attn_masa correspond to $M^{src}, M^{tgt}, M^{lay}$ respectively.

As line 263 shows, the code applies CrossEdit with $M^{src}$ instead of $M^{lay}$ as described in the paper.

I'm trying to dive into this work, but I'm puzzled by these inconsistencies. I wonder which one should I stick to? The code or the paper?

Looking forward to your early reply.

from infedit.

fkcptlst commented on May 29, 2024

regarding local attention blends

The local attention blends are implemented differently as well.

I'm assuming the $(m^{tgt} - m^{src})$ and $(1 - m^{tgt} + m^{src})$ are performing arithmetic operations rather than logical operations.

Below are the actual implementations.

InfEdit/app_infedit.py

Lines 65 to 93 in d9f6c1b

    
           def __call__(self, i, x_s, x_t, x_m, attention_store, alpha_prod, temperature=0.15, use_xm=False): 
        
               maps = attention_store["down_cross"][2:4] + attention_store["up_cross"][:3] 
        
               h,w = x_t.shape[2],x_t.shape[3] 
        
               h , w = ((h+1)//2+1)//2, ((w+1)//2+1)//2 
        
               maps = [item.reshape(2, -1, 1, h // int((h*w/item.shape[-2])**0.5),  w // int((h*w/item.shape[-2])**0.5), MAX_NUM_WORDS) for item in maps] 
        
               maps = torch.cat(maps, dim=1) 
        
               maps_s = maps[0,:] 
        
               maps_m = maps[1,:] 
        
               thresh_e = temperature / alpha_prod ** (0.5) 
        
               if thresh_e < self.thresh_e: 
        
                 thresh_e = self.thresh_e 
        
               thresh_m = self.thresh_m 
        
               mask_e = self.get_mask(x_t, maps_m, self.alpha_e, thresh_e, i) 
        
               mask_m = self.get_mask(x_t, maps_s, (self.alpha_m-self.alpha_me), thresh_m, i) 
        
               mask_me = self.get_mask(x_t, maps_m, self.alpha_me, self.thresh_e, i) 
        
               if self.save_inter: 
        
                   self.save_image(mask_e,i,"mask_e") 
        
                   self.save_image(mask_m,i,"mask_m") 
        
                   self.save_image(mask_me,i,"mask_me") 
        
               if self.alpha_e.sum() == 0: 
        
                 x_t_out = x_t 
        
               else: 
        
                 x_t_out = torch.where(mask_e, x_t, x_m) 
        
               x_t_out = torch.where(mask_m, x_s, x_t_out) 
        
               if use_xm: 
        
                 x_t_out = torch.where(mask_me, x_m, x_t_out) 
        
               return x_m, x_t_out

arithmetic vs logical: Note that torch.where is used here to perform masking. This is equivalent to logical operations rather than arithmetic operations as described in the paper. Since there is no guarantee that $m^{tgt}, m^{src}$ will not overlap each other, I believe the implementation and the description are not consistent.
masks and blending: Please correct me if I'm wrong. In my understanding, alpha_e highlights tokens in target prompt that also appears in target blend prompt, while alpha_m highlights tokens in target prompt that are also present in source prompt. I'm quite confused by the implementation since it blends source x_s, mutual x_m and target x_t. But only the blending of source x_s and target x_t is mentioned in the paper.
implementation of mapper function: The implementation of mapper function is drastically different from that in the Prompt-to-Prompt. As I've mentioned early in issue #21 , I don't get it why the cosine similarity searching is necessary, and consequently, why should the similarity searching alter the value of alpha_e (alpha_e[max_t] = 1). I could not find relevant information in the paper. Could you also elaborate a little more about the logic of mapper implementation?
temperature: there is a temperature in local blend that controls thresh_e, could you please also explain a little about the design of that as well?

from infedit.

SihanXU commented on May 29, 2024

Hi fkcptlst, Thanks for your issue and carefully reviewing our code.
For the first question and second question, I think they are some tiny bugs in our implementation. Replacing both of q_tgt, k_tgt will lead to kv mismatch as mentioned in https://arxiv.org/pdf/2403.02332.pdf Fig.5 and lead to worse performance. And we should use M_lay as our attention map as well.
For the attention blending, we did make some updates and we will update this in the future version of our paper.
As for the mapper function, this is just to make it easier for users to use the gradio demo without manually selecting whether to replace or refine mentioned in prompt2prompt. It has no relation with our methods in our paper.
As for the temperature, I think it should be a feature left over from the development process, possibly due to the previous project not being cleaned up. Sorry for the confusion caused.

from infedit.

fkcptlst commented on May 29, 2024

Hi fkcptlst, Thanks for your issue and carefully reviewing our code. For the first question and second question, I think they are some tiny bugs in our implementation. Replacing both of q_tgt, k_tgt will lead to kv mismatch as mentioned in https://arxiv.org/pdf/2403.02332.pdf Fig.5 and lead to worse performance. And we should use M_lay as our attention map as well. For the attention blending, we did make some updates and we will update this in the future version of our paper. As for the mapper function, this is just to make it easier for users to use the gradio demo without manually selecting whether to replace or refine mentioned in prompt2prompt. It has no relation with our methods in our paper. As for the temperature, I think it should be a feature left over from the development process, possibly due to the previous project not being cleaned up. Sorry for the confusion caused.

Hi, thanks for your reply! Your work is really impressive by the way.

I'm working on a project that's based on what you've done, so I need to make sure I get some things right in my implementation.

So, for the first question, I should replace k_tgt and v_tgt (like the paper says) in my refactored code, right?
Regarding the second question, I should use $M^{lay}$ (also from the paper) in my refactored code, correct?
About the temperature, would it be recommended to just remove it?

I'm still confused about the return values of the mapper. Here's my understanding and how it's actually implemented.

How I understand they should be:

mapper: mapper[i] = j means tgt[i] = src[j]
alphas: alphas[i] = 1 means tgt[i] has matching token from src
m: unused
alpha_e: alpha_e[i] = 1 iff. exists j s.t. tgt_blend[j] = tgt[i]. (tokens in tgt and tgt_blend)
alpha_m: alpha_m[i] = 1 iff. exists j s.t. src[mapper[i]] = tgt[i] = src_blend[j] (tokens both in src, tgt and src_blend)

How it's actually implemented:

mapper: mapper[i] = j means tgt[i] = src[j] or search based on embedding cosine similarity
alphas: alphas[i] = 1 means tgt[i] has matching token from src
m: a clone of mapper without search based on embedding cosine similarity, unused in later codes.
alpha_e: alpha_e[i] = 1 means: exists j s.t. tgt_blend[j] = tgt[i], or src[mapper[i]] = tgt_blend[j]
alpha_m: alpha_m[i] = 1 means: exists j s.t. src[mapper[i]] = tgt[i] = local_blend[j]

The parts that confuses me are highlighted by bold text. I don't understand why the embedding cosine similarity search is necessary.

from infedit.

h6kplus commented on May 29, 2024

For the difference in mapper and m, they we generally unused in most cases, since the embedding cosine similarity was introduced when the source prompt and the target prompt was different word-by-word but semantically similar.
For the alpha_e, it should be alpha_e[i] = 1 means: exists j s.t. tgt_blend[j] = tgt[i], and for the latter part, it is just in case that there are some words were mapped using embedding cosine similarity.
If you are using source prompt and target prompt like "a photo of a dog" and "a photo of a cat" it will have no difference, but if you are using the prompt "a picture of a cat", the embedding cosine similarity may automatically match "photo" and "picture" in the mapper.

from infedit.

fkcptlst commented on May 29, 2024

For the difference in mapper and m, they we generally unused in most cases, since the embedding cosine similarity was introduced when the source prompt and the target prompt was different word-by-word but semantically similar. For the alpha_e, it should be alpha_e[i] = 1 means: exists j s.t. tgt_blend[j] = tgt[i], and for the latter part, it is just in case that there are some words were mapped using embedding cosine similarity. If you are using source prompt and target prompt like "a photo of a dog" and "a photo of a cat" it will have no difference, but if you are using the prompt "a picture of a cat", the embedding cosine similarity may automatically match "photo" and "picture" in the mapper.

I get it now. Thanks for the example!

from infedit.

Inconsistent implementation with description in the paper about infedit HOT 7 CLOSED

Comments (7)

regarding cross attention control

regarding local attention blends

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def forward(self, attn, is_cross: bool, place_in_unet: str):
	if is_cross :
	h = attn.shape[0] // self.batch_size
	attn = attn.reshape(self.batch_size,h, *attn.shape[1:])
	attn_base, attn_repalce,attn_masa = attn[0], attn[1], attn[2]
	attn_replace_new = self.replace_cross_attention(attn_masa, attn_repalce)
	attn_base_store = self.replace_cross_attention(attn_base, attn_repalce)
	if (self.cross_replace_steps >= ((self.cur_step+self.start_steps+1)*1.0 / self.num_steps) ):
	attn[1] = attn_base_store
	attn_store=torch.cat([attn_base_store,attn_replace_new])
	attn = attn.reshape(self.batch_size * h, *attn.shape[2:])
	attn_store = attn_store.reshape(2 h, attn_store.shape[2:])
	super(AttentionControlEdit, self).forward(attn_store, is_cross, place_in_unet)
	return attn

	def __call__(self, i, x_s, x_t, x_m, attention_store, alpha_prod, temperature=0.15, use_xm=False):
	maps = attention_store["down_cross"][2:4] + attention_store["up_cross"][:3]
	h,w = x_t.shape[2],x_t.shape[3]
	h , w = ((h+1)//2+1)//2, ((w+1)//2+1)//2
	maps = [item.reshape(2, -1, 1, h // int((hw/item.shape[-2])0.5), w // int((hw/item.shape[-2])**0.5), MAX_NUM_WORDS) for item in maps]
	maps = torch.cat(maps, dim=1)
	maps_s = maps[0,:]
	maps_m = maps[1,:]
	thresh_e = temperature / alpha_prod ** (0.5)
	if thresh_e < self.thresh_e:
	thresh_e = self.thresh_e
	thresh_m = self.thresh_m
	mask_e = self.get_mask(x_t, maps_m, self.alpha_e, thresh_e, i)
	mask_m = self.get_mask(x_t, maps_s, (self.alpha_m-self.alpha_me), thresh_m, i)
	mask_me = self.get_mask(x_t, maps_m, self.alpha_me, self.thresh_e, i)
	if self.save_inter:
	self.save_image(mask_e,i,"mask_e")
	self.save_image(mask_m,i,"mask_m")
	self.save_image(mask_me,i,"mask_me")

	if self.alpha_e.sum() == 0:
	x_t_out = x_t
	else:
	x_t_out = torch.where(mask_e, x_t, x_m)
	x_t_out = torch.where(mask_m, x_s, x_t_out)
	if use_xm:
	x_t_out = torch.where(mask_me, x_m, x_t_out)

	return x_m, x_t_out