Comments (7)
The part in the paper:
from infedit.
regarding cross attention control
Additionally, the CrossEdit described in algorithm 3 in the paper is also inconsistent with the implementation.
Algorithm 3 used
Lines 255 to 268 in d9f6c1b
To my understanding, attn_base, attn_repalce,attn_masa
correspond to
As line 263 shows, the code applies CrossEdit with
I'm trying to dive into this work, but I'm puzzled by these inconsistencies. I wonder which one should I stick to? The code or the paper?
Looking forward to your early reply.
from infedit.
regarding local attention blends
The local attention blends are implemented differently as well.
I'm assuming the
Below are the actual implementations.
Lines 65 to 93 in d9f6c1b
-
arithmetic vs logical: Note that
torch.where
is used here to perform masking. This is equivalent to logical operations rather than arithmetic operations as described in the paper. Since there is no guarantee that$m^{tgt}, m^{src}$ will not overlap each other, I believe the implementation and the description are not consistent. -
masks and blending: Please correct me if I'm wrong. In my understanding,
alpha_e
highlights tokens in target prompt that also appears in target blend prompt, whilealpha_m
highlights tokens in target prompt that are also present in source prompt. I'm quite confused by the implementation since it blends sourcex_s
, mutualx_m
and targetx_t
. But only the blending of sourcex_s
and targetx_t
is mentioned in the paper. -
implementation of mapper function: The implementation of mapper function is drastically different from that in the Prompt-to-Prompt. As I've mentioned early in issue #21 , I don't get it why the cosine similarity searching is necessary, and consequently, why should the similarity searching alter the value of
alpha_e
(alpha_e[max_t] = 1
). I could not find relevant information in the paper. Could you also elaborate a little more about the logic of mapper implementation? -
temperature: there is a
temperature
in local blend that controlsthresh_e
, could you please also explain a little about the design of that as well?
from infedit.
Hi fkcptlst, Thanks for your issue and carefully reviewing our code.
For the first question and second question, I think they are some tiny bugs in our implementation. Replacing both of q_tgt, k_tgt will lead to kv mismatch as mentioned in https://arxiv.org/pdf/2403.02332.pdf Fig.5 and lead to worse performance. And we should use M_lay as our attention map as well.
For the attention blending, we did make some updates and we will update this in the future version of our paper.
As for the mapper function, this is just to make it easier for users to use the gradio demo without manually selecting whether to replace or refine mentioned in prompt2prompt. It has no relation with our methods in our paper.
As for the temperature, I think it should be a feature left over from the development process, possibly due to the previous project not being cleaned up. Sorry for the confusion caused.
from infedit.
Hi fkcptlst, Thanks for your issue and carefully reviewing our code. For the first question and second question, I think they are some tiny bugs in our implementation. Replacing both of q_tgt, k_tgt will lead to kv mismatch as mentioned in https://arxiv.org/pdf/2403.02332.pdf Fig.5 and lead to worse performance. And we should use M_lay as our attention map as well. For the attention blending, we did make some updates and we will update this in the future version of our paper. As for the mapper function, this is just to make it easier for users to use the gradio demo without manually selecting whether to replace or refine mentioned in prompt2prompt. It has no relation with our methods in our paper. As for the temperature, I think it should be a feature left over from the development process, possibly due to the previous project not being cleaned up. Sorry for the confusion caused.
Hi, thanks for your reply! Your work is really impressive by the way.
I'm working on a project that's based on what you've done, so I need to make sure I get some things right in my implementation.
- So, for the first question, I should replace k_tgt and v_tgt (like the paper says) in my refactored code, right?
- Regarding the second question, I should use
$M^{lay}$ (also from the paper) in my refactored code, correct? - About the temperature, would it be recommended to just remove it?
I'm still confused about the return values of the mapper. Here's my understanding and how it's actually implemented.
How I understand they should be:
mapper
: mapper[i] = j means tgt[i] = src[j]alphas
: alphas[i] = 1 means tgt[i] has matching token from srcm
: unusedalpha_e
: alpha_e[i] = 1 iff. exists j s.t. tgt_blend[j] = tgt[i]. (tokens in tgt and tgt_blend)alpha_m
: alpha_m[i] = 1 iff. exists j s.t. src[mapper[i]] = tgt[i] = src_blend[j] (tokens both in src, tgt and src_blend)
How it's actually implemented:
mapper
: mapper[i] = j means tgt[i] = src[j] or search based on embedding cosine similarityalphas
: alphas[i] = 1 means tgt[i] has matching token from srcm
: a clone ofmapper
without search based on embedding cosine similarity, unused in later codes.alpha_e
: alpha_e[i] = 1 means: exists j s.t. tgt_blend[j] = tgt[i], or src[mapper[i]] = tgt_blend[j]alpha_m
: alpha_m[i] = 1 means: exists j s.t. src[mapper[i]] = tgt[i] = local_blend[j]
The parts that confuses me are highlighted by bold text. I don't understand why the embedding cosine similarity search is necessary.
from infedit.
For the difference in mapper and m, they we generally unused in most cases, since the embedding cosine similarity was introduced when the source prompt and the target prompt was different word-by-word but semantically similar.
For the alpha_e, it should be alpha_e[i] = 1 means: exists j s.t. tgt_blend[j] = tgt[i], and for the latter part, it is just in case that there are some words were mapped using embedding cosine similarity.
If you are using source prompt and target prompt like "a photo of a dog" and "a photo of a cat" it will have no difference, but if you are using the prompt "a picture of a cat", the embedding cosine similarity may automatically match "photo" and "picture" in the mapper.
from infedit.
For the difference in mapper and m, they we generally unused in most cases, since the embedding cosine similarity was introduced when the source prompt and the target prompt was different word-by-word but semantically similar. For the alpha_e, it should be alpha_e[i] = 1 means: exists j s.t. tgt_blend[j] = tgt[i], and for the latter part, it is just in case that there are some words were mapped using embedding cosine similarity. If you are using source prompt and target prompt like "a photo of a dog" and "a photo of a cat" it will have no difference, but if you are using the prompt "a picture of a cat", the embedding cosine similarity may automatically match "photo" and "picture" in the mapper.
I get it now. Thanks for the example!
from infedit.
Related Issues (16)
- Help with virtual inversion questions please HOT 2
- Code release? HOT 5
- how to adapt lora to lcm? HOT 3
- Question for your code: The claimed consistent noise seems unused in demo code? HOT 2
- Code realease? HOT 1
- can
- can you prompt to prompt editing HOT 1
- Support for Ollama HOT 1
- Does it support regular non-consistency models? HOT 2
- Code difference between github and huggingface HOT 2
- run app_infedit.py HOT 3
- How to run without gradio? HOT 1
- The implementation of mapper HOT 1
- PIE-Bench HOT 6
- P2P and UAC HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from infedit.