tencent-ailab / ip-adapter Goto Github PK

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.

License: Apache License 2.0

Python 0.25% Jupyter Notebook 99.75%

ip-adapter's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes meijiishinislame paperwave c00renut evdcush bainaryglobe chukwuemerie-ezieke forestp undercontroller sxela hadryan minghaoguo20 super-ruilei ghkweon putuoka ingeniousfrog haofanwang gttome darquelilly codelabspro kornuolis2 rockystevejobs jiamsu valerakazarin artempacha tomjj1 angelinamuhinax rkp64 tecworks-dev utkarshx suprah925 sidersss promeg gary109 xiao2mo cleardry leeneo fadygit awesome-archive chisonshang coder-james vcbal2580 jsoq evotislab skyveil666 ailabteam rrossezra mrastro69 zhoulingjie hhy5277 soon14 nnnoc af-74413592 basharourabi kekewind 5shekel vineshg ksksks2222 shanhedian2017 qsily zcfrank1st 307509256 asksasasa83 jmu201521121021 aliscifp syunar lihqi jadechip gilvbp kustomzone ibuprofenighty anggawasita chenzihao008 berkode aqueduct-software namdori61 yining-mao nbardy clic-ethiopia danbochman chenyilon josephcatrambone runngezhang liucr leftomelas crucible-ai arefmytb syguan96 localhd anthonyyuan huonglarne moxmoussa friedronaldo taner45 jennyleung whiterose199187 jetthu hjsybyq bigdatasciencegroup bofeizhu

ip-adapter's Issues

Automatic-1111

Hi the paper is really interesting and your results kick ass. not sure how to use this in automatic 1111 though, I tried putting the models in the controlnet model folder but they weren't showing up. Any chance you can update the readme with the process? Thanks.

Which CLIP model do you use?

I notice that you provide image encoder on your own space, is it different from the models released by openai?

Some Implemention Details Discussion

Hi, Dear Authors.

After reading the code, I found that you let the image features after projection (concept features) put into the Adapter layers.

However, in some relevant works (personally), eg. InstantBooth (https://arxiv.org/abs/2304.03411), Subject Diffusion (https://arxiv.org/abs/2307.11410), they inject the image token features (patch features) to the adapter layers in the UNet of SD (they use self-attn).
It seems like patch features may contain more detailed features so that the model can preserve the characteristics of input images better.

Concept features seem to have more high-level semantic info. Maybe this change can help IP-Adapter be more flexible but to some extent lose some abilities of subject/identity-driven generation abilities.

It is just my personal opinion, talks are welcome.

Images from base model getting ruined if I load IP_Adapter

base_model_path = "yiffymix16_32"
image_encoder_path = "models/image_encoder/"
ip_ckpt = "models/ip-adapter_sd15.bin"
device = "cuda"

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained("yiffymix16_32", safety_checker=None, controlnet=controlnet, torch_dtype=torch.float16)

# the below line is causing issues
ip_model = IPAdapter(pipe, image_encoder_path, ip_ckpt, device)

pipe = pipe.to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

I have this above code. Now when i run the pipe() independently (without IPAdapter, for example image = pipe(...)), it produces hazy, unnatural images like this:

But if I comment out the line ip_model = IPAdapter(pipe, image_encoder_path, ip_ckpt, device), the pipe.() gives proper results. any idea why?

invalid load key, '<'. [ip adapter plus face demo]

https://colab.research.google.com/drive/1_Vtos4PRqZWAg69sC9XuSBBt6Mw_aDCz?usp=sharing

ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device, num_tokens=16)

UnpicklingError Traceback (most recent call last)
in <cell line: 1>()
----> 1 ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device, num_tokens=16)

3 frames
/usr/local/lib/python3.10/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
1031 "functionality.")
1032
-> 1033 magic_number = pickle_module.load(f, **pickle_load_args)
1034 if magic_number != MAGIC_NUMBER:
1035 raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.

Original dataset

Hi! Thank you for amazing work! It works like a charm!
I wonder which dataset you've used during training? Can you share more info about it? You specified in the paper that this is subset of LAION & COYO datasets. Maybe you have parameters that you've used for filtering such data? Aesthetic score threshold, p_unsafe / p_watermark, image size? And the proportion of LAION and COYO in your data
Do you think results will be different when trained on smaller dataset, let's say 1M samples? Do you think results would improve if used full resolution using variable aspect ratio bucketing instead of center crop?

Differences between your results and the colab demo output

Greetings,
First of all thank you for this achievement, the potential of this tool is astounding.

I have a problem with the sample code:
Following precisely the code in the Colab demo, I notice some differences between the output I get and the one shown by you in the results.

I can't understand why.
Anyone know why?

Note: I left the SD1.5 set as per code.

Errors in using sdxl

Hello, the plugin you made is very useful, but I used it the day before and it was fine. The next day, I made an error and many others couldn't use it. 1.5 is okay, but xl will make an error. My error prompt is

Error occurred when executing IPAdapter:

Input type (torch. FloatTensor) and weight type (torch. HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

File "D: AI StableSwarmUI dlbackend comfy ComfyUI execution. py", line 151, in recursive_ Execute

Output_ Data, output_ Ui=get_ Output_ Data (obj, input_data_all)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D: AI StableSwarmUI dlbackend comfy ComfyUI execution. py", line 81, in get_ Output_ Data

Return_ Values=map_ Node_ Over_ List (obj, input_data_all, obj. JUNCTION, allow_interrupt=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D: AI StableSwarmUI dlbackend comfy ComfyUI execution. py", line 74, in map_ Node_ Over_ List

Results. append (getattr (obj, func) (* * sliceDict (input_data_all, i)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D: AI StableSwarmUI dlbackend comfy ComfyUI custom_nodes IPAda

do ipadapter_face and ipadapter together

can we try to do ipadapter face and ipadapter together?
maybe like this
hidden_states = hidden_states + self.scale1 * ip_hidden_states + self.scale2 * ip_face_hidden_states

About train code

I run tutorial_train.py and save the related param of ‘unet’ and ‘ip-adapter_sd15.bin’. But when I load the unet param with StableDiffusionPipeline, I get the warning:

weights of the model checkpoint were not used when initializing UNet2DConditionModel:
 ['down_blocks.0.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, mid_block.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, mid_block.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight']

I think maybe the reason about the training code with the lines:

else:
   layer_name = name.split(".processor")[0]
   weights = {
                "to_k_ip.weight": unet_sd[layer_name + ".to_k.weight"],
                "to_v_ip.weight": unet_sd[layer_name + ".to_v.weight"],
    }
    attn_procs[name] = IPAttnProcessor(hidden_size=hidden_size, cross_attention_dim=cross_attention_dim)
    attn_procs[name].load_state_dict(weights)

So, I want to know why it is set like this, and how should I modify my inference code so that the code does not have such warnings?

bug when not using torch 2.0

File "/home/code/third_party/IP-Adapter/ip_adapter/attention_processor.py", line 192, in init
raise ImportError("AttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
ImportError: AttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.
maybe I should modify below code

if is_torch2_available:
    from .attention_processor import IPAttnProcessor2_0 as IPAttnProcessor, AttnProcessor2_0 as AttnProcessor, CNAttnProcessor2_0 as CNAttnProcessor
else:
    from .attention_processor import IPAttnProcessor, AttnProcessor, CNAttnProcessor

if is_torch2_available():

datasets

First, thanks for your great job. there is question about the differences between the promt image and real-value image. could you offer some examples?

The plan of releasing the fine-grained feature version

Hi, Thanks for you great works! I'm very interested in the IP-Adapter with fine-grained features, do you have the plan to release this version?

Does fine-grained IPAdapterPlus support control net?

I change the controlnet demo from IPAdapter to IPAdapterPlus, while using "models/ip-adapter-plus_sd15.bin" as adapter model checkpoint. but failed in loading ip-adapter.
ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device)

error log:

Cell In[8], line 2
      1 # load ip-adapter
----> 2 ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device)

File /mnt/data/aigc/IP-Adapter/ip_adapter/ip_adapter.py:52, in IPAdapter.__init__(self, sd_pipe, image_encoder_path, ip_ckpt, device, num_tokens)
     49 # image proj model
     50 self.image_proj_model = self.init_proj()
---> 52 self.load_ip_adapter()

File /mnt/data/aigc/IP-Adapter/ip_adapter/ip_adapter.py:84, in IPAdapter.load_ip_adapter(self)
     82 def load_ip_adapter(self):
     83     state_dict = torch.load(self.ip_ckpt, map_location="cpu")
---> 84     self.image_proj_model.load_state_dict(state_dict["image_proj"])
     85     ip_layers = torch.nn.ModuleList(self.pipe.unet.attn_processors.values())
     86     ip_layers.load_state_dict(state_dict["ip_adapter"])

File ~/anaconda3/envs/IP-Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:2041, in Module.load_state_dict(self, state_dict, strict)
   2036         error_msgs.insert(
   2037             0, 'Missing key(s) in state_dict: {}. '.format(
   2038                 ', '.join('"{}"'.format(k) for k in missing_keys)))
   2040 if len(error_msgs) > 0:
-> 2041     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   2042                        self.__class__.__name__, "\n\t".join(error_msgs)))
   2043 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for Resampler:
	size mismatch for latents: copying a param with shape torch.Size([1, 16, 768]) from checkpoint, the shape in current model is torch.Size([1, 4, 768]).

Training code

Hi,

This project is just awesome, thanks for it !

Will you release the training/finetuning code ?

Thanks

SDXL support

really awesome work thank guys for that

do you have a plan for supporting SDXL?

Preserving human face likeness

First off, congratulations on this project and thank you so much for your work!

I'm testing using your SDXL demo, and am generally getting good results. However, my use case is really for human "personalization", like DreamBooth. I've tried using your multi-model prompt with some of my images and the likeness of the face is not quite what I'd like, meaning that the face shows variation where I wish it would be more true to the original input.

Do you have any suggestions on settings or values that I could tweak to try and improve this? Or other ideas?

Again, thanks for everything!

Stable Diffusion 2.1 Version

Thanks for your brilliant work.
Do you consider releasing the model of SD 2.1 version? I

Cannot find or load the image encoder from models/image_encoder/ on huggingface

Hi, I am trying to run the ip_adapter_controlnet_demo_new.ipynb notebook, however I keep getting this error from the following line below. I also tried to download the model manually but couldn't find it on huggingface. I definately don't have a local directory with the same name.

Thank you so much for your help.

Line:

# load ip-adapter
ip_model = IPAdapter(pipe, image_encoder_path, ip_ckpt, device)

Error:


---------------------------------------------------------------------------
HFValidationError                         Traceback (most recent call last)
File [c:\Users\avika\anaconda3\envs\interactive_dance_thesis\lib\site-packages\transformers\configuration_utils.py:675](file:///C:/Users/avika/anaconda3/envs/interactive_dance_thesis/lib/site-packages/transformers/configuration_utils.py:675), in PretrainedConfig._get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    673 try:
    674     # Load from local folder or from cache or download from model Hub and cache
--> 675     resolved_config_file = cached_file(
    676         pretrained_model_name_or_path,
    677         configuration_file,
    678         cache_dir=cache_dir,
    679         force_download=force_download,
    680         proxies=proxies,
    681         resume_download=resume_download,
    682         local_files_only=local_files_only,
    683         token=token,
    684         user_agent=user_agent,
    685         revision=revision,
    686         subfolder=subfolder,
    687         _commit_hash=commit_hash,
    688     )
    689     commit_hash = extract_commit_hash(resolved_config_file, commit_hash)

File [c:\Users\avika\anaconda3\envs\interactive_dance_thesis\lib\site-packages\transformers\utils\hub.py:428](file:///C:/Users/avika/anaconda3/envs/interactive_dance_thesis/lib/site-packages/transformers/utils/hub.py:428), in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs)
    426 try:
    427     # Load from URL or cache if already cached
--> 428     resolved_file = hf_hub_download(
...
    703 try:
    704     # Load config dict
    705     config_dict = cls._dict_from_json_file(resolved_config_file)

OSError: Can't load the configuration of 'models/image_encoder/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'models/image_encoder/' is the correct path to a directory containing a config.json file

Doubts regarding the model

I'm just trying to understand how IP Adapter does these powerful manipulations.

So we are encoding the image into embeddings, combining them with the prompt's embeddings and then creating an image based on that. I see we are changing the unet, but I don't understand how.

I also ran into issues using embeddings directly instead of passing in the prompt as a string. Here are the 2 doubts in more detail:

1

What does this do:

input_ids = pipe.tokenizer(prompt, return_tensors="pt").input_ids

I saw this in the set_ip_adapter function. Is this the one that's changing the vae decoder? What does this do, and am I incorrect in assessing this?

2

I changed some function inputs, in order to use prompt embeds directly, for example doing this:

input_ids = pipe.tokenizer(prompt, return_tensors="pt").input_ids
for i in range(0, input_ids.shape[-1], max_length):
    concat_embeds.append(pipe.text_encoder(input_ids[:, i: i + max_length])[0])
prompt_embeds = torch.cat(concat_embeds, dim=1)

Tried inputting this, and commenting out the following in IP Adapter's generate(...) function.

prompt_embeds = self.pipe._encode_prompt(...)

This however gave incorrect results in the output image. What am I doing wrong?

3 (Out of scope)

I know this is out of scope, but is there any way of also loading a LoRA model's weights in the model's pipeline? If I load a LoRA model using pipe.load_lora_weights(...) before calling ip_adapter, will it retain those weights?

Thanks

I know these are many questions. Thanks a lot in advance!

Edits:

Forgot to mention earlier, but this is some amazing work you guys did, I love the idea.

IP Adapter Plus Img-to-Img Pipeline

Hello,

thank you for this wonderful model!

I am trying to run ImgtoImg pipeline using IP Adapter Plus following the example in the original notebook:

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16,
    scheduler=noise_scheduler,
    vae=vae,
    feature_extractor=None,
    safety_checker=None
)

ip_model = IPAdapterPlus(pipe, image_encoder_path, ip_ckpt, device, num_tokens=16)

images = ip_model.generate(pil_image=image, num_samples=4, num_inference_steps=50, seed=seed, image=g_image, strength=scale)

but I am getting the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[7], line 6
      4 for scale in [0.7, 0.75, 0.8, 0.95]:
      5     print(scale)
----> 6     images = ip_model.generate(pil_image=image, num_samples=4, num_inference_steps=50, seed=seed, image=g_image, strength=scale)
      7     grid = image_grid(images, 1, 4)
      8     display(grid)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/IP_Adapter/ip_adapter/ip_adapter.py:132, in IPAdapter.generate(self, pil_image, prompt, negative_prompt, scale, num_samples, seed, guidance_scale, num_inference_steps, **kwargs)
    129 if not isinstance(negative_prompt, List):
    130     negative_prompt = [negative_prompt] * num_prompts
--> 132 image_prompt_embeds, uncond_image_prompt_embeds = self.get_image_embeds(pil_image)
    133 bs_embed, seq_len, _ = image_prompt_embeds.shape
    134 image_prompt_embeds = image_prompt_embeds.repeat(1, num_samples, 1)

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/IP_Adapter/ip_adapter/ip_adapter.py:239, in IPAdapterPlus.get_image_embeds(self, pil_image)
    237 clip_image = self.clip_image_processor(images=pil_image, return_tensors="pt").pixel_values
    238 clip_image = clip_image.to(self.device, dtype=torch.float16)
--> 239 clip_image_embeds = self.image_encoder(clip_image, output_hidden_states=True).hidden_states[-2]
    240 image_prompt_embeds = self.image_proj_model(clip_image_embeds)
    241 uncond_clip_image_embeds = self.image_encoder(torch.zeros_like(clip_image), output_hidden_states=True).hidden_states[-2]

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:1311, in CLIPVisionModelWithProjection.forward(self, pixel_values, output_attentions, output_hidden_states, return_dict)
   1288 r"""
   1289 Returns:
   1290 
   (...)
   1307 >>> image_embeds = outputs.image_embeds
   1308 ```"""
   1309 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-> 1311 vision_outputs = self.vision_model(
   1312     pixel_values=pixel_values,
   1313     output_attentions=output_attentions,
   1314     output_hidden_states=output_hidden_states,
   1315     return_dict=return_dict,
   1316 )
   1318 pooled_output = vision_outputs[1]  # pooled_output
   1320 image_embeds = self.visual_projection(pooled_output)

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:866, in CLIPVisionTransformer.forward(self, pixel_values, output_attentions, output_hidden_states, return_dict)
    863 if pixel_values is None:
    864     raise ValueError("You have to specify pixel_values")
--> 866 hidden_states = self.embeddings(pixel_values)
    867 hidden_states = self.pre_layrnorm(hidden_states)
    869 encoder_outputs = self.encoder(
    870     inputs_embeds=hidden_states,
    871     output_attentions=output_attentions,
    872     output_hidden_states=output_hidden_states,
    873     return_dict=return_dict,
    874 )

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:195, in CLIPVisionEmbeddings.forward(self, pixel_values)
    193 def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
    194     batch_size = pixel_values.shape[0]
--> 195     patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
    196     patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
    198     class_embeds = self.class_embedding.expand(batch_size, 1, -1)

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/conv.py:463, in Conv2d.forward(self, input)
    462 def forward(self, input: Tensor) -> Tensor:
--> 463     return self._conv_forward(input, self.weight, self.bias)

File /mnt/a0b764eb-cdc5-4f46-9a2e-e2f11deba631/PYTHON_CACHE/IP_Adapter/lib/python3.10/site-packages/torch/nn/modules/conv.py:459, in Conv2d._conv_forward(self, input, weight, bias)
    455 if self.padding_mode != 'zeros':
    456     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    457                     weight, bias, self.stride,
    458                     _pair(0), self.dilation, self.groups)
--> 459 return F.conv2d(input, weight, bias, self.stride,
    460                 self.padding, self.dilation, self.groups)

RuntimeError: GET was unable to find an engine to execute this computation

Am I doing something wrong or this feature is not implemented in Adapter Plus?

About torch.cat([encoder_hidden_state, ip_token,], dim=1)

Hello author, I have a stupid question. I read the code you provided and found that ip_token and encoder_hidden_state were cated. Is this different from the add described in the paper

Performance on other models already based on sd1.5 and training dataset

This project is awesome！！！
I have two small questions,

Have you guys tried his tests on other models trained on sd1.5(like Realistic2.0 or anime model)? How is the performance?
I don't know how to structure my training data, could you maybe show a small example?

ip_adapter-plus-face for SDXL?

Hello..
is there an SDXL version of this model "ip_adapter-plus-face"? .. or is there a way to use it with SDXL?

thank you :)

IP Adapter SDXL Demo - RTX4090

Hi It takes almost 15 minutes to create an image with the RTX4090. Is this normal?

7/30 [02:38<09:06, 23.76s/it]

Just saying thanks

Awesome work! Thanks for your contribution

How can I download the generated images?

I am using ip_adapter_multimodal_prompts: generation with multimodal prompts in python. But I want to download the images in a folder whenever I run this code. How to do that after the below line?

multimodal prompts

images = ip_model.generate(pil_image=image, num_samples=1, num_inference_steps=50, seed=42,
prompt="wearing a hat on the beach", scale=0.6)

adding noise to the uncond images

have you experimented adding a little noise to the zeroed tensors?

I made a few tests with the Plus model and the results are... interesting. Basically instead of a zeroed embed I'm passing a random +/- 0.5 noise (or even higher).

This is a quick example, sometimes the result is quite better, you just need to keep it low otherwise it starts "burning" the image.

wondering if this makes any sense

About foreground segmentation

Hello, I have a question. During inference, the foreground objects need to be picked out using instance segmentation. Is this necessary during training? Also, what is the specific segmentation method used in the paper? Looking forward to your reply, thank you.

Pytorch compile compatibility

Hi! Thank you for your amazing work!
When I'm compiling unet with torch.compile(unet) decoupled text/image attention seems to stop working. Do you have fix for that?

how to crop face for face ip-adaptor?

Hi authors,

I wonder how to crop face when you train the face ip-adaptor?

Can you run ip-adapter sdxl using Colab's free tier?

I have tried running it but always run into memory issues and it terminates at this part - IPAdapterXL(pipe, image_encoder_path, ip_ckpt, device)

Even though loading the model with StableDiffusionXLPipeline.from_pretrained works fine. I also tried using accelerate but still face issues. Does this mean it's not possible to run it on Colab's free tier?

Installation?

How to install IP-Adapter? Please help.

Idea: Perspective or scene adapter

Is it possible to train an adapter, that will be grab ONLY perspective/scene/location condition from image?

SD1 often struggling with good perspective and angles. It will fix big problem.

About infernece result

The paper said that when the scale param is set to 0, it is equivalent to the ability of the base text2img model.

But when I set the scale to 0 and compared it with the original text2img base model, I found that there are still some differences in the generation images (all other parameters remain the same setting).

Is it because of the def set_ip_adapter() and def load_ip_adapter() in the inference code?

i made it work without run locally!

First of all you need to change you runtime to t4 because automatically is cpu
Code.txt
And is that!!!

Models .bin not showing on controlnet a1111

Hi, I placed the models ip-adaptater_sd15.bin , ip-adapter-plus_sd15.bin and ip-adapter-plus-face_sd15.bin into ../stable-diffusion-webui > extensions > sd-webui-controlnet > models but when I restart a1111, they not showing into the model field of controlnet ( 1.1.4 )

Thanks

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

Hi, I recently saw a work very similar to yours, also from Tencent, called StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation. Do you know this paper?

The different SD models

Thanks for your work

I observe ipadapter can't generate similar images with image prompt when image prompt is anime character style. But when i use the anime style dreambooth or corresponding character lora, the ipadapter performs better. I'd like to ask whether the ipadapter only works when the foundation model can produce similar results with the image prompt.

colab demo

A1111: scaled_dot_product_attention

getting the issue of AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
I have the latest torch, controlnet 1.1.409, renamed to .pth,
everything looks good just erroring out.

seeing if anyone else has this issue

"RuntimeError: Attempting to deserialize object on a CUDA device" error in Automatic1111 v1.6, ControlNet v1.1.409 - Apple Silicon OSX

Using the SD1.5 IP-Adaptor models in v1.6.0 Automatic1111 environment (python: 3.10.13, torch: 2.0.1 ) with latest ControlNet on Apple ARM architecture generates a random image and produces console runtime error below. COMMANDLINE_ARGS="--skip-torch-cuda-test --upcast-sampling --no-half-vae --use-cpu interrogate".

2023-09-10 21:04:57,087 - ControlNet - STATUS - preprocessor resolution = 512
*** Error running process: /Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/controlnet.py
    Traceback (most recent call last):
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/modules/scripts.py", line 619, in process
        script.process(p, *script_args)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/controlnet.py", line 977, in process
        self.controlnet_hack(p)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/controlnet.py", line 966, in controlnet_hack
        self.controlnet_main_entry(p)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/controlnet.py", line 808, in controlnet_main_entry
        detected_map, is_image = preprocessor(
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/utils.py", line 75, in decorated_func
        return cached_func(*args, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/utils.py", line 63, in cached_func
        return func(*args, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/global_state.py", line 35, in unified_preprocessor
        return preprocessor_modules[preprocessor_name](*args, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/processor.py", line 350, in clip
        from annotator.clipvision import ClipVisionDetector
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/extensions/sd-webui-controlnet/annotator/clipvision/__init__.py", line 81, in <module>
        clip_vision_h_uc = torch.load(clip_vision_h_uc)['uc']
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/modules/safe.py", line 108, in load
        return load_with_extra(filename, *args, extra_handler=global_extra_handler, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/modules/safe.py", line 156, in load_with_extra
        return unsafe_torch_load(filename, *args, **kwargs)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
        return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
        result = unpickler.load()
      File "/opt/homebrew/Cellar/[email protected]/3.10.13/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pickle.py", line 1213, in load
        dispatch[key[0]](self)
      File "/opt/homebrew/Cellar/[email protected]/3.10.13/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pickle.py", line 1254, in load_binpersid
        self.append(self.persistent_load(pid))
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
        typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 1116, in load_tensor
        wrap_storage=restore_location(storage, location),
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 217, in default_restore_location
        result = fn(storage, location)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 182, in _cuda_deserialize
        device = validate_cuda_device(location)
      File "/Users/guestuser/Documents/Projects/StableDiffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/serialization.py", line 166, in validate_cuda_device
        raise RuntimeError('Attempting to deserialize object on a CUDA '
    RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.```

sdxl model checkpoint load error .

i use self.ip_ckpt ./IP-Adapter/sdxl_models/ip-adapter_sdxl.bin,
but when i load it for my sdxl inpainting model :
RuntimeError: Error(s) in loading state_dict for ImageProjModel: size mismatch for proj.weight: copying a param with shape torch.Size([8192, 1280]) from checkpoint, the shape in current model is torch.Size([32768, 1280]). size mismatch for proj.bias: copying a param with shape torch.Size([8192]) from checkpoint, the shape in current model is torch.Size([32768]).

IPAdaptor for MultiControlNet

First of all, this work is amazing truly!
Could there also be a support for multicontrol in ipadapters. I was trying canny and inpaint controlnets and faced errors.

Questions

Hi! first of all thanks for a really great work.

I see that you've added a face-conditioned model for SD1.5 recently, do you have any plans on releasing similar model for SDXL? Also could you give any estimates on how long does it take to train the IP-Adapter model? You mentioned 1M steps in your paper, could you elaborate how many hours/days is it?

Also I have few improvements ideas, based on my experience:

why do you use very final CLIP embeddings, even after projection? it's known to contain mostly semantic information and lose details, while you're probably interested in details of original image as well. this is especially important for face conditioned model. maybe taking some penultimate tokens would work better? (for example SDXL takes hidden_states[-2] for text conditioning
there are a lot of things in the training script that could be cached. text encoding, clip embeddings, vae (since you're not doing augmentations anyway). also you mentioned using deepspeed in your paper, but I can't see it here. Does it mean this is just a draft of the training, and you have a more advanced one?
for face conditioning it would make sense to mask the final loss and only apply it on face area, to avoid penalising model for predicting wrong background for example

Issue with `load_ip_adapter`

Maybe you guys have seen this error before

Traceback (most recent call last):
  File "/pkg/modal/_container_entrypoint.py", line 351, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 437, in run_inputs
    res = imp_fun.fun(*args, **kwargs)
  File "/root/modal_testing/adapter.py", line 188, in run
    ip_model = IPAdapterPlus(
  File "/content/IP-Adapter/ip_adapter/ip_adapter.py", line 52, in __init__
    self.load_ip_adapter()
  File "/content/IP-Adapter/ip_adapter/ip_adapter.py", line 84, in load_ip_adapter
    self.image_proj_model.load_state_dict(state_dict["image_proj"])
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Resampler:
	Missing key(s) in state_dict: "latents", "proj_in.weight", "proj_in.bias", "proj_out.weight", "proj_out.bias", "norm_out.weight", "norm_out.bias", "layers.0.0.norm1.weight", "layers.0.0.norm1.bias", "layers.0.0.norm2.weight", "layers.0.0.norm2.bias", "layers.0.0.to_q.weight", "layers.0.0.to_kv.weight", "layers.0.0.to_out.weight", "layers.0.1.0.weight", "layers.0.1.0.bias", "layers.0.1.1.weight", "layers.0.1.3.weight", "layers.1.0.norm1.weight", "layers.1.0.norm1.bias", "layers.1.0.norm2.weight", "layers.1.0.norm2.bias", "layers.1.0.to_q.weight", "layers.1.0.to_kv.weight", "layers.1.0.to_out.weight", "layers.1.1.0.weight", "layers.1.1.0.bias", "layers.1.1.1.weight", "layers.1.1.3.weight", "layers.2.0.norm1.weight", "layers.2.0.norm1.bias", "layers.2.0.norm2.weight", "layers.2.0.norm2.bias", "layers.2.0.to_q.weight", "layers.2.0.to_kv.weight", "layers.2.0.to_out.weight", "layers.2.1.0.weight", "layers.2.1.0.bias", "layers.2.1.1.weight", "layers.2.1.3.weight", "layers.3.0.norm1.weight", "layers.3.0.norm1.bias", "layers.3.0.norm2.weight", "layers.3.0.norm2.bias", "layers.3.0.to_q.weight", "layers.3.0.to_kv.weight", "layers.3.0.to_out.weight", "layers.3.1.0.weight", "layers.3.1.0.bias", "layers.3.1.1.weight", "layers.3.1.3.weight". 
	Unexpected key(s) in state_dict: "proj.weight", "proj.bias", "norm.weight", "norm.bias".

@xiaohu2015 any ideas??

Question about "a better way for ContrlNet "

Hi, authors.

Thanks for the great open-sourcing work.

Could you please explain what is the actual computation for ContrlNet?

Selection of training data

Congrats again on the great work.

Could you please help clarify how the data subset was selected from the LAION?

Thanks.

why the size of the model I trained is much larger than pretrained model?

I trained with the deepspeed 2 Zero, after I run zero_to_fp32.py, the size of "pytorch_model.bin" generated is ~1.7G, but the pretrained model "ip-adapter_sd15.bin" is only ~43M

Installing this for automatic 111?

can this be installed locally to stable diffusion Automatic 111?

tencent-ailab / ip-adapter Goto Github PK

ip-adapter's People

Contributors

Stargazers

Watchers

Forkers

ip-adapter's Issues

1

2

3 (Out of scope)

Thanks

Edits:

multimodal prompts

Recommend Projects

Recommend Topics

Recommend Org