yanzuolu / cfld Goto Github PK

[CVPR 2024 Highlight] Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

License: MIT License

Python 44.01% Shell 0.33% Jupyter Notebook 55.67%

cfld's Introduction

CFLD

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis
Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, Jian-Huang Lai
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), June 17-21, 2024, Seattle, USA

TL;DR

If you want to cite and compare with out method, please download the generated images from Google Drive here. (Including 256x176, 512x352 on DeepFashion, and 128x64 on Market-1501)

News🔥🔥🔥

2024/02/27 Our paper titled "Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis" is accepted by CVPR 2024.
2024/02/28 We release the code and upload the arXiv preprint.
2024/03/09 The checkpoints on DeepFashion dataset is released on Google Drive.
2024/03/09 We note that the file naming used by different open source codes can be extremely confusing. To facilitate future work, we have organized the generated images of several methods that we used for qualitative comparisons in the paper. They were uniformly resized to 256X176 or 512x352, stored as png files and used the same naming format. Enjoy!🤗
2024/03/20 We upload the jupyter notebook for inference/reasoning. You could modify it as you want, e.g. replacing the conditional image with your customized one and randomly sampling a target pose from the test dataset.
2024/04/05 Our paper is accepted as CVPR 2024 Highlight!!!
2024/04/10 The camera-ready version is available on arXiv now. The supplementary material with more discussions and results was added.

Preparation

Install Environment

conda env create -f environment.yaml

Download DeepFashion Dataset

Download Img/img_highres.zip from the In-shop Clothes Retrieval Benchmark of DeepFashion, unzip it under ./fashion directory. (Password would be required, please contact the authors of DeepFashion (not us!!!) for permission.)
Download train/test pairs and keypoints from DPTN, put them under ./fashion directory.

Make sure the tree of ./fashion directory is as follows.

fashion
├── fashion-resize-annotation-test.csv
├── fashion-resize-annotation-train.csv
├── fashion-resize-pairs-test.csv
├── fashion-resize-pairs-train.csv
├── MEN
├── test.lst
├── train.lst
└── WOMEN

Run generate_fashion_datasets.py with python.

Download Pre-trained Models

Download the following pre-trained models on demand, put them under ./pretrained_models directory.

Model	Official Repository	Publicly Available
U-Net	runwayml/stable-diffusion-v1-5	diffusion_pytorch_model.safetensors
VAE	runwayml/stable-diffusion-v1-5	diffusion_pytorch_model.safetensors
Swin-B	microsoft/Swin-Transformer	swin_base_patch4_window12_384_22kto1k.pth
CLIP (ablation only)	openai/clip-vit-large-patch14	model.satetensors

Make sure the tree of ./pretrained_models directory is as follows.

pretrained_models
├── clip
│   ├── config.json
│   └── model.safetensors
├── scheduler
│   └── scheduler_config.json
├── swin
│   └── swin_base_patch4_window12_384_22kto1k.pth
├── unet
│   ├── config.json
│   └── diffusion_pytorch_model.safetensors
└── vae
    ├── config.json
    └── diffusion_pytorch_model.safetensors

Training

For multi-gpu, run the following command by default.

bash scripts/multi_gpu/pose_transfer_train.sh 0,1,2,3,4,5,6,7

For single-gpu, run the following command by default.

bash scripts/single_gpu/pose_transfer_train.sh 0

For ablation studies, run the following command by example to specify configs.

bash scripts/multi_gpu/pose_transfer_train.sh 0,1,2,3,4,5,6,7 --config_file configs/ablation_study/no_app.yaml

Inference

For multi-gpu, run the following command by example to specify checkpoints.

bash scripts/multi_gpu/pose_transfer_test.sh 0,1,2,3,4,5,6,7 MODEL.PRETRAINED_PATH checkpoints

For single-gpu, run the following command by example to specify checkpoints.

bash scripts/single_gpu/pose_transfer_test.sh 0 MODEL.PRETRAINED_PATH checkpoints

Citation

@inproceedings{lu2024coarse,
  title={Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis},
  author={Lu, Yanzuo and Zhang, Manlin and Ma, Andy J and Xie, Xiaohua and Lai, Jian-Huang},
  booktitle={CVPR},
  year={2024}
}

cfld's People

Contributors

Stargazers

Watchers

Forkers

peterzs jackzhousz han-shurong zhuyinna lijingle-coder kruglov-dmitry fzm121516 slimevrx ocelot7777 changxuan-fan

cfld's Issues

About the checkpoint

May I ask if you can provide a trained model that I would like to use for reasoning

Why your generated images' faces are not good? And the generated images' faces are not from those in your paper?

Hello, I have a big question that why are your generated images' faces not good? And the generated images' faces are not from those in your paper?
Seeing your public generated images from Google Drive, I find that all the faces are not good, why is it? Why are the faces in your paper nice?
Thanks so much for your answer.

Pose2Video

Hi,
Thanks for releasing this amazing program!
Do you have any plans to expand UNET from 2D to 3D and realize pose to video generation? like animate-anyone

关于输入resize图片的大小

是把deepfashion750x1101的图片resize到了512x352吗

questions about metric calculate

thanks for your great efforts in this work
could you release a simple metrics calculate script,that only need to input gen imgs file and gt imgs file?
thanks a lot

关于姿态的一些问题

为什么会考虑到在前半阶段使用pose而不是在后半阶段呢

Why the Perception-Refined Decoder can extract specific information like gender, hairstyle and so on ?

Hi, I see you wrote that in your paper: "By revisiting how people perceive a person image, we find several common characteristics, i.e., human body parts, age, gender, hairstyle, clothing, and so on, as demonstrated in Fig. 1(a)" But how do you make sure that these transformerblocks extract the right information that you want but not something else ? The shape of the hidden_states that the decoder outputs is [batchsize, 8, 768]. I want to know how these eight kinds of information are decoupled from the other of the Irrelevant information.

Thanks very much!

Code question about decoder.

In your paper, Perception-Refined Decoder uses source image encoder. So, I thought appearance encoder should be used, but in your code you use 'down_block_additional_residuals' which uses pose encoder. Why is it?
def forward(self, batched_inputs):
mask = batched_inputs["mask"] if "mask" in batched_inputs else None
x, features = self.backbone(batched_inputs["img_cond"], mask=mask)
up_block_additional_residuals = self.appearance_encoder(features)

    bsz = x.shape[0]
    if self.training:
        bsz = bsz * 2
        down_block_additional_residuals = self.pose_encoder(torch.cat([batched_inputs["pose_img_src"], batched_inputs["pose_img_tgt"]]))
        up_block_additional_residuals = {k: torch.cat([v, v]) for k, v in up_block_additional_residuals.items()}
        # why self.decoder uses pose_encoder?
        c = self.decoder(x, features, down_block_additional_residuals)

Issue with inferencing: AttributeError of 'FrozenDict'

Hello!

I am a college student interested in your work and currently attempting to leverage your published model for my course project. However while trying to run playground.ipynb (also tried in the command line, same error), I encountered the error below.

Please check it and help me solve this problem. Thank you very much!

AttributeError Traceback (most recent call last)
Cell In[10], line 18
16 inputs = torch.cat([noisy_latents, noisy_latents, noisy_latents], dim=0)
17 inputs = noise_scheduler.scale_model_input(inputs, timestep=t)
---> 18 noise_pred = unet(sample=inputs, timestep=t, encoder_hidden_states=c_new,
19 down_block_additional_residuals=copy.deepcopy(down_block_additional_residuals),
20 up_block_additional_residuals=copy.deepcopy(up_block_additional_residuals))
22 noise_pred_uc, noise_pred_down, noise_pred_full = noise_pred.chunk(3)
23 noise_pred = noise_pred_uc +
24 cfg.TEST.DOWN_BLOCK_GUIDANCE_SCALE * (noise_pred_down - noise_pred_uc) +
25 cfg.TEST.FULL_GUIDANCE_SCALE * (noise_pred_full - noise_pred_down)

File ~/anaconda3/envs/CFLD/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/CFLD/lib/python3.10/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None

File /home/dlpj/CFLD/models/unet.py:1946, in UNet.forward(self, sample, timestep, **kwargs)
1945 def forward(self, sample, timestep, **kwargs):
-> 1946 return self.model(sample, timestep, **kwargs).sample

File /home/dlpj/CFLD/models/unet.py:1684, in ResidualUNet2DConditionModel.forward(self, sample, timestep, encoder_hidden_states, class_labels, timestep_cond, attention_mask, cross_attention_kwargs, added_cond_kwargs, down_block_additional_residuals, mid_block_additional_residual, up_block_additional_residuals, encoder_attention_mask, return_dict)
1681 encoder_attention_mask = encoder_attention_mask.unsqueeze(1)
1683 # 0. center input if necessary
-> 1684 if self.config.center_input_sample:
1685 sample = 2 * sample - 1.0
1687 # 1. time

AttributeError: 'FrozenDict' object has no attribute 'center_input_sample'

Classifier-free Guidance for training

Hi authors. Thanks again for your great work! I have noticed that your methodology incorporates the use of classifier-free guidance both during the training and testing phases. The configuration parameter 'MODEL.U_COND_DOWN_BLOCK_GUIDANCE' has been set to False, which seemingly implies that the model has not undergone training in an unconditional pose setting. I would greatly appreciate clarification on your approach in this regard.

关于损失的问题

Reproduce question on test split

Hi.
When i use the checkpoint offered and run the pose_transfer_test.sh using the default hyperparam, I can't generate the same picture you offered in README, also results in poor performance in metric compared to the images you offered.
So i want to know are there some changes of hyperparams when you generate results images compared to the default hyperparam

The code of loss function

Thanks for your great work and released code!

I have two problems for code in pose_transfer_train.py:

the losses said in paper are reconstruction loss and mse loss:

But there is only 1 line code in implementation:
Why do the pose_img_src and pose_img_tgt are concated for pose encoder? And why do the img_src and img_tgt are concated for input?

did you do any experiments with animation?

How to reason, can you directly specify a picture? I tried PCDMs and wanted to compare

about Inference

how can i inference with my own local images？

it seems like test all the testdataset?

build_pose_img annotation_file 怎么生成，能直接使用姿态图推理？

Inference image resolution 256x176

Hi @YanzuoLu , Thanks for sharing this great work!

I want to run the inference for the image resolution 256x176. Could you share the instruction and the config file for your model? In the playground.ipynb, the latent output size is 64x64, and the output of the value decoder is 512x512 now.

Thank you.

Question about the convolution layer channel in the CFLD program

error.pdf
Hello, I just sent you an email. I hope you can forgive my mistake. The pdf file is the complete call stack information. Please check it and help me solve this problem. Thank you very much!

The metrics for VAE reconstructions and the ground truths

Hi authors, I want to know:
Why do the FID results of CFLD can be better than that of VAE reconstructions and the ground truths? (in Tab. 2)

如何只测一张自定义的图片和pose？

How to continue training the interrupted model

Excuse me, If there is an interruption during training, how do I continue training

why not generate 750*1024 ims like origin imgs in Deepfashion dataset?

Thank you for your wonderful efforts in this work, I just wondering why not generate the same resolution like origin imgs in Deepfashion dataset, why choose lower resolution?

about build_pose_img Function's Output

in your code :

  def build_pose_img(self, img_path):
       string = self.annotation_file.loc[os.path.basename(img_path)]
       array = load_pose_cords_from_strings(string['keypoints_y'], string['keypoints_x'])
       pose_map = torch.tensor(cords_to_map(array, tuple(self.pose_img_size), (256, 176)).transpose(2, 0, 1), dtype=torch.float32)
       pose_img = torch.tensor(draw_pose_from_cords(array, tuple(self.pose_img_size), (256, 176)).transpose(2, 0, 1) / 255., dtype=torch.float32)
       pose_img = torch.cat([pose_img, pose_map], dim=0)
       return pose_img

I am curious about the design choice in the build_pose_img function where it concatenates pose_img and pose_map, resulting in a tensor with 21 channels. My initial expectation was that the function would directly return the pose_img with 3 channels. I am interested in understanding the rationale behind using 21 channels instead.

What is the purpose of concatenating pose_img with pose_map, and how does it benefit the overall model or application?

Another question: what is the difference between these two images(img_src and img_cond)? Which img is used for training？

return_dict = {
            "img_src": img_src,
            "img_tgt": img_tgt,
            "img_cond": img_cond,
            "pose_img_src": pose_img_src,
            "pose_img_tgt": pose_img_tgt
        }

about pose_encoder

I want to use DWPOSE to replace OPENPOSE key points. Can I delete the weight of pose_encoder and retrain?

    del state_dict['pose_encoder.conv_in.weight']
    del state_dict['pose_encoder.conv_in.bias']

Issues for running playground.ipynb

There are some errors at the code line, "unet = UNet(cfg).eval().requires_grad_(False).cuda()", during running playground.ipynb. The error messages are as the below.

ValueError Traceback (most recent call last)
Cell In[4], line 4
2 vae = VariationalAutoencoder(pretrained_path="pretrained_models/vae").eval().requires_grad_(False).cuda()
3 model = build_model(cfg).eval().requires_grad_(False).cuda()
----> 4 unet = UNet(cfg).eval().requires_grad_(False).cuda()

File /data/CFLD/models/unet.py:1913, in UNet.init(self, cfg)
1910 def init(self, cfg):
1911 super().init()
-> 1913 self.model = ResidualUNet2DConditionModel.from_pretrained(
1914 cfg.MODEL.UNET_CONFIG.PRETRAINED_PATH, use_safetensors = False)
1915 self.model.requires_grad_(False)
1916 self.model.enable_xformers_memory_efficient_attention()

File /usr/local/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py:114, in validate_hf_hub_args.._inner_fn(*args, **kwargs)
111 if check_use_auth_token:
112 kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.name, has_token=has_token, kwargs=kwargs)
--> 114 return fn(*args, **kwargs)

File /usr/local/lib/python3.9/site-packages/diffusers/models/modeling_utils.py:650, in ModelMixin.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
647 if low_cpu_mem_usage:
648 # Instantiate model with empty weights
649 with accelerate.init_empty_weights():
--> 650 model = cls.from_config(config, **unused_kwargs)
652 # if device_map is None, load the state dict and move the params from meta device to the cpu
653 if device_map is None:

File /usr/local/lib/python3.9/site-packages/diffusers/configuration_utils.py:260, in ConfigMixin.from_config(cls, config, return_unused_kwargs, **kwargs)
258 # Return model and optionally state and/or unused_kwargs
259 print("init_dict: ", init_dict, flush=True)
--> 260 model = cls(**init_dict)
262 # make sure to also save config parameters that might be used for compatible classes
263 # update _class_name
264 if "_class_name" in hidden_dict:

File /usr/local/lib/python3.9/site-packages/diffusers/configuration_utils.py:654, in register_to_config..inner_init(self, *args, **kwargs)
652 new_kwargs = {**config_init_kwargs, **new_kwargs}
653 getattr(self, "register_to_config")(**new_kwargs)
--> 654 init(self, *args, **init_kwargs)

File /data/CFLD/models/unet.py:1564, in ResidualUNet2DConditionModel.init(self, sample_size, in_channels, out_channels, center_input_sample, flip_sin_to_cos, freq_shift, down_block_types, mid_block_type, up_block_types, only_cross_attention, block_out_channels, layers_per_block, downsample_padding, mid_block_scale_factor, act_fn, norm_num_groups, norm_eps, cross_attention_dim, transformer_layers_per_block, encoder_hid_dim, encoder_hid_dim_type, attention_head_dim, num_attention_heads, dual_cross_attention, use_linear_projection, class_embed_type, addition_embed_type, addition_time_embed_dim, num_class_embeds, upcast_attention, resnet_time_scale_shift, resnet_skip_time_act, resnet_out_scale_factor, time_embedding_type, time_embedding_dim, time_embedding_act_fn, timestep_post_act, time_cond_proj_dim, conv_in_kernel, conv_out_kernel, projection_class_embeddings_input_dim, class_embeddings_concat, mid_block_only_cross_attention, cross_attention_norm, addition_embed_type_num_heads)
1561 else:
1562 add_upsample = False
-> 1564 up_block = get_residual_up_block(
1565 up_block_type,
1566 num_layers=reversed_layers_per_block[i] + 1,
1567 transformer_layers_per_block=reversed_transformer_layers_per_block[i],
1568 in_channels=input_channel,
1569 out_channels=output_channel,
1570 prev_output_channel=prev_output_channel,
1571 temb_channels=blocks_time_embed_dim,
1572 add_upsample=add_upsample,
1573 resnet_eps=norm_eps,
1574 resnet_act_fn=act_fn,
1575 resnet_groups=norm_num_groups,
1576 cross_attention_dim=reversed_cross_attention_dim[i],
1577 num_attention_heads=reversed_num_attention_heads[i],
1578 dual_cross_attention=dual_cross_attention,
1579 use_linear_projection=use_linear_projection,
1580 only_cross_attention=only_cross_attention[i],
1581 upcast_attention=upcast_attention,
1582 resnet_time_scale_shift=resnet_time_scale_shift,
1583 resnet_skip_time_act=resnet_skip_time_act,
1584 resnet_out_scale_factor=resnet_out_scale_factor,
1585 cross_attention_norm=cross_attention_norm,
1586 attention_head_dim=attention_head_dim[i] if attention_head_dim[i] is not None else output_channel,
1587 )
1588 self.up_blocks.append(up_block)
1589 prev_output_channel = output_channel

File /data/CFLD/models/unet.py:1061, in get_residual_up_block(up_block_type, num_layers, in_channels, out_channels, prev_output_channel, temb_channels, add_upsample, resnet_eps, resnet_act_fn, transformer_layers_per_block, num_attention_heads, resnet_groups, cross_attention_dim, dual_cross_attention, use_linear_projection, only_cross_attention, upcast_attention, resnet_time_scale_shift, resnet_skip_time_act, resnet_out_scale_factor, cross_attention_norm, attention_head_dim, upsample_type)
1059 if cross_attention_dim is None:
1060 raise ValueError("cross_attention_dim must be specified for CrossAttnUpBlock2D")
-> 1061 return ResidualCrossAttnUpBlock2D(
1062 num_layers=num_layers,
1063 transformer_layers_per_block=transformer_layers_per_block,
1064 in_channels=in_channels,
1065 out_channels=out_channels,
1066 prev_output_channel=prev_output_channel,
1067 temb_channels=temb_channels,
1068 add_upsample=add_upsample,
1069 resnet_eps=resnet_eps,
1070 resnet_act_fn=resnet_act_fn,
1071 resnet_groups=resnet_groups,
1072 cross_attention_dim=cross_attention_dim,
1073 num_attention_heads=num_attention_heads,
1074 dual_cross_attention=dual_cross_attention,
1075 use_linear_projection=use_linear_projection,
1076 only_cross_attention=only_cross_attention,
1077 upcast_attention=upcast_attention,
1078 resnet_time_scale_shift=resnet_time_scale_shift,
1079 )
1080 elif up_block_type == "SimpleCrossAttnUpBlock2D":
1081 if cross_attention_dim is None:

File /data/CFLD/models/unet.py:890, in ResidualCrossAttnUpBlock2D.init(self, in_channels, out_channels, prev_output_channel, temb_channels, dropout, num_layers, transformer_layers_per_block, resnet_eps, resnet_time_scale_shift, resnet_act_fn, resnet_groups, resnet_pre_norm, num_attention_heads, cross_attention_dim, output_scale_factor, add_upsample, dual_cross_attention, use_linear_projection, only_cross_attention, upcast_attention)
874 resnets.append(
875 ResidualResnetBlock2D(
876 in_channels=resnet_in_channels + res_skip_channels,
(...)
886 )
887 )
888 if not dual_cross_attention:
889 attentions.append(
--> 890 ResidualTransformer2DModel(
891 num_attention_heads,
892 out_channels // num_attention_heads,
893 in_channels=out_channels,
894 num_layers=transformer_layers_per_block,
895 cross_attention_dim=cross_attention_dim,
896 norm_num_groups=resnet_groups,
897 use_linear_projection=use_linear_projection,
898 only_cross_attention=only_cross_attention,
899 upcast_attention=upcast_attention,
900 )
901 )
902 else:
903 attentions.append(
904 DualTransformer2DModel(
905 num_attention_heads,
(...)
911 )
912 )

File /data/CFLD/models/unet.py:502, in ResidualTransformer2DModel.init(self, num_attention_heads, attention_head_dim, in_channels, out_channels, num_layers, dropout, norm_num_groups, cross_attention_dim, attention_bias, sample_size, num_vector_embeds, patch_size, activation_fn, num_embeds_ada_norm, use_linear_projection, only_cross_attention, upcast_attention, norm_type, norm_elementwise_affine)
479 @register_to_config
480 def init(
481 self,
(...)
500 norm_elementwise_affine: bool = True,
501 ):
--> 502 super(Transformer2DModel, self).init()
503 self.use_linear_projection = use_linear_projection
504 self.num_attention_heads = num_attention_heads

File /usr/local/lib/python3.9/site-packages/diffusers/models/transformers/transformer_2d.py:151, in Transformer2DModel.init(self, num_attention_heads, attention_head_dim, in_channels, out_channels, num_layers, dropout, norm_num_groups, cross_attention_dim, attention_bias, sample_size, num_vector_embeds, patch_size, activation_fn, num_embeds_ada_norm, use_linear_projection, only_cross_attention, double_self_attention, upcast_attention, norm_type, norm_elementwise_affine, norm_eps, attention_type, caption_channels, interpolation_scale)
146 raise ValueError(
147 f"Cannot define both num_vector_embeds: {num_vector_embeds} and patch_size: {patch_size}. Make"
148 " sure that either num_vector_embeds or num_patches is None."
149 )
150 elif not self.is_input_continuous and not self.is_input_vectorized and not self.is_input_patches:
--> 151 raise ValueError(
152 f"Has to define in_channels: {in_channels}, num_vector_embeds: {num_vector_embeds}, or patch_size:"
153 f" {patch_size}. Make sure that in_channels, num_vector_embeds or num_patches is not None."
154 )
156 # 2. Define input layers
157 if self.is_input_continuous:

ValueError: Has to define in_channels: None, num_vector_embeds: None, or patch_size: None. Make sure that in_channels, num_vector_embeds or num_patches is not None.

Training GPU

Hello, we are very interested in your project and we would like to try training with your code. What type of GPU did you use?

Thank you.

The time of duration for running pose_transfer_test.py

Thanks for your great work!
When I run the pose_transfer_test.py in single RTX 4090, it needs nearly 4 hours. I want to know whether this time of duration is normal?

Is pose map must needed? I want to inference with my own dataset for pose

Is pose map(keypoint coordinate) must needed? I want to inference with my own dataset for pose.

def build_pose_img(annotation_file, img_path):
string = annotation_file.loc[os.path.basename(img_path)]
array = load_pose_cords_from_strings(string['keypoints_y'], string['keypoints_x'])
pose_map = torch.tensor(cords_to_map(array, (256, 256), (256, 176)).transpose(2, 0, 1), dtype=torch.float32)
pose_img = torch.tensor(draw_pose_from_cords(array, (256, 256), (256, 176)).transpose(2, 0, 1) / 255., dtype=torch.float32)
pose_img = torch.cat([pose_img, pose_map], dim=0)
return pose_img

Issue with Loading Pre-trained Weights for Fine-tuning

Hello,

I hope this message finds you well. I am a senior student deeply interested in your work and currently attempting to leverage your published model for my academic project.

While trying to load the pre-trained weights for fine-tuning on my dataset, I encountered an error, which I am struggling to resolve. I have attached a screenshot to illustrate the issue more clearly.

The process I followed is based on the instructions provided in your documentation, aiming to load the pre-trained weights and then fine-tune the model on my data. However, upon execution, I encountered the following error:

I would greatly appreciate it if you could take a moment to look into this matter and provide any guidance or suggestions that might help me resolve this issue.

Thank you very much for your time and assistance. Your work is highly inspiring, and I am eager to apply it to my project successfully.