zamling / psalm Goto Github PK

View Code? Open in Web Editor NEW

173.0 7.0 7.0 4.47 MB

[ECCV2024] This is an official implementation for "PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model"

License: Apache License 2.0

Python 93.56% Shell 0.20% C++ 0.62% Cuda 5.62%

psalm's Introduction

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

Zheng Zhang*, Yeyao Ma*, Enming Zhang*, Xiang Bai

^{* Equal Contribution}

Arxiv Paper

Features

A powerful extension of the Large Multi-modal Model for generic (panoptic, instance, semantic) segmentation, referring segmentation and interactivate segmentation.
Support joint training across multiple segmentation tasks and visual-language tasks.
Demonstrates zero-shot capabilities on unseen task, such as open-vocabulary segmentation, generalizaed referring segmentation, and video object segmentation.

Updates

Release evaluation code
Release training code

Installation

See Installation instructions.

Getting Started

See Preparing Datasets for PSALM.

See Getting Started with PSALM.

Model Zoo

Download PSALM here.

Citation

If you think this work is useful for your research, please use the following BibTeX entry.

@misc{zhang2024psalm,
      title={PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model}, 
      author={Zheng Zhang and Yeyao Ma and Enming Zhang and Xiang Bai},
      year={2024},
      eprint={2403.14598},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

Thanks for awesome works: Mask2former, Mask2former-Simplify and LLaVA. Code is based on these works.

psalm's People

Contributors

Stargazers

Watchers

Forkers

drkhoinguyen rouai constion97 xenozlh cv-seg strategist922 youlixiya

psalm's Issues

Number of Epochs during Training

Hi, I wonder how many epochs do you train PSALM during the finetuning phase?

In your training script the num_train_epochs is set to 10, but in your paper you stated 56k with batch_size 64 are used during training. Setting num_train_epochs=10 results in the learning rate barely decreases and the loss is not going down for a long time. Should we decrease the num_train_epochs to a smaller number? For example, 3 or 4.

Thanks!

ValueError: matrix contains invalid numeric entries

When training the model, I encountered a ValueError: matrix contains invalid numeric entries and I'm not sure what the reason is. I wanted to ask the author if they have encountered similar situations and how to avoid them.

Error when Evaluation in Command Line

Hello, I want evaluate the COCO dataset on instance segmentation task. I run the command python psalm/eval/instance_segmentation.py --image_folder /path/to/coco/val2017/ --model_path /path/to/PSALM --json_path /path/to/coco/instance_instruction_segmentation_val.json , but I couldn't find the instance_instruction_segmentation_val.json, I use the instance_val_psalm.json generated from your code "datasets/build_COCO_instance.py". I want to ask that whether the json file is ture. If it is correct, I meet 'error ' during evaluating. I find this is driven by https://github.com/zamling/PSALM/blob/c58268e07dd01a44e5848dada2df31c558978d37/psalm/model/datasets_mapper/coco_instance_mapper.py#L233C12-L233C55

Training with multiple nodes

Hi, in your paper you mentioned that your model is trained with 16*V100 GPUs. I assume this suggests that your trained your model on 2*8 GPU nodes, right?

Could you share some insights about how to train the model using deepspeed with multiple nodes? Do you need to write a hostfile for it? I used a slurm cluster to enable two nodes, but it seems that the two nodes are not communicating with each other.

Thanks!

Visualization scripts for evaluation

Thanks a lot for your great work! I am able to evaluation the model using the provided scripts.

However, I'm wondering what is the correct way to visualize the model inference results. For instance, for refCOCO dataset, is there a way to visualize the output segmentation masks? (To create demonstration figures like your Figure 1 in the paper.)

Load Model error?

I found the following warning/error when trying the demo.

but not sure why Load Model with error.

Can soneone give some suggestions?

Many thanks !

DETAIL LOG:

loading segmentation model
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
current model is psalm
Mask Decoder has been trained, init directly
current seg concat mode: False, seg_norm: False, seg_proj: True, seg_fuse_score: False
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.30it/s]Some weights of PSALM were not initialized from the model checkpoint at models and are newly initialized: ['model.layers.0.self_attn.k_proj.bias', 'model.layers.0.self_attn.k_proj.weight',layers.9.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.bias', 'model.layers.9.self_attn.v_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Some questions about paper and code

Great Job! Regarding the Ref task, the paper says that the Sentence Condition is extracted from the special [REF] token, while the embedding of Sentence Condition in the code is obtained by avg_pooling the entire sentence, so should I follow the code?

First stage scripts

Hi, does it include the first stage training script?

About inference code

Thank you for open source. When will the inference code that can support arbitrary input be released?

Question about multi-source data joint trainig

Hi, I do have a question regarding the multi-dataset joint training.

During joint training, some gpus only have chat data, while other gpus have seg data. Then, in the same batch, the mask decoder would not be used in some gpus. The deepspeed would probably get hang in this situation, as in issue.

How do you solve this problem? Thank you!

Fail to finetune on my own dataset for anomaly detection task

Issue: Finetuning PSALM Model for Anomaly Detection

I'm working on finetuning the PSALM model after stage 2 for an anomaly detection task on my own dataset, which consists of two categories: 1. Defective 2. Good

Issue Description

During the evaluation stage, I find that the model always outputs all 0 for panoptic_seg without segments_info.

I have constructed three test demos:

①The training set contains only one category (category2 -> `good`).

panoptic_train2017.json:

(all the segments_info is the same)

images in train2017(input):

images in panoptic_train2017 and panoptic_semseg_train2017:

(Totally black -> only category2)

panoptic_val2017.json:

PS:

① if an image is "defective" then it has the ground_truth mask with 2 categories (black and white)

if an image is "good" then it has the ground_truth mask only with 1 category (totally black)

② "id": 16777215 = 255 + 255 * 256 + 255 *256^2 (rgb2id in panoptic_api)

images in val2017:

(it's a Defective image)

images in panoptic_val2017

(black means it's good, white means it's defective)

finetune.sh(based on tranin.sh):

export DISABLE_ADDMM_CUDA_LT=1
deepspeed psalm/train/finetune.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path "/liujinxin/code/models/PSALM/models/PSALM" \
    --version "llava_phi" \
    --panoptic_json_path "/liujinxin/code/models/PSALM/datasets/coco_black" \
    --image_folder "/liujinxin/code/models/PSALM/datasets/coco_black/train2017" \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --fp16 True \
    --output_dir ./checkpoint/PSALM_black \
    --num_train_epochs 10 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 15000 \
    --save_total_limit 1 \
    --learning_rate 6e-6 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none \
    --seg_task 'panoptic'

finetune.py(based on train.sh):

def train():
    global local_rank

    parser = transformers.HfArgumentParser(
        (ModelArguments, DataArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    local_rank = training_args.local_rank
    compute_dtype = (torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))

    mask_cfg = get_mask_config(config=model_args.mask_config)
    mask_cfg.MODEL.MASK_FORMER.SEG_TASK = model_args.seg_task
    bnb_model_from_pretrained_args = {}

    print('using model PSALM')

    # model_name = get_model_name_from_path(model_path)

    # model_args.model_map_name = 
    model_args.model_map_name = 'psalm'
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_args.model_name_or_path, None,'psalm',mask_config=model_args.mask_config,model_args=model_args)
    data_args.image_processor = image_processor
    data_args.is_multimodal = True
    # if not training_args.bf16:
    #Load PSALM model
    # model = PSALM.from_pretrained(
    #     model_args.model_name_or_path,
    #     mask_decoder_cfg=mask_cfg,
    #     add_cross_attn=True,
    #     cache_dir=training_args.cache_dir,
    #     **bnb_model_from_pretrained_args
    #             )
    # if not model.is_train_mask_decode:
    #     mask2former_ckpt = model_args.vision_tower if model_args.load_mask2former else None
    #     model.initial_mask_module(mask2former_ckpt)

    model.config.use_cache = False

    #Decide whether to freeze the backbone
    if model_args.freeze_backbone:
        model.model.requires_grad_(False)

    # Free the projector 
    for param in model.get_model().mm_projector.parameters():
        param.requires_grad = False

    # Freeze vision_tower parameters
    for param in model.get_model().vision_tower.parameters():
        param.requires_grad = False     

    #Decide whether use the gradient_checkpointing
    if training_args.gradient_checkpointing:
        if hasattr(model, "enable_input_require_grads"):
            model.enable_input_require_grads()
        else:
            def make_inputs_require_grad(module, input, output):
                output.requires_grad_(True)

            model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
        model_max_length=training_args.model_max_length,
        padding_side="right",
        use_fast=False,
    )

    if tokenizer.pad_token is None:
        smart_tokenizer_and_embedding_resize(
            special_tokens_dict=dict(pad_token="[PAD]"),
            tokenizer=tokenizer,
            model=model,
        )
    if model_args.version in conversation_lib.conv_templates:
        conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
    else:
        conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]

    # Delete the code that loads vision_tower and mm_projector

    tokenizer.add_tokens("[SEG]")
    model.resize_token_embeddings(len(tokenizer))
    model.get_special_token(SEG=tokenizer("[SEG]", return_tensors='pt', add_special_tokens=False)['input_ids'], EOS=tokenizer.eos_token_id)
    data_module = make_unify_datamodule(tokenizer=tokenizer, data_args=data_args, training_args=training_args)
    training_args.dataloader_drop_last = True
    trainer = LLaVATrainer(model=model,
                           tokenizer=tokenizer,
                           args=training_args,
                           **data_module)
    if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
        trainer.train(resume_from_checkpoint=True)
    else:
        trainer.train()
    trainer.save_state()

    model.config.use_cache = True

    # Delete the code that check whether using lora

    safe_save_model_for_hf_trainer(trainer=trainer,output_dir=training_args.output_dir)

if __name__ == "__main__":
    train()

load data:

def make_unify_datamodule(tokenizer, data_args, training_args):
    panoptic_coco_dataset = COCO_panoptic_dataset_random(json_path=data_args.panoptic_json_path, tokenizer=tokenizer,
                                                             data_args=data_args)
    datasets = [panoptic_coco_dataset]

    # you can change 16 to your frequency sets, it represents how many samples to change tasks
    train_dataset = UnifyDatasetSingleDatasetForBatch(datasets,16,fix_dataset_len=data_args.fix_dataset_len)
    print(f'total unify dataset number is {len(train_dataset)}')
    data_collator = DataCollatorForCOCODatasetV2(tokenizer=tokenizer)
    return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)

PS: only use panoptic_coco_dataset for task1 in the paper

training process info:

2024-06-07 09:47:31,609] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-07 09:47:33,356] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-06-07 09:47:33,357] [INFO] [runner.py:555:main] cmd = /root/miniforge3/envs/psalm/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None psalm/train/finetune.py --deepspeed ./scripts/zero2.json --model_name_or_path /liujinxin/code/models/PSALM/models/PSALM --version llava_phi --panoptic_json_path /liujinxin/code/models/PSALM/datasets/coco_white_gray --image_folder /liujinxin/code/models/PSALM/datasets/coco_white_gray/train2017 --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --fp16 True --output_dir ./checkpoint/PSALM_white_gray2 --num_train_epochs 10 --per_device_train_batch_size 4 --per_device_eval_batch_size 2 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 15000 --save_total_limit 1 --learning_rate 6e-6 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 False --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to none --seg_task panoptic
[2024-06-07 09:47:34,975] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-07 09:47:37,134] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-06-07 09:47:37,134] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-06-07 09:47:37,134] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-06-07 09:47:37,134] [INFO] [launch.py:163:main] dist_world_size=1
[2024-06-07 09:47:37,134] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-06-07 09:47:39,886] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-07 09:47:41,309] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-06-07 09:47:41,309] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-06-07 09:47:41,309] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
using model PSALM
loading segmentation model
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
current model is psalm
Mask Decoder has been trained, init directly
current seg concat mode: False, seg_norm: False, seg_proj: True, seg_fuse_score: False
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 2/2 [00:21<00:00, 10.98s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
coco_id_to_cont_id: {1: 0, 2: 1}
coco_class_name: ['defective', 'good']
total unify dataset number is 132
Rank: 0 partition count [1, 1] and sizes[(1436624896, False), (623040, False)] 
  0%|                                                                               | 0/330 [00:00<?, ?it/s]panoptic_coco
{'loss_mask': 49.85839080810547, 'loss_dice': 24.046592712402344, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.289311408996582, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0}
{'loss': 81.1943, 'learning_rate': 0.0, 'epoch': 0.03}                                                      
  0%|▏                                                                      | 1/330 [00:04<26:22,  4.81s/it]panoptic_coco
{'loss_mask': 52.50733184814453, 'loss_dice': 23.399078369140625, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.860533237457275, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.03}
{'loss': 82.7669, 'learning_rate': 0.0, 'epoch': 0.06}                                                      
  1%|▍                                                                      | 2/330 [00:05<12:59,  2.38s/it]panoptic_coco
{'loss_mask': 49.13460159301758, 'loss_dice': 23.756378173828125, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.406155586242676, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.06}
{'loss': 80.2971, 'learning_rate': 0.0, 'epoch': 0.09}                                                      
  1%|▋                                                                      | 3/330 [00:06<08:44,  1.60s/it]panoptic_coco
{'loss_mask': 48.565338134765625, 'loss_dice': 24.029932022094727, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.697427272796631, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.09}
{'loss': 79.2927, 'learning_rate': 0.0, 'epoch': 0.12}                                                      
  1%|▊                                                                      | 4/330 [00:06<06:47,  1.25s/it]panoptic_coco
{'loss_mask': 49.155311584472656, 'loss_dice': 24.500837326049805, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.502076148986816, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.12}
{'loss': 80.1582, 'learning_rate': 0.0, 'epoch': 0.15}                                                      
  2%|█                                                                      | 5/330 [00:07<05:38,  1.04s/it]panoptic_coco
{'loss_mask': 48.19862365722656, 'loss_dice': 23.695858001708984, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.095686912536621, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.15}
{'loss': 78.9902, 'learning_rate': 0.0, 'epoch': 0.18}                                                      
  2%|█▎                                                                     | 6/330 [00:08<04:57,  1.09it/s]panoptic_coco
{'loss_mask': 51.16276550292969, 'loss_dice': 23.66861343383789, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.111388206481934, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.18}
{'loss': 81.9428, 'learning_rate': 0.0, 'epoch': 0.21}                                                      
  2%|█▌                                                                     | 7/330 [00:08<04:35,  1.17it/s]panoptic_coco
{'loss_mask': 49.68932342529297, 'loss_dice': 23.382747650146484, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.24033784866333, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.21}
{'loss': 80.3124, 'learning_rate': 0.0, 'epoch': 0.24}                                                      
  2%|█▋                                                                     | 8/330 [00:09<04:23,  1.22it/s]panoptic_coco
{'loss_mask': 50.03607940673828, 'loss_dice': 23.745264053344727, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.563680648803711, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.24}
{'loss': 80.345, 'learning_rate': 6.000000000000001e-07, 'epoch': 0.27}                                     
  3%|█▉                                                                     | 9/330 [00:10<04:25,  1.21it/s]panoptic_coco
{'loss_mask': 51.757225036621094, 'loss_dice': 23.377241134643555, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.948200702667236, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.27}
{'loss': 82.0827, 'learning_rate': 1.2000000000000002e-06, 'epoch': 0.3}                                    
  3%|██                                                                    | 10/330 [00:11<04:30,  1.18it/s]panoptic_coco
{'loss_mask': 53.21855163574219, 'loss_dice': 23.796293258666992, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.505247116088867, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.3}
{'loss': 84.5201, 'learning_rate': 1.8e-06, 'epoch': 0.33}                                                  
  3%|██▎                                                                   | 11/330 [00:12<04:35,  1.16it/s]panoptic_coco
{'loss_mask': 49.33003616333008, 'loss_dice': 23.544994354248047, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.133486747741699, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.33}
{'loss': 80.0085, 'learning_rate': 2.4000000000000003e-06, 'epoch': 0.36}                                   
  4%|██▌                                                                   | 12/330 [00:13<04:34,  1.16it/s]panoptic_coco
{'loss_mask': 50.854331970214844, 'loss_dice': 23.77133560180664, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.8431196212768555, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.36}
{'loss': 81.4688, 'learning_rate': 3e-06, 'epoch': 0.39}                                                    
  4%|██▊                                                                   | 13/330 [00:14<04:32,  1.16it/s]panoptic_coco
{'loss_mask': 53.42914962768555, 'loss_dice': 23.217002868652344, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.458742141723633, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.39}
{'loss': 83.1049, 'learning_rate': 3e-06, 'epoch': 0.42}                                                    
  4%|██▉                                                                   | 14/330 [00:14<04:18,  1.22it/s]panoptic_coco
{'loss_mask': 53.33277893066406, 'loss_dice': 23.60284423828125, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.294776916503906, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.42}
{'loss': 83.2304, 'learning_rate': 3.6e-06, 'epoch': 0.45}                                                  
  5%|███▏                                                                  | 15/330 [00:15<04:25,  1.19it/s]panoptic_coco
{'loss_mask': 52.171478271484375, 'loss_dice': 23.656320571899414, 'loss_SEG_class': 0.0, 'loss_class_name_class': 5.994977951049805, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.45}
{'loss': 81.8228, 'learning_rate': 4.2e-06, 'epoch': 0.48}                                                  
  5%|███▍                                                                  | 16/330 [00:16<04:26,  1.18it/s]panoptic_coco
{'loss_mask': 48.39109420776367, 'loss_dice': 23.990428924560547, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.271108627319336, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.48}
{'loss': 78.6526, 'learning_rate': 4.800000000000001e-06, 'epoch': 0.52}                                    
  5%|███▌                                                                  | 17/330 [00:17<04:26,  1.18it/s]panoptic_coco
{'loss_mask': 49.93635177612305, 'loss_dice': 23.715030670166016, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.401008605957031, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.52}
{'loss': 80.0524, 'learning_rate': 5.4e-06, 'epoch': 0.55}                                                  
  5%|███▊                                                                  | 18/330 [00:18<04:24,  1.18it/s]panoptic_coco
{'loss_mask': 49.057777404785156, 'loss_dice': 24.252079010009766, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.318177700042725, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.55}
{'loss': 80.628, 'learning_rate': 6e-06, 'epoch': 0.58}                                                     
  6%|████                                                                  | 19/330 [00:19<04:34,  1.13it/s]panoptic_coco
{'loss_mask': 47.64111328125, 'loss_dice': 23.820335388183594, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.369119644165039, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.58}
{'loss': 78.8306, 'learning_rate': 5.999855426877984e-06, 'epoch': 0.61}                                    
  6%|████▏                                                                 | 20/330 [00:20<04:49,  1.07it/s]panoptic_coco
{'loss_mask': 45.717586517333984, 'loss_dice': 23.78108024597168, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.449423789978027, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.61}
{'loss': 75.9481, 'learning_rate': 5.999421721446195e-06, 'epoch': 0.64}                                    
  6%|████▍                                                                 | 21/330 [00:21<04:57,  1.04it/s]panoptic_coco
{'loss_mask': 44.83399200439453, 'loss_dice': 24.038347244262695, 'loss_SEG_class': 0.0, 'loss_class_name_class': 5.863902568817139, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.64}
{'loss': 74.7362, 'learning_rate': 5.998698925506064e-06, 'epoch': 0.67}                                    
  7%|████▋                                                                 | 22/330 [00:22<05:05,  1.01it/s]panoptic_coco
{'loss_mask': 45.72657012939453, 'loss_dice': 24.009532928466797, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.071250915527344, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.67}
{'loss': 75.8074, 'learning_rate': 5.997687108722169e-06, 'epoch': 0.7}                                     
  7%|████▉                                                                 | 23/330 [00:23<04:53,  1.05it/s]panoptic_coco
{'loss_mask': 45.27854919433594, 'loss_dice': 24.46664047241211, 'loss_SEG_class': 0.0, 'loss_class_name_class': 5.600783348083496, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.7}
{'loss': 75.346, 'learning_rate': 5.996386368615517e-06, 'epoch': 0.73}                                     
  7%|█████                                                                 | 24/330 [00:24<04:53,  1.04it/s]panoptic_coco
{'loss_mask': 42.8742790222168, 'loss_dice': 22.759302139282227, 'loss_SEG_class': 0.0, 'loss_class_name_class': 5.884395122528076, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.73}
  8%|███▏                                      | 25/330 [00:25<04:44,  1.07it/s]{'loss': 71.518, 'learning_rate': 5.994796830554148e-06, 'epoch': 0.76}         
  8%|███▏                                      | 25/330 [00:25<04:44,  1.07it/s]panoptic_coco
{'loss_mask': 42.58303451538086, 'loss_dice': 24.098745346069336, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.326217174530029, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.76}
{'loss': 73.008, 'learning_rate': 5.992918647741047e-06, 'epoch': 0.79}                                     
  8%|█████▌                                                                | 26/330 [00:25<04:39,  1.09it/s]panoptic_coco
{'loss_mask': 43.64096450805664, 'loss_dice': 23.229507446289062, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.295878887176514, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.79}
{'loss': 73.1663, 'learning_rate': 5.990752001199384e-06, 'epoch': 0.82}                                    
  8%|█████▋                                                                | 27/330 [00:26<04:39,  1.08it/s]panoptic_coco
{'loss_mask': 43.896018981933594, 'loss_dice': 22.761449813842773, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.161799430847168, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.82}
{'loss': 72.8193, 'learning_rate': 5.988297099755062e-06, 'epoch': 0.85}                                    
  8%|█████▉                                                                | 28/330 [00:27<04:35,  1.10it/s]panoptic_coco
{'loss_mask': 41.98052978515625, 'loss_dice': 23.672922134399414, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.856027603149414, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.85}
{'loss': 72.5095, 'learning_rate': 5.985554180016591e-06, 'epoch': 0.88}                                    
  9%|██████▏                                                               | 29/330 [00:28<04:31,  1.11it/s]panoptic_coco
{'loss_mask': 40.907073974609375, 'loss_dice': 23.011926651000977, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.6860032081604, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.88}
{'loss': 70.605, 'learning_rate': 5.982523506352285e-06, 'epoch': 0.91}                                     
  9%|██████▎                                                               | 30/330 [00:29<04:28,  1.12it/s]panoptic_coco
{'loss_mask': 40.939064025878906, 'loss_dice': 22.917320251464844, 'loss_SEG_class': 0.0, 'loss_class_name_class': 7.153843879699707, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.91}
{'loss': 71.0102, 'learning_rate': 5.979205370864779e-06, 'epoch': 0.94}                                    
  9%|██████▌                                                               | 31/330 [00:30<04:23,  1.13it/s]panoptic_coco
{'loss_mask': 40.88875961303711, 'loss_dice': 23.06298828125, 'loss_SEG_class': 0.0, 'loss_class_name_class': 5.79106330871582, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.94}
{'loss': 69.7428, 'learning_rate': 5.9756000933628785e-06, 'epoch': 0.97}                                   
 10%|██████▊                                                               | 32/330 [00:31<04:27,  1.11it/s]panoptic_coco
{'loss_mask': 40.81318283081055, 'loss_dice': 21.933622360229492, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.372218132019043, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 0.97}
{'loss': 69.119, 'learning_rate': 5.971708021330732e-06, 'epoch': 1.0}                                      
 10%|███████                                                               | 33/330 [00:32<04:35,  1.08it/s]panoptic_coco
{'loss_mask': 39.913272857666016, 'loss_dice': 20.571090698242188, 'loss_SEG_class': 0.0, 'loss_class_name_class': 6.420867919921875, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 1.0}
{'loss': 66.9052, 'learning_rate': 5.967529529894344e-06, 'epoch': 1.03}                                    
 10%|███████▏                                                              | 34/330 [00:34<06:03,  1.23s/it]panoptic_coco
{'loss_mask': 35.99129104614258, 'loss_dice': 18.98656463623047, 'loss_SEG_class': 0.0, 'loss_class_name_class': 5.745621204376221, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 1.03}
{'loss': 60.7235, 'learning_rate': 5.963065021785414e-06, 'epoch': 1.06}                                    
 11%|███████▍                                                              | 35/330 [00:35<05:29,  1.12s/it]panoptic_coco
{'loss_mask': 34.689720153808594, 'loss_dice': 20.849706649780273, 'loss_SEG_class': 0.0, 'loss_class_name_class': 5.4378204345703125, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 1.06}
{'loss': 60.9772, 'learning_rate': 5.958314927302526e-06, 'epoch': 1.09}                                    
 11%|███████▋                                                              | 36/330 [00:35<05:05,  1.04s/it]panoptic_coco
{'loss_mask': 31.195430755615234, 'loss_dice': 17.665014266967773, 'loss_SEG_class': 0.0, 'loss_class_name_class': 5.153102874755859, 'loss_region_class': 0.0, 'loss_llm': 0.0, 'epoch': 1.09}
{'loss': 54.0135, 'learning_rate': 5.953279704269675e-06, 'epoch': 1.12}                                    
 11%|███████▊                                                              | 37/330 [00:36<04:48,  1.02it/s]

evaluation command:

python psalm/eval/panoptic_segmentation.py --image_folder /liujinxin/code/models/PSALM/datasets/coco_black/val2017 --model_path /liujinxin/code/models/PSALM/checkpoint/PSALM_black --json_path /liujinxin/code/models/PSALM/datasets/coco_black

panoptic_evaluation.py:

just add "print" to output the info

modify the code inside process()

def process(self, inputs, outputs):
        from panopticapi.utils import id2rgb

        for input, output in zip(inputs, outputs):
            panoptic_img, segments_info = output["panoptic_seg"]
            panoptic_img = panoptic_img.cpu().numpy()
            if segments_info is None:
                # If "segments_info" is None, we assume "panoptic_img" is a
                # H*W int32 image storing the panoptic_id in the format of
                # category_id * label_divisor + instance_id. We reserve -1 for
                # VOID label, and add 1 to panoptic_img since the official
                # evaluation script uses 0 for VOID label.
                label_divisor = 1000
                segments_info = []
                for panoptic_label in np.unique(panoptic_img):
                    if panoptic_label == -1:
                        # VOID region.
                        continue
                    pred_class = panoptic_label // label_divisor
                    isthing = self.is_thing_list[pred_class]
                    segments_info.append(
                        {
                            "id": int(panoptic_label) + 1,
                            "category_id": int(pred_class),
                            "isthing": bool(isthing),
                        }
                    )
                # Official evaluation script uses 0 for VOID label.
                panoptic_img += 1

            file_name = os.path.basename(input["file_name"])
            file_name_png = os.path.splitext(file_name)[0] + ".png"


            def extract_last_parts(path):
                parts = path.split('/')
                last_three_parts = '/'.join(parts[-3:])
                last_folder_paths = '/'.join(parts[-3:-1])
                return last_folder_paths,last_three_parts
            output_dir = '/liujinxin/code/models/PSALM/outputImg'
            output_folder_path,output_file_name = extract_last_parts(input["file_name"])
            
            os.makedirs(os.path.join(output_dir,output_folder_path), exist_ok=True)

            # print("output: ",output)
            # print("input: ",input)
            print("panoptic_img",panoptic_img)
            print("segments_info: ",segments_info)

            with io.BytesIO() as out:
                Image.fromarray(id2rgb(panoptic_img)).save(out, format="PNG")
                # Convert the panoptic image to RGB and save it directly to the specified folder
                output_path = os.path.join(output_dir, output_file_name)
                Image.fromarray(id2rgb(panoptic_img)).save(output_path, format="PNG")

                segments_info = [self._convert_category_id(x) for x in segments_info]
                self._predictions.append(
                    {
                        "image_id": input["image_id"],
                        "file_name": file_name_png,
                        "png_string": out.getvalue(),
                        "segments_info": segments_info,
                    }
                )

PS: ①print panoptic_img and segments_info

②save the panoptic_img converted by id2rgb to the specific folder

evaluation process:

[2024-06-07 09:56:04,436] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
loading segmentation model
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
current model is psalm
Mask Decoder has been trained, init directly
current seg concat mode: False, seg_norm: False, seg_proj: True, seg_fuse_score: False
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 2/2 [00:35<00:00, 18.00s/it]
coco_id_to_cont_id: {1: 0, 2: 1}
coco_class_name: ['defective', 'good']
SemSegEvaluator(ignore_label) is deprecated! It should be obtained from metadata.
  0%|                                                                                | 0/35 [00:00<?, ?it/s]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []
  3%|██                                                                      | 1/35 [00:02<01:35,  2.82s/it]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []
  6%|████                                                                    | 2/35 [00:03<00:48,  1.46s/it]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []
  9%|██████▏                                                                 | 3/35 [00:03<00:33,  1.04s/it]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []

(without segments_info)

PS: ①if the training set is with the mask which is all with category 1, it works like this:

[2024-06-07 09:58:07,221] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
loading segmentation model
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
current model is psalm
Mask Decoder has been trained, init directly
current seg concat mode: False, seg_norm: False, seg_proj: True, seg_fuse_score: False
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 2/2 [00:35<00:00, 17.71s/it]
coco_id_to_cont_id: {1: 0, 2: 1}
coco_class_name: ['defective', 'good']
SemSegEvaluator(ignore_label) is deprecated! It should be obtained from metadata.
  0%|                                                                                | 0/35 [00:00<?, ?it/s]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []
  3%|██                                                                      | 1/35 [00:02<01:20,  2.36s/it]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []
  6%|████                                                                    | 2/35 [00:02<00:43,  1.31s/it]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []
  9%|██████▏                                                                 | 3/35 [00:03<00:29,  1.07it/s]panoptic_img [[1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 ...
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]]
segments_info:  [{'id': 1, 'isthing': True, 'category_id': 0}]
 11%|████████▏                                                               | 4/35 [00:03<00:23,  1.31it/s]panoptic_img [[1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 ...
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]]
segments_info:  [{'id': 1, 'isthing': True, 'category_id': 0}]

(sometimes it has segments_info and sometimes it doesn't)

②if the mask is all with category 3, it doesn't work:

panoptic_train2017.json:

[2024-06-07 09:56:04,436] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
loading segmentation model
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
current model is psalm
Mask Decoder has been trained, init directly
current seg concat mode: False, seg_norm: False, seg_proj: True, seg_fuse_score: False
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 2/2 [00:35<00:00, 18.00s/it]
coco_id_to_cont_id: {1: 0, 2: 1,3: 1}
coco_class_name: ['defective', 'good', 'fantastic']
SemSegEvaluator(ignore_label) is deprecated! It should be obtained from metadata.
  0%|                                                                                | 0/35 [00:00<?, ?it/s]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []
  3%|██                                                                      | 1/35 [00:02<01:35,  2.82s/it]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []
  6%|████                                                                    | 2/35 [00:03<00:48,  1.46s/it]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []
  9%|██████▏                                                                 | 3/35 [00:03<00:33,  1.04s/it]panoptic_img [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
segments_info:  []

②The training set containing 2 categories with the left half side as gray and right half side as white

......

③The training set containing only one category1

......

dataset_coco_class_name under multi dataset training

Hello @zamling,

Thank you for sharing the code from your fascinating work. I am currently training the PSALM model using my own datasets, which include multiple datasets with various object classes. While reviewing the code, I found the settings for dataset_coco_id_to_cont_id and dataset_coco_class_name to be aligned with the dataset that has the maximum class number. Could you please explain the rationale behind these specific lines of code?

PSALM/psalm/train/train_datasets.py

Line 756 in 025fa75

for _dataset in self.datasets:

Thank you for your help!

Question about data overlap

Hello, thanks for the great work.

I noticed that the RefCOCO val/test sets use images from the COCO training set. When doing joint training, I think this could cause a data leak, that the testing images and masks for RefCOCO are seen when training on COCO-Panoptic. Is this true, or have you handled this somewhere?

Is mask decoder trained in the second stage?

Hello @zamling, it's mentioned that the mask decoder is trained in the paper. However, I noticed that in the code is_train_mask_decode is set to False by default, this means the mask decoder is not trained. I'm wondering which one should I use for my own dataset. Thanks for your attention.

File missing

datasets/prepare_coco_semantic_annos_from_panoptic_annos.py file missing

How to fine-tune on my own dataset

Hello,
I would like to add some of my own datasets to fine-tune based on PSALM.
Is it better to fine-tune all parameters or fine-tune some parameters? What can be done specifically?
Thank you.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

zamling / psalm Goto Github PK

psalm's Introduction

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

Zheng Zhang*, Yeyao Ma*, Enming Zhang*, Xiang Bai

Features

Updates

Installation

Getting Started

Model Zoo

Citation

Acknowledgement

psalm's People

Contributors

Stargazers

Watchers

Forkers

psalm's Issues

Issue: Finetuning PSALM Model for Anomaly Detection

Issue Description

①The training set contains only one category (category2 -> good).

panoptic_train2017.json:

images in train2017(input):

images in panoptic_train2017 and panoptic_semseg_train2017:

panoptic_val2017.json:

images in val2017:

images in panoptic_val2017

finetune.sh(based on tranin.sh):

load data:

training process info:

evaluation command:

panoptic_evaluation.py:

evaluation process:

panoptic_train2017.json:

②The training set containing 2 categories with the left half side as gray and right half side as white

③The training set containing only one category1

Recommend Projects

Recommend Topics

Recommend Org

Zheng Zhang, Yeyao Ma, Enming Zhang*, Xiang Bai

①The training set contains only one category (category2 -> `good`).