fudan-zvg / semantic-segment-anything Goto Github PK

Automated dense category annotation engine that serves as the initial semantic labeling for the Segment Anything dataset (SA-1B).

License: Apache License 2.0

Python 100.00%

semantic-segment-anything's Introduction

Official repo, Web Demo

Semantic Segment Anything
Jiaqi Chen, Zeyu Yang, and Li Zhang
Zhang Vision Group, Fudan Univerisity

SAM is a powerful model for arbitrary object segmentation, while SA-1B is the largest segmentation dataset to date. However, SAM lacks the ability to predict semantic categories for each mask. (I) To address above limitation, we propose a pipeline on top of SAM to predict semantic category for each mask, called Semantic Segment Anything (SSA). (II) Moreover, our SSA can serve as an automated dense open-vocabulary annotation engine called Semantic segment anything labeling engine (SSA-engine), providing rich semantic category annotations for SA-1B or any other dataset. This engine significantly reduces the need for manual annotation and associated costs.

Web demo and API

Try the Web Demo and API here:

🤔 Why do we need SSA project?

SAM is a highly generalizable object segmentation algorithm that can provide precise masks. SA-1B is the largest image segmentation dataset to date, providing fine mask segmentation annotations. Neither SAM nor SA-1B provide category predictions or annotations for each mask. This makes it difficult for researchers to use the powerful SAM algorithm to directly solve semantic segmentation tasks or to utilize SA-1B to train their own models.
Advanced close-set segmenters like Segformer, Oneformer, open-set segmenters like CLIPSeg, and image caption methods like BLIP can provide rich semantic annotations. However, their mask segmentation predictions may not be as comprehensive and accurate as those generated by SAM, which has highly precise and detailed boundaries.
Therefore, by combining the fine image segmentation masks from SAM and SA-1B with the rich semantic annotations provided by these advanced models, we can generate semantic segmentation models with stronger generalization ability, as well as a large-scale densely categorized image segmentation dataset.

👍 What SSA project can do?

SSA: This is the first open framework that utilizes SAM for semantic segmentation task. It supports users to seamlessly integrate their existing semantic segmenters with SAM without the need for retraining or fine-tuning SAM's weights, enabling them to achieve better generalization and more precise mask boundaries.
SSA-engine: SSA-engine provides dense open-vocabulary category annotations for large-scale SA-1B dataset. After manual review and refinement, these annotations can be used to train segmentation models or fine-grained CLIP models.

✈️ SSA: Semantic segment anything

Before the introduction of SAM, most semantic segmentation application scenarios already had their own models. These models could provide rough category classifications for regions, but were blurry and imprecise at the edges, lacking accurate masks. To address this issue, we propose an open framework called SSA that leverages SAM to enhance the performance of existing models. Specifically, the original semantic segmentation models provide category predictions while the powerful SAM provides masks.

If you have already trained a semantic segmentation model on your dataset, you don't need to retrain a new SAM-based model for more accurate segmentation. Instead, you can continue to use the existing model as the Semantic branch. SAM's strong generalization and image segmentation abilities can improve the performance of the original model. It is worth noting that SSA is suitable for scenarios where the predicted mask boundaries by the original segmentor are not highly accurate. If the original model's segmentation is already very accurate, SSA may not provide a significant improvement.

SSA consists of two branches, Mask branch and Semantic branch, as well as a voting module that determines the category for each mask.

(I) Mask branch (blue). SAM serves as the Mask branch and provides a set of masks with clear boundaries.
(II) Semantic branch (purple). This branch provides the category for each pixel, which is implemented by a semantic segmentor that users can customize in terms of the segmentor's architecture and the interested categories. The segmentor does not need to have highly detailed boundaries, but it should classify each region as accurately as possible.
(III) Semantic Voting module (red). This module crops out the corresponding pixel categories based on the mask's position. The top-1 category among these pixel categories is considered as the classification result for that mask.

🚄 SSA-engine: Semantic segment anything labeling engine

SSA-engine is an automated annotation engine that serves as the initial semantic labeling for the SA-1B dataset. While human review and refinement may be required for more accurate labeling. Thanks to the combined architecture of close-set segmentation and open-vocabulary segmentation, SSA-engine produces satisfactory labeling for most samples and has the capability to provide more detailed annotations using image caption method.

This tool fills the gap in SA-1B's limited fine-grained semantic labeling, while also significantly reducing the need for manual annotation and associated costs. It has the potential to serve as a foundation for training large-scale visual perception models and more fine-grained CLIP models.

The SSA-engine consists of three components:

(I) Close-set semantic segmentor (green). Two close-set semantic segmentation models trained on COCO and ADE20K datasets respectively are used to segment the image and obtain rough category information. The predicted categories only include simple and basic categories to ensure that each mask receives a relevant label.
(II) Open-vocabulary classifier (blue). An image captioning model is utilized to describe the cropped image patch corresponding to each mask. Nouns or phrases are then extracted as candidate open-vocabulary categories. This process provides more diverse category labels.
(III) Final decision module (orange). The SSA-engine uses a Class proposal filter (i.e. a CLIP) to filter out the top-k most reasonable predictions from the mixed class list. Finally, the Open-vocabulary Segmentor predicts the most suitable category within the mask region based on the top-k classes and image patch.

📖 News

🔥 2023/04/14: SSA benchmarks semantic segmentation on ADE20K and Cityscapes.
🔥 2023/04/10: Semantic Segment Anything (SSA and SSA-engine) is released.
🔥 2023/04/05: SAM and SA-1B are released.

Results

All results were tested on a single NVIDIA A6000 GPU.

1. Inference time

Dataset	model	Inference time per image (s)	Inference time per mask (s)
SA-1B	SSA (Close set)	1.149	0.012
SA-1B	SSA-engine (Open-vocabulary)	33.333	0.334

2. Memory usage

SSA (with SAM)

Dataset	model	GPU Memory (MB)
ADE20K	SSA	8798
Cityscapes	SSA	19012

SSA-engine

Dataset	model	GPU Memory without SAM (MB)	GPU Memory with SAM (MB)
SA-1B	SSA-engine-small	11914	28024
SA-1B	SSA-engine-base	14466	30576

3. Close-set semantic segmentation on ADE20K and Cityscapes dataset

For the sake of convenience, we utilized different versions of Segformer from Hugging Face, which come with varying parameter sizes and accuracy levels (including B0, B2, and B5), to simulate semantic branches with less accurate masks. The results show that when the accuracy of original Semantic branch is NOT very high, SSA can lead to an improvement in mIoU.

ADE20K

Model	Semantic branch	mIoU of Semantic branch	mIoU of SSA
SSA	Segformer-B0	31.78	33.60
SSA	Segformer-B2	41.38	42.92
SSA	Segformer-B5	45.92	47.14

Cityscapes

Model	Semantic branch	mIoU of Semantic branch	mIoU of SSA
SSA	Segformer-B0	52.52	55.14
SSA	Segformer-B2	59.76	62.25
SSA	Segformer-B5	71.67	72.99

Note that all Segformer checkpoint and data pipeline are sourced from Hugging Face released by NVIDIA, which shows lower mIoU compared to those on official repository.

4. Cross-domain segmentation on Foggy Driving

We also evaluate the performance of SSA on the Foggy Driving dataset, with OneFormer as Semantic branch. The weight and data pipeline of OneFormer is sourced from Hugging Face.

Model	Training dataset	validation dataset	mIoU
SSA	Cityscapes	Foggy Driving	55.61

Examples

Open-vocabulary prediction on SA-1B

Addition example for Open-vocabulary annotations

Close-set semantic segmentation on Cityscapes

Close-set semantic segmentation on ADE20K

Cross-domain segmentation on Foggy Driving

💻 Requirements

Python 3.7+
CUDA 11.1+

🛠️ Installation

git clone [email protected]:fudan-zvg/Semantic-Segment-Anything.git
cd Semantic-Segment-Anything
conda env create -f environment.yaml
conda activate ssa
python -m spacy download en_core_web_sm
# install segment-anything
cd ..
git clone [email protected]:facebookresearch/segment-anything.git
cd segment-anything; pip install -e .; cd ../Semantic-Segment-Anything

🚀 Quick Start

1. SSA

1.1 Preparation

Dowload the ADE20K or Cityscapes dataset, and unzip them to the data folder.

Folder sturcture:

├── Semantic-Segment-Anything
├── data
│   ├── ade
│   │   ├── ADEChallengeData2016
│   │   │   ├── images
│   │   │   │   ├── training
│   │   │   │   ├── validation
│   │   │   │   │   ├── ADE_val_00002000.jpg
│   │   │   │   │   ├── ...
│   │   │   │   ├── test
│   │   │   ├── annotations
│   │   │   │   ├── training
│   │   │   │   ├── validation
│   │   │   │   │   ├── ADE_val_00002000.png
│   │   │   │   │   ├── ...
│   ├── cityscapes
│   │   ├── leftImg8bit
│   │   │   ├── train
│   │   │   ├── val
│   │   │   │   ├── frankfurt
│   │   │   │   ├── lindau
│   │   │   │   ├── munster
│   │   │   │   │   ├── munster_000173_000019_leftImg8bit.png
│   │   ├── gtFine
│   │   │   ├── train
│   │   │   ├── val
│   │   │   │   ├── frankfurt
│   │   │   │   ├── lindau
│   │   │   │   ├── munster
│   │   │   │   │   ├── munster_000173_000019_gtFine_labelTrainIds.png
│   │   ├── ...

Dowload the checkpoint of SAM and put it to the ckp folder.

mkdir ckp && cd ckp
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
cd ..

1.2 SSA inference

Run our SSA on ADE20K with 8 GPUs:

python scripts/main_ssa.py --ckpt_path ./ckp/sam_vit_h_4b8939.pth --save_img --world_size 8 --dataset ade20k --data_dir data/ade20k/ADEChallengeData2016/images/validation/ --gt_path data/ade20k/ADEChallengeData2016/annotations/validation/ --out_dir output_ade20k

Run our SSA on Cityscapes with 8 GPUs:

python scripts/main_ssa.py --ckpt_path ./ckp/sam_vit_h_4b8939.pth --save_img --world_size 8 --dataset cityscapes --data_dir data/cityscapes/leftImg8bit/val/ --gt_path data/cityscapes/gtFine/val/ --out_dir output_cityscapes

Run our SSA on Foggy Driving with 8 GPUs:

python scripts/main_ssa.py --data_dir data/Foggy_Driving/leftImg8bit/test/ --ckpt_path ckp/sam_vit_h_4b8939.pth --out_dir output_foggy_driving --save_img --world_size 8 --dataset foggy_driving --eval --gt_path data/Foggy_Driving/gtFine/test/ --model oneformer

1.3 SSA evaluation (after inference)

Get the evaluate result of ADE20K:

python scripts/evaluation.py --gt_path data/ade20k/ADEChallengeData2016/annotations/validation --result_path output_ade20k/ --dataset ade20k

Get the evaluate result of Cityscapes:

python scripts/evaluation.py --gt_path data/cityscapes/gtFine/val/ --result_path output_cityscapes/ --dataset cityscapes

Get the evaluate result of Foggy Driving:

# if you haven't downloaded the Foggy Driving dataset, you can run the following command to download it.
wget -P data https://data.vision.ee.ethz.ch/csakarid/shared/SFSU_synthetic/Downloads/Foggy_Driving.zip & unizp data/Foggy_Driving.zip -d data/

python scripts/evaluation.py --gt_path data/Foggy_Driving/gtFine/test/ --result_path output_foggy_driving/ --dataset foggy_driving

2. SSA-engine

Automatic annotation for your own dataset

Organize your dataset as follows:

├── Semantic-Segment-Anything
├── data
│   ├── <The name of your dataset>
│   │   ├── img_name_1.jpg
│   │   ├── img_name_2.jpg
│   │   ├── ...

Run our SSA-engine-base with 8 GPUs (The GPU memory needed is dependent on the size of the input images):

python scripts/main_ssa_engine.py --data_dir=data/<The name of your dataset> --out_dir=output --world_size=8 --save_img --sam --ckpt_path=ckp/sam_vit_h_4b8939.pth

If you want to run the SSA-engine-small, you can use the following command (add the --light_mode flag):

python scripts/main_ssa_engine.py --data_dir=data/<The name of your dataset> --out_dir=output --world_size=8 --save_img --sam --ckpt_path=ckp/sam_vit_h_4b8939.pth --light_mode

Automatic annotation for SA-1B

Download the SA-1B dataset and unzip it to the data/sa_1b folder.
Or you use your own dataset.

Folder sturcture:

├── Semantic-Segment-Anything
├── data
│   ├── sa_1b
│   │   ├── sa_223775.jpg
│   │   ├── sa_223775.json
│   │   ├── ...

Run our SSA-engine-base with 8 GPUs:

python scripts/main_ssa_engine.py --data_dir=data/sa_1b --out_dir=output --world_size=8 --save_img

Run the SSA-engine-small with 8 GPUs (add the --light_mode flag):

python scripts/main_ssa_engine.py --data_dir=data/sa_1b --out_dir=output --world_size=8 --save_img --light_mode

For each mask, we add two new fields (e.g. 'class_name': 'face' and 'class_proposals': ['face', 'person', 'sun glasses']). The class name is the most likely category for the mask, and the class proposals are the top-k most likely categories from Class proposal filter. k is set to 3 by default.

{
    'bbox': [81, 21, 434, 666],
    'area': 128047,
    'segmentation': {
        'size': [1500, 2250],
        'counts': 'kYg38l[18oeN8mY14aeN5\\Z1>'
    }, 
    'predicted_iou': 0.9704002737998962,
    'point_coords': [[474.71875, 597.3125]],
    'crop_box': [0, 0, 1381, 1006],
    'id': 1229599471,
    'stability_score': 0.9598413705825806,
    'class_name': 'face',
    'class_proposals': ['face', 'person', 'sun glasses']
}

📈 Future work

We hope that excellent researchers in the community can come up with new improvements and ideas to do more work based on SSA. Some of our ideas are as follows:

(I) The masks in SA-1B are often in three levels: whole, part, and subpart, and SSA-engine often cannot provide accurate descriptions for too small part or subpart regions. Instead, we use broad categories. For example, SSA-engine may predict "person" for body parts like neck or hand. Therefore, an architecture for more detailed semantic prediction is needed.
(II) SSA and SSA-engine is an ensemble of multiple models, which makes the inference speed slower compared to end-to-end models. We look forward to more efficient designs in the future.
(III) For semantic segmentation models with poor boundary segmentation, SSA can utilize SAM and the semantic voting mechanism to provide more accurate masks. However, for models that already have excellent segmentation performance, SSA cannot bring about a significant improvement. On the other hand, if the original segmentation model is too poor and misses many semantic categories, SSA cannot help it recall those categories either. Exploring better ways to utilize SAM is worth further investigation.

😄 Acknowledgement

Segment Anything provides the SA-1B dataset.
HuggingFace provides code and pre-trained models.
CLIPSeg, Segformer, OneFormer, BLIP and CLIP provide powerful semantic segmentation, image caption and classification models.

📜 Citation

If you find this work useful for your research, please cite our github repo:

@misc{chen2023semantic,
    title = {Semantic Segment Anything},
    author = {Chen, Jiaqi and Yang, Zeyu and Zhang, Li},
    howpublished = {\url{https://github.com/fudan-zvg/Semantic-Segment-Anything}},
    year = {2023}
}

semantic-segment-anything's People

Contributors

Stargazers

Watchers

Forkers

jiaqi-chen-00 jingdujingdu hengle zhangxiaotao2018 lingkeyang westail liuwenhaha lxy1993 hufeihu techthiyanes hayate-hsu jarygrace chenxwh peins zyuipo vaasesun kyuaschee code4indo jinx-ustc goswamig lit1088 guskun8 ibrandiay avivsham hemeda3 cj99 kemolo am0s-ac-x smyucas salmagro xiaoachen98 zillaru j-ru latitudezhou mohammadreza-sheykhmousa ethan-jiang-1 ductai199x modelai johnson7788 wzp8023391 breenglespic cloveryww qianqian121 cv-seg minhlong94 tanjingme coderstrong studiovc zero506 feb-col jerryzfc sunrainyg zijiny kleinyuan davidchoi76 vn-os ishrat-tl three-liu sheffieldcao hanoch666 drewzzzz6 shitoudidi 2132660698 junqiangchen jjhw offworld-projects sddai jiachen0212 diningsystem qianguo123 ricklentz lyf6 codeaudit yinpeidai jasongilholme radreports jzw0025 holliemin9090 mcx whuhxb liuqinglong110 pyqth successhaha zafirshi dq-soulie thkelper gogopen jie311 foobar41 kaynewest airooter linhong00316 andreynz691 lishatang metamorphart standardgalactic bttung-2020 arghyachatterjee theturehooha csxuwu

semantic-segment-anything's Issues

显存占用

请问您在推理时使用的显卡显存是多少呢？我使用24G显存的4090推理时爆显存了，请问至少应该使用多少张以及多大显存的显卡呢？

About the visualization.

Thanks for this great work and the open-soursed repo. I want to know how to visualize the result after the inference like you show in the repo in the end.

ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by ../anaconda3/envs/ssa/lib/python3.8/site-packages/spacy/tokens/span_group.cpython-38-x86_64-linux-gnu.so)

Error when running this time：python -m spacy download en_core_web_sm
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /../anaconda3/envs/ssa/lib/python3.8/site-packages/spacy/tokens/span_group.cpython-38-x86_64-linux-gnu.so)

Can't find model 'en_core_web_sm'

python scripts/main.py --data=image_20230408/ --out_dir output --world_size=1 --save_img
Traceback (most recent call last):
File "scripts/main.py", line 4, in
from pipeline import semantic_annotation_pipeline
File "/data2/queenie_2023/Semantic-Segment-Anything/scripts/pipeline.py", line 13, in
from blip import open_vocabulary_classification_blip
File "/data2/queenie_2023/Semantic-Segment-Anything/scripts/blip.py", line 3, in
from utils import get_noun_phrases
File "/data2/queenie_2023/Semantic-Segment-Anything/scripts/utils.py", line 2, in
nlp = spacy.load('en_core_web_sm')
File "/data2/queenie/anaconda3/envs/mmTrans/lib/python3.7/site-packages/spacy/init.py", line 60, in load
config=config,
File "/data2/queenie/anaconda3/envs/mmTrans/lib/python3.7/site-packages/spacy/util.py", line 449, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

Great work! Do you happen to have any corresponding papers? I couldn't seem to find any.

Get each segments from output of semantic segmentation

Thank you for sharing a wonderful code!

I want to get segments of specific class from result.

Segmentation results have multiple classes, but I only want to see the segmentation results for a specific class. However, the resulting data (json) does not provide information on the coordinates of the segments or the index for each class, making it difficult to access the desired information.

I tried extracting unique the color of segments, but 'unique of color' and 'number of segments' are different....so it failed

for example, I just want mask that pixel's class_name is 'building'.

What can be done?
Thanks

Mask/Label quality

It seems there are too many masks and labels for some simple images, is it possible to use dense crf to improve the masks/labels quality?

推理耗时非常慢

我使用main_ssa.py进行单张图像推理，硬件为单个A100显卡，40G显存，发现在Semantic Voting模块耗时很多，每个mask都几乎需要处理1s，和git中有出入，请问这是正常的吗

运行一会后报错

  File "/media/admin1/envs/anaconda3/envs/leng_lip/lib/python3.7/site-packages/transformers/models/clipseg/modeling_clipseg.py", line 1458, in forward
    conditional_pixel_values=conditional_pixel_values,
  File "/media/admin1/envs/anaconda3/envs/leng_lip/lib/python3.7/site-packages/transformers/models/clipseg/modeling_clipseg.py", line 1360, in get_conditional_embeddings
    raise ValueError("Make sure to pass as many prompt texts as there are query images")
ValueError: Make sure to pass as many prompt texts as there are query images

能成功100多张图片，然后就会出现这样的报错停止。
使用命令如下

python scripts/main_ssa_engine.py --data_dir=data/UCM_Captions --out_dir=output --world_size=4 --save_img --sam --ckpt_path=../../mydata/sam_vit_h_4b8939.pth --light_mode

是否能用于stable diffusion？

如何与facebook的 segment anything 安装在一起？

pycocotools ImportError undefined symbol: __intel_sse2_strchr

Thank you very much for your contribution.

I have installed all the requirements and tried to run it on my own data using the following command in miniconda3 environment on a Linux HPC:
python scripts/main_ssa_engine.py --data_dir=data/ --out_dir=output --world_size=8 --save_img --sam --ckpt_path=ckp/sam_vit_h_4b8939.pth

However I keep getting the following error at import pycocotools.mask as maskUtils
import pycocotools._mask as _mask
ImportError: ssa/lib/python3.8/site-packages/pycocotools/_mask.cpython-38-x86_64-linux-gnu.so: undefined symbol: __intel_sse2_strchr

Can I ask if this is a dependency issue and is there any way to solve it?

Thank you in advance.

messy category generation data

For the messy category generation data, such as "the word '50' in white letters," "three blue plastic rabbits", "three blue plastic snowflakes", "some very pretty blue and black items" and "1 ultra blue" are there any other methods to further clean them?

fine tune the model

Hello,
Is the fine tuning possible to a custom dataset and custom labels? Thank you

Difference between this work and Grounded SAM

Hi, could you please explain the difference between this work and Grounded SAM (https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main)?

ModuleNotFoundError: No module named 'cog'

Hi, thanks for your work.
I just build the environment as you suggested as follows:

git clone [email protected]:fudan-zvg/Semantic-Segment-Anything.git
cd Semantic-Segment-Anything
conda env create -f environment.yaml
conda activate ssa
python -m spacy download en_core_web_sm
# install segment-anything
cd ..
git clone [email protected]:facebookresearch/segment-anything.git
cd segment-anything; pip install -e .; cd ../Semantic-Segment-Anything

Then I just try to run python predict.py and it show the following error:

Traceback (most recent call last):
  File "predict.py", line 19, in <module>
    from cog import BasePredictor, Input, Path, BaseModel
ModuleNotFoundError: No module named 'cog'

Can you give me some suggestions on how to solve this?

How do I change the backbone to use a different backbone. I have a trained Segformer model on my custom dataset, which I would like to use.

Does it support segment a certain type of objects by click one of them? #13

for example, find all cats by click one of them?

How to create image mask from JSON

Hi Team,

Am so much interested in semantic segment. is there any way to create image mask for each segment using the JSON output.

https://replicate.delivery/pbxt/ZNh6er6mvwWfNUSZWdfStS9lTfQl1PnCGkk6QI9zvJWm8XGDB/seg_out.json

Simple Python code available?

Hello!
I love this project and the impressing results! But I would like to handle it simple with a few lines of Python code. Is there anything available? I tried to filter out the relevant parts of your project scripts but failed.
Simple giving the address to a single image and in return receiving an array or a list with labels, coordinates etc. Is that possible?

Best regards
Marc

How to use a custom dataset and different models for the Semantic head

How do I change the backbone to use a different backbone. I have a trained Segformer and Segmenter models, trained on my custom dataset, which I would like to use. Are there any pointers as to how I can change the dataset and model?

Apply SSA on a single Image

Hi, I was thinking if it is possible to have a single image as input, apply the Segment Anything Model from Meta and the use this tool to get the actual labels for the predicted masks

Don't get any output after inference

Hi All,
Thank you for your amazing work and repo!
I'm trying to inference the open vocab model by some random image.
I followed the installation instructions and completed them without any errors, then I tried to inference the model as explained in the README file (see the attached photo). The inference was completed without errors just warnings, but when I entered the output path provided when calling main.py it was empty.
What am I doing wrong?

Cheers,

Can I use my own dataset

Hello, thank you for your work !

I want to use SSA on custom categories and datasets. I saw you mentioned that users can customize in terms of the segmentor's architecture and the interested categories. But can I use my own datasets ?
I've made some attempts. But I don't understand what is "semantic_branch_processor", I try to use one of it directly.But an error will be reported —— ValueError: You have to specify the task_input. Found None. And I guess it has something to do with the fact that I didn't set the "semantic_branch_processor" correctly.
I want to know how to set up "semantic_branch_processor" if I want to use my own dataset.

Thank you very much for any reply.

Appendix
command：
python scripts/main_ssa.py --ckpt_path ./ckp/sam_vit_h_4b8939.pth --save_img --world_size 1 --dataset VOC2012 --data_dir /media/guo/DATA/chen/lraspp/data/VOCdevkit/VOC2012/JPEGImages --gt_path /media/guo/DATA/chen/lraspp/data/VOCdevkit/VOC2012/Annotations --out_dir output_VOC2012
Complete error reporting：
Traceback (most recent call last):
File "scripts/main_ssa_try.py", line 269, in
main(0, args)
File "scripts/main_ssa_try.py", line 248, in main
semantic_segment_anything_inference(file_name, args.out_dir, rank, img=img, save_img=args.save_img,
File "/media/guo/DATA/chen/SSA/Semantic-Segment-Anything/scripts/pipeline.py", line 168, in semantic_segment_anything_inference
class_ids = segformer_func(img, semantic_branch_processor, semantic_branch_model, rank)
File "/media/guo/DATA/chen/SSA/Semantic-Segment-Anything/scripts/segformer.py", line 5, in segformer_segmentation
inputs = processor(images=image, return_tensors="pt").to(rank)
File "/home/guo/anaconda3/envs/ssa/lib/python3.8/site-packages/transformers/models/oneformer/processing_oneformer.py", line 112, in call
raise ValueError("You have to specify the task_input. Found None.")
ValueError: You have to specify the task_input. Found None.

Why is my semantic segmentation result obviously more, but the SSA result is missing?

torch and transformers version mismatch error

Hi, when I tried to run SSA inference, I met this following error:

Traceback (most recent call last):
File "scripts/main_ssa.py", line 122, in
main(0, args)
File "scripts/main_ssa.py", line 66, in main
from transformers import SegformerFeatureExtractor, SegformerForSemanticSegmentation
File "", line 1039, in _handle_fromlist
File "/home/rxu37/.local/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1117, in getattr
value = getattr(module, name)
File "/home/rxu37/.local/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1116, in getattr
module = self._get_module(self._class_to_module[name])
File "/home/rxu37/.local/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1128, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.segformer.modeling_segformer because of the following error (look up to see its traceback):
No module named 'torch.distributed.algorithms.join'

I installed my environment using the exact provided environment.yaml , so my torch version is 1.9.1+cu111, and my transformers version is 4.27.1. I looked up the PyTorch doc and found that only PyTorch 2.0 has the module torch.distributed.algorithms.join. Thus I upgraded PyTorch to 2.0.1+cu118, but now I met the error

Traceback (most recent call last):
File "scripts/main_ssa.py", line 6, in
from pipeline import semantic_segment_anything_inference, eval_pipeline, img_load
File "/mnt/d/GitHub/Semantic-Segment-Anything/scripts/pipeline.py", line 8, in
from mmdet.core.visualization.image import imshow_det_bboxes
File "/home/rxu37/anaconda3/envs/ssa/lib/python3.8/site-packages/mmdet/core/init.py", line 3, in
from .bbox import * # noqa: F401, F403
File "/home/rxu37/anaconda3/envs/ssa/lib/python3.8/site-packages/mmdet/core/bbox/init.py", line 8, in
from .samplers import (BaseSampler, CombinedSampler,
File "/home/rxu37/anaconda3/envs/ssa/lib/python3.8/site-packages/mmdet/core/bbox/samplers/init.py", line 12, in
from .score_hlr_sampler import ScoreHLRSampler
File "/home/rxu37/anaconda3/envs/ssa/lib/python3.8/site-packages/mmdet/core/bbox/samplers/score_hlr_sampler.py", line 3, in
from mmcv.ops import nms_match
File "/home/rxu37/anaconda3/envs/ssa/lib/python3.8/site-packages/mmcv/ops/init.py", line 2, in
from .assign_score_withk import assign_score_withk
File "/home/rxu37/anaconda3/envs/ssa/lib/python3.8/site-packages/mmcv/ops/assign_score_withk.py", line 5, in
ext_module = ext_loader.load_ext(
File "/home/rxu37/anaconda3/envs/ssa/lib/python3.8/site-packages/mmcv/utils/ext_loader.py", line 13, in load_ext
ext = importlib.import_module('mmcv.' + name)
File "/home/rxu37/anaconda3/envs/ssa/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory

I have tried to rebuild the mmcv module after I upgrade torch, but it does not seem to work. Would greatly appreciate any pointer/help on this issue. Thank you!

Release model weights

Really appreciate this work. Do you plan on releasing model weights/checkpoints for SSA? Would really appreciate it.

Thanks

Is it possible we can define our own labels?

Thanks for sharing this great work!

Can we define our labels or apply transfer learning to this project? I did not figure out how do we run this project if our dataset is not from SA-B dataset.
e.g. I want to apply it on an indoor scene where the labels are mainly the furniture. Would simple addition of config files on configs work?

Appreciate your help!

Inference demo

Great work. And I'm wondering why not just give a simple inference demo to a single Image?

Where is scripts/stable_two_stage_multi_segmenter_clip_seg.py ?

missing stable_two_stage_multi_segmenter_clip_seg.py file in scripts

I cloned the repo and run this command
python scripts/stable_two_stage_multi_segmenter_clip_seg.py --data_dir=data/examples --out_dir=output --world_size=8 --save_img
i got:
python3: can't open file '/content/Semantic-Segment-Anything/scripts/stable_two_stage_multi_segmenter_clip_seg.py': [Errno 2] No such file or directory

cannot import print.log from mmcv.util using google colab

Traceback (most recent call last):
File "/content/Semantic-Segment-Anything/scripts/main_ssa_engine.py", line 5, in
from pipeline import semantic_annotation_pipeline
File "/content/Semantic-Segment-Anything/scripts/pipeline.py", line 8, in
from mmcv.utils import print_log
ImportError: cannot import name 'print_log' from 'mmcv.utils' (/usr/local/lib/python3.10/dist-packages/mmcv/utils/init.py)

Is it possible to be used in Google Colab?

Hi! I am new to CV and want to use this mode. I am wondering if it is possible to run this in Google Colab. Specifically, assume that I have a bunch of images in a folder "images," how could I import the library and use it? Thank you and look forward to the reply:)

Suggestion - Integrate MobileSAM into the pipeline for lightweight and faster inference

Reference: https://github.com/ChaoningZhang/MobileSAM

Our project performs on par with the original SAM and keeps exactly the same pipeline as the original SAM except for a change on the image encode, therefore, it is easy to Integrate into any project.

MobileSAM is around 60 times smaller and around 50 times faster than original SAM, and it is around 7 times smaller and around 5 times faster than the concurrent FastSAM. The comparison of the whole pipeline is summarzed as follows:

i am confusing about lower evaluation result with segformer method

why the segformer result is lower than segformer paper display? is the different with evaluation metircs? i think the paper miou arrive to 84% , but the result you pointed out was 71% thanks~

Repository license needed

This work looks quite valuable! I would like to explore SSA, but I noticed there is no license.

Would you please add a license that helps clarify the terms under which this work can or cannot be used, edited, extended, and/or redistributed?

请问有论文吗，没有找到原文

what the command to run the cog.yaml?

When I see the prediction.py ,I used the command: pip install cog? this is useful?

About env

Great Job!!!
How about creating an environment without using Conda, such as using Python from virtualenv, and what libraries are needed?

Apple Silicon Support

Requirements section lists CUDA 11.1+ as a requirement. Is it possible to run this Semantic-Segment_Anything on Apple Silicon hardware (M1, M2)?

Thank you

I must say bravo and thank you for doing exactly what I would like to start doing now.

Change Semantic Branch

Can I switch the semantic branch segformer to other semantic segmentation models？

Human class

Can we get the mask only showing a human body? Not with all the other elements in the image/video

Bug in `scripts/pipeline.py`

In scripts/pipeline.py, there are some code like this

patch_small = mmcv.imcrop(img, np.array(
            [ann['bbox'][0], ann['bbox'][1], ann['bbox'][0] + ann['bbox'][2], ann['bbox'][1] + ann['bbox'][3]]),
                                  scale=scale_small)
patch_large = mmcv.imcrop(img, np.array(
    [ann['bbox'][0], ann['bbox'][1], ann['bbox'][0] + ann['bbox'][2], ann['bbox'][1] + ann['bbox'][3]]),
                          scale=scale_large)
patch_huge = mmcv.imcrop(img, np.array(
    [ann['bbox'][0], ann['bbox'][1], ann['bbox'][0] + ann['bbox'][2], ann['bbox'][1] + ann['bbox'][3]]),
                          scale=scale_large)

would patch_huge have scale=scale_hugh?

TypeError: list indices must be integers or slices, not str

Thank you very much for creating this project & publishing it so soon after the segment-anything release!

I got this error when running SSA inference on a directory of images:

python scripts/main_ssa_engine.py --data_dir=data/examples --out_dir=data/output --save_img --world_size 1

Before running this command, I first ran segment-anything on my directory of images, in "coco_rle" output mode to generate the JSONs:

python scripts/amg.py --checkpoint models/sam_vit_h_4b8939.pth --model-type default --input ../Semantic-Segment-Anything/data/examples --output ../Semantic-Segment-Anything/data/examples/ --convert-to-rle

main_ssa_engine.py threw this error:

torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/laurens/anaconda3/envs/ssa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/laurens/git/OSS/Semantic-Segment-Anything/scripts/main_ssa_engine.py", line 46, in main
    semantic_annotation_pipeline(file_name, args.data_dir, args.out_dir, rank, save_img=args.save_img,
  File "/home/laurens/git/OSS/Semantic-Segment-Anything/scripts/pipeline.py", line 69, in semantic_annotation_pipeline
    for ann in anns['annotations']:
TypeError: list indices must be integers or slices, not str

I figured this error is because of this part: https://github.com/fudan-zvg/Semantic-Segment-Anything/blob/main/scripts/pipeline.py#L61-L69

If the pipeline has mask_generator, the segmentations are put in dict with "annotations" key - not sure why.

In my case, I'm not using the mask_generator since the JSONs are already generated, so when I remove the key annotations in the for loop on line 69, it works.

可以自行调整标签么

图片处理后发现很多mask出现重叠现象，面积加和超越了原本图片的尺寸大小，而且我也不需要分类标签这么丰富，所以请问可以按照自己的意愿想法修改能被标注的标签种类吗

thanks for open-souring such a nice repo

Run without GPU

Hi,
thanks for the amazing code! Is there any chance to get it running without a cuda capable gpu? Maybe via cpu (as for the sam)?
Thanks!

out of memory！！！

cuda:11.1 torch:1.10 single A6000
命令：python scripts/main_ssa_engine.py --data_dir=/mnt/usb/gxy/dataset_path/images --out_dir=output --world_size=1 --save_img --sam --checkpoint-path=checkpoint-path/sam_vit_h_4b8939.pth
报错：RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 47.54 GiB total capacity; 537.07 MiB already allocated; 12.06 MiB free; 580.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
help!!!help!!!

OneFormerProcessor

加载下载好的模型， oneformer出错，huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name':

oneformer_ade20k_processor = OneFormerProcessor.from_pretrained("shi-labs/oneformer_ade20k_swin_large")
oneformer_ade20k_model = OneFormerForUniversalSegmentation.from_pretrained("shi-labs/oneformer_ade20k_swin_large").to(rank)

evaluation results available?

Looks like, with SSA, it becomes possible to compare the performances of SAM against SOTA on popular benchmark datasets. Would you report the validation results of SSA for semantic segmentation or instance segmentation on AED20k or coco?