Giter Site home page Giter Site logo

Comments (10)

zhanggang001 avatar zhanggang001 commented on May 26, 2024 9

@lkeab Thank you for your reply.

About the performance comparison and training setting differences between HTC and your transfiner: The AP 39.7 of HTC you cite was trained with single scale input by 20 epochs (refer to Sec 4.2 and Table 1 in the paper of HTC). Indeed if you train HTC for more epochs under this single-scale training setting, the performance will drop. As for the training setting differences you mentioned about HTC "The multi-scale training setting used in the HTC", it was used for their winning entry for the COCO 2018 Challenge Object Detection Task (refer to Sec 4.5 in the paper of HTC), not for the result you cite. In addition, if you train HTC with the same training setting of your transfiner (also the default training setting) by more than 20 epochs, the performance will be better than 39.7. However, you obviously did not do it in this way.

About the performance of PointRend: if using R101-FPN as the backbone and trained with the default setting (3x schedule with multi-scale jittering) in detectron2, the AP and AP* of PointRend are 39.8 and 43.5 (refer to Table 5 in the paper of PointRend), not the number you cite 39.6 and 41.5. Similar to results of R50-FPN-3x.

About the performance of RefineMask: The results you cite about RefineMask with R101-FPN as backbone are 39.4 and 42.3 for AP and AP* respectively. This model was trained with the default setting in mmdetection, i.e. single scale input and 2x schedule (refer to Sec 3.4, Table 1, and Table 8 in the paper of RefineMask). If trained with the setting in detectron2 (also the setting of your transfiner), the AP and AP* of RefineMask with R101-FPN are 41.2 and 44.6, which is higher than transfiner (refer to Table 8 of RefineMask). The authors state these stronger results in the caption of Table 8, I do not think you should ignore it.

Hope that you can solve these problems I mentioned above properly. And I can not control the way things are going, as @Emiria96 and I may not be the only two having these questions on your paper.

from transfiner.

Emiria96 avatar Emiria96 commented on May 26, 2024 9

Obviously I am also not convinced by your response. I will give my response below.

You are still avoiding talking about the problem of why you are using Mask R-CNN pretrained model to initialize transfiner. I do not see any word about this problem in your response. The "Implementation Details" part and the appendix does not mention any words about this strange initialization. If "slower convergence compared to standard CNN models" is your answer, then an experiment you must do is to show that CNN models could not get further performance boost when they are also initialized from Mask R-CNN. I have run these experiments for you. I will show the result at the end of my response, and also attach the config file, training logs, and model weights to let you verify my results.

In your response, you are also avoiding talking about that under the same training scheme, transfiner actually underperforms refinemask. I can even not see the word "refinemask" appears in your reply! Table 8 in refinemask does mention their training scheme, which is the same as yours, and there is no reason to omit the caption and quote the results that are trained with fewer epochs. Even transfiner is trained with Mask R-CNN initialization, and refinemask is trained with ImageNet initialization, refinemask still performs better.

  • ResNet50:
    • transfiner: AP: 39.4, AP*: 42.3
    • refinemask: AP: 40.2 (+0.8), AP*: 43.1 (+0.8)
  • ResNet101:
    • transfiner: AP: 40.7, AP*: 43.6
    • refinemask: AP: 41.2 (+0.5), AP*: 44.6 (+1.0)

All entries are obtained from the paper. Clearly transfiner underperforms refinemask, a paper published in CVPR 2021. Sadly, you avoid talking about this issue.

Also I do not like your response "Simply comparing the ms-scale performance without considering these differences across platforms is unfair." Anyone experienced with object detection would agree that with the same training scheme, the same model will get similar results in detectron2 and mmdetection. So your response is not convincing. Or according to your words, if you trained your model with detectron2, then all methods trained in mmdetection are not comparable and can be omitted in the leaderboard comparision? That is ridiculous.

About the speed. I'm glad the inference speed is improved. Now I can get similar results in my environment. But I still need to mention that the inference speed you provided now is still slower when using a lighter backbone (<5FPS with ResNet101 vs 6.1FPS with ResNet-101 DCN reported in your paper). So please take this seriously, as one claim of your paper is that your inference speed does not decrease much when compared to other methods. Also, the boost in inference speed is actually at the cost of the final performance, this is clearly shown in your modification where you are trying to sample fewer points and avoid doing the dilation operation. The AP, AP*, AP_boundary of ResNet-50 actually drops from 38.7, 42.3, 26.0 to 38.6, 41.9, 25.8 when these modifications are taken. If you want to report your performance and inference speed simutaneously, please keep the consistency of your model when you are doing evaluation.

About pointrend. You said you obtained 39.6 for your reproduced pointrend model. I agree fluctuation does apper when doing experiments, but that is not the reason you report the wrong number of AP* and AP_boundary of this 39.6 pointrend model. In no ways can a 39.6 pointrend model only achieve AP* of 41.4 and AP_boundary of 25.3. This also happens to your ResNet-50 comparision, where your pointrend model only achieve AP* of 39.7 and AP_boundary of 23.5, which is also not possible. Considering pointrend is a paper published in CVPR 2020 and has gotten more than 200 citations, I do not think this is an acceptable attitute when you are writing this paper.

About cityscapes. I did not mention it before but this is a problem that must be taken seriously. In table 10, you also initialize your transfiner with pretrained COCO Mask R-CNN weights while mask rcnn, pointrend, refinemask in your table are all initialized from ImageNet weights. Unfair comparison happens in COCO evaluation, and the cityscapes evaluation. Honestly, if pointrend is initialized from pretrained COCO Mask R-CNN weights, it does perform better than transfiner. I will give the result table, and attach the config file, training log and pretrained model weights to let you verify my result.

Results comparison

I will give my experiment result of pointrend when it is initializing from Mask R-CNN pretrained weights. The COCO result of transfiner are obtained by running your given model. transfiner-before means the results are obtained before the code modification, and transfiner-after means after the code modification. The cityscapes result is obtained from your paper.

COCO

R50-3x-M AP_val AP* AP_B AP_Box
pointrend 38.5 41.9 25.7 41.6
transfiner-before 38.7 42.3 26.0 41.7
transfiner-after 38.6 41.9 25.8 41.7
R101-3x-M AP_val AP* AP_B AP_Box
pointrend 40.1 43.6 27.2 43.8
transfiner-before 40.2 43.6 27.3 43.7
transfiner-after 40.1 43.4 27.0 43.7

AP_B means boundary AP.

cityscapes:

R50-1x-M AP AP_50 AP^B AP^B_50
pointrend 38.6 65.3 18.1 51.1
transfiner 37.9 64.1 18.0 49.8

The config file, training log and pretrained model weights are provided in google drive. You can use the config to train the official pointrend in detectron2 (you just need to download the official detectron2 repo and put my config into it!), and you should get similar results as mine, and you can see the training logs and metrics to know the training process. Also, you can directly test the results by using my pretrained model.

Conclusion:

Under the same initialization, transfiner could only get similar results with pointrend (CVPR 2020). Even if transfiner is initialized from mask r-cnn and refinemask is initialized from imagenet, transfiner still underperforms refinemask (CVPR 2021).

I don't want our efforts to be in vain. We are refrained to only discuss all these problems in a closed issue. If you do not want the situation to get worse, I sincerely suggest the authors to consider seriously what is the right way to treat the transfiner paper and what is the right way to do research. As @zhanggang001 said, we may not be the only two having these questions on your paper, I can not control the way things are going.

Good luck with your research career!

from transfiner.

zhanggang001 avatar zhanggang001 commented on May 26, 2024 8

The HTC numerical result 39.7 is from table 1 of their official HTC paper and also Refinemask. The paper Refinemask also uses 39.7 as the HTC-R101 result during comparison (Table 8). This is a little bit contradictory to your mentioned "HTC-R101 is even higher than refinemask", i.e. , why not reporting the "even higher" result more than 41.2 by HTC in the table 8 of refinemask paper? By default, when evaluating results on coco-testdev, most papers will follow the default training & evaluation setting in detecton2. I will further check the training schemes by these different methods as there are too much to care about, and update the arxiv content on this to make it more clear in the future. It's hard to check/train all these methods one by one, and there are various referred numerical results even for the same method in different papers. So sometimes I can just refer to the numerical numbers reported in the original official paper.

  1. In Table 8 of RefineMask, the authors compare both HTC and RefineMask under the same training setting (single scale training under the 2x training schedule, 38.6 vs 38.2 for R50-FPN, 39.7 vs 39.4 for R101-FPN, 41.3 vs 41.0 for X101-FPN). The authors also report the results of RefineMask trained with multi-scale jittering under the 3x training schedule, and they state that in the caption of Table 8.
  2. As your response, "It's hard to check/train all these methods one by one”, I don't think this is a good answer for the "unfair comparison" and the numerical errors in Table 9 of your paper.
  3. Before you give convinced response for comparisons in Table 9 of your paper, I do not think it is suitable to change the title of this issue.

Something to supplement, the 2x training schedule of HTC usually indicates 20 epochs, as the HTC will be overfitted training for 24 epochs with single scale input. This is consistent with the above statement when I first open this issue.

from transfiner.

Emiria96 avatar Emiria96 commented on May 26, 2024 6

I need to emphasize, the novelty of your paper has never been our focus. What we care about is the problem of unfair comparison and misreport shown in your paper. Please do not try to divert the discussion.

Moreover, you are the one that emphasizes AP numbers most. In your abstract, you said "significantly improving both two-stage and query-based frameworks by a large margin of +3.0 mask AP on COCO and BDD100K, and +6.6 boundary AP on Cityscapes." In your zhihu post, you said "使用 R101-FPN 和 Faster R-CNN 的性能超过 RefineMask 和 BCNet 1.8 Boundary AP 和 2.5 Boundary AP" (using a model trained with DCN compared with a model trained without DCN? it seems unfair comparision is part of your body)

Emphasize again, our focus has always been, and always will be the unfair comparison and misreport shown in your paper.

About the initial weight problem. So you admit that when you first submitted this paper to CVPR, you did use mask rcnn pretrained weights to initialize your model and gained unfair adavantage at that time. Moreover, when I check your modification, I find you are adding an additional loss term to boost the performance.
图片

This contradicts to your claim "I believe the novel idea of a paper is more important than how many percent of improvement the method can achieve". You can only rely on tricks to improve your performance, but not your novel idea. So you yourself do not trust your novel idea. After removing this additional loss term, I could only get 36.6 mask AP with R50-1x, a 0.4 AP drop.

I also need to emphasize all numbers reported in the PointRend paper are evaluated on the val set, not the test-dev set. Please read paper carefully and do fair comparison. On the val set, your number is 40.3 (v.s. 39.8 (paper), 40.1 (official repo) for pointrend).

About refinemask. @zhanggang001 has told you what' s wrong with your config. I' m curious about if the number of the re-running experiment is consistent with the paper, what new excuse will you find? I can run these experiments for you if you do not want to run it.

Also, you haven't addressed all problems. the speed issue and the unfair comparison problem in Cityscapes. I already know what kind a researcher you are from the discussion in this issue and I will not have any expectations for your response.

I'm tired, you can never wake up a person who pretends to be asleep. The discussion with you here is fruitless. Good luck with your research career.

Update

It seems that you do not want to re-run the experiments of refinemask, so I do it for you.
config, model, log, results on val split, results submitted to the evaluation server (test-dev).

Backbone AP_val AP_test-dev AP*
refinemask-r101-3x 41.0 41.6 44.8
numbers from paper - 41.2 44.6

The result I obtained is even higher than the result in the official refinemask paper. So their results are solid and consistent with their paper, unlike yours.

from transfiner.

Emiria96 avatar Emiria96 commented on May 26, 2024 5

Besides the problem pointed out by @zhanggang001 . There are some other problems of your transfiner.

  1. You are initializing your model with Mask R-CNN pretrained weights. For fair comparison, all methods initialize their model from ImageNet. I never saw any methods except yours using this cheating initialization. As you said, "most papers will follow the default training & evaluation setting in detecton2", so are you following the default training scheme?
  2. "sometimes I can just refer to the numerical numbers reported in the original official paper." So what about the number of pointrend? In the official pointrend paper, the number of AP and AP* are 38.2, 41.5 for ResNet-50, 39.8, 43.5 for ResNet-101. And I think it is not the number that are shown in your paper. And the AP* of ResNet-101 in the official pointrend repo is actually higher than yours (43.8 v.s. 43.6).

To be honest, if pointrend and transfiner have the same initialization. Transfiner does not perform better than pointrend, a paper published two years ago.

  1. The speed. As pointed out in my previous issue #10 . The inference speed of your pointrend is about 1sec/img. Which is much slower than 6.1FPS reported in your paper. I don't think it is a good practice to close an issue when the problem is not solved. I do not see you have taken any steps towards solving this problem. Before you have solved it, I can just keep doubting the number you reported is fake.

Given all above points, I do not think it is a qualified CVPR paper. Hope you can properly address all our concerns.

from transfiner.

lkeab avatar lkeab commented on May 26, 2024 3

Thank you for your questions. I hope that the explanations below address your concerns. If you have any further questions, please let me know.

Regarding performance comparison: I used the official reported result 39.7 in Table 1 of HTC paper, where the training time is 20 epochs. This is the optimal training schedule adopted by HTC. I further trained the HTC model for another 16 epochs using the mmdet config and found that the performance becomes 39.6 (a slight decrease). This supports the 'saturation phenomenon' about HTC that you mentioned. It shows that training more epochs does not always bring better performance, since it is also related to the specific model design and training ways. It is well-known that transformer (multi-head attention) models often need more training iterations due to slower convergence compared to standard CNN models. For example, in the DETR paper, over 300 training epochs are used. We have clearly described the adopted training schedules and multi-scale setting for our method on COCO leaderboard submission in the last paragraph “Implementation Details” of Section 3.

Regarding training setting differences: The multi-scale training setting used in the HTC and ours are largely different. HTC uses mmdet, where the multiscale range is from 400 to 1400 as short edge and long edge 1600. While ours based on detectron2 only uses the default from 640 to 800 as the multi-scale short edge with long edge 1333. The [400 to 1400] scale and long edge 1600 is also described in section 4.5 of the HTC paper. The scale range differences between HTC and ours are very obvious. Simply comparing the ms-scale performance without considering these differences across platforms is unfair. Also, the usage differences in Multi-scale Testing, SyncBN etc. between detectron2 and mmdet also need to be considered.

Regarding speed comparison: The attached screenshot shows the running speed of our current released R101 model and code on RTX gpu, 0.22s/iter in the screenshot figure vs. 0.96s/iter that you obtained. The speed differences are so huge even considering your local machine environment is different. Also, as I mentioned in the issue response, we have not yet published the code for the final optimized version, which will be released soon. I have also attached the evaluation log with speed record during testing. Could you try again with our latest github code? Please also let me know your other environment details (pytorch version, CPU, etc), so that I can help you achieve the speed we obtain.

fig_speed

Regarding pointrend performance: We obtained 39.6 for our reproduced pointrend model, which is negligibly different from 39.8 reported in the paper. This can be due to randomness in the training. Thank you for bringing this issue to our attention. We will update the paper with the latter pointrend results. Note that our approach achieves 40.7, and thus outperforming both these results.

from transfiner.

lkeab avatar lkeab commented on May 26, 2024 1

Overall, I believe the novel idea of a paper is more important than how many percent of improvement the method can achieve. We are the first one to explore how to employ complex transformer structure to output high-quality/resolution instance masks. We design our algorithm based on the experiment observation that the incoherent regions are sparse but critical to final performance. We solve the spatial instance segmentation from the sequential pixel modeling perspective. When measuring the quality of the paper, we should also read the ideas/insight behind it, not just about the AP numbers.

There are some further clarifications for the comparisons.

  • Regarding initial weights usage: Using our latest released code, transfiner can achieve the similar performance reported in the paper by training from scratch using imagenet weights. There is no need to adopt pretrained Mask R-CNN weights. The table below shows the result comparison on coco testdev. PointRend results are from their camera ready paper (Table 1 and Table 5).
Backbone Method mAP(mask)
Res50-FPN PointRend (1x, from scratch using imagenet weights) 36.3
Res50-FPN Transfiner (1x, from scratch using imagenet weights) 37.0
Res101-FPN PointRend (3x, from scratch using imagenet weights) 39.8
Res101-FPN Transfiner (3x, from scratch using imagenet weights) 40.5

  For training r50 model based on imagenet-weights, our config file is at here, trained model is at here.

  For training r101 model model based on imagenet-weights, our config file is at here, trained model is at here.

  • For Refinemask, the official github repo does not provide the pretrained model for the mentioned 3x multi-scale training with R101, reported in the paper. Thus, I trained the 3x (36 epochs, ms) model according to the instructions in the github repo. I modified the single-scale the 2x schedule config to multi-scale range (640 to 800), and changed the training schedule period from 24 epochs to 36 epochs. I followed the provided github training instructions below by replacing the r50 config file with our r101-3x-ms Config. I was unable to reproduce the performance mentioned in the paper, and instead achieved ranging from 38.0 to 39.2 mask AP on COCO val by training multiple times. Could you please have a check on it?Here is my training log, config and trained model.

./scripts/dist_train.sh ./configs/refinemask/coco/r50-refinemask-1x.py 8 work_dirs/r50-refinemask-1x

  • For the AP* of pointrend, we recomputed and checked the AP* again. AP number is correct but we mistakenly took the result file produced by single-scale trained version PointRend model for AP* when making the separate computation. Thanks for pointing this out. We will correct it by following Table 5 of the PointRend paper.

from transfiner.

lkeab avatar lkeab commented on May 26, 2024

Hi thanks for watching the work. I will look into this it and further reply it. The compared numbers are from published papers.

from transfiner.

lkeab avatar lkeab commented on May 26, 2024

The HTC numerical result 39.7 is from table 1 of their official HTC paper and also Refinemask. The paper Refinemask also uses 39.7 as the HTC-R101 result during comparison (Table 8). This is a little bit contradictory to your mentioned "HTC-R101 is even higher than refinemask", i.e. , why not reporting the "even higher" result more than 41.2 by HTC in the table 8 of refinemask paper? By default, when evaluating results on coco-testdev, most papers will follow the default training & evaluation setting in detecton2. I will further check the training schemes by these different methods as there are too much to care about, and update the arxiv content on this to make it more clear in the future. It's hard to check/train all these methods one by one, and there are various referred numerical results even for the same method in different papers. So sometimes I can just refer to the numerical numbers reported in the original official paper.

from transfiner.

zhanggang001 avatar zhanggang001 commented on May 26, 2024

As for the config you used to reproduce RefineMask, there are at least two problems: 1. The total training batch size should be 16 (i.e. 8x2 or 16x1). If you train models with a batch size of 8x1 using MMDetection, the performance will drop much, which not only occurs with RefineMask. We have stated this in the repo (those lines with a strikethrough). 2. If you train models with multi-scale jittering by 36 epochs, you should decay the learning rate at the 28th and 34th epoch following the default setting in MMDetectiion (instead of the 24th and 33th epoch in your config).

When I first released the codes, only a single image per GPU was supported, and one needs to use 16 GPUs with slurm to reproduce the results. We have stated this in the repo (those lines with a strikethrough). The original config './configs/refinemask/coco/r50-refinemask-1x.py' was set for training with a batch size of 16x1. Then some days later, I updated the code to support multiple images per GPU, but I forgot to update those configs to 2 images per GPU, which may mislead you, and I'm sorry for that. And I have updated those configs.

from transfiner.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.