<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com

Question of the DIR performance discrepancy between the paper table 13 and leaderboard about good HOT 10 CLOSED

TimeLovercc commented on June 15, 2024

Question of the DIR performance discrepancy between the paper table 13 and leaderboard

from good.

Comments (10)

CM-BF commented on June 15, 2024 1

Hi Zhimeng,

Thank you for your question! There are three reasons as follows.

The OOD generalization problem has not been theoretically solved by DIR, i.e., the lack of guarantee leads to relatively random results with large variances.
The GOOD-Motif dataset is designed as a sanity check that can exaggerate the OOD problem in under structural shifts.
The leaderboard 1.1.0 on latest datasets have larger hyperparameter spaces and more runs for hyperparameter sweeping, which leads to new but more statistical significant results. However, it cannot guarantee better results, e.g., you can notice that DIR's performances on the basis-covariate split are different (39.99 on the leaderboard v.s. 61.50 on the paper), which also reflects my first point.

Best,
Shurui Gui

from good.

CM-BF commented on June 15, 2024 1

Hi,

Hello, thank you for creating GOOD, which has been incredibly helpful. I have a similar question. Is the strong performance of DIR on the leaderboard attributed to your tuning it across a broader range of hyperparameters?

The tuning process is an automatic process without my interferences. The broader range is only a part of the reason but is not the most important factor. The most significant problem is that DIR strategy cannot guarantee a sucessful subgraph discovery, making its results on this sanity check unspecified, i.e., it has a high hyperparameter sensitivity in this scenario. If one runs the hyperparameter sweeping, one may notice that the performance gap between its best and second best results can be huge.

Best,
Shurui Gui

from good.

CM-BF commented on June 15, 2024 1

Hi Zhimeng,

The discrepancy in performance, specifically for DIR, is primarily due to its unstable performance across runs.

Yes, partially. It is not just across runs, but also across different hyperparameters (high sensitivity).

You have runed many times for DIR. The results presented in Table 13 are based on an earlier version, while the leaderboard displays the most recent outcomes.

Yes. The leaderboard results are the latest results. We haven't updated the paper to reflect them.

There have been no modifications or updates to the datasets between these two sets of results.
Could you please confirm my understanding on these points?

Yes. Both GOOD-Motif datasets are the same.

Best,
Shurui

from good.

CM-BF commented on June 15, 2024 1

Hi,

Hi, Thank you! Do you have some insights about why DIR is not stable compared to other methods in the leaderboard?

Thank you for your question! Since you are interested in this insight, I'd like to redirect you to our work LECI. Specifically, you may find Figure 4 and Table 8 useful. In brief, the training of subgraph discovery networks adds one more degree of freedom (structure disentanglement), so without guarantees, the generalization results are unspecified.

In addition, it is critical to note that these synthetic datasets are sanity checks that exaggerate the OOD problems. You may test your initial theory and implementation on them. If your theory is right, you can obtain much higher results. The easiest way to validate is using test domain validations as shown in Table 10 in LECI. Generally, without appropriate theoretical guarantees, the method cannot pass the sanity check even using the test domain validation as we observed.

Best,
Shurui

from good.

CM-BF commented on June 15, 2024 1

Hi Zhimeng,

You are most welcome!

What was the rationale behind setting different spurious ratios and then combining the three sets? Why not employ a single spurious ratio for the entire training set?

The original purpose is to simulate a scenario in the real world where one can collect data from several environments. Although these environments include data distribution with similar biases, the degrees of the biases are different. This information can contribute to the judgment of whether the strong correlation is spurious or not, under the assumption that data collecting noises from different environments are at the same intensity.

I noticed that val_spurious_ratio is set to 0.3, as opposed to 0. Was this choice made to emulate a more realistic scenario?

This design is also for simulating real-world scenarios in which it is more practical to collect data similar to the test domain than to obtain data with distributions as the same as the test domain. The validation set is a bridge between the training and testing set. Inspired by DomainBed where oracle domain validations can produce better results, we modify this principle by making the obtainment of validation set more practical.

Please let me know if any questions. 😄

Best,
Shurui

from good.

AGTSAAA commented on June 15, 2024

Hello, thank you for creating GOOD, which has been incredibly helpful. I have a similar question. Is the strong performance of DIR on the leaderboard attributed to your tuning it across a broader range of hyperparameters?

from good.

AGTSAAA commented on June 15, 2024

Hi, Thank you! Do you have some insights about why DIR is not stable compared to other methods in the leaderboard?

from good.

TimeLovercc commented on June 15, 2024

The leaderboard 1.1.0 on latest datasets have larger hyperparameter spaces and more runs for hyperparameter sweeping, which leads to new but more statistical significant results. However, it cannot guarantee better results, e.g., you can notice that DIR's performances on the basis-covariate split are different (39.99 on the leaderboard v.s. 61.50 on the paper), which also reflects my first point.

Hi Shurui,

Thank you for shedding light on the differences in the leaderboard results and the paper. My current understanding is:

The discrepancy in performance, specifically for DIR, is primarily due to its unstable performance across runs.
You have runed many times for DIR. The results presented in Table 13 are based on an earlier version, while the leaderboard displays the most recent outcomes.
There have been no modifications or updates to the datasets between these two sets of results.
Could you please confirm my understanding on these points?

Thank you!
Zhimeng

from good.

TimeLovercc commented on June 15, 2024

Hi Shurui,

I really appreciate your timely reply.

Thank you for providing clarity on my previous queries. I have a few more questions, particularly related to the design choices of the GOOD-motif dataset. In the get_basis_concept_shift_list function:

What was the rationale behind setting different spurious ratios and then combining the three sets? Why not employ a single spurious ratio for the entire training set?
I noticed that val_spurious_ratio is set to 0.3, as opposed to 0. Was this choice made to emulate a more realistic scenario?

Best,
Zhimeng

from good.

TimeLovercc commented on June 15, 2024

Thank you for your timely and patient responce. It's quite helpful!

Best,
Zhimeng

from good.

Question of the DIR performance discrepancy between the paper table 13 and leaderboard about good HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent