Giter Site home page Giter Site logo

Comments (10)

CM-BF avatar CM-BF commented on June 15, 2024 1

Hi Zhimeng,

Thank you for your question! There are three reasons as follows.

  1. The OOD generalization problem has not been theoretically solved by DIR, i.e., the lack of guarantee leads to relatively random results with large variances.
  2. The GOOD-Motif dataset is designed as a sanity check that can exaggerate the OOD problem in under structural shifts.
  3. The leaderboard 1.1.0 on latest datasets have larger hyperparameter spaces and more runs for hyperparameter sweeping, which leads to new but more statistical significant results. However, it cannot guarantee better results, e.g., you can notice that DIR's performances on the basis-covariate split are different (39.99 on the leaderboard v.s. 61.50 on the paper), which also reflects my first point.

Best,
Shurui Gui

from good.

CM-BF avatar CM-BF commented on June 15, 2024 1

Hi,

Hello, thank you for creating GOOD, which has been incredibly helpful. I have a similar question. Is the strong performance of DIR on the leaderboard attributed to your tuning it across a broader range of hyperparameters?

The tuning process is an automatic process without my interferences. The broader range is only a part of the reason but is not the most important factor. The most significant problem is that DIR strategy cannot guarantee a sucessful subgraph discovery, making its results on this sanity check unspecified, i.e., it has a high hyperparameter sensitivity in this scenario. If one runs the hyperparameter sweeping, one may notice that the performance gap between its best and second best results can be huge.

Best,
Shurui Gui

from good.

CM-BF avatar CM-BF commented on June 15, 2024 1

Hi Zhimeng,

The discrepancy in performance, specifically for DIR, is primarily due to its unstable performance across runs.

Yes, partially. It is not just across runs, but also across different hyperparameters (high sensitivity).

You have runed many times for DIR. The results presented in Table 13 are based on an earlier version, while the leaderboard displays the most recent outcomes.

Yes. The leaderboard results are the latest results. We haven't updated the paper to reflect them.

There have been no modifications or updates to the datasets between these two sets of results.
Could you please confirm my understanding on these points?

Yes. Both GOOD-Motif datasets are the same.

Best,
Shurui

from good.

CM-BF avatar CM-BF commented on June 15, 2024 1

Hi,

Hi, Thank you! Do you have some insights about why DIR is not stable compared to other methods in the leaderboard?

Thank you for your question! Since you are interested in this insight, I'd like to redirect you to our work LECI. Specifically, you may find Figure 4 and Table 8 useful. In brief, the training of subgraph discovery networks adds one more degree of freedom (structure disentanglement), so without guarantees, the generalization results are unspecified.

In addition, it is critical to note that these synthetic datasets are sanity checks that exaggerate the OOD problems. You may test your initial theory and implementation on them. If your theory is right, you can obtain much higher results. The easiest way to validate is using test domain validations as shown in Table 10 in LECI. Generally, without appropriate theoretical guarantees, the method cannot pass the sanity check even using the test domain validation as we observed.

Best,
Shurui

from good.

CM-BF avatar CM-BF commented on June 15, 2024 1

Hi Zhimeng,

You are most welcome!

What was the rationale behind setting different spurious ratios and then combining the three sets? Why not employ a single spurious ratio for the entire training set?

The original purpose is to simulate a scenario in the real world where one can collect data from several environments. Although these environments include data distribution with similar biases, the degrees of the biases are different. This information can contribute to the judgment of whether the strong correlation is spurious or not, under the assumption that data collecting noises from different environments are at the same intensity.

I noticed that val_spurious_ratio is set to 0.3, as opposed to 0. Was this choice made to emulate a more realistic scenario?

This design is also for simulating real-world scenarios in which it is more practical to collect data similar to the test domain than to obtain data with distributions as the same as the test domain. The validation set is a bridge between the training and testing set. Inspired by DomainBed where oracle domain validations can produce better results, we modify this principle by making the obtainment of validation set more practical.

Please let me know if any questions. 😄

Best,
Shurui

from good.

AGTSAAA avatar AGTSAAA commented on June 15, 2024

Hello, thank you for creating GOOD, which has been incredibly helpful. I have a similar question. Is the strong performance of DIR on the leaderboard attributed to your tuning it across a broader range of hyperparameters?

from good.

AGTSAAA avatar AGTSAAA commented on June 15, 2024

Hi, Thank you! Do you have some insights about why DIR is not stable compared to other methods in the leaderboard?

from good.

TimeLovercc avatar TimeLovercc commented on June 15, 2024

The leaderboard 1.1.0 on latest datasets have larger hyperparameter spaces and more runs for hyperparameter sweeping, which leads to new but more statistical significant results. However, it cannot guarantee better results, e.g., you can notice that DIR's performances on the basis-covariate split are different (39.99 on the leaderboard v.s. 61.50 on the paper), which also reflects my first point.

Hi Shurui,

Thank you for shedding light on the differences in the leaderboard results and the paper. My current understanding is:

  1. The discrepancy in performance, specifically for DIR, is primarily due to its unstable performance across runs.
  2. You have runed many times for DIR. The results presented in Table 13 are based on an earlier version, while the leaderboard displays the most recent outcomes.
  3. There have been no modifications or updates to the datasets between these two sets of results.
    Could you please confirm my understanding on these points?

Thank you!
Zhimeng

from good.

TimeLovercc avatar TimeLovercc commented on June 15, 2024

Hi Shurui,

I really appreciate your timely reply.

Thank you for providing clarity on my previous queries. I have a few more questions, particularly related to the design choices of the GOOD-motif dataset. In the get_basis_concept_shift_list function:

  1. What was the rationale behind setting different spurious ratios and then combining the three sets? Why not employ a single spurious ratio for the entire training set?
  2. I noticed that val_spurious_ratio is set to 0.3, as opposed to 0. Was this choice made to emulate a more realistic scenario?

Best,
Zhimeng

from good.

TimeLovercc avatar TimeLovercc commented on June 15, 2024

Thank you for your timely and patient responce. It's quite helpful!

Best,
Zhimeng

from good.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.