Giter Site home page Giter Site logo

Comments (9)

albertfgu avatar albertfgu commented on May 26, 2024

One question in particular is why there are 2 provided pathfinder128/ data folders. Did the authors already find this bug, and release an updated version of the dataset in the higher level folder?

However, the top-level pathfinder128/ folder is organized differently from the inner 4. In particular it has far fewer images, so it doesn't seem correct either.

from long-range-arena.

alexmathfb avatar alexmathfb commented on May 26, 2024

Thanks for sharing, currently looking into pathfinder, will keep you updated if we figure out why this happens.

from long-range-arena.

albertfgu avatar albertfgu commented on May 26, 2024

If it helps, some more observations about performance:

  • The ResNet trains to 80%+ train accuracy after half an epoch or less on Pathfinder-64 and Pathfinder-256. On these datasets, validation performance tracks train very closely.
  • The ResNet does not make much progress on train accuracy for several epochs on Pathfinder-128. It eventually learns something, but validation performance is always random guessing.
  • I tried a simple variant where I took the Pathfinder-256 dataset and did mean pooling on 2x2 patches, reducing it to the same resolution as Pathfinder-128. The ResNet is able to find recover its behavior on this version (i.e. trains fast and validation tracks train).

These observations seem to indicate that Pathfinder-128 is processed differently in a way that slows learning and prevents generalization entirely. One guess I had was that the labels were random; however, I manually looked at several images/labels in the data files and they seemed correct. I also can't see any difference in the image files between this dataset and the others.

from long-range-arena.

Splend1d avatar Splend1d commented on May 26, 2024

@albertfgu
Thank you for your experiments using ResNet on PathFinder-128, I learned a lot from it. Although I might be late to the party, I think the main reason of this might just be that the PathFinder-128 is significantly harder. Judging by looking at the picture, PathFinder-256 has sparse spaces between lines, and the padding of the image is also generous. I found a critical argument in the PathFinder generator args.num_distractor_snakes in data/pathfinder.py that highlights the difference. This argument has different values for each of the map sizes:
PathFinder32 -- 20/(14/3) = 4.28
PathFinder64 -- 22/(14/3) = 4.71
PathFinder128 -- 35/(14/3) = 7.50
PathFinder256 -- 30/(14/3) = 6.43
Therefore PathFinder128 has more distractor snakes than PathFinder 256 while having less space (the size of the lines are not scaled).

from long-range-arena.

albertfgu avatar albertfgu commented on May 26, 2024

@Splend1d thanks for pointing this out. I went through and looked at pictures of examples from Pathfinder64, Pathfinder128, and Pathfinder256 and I agree that Pathfinder128 is much harder than 256 (visually - ignoring challenges of scale).

However, I am still not sure if the data is correct given the ResNet gap of 98% train to 50% test. The Pathfinder128 data is harder but not completely different from the other versions, and I don't know how to explain this lack of generalization.

With that said, assuming the data is correct, the dataset can still be argued to be "buggy" for several reasons.

  • First, I'm guessing the args.num_distractor_snakes argument seems to be misconfigured from what they intended. For resolution 32 / 64 / 128 / 256, the argument is 20 / 22 / 35 / 30. I'm guessing they intended this to be 20 / 22 / 25 / 30 which makes more sense
  • Additionally (more seriously), the actual argument used to generate the Pathfinder128 data does not match even this number. I counted the number of distractor snakes in Pathfinder64 and Pathfinder256 and got roughly 5 per image and 7 per image respectively, which matches your above calculation. However, for Pathfinder128, I got 14-16 snakes per image. This seems to imply that the distractor snake argument was actually 70 and not 35 or 25 for Pathfinder128
  • Problems with data generation aside, ultimately, I would argue that a sequential image classification dataset that a 2D ResNet cannot solve does not seem like a reasonable dataset.

from long-range-arena.

albertfgu avatar albertfgu commented on May 26, 2024

Aside from issues with the args.num_distractor_snakes argument for Pathfinder128, as @Splend1d pointed out there seems to be another data generation oversight for Pathfinder256: the margins are bigger than in the other variants. I'm guessing it stems from the args.padding=1 flag which is passed to Pathfinder32/64/128 but not to 256.

from long-range-arena.

albertfgu avatar albertfgu commented on May 26, 2024

To get around these issues, in experiments what I did was take the Pathfinder256 data, and do mean pooling on 2x2 squares to turn it to resolution 128. I originally thought that this was more or less equivalent to Pathfinder128. More importantly, I felt that it is still in the spirit of the task and checked that the original Transformer variants do not make progress on it.

In light of the above data generation issues found in Pathfinder128 and Pathfinder256, I feel less comfortable with this argument. I think this issue seriously needs the authors @vanzytay @MostafaDehghani to step in. We're at a point where people are using the LRA benchmark more extensively and where models are beginning to be able to handle Path-X, so a discussion about this dataset is necessary.

from long-range-arena.

MostafaDehghani avatar MostafaDehghani commented on May 26, 2024

Thanks for opening the issue and the discussion.
First of all, we are aware of the difficulty of PathFinder-128 and we know that it's much more challenging than PathFinder-256. This is the reason that we use it as Path-X (instead of using PathFinder-256) in LRA.

We have generated many many variants of PathFinder and ran a lot of experiments with different model classes besides Transformers (including ResNet) and decided to include two of the setups we had for PathFinder in LRA: one that is not hard to make progress on it and one that all almost all of our models struggle to generalize. We had a lot of discussions internally and decided to add Path-X as an official LRA task to motivate a jump in the usual paradigms we were seeing in ideas for making transformers more efficient. Also I would like to argue that I totally disagree with @albertfgu on:

[...] ultimately, I would argue that a sequential image classification dataset that a 2D ResNet cannot solve does not seem like a reasonable dataset.

As a matter of the fact, the PathFinder becomes only interesting when a 2D CNN-based model fails on it, simply because CNNs struggle modeling transitivity and they don't have a direct global receptive field, which are probably key important abilities for a model to be able to solve PathFinderr. We wanted to see a new model with inductive biases that help pick up a solution in such a setup. So the config for generating PathFinder128 is designed in a way that a ResNet fails.

In the end, I want to add that although extremely difficult, it seems there are new methods that are able to find a generalizable solution for Path-X and such a development is really exciting to see for us and given that we know how hard this task is, we are impressed to see any progress on it.

from long-range-arena.

albertfgu avatar albertfgu commented on May 26, 2024

Thanks for the response! The clarification around some of the design decisions is very helpful. This still leaves me with several questions:

  1. The fact that the number of snakes is 20/22/35/30 instead of 20/22/25/30 still seems odd. Also, the 35 number still doesn't match the actual number of snakes in Pathfinder128. Could you confirm that the actual number of snakes (which seems to be 70) was intentional?
  2. Overall it seems that you're saying that you purposely made Pathfinder128 much harder than Pathfinder256. Could you clarify why you chose this design choice instead of the more straightforward one of having Pathfinder 32/64/128/256 increase in difficulty and choosing Pathfinder256 as Path-X?
  3. Above you said that the choice of including this particular dataset is to test generalization, implying that the baseline models do achieve above random train accuracy, but random test accuracy.

I would like to clarify whether this is the case: did any of your xformer variants achieve above random train accuracy? I thought that the answer was "no" based on my own experiments, as well as indicators in the paper:

  • Table 1 says "all models do not learn anything on Path-X... this shows that increasing the sequence length can cause seriously difficulties for model training".
  • the paper makes a big discussion about train-test gap on CIFAR-10, but does not mention this phenomenon at all for Path-X.
    Together, I assumed this indicated that all xformer variants were unable to learn anything during training; is this incorrect?

If it is the case that xformer baselines do learn on the train split, but not test, then I feel that adding this discussion about generalization to the LRA paper would substantially clarify the design choice for future researchers, and obviate the confusion raised in this thread.

If it is the case that xformer baselines do not learn on the train split, then I admit I don't quite understand why the benchmark needs such a large jump in difficulty. Towards understanding long-range dependencies, in my opinion the first question should be whether or not methods can model anything at all on sequences of length 16k (or 64k), and then the follow-up question is the one of generalization and inductive bias.

Towards this goal, a reasonable first step would be including a simpler Path-128 task of the same length, where all xformer baselines still fail to learn during training, but ResNets do solve it. Then a harder version can be included where ResNets train but do not generalize.

Ultimately, if it is true that the current version of Path-X was chosen to be so hard that even 2D ResNets cannot solve it, I think that's worth highlighting in the paper. The current language simply poses it as a longer sequence task:

This is an interesting litmus test to see if the same algorithmic challenges bear a different extent of difficulty when sequence lengths are much longer.

Given this language, it is reasonable to expect that this is the exact same version as the other PathFinders but with longer sequences; not that it is a drastically harder version that even ResNets can't solve, that conflates generalization challenges with the stated algorithmic challenges

Thanks again for continuing the discussion.

from long-range-arena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.