<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

MCC confusion matrices HMO confusion matrix <a target="_blank" rel="noopener n

Comments: For both of the two options above, we

<a target="_blank" rel="noopener noreferrer nofollow" href="https://camo.githubusercon

Directions forward about behavioralthor HOT 6 OPEN

ardila commented on September 9, 2024

Directions forward

from behavioralthor.

Comments (6)

ardila commented on September 9, 2024

MCC confusion matrices
HMO confusion matrix

Pixel confusion matrix

from behavioralthor.

yamins81 commented on September 9, 2024

Comments:

For both of the two options above, we could also replace the
"average-L3-hard" score with the "HMO-0" score, correct? By HMO-0, I mean
the current HMO model as extracted so far. I haven't yet completely
thought through what I think is best here. Or we could do V1-hard or
HMAX-hard, right? I am kind of leaning toward HMO0-hard at the moment.
What are your thoughts?
The method you described might generally be called "worst margin", e.g.
you pick images as the ones with the worst margin on a classifier. I
think this should be amended in two ways:
a) First, we should make sure that any margins are those averaged
over a set of splits, so that the "bad images" are truly those that have
stably bad margins, regardless of the specific distractors.
b) We should include a set of additional distractors that are
randomly chosen with respect to margins. The reason to do this is that
oftentimes, I have had the impression that "hard images" (or hard objects)
are hard because of the "easier" distractors that are also in a given set.
In other words, those "easier distractors" being present is important to
expose the difficulty of a given "hard" image or object. If we remove all
the easy ones, than it might suddenly look easy to solve the hard images,
because they get "moved into place" on top of where the easy ones used to
be. Then, once we try to combine the solution back in complementarily, it
won't work. So, we'll want to keep at least some easy distractors
around that are uniformly distributed in image space with regard to margins
on the test algorithms (and classes).
I assume that you think we should draw the images from which to choose
these set from the pixel-hard synsets as opposed to random 250K images.
This is why you're saying that we'll start extracting from tomorrow
the "PixelHardSynsets"
set, right? How many hard synsets are you thinking? Or will that be set
by N1 to fix the size of the total set?

--> On a separate note, what we're doing here is basically stacking a

hierarchical series of increasingly stringent tests, to winnow down the
set. Starting with pixels, and using that to cut down the set a lot.
Then we can cut it down further with HMO0- or HMAX- or whatever we decide
on point 1) above. We then run THAT through either HMO procedure
directly, or THEN through humans to re-weight it.

From the plot you made for the HMO0 model, I don't agree that we're
seeing saturation, though, in the performance as a function of training
examples. In fact, it looks to me like its slow (approximately
logarithmic) increase, much like in the case of HvM. I expect that
performance will keep increasing slowly with the number of examples. But
I think N1 = 400 is fine, probably, since we don't need to push out to
saturation, we just need a representative sample.
Does this plan relate clearly to the psychophysics plan you came up with
a couple of months ago? Can you spell that out a little more explicitly
now, again?

On Mon, Sep 30, 2013 at 5:47 PM, Diego Ardila [email protected]:

@yamins81 https://github.com/yamins81
There are 2 main goals

A screening set that is representative of the difficulty in the
1000-way categorization task, for creating a challenge submission

A screening set that is representative of the difficulty that humans
are good at in all of imagenet, for getting better neural fits

re 1)
We should use random L3 models (5 sets of features, one from each random
model) and find a set of images that is hard to separate on average for the
model class. This would mean extracting #N1 images from each synset, then
getting margins for all 2-ways for each image. Then, we could just take the
mean of the set of negative margins for each image as a score, and take the
#N2 lowest scoring images.

re 2)
We should find the largest negative margins as above, but then for each of
these margins, test it in humans. This means that we will have ranked list
of tuples ranked by margin (most negative first):
(image, distractor_synset, margin)

And we will search through this set of image tuples using psychophysics to
find the first (going down the ordered list) #N2 tuples that have a
performance above some threshold.

Here are some training curve results for MCC2 classification
The results for linearsvc are still being calculated (takes about 210
minutes to generate one of these curves.)

[image: screen shot 2013-09-30 at 5 22 44 pm]https://f.cloud.github.com/assets/2701347/1241139/cdc9188e-2a17-11e3-99a5-c8acd6783cb1.png

Immediate points of action:

Deciding how many images per synset to extract (#N1), then extracting
them.

Deciding the size of the screening set (#N2)

#N1 seems to be around 400 given the training curve (saturation around
300-350, need 50-100 test examples)

If you agree with this decision for #N1, then I will create a new dataset
called PixelHardSynsets which you should then extract

import imagenetdataset = imagenet.dldataset.PixelHardSynsets

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2
.

from behavioralthor.

ardila commented on September 9, 2024

Some vocabulary:
challenge subset -> dataset for goal one
imagenet subset -> dataset for goal two

1)

The various options are
Just pixels
V1 <- probably will require some engineering effort/setup time from me
Hmax <- probably will require some engineering effort/setup time from me
V1+HMax <- probably will require some engineering effort/setup time from me
Random L3
HMO
The problem with HMO hard is that if we believe that HMO is capturing key axes of difficulty, then we will be removing those from the dataset. This is ok if we have some principled way of combining the model we screen on the challenge subset with our existing model, but even if we do, at some point we should think about regularization (how many times is it fair to screen on a new dataset and add more components to the model).
If we are not combining models, then we want to remove only the axes of difficulty that will automatically be captured by almost any member of the model class, which is why I suggested random L3s

2)

a) agreed
b) Once we have a set of tuples with high deltas:
(image, distractor_synset, delta = margin-human performance as margin (using logistic regression))
We can construct the imagenet subset in several ways, here is one suggestion:
If we think of the deltas as weights, then every distractor synset will have some amount of weight summed over all the tuples. We should take a random sample of images from each synset whose size proportional to the synset's weight.
There is now one free parameter: # hard images/# images from distractor synsets
which can be set empirically to ensure that the screening set is actually difficult for HMO-0.

3)

It depends on N1. Since we've agreed N1=400 is ok, the number of synsets depends on the budget for extraction which you said was 250,000 (833 synsets). If that is correct, then you should begin extraction of PixelHardSynsets ASAP (should be ready in 15 minutes from the time I post this)

from behavioralthor.

ardila commented on September 9, 2024

PixelHardSynsets is now available: e93d9e99547c2fe05e48d264bf9219589ca9bc54

Here are svm results (not much different from MCC results):

I am also running the following classifiers using compute metric base: 5-NN and SGDClassifier

from behavioralthor.

ardila commented on September 9, 2024

from behavioralthor.

ardila commented on September 9, 2024

@yamins81
In talking with Jim about priorities, I think we came to the conclusion that we need to take advantage of the work I've done so far in some way, instead of just dropping it all to move to a new problem. Looking through what I have I was wondering if you still thought that "finding the hard parts of imagenet" is a useful goal.

I'm pretty convinced that I've done this: I have all 2-ways of the best model I can run, and found the densest part of the space. I have measured the human and model performance at just a few points in this space and it looks like there is a significant gap with humans (just not in 2-ways because humans and models are near ceiling). If you are not convinced of this gap, what would it take?

Is it possible to run the HMO procedure again on a combination of however much of this dense space would be appropriate + the synthetic set from before?

At the very least, I want run some sort of apples-to-apples comparison on Imagenet with HMO and the others, especially since I've found that

The gap on HvM is still significant, and here HMO is the most consistent with humans
The consistency between humans and the convnet models is generally low on imagenet, especially in the dense subspace.

from behavioralthor.

Directions forward about behavioralthor HOT 6 OPEN

Comments (6)

1)

2)

3)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent