Giter Site home page Giter Site logo

common-sense-prediction's People

Contributors

arianhosseini avatar kudkudak avatar mnoukhov avatar rizar avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

common-sense-prediction's Issues

Try to figure out why their model works well on extrinsic eval

We talked with Dima:

  • Evaluation, if it has negative samples, ultimately sucks, so we need extrinsic evaluation
  • Paper also figured it out, it has extrinsic evaluation, and model performs well on it. Weird!

Task is to examine hypotheses around why it happens

vocab should be property of dataset, not embedding.

Handle embeddings and vocab like in https://github.com/tombosc/dict_based_learning

Some clarification: by vocab we mean an assignment from word to index, as well as picked vocab of known words. All words outside of vocab are UNK'ed.

Introduce changes to all files using src/data.py

It will also require some changes in scripts/evaluate, but this is on @kudkudak. Basic idea is that this issue introduces a breaking change, so any models prior to it won't be possible to use in evaluation etc.

Try to merge ArgSim and MaxSim

While for now we abandon showing trivialization via model, in favour of clearer showing trivialization via extrinsic evaluation, it would be still nice to try to have clearly trivial model.

One idea would be to merge ArgSim and MaxSim. This requires #36

Factorized model

I find it a bit hard to combine two jobs because of some health issues, so I thought I would at least describe what I think should be done.

So, we want to show that their model doesn't really do any serious common sense inference. How do we do it? We can hope to build a simpler model, for which it would be obvious that nothing fancy can take place, but for which we still hope to obtain a high performance. This is the same path that we tried to take with MaxSim++, but it didn't work. So let's change the angle of attack as follows.

In one of his papers (http://www.aclweb.org/anthology/N15-1098) Omer Levy shows that one way to cheat when one is training a supervised relation classifier is to remember what are prototypical heads and prototypical tails. For example, "animal" is a typical tail of "Is-A" relation. This can be generalize to our case of modelling many relation by checking the compatibility of the head and the tail, the head and the relation, the relation and the tail. For example, we can consider score function like S(H, R, T) = S1(H, R) + S2(R, T) + S3(H, T) or S(H, R, T) = max(S1(H, R), S2(R, T), S3(H, T). It would be great to show that smth like this can show a similar level of performance. S1, S2, S3 can be built to mimic our best-performing model but with a fixed trainable head, relation or tail.

If such a model performs well, we change the sampling of negative examples to avoid the examples this factorized model can detect, because such examples are worthless for us.

Dev re-make

Simple re-make of dev. I am not sure if it will solve much, but we should propose and re-evaluate on something like this anyway.

  • Random dev/dev2/test instead of top scores
  • Bucketing of dev/dev2/test, or some AUC measure

See how ordering of factorized/prototypical/DNN is affected.

Tasks:

  • Script performing splitting

  • Script adding negative samples

  • Metrics on buckets

Human novelty evaluation

Categorize triplets into various triviality categories, or state no triviality is seen. Based on 5 closest neighbours in OMCS embedding distance. Then see if:

  • how many arguably trivial triplets each dataset has
  • there is statistically significant difference in top50% vs bottom50%

Datasets:

  • test from CN
  • wiki 100 out of 10k (so it is not extremely biased)
  • random 100 from train

Code robustness - Add asserts to data streams, fix for Theano.

Now when something is wrong with data folder it just fails silently and returns None. Would be nice to add assert that we do not return None

Also code currently fails for Theano, which is fine but would be nice to fix.

Bonus: figure out why sometimes keras complains about dict input and sometimes doesnt. Right now I just always pass list :P

Code script to add distances, run for wiki and conceptnet, analyze

We have two test sets: Wiki test has 1.7M triplets, conceptnet one has 3k. We need to add to each distance (based on given embedding) to closest example from train set (100k examples).

This is feasible computation, if we use properties of metric. Worst case we can subsample wiki corpora, shouldn't be an issue. Start with script https://github.com/kudkudak/common-sense-prediction/blob/master/scripts/evaluate/augment_with_closest.py. Script should have similar arguments

Distance function: +INF if relation is different, max( ||tail_a, tail_b||^2, ||head_a - head_b||^2). Note that this can be a bit sped up significantly by the following tricks

  • precomputing first and second term of max for each head and tail. There are 25k unique heads, and 50k unique tails, so this should speed up computation by 3x.
  • dividing training set by relations (because distance is +INF if relations are different)

Suggestions:

  • Parallelize only if needed (warning about memory problem when parallelizing glove). Probably easiest parallelization is use gnu parallel and add to your script some sort of range parameters which would indicate which rows to compute

After computing the distance, for random examples from wiki corpora, please fetch 5 closest + their scores to get a feel how this conforms to intuition of what is novel, and what is trivial.

This should conclude test set evaluation

Subtasks:

  • Code script (and commit to master)
  • Run it
  • Create gdoc with some examples

Improve human evaluation script

  • Any number of input models
  • Remove stupid naming convention and just add passing file names
  • Include different types of mixing (ex top K + random L)

Refactor and summarize current effort on baselines

  1. Prepare small note based on notes in Evernote of what has been done
  2. Refactor into scripts:
  • maxsim.py
  • maxsim_3_5.py, maxsim_3_6.py etc.
  • maxargsim.py
  1. Extract raw resources that I use: predictions from their model and embeddings.txt
  2. Notebook for reproducing last bit of analysis (as example of loading raw resources)

Evaluate on wiki

Script evaluating on wiki extrinsic evaluation, data preprocessing

Close remaining 0.5-1%

  1. Check again how good is their code (I got I think 91.3 while they report 91.8?)
  2. Match the 91.3, it should be doable by tuning l2, I haven't done it really. If tuning l2 doesn't work, try tuning lr.
  3. IF by now still below their reported 91.8%, try pretraining relation embeddings (they host code for that)

Decouple embeddings from Data

I do not think it is a good design choice, and surely not a common one. Just move load_embeddings code to train script and that's all.

vegab improve meta.json

Add current git commit hash (and possibly branch) to meta.json, to make it easier to find the version used in that run

Can potentially also remove the train.py since this would allow you to go back to the right commit

Fast computing distance script

#46 computes distances, but is slow. We need a script working on 1M triplets (like wiki). We could alternatively short-list, but that's unprofessional :)

Goal: get 10k computation close to 10 minutes, rather than few hours. Shoul be good enough, in engineering terms :)

Start from https://github.com/kudkudak/common-sense-prediction/blob/master/scripts/evaluate/compute_distances.py (it is faster than #46, probably because the suggested trick with head/tail splitting makes little sense without algebra parallelization, it could be modified by stacking heads and tails, but for now it is simpler to just do bruteforce computation)

Negative samples outside of batch

For some reason we generate negative samples within-batch. That's arguably weird, not sure what it does.

Try outside of batch generation, and see how it impacts models for both negative argsim and normal split.

Bilinear model

Based on analysis, seems most likely that model is mostly about protypical relations. If bilinear model does really well, with the representation it uses, I think it is a very strong message towards the fact their model does completion not prediction.

Let's see how well bilinear model does.

Assigning Dima for now.

Create a golden set of "trivial/novel" pairs

To make our considerations concrete, and see how much our MaxSim is lying please let's do the following:

Create a set of K paris of triplets with label if this is novel or not fact in your opinion

e.g.:

(frog, isan, anima), (cat, isan, animal) -> trivial/novel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.