The common-sense-prediction from kudkudak

Try to figure out why their model works well on extrinsic eval

We talked with Dima:

Evaluation, if it has negative samples, ultimately sucks, so we need extrinsic evaluation
Paper also figured it out, it has extrinsic evaluation, and model performs well on it. Weird!

Task is to examine hypotheses around why it happens

Explore why OMCS representation is so good

Compare relation-wise glove with OMCS in ArgSim
Try to fit per-representation translation in ArgSim+

Groundword: do hp optimization for current DNN_CE and Factorized

It is not entirely clear what are exact performances of DNN_CE and Factorized. Ballpark should be 92% for both, but I am not certain. Especially, that previous score estimates used sth. closer to max test acc, which is not great ;)

vocab should be property of dataset, not embedding.

Handle embeddings and vocab like in https://github.com/tombosc/dict_based_learning

Some clarification: by vocab we mean an assignment from word to index, as well as picked vocab of known words. All words outside of vocab are UNK'ed.

Copy class vocab from https://github.com/tombosc/dict_based_learning/blob/master/dictlearn/vocab.py
Add computing vocab to preprocessing
LiACLSplitDataset should automagically load vocab from the folder
load_embedding based on vocab (see for instance https://github.com/tombosc/dict_based_learning/blob/master/bin/pack_glove.py). Do not return word2index in load_embedding

Introduce changes to all files using src/data.py

It will also require some changes in scripts/evaluate, but this is on @kudkudak. Basic idea is that this issue introduces a breaking change, so any models prior to it won't be possible to use in evaluation etc.

Closest-neighbour analysis

See if ACL model gets good scores if and only if it "copies" (retrieves mem triplet from CN)

Explore wiki distances

Manually label the 5k conceptnet

I am going to a boring conference, it should be doable within 3h. We will be able to iterate much faster.

Try to merge ArgSim and MaxSim

While for now we abandon showing trivialization via model, in favour of clearer showing trivialization via extrinsic evaluation, it would be still nice to try to have clearly trivial model.

One idea would be to merge ArgSim and MaxSim. This requires #36

Try argsim and prototypical negative samples - effect on extrinsic and test gen

Try new negative samples procedure, see how it works as regularizer. See effect on extrinsic evaluation and comparison on new test set

Factorized model

I find it a bit hard to combine two jobs because of some health issues, so I thought I would at least describe what I think should be done.

So, we want to show that their model doesn't really do any serious common sense inference. How do we do it? We can hope to build a simpler model, for which it would be obvious that nothing fancy can take place, but for which we still hope to obtain a high performance. This is the same path that we tried to take with MaxSim++, but it didn't work. So let's change the angle of attack as follows.

In one of his papers (http://www.aclweb.org/anthology/N15-1098) Omer Levy shows that one way to cheat when one is training a supervised relation classifier is to remember what are prototypical heads and prototypical tails. For example, "animal" is a typical tail of "Is-A" relation. This can be generalize to our case of modelling many relation by checking the compatibility of the head and the tail, the head and the relation, the relation and the tail. For example, we can consider score function like S(H, R, T) = S1(H, R) + S2(R, T) + S3(H, T) or S(H, R, T) = max(S1(H, R), S2(R, T), S3(H, T). It would be great to show that smth like this can show a similar level of performance. S1, S2, S3 can be built to mimic our best-performing model but with a fixed trainable head, relation or tail.

If such a model performs well, we change the sampling of negative examples to avoid the examples this factorized model can detect, because such examples are worthless for us.

Dev re-make

Simple re-make of dev. I am not sure if it will solve much, but we should propose and re-evaluate on something like this anyway.

Random dev/dev2/test instead of top scores
Bucketing of dev/dev2/test, or some AUC measure

See how ordering of factorized/prototypical/DNN is affected.

Tasks:

Script performing splitting
Script adding negative samples
Metrics on buckets

Human novelty evaluation

Categorize triplets into various triviality categories, or state no triviality is seen. Based on 5 closest neighbours in OMCS embedding distance. Then see if:

how many arguably trivial triplets each dataset has
there is statistically significant difference in top50% vs bottom50%

Datasets:

test from CN
wiki 100 out of 10k (so it is not extremely biased)
random 100 from train

MaxSim - negative sampling using MaxSim

Write sketch of text

We already have overleaf link
Remember about NAACL template
See how what we have fits in page limit

Closing for now, wrote the main pitch, and outline of structure (actually still most in gdoc https://docs.google.com/document/d/1tOepCPXQB-kUzAV38j1SqG0bHcyWHSpCkiN8smiDQJQ/edit, but some already in overleaf)

Human based negative samples - decide on protocol

Evaluate 1000 random conceptnet triplets to create negative examples. Should take approximately 1h in total. We need at least two scores to sounds reasonable.

Protocol will be decided during meeting 22.11.

Gdoc link (for now empty): https://docs.google.com/spreadsheets/d/17_OLZ29lm2evv07jNdKlNKcTNrdHkRZRNK9R5hdh7a0/edit#gid=0

Reproduce DNN CE 92%

Don't be too concerned about code quality right now, but if you want you can copy https://github.com/gmum/toolkit/tree/master/example_dl_project and build from it.

Add hostname to vegab MetaSaver

Code robustness - Add asserts to data streams, fix for Theano.

Now when something is wrong with data folder it just fails silently and returns None. Would be nice to add assert that we do not return None

Also code currently fails for Theano, which is fine but would be nice to fix.

Bonus: figure out why sometimes keras complains about dict input and sometimes doesnt. Right now I just always pass list :P

Simplest retrieval system

Use elasticsearch or argue why not
Work with BT or CBT
Similar functionality to https://gist.github.com/kudkudak/bd486180eba247a0bf1bad1621b0f260
Produce sth like https://gist.github.com/kudkudak/17c0ab885df12d18e4c39660d11b5b29

Explore performance on reduced train set

Code script to add distances, run for wiki and conceptnet, analyze

We have two test sets: Wiki test has 1.7M triplets, conceptnet one has 3k. We need to add to each distance (based on given embedding) to closest example from train set (100k examples).

This is feasible computation, if we use properties of metric. Worst case we can subsample wiki corpora, shouldn't be an issue. Start with script https://github.com/kudkudak/common-sense-prediction/blob/master/scripts/evaluate/augment_with_closest.py. Script should have similar arguments

Distance function: +INF if relation is different, max( ||tail_a, tail_b||^2, ||head_a - head_b||^2). Note that this can be a bit sped up significantly by the following tricks

precomputing first and second term of max for each head and tail. There are 25k unique heads, and 50k unique tails, so this should speed up computation by 3x.
dividing training set by relations (because distance is +INF if relations are different)

Suggestions:

Parallelize only if needed (warning about memory problem when parallelizing glove). Probably easiest parallelization is use gnu parallel and add to your script some sort of range parameters which would indicate which rows to compute

After computing the distance, for random examples from wiki corpora, please fetch 5 closest + their scores to get a feel how this conforms to intuition of what is novel, and what is trivial.

This should conclude test set evaluation

Subtasks:

Code script (and commit to master)
Run it
Create gdoc with some examples

Investigate whether we can use shitty triplets from CN as negative examples

This has been done in evaluation of NELL.

Check if distances on conceptnet extrinsic test are useful

Once #46 is done, check if this is true that for novel triplets, all models behave the same

Improve human evaluation script

Any number of input models
Remove stupid naming convention and just add passing file names
Include different types of mixing (ex top K + random L)

External source of information: add closest training tuple as input to DNN

Refactor and summarize current effort on baselines

Prepare small note based on notes in Evernote of what has been done
Refactor into scripts:

maxsim.py
maxsim_3_5.py, maxsim_3_6.py etc.
maxargsim.py

Extract raw resources that I use: predictions from their model and embeddings.txt
Notebook for reproducing last bit of analysis (as example of loading raw resources)

Add conceptnet extrinsic eval

Following what I did for Wiki, do the same for conceptnet "extrinsic" evaluation proposed in ACL paper

Evaluate on wiki

Script evaluating on wiki extrinsic evaluation, data preprocessing

Close remaining 0.5-1%

Check again how good is their code (I got I think 91.3 while they report 91.8?)
Match the 91.3, it should be doable by tuning l2, I haven't done it really. If tuning l2 doesn't work, try tuning lr.
IF by now still below their reported 91.8%, try pretraining relation embeddings (they host code for that)

Decouple embeddings from Data

I do not think it is a good design choice, and surely not a common one. Just move load_embeddings code to train script and that's all.

Small bug in argsim_threshold fnc: avg should count non-zeros

vegab improve meta.json

Add current git commit hash (and possibly branch) to meta.json, to make it easier to find the version used in that run

Can potentially also remove the train.py since this would allow you to go back to the right commit

Perform human evaluation on wiki

Please do that if you can, we can talk tomorrow about how to assign scores more or less.

Practically speaking: just copy all sub-spreadsheets (1-2_Stan) and create like 1-2_Arian in https://docs.google.com/spreadsheets/d/1yblbveDmG6RjmOf8nsVnOGg1ExHgZGJwjxqtjQk-17o/edit#gid=1151004832, delete column with my scores and score yourself. This is 400 tuples, shouldn't take more than 1h to score.

Fast computing distance script

#46 computes distances, but is slow. We need a script working on 1M triplets (like wiki). We could alternatively short-list, but that's unprofessional :)

Goal: get 10k computation close to 10 minutes, rather than few hours. Shoul be good enough, in engineering terms :)

Start from https://github.com/kudkudak/common-sense-prediction/blob/master/scripts/evaluate/compute_distances.py (it is faster than #46, probably because the suggested trick with head/tail splitting makes little sense without algebra parallelization, it could be modified by stacking heads and tails, but for now it is simpler to just do bruteforce computation)

Check relation of conceptnet5 to version used in paper

Examine regularization for glove embeddings when used together with OMCS

Are scale of glove and OMCS same? Triple check
Try regularization: split l2, dropout, spectral (https://github.com/minhnhat93/tf-SNDCGAN) and others

Add good configs for Factorized to config file

Negative samples outside of batch

For some reason we generate negative samples within-batch. That's arguably weird, not sure what it does.

Try outside of batch generation, and see how it impacts models for both negative argsim and normal split.

Explore predictions of all our models based on notebook

merge LiACLDatasetFromFile and LiACLSplitDataset

I had to create LiACLDatasetFromFile for evaluation purposes, but it could and should be merged with LiACLSplitDataset

Try glove as external source of information in Factorized

NLP inspired distances train/test set for qualiatative analysis.

We want to for each point to have a label like "paraphrase", "synonym" etc, so it allows us to have a table in paper stating how good at each category model is. So it drives home the message model is about completion, not prediction.

Human examination of split quality

Using #56 and distances on original test set of conceptnet look at importance and predictive value of distance metric used

Bilinear model

Based on analysis, seems most likely that model is mostly about protypical relations. If bilinear model does really well, with the representation it uses, I think it is a very strong message towards the fact their model does completion not prediction.

Let's see how well bilinear model does.

Assigning Dima for now.