kudkudak / common-sense-prediction Goto Github PK
View Code? Open in Web Editor NEWCommon sense prediction using DL.
Common sense prediction using DL.
We talked with Dima:
Task is to examine hypotheses around why it happens
It is not entirely clear what are exact performances of DNN_CE and Factorized. Ballpark should be 92% for both, but I am not certain. Especially, that previous score estimates used sth. closer to max test acc, which is not great ;)
Handle embeddings and vocab like in https://github.com/tombosc/dict_based_learning
Some clarification: by vocab we mean an assignment from word to index, as well as picked vocab of known words. All words outside of vocab are UNK'ed.
Introduce changes to all files using src/data.py
It will also require some changes in scripts/evaluate
, but this is on @kudkudak. Basic idea is that this issue introduces a breaking change, so any models prior to it won't be possible to use in evaluation etc.
See if ACL model gets good scores if and only if it "copies" (retrieves mem triplet from CN)
I am going to a boring conference, it should be doable within 3h. We will be able to iterate much faster.
While for now we abandon showing trivialization via model, in favour of clearer showing trivialization via extrinsic evaluation, it would be still nice to try to have clearly trivial model.
One idea would be to merge ArgSim and MaxSim. This requires #36
I find it a bit hard to combine two jobs because of some health issues, so I thought I would at least describe what I think should be done.
So, we want to show that their model doesn't really do any serious common sense inference. How do we do it? We can hope to build a simpler model, for which it would be obvious that nothing fancy can take place, but for which we still hope to obtain a high performance. This is the same path that we tried to take with MaxSim++, but it didn't work. So let's change the angle of attack as follows.
In one of his papers (http://www.aclweb.org/anthology/N15-1098) Omer Levy shows that one way to cheat when one is training a supervised relation classifier is to remember what are prototypical heads and prototypical tails. For example, "animal" is a typical tail of "Is-A" relation. This can be generalize to our case of modelling many relation by checking the compatibility of the head and the tail, the head and the relation, the relation and the tail. For example, we can consider score function like S(H, R, T) = S1(H, R) + S2(R, T) + S3(H, T)
or S(H, R, T) = max(S1(H, R), S2(R, T), S3(H, T)
. It would be great to show that smth like this can show a similar level of performance. S1, S2, S3 can be built to mimic our best-performing model but with a fixed trainable head, relation or tail.
If such a model performs well, we change the sampling of negative examples to avoid the examples this factorized model can detect, because such examples are worthless for us.
Simple re-make of dev. I am not sure if it will solve much, but we should propose and re-evaluate on something like this anyway.
See how ordering of factorized/prototypical/DNN is affected.
Tasks:
Script performing splitting
Script adding negative samples
Metrics on buckets
Categorize triplets into various triviality categories, or state no triviality is seen. Based on 5 closest neighbours in OMCS embedding distance. Then see if:
Datasets:
Closing for now, wrote the main pitch, and outline of structure (actually still most in gdoc https://docs.google.com/document/d/1tOepCPXQB-kUzAV38j1SqG0bHcyWHSpCkiN8smiDQJQ/edit, but some already in overleaf)
Evaluate 1000 random conceptnet triplets to create negative examples. Should take approximately 1h in total. We need at least two scores to sounds reasonable.
Protocol will be decided during meeting 22.11.
Gdoc link (for now empty): https://docs.google.com/spreadsheets/d/17_OLZ29lm2evv07jNdKlNKcTNrdHkRZRNK9R5hdh7a0/edit#gid=0
Don't be too concerned about code quality right now, but if you want you can copy https://github.com/gmum/toolkit/tree/master/example_dl_project and build from it.
Now when something is wrong with data folder it just fails silently and returns None. Would be nice to add assert that we do not return None
Also code currently fails for Theano, which is fine but would be nice to fix.
Bonus: figure out why sometimes keras complains about dict input and sometimes doesnt. Right now I just always pass list :P
We have two test sets: Wiki test has 1.7M triplets, conceptnet one has 3k. We need to add to each distance (based on given embedding) to closest example from train set (100k examples).
This is feasible computation, if we use properties of metric. Worst case we can subsample wiki corpora, shouldn't be an issue. Start with script https://github.com/kudkudak/common-sense-prediction/blob/master/scripts/evaluate/augment_with_closest.py. Script should have similar arguments
Distance function: +INF if relation is different, max( ||tail_a, tail_b||^2, ||head_a - head_b||^2). Note that this can be a bit sped up significantly by the following tricks
Suggestions:
After computing the distance, for random examples from wiki corpora, please fetch 5 closest + their scores to get a feel how this conforms to intuition of what is novel, and what is trivial.
This should conclude test set evaluation
Subtasks:
This has been done in evaluation of NELL.
Once #46 is done, check if this is true that for novel triplets, all models behave the same
Following what I did for Wiki, do the same for conceptnet "extrinsic" evaluation proposed in ACL paper
Script evaluating on wiki extrinsic evaluation, data preprocessing
I do not think it is a good design choice, and surely not a common one. Just move load_embeddings code to train script and that's all.
Add current git commit hash (and possibly branch) to meta.json, to make it easier to find the version used in that run
Can potentially also remove the train.py since this would allow you to go back to the right commit
Please do that if you can, we can talk tomorrow about how to assign scores more or less.
Practically speaking: just copy all sub-spreadsheets (1-2_Stan) and create like 1-2_Arian in https://docs.google.com/spreadsheets/d/1yblbveDmG6RjmOf8nsVnOGg1ExHgZGJwjxqtjQk-17o/edit#gid=1151004832, delete column with my scores and score yourself. This is 400 tuples, shouldn't take more than 1h to score.
#46 computes distances, but is slow. We need a script working on 1M triplets (like wiki). We could alternatively short-list, but that's unprofessional :)
Goal: get 10k computation close to 10 minutes, rather than few hours. Shoul be good enough, in engineering terms :)
Start from https://github.com/kudkudak/common-sense-prediction/blob/master/scripts/evaluate/compute_distances.py (it is faster than #46, probably because the suggested trick with head/tail splitting makes little sense without algebra parallelization, it could be modified by stacking heads and tails, but for now it is simpler to just do bruteforce computation)
For some reason we generate negative samples within-batch. That's arguably weird, not sure what it does.
Try outside of batch generation, and see how it impacts models for both negative argsim and normal split.
I had to create LiACLDatasetFromFile for evaluation purposes, but it could and should be merged with LiACLSplitDataset
We want to for each point to have a label like "paraphrase", "synonym" etc, so it allows us to have a table in paper stating how good at each category model is. So it drives home the message model is about completion, not prediction.
Using #56 and distances on original test set of conceptnet look at importance and predictive value of distance metric used
Based on analysis, seems most likely that model is mostly about protypical relations. If bilinear model does really well, with the representation it uses, I think it is a very strong message towards the fact their model does completion not prediction.
Let's see how well bilinear model does.
Assigning Dima for now.
To make our considerations concrete, and see how much our MaxSim is lying please let's do the following:
Create a set of K paris of triplets with label if this is novel or not fact in your opinion
e.g.:
(frog, isan, anima), (cat, isan, animal) -> trivial/novel
Check for various thresholds how negative samples look. Report together with Factorized score.
We need this for:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.