martinthenext / eth_ml Goto Github PK

Projects in Machine Learning ETH team trying to use mechanical turk and active learning for solving word-sense disambiguation task

Python 100.00%

machine-learning disambiguation computational-linguistics nlp

eth_ml's Introduction

Pool-based active learning for crowdsourcing word-sense disambiguation tasks

Word-sense disambiguation task is a task to resolve ambiguity: find out which of the possible meanings the phrase has in a particular context. An example of disambiguation task:

Its use should be postponed in patients with Sardinella siccus affecting the stomach or gut.

Does Sardinella siccus in this text mean a type of disorder or a living being?

There are 190 000 cases of ambiguous terms produced by automated text annotation tool. The goal is to resolve all of them. To train a classifier to perform such tasks labeled data is needed. A project is conducted at Computational Linguistics Lab of UZH to use crowdsourcing: Amazon Mechanical Turk workers are asked to solve such tasks:

As of now, tasks are being randomly picked from a pool of 190 000 ambiguous cases. Each of them is solved by at least 3 different workers. The goal of the project would be to implement active learning:

Have a classifier to predict phrase meaning from context (solve disambiguation tasks)
Request MTurk workers to solve tasks which are the most informative for training the classifier

Data

Unlabeled data: ~195 000 disambiguation tasks

Labeled data:

821 answers to 255 tasks (taken out of these 195 000) by MTurk workers. More answers can be easily retrieved if needed.
Up to 16 million non-ambiguous annotations, which can be viewed as tasks with known answers to train the initial classifier

Resources

Applying active learning to supervised word sense disambiguation in MEDLINE. Chen et al., 2012
Active Learning with Amazon Mechanical Turk. Laws et at., EMNLP, 2011 (link)
Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization. Golovin and Krause, 2011
Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization. Chen and Krause, 2013

Results

See final report.

Log of the results can be viewed here.

eth_ml's People

Contributors

Stargazers

Watchers

Forkers

go1dshtein aerial1543 computational-linguistics-research

eth_ml's Issues

Further investigate Bag of Words

For the sake of dimensionality reduction, following variants of ContextRestrictedBagOfWordsLeftRight should be implemented:

One that omits the words with low counts. For example, if the word occurs less than 3 times in the whole data set, exclude it from bag of word features. Call it ContextRestrictedBagOfWordsLeftRightCutoff and make a cut-off frequency a parameter of it's constructor, just like the window size. Hint: parameter min_df can be set to 3 in CountVectorizer options
One that uses English stop words: ContextRestrictedBagOfWordsLeftRightStopWords Hint: stop_words='english'.

To compare the performance of new vectorizers, create a script prototypes/compare_vectorizers.py that does the following:

Outputs the agreement of the OptionAwareNaiveBayesLeftRight on the given data, just like mturk_classifier_agreement.py
Substitutes a vectorizer in this classifier with described before variants (Cutoff and StopWords), trains the new classifiers and outputs the resulting agreement for comparison. It would be ideal for it to output a table with names of vectorizers, parameters (like min_df) and according agreements.

The resulting script should have the same command-line arguments as mturk_classifier_agreement.py.

Classification accuracy measurements don't match

In the result summary OptionAwareNaiveBayesFullContextLeftRightCutoff attains 75% accuracy when trained on Medline and 61% accuracy when trained on EMEA.

Nevertheless, when wrapped into a passive/active learner, it produces results which are worse - 63% and 56% accordingly.

Implement separate one-vs-all classifier for semantic groups

Fit 10 separate classifiers - one for every semantic group. For classification of an annotation instance:

Observe which options for semantic groups are presented for ambiguous term: typically 2 or 3
Run according group-specific classifiers and retrieve probabilities of conflicting groups
Assign a group with the highest probability

Simplest classifier to output probability - logistic regression.

Resulting collection of classifiers should be wrapped into a classifier class.

Measure agreement between the best classifier and Expert

To evaluate the accuracy of the best classifier (OptionAwareNaiveBayesFullContextLeftRightCutoff trained on Medline) on the "Gold standard" we need to measure its agreement with expert annotations.

Expert annotations are stored in this file.

Modify the load_ambiguous_annotations_labeled method from data.py so that it also works with loading data from this tsv file.
Create a file expert_classifer_agreement.py where you use the function get_mturk_pickled_classifier_agreement from mturk_classifier_agreement.py to get the agreement between the pickled classifier (you need to load it with joblib.load) and the expert.

expert_classifer_agreement.py should have a pickle of a classifier and an expert annotation tsv file as parameters and output two numbers:

Agreement with strict answer comparison
Agreement when only useful answers are counted (that is, if an expert says IDK or NONE you should exclude this annotation from consideration.

Assess classifier performance on MTurk data only

Do cross-validation of OptionAwareNaiveBayesLeftRight classifier on MTurk data.

Validate classifiers against the labeled ambiguous data

Implement a procedure to measure agreement of a classifier with the labeled ambiguous data. For that purpose:

Annotation class should be modified to store ambiguous data as well
Function should be implemented in data.py to deserialize labeled ambiguous data (from MTurk or expert) to a list of Annotation objects.
Module should be implemented to test classifiers against this data - analogous to cv.py

Full context for OptionAwareNaiveBayesLeftRightCutoff

OptionAwareNaiveBayesLeftRightCutoff needs to be modified to use all context and tested on Medline, see motivation here

This new classifier should be named OptionAwareNaiveBayesFullContextLeftRightCutoff and added to models.py.

Implement a window bag of words

Currently, bag of words feature takes the entire context of an ambiguous term as an argument. We need to implement a new feature that will only account for k words around the ambiguous term in vectorization.

Technical details:

Refactor code so that feature selection and classification routines are pluggable into class definitions. Probably use mixins.
Subclass CountVectorizer to implement a bag of words window.

Generate sample corpus files for EMEA and Medline

For the sake of testing on small scale, generate a small subset of two corpus files so that Maria and Valya can work on them without accessing the server.

Make plotting learning curves possible and very easy

In this ticket you have to implement a very easy to use function that will allow to plot learning curves. The image should be written to the specified file location. The user of the function should be able to use it without knowing how it works. A good example of a call would be:

plot_curves('output.jpg', passive_learner = [0.2, 0.21, 0.22],
  active_learner = [0.2, 0.23, 0.55])

For every keyword argument (see info about kwargs) this would plot a line plot with list index (starting with 1) on X axis and list values on Y axis. For the supplied example it would plot points (1, 0.2), (2, 0.21), (3, 0.22) in red and (1, 0.2), (2, 0.23), (3, 0.55) in blue and have a legend about red means 'passive_learner' and blue means 'active learner'. Optional control over graphical parameters could be also useful. Please describe how to use the function in a docstring.

If a plotting library motivates some other argument structure, it's ok - the main thing that it should be very straightforward and easy to use.

It would be nice to use matplotlib as it is installed on the working server.

Training classifiers on data fraction doesn't work

Looks like train_and_serialize.py produces similar classifiers for any value of dataset_fraction parameter.

Evidence 1. Pickled classifier files for different fractions have the same size.

Evidence 2. Plots for passive vs. active are exactly the same for active learner and slightly different for passive, which indicates that active learning is acting on the same data.

Full dataset:

What's expected to be 5% fraction:

$weightedpartialfitpassivetransferclassifier2_emea_fraction0 05_weight1000$

New feature graph

In the dimensionality reduction section of result summary there is a graph where each point is a feature set with certain parameters and coordinates are accuracy values on EMEA and Medline.

Under the graph in the section re-evaluation you can find similar data for the new dataset. The task is to produce the new graph from this data. The graph should look like the old one - Pareto front should be highlighted and color-coded with according features.

Fix learning curve labels

The plot_curves function, when called consecutively on different arguments (see active_vs_passive.py) does not clear the Y axis labels. For example:

Labels should be cleared on every call of the function.

Differentiate between left and right contexts in vectorizer

This task in similar to the bigram one in a sense that one also needs to modify the ContextRestrictedBagOfWords. There should be two feature vectors created for each annotation instead of one: bag of words on the right part and bag of words on the left part. Then two vectors should be joined into one.

In realization of this idea one should work with feature matrices directly, joining outputs of two CountVectorizers.

Implement an bigram annotation vectorizer

Now in models.py there is a class called ContextRestrictedBagOfWords. It implements two functions fit_transform and transform to vectorize annotations.

The task is to make a new version of this class called ContextRestrictedBagOfBigrams that would use word bigrams instead of just words.

Please refer to sklean docs on CountVectorizer, specifically to the parameter ngram_range.

To test a new class you can just plug it into an existing classifier instead of ContextRestrictedBagOfWords like that:

class NaiveBayesContextRestricted(AnnotationClassifier):
  def __init__(self, **kwargs):
    self.classifier = MultinomialNB()
    window_size = kwargs.get('window_size', 3)
    self.vectorizer = ContextRestrictedBagOfBigrams(window_size)