Giter Site home page Giter Site logo

martinthenext / eth_ml Goto Github PK

View Code? Open in Web Editor NEW
7.0 4.0 3.0 1.73 MB

Projects in Machine Learning ETH team trying to use mechanical turk and active learning for solving word-sense disambiguation task

Python 100.00%
machine-learning disambiguation computational-linguistics nlp

eth_ml's Introduction

Pool-based active learning for crowdsourcing word-sense disambiguation tasks

Word-sense disambiguation task is a task to resolve ambiguity: find out which of the possible meanings the phrase has in a particular context. An example of disambiguation task:

Its use should be postponed in patients with Sardinella siccus affecting the stomach or gut.

Does Sardinella siccus in this text mean a type of disorder or a living being?

There are 190 000 cases of ambiguous terms produced by automated text annotation tool. The goal is to resolve all of them. To train a classifier to perform such tasks labeled data is needed. A project is conducted at Computational Linguistics Lab of UZH to use crowdsourcing: Amazon Mechanical Turk workers are asked to solve such tasks:

mantracrowd survey

As of now, tasks are being randomly picked from a pool of 190 000 ambiguous cases. Each of them is solved by at least 3 different workers. The goal of the project would be to implement active learning:

  1. Have a classifier to predict phrase meaning from context (solve disambiguation tasks)
  2. Request MTurk workers to solve tasks which are the most informative for training the classifier

Data

Unlabeled data: ~195 000 disambiguation tasks

Labeled data:

  • 821 answers to 255 tasks (taken out of these 195 000) by MTurk workers. More answers can be easily retrieved if needed.
  • Up to 16 million non-ambiguous annotations, which can be viewed as tasks with known answers to train the initial classifier

Resources

Resouces

  1. Applying active learning to supervised word sense disambiguation in MEDLINE. Chen et al., 2012
  2. Active Learning with Amazon Mechanical Turk. Laws et at., EMNLP, 2011 (link)
  3. Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization. Golovin and Krause, 2011
  4. Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization. Chen and Krause, 2013

Results

See final report.

Log of the results can be viewed here.

eth_ml's People

Contributors

aerial1543 avatar go1dshtein avatar hanveiga avatar martinthenext avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

eth_ml's Issues

Further investigate Bag of Words

For the sake of dimensionality reduction, following variants of ContextRestrictedBagOfWordsLeftRight should be implemented:

  1. One that omits the words with low counts. For example, if the word occurs less than 3 times in the whole data set, exclude it from bag of word features. Call it ContextRestrictedBagOfWordsLeftRightCutoff and make a cut-off frequency a parameter of it's constructor, just like the window size. Hint: parameter min_df can be set to 3 in CountVectorizer options
  2. One that uses English stop words: ContextRestrictedBagOfWordsLeftRightStopWords Hint: stop_words='english'.

To compare the performance of new vectorizers, create a script prototypes/compare_vectorizers.py that does the following:

  1. Outputs the agreement of the OptionAwareNaiveBayesLeftRight on the given data, just like mturk_classifier_agreement.py
  2. Substitutes a vectorizer in this classifier with described before variants (Cutoff and StopWords), trains the new classifiers and outputs the resulting agreement for comparison. It would be ideal for it to output a table with names of vectorizers, parameters (like min_df) and according agreements.

The resulting script should have the same command-line arguments as mturk_classifier_agreement.py.

Implement separate one-vs-all classifier for semantic groups

Fit 10 separate classifiers - one for every semantic group. For classification of an annotation instance:

  1. Observe which options for semantic groups are presented for ambiguous term: typically 2 or 3
  2. Run according group-specific classifiers and retrieve probabilities of conflicting groups
  3. Assign a group with the highest probability

Simplest classifier to output probability - logistic regression.

Resulting collection of classifiers should be wrapped into a classifier class.

Measure agreement between the best classifier and Expert

To evaluate the accuracy of the best classifier (OptionAwareNaiveBayesFullContextLeftRightCutoff trained on Medline) on the "Gold standard" we need to measure its agreement with expert annotations.

Expert annotations are stored in this file.

  1. Modify the load_ambiguous_annotations_labeled method from data.py so that it also works with loading data from this tsv file.
  2. Create a file expert_classifer_agreement.py where you use the function get_mturk_pickled_classifier_agreement from mturk_classifier_agreement.py to get the agreement between the pickled classifier (you need to load it with joblib.load) and the expert.

expert_classifer_agreement.py should have a pickle of a classifier and an expert annotation tsv file as parameters and output two numbers:

  1. Agreement with strict answer comparison
  2. Agreement when only useful answers are counted (that is, if an expert says IDK or NONE you should exclude this annotation from consideration.

Validate classifiers against the labeled ambiguous data

Implement a procedure to measure agreement of a classifier with the labeled ambiguous data. For that purpose:

  1. Annotation class should be modified to store ambiguous data as well
  2. Function should be implemented in data.py to deserialize labeled ambiguous data (from MTurk or expert) to a list of Annotation objects.
  3. Module should be implemented to test classifiers against this data - analogous to cv.py

Implement a window bag of words

Currently, bag of words feature takes the entire context of an ambiguous term as an argument. We need to implement a new feature that will only account for k words around the ambiguous term in vectorization.

Technical details:

  1. Refactor code so that feature selection and classification routines are pluggable into class definitions. Probably use mixins.
  2. Subclass CountVectorizer to implement a bag of words window.

Make plotting learning curves possible and very easy

In this ticket you have to implement a very easy to use function that will allow to plot learning curves. The image should be written to the specified file location. The user of the function should be able to use it without knowing how it works. A good example of a call would be:

plot_curves('output.jpg', passive_learner = [0.2, 0.21, 0.22],
  active_learner = [0.2, 0.23, 0.55])

For every keyword argument (see info about kwargs) this would plot a line plot with list index (starting with 1) on X axis and list values on Y axis. For the supplied example it would plot points (1, 0.2), (2, 0.21), (3, 0.22) in red and (1, 0.2), (2, 0.23), (3, 0.55) in blue and have a legend about red means 'passive_learner' and blue means 'active learner'. Optional control over graphical parameters could be also useful. Please describe how to use the function in a docstring.

If a plotting library motivates some other argument structure, it's ok - the main thing that it should be very straightforward and easy to use.

It would be nice to use matplotlib as it is installed on the working server.

Training classifiers on data fraction doesn't work

Looks like train_and_serialize.py produces similar classifiers for any value of dataset_fraction parameter.

Evidence 1. Pickled classifier files for different fractions have the same size.

Evidence 2. Plots for passive vs. active are exactly the same for active learner and slightly different for passive, which indicates that active learning is acting on the same data.

Full dataset:

weightedpartialfitpassivetransferclassifier2_emea_weight1000

What's expected to be 5% fraction:

weightedpartialfitpassivetransferclassifier2_emea_fraction0 05_weight1000

New feature graph

In the dimensionality reduction section of result summary there is a graph where each point is a feature set with certain parameters and coordinates are accuracy values on EMEA and Medline.

Under the graph in the section re-evaluation you can find similar data for the new dataset. The task is to produce the new graph from this data. The graph should look like the old one - Pareto front should be highlighted and color-coded with according features.

Fix learning curve labels

The plot_curves function, when called consecutively on different arguments (see active_vs_passive.py) does not clear the Y axis labels. For example:

plot_medline_39

Labels should be cleared on every call of the function.

Differentiate between left and right contexts in vectorizer

This task in similar to the bigram one in a sense that one also needs to modify the ContextRestrictedBagOfWords. There should be two feature vectors created for each annotation instead of one: bag of words on the right part and bag of words on the left part. Then two vectors should be joined into one.

In realization of this idea one should work with feature matrices directly, joining outputs of two CountVectorizers.

Implement an bigram annotation vectorizer

Now in models.py there is a class called ContextRestrictedBagOfWords. It implements two functions fit_transform and transform to vectorize annotations.

The task is to make a new version of this class called ContextRestrictedBagOfBigrams that would use word bigrams instead of just words.

Please refer to sklean docs on CountVectorizer, specifically to the parameter ngram_range.

To test a new class you can just plug it into an existing classifier instead of ContextRestrictedBagOfWords like that:

class NaiveBayesContextRestricted(AnnotationClassifier):
  def __init__(self, **kwargs):
    self.classifier = MultinomialNB()
    window_size = kwargs.get('window_size', 3)
    self.vectorizer = ContextRestrictedBagOfBigrams(window_size)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.