Giter Site home page Giter Site logo

hh-page-classifier's Introduction

Headless Horseman Page Classifier

It gets pages and their labels from Sitehound (previously The Headless Horseman, or THH) via a kafka queue, trains a model, and sends back both model and some quality report. The user of THH then might label more pages, allowing the classifier to reach higher accuracy.

Incoming message example:

{
  "id": "some id that will be returned in the answer message",
  "pages": [
    {
      "url": "http://example.com",
      "html": "<h1>hi</h1>",
      "relevant": true
    },
    {
      "url": "http://example.com/1",
      "html": "<h1>hi 1</h1>",
      "relevant": false
    },
    {
      "url": "http://example.com/2",
      "html": "<h1>hi 2</h1>",
      "relevant": null
    }
  ]
}

Outgoing message with trained model:

{
  "id": "the same id",
  "quality": "{ ... }",
  "model": "b64-encoded page classifier model"
}

quality field is a JSON-encoded string. Here is an example:

{
 "advice": [
  {
   "kind": "Warning",
   "text": "The quality of the classifier is not very good, ROC AUC is just 0.67. Consider labeling more pages, or re-labeling them using different criteria."
  }
 ],
 "description": [
  {"heading": "Dataset", "text": "183 documents, 183 with labels (100%) across 129 domains."},
  {"heading": "Class balance", "text": "40% relevant, 60% not relevant."},
  {"heading": "Metrics", "text": ""},
  {"heading": "Accuracy", "text": "0.628 ± 0.087"},
  {"heading": "F1", "text": "0.435 ± 0.140"},
  {"heading": "ROC AUC", "text": "0.666 ± 0.127"}
 ],
 "tooltips": {
  "Accuracy": "Accuracy is the ratio of pages classified correctly as relevant or not relevant. This metric is easy to interpret but not very good for unbalanced datasets.",
  "F1": "F1 score is a combination of recall and precision for detecting relevant pages. It shows how good is a classifier at detecting relevant pages at default threshold.Worst value is 0.0 and perfect value is 1.0.",
  "ROC AUC": "Area under ROC (receiver operating characteristic) curve shows how good is the classifier at telling relevant pages from non-relevant at different thresholds. Random classifier has ROC&nbsp;AUC&nbsp;=&nbsp;0.5, and a perfect classifier has ROC&nbsp;AUC&nbsp;=&nbsp;1.0."
 },
 "weights": {
  "neg": [
   {
    "feature": "<BIAS>",
    "hsl_color": "hsl(0, 100.00%, 88.77%)",
    "weight": -1.5918805437501728
   }
  ],
  "neg_remaining": 4006,
  "pos": [
   {
    "feature": "2015",
    "hsl_color": "hsl(120, 100.00%, 80.00%)",
    "weight": 3.630274967418529
   }
  ],
  "pos_remaining": 4513
 }
}

Outgoing message with progress (dd-modeler-progress queue):

{
  "id": "some id",
  "percentage_done": 98.123,
}

Usage

Run the service passing THH host (add hh-kafka to /etc/hosts if running on a different network):

hh-page-clf-service --kafka-host hh-kafka

LDA model trained on 500k dmoz pages with bigrams and 100k features, and random pages (1k alexa top-1m sample) are available at s3://darpa-memex/thh/: random-pages.jl.gz and lda.pkl.

Pass path to random pages via --random-pages argument, and path to LDA model via --lda argument. Note that LDA model is optional and is disabled by default. It can get a very slight improvement in accuracy and some sensible looking topics, but also slows down training and prediction quite a bit, and requires more memory.

Random pages are a sample of a low number (about 1k) random pages in the same format as input pages (with "url" and "html" fields). They are used as negative examples during training.

An LDA model was trained on a large number of random pages. It's features are used in addition to text features from the page. You may build an LDA model yourself (see also command line options, good results can be obtained with 300 topics with bigrams):

train-lda text-items.jl.gz lda.joblib

For faster loading, it is recommended to re-dump the model with pickle (joblib can load pickled data as well).

Building docker image

Building does not require anything special, just check out the project and run:

docker build -t hh-page-clf .

Accuracy testing

If you have some datasets in json format (they may be gzipped), you can check accuracy, eli5 work and serialization by running:

hh-page-clf-train my-dataset.json.gz --lda lda.pkl

or even run on several datasets and see an aggregate accuracy report:

hh-page-clf-train datasets/*.json.gz --lda lda.pkl

Testing

Install pytest and pytest-cov.

Start kafka with zookeper:

docker run --rm -p 2181:2181 -p 9092:9092 \
    --env ADVERTISED_HOST=127.0.0.1 \
    --env ADVERTISED_PORT=9092 \
    spotify/kafka

Run tests:

py.test --doctest-modules \
    --cov=hh_page_clf --cov-report=term --cov-report=html \
    --ignore=hh_page_clf/pretraining \
    tests hh_page_clf

Cleaning Kafka queues at the start of tests/test_service.py can sometimes hang - just try once again.


define hyperiongray

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.