Giter Site home page Giter Site logo

govuk-taxonomy-supervised-learning's Introduction

govuk-taxonomy-supervised-machine-learning

Automatically tag content tags to taxons using machine learning algorithms.

Requirements

  • Python 3.4.6
  • See base-requirements.txt for python dependencies.
  • The amazon web services (AWS) command line interface (CLI), see below.

Setting environment variables

A number of environment variables need to be set before running the cleaning scripts on your system:

ENV VAR Description Nominal value
DATADIR Path to the directory storing the data ./data (relative to the root of the repository -- you may need to set an absolute path)
LOGGING_CONFIG Path to the logging configuration file ./python/logging.conf (relative to the root of the repository -- you may need to set an absolute path)
S3BUCKET Path of the S3 bucket in which the data are stored. s3://buod-govuk-taxonomy-supervised-learning

Preparing your python environment

The Makefile assumes that the python3 command is pointing to the correct distribution of python, which was 3.4.6 in development. To install the correct package dependencies run make pip_install from the project root.

Getting the data

The taxonomy pipeline script runs on the GOV.UK Deploy Jenkins machine: https://deploy.publishing.service.gov.uk/job/govuk_taxonomy_supervised_learning/

It runs every weekday starting at 2 AM and usually takes a long time to finish.

The content.json.gz and taxon.json.gz files are the raw data files downloaded from the live site and can be downloaded by using scp:

scp deploy.publishing.service.gov.uk:/var/lib/jenkins/workspace/govuk_taxonomy_supervised_learning/data/* .

These files need to be moved to DATADIR

Running the cleaning scripts

After setting environment variables and obtaining the raw data files saved in DATADIR, running make will download the data and launch the cleaning scripts in order. The following files are created by the various cleaning scripts:

source filename (data/) output filename (data/) produced by (python/)
taxons.json.gz clean_taxons.csv.gz clean_taxons.py
content.json.gz clean_content.csv clean_content.py
clean_taxons.csv.gz; clean_content.csv; content_to_taxon_map.csv untagged.csv.gz create_labelled.py
clean_taxons.csv.gz; clean_content.csv; content_to_taxon_map.csv empty_taxons.csv.gz create_labelled.py
clean_taxons.csv.gz; clean_content.csv; content_to_taxon_map.csv labelled.csv.gz create_labelled.py
clean_taxons.csv.gz; clean_content.csv; content_to_taxon_map.csv labelled_level1.csv.gz create_labelled.py
clean_taxons.csv.gz; clean_content.csv; content_to_taxon_map.csv labelled_level2.csv.gz create_labelled.py
labelled*.csv.gz *arrays.npz dataprep.py

The following schematic describes the movement of data through the pipeline, and the role of each of the scripts.

alt text

The cleaned files are used by the python notebooks contained in python/notebooks.

Jupyter notebooks

Setting up a Jupyter kernel

You should use your virtualenv when running jupyter notebooks. Follow the following steps:

Install the ipython kernel module into your virtualenv

workon my-virtualenv-name  # activate your virtualenv, if you haven't already
pip install ipykernel

Now run the kernel "self-install" script:

python -m ipykernel install --user --name=my-virtualenv-name

Replacing the --name parameter as appropriate.

You should now be able to see your kernel in the IPython notebook menu: Kernel -> Change kernel and be able so switch to it (you may need to refresh the page before it appears in the list). IPython will remember which kernel to use for that notebook from then on.

Notebooks

Name Activity Data inputs Data outputs
EDA-count-data Read in and count data files untagged_content.csv, clean_taxons.csv, clean_content.csv.gz, labelled.csv, filtered.csv, empty_taxons.csv, old_tags.csv None
EDA-taxons Descriptive analysis of taxon content overall, and according to level labelled, filtered, taxons level2taxons_concordant.csv, taggedtomorethan10taxons.csv
EDA-document-type Descriptive analysis of content according to document type, over time untagged, labelled, filtered, labelled_level1, labelled_level2 document_type_group_lookup.json
EDA-other-metadata Descriptive analysis of content according to metadata types, over time untagged, labelled, filtered, labelled_level1, labelled_level2 none

Machine learning notebooks (ML_notebooks)

Name Activity Data inputs
CNN-allgovuk.ipynb Convolutional Neural Network of tagged content using keras framework and pre-trained word embeddings clean_content.csv.gz, clean_taxons.csv
SVM_allgovuk.ipynb Support vector machine of tagged content
TPOT_allgovuk.ipynb Genetic algorithm to select optimal algorithm and hyperparameters

Archived notebooks

Name Activity Data inputs Data outputs
EDA Exploratory data analysis untagged_content.csv, clean_taxons.csv, clean_content.csv.gz None
clean_content.ipynb Development of steps to process raw content data into formats for use in EDA and modelling. These are now used in clean_content.py, which is called by the Makefile
explore_content_dupes.ipynb Understand duplicates in gov.uk content items raw_content.json, clean_content.csv None

Logging

The default logging configuration used by the data transformation pipeline (set in ./python/) will do the following things:

  • Write a simple log to stdout (console) at INFO level
  • Write a more detailed log to a file at DEBUG level (by default /tmp/govuk-taxonomy-supervised-learning.log).

Setting up Tensorflow/Keras on GPU backed instances on AWS

Setting up GPU-backed instances on AWS is greatly facilitated by using databox. Currently the features required to create a deep learning instances are in Pull Request 31. Once these are merged into master, operate databox from master, but for now you will need to git checkout feature/playbook_argument. Once you have databox and all its dependencies installed, the following command will instantiate an instance prepared for Deep Learning on AWS:

./databox.sh -a ami-1812bb61 -r eu-west-1 -i p2.xlarge -s snap-04eb15f2e4faee97a -p playbooks/govuk-taxonomy-supervised-learning.yml up

The arguments are explained in the table below:

Argument Value Description
-a ami-1812bb61 The Conda based Amazon Machine Image. Other options are explained by amazon.
-i p2.xlarge This is the smallest of the Deep Learning instance types. More information is available here. Note that the Deep Learning AMIs may not work with the newer p3 GPU instances.
-s snap-04eb15f2e4faee97a The id of the snapshot containing the taxonomy data. This can be checked at the AWS console
-p playbooks/govuk-taxonomy-supervised-learning.yml The ansible playbook describing deployment tasks required on to setup the instance.
-r eu-west-1 The region in which the instance will be deployed. At present this must be set to eu-west-1 (Ireland) as some deep learning instances are not available in the eu-west-2 (London) zone, and the snapshot is currently in eu-west-1 (although could be copied elsewhere.

Once the instance has instantiated, you will need to run the following commands:

  • SSH tunnel into the instance with ssh -L localhost:8888:localhost:8888 ubuntu@$(terraform output ec2_ip)
  • Open tmux to ensure that any operations do not fail if you disconnect
  • Activate the tensorflow_p36 environment and run jupyter notebook on the instance:
tmux
source activate tensorflow_p36
jupyter notebook

This will set up a notebook server, for which you will be provided a link in the console.

  • Log in to notebook server on your local machine by copying the link generated on the server into a browser. This will give you access to jupyter notebooks on your local machine.

Tensorboard

  • To run tensorboard ensure that the tensorboard callback has been enabled in the model, then log into the instance again in a new terminal creating a new tunnel with ssh -L localhost:6006:localhost:6006 ubuntu@$(terraform output ec2_ip).
  • Open tmux to ensure the task continues running even if you disconnect.
  • Activate the tensorflow_p36 environment with source activate tensorflow_p36.
  • Run tensorboard --log_dir=<path to logging>.
  • Open a browser on your local machine and navigate to https://localhost:6006 to access the tensorboard sever.

Check that the GPU is doing the work

  • Ensure that your model is running on the instance GPU by running nvidia-smi in a new terminal on the instance (you can run this repeatedly with watch -n 10 nvidia-smi to update every 10 seconds).

govuk-taxonomy-supervised-learning's People

Contributors

1pretz1 avatar cbaines avatar ff-l avatar ivyleavedtoadflax avatar koetsier avatar oscarwyatt avatar surminus avatar thomasleese avatar tijmenb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

govuk-taxonomy-supervised-learning's Issues

./databox.sh fails on ubuntu 16.04

When running ./databox.sh -a ami-1812bb61 -r eu-west-1 -i p2.xlarge up on Ubuntu 16.04 and zsh, I get the following error:

./databox.sh: 3: ./databox.sh: [[7: not found
DataBox - create and destroy AWS instances for Data Science
./databox.sh up - Create a DataBox
 -- options -- 
  -r|--region - AWS region
  -u|--username - Username used to name the AWS objects
  -i|--instance - AWS instance type
  -v|--volume_size - EBS volume size
  -a|--ami_id - AMI id
./databox.sh down - Destroy the DataBox

create_unlabelled_predictions_meta as script and task to makefile

Currently create_unlabelled_predictions_meta requires a lot of memory to be run in a notebook to produce a df where each row is a content_item/taxon pair: probability of a content item being in that taxon plus all of the data about that content item.

There is no requirement for this to be in a notebook. Can be run in script. This task could be appended to makefile in order to make these outputs available for evaluation of model as part of the pipeline.

The output of this is used in the predictions evaluation scripts

Algorithm V2.0.0 needs to be run as script

Currently after running make ad then dataprep scripts, the CNN_v2.0.0_save_model_predict.ipynb is run to train the model, predict on train and dev sets and then load new_content and labelled_level1_content to predict on these as well.

There are multiple outputs saved from this notebook and the model&weights is not yet but should also be added to these outputs.

Since algorithm tuning is paused, this should be converted to a script

dataprep.py and new_dataprep.py need to be added to Makefile

Currently, these need to be run after running make in order to run the v2.0.0_save_model_predict.ipynb:

python python/dataprep.py
python python/new_dataprep.py --untagged_filename 'new_content.csv.gz' --outarrays_filename 'new'

python python/new_dataprep.py --untagged_filename 'labelled_level1.csv.gz' --outarrays_filename 'level1'

These tasks should be contained within the makefile and completed with make all

Steps for deep learning on AWS

Some thoughts about running deep learning with tensorflow/keras on AWS

Workflow:

  • Instantiate databox with conda ami (./databox.sh -a ami-1812bb61 -r eu-west-1 -i p2.xlarge up)
  • DataBoxVolume will fail to attach as some development tasks fail on this ami (see issue). Log into the AWS console and delete this volume (or leave it to be deleted when you run ./databox -r eu-west-1 down).
  • Load the govuk-taxonomy-supervised-learning snapshot into a volume (through AWS console) and attach to the DataBox.
  • Mount the data volume at /data
  • SSH tunnel into the instance with ssh -L localhost:8888:localhost:8888 ubuntu@<databox IP>
  • Make a copy of the coda env tensorflow_p36 before making any changes, with conda create --name <new env name> --clone tensorflow_p36 NOTE: This takes a really long time!.
  • Set env vars (DATADIR) in new environment (consider installing direnv to make life simpler!)
  • Install requirements into new environment. Do this manually. Don't do pip install -r requirements.txt, which will break the installation. Run the notebook: see what is missing, and then install that using pip install. Then create a new requirements file that is up to date NOTE: It should only need sklearn!
  • Run jupyter notebook on instance
  • Log in to notebook server on your local machine by copying and pasting link generated on the server.
  • To run tensorboard ensure that the tensorboard callback has been enabled in the model, then log into the instance again in a new terminal creating a new tunnel with ssh -L localhost:6006:localhost:6006 ubuntu@<databox IP>. Activate the correct environment that you created earlier with conda create... then run tensorboard --log_dir=<path to logging>.
  • Ensure that your model is running on the instance GPU by running nvidia-smi in a new terminal on the instance (you can run this repeatedly with watch -n 10 nvidia-smi to update every 10 seconds).

Thoughts going forward:

  • Move all generic pipeline tasks out of the deep learning notebooks, and into the data prep pipeline. The only thing that the deep learning instances should be doing is deep learning, no other data preparation. Preprocessed data can instead just be saved on S3 or on a volume snapshot.
  • Split apart the data prep pipeline and the deep learning notebooks into a separate repo. These are likely to have different requirements, and we will not need to run deep learning on gov.uk infrastructure.

Questions:

  • Where should the data be stored?
    • An S3 Bucket?
    • A volume/snapshot
    • Both? (present situation)
  • Should the pipeline be executed on a remote instance, or run locally then uploaded?
  • Do we need to output so many data cuts from the pipeline? And is compressed .csv.gz the right format (we could also use jsons or pickles) - do we lose any information by flattening to csv, and then reimporting (e.g. dtype) and would it be better to keep it in a more informative format?

Notes on a first run:

  • Training an epoch is very fast (<2 minute) for this notebook - there is a version fixed for AWS added in #39.
  • The end of epoch evaluation is incredibly slow, at least ten minutes after each epoch. This is run on the instance CPU (and so may be faster on a bigger instance), but we may also reduce the frequency of this task?

Discuss: GO

Cache tokens during model runs

Tokenising texts during model runs is slow, and good be sped up by caching the tokens, and only re-running once the data is updated. This could be included in the data preparation pipeline, in which case the makefile could be used to determine whether or not the step should be repeated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.