Concept Tagging Training

This software enables the creation of concept classifiers, to be utilized by an accompanying service. If you don't have your own data to train, you can use the pretrained models described here. This project was written about here for the Federal Data Strategy Incubator Project.

What is Concept Tagging

By concept tagging, we mean you can supply text, for example: Volcanic activity, or volcanism, has played a significant role in the geologic evolution of Mars.[2] Scientists have known since the Mariner 9 mission in 1972 that volcanic features cover large portions of the Martian surface. and get back predicted keywords, like volcanology, mars surface, and structural properties, as well as topics like space sciences, geosciences, from a standardized list of several thousand NASA concepts with a probability score for each prediction.

Requirements

You can see a list of options for this project by navigating to the root of the project and executing make or make help.

This project requires:

docker -- tested with this version
GNU Make -- tested with 3.81 built for i386-apple-darwin11.3.0

installation

You have several options for installing and using the pipeline.

pull existing docker image
build docker image from source
install in python virtual environment

pull existing docker image

You can just pull a stable docker image which has already been made:

docker pull storage.analytics.nasa.gov/abuonomo/concept_trainer:stable

In order to do this, you must be on the NASA network and able to connect to the https://storage.analytics.nasa.gov docker registry. * _{There are several versions of the images. You can see them here.
If you don't use "stable", some or all of this guide may not work properly.}

build docker image from source

To build from source, first clone this repository and go to its root.

Then build the docker image using:

docker build -t concept_trainer:example .

Substitute concept_trainer:example for whatever name you would like. Keep this image name in mind. It will be used elsewhere.

* If you are actively developing this project, you should look at the make build in Makefile. This command automatically tags the image with the current commit url and most recent git tag. The command requires that setuptools-scm is installed.

install in python virtual environment

* tested with python3.7 First, clone this repository. Then create and activate a virtual environment. For example, using venv:

python -m venv my_env
source my_env/bin/activate

Next, while in the root of this project, run make requirements.

how to run

The pipeline takes input document metadata structured like this and a config file like this. The pipeline produces interim data, models, and reports.

using docker -- if you pulled or built the image
using python in virtual environment -- if you are running in a local virtual environment

using docker

First, make sure config, data, data/raw, data/interim, models, and reports directories. If they do not exist, make them (mkdir config data models reports data/raw). These directories will be used as docker mounted volumes. If you don't make these directories beforehand, they will be created by docker later on, but their permissions will be unnecessarily restrictive.

Next, make sure you have your input data in the data/raw/ directory. Here is an example file with the proper structure. You also need to make sure the subj_mapping.json file here is in data/interim/ directory.

Now, make sure you have a config file in the config directory. Here is an example config which will work with the above example file.

With these files in place, you can run the full pipeline with this command:

docker run -it \
     -v $(pwd)/data:/home/data \
     -v $(pwd)/models:/home/models \
     -v $(pwd)/config:/home/config \
     -v $(pwd)/reports:/home/reports \
    concept_trainer:example pipeline \
        EXPERIMENT_NAME=my_test_experiment \
        IN_CORPUS=data/raw/STI_public_metadata_records_sample100.jsonl \
        IN_CONFIG=config/test_config.yml

Substitute concept_trainer:example with the name of your docker image. You can set the EXPERIMENT_NAME to whatever you prefer. IN_CORPUS and IN_CONFIG should be set to the paths to the corpus and to the configuration file, respectively.

* Developers can also use the container command in the Makefile. Note that this command requires setuptools-scm. Note that this command will use the image defined by the IMAGE_NAME variable and version number equivalent to the most recent git tag.

using python in virtual environment

Assuming you have cloned this repository, files for testing the pipeline should be in place. In particular, data/raw/STI_public_metadata_records_sample100.jsonl and config/test_config.yml should both exist. Additionally, you should add the src directory to your PYTHONPATH:

export PYTHONPATH=$PYTHONPATH:$(pwd)/src/

Then, you can run a test of the pipeline with:

make pipeline \
    EXPERIMENT_NAME=test \
    IN_CORPUS=data/raw/STI_public_metadata_records_sample100.jsonl \
    IN_CONFIG=config/test_config.yml

If you are not using the default values, simply substitute the proper paths for IN_CORPUS and IN_CONFIG. Choose whatever name you prefer for EXPERIMENT_NAME.

managing experiments

If you have access to the hq-ocio-ci-bigdata moderate s3 bucket, you can sync local experiments with those in the s3 bucket.

For example, if you created a local experiment with EXPERIMENT_NAME=my_cool_experiment, you can upload your local results to the appropriate place on the s3 bucket with:

make sync_experiment_to_s3 EXPERIMENT_NAME=my_cool_experiment PROFILE=my_aws_profile

where my_aws_profile is the name of your awscli profile which has access to the given bucket.

Afterwards, you can download the experiment interim files and results with:

make sync_experiment_from_s3 EXPERIMENT_NAME=my_cool_experiment PROFILE=my_aws_profile

use full sti metadata records

If you have access to the moderate bucket and you want to work with the full STI metadata records, you can download them to the data/raw folder with:

make sync_raw_data_from_s3 PROFILE=my_aws_profile

When using these data, you will want to use a config file which is different from the test config file. You can browse previous experiments at s3://hq-ocio-ci-bigdata/home/DataSquad/classifier_scripts/ to see example config files. You might try:

weights:  # assign weights for term types specified in process section
  NOUN: 1
  PROPN: 1
  NOUN_CHUNK: 1
  ENT: 1
  ACRONYM: 1
min_feature_occurrence: 100
max_feature_occurrence: 0.6
min_concept_occurrence: 500

See config/test_config.yml for details on these parameters.

advanced usage

For more advanced usage of the project, look at the Makefile commands and their associated scripts. You can learn more about these python scripts by them with help flags. For example, you can run python src/make_cat_models.py -h.

dipakbagal / concept-tagging-training Goto Github PK