effect-workflows

DIG workflow processing for the EFFECT project.

Installation

Download and install conda - https://www.continuum.io/downloads
Install conda env - conda install -c conda conda-env
Create the environment conda env create. This will create a virtual environment named effect-env (The name is defined in environment.yml)
Switch to the environment using source activate effect-env

NOTE: You should build the environment on the same hardware/os you're going to run the job

Running script to convert PostgreSQL to CDR

Follow above instructions to create conda environment - Steps 1-3
Switch to the effect-env: source activate effect-env
Execute:

python postgresToCDR.py --host <postgreSQL hostname> --user <db username> --password <db password> \
                        --database <databasename> --table <tablename> \
                        --output <output filename> --team <Name of team providing data>`

Running script to convert CSV,JSON,XML,CDR data into a format that should be used for Karma Modeling

Follow above instrucrions to create conda environment - Steps 1-3
Switch to the effect-env: source activate effect-env
Execute:

python generateDataForKarmaModeling.py --input <input filename> --output <output filename> \
      --format <input format-csv/json/xml/cdr> --source <a name for the source> \
      --separator <column separator for CSV files>

Example Invocations:

python generateDataForKarmaModeling.py --input ~/github/effect/effect-data/nvd/sample/nvdcve-2.0-2003.xml \
          --output nvd.jl --format xml --source nvd


python generateDataForKarmaModeling.py --input ~/github/effect/effect-data/hackmageddon/sample/hackmageddon_20160730.csv \
          --output hackmageddon.jl --format csv --source hackmageddon


python generateDataForKarmaModeling.py --input ~/github/effect/effect-data/hackmageddon/sample/hackmageddon_20160730.jl \
          --output hackmageddon.jl --format json --source hackmageddon

Loading data in HIVE

Login to AWS and create a tunnel - ssh -L 8888:localhost:8888 [email protected]
Access Hue on http://localhost:8888
See hiveQueries.sql for examples

Running the workflow

To build the python libraries required by the workflows,

Edit make.sh and update the path to dig-workflows
Run ./make.sh. This will create effect-env.zip that can be attached with the --archives option to the spark workflow
Copy the effect-env.zip file to AWS - scp effect-env.zip [email protected]:/home/hadoop/effect-workflows/lib
zip your karma home folder into karma.zip and copt to AWS - scp karma.zip [email protected]:/home/hadoop/effect-workflows/

Build a shaded karma-spark jar -

cd karma-spark
mvn clean install -P shaded -Denv=hive
scp lib/karma-spark-0.0.1-SNAPSHOT-shaded.jar [email protected]:/home/hadoop/effect-workflows/lib

Login to AWS and run the workflow using the script run_karma_workflow.sh This will load data from HIVE table CDR, apply karma models to it and save the output to HDFS.

To load the data to ES,

Create an index, say effect-2 with mappings from file - https://raw.githubusercontent.com/usc-isi-i2/effect-alignment/master/es/es-mappings.json

Run spark workflow to load data from hdfs to this effect-2 index

spark-submit --deploy-mode client  \
    --executor-memory 5g \
    --driver-memory 5g \
    --jars "/home/hadoop/effect-workflows/jars/elasticsearch-hadoop-2.4.0.jar" \
    --py-files /home/hadoop/effect-workflows/lib/python-lib.zip \
    /home/hadoop/effect-workflows/effectWorkflow-es.py \
    --host 172.31.19.102 \
    --port 9200 \
    --index effect-2 \
    --doctype attack \
    --input hdfs://ip-172-31-19-102/user/effect/data/cdr-framed/attack

This shows how to add in the attack frame. This needs to executed for all the available frames.

Change the alias 'effect' in ES to point to this new index - effect-2

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "effect-2",
        "alias": "effect"
      },
      "remove": {
        "index": "effect-1",
        "alias": "effect"
      }
    }
  ]
}

Running the Extractor Workflow

Follow the Installation Instructions to install conda and conda-env if you dont have them installed
Create the effect environement conda env create
Switch to the environment using source activate effect-env
Run .\make-extractor.sh. This bundles up the entire environemnt, including python that is used to run the workflow
If spark is not installed in the default /usr/lib/spark/, change paths in run-extractor.sh
Run run-extractor.sh

Extras

To remove the environment run conda env remove -n effect-env
To see all environments run conda env list

** Run OOZIE workflow from command line - takes in job.properties and workflow.xml

onzali / effect-workflows Goto Github PK