The sparkle from msarmi9

Data Trimming

Issue: Data needs to be trimmed at the start and end to account for starting and stopping the recording feature on the apple watch.

Script for Flattening Directories

@steph-jung Would love to have your script for flattening the data/pills/original directories into a single directory to place in the s3 bucket. Whenever you're done tweaking it, put in a PR :D

Task: Update README to give at least a rough overview/outline of our project. Doesn’t need to be super detailed, as we can continue to flesh it out as we go. Writing this after the EMBC paper (Issue #37) is probably a good idea.

All members of group run `dc-project-AC-timed.ipynb` on EMR and record run time.

Per Diane's instructions, each of us must run Andy's dc-project-AC.ipynb notebook on a unique EMR cluster configuration, then we will compare the run times. She recommends putting all our spark data processing code into a single cell so we can easily capture the full running time. I have made that edit to Andy's notebook and renamed it to dc-project-AC-timed.ipynb - this will be posted in the sparkle slack channel.

It seems fairly open ended in terms of how our EMR cluster configurations differ, so I propose we all run the notebook with the default m5.xlarge instances, but all use a different number of instances. For simplicity, let's all use the following number of instances:

    Collin: 5 `m5.xlarge` instances
    Kevin: 4 `m5.xlarge` instances
    Andy: 3 `m5.xlarge` instances
    Stephanie: 2 `m5.xlarge` instances
    Matt: 1 `m5.xlarge` instances

So, when launching, my configuration looked like this:

For the project, it would probably be good for us all to get a screenshot of the cluster config while running, which you can get in the AWS console if you go to EMR -> Clusters -> (toggle your cluster) -> View details -> hardware. Mine looks like this:

Finally, after getting things all set up, upload dc-project-AC-timed.ipynb, then run the entire notebook, then go down and toggle the Spark Job Progress under the big cell which contains all the pyspark data processing code to get your configuration's runtime. For example, with 5 m5.xlarge instances, my code took 12.11 seconds to run.

And once we all do that, we should be good to go! We can put all of our different configurations and run times into a nice table in our presentation. 👍

iOS -- Home View

Task: Clean & minimal home dashboard displaying current week’s intake schedule, organized by day and user-customized intake times that conforms to a doctor-prescribed schedule.

iOS/WatchOS -- Notifications

EMR Bootstrap Scripts

Task: Create a bootstrap script for installing boto3, h2o and local modules when launching an aws emr cluster.

Record Presentation Video #1

Train Models on Log-transformed Data

Task: Train models (XGBoost, h2o automl, etc) on the log-transformed data (lite_300.csv in the s3 bucket).

Pick Dataset

Choose A License

Task: Choose a license and add a LICENSE.md to the repo

Re-name Pill Data Files & Store on S3

Task: Name csv files for newly collected pill data in a smart format and put them on s3. Maybe <activity>-<PID>-<substring>.csv, e.g. pills-07-25.csv for subject 07 taking n_pills=25? But this also looks like month/date...

Update business plan #2

Run `dc-project-AC-timed.ipynb` on EMR with scaled out cluster

Task: Run the same notebook (dc-project-AC-timed.ipynb) on EMR with scaled out instances (more ec2 instances, less computing power per instance).

Curious to see how the same total amount of memory distributed to more worker nodes affects runtime performance (e.g. 8 workers at 8 GB RAM each vs 4 workers at 16 GB RAM)

Builds upon #14

Each config group will be run with 1 or 3 m5.xlarge master nodes and x number of m4.large worker nodes (8 GB RAM)

Config 1: 10 workers 
Config 2: 8 workers
Config 3: 6 workers
Config 4: 4 workers
Config 5: 2 workers

Group Project 1 - Team/Data Selection

Each team member reviews relevant research + datasets and writes short review in this doc.
We must collect all pill-taking data so that our dataset meet's Diane's criteria.
Get all data into s3 by Sunday (tomorrow) night at 10pm

Configure awslabs/git-secrets

Task: Set up awslab/git-secrets to automatically scan for & prevent uploading of access keys in any future commits 😎

Add GUIDELINES.md

Task: Write a brief doc covering

branch naming and PR conventions
suggested format for commit messages
naming convention for notebooks

iOS/Watch/Web -- UIs

Functions for Bucketing Input Files

Sync github repo with EMR Notebook

Issue: The security groups for EMR Notebooks are tricky. They need to be configured to sync with the repo so that when a notebook is launched the project’s source files are already loaded. Pushes to the repo should also be enabled.

Integrate WatchOS and iOS apps

Need to implement watch connectivity for two-way communication between watch and phone.

Set up IAM Accounts for Team

Cross-validate models

Run k-fold cross-validation on all candidate models to assess variance. Code should go in notebook that contains model training and evaluation.

Join datasets to single csv file

Task: Join all csv files in s3://msds-sparkle/data/commute/ to a single csv file for ease of visualizing and processing entire dataset.

Make Visual Comparing Runtime Performance of Different Cluster Configs

Task: Visualize runtime performance of various cluster settings for dc-project-AC-timed.ipynb notebook.

Builds on #14 & #16

Web -- Secure log-in

Re-structure S3 Bucket

Task: Re-organize the data in the project s3://msds-sparkle bucket as follows:

/
├── data
│   ├── original
│       └── 01-pid
│           └── 05-pills
│               └── trial-03.csv
│   ├── processed
│       ├── 01-pid
│           └── 05-pills
│               └── trial-03.csv
└── models
│   ├── input
│       └── pills-05-300-101.parquet
│   ├── output
│       └── XGBoost.pickle
└──  README.md

/data/original will contain all of the original, unmodified data recordings (preserving the original filename with timestamp).
/data/processed will contain files that have been re-named or modified in anyway (e.g. by having added pid and n_pills cols).
/models/input will contain the .parquet or .csv files for training ML models, named as follows pills-<window_size>-<n_obsv>-<n_feats>.parquet.
/models/output will contain trained ML models

Script for Loading Files from s3

Task: Write a script that uses an already configured awscli profile called sparkle to load in files from s3. This avoids having to hardcode individual access keys. The script should load in files as a dictionary of spark DataFrames with (key, value) = (n_pills, [DataFrames]).

WatchOS -- Configure automatic data transfer to S3

Task: Configure watchOS app to log data to S3 bucket

Train Models on Various Cluster Configurations

Task: Train different models on various emr clusters. Record cluster_configuration and model_runtime in /docs/cluster_performance.csv, which has the following header:

model;model_params;n_nodes;instance_type;n_records;n_features;train_time

Note: The field delimiter is ; NOT ,

Ex:

model = GBRegressor
model_params = {maxIter: 150, nfolds: 10} (leave empty {} if set to defaults)
n_nodes = 3 (1 master and 2 workers)
instance_type = m5.xlarge
n_records = 240
n_features = 102
train_time = 35s

Final Presentation Slides

Review EMR launch flow

Rough Draft of IEEE EMBS Paper

Record movement/activity data on SensorLog

Goal: gather enough data for our project by recording our morning commute on Wed 11/20.

Sphinx Project Documentation

Task: Format final python modules/scripts so that Sphinx can automagically create documentation for the entire project. Function docstrings, for example, can be formatted using Google's style guide:

def test(num, string):
    """Sample docstring in a friendly Sphinx format.

    Args: 
        num (int): A number
        string (str): A string

    Returns:
        foo (str): String "foo"
    """

Not an essential task, but would be nice since this is a public repo 🙂

Web -- Home Dashboard

Homepage dashboard of clickable cards for patients with poor medication adherence rates, which when clicked bring up a comprehensive view of patient information, including:

Illnesses & current medication, intake history, and amount of medication remaining.
Summary of KPIs for medication adherence (current streak; weekly, monthly and lifetime adherence rates).

EDA on Features

Scripts For Re-naming Files & Adding Columns

Tasks:

Generate scripts for re-naming data files in the format trial<num>.csv. The full path of each file on s3 looks like s3://msds-sparkle/data/pills/<PID>/<n_pills>/trial<num>.csv
Generate scripts for adding PID and n_pills columns to each input csv file.

WatchOS -- Get audio data

Right now, CoreMotion provides us with all the movement-related sensor data that we need (accelerometer, gyroscope, etc). As far as I can tell, it will not give us any audio data.

For the time being, we can proceed without it, but ideally we'll be able to find another way to get to it, and feed it into the existing pipeline, directly into s3.

Ensure all team members can access s3 bucket from EMR Notebook

First Unit Test

Task: Write a (any) unit test and configure pytest for repo. This will enable tests to run automatically on every push and pull_request, thanks to our handy GitHub Actions ;)

dummy-issue

Configure s3 bucket so that we can read our data directly from it in python

This looks like it is relatively straightforward. We basically just need to be able to read out data into a jupyter notebook, like Diane shows here. She has a video up on canvas explaining the process. Looks like we'll need to configure the permissions policy of our s3 bucket.

WatchOS start/stop button for logging sensor data (gyroscope, accelerometer, audio)

Task: Start/stop button for logging sensor data (gyroscope, accelerometer, audio) to iPhone (which the iPhone then sends to s3).

Develop working proof-of-concept for data collection pipeline

Identify the optimal way to get data from Phone/watch sensors into s3.

Options:

Log to file on device, then move to s3
stream from IP
send data somewhere via HTTP POST request, then forward it to s3 (will be json data)

msarmi9 / sparkle Goto Github PK

sparkle's People

Contributors

Stargazers

Watchers

Forkers

sparkle's Issues

Options:

Recommend Projects

Recommend Topics

Recommend Org