The sparkle's discuss from msarmi9

iOS -- Home View

Task: Clean & minimal home dashboard displaying current week’s intake schedule, organized by day and user-customized intake times that conforms to a doctor-prescribed schedule.

Functions for Bucketing Input Files

Configure s3 bucket so that we can read our data directly from it in python

This looks like it is relatively straightforward. We basically just need to be able to read out data into a jupyter notebook, like Diane shows here. She has a video up on canvas explaining the process. Looks like we'll need to configure the permissions policy of our s3 bucket.

Group Project 1 - Team/Data Selection

Each team member reviews relevant research + datasets and writes short review in this doc.
We must collect all pill-taking data so that our dataset meet's Diane's criteria.
Get all data into s3 by Sunday (tomorrow) night at 10pm

Configure AWS Access Credentials for Team

Pick Dataset

All members of group run `dc-project-AC-timed.ipynb` on EMR and record run time.

Per Diane's instructions, each of us must run Andy's dc-project-AC.ipynb notebook on a unique EMR cluster configuration, then we will compare the run times. She recommends putting all our spark data processing code into a single cell so we can easily capture the full running time. I have made that edit to Andy's notebook and renamed it to dc-project-AC-timed.ipynb - this will be posted in the sparkle slack channel.

It seems fairly open ended in terms of how our EMR cluster configurations differ, so I propose we all run the notebook with the default m5.xlarge instances, but all use a different number of instances. For simplicity, let's all use the following number of instances:

    Collin: 5 `m5.xlarge` instances
    Kevin: 4 `m5.xlarge` instances
    Andy: 3 `m5.xlarge` instances
    Stephanie: 2 `m5.xlarge` instances
    Matt: 1 `m5.xlarge` instances

So, when launching, my configuration looked like this:

For the project, it would probably be good for us all to get a screenshot of the cluster config while running, which you can get in the AWS console if you go to EMR -> Clusters -> (toggle your cluster) -> View details -> hardware. Mine looks like this:

Finally, after getting things all set up, upload dc-project-AC-timed.ipynb, then run the entire notebook, then go down and toggle the Spark Job Progress under the big cell which contains all the pyspark data processing code to get your configuration's runtime. For example, with 5 m5.xlarge instances, my code took 12.11 seconds to run.

And once we all do that, we should be good to go! We can put all of our different configurations and run times into a nice table in our presentation. 👍

Update README

Task: Update README to give at least a rough overview/outline of our project. Doesn’t need to be super detailed, as we can continue to flesh it out as we go. Writing this after the EMBC paper (Issue #37) is probably a good idea.

Train Models on Various Cluster Configurations

Task: Train different models on various emr clusters. Record cluster_configuration and model_runtime in /docs/cluster_performance.csv, which has the following header:

model;model_params;n_nodes;instance_type;n_records;n_features;train_time

Note: The field delimiter is ; NOT ,

Ex:

model = GBRegressor
model_params = {maxIter: 150, nfolds: 10} (leave empty {} if set to defaults)
n_nodes = 3 (1 master and 2 workers)
instance_type = m5.xlarge
n_records = 240
n_features = 102
train_time = 35s

iOS/Watch/Web -- UIs

Configure awslabs/git-secrets

Task: Set up awslab/git-secrets to automatically scan for & prevent uploading of access keys in any future commits 😎

Train Models on Log-transformed Data

Task: Train models (XGBoost, h2o automl, etc) on the log-transformed data (lite_300.csv in the s3 bucket).

Store Data on S3

Run `dc-project-AC-timed.ipynb` on EMR with scaled out cluster

Task: Run the same notebook (dc-project-AC-timed.ipynb) on EMR with scaled out instances (more ec2 instances, less computing power per instance).

Curious to see how the same total amount of memory distributed to more worker nodes affects runtime performance (e.g. 8 workers at 8 GB RAM each vs 4 workers at 16 GB RAM)

Builds upon #14

Each config group will be run with 1 or 3 m5.xlarge master nodes and x number of m4.large worker nodes (8 GB RAM)

Config 1: 10 workers 
Config 2: 8 workers
Config 3: 6 workers
Config 4: 4 workers
Config 5: 2 workers

Review EMR launch flow

Web -- Home Dashboard

Homepage dashboard of clickable cards for patients with poor medication adherence rates, which when clicked bring up a comprehensive view of patient information, including:

Illnesses & current medication, intake history, and amount of medication remaining.
Summary of KPIs for medication adherence (current streak; weekly, monthly and lifetime adherence rates).

Rough Draft of IEEE EMBS Paper

Set up IAM Accounts for Team

Join datasets to single csv file

Task: Join all csv files in s3://msds-sparkle/data/commute/ to a single csv file for ease of visualizing and processing entire dataset.

Scripts For Re-naming Files & Adding Columns

Tasks:

Generate scripts for re-naming data files in the format trial<num>.csv. The full path of each file on s3 looks like s3://msds-sparkle/data/pills/<PID>/<n_pills>/trial<num>.csv
Generate scripts for adding PID and n_pills columns to each input csv file.

Develop working proof-of-concept for data collection pipeline

Identify the optimal way to get data from Phone/watch sensors into s3.

Options:

Log to file on device, then move to s3
stream from IP
send data somewhere via HTTP POST request, then forward it to s3 (will be json data)

EMR Bootstrap Scripts

Task: Create a bootstrap script for installing boto3, h2o and local modules when launching an aws emr cluster.

Choose A License

Task: Choose a license and add a LICENSE.md to the repo

WatchOS -- Configure automatic data transfer to S3

Task: Configure watchOS app to log data to S3 bucket

First Unit Test

Task: Write a (any) unit test and configure pytest for repo. This will enable tests to run automatically on every push and pull_request, thanks to our handy GitHub Actions ;)

WatchOS -- Get audio data

Right now, CoreMotion provides us with all the movement-related sensor data that we need (accelerometer, gyroscope, etc). As far as I can tell, it will not give us any audio data.

For the time being, we can proceed without it, but ideally we'll be able to find another way to get to it, and feed it into the existing pipeline, directly into s3.

dummy-issue

Re-structure S3 Bucket

Task: Re-organize the data in the project s3://msds-sparkle bucket as follows:

/
├── data
│   ├── original
│       └── 01-pid
│           └── 05-pills
│               └── trial-03.csv
│   ├── processed
│       ├── 01-pid
│           └── 05-pills
│               └── trial-03.csv
└── models
│   ├── input
│       └── pills-05-300-101.parquet
│   ├── output
│       └── XGBoost.pickle
└──  README.md

/data/original will contain all of the original, unmodified data recordings (preserving the original filename with timestamp).
/data/processed will contain files that have been re-named or modified in anyway (e.g. by having added pid and n_pills cols).
/models/input will contain the .parquet or .csv files for training ML models, named as follows pills-<window_size>-<n_obsv>-<n_feats>.parquet.
/models/output will contain trained ML models

Script for Flattening Directories

@steph-jung Would love to have your script for flattening the data/pills/original directories into a single directory to place in the s3 bucket. Whenever you're done tweaking it, put in a PR :D

Translate PySpark script for processing sensor data to Python

Task: Translate PySpark script for processing sensor data into final training frame for DL model into Python

Script for Loading Files from s3

Task: Write a script that uses an already configured awscli profile called sparkle to load in files from s3. This avoids having to hardcode individual access keys. The script should load in files as a dictionary of spark DataFrames with (key, value) = (n_pills, [DataFrames]).

Cross-validate models

Run k-fold cross-validation on all candidate models to assess variance. Code should go in notebook that contains model training and evaluation.

Add Status Badges to README

Task: Configure dynamic badges for displaying key information concerning the project's codebase, e.g. % of codebase tested, last build passing T/F, % of code that is documented, as in the screenshot below:

GitHub CI Actions for Automatic Testing

Task: To enable automatic testing of code submitted in new PRs. For example, we'll set up an action for running pylint on all source code. PRs will not be accepted unless they pass all tests (and are approved by a reviewer).

Purpose: To enhance the overall health of the project's codebase and improve the engineering skills of its contributors.

Record Presentation Video #1

Integrate WatchOS and iOS apps

Need to implement watch connectivity for two-way communication between watch and phone.

Update business plan #2

Final Presentation Slides

Ensure all team members can access s3 bucket from EMR Notebook

WatchOS start/stop button for logging sensor data (gyroscope, accelerometer, audio)

Task: Start/stop button for logging sensor data (gyroscope, accelerometer, audio) to iPhone (which the iPhone then sends to s3).

Make Visual Comparing Runtime Performance of Different Cluster Configs

Task: Visualize runtime performance of various cluster settings for dc-project-AC-timed.ipynb notebook.

Builds on #14 & #16

Add GUIDELINES.md

Task: Write a brief doc covering

branch naming and PR conventions
suggested format for commit messages
naming convention for notebooks

Re-name Pill Data Files & Store on S3

Task: Name csv files for newly collected pill data in a smart format and put them on s3. Maybe <activity>-<PID>-<substring>.csv, e.g. pills-07-25.csv for subject 07 taking n_pills=25? But this also looks like month/date...

Record movement/activity data on SensorLog

Goal: gather enough data for our project by recording our morning commute on Wed 11/20.

Data Trimming

Issue: Data needs to be trimmed at the start and end to account for starting and stopping the recording feature on the apple watch.

EDA on Features

Sphinx Project Documentation

Task: Format final python modules/scripts so that Sphinx can automagically create documentation for the entire project. Function docstrings, for example, can be formatted using Google's style guide:

def test(num, string):
    """Sample docstring in a friendly Sphinx format.

    Args: 
        num (int): A number
        string (str): A string

    Returns:
        foo (str): String "foo"
    """

Not an essential task, but would be nice since this is a public repo 🙂

Sync github repo with EMR Notebook

Issue: The security groups for EMR Notebooks are tricky. They need to be configured to sync with the repo so that when a notebook is launched the project’s source files are already loaded. Pushes to the repo should also be enabled.

msarmi9 / sparkle Goto Github PK

sparkle's Issues

Options:

Recommend Projects

Recommend Topics

Recommend Org