Giter Site home page Giter Site logo

msarmi9 / sparkle Goto Github PK

View Code? Open in Web Editor NEW
11.0 4.0 1.0 53.92 MB

Promoting medication adherence with ML ✨

Home Page: https://www.sparklemed.com

Python 38.83% HTML 32.13% Swift 23.75% Ruby 0.43% Dockerfile 0.28% JavaScript 4.58%
medication-adherence machine-learning

sparkle's People

Contributors

actions-user avatar ajcheon avatar collinprather avatar dependabot[bot] avatar kbengtsonwong avatar msarmi9 avatar steph-jung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

ajcheon

sparkle's Issues

Data Trimming

Issue: Data needs to be trimmed at the start and end to account for starting and stopping the recording feature on the apple watch.

Script for Flattening Directories

@steph-jung Would love to have your script for flattening the data/pills/original directories into a single directory to place in the s3 bucket. Whenever you're done tweaking it, put in a PR :D

Update README

Task: Update README to give at least a rough overview/outline of our project. Doesn’t need to be super detailed, as we can continue to flesh it out as we go. Writing this after the EMBC paper (Issue #37) is probably a good idea.

All members of group run `dc-project-AC-timed.ipynb` on EMR and record run time.

Per Diane's instructions, each of us must run Andy's dc-project-AC.ipynb notebook on a unique EMR cluster configuration, then we will compare the run times. She recommends putting all our spark data processing code into a single cell so we can easily capture the full running time. I have made that edit to Andy's notebook and renamed it to dc-project-AC-timed.ipynb - this will be posted in the sparkle slack channel.

It seems fairly open ended in terms of how our EMR cluster configurations differ, so I propose we all run the notebook with the default m5.xlarge instances, but all use a different number of instances. For simplicity, let's all use the following number of instances:

    Collin: 5 `m5.xlarge` instances
    Kevin: 4 `m5.xlarge` instances
    Andy: 3 `m5.xlarge` instances
    Stephanie: 2 `m5.xlarge` instances
    Matt: 1 `m5.xlarge` instances

So, when launching, my configuration looked like this:

Screen Shot 2019-12-08 at 7 44 10 AM

For the project, it would probably be good for us all to get a screenshot of the cluster config while running, which you can get in the AWS console if you go to EMR -> Clusters -> (toggle your cluster) -> View details -> hardware. Mine looks like this:

Screen Shot 2019-12-08 at 7 58 03 AM

Finally, after getting things all set up, upload dc-project-AC-timed.ipynb, then run the entire notebook, then go down and toggle the Spark Job Progress under the big cell which contains all the pyspark data processing code to get your configuration's runtime. For example, with 5 m5.xlarge instances, my code took 12.11 seconds to run.

Screen Shot 2019-12-08 at 8 00 50 AM

And once we all do that, we should be good to go! We can put all of our different configurations and run times into a nice table in our presentation. 👍

iOS -- Home View

Task: Clean & minimal home dashboard displaying current week’s intake schedule, organized by day and user-customized intake times that conforms to a doctor-prescribed schedule.

EMR Bootstrap Scripts

Task: Create a bootstrap script for installing boto3, h2o and local modules when launching an aws emr cluster.

Re-name Pill Data Files & Store on S3

Task: Name csv files for newly collected pill data in a smart format and put them on s3. Maybe <activity>-<PID>-<substring>.csv, e.g. pills-07-25.csv for subject 07 taking n_pills=25? But this also looks like month/date...

Run `dc-project-AC-timed.ipynb` on EMR with scaled out cluster

Task: Run the same notebook (dc-project-AC-timed.ipynb) on EMR with scaled out instances (more ec2 instances, less computing power per instance).

Curious to see how the same total amount of memory distributed to more worker nodes affects runtime performance (e.g. 8 workers at 8 GB RAM each vs 4 workers at 16 GB RAM)

Builds upon #14


Each config group will be run with 1 or 3 m5.xlarge master nodes and x number of m4.large worker nodes (8 GB RAM)

Config 1: 10 workers 
Config 2: 8 workers
Config 3: 6 workers
Config 4: 4 workers
Config 5: 2 workers

Group Project 1 - Team/Data Selection

  • Each team member reviews relevant research + datasets and writes short review in this doc.
  • We must collect all pill-taking data so that our dataset meet's Diane's criteria.
  • Get all data into s3 by Sunday (tomorrow) night at 10pm

Add GUIDELINES.md

Task: Write a brief doc covering

  • branch naming and PR conventions
  • suggested format for commit messages
  • naming convention for notebooks

Sync github repo with EMR Notebook

Issue: The security groups for EMR Notebooks are tricky. They need to be configured to sync with the repo so that when a notebook is launched the project’s source files are already loaded. Pushes to the repo should also be enabled.

Cross-validate models

Run k-fold cross-validation on all candidate models to assess variance. Code should go in notebook that contains model training and evaluation.

Join datasets to single csv file

Task: Join all csv files in s3://msds-sparkle/data/commute/ to a single csv file for ease of visualizing and processing entire dataset.

Re-structure S3 Bucket

Task: Re-organize the data in the project s3://msds-sparkle bucket as follows:

/
├── data
│   ├── original
│       └── 01-pid
│           └── 05-pills
│               └── trial-03.csv
│   ├── processed
│       ├── 01-pid
│           └── 05-pills
│               └── trial-03.csv
└── models
│   ├── input
│       └── pills-05-300-101.parquet
│   ├── output
│       └── XGBoost.pickle
└──  README.md
  • /data/original will contain all of the original, unmodified data recordings (preserving the original filename with timestamp).
  • /data/processed will contain files that have been re-named or modified in anyway (e.g. by having added pid and n_pills cols).
  • /models/input will contain the .parquet or .csv files for training ML models, named as follows pills-<window_size>-<n_obsv>-<n_feats>.parquet.
  • /models/output will contain trained ML models

Script for Loading Files from s3

Task: Write a script that uses an already configured awscli profile called sparkle to load in files from s3. This avoids having to hardcode individual access keys. The script should load in files as a dictionary of spark DataFrames with (key, value) = (n_pills, [DataFrames]).

Train Models on Various Cluster Configurations

Task: Train different models on various emr clusters. Record cluster_configuration and model_runtime in /docs/cluster_performance.csv, which has the following header:

model;model_params;n_nodes;instance_type;n_records;n_features;train_time

Note: The field delimiter is ; NOT ,

Ex:

  • model = GBRegressor
  • model_params = {maxIter: 150, nfolds: 10} (leave empty {} if set to defaults)
  • n_nodes = 3 (1 master and 2 workers)
  • instance_type = m5.xlarge
  • n_records = 240
  • n_features = 102
  • train_time = 35s

Sphinx Project Documentation

Task: Format final python modules/scripts so that Sphinx can automagically create documentation for the entire project. Function docstrings, for example, can be formatted using Google's style guide:

def test(num, string):
    """Sample docstring in a friendly Sphinx format.

    Args: 
        num (int): A number
        string (str): A string

    Returns:
        foo (str): String "foo"
    """

Not an essential task, but would be nice since this is a public repo 🙂

Web -- Home Dashboard

Homepage dashboard of clickable cards for patients with poor medication adherence rates, which when clicked bring up a comprehensive view of patient information, including:

  • Illnesses & current medication, intake history, and amount of medication remaining.
  • Summary of KPIs for medication adherence (current streak; weekly, monthly and lifetime adherence rates).

Scripts For Re-naming Files & Adding Columns

Tasks:

  1. Generate scripts for re-naming data files in the format trial<num>.csv. The full path of each file on s3 looks like s3://msds-sparkle/data/pills/<PID>/<n_pills>/trial<num>.csv
  2. Generate scripts for adding PID and n_pills columns to each input csv file.

WatchOS -- Get audio data

Right now, CoreMotion provides us with all the movement-related sensor data that we need (accelerometer, gyroscope, etc). As far as I can tell, it will not give us any audio data.

For the time being, we can proceed without it, but ideally we'll be able to find another way to get to it, and feed it into the existing pipeline, directly into s3.

First Unit Test

Task: Write a (any) unit test and configure pytest for repo. This will enable tests to run automatically on every push and pull_request, thanks to our handy GitHub Actions ;)

Add Status Badges to README

Task: Configure dynamic badges for displaying key information concerning the project's codebase, e.g. % of codebase tested, last build passing T/F, % of code that is documented, as in the screenshot below:

image

GitHub CI Actions for Automatic Testing

Task: To enable automatic testing of code submitted in new PRs. For example, we'll set up an action for running pylint on all source code. PRs will not be accepted unless they pass all tests (and are approved by a reviewer).

Purpose: To enhance the overall health of the project's codebase and improve the engineering skills of its contributors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.