msarmi9 / sparkle Goto Github PK
View Code? Open in Web Editor NEWPromoting medication adherence with ML ✨
Home Page: https://www.sparklemed.com
Promoting medication adherence with ML ✨
Home Page: https://www.sparklemed.com
Task: Clean & minimal home dashboard displaying current week’s intake schedule, organized by day and user-customized intake times that conforms to a doctor-prescribed schedule.
This looks like it is relatively straightforward. We basically just need to be able to read out data into a jupyter notebook, like Diane shows here. She has a video up on canvas explaining the process. Looks like we'll need to configure the permissions policy of our s3 bucket.
Per Diane's instructions, each of us must run Andy's dc-project-AC.ipynb
notebook on a unique EMR cluster configuration, then we will compare the run times. She recommends putting all our spark
data processing code into a single cell so we can easily capture the full running time. I have made that edit to Andy's notebook and renamed it to dc-project-AC-timed.ipynb
- this will be posted in the sparkle
slack channel.
It seems fairly open ended in terms of how our EMR cluster configurations differ, so I propose we all run the notebook with the default m5.xlarge
instances, but all use a different number of instances. For simplicity, let's all use the following number of instances:
Collin: 5 `m5.xlarge` instances
Kevin: 4 `m5.xlarge` instances
Andy: 3 `m5.xlarge` instances
Stephanie: 2 `m5.xlarge` instances
Matt: 1 `m5.xlarge` instances
So, when launching, my configuration looked like this:
For the project, it would probably be good for us all to get a screenshot of the cluster config while running, which you can get in the AWS console if you go to EMR
-> Clusters
-> (toggle your cluster) -> View details
-> hardware
. Mine looks like this:
Finally, after getting things all set up, upload dc-project-AC-timed.ipynb
, then run the entire notebook, then go down and toggle the Spark Job Progress
under the big cell which contains all the pyspark
data processing code to get your configuration's runtime. For example, with 5 m5.xlarge
instances, my code took 12.11 seconds to run.
And once we all do that, we should be good to go! We can put all of our different configurations and run times into a nice table in our presentation. 👍
Task: Update README
to give at least a rough overview/outline of our project. Doesn’t need to be super detailed, as we can continue to flesh it out as we go. Writing this after the EMBC paper
(Issue #37) is probably a good idea.
Task: Train different models on various emr
clusters. Record cluster_configuration
and model_runtime
in /docs/cluster_performance.csv
, which has the following header:
model;model_params;n_nodes;instance_type;n_records;n_features;train_time
Note: The field delimiter is ;
NOT ,
Ex:
model = GBRegressor
model_params = {maxIter: 150, nfolds: 10}
(leave empty {}
if set to defaults)n_nodes = 3
(1 master
and 2 workers
)instance_type = m5.xlarge
n_records = 240
n_features = 102
train_time = 35s
Task: Set up awslab/git-secrets to automatically scan for & prevent uploading of access keys
in any future commits 😎
Task: Train models (XGBoost
, h2o automl
, etc) on the log-transformed data (lite_300.csv
in the s3
bucket).
Task: Run the same notebook (dc-project-AC-timed.ipynb
) on EMR
with scaled out instances (more ec2
instances, less computing power per instance).
Curious to see how the same total amount of memory distributed to more worker nodes affects runtime performance (e.g. 8 workers at 8 GB RAM each
vs 4 workers at 16 GB RAM
)
Builds upon #14
Each config group will be run with 1 or 3 m5.xlarge master nodes
and x
number of m4.large worker nodes (8 GB RAM)
Config 1: 10 workers
Config 2: 8 workers
Config 3: 6 workers
Config 4: 4 workers
Config 5: 2 workers
Homepage dashboard of clickable cards for patients with poor medication adherence rates, which when clicked bring up a comprehensive view of patient information, including:
Task: Join all csv
files in s3://msds-sparkle/data/commute/
to a single csv
file for ease of visualizing and processing entire dataset.
Tasks:
trial<num>.csv
. The full path of each file on s3
looks like s3://msds-sparkle/data/pills/<PID>/<n_pills>/trial<num>.csv
PID
and n_pills
columns to each input csv
file.Identify the optimal way to get data from Phone/watch sensors into s3.
Task: Create a bootstrap script for installing boto3, h2o
and local modules when launching an aws emr
cluster.
Task: Choose a license and add a LICENSE.md
to the repo
Task: Configure watchOS app to log data to S3 bucket
Task: Write a (any) unit test and configure pytest
for repo. This will enable tests to run automatically on every push
and pull_request
, thanks to our handy GitHub Actions ;)
Right now, CoreMotion provides us with all the movement-related sensor data that we need (accelerometer, gyroscope, etc). As far as I can tell, it will not give us any audio data.
For the time being, we can proceed without it, but ideally we'll be able to find another way to get to it, and feed it into the existing pipeline, directly into s3.
Task: Re-organize the data in the project s3://msds-sparkle
bucket as follows:
/
├── data
│ ├── original
│ └── 01-pid
│ └── 05-pills
│ └── trial-03.csv
│ ├── processed
│ ├── 01-pid
│ └── 05-pills
│ └── trial-03.csv
└── models
│ ├── input
│ └── pills-05-300-101.parquet
│ ├── output
│ └── XGBoost.pickle
└── README.md
/data/original
will contain all of the original, unmodified data recordings (preserving the original filename with timestamp)./data/processed
will contain files that have been re-named or modified in anyway (e.g. by having added pid
and n_pills
cols)./models/input
will contain the .parquet
or .csv
files for training ML models
, named as follows pills-<window_size>-<n_obsv>-<n_feats>.parquet
./models/output
will contain trained ML models
@steph-jung Would love to have your script for flattening the data/pills/original
directories into a single directory to place in the s3
bucket. Whenever you're done tweaking it, put in a PR
:D
Task: Translate PySpark script for processing sensor data into final training frame for DL model into Python
Task: Write a script that uses an already configured awscli
profile called sparkle
to load in files from s3. This avoids having to hardcode individual access keys
. The script should load in files as a dictionary of spark DataFrames
with (key, value) = (n_pills, [DataFrames])
.
Run k-fold cross-validation on all candidate models to assess variance. Code should go in notebook that contains model training and evaluation.
Task: To enable automatic testing of code submitted in new PRs. For example, we'll set up an action for running pylint
on all source code. PRs will not be accepted unless they pass all tests (and are approved by a reviewer).
Purpose: To enhance the overall health of the project's codebase and improve the engineering skills of its contributors.
Need to implement watch connectivity for two-way communication between watch and phone.
Task: Start/stop button for logging sensor data (gyroscope, accelerometer, audio) to iPhone (which the iPhone then sends to s3).
Task: Write a brief doc covering
Task: Name csv
files for newly collected pill data in a smart format and put them on s3
. Maybe <activity>-<PID>-<substring>.csv
, e.g. pills-07-25.csv
for subject 07
taking n_pills=25
? But this also looks like month/date...
Goal: gather enough data for our project by recording our morning commute on Wed 11/20.
Issue: Data needs to be trimmed at the start and end to account for starting and stopping the recording feature on the apple watch.
Task: Format final python modules/scripts so that Sphinx can automagically create documentation for the entire project. Function docstrings, for example, can be formatted using Google's style guide:
def test(num, string):
"""Sample docstring in a friendly Sphinx format.
Args:
num (int): A number
string (str): A string
Returns:
foo (str): String "foo"
"""
Not an essential task, but would be nice since this is a public repo 🙂
Issue: The security groups for EMR Notebooks
are tricky. They need to be configured to sync with the repo so that when a notebook
is launched the project’s source files are already loaded. Pushes to the repo should also be enabled.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.