README for Predicament: Predicting Engagement from Biosensed Data

Overview

This code base is for importing and processing biosensed data including from DREEM EEG devices and E4 wristbands, as well as conducting predictive studies on that data.

File structure

The general file structure of the repository is as follows, where refers to the root directory of the repository hereafter:

├── data
│   ├── CARE_HOME_DATA
│   ├── featured
│   ├── results
:   :
│   └── windowed
├── notebooks
├── predicament
├── prepare_evaluation_data.py
├── README.md
:
└── requirements.txt

Key files and folders are as follows:

predicament is the core library of functionality for all the tasks described in this document.
data contains all data directories is ignored by the repository. During setup you will need to link or otherwise make available, the raw care home data in the folder data/CARE_HOME_DATA. Please do not manually add these files to the git repository.
data/CARE_HOME_DATA contains the raw data collected from the studies, please see Section "Care Home Data" for a description of this. Hereafter we will refer to this folder as <RAW_DATA_DIR>.
data/windowed is an autogenerated folder whose subfolders are generated each time a call is made to "window" data. This converts the raw data into windows of uniform length. See Section "Windowing Data" for more.
data/featured is an autogenerated folder whose subfolders are generated after windowed data has been featured. See Section "Featuring Data" for more.
prepare_evaluation_data.py is a python executable which can be used for both windowing and featuring data.
notebooks is a folder containing jupyter notebooks each focussed on a particular part of the pipeline in development. These documents are subject to regular changes.

Use Case: Hold one Group out Cross Validation of Random Forest with Featured Data

This use case focuses on the use of conventional machine learning methods, e.g. Random Forest, MLP, Gradient Boosting, to predict the subject's condition based on featured data. After completing the steps below you should be able to run the notebook notebooks/featured_prediction_random_forest . The steps to complete are as follows:

Link, or otherwise make available, the raw data in the folder <RAW_DATA_DIR>.
Window the data, see Section "Windowing Data".
Feature the data, see Section "Featuring Data"
Run the notebook.

The notebook will do the following:

Load the featured data and separate out the features (input data) from the metadata. Each row will be features extracted from a small time-window of the sensed data. Meta-data includes the condition, to be used as the predictive label, and the groups, to be used as a hold out group.
Balance the data. Ensure that as best as possible the data is equally balanced over the target classes.
Create a hold one group out cross validation object, to ensure that each fold has just one subject's data as valdation.
Perform either Random Search or Bayesian Optimisation to determine empirically best performing hyperparameter choices.
Save and output results.

Windoing Data

First we need to load the data to windowed files of a chosen length. For the dreem data (edf format) we use

python3 prepare_evaluation_data.py --mode windowed -f dreem

For the E4 data (csv files) we use:

python3 prepare_evaluation_data.py --mode windowed -f E4

The resulting execution will cause a dataframe and a config file to be stored in ./data/windowed/<subdir>/. By default <subdir> is an autogenerated time-stamp but you can override this with the --subdir flag. The config file, details.cfg, will include a [LOAD] section with the considerations of the loading of the data and WINDOWED section with details relating to the overlapping windows. In particular, the output csv file will have a condition column with numerical values for the condition. The subsection [LOAD][label_mapping] will indicate the mapping from index to condition label and this may not correspond with the order given in [LOAD][conditions].

In our experiments we have focused on windows of approximately 4 and approximately 10 seconds but these differ slightly depending on the sample rates of the devices used. We also set the conditions to be the 5 active conditions: exper_video, wildlife_video,familiar_music,tchaikovsky, and family_inter. For convenience we save these to predefined subdirectory names too. A more explicit call for the 4 different five class datasets we used are as follows:

for dreem data 4 seconds:

python3 prepare_evaluation_data.py --mode windowed --subdir dreem_4secs --window-size 1024 -f dreem --conditions exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter

for dreem data 10 second windows:

python3 prepare_evaluation_data.py --mode windowed --subdir dreem_10secs --window-size 2560 -f dreem --conditions baseline,break,exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter

for E4 data 4 second windows:

python3 prepare_evaluation_data.py --mode windowed --subdir E4_4secs --window-size 256 -f E4 --conditions baseline,break,exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter

for E4 data 10 second windows:

python3 prepare_evaluation_data.py --mode windowed --subdir E4_10secs --window-size 640 -f E4 --conditions baseline,break,exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter

Finally, we want to consider grouped conditions such as inactive covering baseline and break conditions, and active covering exper_video, wildlife_video,familiar_music,tchaikovsky, and family_inter conditions. The syntax for specifying grouped conditions is a semi-colon ( ;) separated list of colon ( :) separated key-value (group-conditions) pairs. So the above grouping would be specified by the string:

inactive:baseline,break;active:exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter

Our choice of condition grouping corresponds to the following example calls

for dreem data 4 seconds:

python3 prepare_evaluation_data.py --mode windowed --subdir binary_dreem_4secs --window-size 1024 -f dreem --conditions baseline,break,exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter --condition-groups "inactive:baseline,break;active:exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter"

for dreem data 10 second windows:

python3 prepare_evaluation_data.py --mode windowed --subdir binary_dreem_10secs --window-size 2560 -f dreem --conditions baseline,break,exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter --condition-groups "inactive:baseline,break;active:exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter"

for E4 data 4 second windows:

python3 prepare_evaluation_data.py --mode windowed --subdir binary_E4_4secs --window-size 256 -f E4 --conditions baseline,break,exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter --condition-groups "inactive:baseline,break;active:exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter"

for E4 data 10 second windows:

python3 prepare_evaluation_data.py --mode windowed --subdir binary_E4_10secs --window-size 640 -f E4 --conditions baseline,break,exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter --condition-groups "inactive:baseline,break;active:exper_video,wildlife_video,familiar_music,tchaikovsky,family_inter"

Notice that we append binary_ to the subdirectory name to distinguish these from before. The condition groups will be recorded in the item [LOAD][condition_groups] and the item [LOAD][label_mapping] will now relate index to condition group name.

Featuring Data

For featured experiments we have to convert time-series windows into feature vectors so they can be processed by conventional machine learning models, such as Random Forests. After completing one of the calls from Section "Windowing Data", this will result in a data file and config file in an appropriate directory, and initially this is time-stamped. You can rename this folder to something more meaningful if desired.

The subdirectory name is then used as a key for the subsequent experiments, and a matching subfolder will be created in the data/featured directory (and other directories), so do not change the windowed subdirectory name after featuring the data. For instance, on 6th Dec 2023 at 19:35, I created folder relative to the pwd of data/featured/20231206193533. This corresponds to subdir of 20231206193533. We can then create features from this, by using --subdir 20231206193533. This results in folder data/featured/20231206193533 with a data file and a config file. The recommended approach is to generate all supported features in one go with the following command:

python3 prepare_evaluation_data.py --mode featured --subdir 20231206193533

Featuring data incrementally

If you want to construct your features in a series of calls, e.g. to break up the computation time, or to debug issues, then you can do so by simply running the featuring call multiple times, each time using a different --feature-group flag (for predefined sets of features) or --feature-set flag (to specify specific feature types). If a featured dataset has been previously been created for this subdirectory, then this will update the dataset with the new features. For instance, you could start with the call:

python3 prepare_evaluation_data.py --mode featured --subdir 20231206193533 --feature-group stats

This only produces a subset of the features, referred to as feature-group stats.

If I call the script with --mode featured again then it will augment pre-existing features with any new features, overwriting the pre-existing features with a new copy of identical data (effectively leaving them unchanged). I tried running this:

python3 prepare_evaluation_data.py --mode featured --subdir 20231206193533 --feature-set arCoeff,Hurst,LyapunovExponent

and

python3 prepare_evaluation_data.py --mode featured --subdir 20231206193533 --feature-set MaxFreqInd,MeanFreq,FreqSkewness,FreqKurtosis

This extends the pre-existing featured data to include channel-wise features for each of arCoeff, Hurst, LyapunovExponent, MaxFreqInd, MeanFreq, FreqSkewness, FreqKurtosis. At time of writing, the second call just above is equivalent to

python3 prepare_evaluation_data.py --mode featured --subdir 20231206193533 --feature-group freq

Some features, e.g. SampleEntropy, can take a long time to compute.

Older or unchecked material

Partitioning data (deprecated)

Make sure that you have the data in an appropriate data folder. I use ./data where . is the repository root. The care home data, should be in a subfolder called CARE_HOME_DATA (i.e. in ./data/CARE_HOME_DATA).

The first bit of code to be run is the

python3 prepare_evaluation_data.py --between -g <channel-group> -w <window-size>

The default is for window size 1024 and channel group dreem-minimal, but a possibly better option is dreem-eeg. So a good run would be:

python3 prepare_evaluation_data.py --between -g dreem-eeg -w 1024

Good values to try for -w are 256, 512, (768), 1024, (1536,) 2048. At the moment the default value for window step is <window-size>/8. You can specify a different window step with the -s flag. It might make sense to have a step size of 128 for all window sizes.

Ifd you want to specify the channels to use directly you can do so with the -c flag and specify the channels as a comma separated string of channel names, e.g.

This merges the observational data from the study with the EEG data in the edf files and produces a bunch of folds (hold one group out, where each participant represents a group). It places this in a subfolder of the data folder called evaluation/<DATETIME> (e,g, in ./data/evaluation/20230713194411/). Subfolders of this folder are named fold<N> one for each fold.

Running on Luke's eeg-implementation.

Then you need to also clone the repository arl-eegmodels (https://github.com/vlawhern/arl-eegmodels) and place this in your PYTHONPATH. So, I put this repository in the folder ~/git/external/arl-eegmodels/, then run the command:

export PYTHONPATH=~/git/external/arl-eegmodels/:${PYTHONPATH}

Now you can test this by running python3 and trying import EEGModels (you will need all the dependencies for arl-eegmodels too.

Now you can run the grid-search code (as it currently stands) with the command

python3 eegnet_evaluation.py data/evaluation/20230713194411/fold0

Which will run the grid search on just the 0th fold.

My aim is instead, to input something like

python3 eegnet_evaluation.py data/evaluation/20230713194411 data/results/performance.csv

And for it to a) load all previous performance results from data/results/performance.csv associated with all hypterparameter choices of interest, b) sample a new set of hyperparameter choices from a bayesian optimisation module (for the hyperparameters that we are searching), c) run epoch epochs of training on the model with those hyperparameter choices on every fold in data/evaluation/20230713194411, d) take the average performance over all folds as the performance for those hyperparametter choices (watch out for NaN), e) save the hyperparameter choices and avg performance results to data/results/performance.csv.

lukedickens / predicament Goto Github PK

predicament's Introduction

README for Predicament: Predicting Engagement from Biosensed Data

Overview

File structure

Use Case: Hold one Group out Cross Validation of Random Forest with Featured Data

Windoing Data

Featuring Data

Featuring data incrementally

Older or unchecked material

Partitioning data (deprecated)

Running on Luke's eeg-implementation.

predicament's People

Contributors

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent