Source code and scripts to process the datasets using in the paper: ADataViewer: Exploring Semantically Harmonized Alzheimer’s Disease Cohort Datasets.
- The summary statistics were computed here (quantiles) and here (mean).
- There are 4 distinct scripts that are used to compute the summary statistics.
- Each script is dedicated to select a subset of participants based on a diagnostic group except one where we take all diagnoses into account. For instance "AD_quantiles_all_datasets-v100.ipynb" is used to compute summary statistics of the selected features for participants that were diagnosed with AD. Note: we restrict the datasets to only baseline visit here.
- The computed summary statistics are then saved as a CSV file here.
Using the CSV tables we illustrated the following table.
- Using this script, we make a table that would be used to generate pie charts for the ethnicity distribution of any cohort (shown in the figure below)
- The table made by this script is saved here
- To check for the available modality and plot the ranks, we can use this script
- To visualize the biomarker distribution, we generated tables that contain measurements of participants for each feature in all datasets
- The categorical features are Sex, APOE4 and CRD
- All harmonized numerical features are included
- There are dedicated scripts for each diagnostic group as well as one script for all diagnoses
- Each script saves the tables where the columns are the "feature + name of cohort" and rows contain the measurements of participants
The tables are then used to generate boxplot for numerical features and stacked barplot for categorical features. Figure below is an example of boxplots.
- To investigate the number of visits for each study cohort as well as the number of patients in each diagnosis stage, use this script
- We can compute the transition from one diagnostic state to another using this
The following table illustrates the number of patients for each transition
To enable a user-friendly tool for assessing cohort studies that could be compatible to use as training and validation datasets, we investigated the feature overlap across our datasets. Additionally, the number of available measurements for each follow-up visit was computed to enable visualization of collected measurements through the study length (script).
Inputs:
- Harmonized feature scape across the cohorts
- Merged dataset of each cohort
Outputs:
- Table per modality where rows are feature names and columns are the cohorts, the cells contain 1 where the feature was available and 0 where it was not. Note: 0 indicates that the feature was reported in the study but no measurements were collected for any of the participants of that cohort. here
- Similar to the point above, one table that contains all the non-existing features in every cohort dataset (we combined the tables into one table). This table can be find here, called "nonexistence_features.tsv".
- To investigated whether each feature of a cohort dataset has been collected for all of the participants of that cohort or a subset of the participants, we generated tables for all the cohort. For instance, ADNI.tsv contains random index for all participants as rows and all the investigated features as columns. For each participant, we look into the original dataset (merged table) and check whether a certain feature was recorded in any visit-point through the study length and if so we store "1" for that participant in the output table.
- the total number of harmonized cohort is saved as a distict tsv file per modality and stored here
- Lastly, for each harmonized feature, we count the number of patients that have collected measurements for each visit point. The results are stored in separated tables for each investigated modality. For instance, apoe.tsv table contains information about the features related to APOE status. In this table, the cohort names are as rows and harmonized features as columns. Note: there are multiple rows with the same cohort name as the index for longitudinal cohorts and each time point is stored in the same row under the "Months" column. In other words, the "Months" column indicates whether at a certain time-point of the study length the measurements were collected or simply skipped.
All the described outputs are then used for the "StudyPicker" as well as "Longitudinal" (i.e. Biomarker-specific Follow-up) tools on the website.
The figure below is an exmaple of StudyPicker".
The figure below is an example of longitudinal plot for a set of features in 4 distinct cohorts.
- Yasamin Salimi: [email protected]
- Colin Birkenbihl: [email protected]
Salimi, Y., Domingo‐Fernándéz, D., Bobis-Álvarez, C., Hofmann‐Apitius, M., for the Alzheimer's Disease Neuroimaging Initiative, the Japanese Alzheimer’s Disease Neuroimaging Initiative, for the Aging Brain: Vasculature, Ischemia, and Behavior Study, the Alzheimer's Disease Repository Without Borders Investigators, for the European Prevention of Alzheimer’s Disease (EPAD) Consortium, Birkenbihl, C. ADataViewer: Exploring Semantically Harmonized Alzheimer’s Disease Cohort Datasets (2021), medRxiv, 2021.09.01.21262607
The code in this package is licensed under the MIT License.