Giter Site home page Giter Site logo

nsides-release's Introduction

NSIDES project

Analysis notebooks and database interaction scripts for the NSIDES project. The goal of the NSIDES project is to extend adverse event discovery (specifically using FDA AERS data as in [1]) to higher-order drug combinations, such as triplets, quadruplets, etc.

[1] Tatonetti, Nicholas P., P. Ye Patrick, Roxana Daneshjou, and Russ B. Altman. "Data-driven prediction of drug effects and interactions." Science translational medicine 4, no. 125 (2012): 125ra31-125ra31. doi:10.1126/scitranslmed.3003377

Overview - Steps to run everything

Preprocessing

  1. Download from the MYSQL database all outcomes (MedDRA concepts) (nb/1.server/1.get_outcomes_meddra.ipynb)
  2. Create reformatted matrices of outcomes, also saving the id-to-index keys as vectors (nb/2.preprocessing/1.format_outcomes_data.ipynb)
  3. Create reformatted matrices of exposures, also saving the id-to-index keys as vectors (nb/2.preprocessing/2.format_drug_exposures_data.ipynb)

Computation

  1. Create file maps for OFFSIDES and TWOSIDES (scripts/1.compute_file_maps.py)
  2. Compute all propensity scores (by averaging across the 20 bootstrap iterations, and only those iterations where AUC > 0.5) (scripts/2.compute_propensity_scores.py)
  3. Compute all disproportionality statistics for OFFSIDES and TWOSIDES (PRR, PRR_error, A, B, C, D, and mean (reporting frequency)) (scripts/3.compute_prr.py)
  4. Combine all disproportionality data into single files, one for each n (ie. offsides_prr.csv.xz, twosides.csv.xz). (The PRR files were originally split to allow parallelization) (scripts/4.combine_prr_clean.py)

Table formatting

These notebooks, located in nb/3.format_tables/, reformat computed data into the tables that will be inserted into the database.

Table inserts

These notebooks are run on the local server and they insert created tables and check various facts about the tables, once inserted. For more information, see the notebooks at nb/4.insert_tables/.

Method notes

PRR

A contingency table can be drawn using exposed and unexposed cohorts produced by propensity score matching.

Had outcome Didn't have outcome
Drug exposed A B
Not drug exposed C D

Using these definitions,

and the error is

Several consequences of these definitions should be taken into account when inspecting the data.

  • PRR is NaN when both A and C are zero.
  • PRR is Inf when C is zero but A is greater than zero.
  • PRR is zero when A is zero and C is not zero.
  • PRR_s is Inf when A or C or both is zero.

Setup

The notebooks and scripts in this repository expect that certain source files are properly located. The data/ layout I employed is the following:

.
+-- data
|   +-- aeolus
|   |   +-- AEOLUS_all_reports_IN_0.npy
|   |   +-- ... (all-inclusive)
|   |   +-- AEOLUS_all_reports_IN_54.npy
|   +-- archives
|   |   +-- 1
|   |   |   +-- scores_1.tgz
|   |   |   +-- ... (all-inclusive)
|   |   |   +-- scores_220.tgz
|   |   +-- 2
|   |   |   +-- scores_1.tgz
|   |   |   +-- ... (all-inclusive)
|   |   |   +-- scores_?.tgz
|   +-- scores
|   |   +-- 1
|   |   |   +-- 0.npz
|   |   |   +-- ... (not all-inclusive)
|   |   |   +-- 4391.npz
|   +-- prr
|   |   +-- 1
|   |   |   +-- 0.csv.xz
|   |   |   +-- ... (not all-inclusive)
|   |   |   +-- 4391.csv.xz
|   |   +-- 2
|   |   |   +-- ....csv.xz
|   +-- meta_formatted
|   +-- tables
|   +-- output_archives

aeolus

aeolus files are (reports x drug exposures). The split into 55 files is apparently to reduce the size of individual files was done before I joined the project. In nb/2.reformat_exposures_outcomes.ipynb I load and combine all these files, in order, and I save the resulting array as data/meta/all_drug_exposures.npz, a scipy.sparse.csc_matrix of dimension (4694086 x 4396). I had to truncate the combined array to have the correct number of reports, as AEOLUS_all_reports_IN_54.npy had excessive all-zero rows at the end, making the combination of all AEOLUS_all_reports_IN_**.npy files have more rows than reports. The number of possible exposures in this array, 4396, is incorrect, because it contains duplicates (3453 is the correct number). However, to maintain consistency with the work previously done I did not correct this error, but simply used the first column corresponding to each ingredient as the true values for that ingredient.

It is possible, of course, to use the code here when files are located elsewhere, but care must be taken. When possible, I have attempted to make these path assignments in obvious locations, such as the first cell in a notebook or one of the first lines in main for scripts, though some paths may still be irregularly relative.

archives

archives contains .tgz (.tar.gz) archives of propensity scores for each drug. The first subdirectories tell whether the archives are for OFFSIDES, TWOSIDES, ... The computation of these involved 20 bootstrap iterations per drug, meaning that within these archives are 20 propensity score files for each drug, which should be averaged to find the final propensity scores used for later computation. Each archive contains files of these bootstrap iteration propensity scores (scores_lrc_<drug>__<boostrap>.npy), as well as performance metrics for each of these files (log_lrc_<drug>__<bootstrap>.npy). In computing the average, I used only those bootstrap iterations which had an AUROC > 0.5.

scores

This directory is initially empty but comes to be filled with files for each drug. The first subdirectories tell whether the archives are for OFFSIDES, TWOSIDES, ... Because I only averaged those bootstrap iterations with AUC > 0.5, some drugs do not have corresponding propensity score files. Of the 3453 unique drugs, only ultimately 2757 have propensity score files. The files in this directory are compressed numpy.ndarrays stored in the .npz format. Because this format is intended to enable the storage of multiple arrays, the scores can be accessed by loading the .npz file with numpy.load(), and the extracting scores using an attribute of the loaded file (ie. loaded['scores']).

These files are each between 50 KB and 11 MB.

prr

prr is also initially empty, and it also comes to be filled with one file per drug (the same 2757 as in scores). The first subdirectories tell whether the archives are for OFFSIDES, TWOSIDES, ... These files correspond to disproportionality statistics for a given drug and all (MedDRA) outcomes. Each file is a .csv file compressed using the LZMA algorithm (result is an .xz file). The columns of these files are the following: drug_id (RxNorm ID), outcome_id (MedDRA ID), A, B, C, D, PRR, and PRR_error.

These values correspond to the following:

  • A is the number of reports with exposure to the drug who had the outcome
  • B is the number of reports with exposure to the drug who did not have the outcome
  • C is the number of reports without exposure to the drug who had the outcome
  • D is the number of reports without exposure to the drug who did not have the outcome

For C and D, the number of unexposed reports was determined by binned propensity score matching. That is, the propensity scores were binned, and the number of exposed reports in the bin was paired to 10x the number of unexposed reports in the bin, sampled with replacement. In cases where every bin having an exposed report also had at least one unexposed report, there are 10x the total number of exposed reports in the combined unexposed group. However, there were some cases in which a bin had only exposed reports. In these cases simply no unexposed cases were added for the bin.

These files are each between 8 and 160 KB. The combined, data/full_prr.csv.xz file is 52 MB, though it excludes rows with $PRR = NaN$.

meta_formatted

This directory is for a number of files, including the following:

  • drug_exposure_matrix.npz
    • Matrix of reports by drugs.
  • outcome_matrix.npz
    • Matrix of reports by outcomes.
  • report_id_vector.npy
    • Vector giving the report ID at each index in the matrix
  • drug_id_vector.npy
    • Vector giving the drug ID at each index in the matrix.
  • file_map_offsides.csv
    • This file shows where score and log files are located.
  • outcomes_table.csv.xz
    • Equivalent to the standard_case_outcome table from effect_aeolus on the local server.
  • outcome_id_vector.npy
    • Vector giving the outcome ID at each index in the matrix.

tables

This directory is for locally saving tables that will be later inserted into the effect_nsides MYSQL database.

nsides-release's People

Contributors

zietzm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nsides-release's Issues

PRR

Dear,

My name is Jordi and I'm contacting for the following reason.
I want to compare OFFSIDES with other side effect sources, and I want to filter by some confidence level. I think that PRR is the quality but I'm not sure... If so, could you tell me if there are any quality threshold number? in order to filter by quality the interactions reported by OFFSIDES? For example MEDEFFECT or FAERS, use the term "suspec"t to indicate high quality, the antagonist is "concomitant"...

Thanks for your help and time,

Jordi

scores *.tgz files

First and foremost, thank you for your share :^)
I have a question about the scores *.tgz files.
I need the code to generate data/archives/1/scores *.tgz and data/archives/2/scores *.tgz in order to run script/1.compute file maps.py. but I can't seem to find it. I'd like to know how I can obtain that code.

about external_maps

Hi, first of all, thank you for integrating the data. This has helped me a lot. Secondly, I am trying to perform ID conversion. I would like to ask where can I download '../../data/external_maps/RxNorm.csv' .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.