greenelab / tybalt Goto Github PK

Training and evaluating a variational autoencoder for pan-cancer gene expression data

License: BSD 3-Clause "New" or "Revised" License

Shell 0.02% Jupyter Notebook 41.16% Python 3.95% R 0.68% HTML 54.18%

cancer-genomics deep-learning unsupervised-learning gene-expression variational-autoencoder script tool analysis autoencoder variational-autoencoders cancer

tybalt's Issues

Establish naming convention for latent features

This has caused issues in the past (see #63) and was again brought up in greenelab/pancancer#59

I will need to revisit how features are named. This could be as simple as node_algorithm format. For example, ADAGE feature 17 would be 17_adage but could ease frustration and prevent breakage in the future.

Save recon_loss and KL_loss separately to visualize tradeoff

Visualizing how each loss function contributes to the overall loss and how each is learned through each training epoch will be important. It will be important especially if warmup is implemented in #12

High Weigh Genes Outliers

In #95 I removed outliers from visualizations to focus on areas of highest density. I defined outliers by > 3 z score for skew or kurtosis. The features appear to be identifiying similar biological processes across algorithms (e.g. patient sex). I need to decide if focusing on biological assignment for these features should be handled separately than other features.

Assess per Cancer-Type reconstruction cost

this will inform any customized batching requirements to enforce balanced learning

Latent Space Evaluations: Arithmetic

I will need to establish a specific hypothesis and test latent space arithmetic in this context. This analysis may be an extension of the high grade serous ovarian cancer analyses presented in latent_space_comparison.ipynb

t-SNT visualization

hi,
Which part of the code is to achieve T-SNE visualization of all tumor VAE coding features? I didn't find it in the project.

Compressing git LFS files

As @dhimmel brought up in greenelab/pancancer#56 some of the files used are large and should be compressed to save space and speed up access. Once the git LFS data are compressed and better organized, this issue will be resolved.

Brainstorm: Capabilities

I'll use a couple issues to discuss capabilities/challenges/evaluation tactics

Encodes gene expression representation into less dimensions (i.e. dimensionality reduction)
- Nonlinear feature generation
  - What do the feature represent? What about intermediate layers?
- The representation is a reflection of the data generating manifold
  - Possibility of exploring/observing samples along data manifold
    - Where are various covariates?
    - Where are expression sigatures? (TP53, Ras/NF1?)
    - Provide metric for sample similarity along expression signatures (HGSC subtypes?)
Provides ability to generate samples
- Can this help with adding more data to other learning tasks?
- Some sort of in silico drug response to hypothetical sample analysis?

Add reconstruction based analysis

I need to perform a dimesionality reduction algorithm analysis of reconstruction costs. I will slowly increase the number of latent space features and observe how this relates to reconstruction cost for several algorithms.

This is much easier to do for PCA, ICA, and NMF. It will require a substantial refactoring of tybalt (see #74) and adage in order to facilitate these keras based models

Define conda environment to work in GPU cluster

This will be in anticipation of a large parameter sweep to be run on GPU cluster. I will need to determine supported versions

Refactoring Tybalt VAE Code

Code is getting unwieldy with the same functions and classes being defined in multiple places. There is also necessarily a lot of customization required to run each model. This issue will include information on how a refactoring should be performed.

Points to consider:

VAE custom classes require specific input but must also communicate to a new tybalt class
tybalt class should have various methods for training, transforming, and evaluating performance metrics
b. training should allow multiple architecture and hyper parameter setting and getting
a. performance metrics include: avg sample reconstruction, per cancer-type reconstruction, training history visualization, etc.

Top n - High Weight Selection Method

High weight gene selection performed naively by selecting the top n weight genes.

as @jaclyn-taroni pointed out in #95 - this method should be implemented in module tybalt.genes.high_weight_genes and be evaluated in upstream pathway analyses.

Features that represent biological signals

Hi,
I’m interested in the part that you have investigated the biological relevance of extracted latent features. However, I have not been bale to figure where you find the dimensions that are say most important in distinguishing patients' sex. It appears that this is feature 82. Is this feature 82 (out of 100) in the latent space of VAEs with one hidden layer, or two hidden layer? I’m guessing it’s the latter, which is why I was looking into tybalt_twohidden.ipynb. However, here only weights for all samples are extracted based on multiplying the weights of the two hidden layers. So the question I was hoping you can help me with is in the vae with two hidden layer, how the latent feature(s) that distinguishes patients's sex is identified as 82?

Error when setting up environment

Hi Dr. Greg,

I tried to implement your code but was trapped when creating environment using environment.yml you provided. There are some conflicts for the dependency. I am not sure whether it is rasied by the reason the version might be too old. Will you update the dependencies?

Thanks
yurui

Implement Warmup

As discussed in Sonderby et al. 2016 a warm up period from deterministic to stochastic autoencoder model as training epochs proceed may enhance parameter usage.

This stackoverflow post may be useful to review.

Parameter Sweep

I will use this issue to discuss which parameter combos to sweep over

Simulation Experiments

Introduced in #103 - a simulation experiment is worth exploring. Currently, the simulation experiment samples the following signals (copy and pasted from #103):

Groups (based on different normal distributions along given across axes)
Nonlinear functions (any transformation to a randomly sampled vector)
Cell-types (based on a different proportions of normal distributions across given axes)
Presence/absence (random determination if feature is present)
Random noise features

As mentioned, some of the evals envisioned include:

Ability to distinguish groups
Identify nonlinear but continuous patterns
Robust to noise injection
Latent space arithmetic (e.g. Group 1: Yes Feature - Group 1: No Feature + Group 2: No Feature = Group 2: Yes Feature?)
Generative ability of models

As suggested by @jaclyn-taroni, we will use this issue to track important citations and discuss rationale for the simulations and evals.

Replace data in encoded_adage_features.tsv

The file (encoded_adage_features.tsv) as stored on git is only three lines. tsne_tybalt_features.ipynb expects a true tsv file and throws a value error. I ran adage_training.ipynb on my machine and it made a working copy of the file.

Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative.

Dear Dr. Greg,

I was trying to run your process_data.py script in nbconverted scripts. This resulted in an error, which I believe was because of the missing column 'hugoSymbol' in tsg_df and also because that indexing with list with missing labels is deprecated. Please find the exact output below. The documentation says:

Changed in version 1.0.0.

Using .loc or [] with a list with one or more missing labels will no longer reindex, in favor of .reindex.

Could I simply use .reindex() instead of .loc() here in this case? I would be very grateful for your help with this issue. Thank you.

Kind regards,
Yagmur

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  status_gain = copy_gain_df.loc[:, oncogenes_df['hugoSymbol']]
/mnt/lsf-nas-1/os-shared/anaconda3/envs/tybalt/lib/python3.5/site-packages/pandas/core/indexing.py:1367: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
tybalt/scripts/nbconverted/process_data.py:255: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  status_loss = copy_loss_df.loc[:, tsg_df['hugoSymbol']]
tybalt/scripts/nbconverted/process_data.py:270: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  copy_status = copy_status.loc[:, mutation_status.columns].fillna(0).astype(int)```

Zero'd out training

Hello,

I've essentially copied your tybalt_vae.ipynb file line for line in a new file to train on some of my own data, but after training on my dataset, it appears the training isn't occurring at all:

The plot of the loss over the iterations is blank
The encoded batchnorm file is zero'd out

However, the code seems to function as intended otherwise (e..g no errors). I've noted that your training data is scaled [0,1], while mine is scaled to be centered on 0, with unit variance (e.g. not [0,1])--could this be the issue?

Reorganize repository

Distinct analyses are becoming confusing to follow. There are many parameter sweeps and simulation evaluations spanning compression models and latent space dimensionality.

The tybalt folder will become a standalone package, and the analyses that live in this repository must be easier to understand, especially as the project expands.

Brainstorm: Improving Model

I'll use a couple issues to discuss capabilities/challenges/evaluation tactics

Read more VAE papers and code
- Archive and discuss in new issues?
Define appropriate parameter sweep strategy
Determine how to test a "better" model
- Visualization performance scripts during training
- Fake vs. Real data classifier
- Reconstruction against adversarial examples
- Identify easy to identify signatures (e.g. sex, tumor/normal)

Add R packages to environment.yml

As pointed out by @ajlee21 in #114 - I need to specify the proper R environment. I will need to add this to environment.ymlasap.

ADAGE Implementation Issues

Initial ADAGE models allowed encoder and decoder weights to vary independently. This resulted in relatively poor performance in comparison to other models in simulated and real data reconstruction tasks.

Recent implementations (#123 , #126) have added an option to tie ADAGE weights together (decoder weights are transposed encoder weights). This improved performance. However, the optimal noise parameter observed in a parameter sweep for the pancanatlas data is noise = 0.

Based on previous observations, this should not be the case! We need to look into this in more detail. Currently, my thoughts are either:

the implementation is incorrect and doesn't match https://github.com/greenelab/adage
we do not sweep over enough noise parameters (i.e. the optimal noise is somewhere between 0 and 0.1)

Separate Diagnostic Notebook

tybalt_vae.ipynb currently trains a VAE on pancan data and outputs the latent layer along with compiled Keras models.

the notebook also does a few sanity checks to ensure the model is actually learning something. This includes ensuring there are no zeroed out features and a quick check on reconstruction fidelity.

A more thorough diagnostic notebook would help ensure robust training. I will use this issue to discuss what should be included in the future nb.

Cancer-type specific reconstruction - possibly point to needing to update batches in training

Switch to Git LFS

Setup Git LFS for future imports of large data files and models

MAD: mean or median?

Hi Greg,

Just a quick comment on something I spotted: in the manuscript you mentioned you've used median absolute deviation (MAD), but in the process_data.ipynb notebook you seem to use the pd.DataFrame.mad() method which, according to the docs, calculates the mean absolute deviation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mad.html

Hope this helps (and sorry if I got that wrong!),
Enrico

logging folder

add logging solution for std out and std err output for pmacs

Latent Space Evaluations: Interpolation

This involves data simulation and interpolating between cancer-states. I need to define and test a hypothesis for this evaluation

write scripts/vae_pancancer.py

this will be the workhorse of the repository - will need to take in command arguments and run through model training and preliminary evaluation.

By preliminary evaluation, I mean it will need to determine quickly which parameter combination is best, and output some sort of metric. I am thinking the metrics should be a combination of lowest loss and biological interpretability.

Reproducing separation

Hi Greg,

Thanks for all the work in putting this together. I'm trying to reproduce some of the results, and can't quite seem to get the separation that is shown in the paper.

At the simplest, I'm loading the existing results saved in encoded_rnaseq_onehidden_warmup_batchnorm.tsv, and then plotting:

plt.figure(figsize=(6, 6))
plt.scatter(encoded_rnaseq_df.iloc[:, 1], encoded_rnaseq_df.iloc[:, 2])
plt.xlabel('Latent Feature 1')
plt.ylabel('Latent Feature 2');

(assuming line 4 above should be plt.ylabel) which produces:

Which seems to have less separation than the sanity check in the IPython notebook. Similarly when I plot feature 53 against 66:

You can kind of see the cluster, but there is less separation than what is produced in Fig. 3B in the paper. When I train the model from scratch to produce the embeddings, I get similar separation to the saved results, and not the clear separation that you see in the papers.

Am I missing something?

ADAGE gene weights

the ADAGE gene weights (saved in cell 17 of adage_training.ipynb) are not saved in the same format as the Tybalt weights. This needs to be addressed.

Extract High Weight Genes for All Features

I will need to determine a consistent file format to extract all high weight genes from each feature encoding.

This will help to facilitate some sort of pathway coverage analysis aiding in an alternative dimensionality reduction algorithm eval.

ERROR: VAE Model reconstruct the gene expression data

Hi Gregory,

I am very happy to see how you have detailed the steps for gene expression VAE based model. I am doing some analysis on gene expression prediction model to classify dementia. I started learning teh applications of VAE model/ machine learning models in omic prediction models. Also I am new to Tensorflow/Keras.

I successfully implemented your code using your gene expression data after disabling the eager execution to make your program work in Tensorflow 2. But when we use our gene expression data (which is log 2 cpm normalized data), I am getting all the reconstructed values as 1.0 so my gene_mean and gene_summary remains same. I evaluated your data distribution ( it looks gene expression data in the range of 0-1).

Could you let me know if you can think of any issue with my input data shape.

Input:

Dimension: (3045, 10956)
data:

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
12.988314	12.83923	12.576047	12.136305	11.978625	12.494600	12.583452	12.415211	10.807796	11.77091	...	3.774915	3.350120	3.774915	2.310416	1.684627	2.745442	3.350120	2.745442	2.310416	3.350120
12.641275	12.56024	12.252506	12.576744	12.883745	12.327777	11.751295	11.776363	12.345054	11.64166	...	2.240370	2.953526	3.618031

input_rnaseq_reconstruct.head(2):

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0

gene_summary:

gene mean	gene abs(sum)
11.867037	11.867037
11.774692	11.774692
11.622958	11.622958
11.489614	11.489614
11.474622	11.474622

Sampling distriubtions

Sorry for all the Issues, but this should be the last one for a while:

Is there an efficient manner to sample the latent space? I'm currently just randomly sampling the range of min/max for each feature, and having the decoder give the proper representation for that latent feature set, but not sure if thats the best route.

EDIT: To clarify, I have a classifier that can distinguish two populations based on rnaseq data. I would like to iteratively go through each gene in my model and see what happens when it is changed/lowered to a specific range. To do this, I'd like to sample the target distribution (for that given gene; likely just a 1sigma interval of 'normal' expression) and then run my classifier on the backend. Given the non-linear dependency of all the other genes on the target gene, the idea is to just randomly sample the latent space(and the learned distributions for each gene), and just select those vectors whose values meet my 1sigma criterion (on a per-gene basis, and hope that the representations of the other genes "respond" to the change in expression on my target gene).

Sampling space for specific genes

Hey, great work on this project! I'm try to re-purpose this for something a little different. After training the VAE, I would like to focus on particular genes, and sample around a specific range, and see how that affects other gene distributions. Is that at all possible?

Pathway Coverage Evaluation

A good indicator of "how much biology did the model learn" is based on a pathway coverage test. I need to see how this was done in the eADAGE paper and then perform the pathway coverage on Pan-Cancer data for dimensionality reduction algorithms: PCA, ICA, NMF, ADAGE, and Tybalt.

Is there a better pathway database for cancer? Perhaps something like KEGG cancer?

Output Gene Weight Matrix

I will want to save the tybalt weights as a separate file on the Master branch

Matching pancancer expression to metadata

Hi @gwaygenomics @cgreene
I would like to map samples (index values) in pancan_scaled_rnaseq.tsv.gz to the metadata tcga-clinical_data.tsv.

Currently the index values for the rnaseq are not unique and I am unable to match them to the metadata.

Could you please advise on how, for example, I could subset the pancancer data to just a single cancer subtype (tying it to the metadata).

Something wrong in extracting weights?

In extract_tybalt_weights.ipynb notebook, you defined a function output_high_weight_genes.

At a certain point you have these 2 lines:
hw_pos_df = pd.DataFrame(encoding_df[encoding_df > encoding_df.std() * thresh])
hw_neg_df = pd.DataFrame(encoding_df[encoding_df < -encoding_df.std() * thresh])

It seems like you are trying to get the most extreme values away from the mean, but you don't really have the mean here. Shouldn't it be something like this?
hw_pos_df = pd.DataFrame(encoding_df[encoding_df > encoding_df.mean() + encoding_df.std() * thresh])
hw_neg_df = pd.DataFrame(encoding_df[encoding_df < encoding_df.mean() - encoding_df.std() * thresh])

Modify Tybalt to handle missing values for incomplete data

Dear Dr. Greg,

Firstly, thank you very much for this well-documented research. This is not an issue but a feature that I need for my thesis. I would like to modify this model so that it can handle missing values in the test and training data. Because the data set I want to use in the future contains missing values in every sample. Imputing missing values before the training e.g. with zeros is not an option as this would introduce bias and the value zero has a meaning in our data set. Therefore I started with replacing the reconstruction error function with my custom binary cross-entropy function that can handle missing values. I tested this function in another notebook and it seemed to work. However, I observe NaN loss values in the training, also even after only using the reconstruction error. If you have any tips for me regarding how to handle missing values in the model, I would be very grateful for your help.

Kind regards,
Yagmur Onay

Keras versioning error

Error using Keras version 2.1.3 and tensorflow version 1.5.0

https://github.com/gwaygenomics/tybalt/blob/84c6079f98ac4aa9269eead0afbd424369a94d20/scripts/vae_pancancer.py#L163

Bug in Training VAE

The notebook I uploaded doesn't have the full training results... I need to rerun. Also should rerun with magic time

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0

greenelab / tybalt Goto Github PK

tybalt's Issues

Recommend Projects

Recommend Topics

Recommend Org

XXbac-BPG248L24.12	TTN	RP11-290D2.6	JSRP1	RP11-115D19.1	HCG4P5	AC114271.2	RP3-394A18.1	ABALON	KB-1208A12.3	...	LNPK	NBPF15	ATP8B4	AC005522.7	CHID1	ARFRP1	NAPB	CTB-133G6.2	SPATA24	POU2F2
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0