lzamparo / embedding Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 4.14 MB

Learning semantic embeddings for TF binding preferences directly from sequence

License: Other

Python 4.17% Jupyter Notebook 95.73% Shell 0.03% R 0.07%

embedding's People

Contributors

Watchers

embedding's Issues

Can the model really learn??

I'm seeing a disturbing pattern of not seeing better performance by epoch.

Different macrobatches are showing better or worse performance, but they are very consistent with which files make up the macro-batches

Below is a trace of the output from a sample embedding, trained on the positive probes that pass Han's QC stages. It's clear that across epochs, the macrobatches which cover the same files in the data yield an average loss which does not decrease in an epoch dependent manner:

zamparol$ less fasta_seqs.txt | grep -e average -e macrobatch -e epoch

starting epoch 0
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090376
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090356
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090346
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090362
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090349
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090341
receiving macrobatch from child process
running macrobatch 6
average loss: 5.876007
receiving macrobatch from child process
running macrobatch 7
average loss: 5.559565
receiving macrobatch from child process
running macrobatch 8
average loss: 5.803356

starting epoch 1
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090372
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090356
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090366
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090344
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090327
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090322
receiving macrobatch from child process
running macrobatch 6
average loss: 5.513479
receiving macrobatch from child process
running macrobatch 7
average loss: 5.577124
receiving macrobatch from child process
running macrobatch 8
average loss: 5.952247

starting epoch 2
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090325
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090312
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090318
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090284
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090295
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090289
receiving macrobatch from child process
running macrobatch 6
average loss: 5.597625
receiving macrobatch from child process
running macrobatch 7
average loss: 5.679941
receiving macrobatch from child process
running macrobatch 8
average loss: 5.735664

starting epoch 3
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090294
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090319
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090273
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090270
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090262
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090235
receiving macrobatch from child process
running macrobatch 6
average loss: 5.614810
receiving macrobatch from child process
running macrobatch 7
average loss: 5.614810
receiving macrobatch from child process
running macrobatch 8
average loss: 5.829066

starting epoch 4
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090243
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090234
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090187
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090198
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090191
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090154
receiving macrobatch from child process
running macrobatch 6
average loss: 5.606859
receiving macrobatch from child process
running macrobatch 7
average loss: 5.774870
receiving macrobatch from child process
running macrobatch 8
average loss: 5.686827

starting epoch 5
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090167
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090161
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090095
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090075
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090047
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090009
receiving macrobatch from child process
running macrobatch 6
average loss: 5.545681
receiving macrobatch from child process
running macrobatch 7
average loss: 5.530536
receiving macrobatch from child process
running macrobatch 8
average loss: 5.955928

starting epoch 6
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090011
receiving macrobatch from child process
running macrobatch 1
average loss: 11.089976
receiving macrobatch from child process
running macrobatch 2
average loss: 11.089961
receiving macrobatch from child process
running macrobatch 3
average loss: 11.089908
receiving macrobatch from child process
running macrobatch 4
average loss: 11.089851
receiving macrobatch from child process
running macrobatch 5
average loss: 11.089799
receiving macrobatch from child process
running macrobatch 6
average loss: 5.628832
receiving macrobatch from child process
running macrobatch 7
average loss: 5.542895
receiving macrobatch from child process
running macrobatch 8
average loss: 5.874512

starting epoch 7
receiving macrobatch from child process
running macrobatch 0
average loss: 11.089784
receiving macrobatch from child process
running macrobatch 1
average loss: 11.089684
receiving macrobatch from child process
running macrobatch 2
average loss: 11.089632
receiving macrobatch from child process
running macrobatch 3
average loss: 11.089536
receiving macrobatch from child process
running macrobatch 4
average loss: 11.089473
receiving macrobatch from child process
running macrobatch 5
average loss: 11.089358
receiving macrobatch from child process
running macrobatch 6
average loss: 5.663543
receiving macrobatch from child process
running macrobatch 7
average loss: 5.709703
receiving macrobatch from child process
running macrobatch 8
average loss: 5.704039

starting epoch 8
receiving macrobatch from child process
running macrobatch 0
average loss: 11.089377
receiving macrobatch from child process
running macrobatch 1
average loss: 11.089262
receiving macrobatch from child process
running macrobatch 2
average loss: 11.089121
receiving macrobatch from child process
running macrobatch 3
average loss: 11.088958
receiving macrobatch from child process
running macrobatch 4
average loss: 11.088927
receiving macrobatch from child process
running macrobatch 5
average loss: 11.088632
receiving macrobatch from child process
running macrobatch 6
average loss: 5.636472
receiving macrobatch from child process
running macrobatch 7
average loss: 5.469601
receiving macrobatch from child process
running macrobatch 8
average loss: 5.911286

starting epoch 9
receiving macrobatch from child process
running macrobatch 0
average loss: 11.088615
receiving macrobatch from child process
running macrobatch 1
average loss: 11.088458
receiving macrobatch from child process
running macrobatch 2
average loss: 11.088245
receiving macrobatch from child process
running macrobatch 3
average loss: 11.088001
receiving macrobatch from child process
running macrobatch 4
average loss: 11.087771
receiving macrobatch from child process
running macrobatch 5
average loss: 11.087394
receiving macrobatch from child process
running macrobatch 6
average loss: 5.789453
receiving macrobatch from child process
running macrobatch 7
average loss: 5.532505
receiving macrobatch from child process
running macrobatch 8
average loss: 5.703700

This could be due to lots of things, but mostly it's probably a failure of the DatasetReader to parse the data in a format which is true to what the model (either CBOW or skip-gram) actually expects. This is a problem. I'll try testing with a gensim implementation.

Atlas QC work

Still need to do a bunch of work on the atlas, some peaks seem spurious

Convert bedgraph files to bigwig track files, for easier validation in IGV
Distribution of support underlying peak-rich genes;
- peak height re-scaled by library size
- total corrected coverage divided by peak length
Make bigwig tracks for combined replicates of each cell type, plot over atlas to see which peaks are being driven by signal in which CTs

Still cannot train properly on the full SELEX data

Tried running the model on device with different values for k, stride. No output was produced, and eventually the jobs timed out.

I need to try running the same jobs on CPU and with longer duration, to see if this is a problem induced by transfer to device, or if it's a memory bound problem that isn't reporting properly, or if it's something else.

Test Adam versus Nesterov momentum versus RMSprop

Maybe the last is superfluous, but I'd like to see if Adam or Adagrad (with reset) will help learning go faster & better than simply using Nesterov momentum.

Fix visualization notebooks to use SequenceParser objects

currently all .ipynb scripts use the old-style parsing functions, which are now refactored under a SequenceParser object. Need to fix this in each.

Model visualization enhancements needed

To interpret the learned codes it would help to have the following visualizations:
1. distance matrix clustering for all TFs. Do probes from like families cluster together?
2. Visualization in 3D for probes from 3 specific factors (do we see separation??) Maybe from distinct families...
3. Look at nearest k-mers to center of mass for each factor: K-mers which are within a very small radius of the centre of mass of each factor.

Maybe something like exemplar-based clustering could work in the embedding space?

Further down the line, for a given probe, can we decode along a probe to find important Kmers (that might resemble motifs??)

Strange BrokenPipe / EOFError when using generate_dataset_parallel

cf traceback here

Not sure, but I suspect that when training in parallel, a race condition occurs in generate_dataset_parallel when generating the macrobatches. I can train a model with generate_dataset_serial which trains alright (eventually get a loss of NaN, but that's an unrelated problem).

Source might be problem here: lanjelot/patator#18 (comment)

Fix might be here: https://gist.github.com/mangecoeur/9540178

Merge unigram words, code-words with their reverse complements

Not sure of the easiest way to represent this, but probably a very small class with primary & RC representations, which is used to key the unigram dict, and also to index the embedding & decoding matrices.

Re-focus to start embedding ATAC-seq sub-peaks

based on discussion with C today, I'm going to shift away from embedding SELEX-seq probes and towards embedding shorter windows of ATAC-seq peaks.

To that end, I need to do several things for the data set to be prepared for embedding:

Embedding code needs to take input sequences and make them into sentences of kmers. Currently I have an atlas of regions, but not any sequence regions that underlie them. So, I need to turn my atlas peaks into sequences to be parsed.
I need to write code to extract sub-windows of 50bp from within a given peak, and to compute the corresponding average coverage score.
I need to hack the data prep code to withhold an entire chr for testing, or entire cell-type for testing.
I need to integrate the GC-content bias correction from Basenji (gcapc, in R. Apparently also in Basenji python script)

Fix DatasetReader to truly act as a generator for SELEX data

Whatever the embedding model used, I need to change the DatasetReader to be able to scale to larger data sets

build the unigram dictionary as expected
beginning at each epoch, keep a randomly ordered queue of factor-gz files
have several workers that dequeue files, process them into macrobatches, ready to yield a macrobatch when required by the DatasetReader
when generating macro-batches, don't hog all the memory! In particular, I should not waste time by reading and processing the whole dataset each epoch.
should time how long it takes to process an entire epoch's worth of data, and compare against the present state of the DatasetReader.

Focus argument for improvement as interpretability

I've described the project to a few ppl now, and one question that comes up is as follows:

There is little dispute that higher order models for predicting TF-DNA binding are more capable than simple PWM models, there are no shortage of models that provide greater modeling capacity (e.g DeepBind, DeepSEA, DeeperBind, Basset, DanQ,...)
-Yet there was little immediate interest in another such model whose main advance was better interpretation.
Existing models (AFAIK) are interested in different predictions. DeepBind was interested in predicting bound-vs-unbound sequences, and also in determining if SNPs would change this status (esp. WRT splicing). Basset also tried an in-silico mutagenesis experiment to try and identify substitutions that would induce changes in predicting binding effects. DeepSEA did a similar experiment. DeepLIFT does not, but they do try to compute per-base importance scores.
So why do we need another model which takes a variable length sequence input and tries to predict binding information?

My currently most plausible answer:

Predicting bound versus unbound is relatively well-solved, as each of the comparators I mentioned will attest.
But usually not the most important problem for people with sequencing data in hand.
Usually, they wan to know what their sequencing data means. What is binding? What is changing at a given site between experimental conditions? What parts of a given sequence are most important in determining binding versus not binding? And can we interpret what those more important elements tell us about gene regulation?

This suggests I will want to incorporate a figure that shows that my method provides a rich and interpretable view of sequence decoding. Simpler and more precise than FIMO, or other comparators.

One simple experiment would be to interpret ChIP-seq data; sample peaks, flanks and show that I can reliably rank the factor of interest as most likely to be bound ahead of other factors, and not in flanks. Can compare to FIMO.

Regularize loss function with k-mer based inverse distance prior

So, while I'm not sure of the form this could take, I think it definitely makes sense to influence codes for k-mers that are close in terms of syntactic similarity. I've got a couple of papers to look at in terms of defining this:

The real problem here I think is I'm not sure exactly how to incorporate a regularization term into the stochastic proxy loss:

Each mini-batch of training data (*will be, but is not now) sampled randomly. The random sample might or might not grab probes with very similar k-mer profiles. In that case, how do you change the codes for the k-mers in this minibatch so that they get updated accordingly?

Implement doc2vec style factor augmentation

Implement a code vector for each document (i.e factor in this case) to be learned and concatenated with the codes for each word to learn the embeddings.

Incredibly, there isn't an equation in the original paper describing how to do this (which is shocking), but there are other implementations. This one seems readable, and this page actually has derivations, which will help augment the model I'm currently working with.

Without some probabilistic interpretation which would allow for the decoding of a window without an associated document, this extension seems unlikely to be useful. But it should be informative as to how much separation I can get just by including factor information in the generation of the code words.

I might still be able to use the code-words learned in this way as some empirical Bayes-style prior in a more principled model.

Re-write data processing code to use `dask.distributed` for serving data to models

This is a bit pie-in-the-sky, but eventually I'd like to not have the burden of micro-managing data processing, instead off-loading scheduling and load-balancing to dask.

This would involve a totally new dataset_reader (or maybe just parallel functionality within?), which would function something like this:

dask-scheduler from dask.distributed works to distribute loads sets of factor files for training individually
dask unpacks the files, turning them into macrobatches
still need to understand how dask & Theano will work together; maybe working on the Scipy Dask tutorial will help.

Turn samplers for contexts into factor-dependent parameters

Had a few thoughts about how to incorporate factor specific information into the context samplers

Turn the sampling kernel into model parameters. The first way to do this is by preferring far away K-mers rather than nearby K-mers. This will be to make sure the model does not simply learn to put adjacent words (with substantial K-mer overlap) together at the expense of learning longer spatial dependencies within probes.
Adapt the samplers by conditioning on K-mers involved and overall statistics for this factor. How this would work is by making one sampler per factor, which has enrichment weights for each word in the subset of the unigram dictionary that appears in that factor. Then for each probe reduced to a sentence, form the sampling probabilities from two components:
- the K-mer enrichment by this factor
- the positional preference for further away context kmers

Considerations from ENCODE meeting

Presented the ChIP-seq peaks vs flanks comparison on Friday ENCODE call today. Jeff and Jacob suggest that HashingVectorizer is not a good comparator for my method since I want something that takes proximity into account, like a more locality sensitive hashing, or random projection.

Jeff had a bunch of suggestions:

experiment with increasing K
experiment with increasing K while also changing stride to something other than 1. Says this scheme is equivalent to Kth order Markov model.
experiment with adding a supervised component to objective. One candidate for supervision would be to minimize the differences of absolute distances calculated by a gappy-k-mer kernel, and Euclidean distances between embedded kmers.
Jeff suggested doing a ton of experiments and seeing which actually lead to more useful representations; at this stage I should do much more exploration and less exploitation.

This suggests another possible measure of validation:

Can I predict open vs closed on the within-cell type task, with a chromosome held out?

lzamparo / embedding Goto Github PK

embedding's People

Contributors

Watchers

embedding's Issues

Recommend Projects

Recommend Topics

Recommend Org