lzamparo / embedding Goto Github PK
View Code? Open in Web Editor NEWLearning semantic embeddings for TF binding preferences directly from sequence
License: Other
Learning semantic embeddings for TF binding preferences directly from sequence
License: Other
I'm seeing a disturbing pattern of not seeing better performance by epoch.
Different macrobatches are showing better or worse performance, but they are very consistent with which files make up the macro-batches
Below is a trace of the output from a sample embedding, trained on the positive probes that pass Han's QC stages. It's clear that across epochs, the macrobatches which cover the same files in the data yield an average loss which does not decrease in an epoch dependent manner:
zamparol$ less fasta_seqs.txt | grep -e average -e macrobatch -e epoch
starting epoch 0
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090376
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090356
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090346
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090362
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090349
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090341
receiving macrobatch from child process
running macrobatch 6
average loss: 5.876007
receiving macrobatch from child process
running macrobatch 7
average loss: 5.559565
receiving macrobatch from child process
running macrobatch 8
average loss: 5.803356starting epoch 1
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090372
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090356
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090366
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090344
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090327
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090322
receiving macrobatch from child process
running macrobatch 6
average loss: 5.513479
receiving macrobatch from child process
running macrobatch 7
average loss: 5.577124
receiving macrobatch from child process
running macrobatch 8
average loss: 5.952247starting epoch 2
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090325
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090312
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090318
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090284
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090295
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090289
receiving macrobatch from child process
running macrobatch 6
average loss: 5.597625
receiving macrobatch from child process
running macrobatch 7
average loss: 5.679941
receiving macrobatch from child process
running macrobatch 8
average loss: 5.735664starting epoch 3
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090294
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090319
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090273
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090270
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090262
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090235
receiving macrobatch from child process
running macrobatch 6
average loss: 5.614810
receiving macrobatch from child process
running macrobatch 7
average loss: 5.614810
receiving macrobatch from child process
running macrobatch 8
average loss: 5.829066starting epoch 4
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090243
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090234
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090187
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090198
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090191
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090154
receiving macrobatch from child process
running macrobatch 6
average loss: 5.606859
receiving macrobatch from child process
running macrobatch 7
average loss: 5.774870
receiving macrobatch from child process
running macrobatch 8
average loss: 5.686827starting epoch 5
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090167
receiving macrobatch from child process
running macrobatch 1
average loss: 11.090161
receiving macrobatch from child process
running macrobatch 2
average loss: 11.090095
receiving macrobatch from child process
running macrobatch 3
average loss: 11.090075
receiving macrobatch from child process
running macrobatch 4
average loss: 11.090047
receiving macrobatch from child process
running macrobatch 5
average loss: 11.090009
receiving macrobatch from child process
running macrobatch 6
average loss: 5.545681
receiving macrobatch from child process
running macrobatch 7
average loss: 5.530536
receiving macrobatch from child process
running macrobatch 8
average loss: 5.955928starting epoch 6
receiving macrobatch from child process
running macrobatch 0
average loss: 11.090011
receiving macrobatch from child process
running macrobatch 1
average loss: 11.089976
receiving macrobatch from child process
running macrobatch 2
average loss: 11.089961
receiving macrobatch from child process
running macrobatch 3
average loss: 11.089908
receiving macrobatch from child process
running macrobatch 4
average loss: 11.089851
receiving macrobatch from child process
running macrobatch 5
average loss: 11.089799
receiving macrobatch from child process
running macrobatch 6
average loss: 5.628832
receiving macrobatch from child process
running macrobatch 7
average loss: 5.542895
receiving macrobatch from child process
running macrobatch 8
average loss: 5.874512starting epoch 7
receiving macrobatch from child process
running macrobatch 0
average loss: 11.089784
receiving macrobatch from child process
running macrobatch 1
average loss: 11.089684
receiving macrobatch from child process
running macrobatch 2
average loss: 11.089632
receiving macrobatch from child process
running macrobatch 3
average loss: 11.089536
receiving macrobatch from child process
running macrobatch 4
average loss: 11.089473
receiving macrobatch from child process
running macrobatch 5
average loss: 11.089358
receiving macrobatch from child process
running macrobatch 6
average loss: 5.663543
receiving macrobatch from child process
running macrobatch 7
average loss: 5.709703
receiving macrobatch from child process
running macrobatch 8
average loss: 5.704039starting epoch 8
receiving macrobatch from child process
running macrobatch 0
average loss: 11.089377
receiving macrobatch from child process
running macrobatch 1
average loss: 11.089262
receiving macrobatch from child process
running macrobatch 2
average loss: 11.089121
receiving macrobatch from child process
running macrobatch 3
average loss: 11.088958
receiving macrobatch from child process
running macrobatch 4
average loss: 11.088927
receiving macrobatch from child process
running macrobatch 5
average loss: 11.088632
receiving macrobatch from child process
running macrobatch 6
average loss: 5.636472
receiving macrobatch from child process
running macrobatch 7
average loss: 5.469601
receiving macrobatch from child process
running macrobatch 8
average loss: 5.911286starting epoch 9
receiving macrobatch from child process
running macrobatch 0
average loss: 11.088615
receiving macrobatch from child process
running macrobatch 1
average loss: 11.088458
receiving macrobatch from child process
running macrobatch 2
average loss: 11.088245
receiving macrobatch from child process
running macrobatch 3
average loss: 11.088001
receiving macrobatch from child process
running macrobatch 4
average loss: 11.087771
receiving macrobatch from child process
running macrobatch 5
average loss: 11.087394
receiving macrobatch from child process
running macrobatch 6
average loss: 5.789453
receiving macrobatch from child process
running macrobatch 7
average loss: 5.532505
receiving macrobatch from child process
running macrobatch 8
average loss: 5.703700
This could be due to lots of things, but mostly it's probably a failure of the DatasetReader to parse the data in a format which is true to what the model (either CBOW or skip-gram) actually expects. This is a problem. I'll try testing with a gensim implementation.
Still need to do a bunch of work on the atlas, some peaks seem spurious
Tried running the model on device with different values for k, stride. No output was produced, and eventually the jobs timed out.
I need to try running the same jobs on CPU and with longer duration, to see if this is a problem induced by transfer to device, or if it's a memory bound problem that isn't reporting properly, or if it's something else.
Maybe the last is superfluous, but I'd like to see if Adam or Adagrad (with reset) will help learning go faster & better than simply using Nesterov momentum.
currently all .ipynb scripts use the old-style parsing functions, which are now refactored under a SequenceParser object. Need to fix this in each.
To interpret the learned codes it would help to have the following visualizations:
1. distance matrix clustering for all TFs. Do probes from like families cluster together?
2. Visualization in 3D for probes from 3 specific factors (do we see separation??) Maybe from distinct families...
3. Look at nearest k-mers to center of mass for each factor: K-mers which are within a very small radius of the centre of mass of each factor.
Maybe something like exemplar-based clustering could work in the embedding space?
Further down the line, for a given probe, can we decode along a probe to find important Kmers (that might resemble motifs??)
Not sure, but I suspect that when training in parallel, a race condition occurs in generate_dataset_parallel when generating the macrobatches. I can train a model with generate_dataset_serial which trains alright (eventually get a loss of NaN, but that's an unrelated problem).
Source might be problem here: lanjelot/patator#18 (comment)
Fix might be here: https://gist.github.com/mangecoeur/9540178
Not sure of the easiest way to represent this, but probably a very small class with primary & RC representations, which is used to key the unigram dict, and also to index the embedding & decoding matrices.
based on discussion with C today, I'm going to shift away from embedding SELEX-seq probes and towards embedding shorter windows of ATAC-seq peaks.
To that end, I need to do several things for the data set to be prepared for embedding:
Whatever the embedding model used, I need to change the DatasetReader to be able to scale to larger data sets
I've described the project to a few ppl now, and one question that comes up is as follows:
My currently most plausible answer:
This suggests I will want to incorporate a figure that shows that my method provides a rich and interpretable view of sequence decoding. Simpler and more precise than FIMO, or other comparators.
So, while I'm not sure of the form this could take, I think it definitely makes sense to influence codes for k-mers that are close in terms of syntactic similarity. I've got a couple of papers to look at in terms of defining this:
The real problem here I think is I'm not sure exactly how to incorporate a regularization term into the stochastic proxy loss:
Implement a code vector for each document (i.e factor in this case) to be learned and concatenated with the codes for each word to learn the embeddings.
Incredibly, there isn't an equation in the original paper describing how to do this (which is shocking), but there are other implementations. This one seems readable, and this page actually has derivations, which will help augment the model I'm currently working with.
Without some probabilistic interpretation which would allow for the decoding of a window without an associated document, this extension seems unlikely to be useful. But it should be informative as to how much separation I can get just by including factor information in the generation of the code words.
I might still be able to use the code-words learned in this way as some empirical Bayes-style prior in a more principled model.
This is a bit pie-in-the-sky, but eventually I'd like to not have the burden of micro-managing data processing, instead off-loading scheduling and load-balancing to dask.
This would involve a totally new dataset_reader
(or maybe just parallel functionality within?), which would function something like this:
dask.distributed
works to distribute loads sets of factor files for training individuallyHad a few thoughts about how to incorporate factor specific information into the context samplers
Presented the ChIP-seq peaks vs flanks comparison on Friday ENCODE call today. Jeff and Jacob suggest that HashingVectorizer is not a good comparator for my method since I want something that takes proximity into account, like a more locality sensitive hashing, or random projection.
Jeff had a bunch of suggestions:
This suggests another possible measure of validation:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.