kundajelab / tfmodisco Goto Github PK

TF MOtif Discovery from Importance SCOres

License: MIT License

Python 20.43% Jupyter Notebook 79.53% Shell 0.04%

tfmodisco's Introduction

TF-MoDISco: Transcription-Factor Motif Discovery from Importance Scores

This repository contains the code developed for the associated manuscript, Distilling consolidated DNA sequence motifs and cooperative motif syntax from neural-network models of in vivo transcription-factor binding profiles. The analysis scripts and notebooks used to reproduce the results in this manuscript can be found at this repository.

General users should visit the TF-MoDISco-lite repository for a more efficient, actively maintained, and easier-to-use version of the same algorithm.

Structure of TF-MoDISco

The TF-MoDISco algorithm starts with a set of importance scores on genomic sequences, and can perform the following tasks:

Identify high-importance windows of the sequences, termed "seqlets"
Cluster recurring similar seqlets into motifs
Scan through importance scores across the genome to call motif instances (AKA "hit scoring")

Installing TF-MoDISco

pip install modisco

Alternatively, for a specific tagged version or commit, install from source code by cloning this repository, checking out the desired version, and running pip install -e /path/to/cloned/repo.

Required inputs to run the algorithm

In order to run the TF-MoDISco algorithm, the following data is required as an input:

An N x L x 4 NumPy array of one-hot encoded genomic sequences, where N is the number of sequences and L is the sequence length (the 4 bases are in A, C, G, T order); this denotes the identity of the sequence
A parallel N x L x 4 NumPy array of contribution scores; each position contains the importance of the base specified in the corresponding one-hot encoded sequence (i.e. each base position should have at most one nonzero entry out of the 4, which measures importance at the base in the sequence)
An optional parallel N x L x 4 NumPy array of hypothetical contribution scores, which measures the hypothetical contribution of every base (not just the one that is present in the sequence); equivalently, the element-wise product of this array with the one-hot encoded genomic sequences should be identical to the array of contribution scores

Other resources

A technical note describing version 0.5.6.5 is available at https://arxiv.org/abs/1811.00416.

Video of talk at NeurIPS MLCB 2017

Example notebooks for running the algorithm:

TF MoDISco TAL GATA: a self-contained example notebook that uses pre-computed importance scores (generated by a neural network) as input. Scores were generated using deeplift as illustated in this notebook. If deeplift doesn't work with your architecture, you could alternatively generate scores using DeepSHAP (DeepSHAP is an extension of DeepLIFT that can work with more diverse architectures) as illustrated in this notebook (heads-up: that notebook uses a custom branch of the DeepSHAP repository).
TF MoDISco Nanog: a self-contained example notebook that uses pre-computed importance scores and an empirically-generated null distribution (generated by a gkm-SVM) as input. Scores were generated using gkmexplain as illustated in this notebook. This notebook also illustrates how to use a MEME-based initialization to potentially boost the performance of TF-MoDISco.

tfmodisco's People

Contributors

Stargazers

Watchers

tfmodisco's Issues

Adding padding during sequel detection results in invalid coordinates

Hi,

I have a CNN trained on cluster specific sequences from ATAC-seq.
All sequences in my data are 401 bp long. I tried transposing them, but axis-order does not seem to be the issue. Neither does normalising the importance scores help. It happens on all tasks (each with a different set of sequences).
I did notice there are a lot of negative importance scores in my data, could that be causing this error?

MEMORY 5.619265536
On task task0
Computing windowed sums on original
Generating null dist
peak(mu)= 0.017071453228592874
Computing threshold
Subsampling!
For increasing = True , the minimum IR precision was 0.10307860600830022 occurring at 2.0489096641540527e-06 implying a frac_neg of 0.11492490501263941
To be conservative, adjusted frac neg is 0.95
For increasing = False , the minimum IR precision was 0.4830350286412047 occurring at -3.2782554626464844e-07 implying a frac_neg of 0.9343670372319252
To be conservative, adjusted frac neg is 0.95
Thresholds from null dist were -3.1445798873901367  and  2.9043326415121555 with frac passing 0.067297
Final raw thresholds are -3.1445798873901367  and  2.9043326415121555
Final transformed thresholds are -0.9395319808910858  and  0.9310386693796606

Got 24745 coords
Applying left/right pad of 0 and 238 for (0, 217, 242) with total sequence length 4
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In [82], line 4
      1 i = 1
      3 null_per_pos_scores = modisco.coordproducers.LaplaceNullDist(num_to_samp=5000)
----> 4 tfmodisco_results = modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow(
      5                     #Slight modifications from the default settings
      6                     sliding_window_size=15, ## 15, 21
      7                     flank_size=5, ## 5, 10
      8                     target_seqlet_fdr=0.15,
      9                     seqlets_to_patterns_factory=
     10                      modisco.tfmodisco_workflow.seqlets_to_patterns.TfModiscoSeqletsToPatternsFactory(
     11                         #Note: as of version 0.5.6.0, it's possible to use the results of a motif discovery
     12                         # software like MEME to improve the TF-MoDISco clustering. To use the meme-based
     13                         # initialization, you would specify the initclusterer_factory as shown in the
     14                         # commented-out code below:
     15                         #initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(    
     16                         #    meme_command="meme", base_outdir="meme_out",            
     17                         #    max_num_seqlets_to_use=10000, nmotifs=10, n_jobs=1),
     18                         trim_to_window_size=15, ## 15, 30
     19                         initial_flank_to_add=5, ## 5, 10
     20                         final_flank_to_add=5, ## 5, 0
     21                         final_min_cluster_size=30, ## 60
     22                         use_pynnd=True, #use_pynnd=True can be used for better
     23                         n_cores=10)
     24                 )(
     25                  task_names=["task0"],#, "task1", "task2"],
     26                  contrib_scores=task_to_scores[i],
     27                  hypothetical_contribs=task_to_hyp_scores[i],
     28                  one_hot=onehot_data[i],
     29                  null_per_pos_scores=null_per_pos_scores)

File ~/git_repos/tfmodisco/modisco/tfmodisco_workflow/workflow.py:267, in TfModiscoWorkflow.__call__(self, task_names, contrib_scores, hypothetical_contribs, one_hot, null_per_pos_scores, per_position_contrib_scores, revcomp, other_tracks, just_return_seqlets, plot_save_dir)
    254     per_position_contrib_scores = OrderedDict([
    255         (x, [np.sum(s,axis=1) for s in contrib_scores[x]])
    256         for x in task_names])
    258 track_set = prep_track_set(
    259                 task_names=task_names,
    260                 contrib_scores=contrib_scores,
   (...)
    264                 custom_perpos_contribs=custom_perpos_contribs,
    265                 other_tracks=other_tracks)
--> 267 multitask_seqlet_creation_results = core.MultiTaskSeqletCreator(
    268     coord_producer=self.coord_producer,
    269     overlap_resolver=self.overlap_resolver)(
    270         task_name_to_score_track=per_position_contrib_scores,
    271         null_tracks=null_per_pos_scores,
    272         track_set=track_set)
    274 #find the weakest transformed threshold used across all tasks
    275 weakest_transformed_thresh = (min(
    276     [min(x.tnt_results.transformed_pos_threshold,
    277          abs(x.tnt_results.transformed_neg_threshold))
    278          for x in (multitask_seqlet_creation_results.
    279                    task_name_to_coord_producer_results.values())]) -
    280     0.0001) #subtract 1e-4 to avoid weird numerical issues

File ~/git_repos/tfmodisco/modisco/core.py:371, in MultiTaskSeqletCreator.__call__(self, task_name_to_score_track, null_tracks, track_set, task_name_to_tnt_results)
    363         coord_producer_results =\
    364             self.coord_producer(
    365              score_track=score_track,
    366              null_track = null_track,
    367              tnt_results=
    368               task_name_to_tnt_results[task_name])
    369     task_name_to_coord_producer_results[task_name] =\
    370         coord_producer_results
--> 371     seqlets = track_set.create_seqlets(
    372                 coords=coord_producer_results.coords) 
    373     task_name_to_seqlets[task_name] = seqlets
    374 final_seqlets = self.overlap_resolver(
    375     itertools.chain(*task_name_to_seqlets.values()))

File ~/git_repos/tfmodisco/modisco/core.py:156, in TrackSet.create_seqlets(self, coords, track_names, attribute_names)
    154 seqlets = []
    155 for coor in coords:
--> 156     seqlets.append(self.create_seqlet(coor=coor,
    157                                       track_names=track_names,
    158                                       attribute_names=attribute_names))
    159 return seqlets

File ~/git_repos/tfmodisco/modisco/core.py:167, in TrackSet.create_seqlet(self, coor, track_names, attribute_names)
    165     attribute_names=self.attribute_name_to_attribute_provider.keys()
    166 seqlet = Seqlet(coor=coor)
--> 167 self.augment_seqlet(seqlet=seqlet, track_names=track_names,
    168                     attribute_names=attribute_names) 
    169 return seqlet

File ~/git_repos/tfmodisco/modisco/core.py:173, in TrackSet.augment_seqlet(self, seqlet, track_names, attribute_names)
    171 def augment_seqlet(self, seqlet, track_names, attribute_names):
    172     for track_name in track_names:
--> 173         seqlet.add_snippet_from_data_track(
    174             data_track=self.track_name_to_data_track[track_name])
    175     for attribute_name in attribute_names:
    176         seqlet.set_attribute(
    177             attribute_provider=\
    178              self.attribute_name_to_attribute_provider[attribute_name])

File ~/git_repos/tfmodisco/modisco/core.py:466, in Seqlet.add_snippet_from_data_track(self, data_track)
    464 def add_snippet_from_data_track(self, data_track): 
    465     snippet = data_track.get_snippet(coor=self.coor)
--> 466     return self.add_snippet(data_track_name=data_track.name,
    467                             snippet=snippet)

File ~/git_repos/tfmodisco/modisco/core.py:471, in Seqlet.add_snippet(self, data_track_name, snippet)
    469 def add_snippet(self, data_track_name, snippet):
    470     if (snippet.has_pos_axis):
--> 471         assert len(snippet)==len(self),\
    472                ("tried to add snippet with pos axis of len "
    473                 +str(len(snippet))+" but snippet coords have "
    474                 +"len "+str(self.coor))
    475     self.track_name_to_snippet[data_track_name] = snippet 
    476     return self

AssertionError: tried to add snippet with pos axis of len 238 but snippet coords have len example:0,start:217,end:242,rc:False

Confusion with negative motifs

I have a binary classifier trained and computed deepLIFT scores for just the genomic sequences that belong to class 1. My expectation was to get only positive motifs but I also get negative motifs. How does one interpret this? I consider positive motifs as those with upward facing web logos and negative motifs as those with downward facing web logos. As additional information if needed, my model ends with an output layer with a single neuron and sigmoid activation.
Any explanation to this will be appreciated.
Thanks

AssertionError: 50 1 in `assert len(fwd)==len(rev),str(len(fwd))+" "+str(len(rev))`

Hey, Installed modisco 5.9.0 (update it actually), and I get an error while it running:

TF-MoDISco is using the TensorFlow backend.
WARNING:tensorflow:From /users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.neighbors.kde module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
  warnings.warn(message, FutureWarning)
MEMORY 1.009188864
On task task0
Computing windowed sums on original
Generating null dist
peak(mu)= -3.022349007487297
Computing threshold
Thresholds from null dist were -0.017756938934326172  and  0.00016736984252929688
Passing windows frac was 0.9985476992143659 , which is above  0.2 ; adjusting
Final raw thresholds are -4.143244409561158  and  4.143244409561158
Final transformed thresholds are -0.8  and  0.8
saving plot to figures/scoredist_0.png
Got 10934 coords
After resolving overlaps, got 10934 seqlets
Across all tasks, the weakest transformed threshold used was: 0.7999
MEMORY 1.039724544
10934 identified in total
min_metacluster_size_frac * len(seqlets) = 109 is more than min_metacluster_size=100.
Using it as a new min_metacluster_size
Reducing weak_threshold_for_counting_sign to match weakest_transformed_thresh, from 0.8 to 0.7999
2 activity patterns with support >= 109 out of 2 possible patterns
Metacluster sizes:  [10203, 731]
Idx to activities:  {0: '-1', 1: '1'}
MEMORY 1.040121856
On metacluster 1
Metacluster size 731
Relevant tasks:  ('task0',)
Relevant signs:  (1,)
TfModiscoSeqletsToPatternsFactory: seed=1234
(Round 1) num seqlets: 731
(Round 1) Computing coarse affmat
MEMORY 1.069858816
Beginning embedding computation
Computing embeddings
Using TensorFlow backend.
MAKING A SESSION
2020-12-06 15:26:34.713266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-06 15:26:34.714747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      
Finished embedding computation in 7.69 s
Starting affinity matrix computations
Normalization computed in 0.68 s
Cosine similarity mat computed in 0.79 s
Normalization computed in 0.77 s
Cosine similarity mat computed in 0.85 s
Finished affinity matrix computations in 1.64 s
(Round 1) Compute nearest neighbors from coarse affmat
MEMORY 1.347534848
Computed nearest neighbors in 0.09 s
MEMORY 1.356439552
(Round 1) Computing affinity matrix on nearest neighbors
MEMORY 1.356439552
Launching nearest neighbors affmat calculation job
MEMORY 1.356570624
Parallel runs completed
MEMORY 1.316503552
Job completed in: 12.76 s
MEMORY 1.316507648
Launching nearest neighbors affmat calculation job
MEMORY 1.316466688
Parallel runs completed
MEMORY 1.328021504
Job completed in: 12.96 s
MEMORY 1.332158464
(Round 1) Computed affinity matrix on nearest neighbors in 25.91 s
MEMORY 1.336352768
Filtered down to 633 of 731
(Round 1) Retained 633 rows out of 731 after filtering
MEMORY 1.336487936
(Round 1) Computing density adapted affmat
MEMORY 1.336487936
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 633 samples in 0.001s...
[t-SNE] Computed neighbors for 633 samples in 0.007s...
[t-SNE] Computed conditional probabilities for sample 633 / 633
[t-SNE] Mean sigma: 0.225459
(Round 1) Computing clustering
MEMORY 1.336500224
Beginning preprocessing + Leiden
  0%|                                                                                                                                                                                                                                                           | 0/50 [00:00<?, ?it/s]Quality: 0.5709345452688861
  2%|████▊                                                                                                                                                                                                                                              | 1/50 [00:00<00:06,  8.05it/s]Quality: 0.5713024674728743
  4%|█████████▋                                                                                                                                                                                                                                         | 2/50 [00:00<00:07,  6.67it/s]Quality: 0.571337497320167
  6%|██████████████▌                                                                                                                                                                                                                                    | 3/50 [00:00<00:07,  6.19it/s]Quality: 0.5725037634010955
 16%|██████████████████████████████████████▉                                                                                                                                                                                                            | 8/50 [00:01<00:06,  6.55it/s]Quality: 0.5726370747530787
 18%|███████████████████████████████████████████▋                                                                                                                                                                                                       | 9/50 [00:01<00:06,  5.87it/s]Quality: 0.5731774961658465
 26%|██████████████████████████████████████████████████████████████▉                                                                                                                                                                                   | 13/50 [00:02<00:06,  5.84it/s]Quality: 0.5755771740992032
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:08<00:00,  5.60it/s]
Got 11 clusters after round 1
Counts:
{1: 123, 2: 97, 5: 44, 6: 37, 7: 34, 3: 84, 4: 67, 8: 16, 0: 127, 10: 2, 9: 2}
MEMORY 1.337581568
(Round 1) Aggregating seqlets in each cluster
MEMORY 1.337581568
Aggregating for cluster 0 with 127 seqlets
MEMORY 1.337581568
Trimming eliminated 0 seqlets out of 127
Skipped 57 seqlets
Aggregating for cluster 1 with 123 seqlets
MEMORY 1.337581568
Trimming eliminated 0 seqlets out of 123
Skipped 59 seqlets
Aggregating for cluster 2 with 97 seqlets
MEMORY 1.337667584
Trimming eliminated 0 seqlets out of 97
Skipped 40 seqlets
Aggregating for cluster 3 with 84 seqlets
MEMORY 1.337790464
Trimming eliminated 0 seqlets out of 84
Skipped 38 seqlets
Aggregating for cluster 4 with 67 seqlets
MEMORY 1.337864192
Trimming eliminated 0 seqlets out of 67
Skipped 28 seqlets
Aggregating for cluster 5 with 44 seqlets
MEMORY 1.33789696
Trimming eliminated 0 seqlets out of 44
Skipped 23 seqlets
Aggregating for cluster 6 with 37 seqlets
MEMORY 1.337933824
Trimming eliminated 0 seqlets out of 37
Skipped 19 seqlets
Aggregating for cluster 7 with 34 seqlets
MEMORY 1.337933824
Trimming eliminated 0 seqlets out of 34
Skipped 19 seqlets
Aggregating for cluster 8 with 16 seqlets
MEMORY 1.337933824
Trimming eliminated 0 seqlets out of 16
Skipped 4 seqlets
Aggregating for cluster 9 with 2 seqlets
MEMORY 1.337950208
Trimming eliminated 0 seqlets out of 2
Skipped 1 seqlets
Aggregating for cluster 10 with 2 seqlets
MEMORY 1.337950208
Trimming eliminated 0 seqlets out of 2
Skipped 1 seqlets
(Round 2) num seqlets: 344
(Round 2) Computing coarse affmat
MEMORY 1.337950208
Beginning embedding computation
Computing embeddings
Finished embedding computation in 2.76 s
Starting affinity matrix computations
Normalization computed in 0.3 s
Cosine similarity mat computed in 0.36 s
Normalization computed in 0.34 s
Cosine similarity mat computed in 0.39 s
Finished affinity matrix computations in 0.75 s
(Round 2) Compute nearest neighbors from coarse affmat
MEMORY 1.329205248
Computed nearest neighbors in 0.05 s
MEMORY 1.324064768
(Round 2) Computing affinity matrix on nearest neighbors
MEMORY 1.324064768
Launching nearest neighbors affmat calculation job
MEMORY 1.324101632
Parallel runs completed
MEMORY 1.326985216
Job completed in: 6.71 s
MEMORY 1.326723072
Launching nearest neighbors affmat calculation job
MEMORY 1.326460928
Parallel runs completed
MEMORY 1.32646912
Job completed in: 6.71 s
MEMORY 1.32646912
(Round 2) Computed affinity matrix on nearest neighbors in 13.5 s
MEMORY 1.32646912
Not applying filtering for rounds above first round
MEMORY 1.32646912
(Round 2) Computing density adapted affmat
MEMORY 1.32646912
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 344 samples in 0.000s...
[t-SNE] Computed neighbors for 344 samples in 0.003s...
[t-SNE] Computed conditional probabilities for sample 344 / 344
[t-SNE] Mean sigma: 0.232394
(Round 2) Computing clustering
MEMORY 1.32646912
Beginning preprocessing + Leiden
  0%|                                                                                                                                                                                                                                                           | 0/50 [00:00<?, ?it/s]Quality: 0.5417845542582778
Quality: 0.5466505032751225
  8%|███████████████████▍                                                                                                                                                                                                                               | 4/50 [00:00<00:02, 17.08it/s]Quality: 0.5473188417171921
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 16.05it/s]
Got 10 clusters after round 2
Counts:
{1: 64, 7: 19, 5: 23, 6: 20, 2: 49, 0: 71, 3: 49, 8: 19, 4: 28, 9: 2}
MEMORY 1.326731264
(Round 2) Aggregating seqlets in each cluster
MEMORY 1.326731264
Aggregating for cluster 0 with 71 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 71
Skipped 11 seqlets
Aggregating for cluster 1 with 64 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 64
Skipped 7 seqlets
Aggregating for cluster 2 with 49 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 49
Skipped 7 seqlets
Aggregating for cluster 3 with 49 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 49
Skipped 8 seqlets
Removed 1 duplicate seqlets
Aggregating for cluster 4 with 28 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 28
Skipped 4 seqlets
Aggregating for cluster 5 with 23 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 23
Skipped 3 seqlets
Aggregating for cluster 6 with 20 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 20
Skipped 6 seqlets
Aggregating for cluster 7 with 19 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 19
Skipped 7 seqlets
Aggregating for cluster 8 with 19 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 19
Skipped 10 seqlets
Aggregating for cluster 9 with 2 seqlets
MEMORY 1.326731264
Trimming eliminated 0 seqlets out of 2
Skipped 1 seqlets
Dropping cluster 9 with 1 seqlets due to sign disagreement
Got 9 clusters
Splitting into subclusters...
MEMORY 1.326731264
Inspecting for spurious merging
Wrote graph to binary file in 0.010907888412475586 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0044247
Louvain completed 21 runs in 1.5680828094482422 seconds
Similarity is 0.91470367; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.010420799255371094 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.004879
After 3 runs, maximum modularity is Q = 0.00499837
After 4 runs, maximum modularity is Q = 0.00521547
After 16 runs, maximum modularity is Q = 0.00521548
Louvain completed 36 runs in 2.6976420879364014 seconds
Similarity is 0.8170022; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.005519390106201172 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00386571
After 3 runs, maximum modularity is Q = 0.00456507
Louvain completed 23 runs in 1.697765588760376 seconds
Similarity is 0.8003647; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.005071401596069336 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00160869
After 2 runs, maximum modularity is Q = 0.00210899
After 4 runs, maximum modularity is Q = 0.00252618
After 5 runs, maximum modularity is Q = 0.00263157
After 6 runs, maximum modularity is Q = 0.00266519
After 8 runs, maximum modularity is Q = 0.0026652
Louvain completed 28 runs in 2.269061803817749 seconds
Similarity is 0.9035732; is_dissimilar is False
Merging on 9 clusters
MEMORY 1.32550656
On merging iteration 1
Numbers for each pattern pre-subsample: [60, 57, 42, 40, 24, 20, 14, 12, 9]
Numbers after subsampling: [60, 57, 42, 40, 24, 20, 14, 12, 9]
TF-MoDISco is using the TensorFlow backend.
TF-MoDISco is using the TensorFlow backend.
TF-MoDISco is using the TensorFlow backend.
TF-MoDISco is using the TensorFlow backend.
2020-12-06 15:27:51.339844: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-06 15:27:51.340008: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-06 15:27:51.340128: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-06 15:27:51.340421: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.neighbors.kde module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
  warnings.warn(message, FutureWarning)
/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.neighbors.kde module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
  warnings.warn(message, FutureWarning)
/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.neighbors.kde module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
  warnings.warn(message, FutureWarning)
/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.neighbors.kde module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
  warnings.warn(message, FutureWarning)
Applying left/right pad of 0 and 1 for (7479, 61, 111) with total sequence length 110
Traceback (most recent call last):
  File "/data/yaishof/deg_project/deg_project/general/run.py", line 47, in <module>
    GPU=False
  File "/data/yaishof/deg_project/deg_project/NN/NN_train_test_models_utilies.py", line 302, in evaluate_model_type
    model_id,  model_type, data_type, preforme_TF_modisco)
  File "/data/yaishof/deg_project/deg_project/NN/NN_train_test_models_utilies.py", line 346, in evaluate_TF_modisco
    TF_modisco.run_modisco(hyp_impscores, impscores, onehot_data)
  File "/data/yaishof/deg_project/deg_project/NN/TF_modisco.py", line 59, in run_modisco
    one_hot=onehot_data)#,
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/workflow.py", line 377, in __call__
    metacluster_seqlets)
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py", line 913, in __call__
    patterns=split_patterns) 
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/aggregator.py", line 845, in __call__
    self.pattern_comparison_settings.track_names) 
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/core.py", line 149, in create_seqlets
    attribute_names=attribute_names))
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/core.py", line 159, in create_seqlet
    attribute_names=attribute_names) 
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/core.py", line 165, in augment_seqlet
    data_track=self.track_name_to_data_track[track_name])
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/core.py", line 456, in add_snippet_from_data_track
    snippet = data_track.get_snippet(coor=self.coor)
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/core.py", line 111, in get_snippet
    has_pos_axis=self.has_pos_axis)
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/core.py", line 18, in __init__
    assert len(fwd)==len(rev),str(len(fwd))+" "+str(len(rev))
AssertionError: 50 1

It's probably something with the update that is not working any more.

Another problem that I saw in this version is that you preformed tf.compat.v1.disable_v2_behavior(), and therefore if the modisco is part of a certain pipeline that work make other preation in TensorFlow 2, then there is a problem: for example, it seems that it disable eager execution. when I tried to perform tf.compat.v1.enable_v2_behavior after the use with modisco it still was not working, and therefore other solution is needed.

Thank you for you work

relating patternidx from hit scoring to identified motifs

Hi,

I trained a CNN to predict if sequences are accessible in different cell types and used tfmodisco to explain what drives the prediction. Now I'm trying to do hit scoring in the sequences used for training but I got a bit confused on how the patternidx relates to the patterns identified in the main pipeline that are split into meta clusters since here they are not labelled as meta cluster X pattern Y etc. Does pattern 9 correspond to meta cluster 1 pattern 0 if meta cluster 0 has 8 patterns?

Camiel

TfModiscoWorkflow breaks at coo_mat

Hello,

Any idea what the source of this error might be?
I can provide examples for reproducing this.

In [13]:     tfmodisco_results = tfm.workflow.TfModiscoWorkflow(
    ...:         sliding_window_size=15,
    ...:         flank_size=5,
    ...:         target_seqlet_fdr=0.15,
    ...:         seqlets_to_patterns_factory=seqletsObj)(
    ...:             task_names=["task0"],
    ...:             contrib_scores=dl_dict,
    ...:             hypothetical_contribs=dl_dict_shuffles,
    ...:             one_hot=onehotar,
    ...:             null_per_pos_scores=null_per_pos_scores)
MEMORY 3.078750208
On task task0
Computing windowed sums on original
Generating null dist
peak(mu)= 0.10214246054238174
Computing threshold
Subsampling!
For increasing = True , the minimum IR precision was 0.33432691830786476 occurring at 3.557652235031128e-07 implying a frac_neg of 0.5022389030032709
To be conservative, adjusted frac neg is 0.95
For increasing = False , the minimum IR precision was 0.20688510991289918 occurring at -9.96515154838562e-08 implying a frac_neg of 0.26085137537914443
To be conservative, adjusted frac neg is 0.95
Thresholds from null dist were -1.3173690717667341  and  2.090032006846741 with frac passing 0.000412
Passing windows frac was 0.000412 , which is below  0.03 ; adjusting
Final raw thresholds are -0.84734781161882  and  0.84734781161882
Final transformed thresholds are -0.9698610639191586  and  0.9698610639191586
saving plot to figures/scoredist_0.png
Got 6240 coords
After resolving overlaps, got 6240 seqlets
Across all tasks, the weakest transformed threshold used was: 0.9697610639191586
MEMORY 3.300876288
6240 identified in total
2 activity patterns with support >= 100 out of 2 possible patterns
Metacluster sizes:  [5511, 729]
Idx to activities:  {0: '1', 1: '-1'}
MEMORY 3.30160128
On metacluster 1
Metacluster size 729
Relevant tasks:  ('task0',)
Relevant signs:  (-1,)
TfModiscoSeqletsToPatternsFactory: seed=1234
(Round 1) num seqlets: 729
(Round 1) Computing coarse affmat
MEMORY 3.30160128
Beginning embedding computation
MEMORY 3.30160128
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   11.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   34.3s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done 729 out of 729 | elapsed:  1.9min finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    6.8s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   29.9s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  1.1min
[Parallel(n_jobs=4)]: Done 729 out of 729 | elapsed:  1.9min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 729 out of 729 | elapsed:    2.6s finished
Constructing csr matrix...
csr matrix made in 0.20343279838562012 s
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 729 out of 729 | elapsed:    2.6s finished
Constructing csr matrix...
csr matrix made in 0.20130276679992676 s
Finished embedding computation in 234.76 s
MEMORY 3.373395968
Starting affinity matrix computations
MEMORY 3.373395968
Batching in slices of size 729
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.61s/it]
Finished affinity matrix computations in 1.62 s
MEMORY 3.373662208
(Round 1) Computed coarse affmat
MEMORY 3.372089344
(Round 1) Computing affinity matrix on nearest neighbors
MEMORY 3.372089344
Launching nearest neighbors affmat calculation job
MEMORY 3.372089344
Parallel runs completed
MEMORY 3.39525632
Job completed in: 22.73 s
MEMORY 3.39525632
Launching nearest neighbors affmat calculation job
MEMORY 3.39525632
Parallel runs completed
MEMORY 3.410219008
Job completed in: 22.74 s
MEMORY 3.410219008
(Round 1) Computed affinity matrix on nearest neighbors in 45.8 s
MEMORY 3.410219008
Filtered down to 0 of 729
(Round 1) Retained 0 rows out of 729 after filtering
MEMORY 3.410505728
(Round 1) Computing density adapted affmat
MEMORY 3.410505728
Symmetrizing nearest neighbors
Computing betas for density adaptation
Computing normalizing denominators
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-9ab94d7c289f> in <module>
      8         hypothetical_contribs=dl_dict_shuffles,
      9         one_hot=onehotar,
---> 10         null_per_pos_scores=null_per_pos_scores)

~/.local/lib/python3.6/site-packages/modisco/tfmodisco_workflow/workflow.py in __call__(self, task_names, contrib_scores, hypothetical_contribs, one_hot, null_per_pos_scores, per_position_contrib_scores, revcomp, other_tracks, just_return_seqlets, plot_save_dir)
    390                   other_comparison_track_names=[])
    391                 seqlets_to_patterns_result = seqlets_to_patterns(
--> 392                                               metacluster_seqlets)
    393             else:
    394                 seqlets_to_patterns_result = None

~/.local/lib/python3.6/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py in __call__(self, seqlets)
   1063                     entries=sym_densadapted_affmat_nn,
   1064                     neighbors=sym_seqlet_neighbors,
-> 1065                     ncols=len(sym_densadapted_affmat_nn)).tocsr()
   1066
   1067             if (self.verbose):

~/.local/lib/python3.6/site-packages/modisco/util.py in coo_matrix_from_neighborsformat(entries, neighbors, ncols)
     48 def coo_matrix_from_neighborsformat(entries, neighbors, ncols):
     49     coo_mat = scipy.sparse.coo_matrix(
---> 50             (np.concatenate(entries, axis=0),
     51              (np.array([i for i in range(len(neighbors))
     52                            for j in neighbors[i]]).astype("int"),

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: need at least one array to concatenate

My environment's main packages:

matplotlib                         3.3.3
numpy                              1.18.1
numpydoc                           0.9.1
pandas                             1.1.4
scipy                              1.5.4
seaborn                            0.9.0

ValueError: cannot reshape array of size 0 into shape (0,newaxis)

Hello,

I tried to print the snippet, and it's an empty array. I appreciate for any suggestions.

Thank you,
Yao

question about hypothetical contribs + "cannot do a non-empty take from an empty axes" error

Hi Avanti, When I modify part of the example code and then have an error to be reported here. I don't know why. Could you help me out?

tfmodisco_results = modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow(
                    #Slight modifications from the default settings
                    sliding_window_size=5,
                    flank_size=5,
                    target_seqlet_fdr=0.15,
                    seqlets_to_patterns_factory=
                     modisco.tfmodisco_workflow.seqlets_to_patterns.TfModiscoSeqletsToPatternsFactory(
                        #Note: as of version 0.5.6.0, it's possible to use the results of a motif discovery
                        # software like MEME to improve the TF-MoDISco clustering. To use the meme-based
                        # initialization, you would specify the initclusterer_factory as shown in the
                        # commented-out code below:
                        # initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(
                        #    meme_command="meme", base_outdir="meme_out",
                        #    max_num_seqlets_to_use=10000, nmotifs=10, n_jobs=1),
                        trim_to_window_size=5,
                        initial_flank_to_add=5,
                        final_flank_to_add=5,
                        final_min_cluster_size=6,
                        n_cores=10)
                )(
                 task_names=["0"],#, "task1", "task2"],
                 contrib_scores=task_to_scores,
                 hypothetical_contribs=task_to_hyp_scores,
                 one_hot=one_hot_seqs,
                 null_per_pos_scores=null_per_pos_scores)

IndexError: cannot do a non-empty take from an empty axes.

Which settings of modisco should I change

Hello,
I have a model that predicts TF binding for a mixture of several possible TFs (the data is similar to having peaks from ATAC-seq). I have computed actual and hypothetical importance scores for only the True positive predictions. I wish to know which settings would be recommended that I change for modisco.
-sliding_window_size: I currently leave at 15 because I can't know exactly what the core motif sizes are. (BPNet paper used 21 but I think the core motif sizes probably had some known estimated sizes)

-target_seqlet_fdr: I am not sure how to change this. Any suggestions? (BPnet uses 0.01 but the example notebook uses 0.15, so how does one make a decision on which value to use?)

And are there any other settings are recommended I change since I have several possible binding motifs?

cheers
Fritz

Doesn't work for the the sequences in variable lengths.

Hi @AvantiShri,

Thanks for the amazing work!

In the following notebook, you mentioned the pipeline also works for the sequences in different lengths. However, when I was testing the notebook with input data with shape of [100, length, 4], the length ranges from 500-1000 bp, it raised the ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (100,) + inhomogeneous part.
Then if I crop the sequences and contribution scores to the same length, it works again. I am wondering is there any version available for the sequences with different length?
https://github.com/kundajelab/tfmodisco/blob/master/examples/simulated_TAL_GATA_deeplearning/TF_MoDISco_TAL_GATA.ipynb

Thank you again!

Best wishes,
Ruizhi

Using without tasks

I'm an undergrad working with sequence analysis. Currently, I'm attempting to implement TF MoDISco to work with my dataset. So far, I've generated the importance scores via DeepLIFT, and saved to an h5 file similar to "Generate Importance Scores.ipynb" now I'm working on importing like "TF MoDISco TAL GATA.ipynb" does.

However, when I try to call 'modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow' it requires a list of tasks. This I do not have.

With my data I'm exploring a sequence of binding lncRNA and cannot assume that any sequence contains motifs that are similar to another sequence. If I understand the video correctly, the 3 tasks presented do have such a relationship.

Will TF MoDISco fit such a need (isolated 'task-less') motif discovery? Am I using the correct methods?

export like Generate Importance Scores.ipynb
`#save the scores and hypotheticals

if (os.path.isfile("scores.h5")):
    !rm scores.h5
f = h5py.File("scores.h5")
g = f.create_group("contrib_scores")
g.create_dataset("task", data=scores)
g = f.create_group("hyp_contrib_scores")
g.create_dataset("task", data=hyp_contrib_scores)
f.close()

import like TF MoDISco TAL GATA.ipynb

f = h5py.File("scores.h5","r")
scores = [np.array(x) for x in f['contrib_scores']["task"][:]]
hyp_scores = [np.array(x) for x in f['hyp_contrib_scores']["task"][:]]

tfmodisco_results = modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow(
                    #Slight modifications from the default settings
                    sliding_window_size=21,
                    flank_size=10,
                    target_seqlet_fdr=0.01,
                    seqlets_to_patterns_factory=
                     modisco.tfmodisco_workflow.seqlets_to_patterns.TfModiscoSeqletsToPatternsFactory(
                        trim_to_window_size=30,
                        initial_flank_to_add=10,
                        kmer_len=7, num_gaps=3,
                        num_mismatches=2,
                        final_min_cluster_size=60)
                )(
                task_names=["task"],
                contrib_scores=scores,
                hypothetical_contribs=hyp_scores,
                one_hot=pos_train_onehot_data)

NaN or High value for dtype float32

I've repeatedly encountered motif discovery failing in Round 2 with a NaN or value too high exception.

To isolate, I've tried several models with different layers and layer structures fitted and scored with DeepLIFT. I've also tried adjusting from the default parameters (as noted in the notebook) and the parameters used by the notebook. I have yet to find a pattern in the failures.

Additionally, I've tried different subsets of the same data, and the complete data set. I've also tried alternate sequence data sets. The only thing I've noticed is the smaller subset dataset tends to produce the error less, but this is inconsistent with the alternate datasets that are inherently small.

For comparisons:
I have a HOTAIR dataset "GSE31332_hotair_oe_peaks" of 832 sequences we'll call this the full set
Longest sequence 2551
Shortest sequence 756 padded with 0s
I have subsetted it to 155 sequences we'll call this one the 'small' subset

I'm not sure what information would be helpful to identify what I can improve in preprocessing or parameters passed to avoid the exception.

If it will help I can also bundle a notebook and dataset that includes the whole workflow from classification to motif discovery. I could also provide the raw output from the motif discovery calls.

Thanks

Installation error

tfmodisco/modisco/util.py

Line 468 in 81f2aa4

except Exception, _:

I got a syntax error with python 3.5.

    except Exception, _:
                    ^
SyntaxError: invalid syntax

same for

tfmodisco/modisco/util.py

Line 502 in 81f2aa4

except KeyboardInterrupt, OSError:

need to change to

except (Exception, _):
except (KeyboardInterrupt, OSError):

thanks,
Jun

AssertionError: Probabilities don't sum to 1 along axis 1

There is an error message on utils/compute_per_position_ic(ppm, background, pseudocount) function. It says that the probabilities of ppm don't sum to 1 along axis 1. I think it has something to do with the warning I obtained earlier:

RuntimeWarning: invalid value encountered in true_divide
  vecs1/np.linalg.norm(vecs1, axis=1)[:,None]

I'm using grad X input as importance scores and grad alone as hypothetical scores. Is this error message implying that the importance scores are problematic?

Details attached below:

MEMORY 2.821525504
On task task
Computing windowed sums on original
Generating null dist
peak(mu)= -2.2886458784347058e-05
Computing threshold
Thresholds from null dist were -2.4279579957947135 and 1.8365790802054107
Final raw thresholds are -2.4279579957947135 and 1.8365790802054107
Final transformed thresholds are -0.9624493367346939 and 0.9441734693877551

Got 66719 coords
After resolving overlaps, got 66719 seqlets
Across all tasks, the weakest transformed threshold used was: 0.9440734693877552
MEMORY 3.77667584
66719 identified in total
min_metacluster_size_frac * len(seqlets) = 667 is more than min_metacluster_size=100.
Using it as a new min_metacluster_size
2 activity patterns with support >= 667 out of 2 possible patterns
Metacluster sizes: [45352, 21367]
Idx to activities: {0: '1', 1: '-1'}
MEMORY 3.777286144
On metacluster 1
Metacluster size 21367 limited to 20000
Relevant tasks: ('task',)
Relevant signs: (-1,)
TfModiscoSeqletsToPatternsFactory: seed=1234
(Round 1) num seqlets: 20000
(Round 1) Computing coarse affmat
MEMORY 3.777286144
Beginning embedding computation
Computing embeddings
Using TensorFlow backend.
Finished embedding computation in 27.73 s
Starting affinity matrix computations
/home/ubuntu/anaconda3/envs/kipoi-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/modisco/affinitymat/core.py:184: RuntimeWarning: invalid value encountered in true_divide
vecs1/np.linalg.norm(vecs1, axis=1)[:,None],
/home/ubuntu/anaconda3/envs/kipoi-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/modisco/affinitymat/core.py:187: RuntimeWarning: invalid value encountered in true_divide
vecs2/np.linalg.norm(vecs2, axis=1)[:,None],
Normalization computed in 2.45 s
Cosine similarity mat computed in 13.23 s
Normalization computed in 2.54 s
Cosine similarity mat computed in 13.88 s
Finished affinity matrix computations in 35.55 s
(Round 1) Compute nearest neighbors from coarse affmat
MEMORY 7.015927808
Computed nearest neighbors in 24.5 s
MEMORY 7.261278208
(Round 1) Computing affinity matrix on nearest neighbors
MEMORY 7.261278208
Launching nearest neighbors affmat calculation job
MEMORY 7.39745792
Parallel runs completed
MEMORY 7.595122688
Job completed in: 396.55 s
MEMORY 10.60612096
Launching nearest neighbors affmat calculation job
MEMORY 10.60352
Parallel runs completed
MEMORY 10.724483072
Job completed in: 395.72 s
MEMORY 13.735211008
(Round 1) Computed affinity matrix on nearest neighbors in 800.06 s
MEMORY 10.774994944
/home/ubuntu/anaconda3/envs/kipoi-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/scipy/stats/stats.py:4196: SpearmanRConstantInputWarning: An input array is constant; the correlation coefficent is not defined.
warnings.warn(SpearmanRConstantInputWarning())
Filtered down to 422 of 20000
(Round 1) Retained 422 rows out of 20000 after filtering
MEMORY 10.775801856
(Round 1) Computing density adapted affmat
MEMORY 5.975793664
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 422 samples in 0.001s...
[t-SNE] Computed neighbors for 422 samples in 0.002s...
[t-SNE] Computed conditional probabilities for sample 422 / 422
[t-SNE] Mean sigma: 0.883924
(Round 1) Computing clustering
MEMORY 5.975793664
Beginning preprocessing + Leiden
0%| | 0/50 [00:00<?, ?it/s]
Quality: 0.45266841163373195
Quality: 0.45729671160348756
100%|██████████| 50/50 [00:02<00:00, 22.66it/s]
Got 10 clusters after round 1
Counts:
{4: 34, 3: 50, 8: 16, 1: 81, 2: 67, 5: 30, 0: 84, 9: 15, 6: 27, 7: 18}
MEMORY 5.977145344
(Round 1) Aggregating seqlets in each cluster
MEMORY 5.977145344
Aggregating for cluster 0 with 84 seqlets
MEMORY 5.977145344

Trimming eliminated 0 seqlets out of 84

AssertionError: Probabilities don't sum to 1 along axis 1 in [[0.28571429 0.16666667 0.22619048 0.20238095

self.seqlet_aggregator return empty list

Hi @AvantiShri

Computed threshold 2.1903742329729723
378 coords remaining after thresholding
After resolving overlaps, got 378 seqlets
1 activity patterns with support >= 200 out of 3 possible patterns
Metacluster sizes:  [378]
Idx to activities:  {0: '-1'}
On metacluster 0
Metacluster size 378
Relevant tasks:  ('task0',)
Relevant signs:  (-1,)
(Round 1) num seqlets: 378
(Round 1) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 17.45 s
Starting affinity matrix computations
Normalization computed in 0.34 s
Cosine similarity mat computed in 0.56 s
Normalization computed in 0.34 s
Cosine similarity mat computed in 0.55 s
Finished affinity matrix computations in 1.33 s
(Round 1) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 0.01 s
(Round 1) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 4.04 s
Launching nearest neighbors affmat calculation job
Job completed in: 4.01 s
(Round 1) Computed affinity matrix on nearest neighbors in 8.64 s
Filtered down to 309 of 378
(Round 1) Retained 309 rows out of 378 after filtering
(Round 1) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 309 samples in 0.000s...
[t-SNE] Computed neighbors for 309 samples in 0.001s...
[t-SNE] Computed conditional probabilities for sample 309 / 309
[t-SNE] Mean sigma: 0.000000
(Round 1) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 0.027967453002929688 seconds
Running Louvain modularity optimization
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.2s
Louvain completed 200 runs in 2.220554828643799 seconds
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:    1.5s finished
Wrote graph to binary file in 0.02801656723022461 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.887512
Louvain completed 51 runs in 2.2180330753326416 seconds
Preproc + Louvain took 4.560196876525879 s
Got 14 clusters after round 1
Counts:
{0: 39, 1: 35, 2: 33, 3: 29, 4: 25, 5: 24, 6: 23, 7: 22, 8: 19, 9: 18, 10: 16, 11: 10, 12: 9, 13: 7}
(Round 1) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 39 seqlets
Trimmed 0 out of 39
Skipped 39 seqlets
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-107-9e507e90d86f> in <module>()
     42     contrib_scores={'task0': scores},
     43     hypothetical_contribs={'task0': hyp_scores},
---> 44     one_hot=exon)

/opt/modules/i12g/anaconda/3-5.0.1/envs/splicing/lib/python3.5/site-packages/modisco/tfmodisco_workflow/workflow.py in __call__(self, task_names, contrib_scores, hypothetical_contribs, one_hot)
    216                 other_comparison_track_names=[])
    217 
--> 218             seqlets_to_patterns_result = seqlets_to_patterns(metacluster_seqlets)
    219             metacluster_idx_to_submetacluster_results[metacluster_idx] =\
    220                 SubMetaclusterResults(

/opt/modules/i12g/anaconda/3-5.0.1/envs/splicing/lib/python3.5/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py in __call__(self, seqlets)
    598                     sys.stdout.flush()
    599                 motifs = self.seqlet_aggregator(cluster_to_seqlets[i])
--> 600                 assert len(motifs)==1
    601                 motif = motifs[0]
    602                 if (self.sign_consistency_func(motif)):

AssertionError:

When debuging, len(motifs) is 0. cluster_to_seqlets looks normal but seqlet_aggregator gives a empty list.

memory usage bottleneck

Hi Avanti,

I've been trying to figure out the memory bottleneck when using tfmodisco. It turns out the initially dense matrix created by seqlets2patterns doesn't take that much memory with 40k seqlets per metacluster (~40gb). I then narrow it down to the graph2binary() in modisco/cluster/phenograph/core.py. graph2binary() creates a really large list before writing it out to a binary file:

181 f.writelines([e for t in zip(ij, s) for e in t])

For a 4gb sparse matrix, the list can be ~60gb. avoid creating this list, I can run tfmodisco with 40k seqlets per metacluster with 200G memory.

I've submitted a PR to make a revision on this. I'm not too familiar with the codebase yet. So let me know if I miss anything.

viz_sequence plot

Hi Avanti,

I have noticed that when we plot the output of sequences, the location of nucleotide starts at index zero (which it needs to be one), and the nucleotide is between two locations(rather than being in the middle).

Also, how can we replace the letter 'T' with 'U' when we want to plot for RNA sequences?

ps: I tried few things but couldn't fix all together correctly. Also great works both deeplift and modisco.

Bug in checking weights sum to 1

Hi Avanti, I've reopened this issue to fix a small bug. You had a < instead of > in your fix. I left a one-line review on the PR.

Originally posted by @BeyondTheProof in #79 (comment)

libary import leidenalg and tqdm is missing in the dependencies

after running the newest version I get the error:

  File "tfmodisco/modisco/cluster/core.py", line 7, in <module>
    import leidenalg
ModuleNotFoundError: No module named 'leidenalg'

Same with tqdm

after installling leidenalg and tqdm (here I used
conda install -c conda-forge leidenalg tqdm) the error is gone:

I think you should add them to the dependencies in your setup file

"UnboundLocalError: local variable 'motif' referenced before assignment" in some runs

Hi, I think that there is a bug in the code. I ran some examples and Modisco ran ok, but on some run it return error:

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
 in 
     18     [hyp_impscores, impscores, onehot_data] = pickle.load(open('modisco_input_'+model_id_list[i]+'_'+data_type_list[i]+'_'+str('all_outputs' if target_range_list[i] is None else target_range_list[i])+'.sav', 'rb'))
     19 
---> 20     run_modisco(hyp_impscores, impscores, onehot_data)
     21 
     22     #zip outputfolder

 in run_modisco(hyp_impscores, impscores, onehot_data)
     50                     contrib_scores={'task0': impscores},
     51                     hypothetical_contribs={'task0': hyp_impscores},
---> 52                     one_hot=onehot_data)#,
     53                     #null_per_pos_scores={'task0': nulldist_perposimp})
     54 

~/.conda/envs/tf1.3env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/workflow.py in __call__(self, task_names, contrib_scores, hypothetical_contribs, one_hot, null_per_pos_scores, per_position_contrib_scores, revcomp, other_tracks, just_return_seqlets, plot_save_dir)
    392                   other_comparison_track_names=[])
    393                 seqlets_to_patterns_result = seqlets_to_patterns(
--> 394                                               metacluster_seqlets)
    395             else:
    396                 seqlets_to_patterns_result = None

~/.conda/envs/tf1.3env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py in __call__(self, seqlets)
    840                     cluster_indices=cluster_results.cluster_indices,
    841                     sign_consistency_check=True,
--> 842                     min_seqlets_in_motif=0)
    843 
    844             #obtain unique seqlets from adjusted motifs

~/.conda/envs/tf1.3env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py in get_cluster_to_aggregate_motif(self, seqlets, cluster_indices, sign_consistency_check, min_seqlets_in_motif)
    651                                   +" seqlets due to sign disagreement")
    652                         cluster_to_eliminated_motif[i] = motif
--> 653                 cluster_to_motif[i] = motif
    654         return cluster_to_motif, cluster_to_eliminated_motif
    655 

UnboundLocalError: local variable 'motif' referenced before assignment

This is very strange error and Is don't know why it's happening. The input type, size of all the examples are the same which makes is even more strange.

I looked at the source code, and I think that I see the problem:

def get_cluster_to_aggregate_motif(self, seqlets, cluster_indices,
                                       sign_consistency_check,
                                       min_seqlets_in_motif):
        num_clusters = max(cluster_indices+1)
        cluster_to_seqlets = defaultdict(list) 
        assert len(seqlets)==len(cluster_indices)
        for seqlet,idx in zip(seqlets, cluster_indices):
            cluster_to_seqlets[idx].append(seqlet)

        cluster_to_motif = OrderedDict()
        cluster_to_eliminated_motif = OrderedDict()
        for i in range(num_clusters):
            if (len(cluster_to_seqlets[i]) >= min_seqlets_in_motif):
                if (self.verbose):
                    print("Aggregating for cluster "+str(i)+" with "
                          +str(len(cluster_to_seqlets[i]))+" seqlets")
                    print_memory_use()
                    sys.stdout.flush()
                motifs = self.seqlet_aggregator(cluster_to_seqlets[i])
                assert len(motifs)<=1
                **if (len(motifs) > 0):
                    motif = motifs[0]**
                    if (sign_consistency_check==False or
                        self.sign_consistency_func(motif)):
                        cluster_to_motif[i] = motif
                    else:
                        if (self.verbose):
                            print("Dropping cluster "+str(i)+
                                  " with "+str(motif.num_seqlets)
                                  +" seqlets due to sign disagreement")
                        cluster_to_eliminated_motif[i] = motif
                cluster_to_motif[i] = motif
        return cluster_to_motif, cluster_to_eliminated_motif

Probably if (len(motifs) > 0) is false, then motif doesn't get assigned, and therefore we get the error

the full output of the run before raising the error

MEMORY 0.725680128
On task task0
Computing windowed sums on original
Generating null dist
peak(mu)= -2.3275281772017475
Computing threshold
Thresholds from null dist were -20.881900787353516  and  12.298645496368408
Passing windows frac was 1.604938271604938e-05 , which is below  0.03 ; adjusting
Final raw thresholds are -6.489653731584548  and  6.489653731584548
Final transformed thresholds are -0.97  and  0.97
Got 21719 coords
After resolving overlaps, got 21719 seqlets
Across all tasks, the weakest transformed threshold used was: 0.9699
MEMORY 1.04849408
21719 identified in total
min_metacluster_size_frac * len(seqlets) = 217 is more than min_metacluster_size=100.
Using it as a new min_metacluster_size
2 activity patterns with support >= 217 out of 2 possible patterns
Metacluster sizes:  [19485, 2234]
Idx to activities:  {0: '-1', 1: '1'}
MEMORY 1.0488832
On metacluster 1
Metacluster size 2234
Relevant tasks:  ('task0',)
Relevant signs:  (1,)
TfModiscoSeqletsToPatternsFactory: seed=1234
(Round 1) num seqlets: 2234
(Round 1) Computing coarse affmat
MEMORY 1.0492928
Beginning embedding computation
Computing embeddings
Using TensorFlow backend.
Finished embedding computation in 23.11 s
Starting affinity matrix computations
Normalization computed in 0.97 s
Cosine similarity mat computed in 1.87 s
Normalization computed in 1.0 s
Cosine similarity mat computed in 1.89 s
Finished affinity matrix computations in 3.8 s
(Round 1) Compute nearest neighbors from coarse affmat
MEMORY 1.617543168
Computed nearest neighbors in 0.43 s
MEMORY 1.644843008
(Round 1) Computing affinity matrix on nearest neighbors
MEMORY 1.644843008
Launching nearest neighbors affmat calculation job
MEMORY 1.645068288
Parallel runs completed
MEMORY 1.679843328
Job completed in: 19.56 s
MEMORY 1.679843328
Launching nearest neighbors affmat calculation job
MEMORY 1.679843328
Parallel runs completed
MEMORY 1.693884416
Job completed in: 19.6 s
MEMORY 1.693884416
(Round 1) Computed affinity matrix on nearest neighbors in 39.7 s
MEMORY 1.693884416
Filtered down to 1428 of 2234
(Round 1) Retained 1428 rows out of 2234 after filtering
MEMORY 1.693884416
(Round 1) Computing density adapted affmat
MEMORY 1.693884416
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 1428 samples in 0.002s...
[t-SNE] Computed neighbors for 1428 samples in 0.019s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1428
[t-SNE] Computed conditional probabilities for sample 1428 / 1428
[t-SNE] Mean sigma: 0.252760
(Round 1) Computing clustering
MEMORY 1.694502912
Beginning preprocessing + Leiden
  0%|          | 0/50 [00:00<?, ?it/s]Quality: 0.6641347090604255
  4%|▍         | 2/50 [00:00<00:11,  4.12it/s]Quality: 0.6646815153647481
 30%|███       | 15/50 [00:05<00:11,  3.12it/s]Quality: 0.6647057219061275
 44%|████▍     | 22/50 [00:07<00:10,  2.72it/s]Quality: 0.664788009501556
 46%|████▌     | 23/50 [00:08<00:10,  2.66it/s]Quality: 0.6648110658355727
 54%|█████▍    | 27/50 [00:09<00:07,  2.92it/s]Quality: 0.6648135041051698
 80%|████████  | 40/50 [00:14<00:03,  2.88it/s]Quality: 0.6649336606412637
100%|██████████| 50/50 [00:17<00:00,  2.88it/s]Got 19 clusters after round 1
Counts:
{10: 28, 3: 168, 12: 11, 6: 98, 1: 186, 4: 160, 2: 181, 8: 75, 9: 53, 0: 231, 5: 108, 18: 3, 7: 84, 11: 19, 14: 5, 16: 4, 15: 4, 13: 7, 17: 3}
MEMORY 1.69637888
(Round 1) Aggregating seqlets in each cluster
MEMORY 1.69637888
Aggregating for cluster 0 with 231 seqlets
MEMORY 1.69637888

Trimmed 6 out of 231
Skipped 131 seqlets
Aggregating for cluster 1 with 186 seqlets
MEMORY 1.69637888
Trimmed 14 out of 186
Skipped 85 seqlets
Aggregating for cluster 2 with 181 seqlets
MEMORY 1.69637888
Trimmed 8 out of 181
Skipped 83 seqlets
Aggregating for cluster 3 with 168 seqlets
MEMORY 1.69637888
Trimmed 8 out of 168
Skipped 79 seqlets
Aggregating for cluster 4 with 160 seqlets
MEMORY 1.696632832
Trimmed 21 out of 160
Skipped 64 seqlets
Aggregating for cluster 5 with 108 seqlets
MEMORY 1.696632832
Trimmed 14 out of 108
Skipped 35 seqlets
Aggregating for cluster 6 with 98 seqlets
MEMORY 1.696903168
Trimmed 11 out of 98
Skipped 41 seqlets
Aggregating for cluster 7 with 84 seqlets
MEMORY 1.696903168
Trimmed 5 out of 84
Skipped 38 seqlets
Aggregating for cluster 8 with 75 seqlets
MEMORY 1.696903168
Trimmed 16 out of 75
Skipped 20 seqlets
Aggregating for cluster 9 with 53 seqlets
MEMORY 1.697173504
Trimmed 1 out of 53
Skipped 24 seqlets
Aggregating for cluster 10 with 28 seqlets
MEMORY 1.697173504
Trimmed 4 out of 28
Skipped 5 seqlets
Dropping cluster 10 with 19 seqlets due to sign disagreement
Aggregating for cluster 11 with 19 seqlets
MEMORY 1.697173504
Trimmed 0 out of 19
Skipped 13 seqlets
Aggregating for cluster 12 with 11 seqlets
MEMORY 1.697173504
Trimmed 1 out of 11
Skipped 6 seqlets
Skipped 1 seqlets
Dropping cluster 12 with 3 seqlets due to sign disagreement
Aggregating for cluster 13 with 7 seqlets
MEMORY 1.697173504
Trimmed 0 out of 7
Skipped 4 seqlets
Aggregating for cluster 14 with 5 seqlets
MEMORY 1.697173504
Trimmed 0 out of 5
Skipped 4 seqlets
Aggregating for cluster 15 with 4 seqlets
MEMORY 1.697173504
Trimmed 0 out of 4
Skipped 2 seqlets
Aggregating for cluster 16 with 4 seqlets
MEMORY 1.697173504
Trimmed 0 out of 4
Skipped 3 seqlets
Aggregating for cluster 17 with 3 seqlets
MEMORY 1.697173504
Trimmed 0 out of 3
Skipped 2 seqlets
Aggregating for cluster 18 with 3 seqlets
MEMORY 1.697173504
Trimmed 0 out of 3
Skipped 1 seqlets
(Round 2) num seqlets: 678
(Round 2) Computing coarse affmat
MEMORY 1.697173504
Beginning embedding computation
Computing embeddings
Finished embedding computation in 8.51 s
Starting affinity matrix computations
Normalization computed in 0.31 s
Cosine similarity mat computed in 0.44 s
Normalization computed in 0.34 s
Cosine similarity mat computed in 0.46 s
Finished affinity matrix computations in 0.9 s
(Round 2) Compute nearest neighbors from coarse affmat
MEMORY 1.32960256
Computed nearest neighbors in 0.18 s
MEMORY 1.310908416
(Round 2) Computing affinity matrix on nearest neighbors
MEMORY 1.310908416
Launching nearest neighbors affmat calculation job
MEMORY 1.310908416
Parallel runs completed
MEMORY 1.313882112
Job completed in: 9.12 s
MEMORY 1.313882112
Launching nearest neighbors affmat calculation job
MEMORY 1.313882112
Parallel runs completed
MEMORY 1.316995072
Job completed in: 9.12 s
MEMORY 1.31678208
(Round 2) Computed affinity matrix on nearest neighbors in 18.41 s
MEMORY 1.31678208
Not applying filtering for rounds above first round
MEMORY 1.31678208
(Round 2) Computing density adapted affmat
MEMORY 1.31678208
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 678 samples in 0.000s...
[t-SNE] Computed neighbors for 678 samples in 0.005s...
[t-SNE] Computed conditional probabilities for sample 678 / 678
[t-SNE] Mean sigma: 0.239224
(Round 2) Computing clustering
MEMORY 1.31678208
Beginning preprocessing + Leiden
  0%|          | 0/50 [00:00<?, ?it/s]Quality: 0.5880064841766943
Quality: 0.5885924241506224
  4%|▍         | 2/50 [00:00<00:05,  9.59it/s]Quality: 0.5889366171256917
  6%|▌         | 3/50 [00:00<00:05,  8.94it/s]Quality: 0.5895947801572112
 48%|████▊     | 24/50 [00:02<00:02,  9.23it/s]Quality: 0.589600400957549
 58%|█████▊    | 29/50 [00:03<00:02,  8.15it/s]Quality: 0.5900198475942349
100%|██████████| 50/50 [00:05<00:00,  8.84it/s]Got 13 clusters after round 2
Counts:
{10: 27, 0: 148, 9: 32, 8: 35, 2: 77, 5: 51, 6: 41, 1: 99, 4: 56, 11: 4, 3: 67, 7: 38, 12: 3}
MEMORY 1.317494784
(Round 2) Aggregating seqlets in each cluster
MEMORY 1.317494784
Aggregating for cluster 0 with 148 seqlets
MEMORY 1.317494784

Trimmed 14 out of 148
Skipped 36 seqlets
Aggregating for cluster 1 with 99 seqlets
MEMORY 1.317494784
Trimmed 8 out of 99
Skipped 30 seqlets
Aggregating for cluster 2 with 77 seqlets
MEMORY 1.317494784
Trimmed 7 out of 77
Skipped 30 seqlets
Aggregating for cluster 3 with 67 seqlets
MEMORY 1.317494784
Trimmed 12 out of 67
Skipped 15 seqlets
Aggregating for cluster 4 with 56 seqlets
MEMORY 1.317494784
Trimmed 20 out of 56
Skipped 4 seqlets
Aggregating for cluster 5 with 51 seqlets
MEMORY 1.317494784
Trimmed 4 out of 51
Skipped 22 seqlets
Aggregating for cluster 6 with 41 seqlets
MEMORY 1.317494784
Trimmed 1 out of 41
Skipped 12 seqlets
Aggregating for cluster 7 with 38 seqlets
MEMORY 1.317494784
Trimmed 5 out of 38
Skipped 11 seqlets
Aggregating for cluster 8 with 35 seqlets
MEMORY 1.317494784
Trimmed 23 out of 35
Skipped 2 seqlets
Dropping cluster 8 with 10 seqlets due to sign disagreement
Aggregating for cluster 9 with 32 seqlets
MEMORY 1.317494784
Trimmed 1 out of 32
Skipped 11 seqlets
Aggregating for cluster 10 with 27 seqlets
MEMORY 1.317494784
Trimmed 4 out of 27
Skipped 8 seqlets
Aggregating for cluster 11 with 4 seqlets
MEMORY 1.317494784
Trimmed 0 out of 4
Aggregating for cluster 12 with 3 seqlets
MEMORY 1.317494784
Trimmed 0 out of 3
Got 13 clusters
Splitting into subclusters...
MEMORY 1.317494784
Inspecting for spurious merging
Wrote graph to binary file in 0.0877983570098877 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00366525
After 2 runs, maximum modularity is Q = 0.00459325
After 6 runs, maximum modularity is Q = 0.00459326
After 19 runs, maximum modularity is Q = 0.00459755
Louvain completed 39 runs in 2.8975212574005127 seconds
Similarity is 0.87047726; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.040494441986083984 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00184102
After 3 runs, maximum modularity is Q = 0.00196909
After 4 runs, maximum modularity is Q = 0.00199081
After 9 runs, maximum modularity is Q = 0.00201487
Louvain completed 29 runs in 1.6093993186950684 seconds
Similarity is 0.94781816; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.03618764877319336 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.005053
Louvain completed 21 runs in 0.8584208488464355 seconds
Similarity is 0.77573985; is_dissimilar is True
Got 2 subclusters
Inspecting for spurious merging
Wrote graph to binary file in 0.03833961486816406 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00128103
After 3 runs, maximum modularity is Q = 0.00162226
After 23 runs, maximum modularity is Q = 0.00169374
After 41 runs, maximum modularity is Q = 0.00169375
Louvain completed 61 runs in 2.6738028526306152 seconds
Similarity is 0.93972546; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.028965473175048828 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0028314
After 16 runs, maximum modularity is Q = 0.00283141
Louvain completed 36 runs in 1.5165183544158936 seconds
Similarity is 0.8591566; is_dissimilar is False
Merging on 14 clusters
MEMORY 1.320300544
On merging iteration 1
Computing pattern to seqlet distances
Computing pattern to pattern distances
Collapsing 1 & 4 with prob 2.67478323550952e-05 and sim 0.9289754111263918
Collapsing 1 & 7 with prob 1.0721644058876706e-05 and sim 0.8732028830389162
Trimmed 0 out of 101
Skipped 9 seqlets
Trimmed 1 out of 120
Skipped 9 seqlets
On merging iteration 2
Computing pattern to seqlet distances
Computing pattern to pattern distances
Got 12 patterns after merging
MEMORY 1.32102144
Performing seqlet reassignment
MEMORY 1.32102144
Cross contin jaccard time taken: 5.29 s
Cross contin jaccard time taken: 0.03 s
Discarded 26 seqlets
Skipped 25 seqlets
Skipped 11 seqlets
Skipped 58 seqlets
Got 3 patterns after reassignment
MEMORY 1.322172416
Total time taken is 149.47s
MEMORY 1.322205184
On metacluster 0
Metacluster size 19485
Relevant tasks:  ('task0',)
Relevant signs:  (-1,)
TfModiscoSeqletsToPatternsFactory: seed=1234
(Round 1) num seqlets: 19485
(Round 1) Computing coarse affmat
MEMORY 1.314852864
Beginning embedding computation
Computing embeddings
Finished embedding computation in 204.75 s
Starting affinity matrix computations
Normalization computed in 8.19 s
Cosine similarity mat computed in 78.17 s
Normalization computed in 8.39 s
Cosine similarity mat computed in 80.0 s
Finished affinity matrix computations in 161.0 s
(Round 1) Compute nearest neighbors from coarse affmat
MEMORY 7.240065024
Computed nearest neighbors in 23.25 s
MEMORY 7.478202368
(Round 1) Computing affinity matrix on nearest neighbors
MEMORY 7.478202368
Launching nearest neighbors affmat calculation job
MEMORY 7.478747136
Parallel runs completed
MEMORY 7.625007104
Job completed in: 176.31 s
MEMORY 7.625007104
Launching nearest neighbors affmat calculation job
MEMORY 7.6225536
Parallel runs completed
MEMORY 7.699324928
Job completed in: 176.71 s
MEMORY 10.58056192
(Round 1) Computed affinity matrix on nearest neighbors in 358.36 s
MEMORY 10.733633536
Filtered down to 11479 of 19485
(Round 1) Retained 11479 rows out of 19485 after filtering
MEMORY 10.73389568
(Round 1) Computing density adapted affmat
MEMORY 6.177910784
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 11479 samples in 0.103s...
[t-SNE] Computed neighbors for 11479 samples in 0.605s...
[t-SNE] Computed conditional probabilities for sample 1000 / 11479
[t-SNE] Computed conditional probabilities for sample 2000 / 11479
[t-SNE] Computed conditional probabilities for sample 3000 / 11479
[t-SNE] Computed conditional probabilities for sample 4000 / 11479
[t-SNE] Computed conditional probabilities for sample 5000 / 11479
[t-SNE] Computed conditional probabilities for sample 6000 / 11479
[t-SNE] Computed conditional probabilities for sample 7000 / 11479
[t-SNE] Computed conditional probabilities for sample 8000 / 11479
[t-SNE] Computed conditional probabilities for sample 9000 / 11479
[t-SNE] Computed conditional probabilities for sample 10000 / 11479
[t-SNE] Computed conditional probabilities for sample 11000 / 11479
[t-SNE] Computed conditional probabilities for sample 11479 / 11479
[t-SNE] Mean sigma: 0.258674
(Round 1) Computing clustering
MEMORY 6.177910784
Beginning preprocessing + Leiden
  0%|          | 0/50 [00:00<?, ?it/s]Quality: 0.7514436404431755
  2%|▏         | 1/50 [00:05<04:38,  5.68s/it]Quality: 0.7542646129977288
 14%|█▍        | 7/50 [00:35<03:37,  5.06s/it]Quality: 0.7549363507610374
 40%|████      | 20/50 [02:01<03:00,  6.03s/it]Quality: 0.7549657141623181
100%|██████████| 50/50 [05:06<00:00,  6.13s/it]Got 24 clusters after round 1
Counts:
{2: 1522, 1: 1875, 3: 1115, 4: 1017, 8: 524, 12: 60, 6: 883, 11: 157, 7: 822, 20: 3, 9: 208, 10: 170, 0: 2090, 17: 5, 5: 950, 13: 26, 19: 4, 21: 2, 18: 5, 14: 24, 15: 8, 22: 2, 16: 5, 23: 2}
MEMORY 6.178697216
(Round 1) Aggregating seqlets in each cluster
MEMORY 6.178697216
Aggregating for cluster 0 with 2090 seqlets
MEMORY 6.178697216

Trimmed 192 out of 2090
Skipped 1284 seqlets
Aggregating for cluster 1 with 1875 seqlets
MEMORY 6.181855232
Trimmed 40 out of 1875
Skipped 1090 seqlets
Aggregating for cluster 2 with 1522 seqlets
MEMORY 6.183706624
Trimmed 49 out of 1522
Skipped 844 seqlets
Aggregating for cluster 3 with 1115 seqlets
MEMORY 6.18489856
Trimmed 56 out of 1115
Skipped 588 seqlets
Aggregating for cluster 4 with 1017 seqlets
MEMORY 6.185091072
Trimmed 48 out of 1017
Skipped 545 seqlets
Aggregating for cluster 5 with 950 seqlets
MEMORY 6.187085824
Trimmed 58 out of 950
Skipped 507 seqlets
Aggregating for cluster 6 with 883 seqlets
MEMORY 6.187180032
Trimmed 38 out of 883
Skipped 471 seqlets
Aggregating for cluster 7 with 822 seqlets
MEMORY 6.187864064
Trimmed 35 out of 822
Skipped 431 seqlets
Aggregating for cluster 8 with 524 seqlets
MEMORY 6.18852352
Trimmed 30 out of 524
Skipped 280 seqlets
Aggregating for cluster 9 with 208 seqlets
MEMORY 6.1896704
Trimmed 8 out of 208
Skipped 109 seqlets
Aggregating for cluster 10 with 170 seqlets
MEMORY 6.1896704
Trimmed 6 out of 170
Skipped 84 seqlets
Aggregating for cluster 11 with 157 seqlets
MEMORY 6.1896704
Trimmed 7 out of 157
Skipped 65 seqlets
Aggregating for cluster 12 with 60 seqlets
MEMORY 6.1896704
Trimmed 17 out of 60
Skipped 6 seqlets
Skipped 1 seqlets
Aggregating for cluster 13 with 26 seqlets
MEMORY 6.189678592
Trimmed 0 out of 26
Skipped 17 seqlets
Aggregating for cluster 14 with 24 seqlets
MEMORY 6.189678592
Trimmed 5 out of 24
Skipped 9 seqlets
Aggregating for cluster 15 with 8 seqlets
MEMORY 6.189678592
Trimmed 0 out of 8
Skipped 4 seqlets
Aggregating for cluster 16 with 5 seqlets
MEMORY 6.189678592
Trimmed 0 out of 5
Skipped 2 seqlets
Aggregating for cluster 17 with 5 seqlets
MEMORY 6.189678592
Trimmed 0 out of 5
Skipped 2 seqlets
Aggregating for cluster 18 with 5 seqlets
MEMORY 6.189678592
Trimmed 0 out of 5
Skipped 2 seqlets
Aggregating for cluster 19 with 4 seqlets
MEMORY 6.189678592
Trimmed 0 out of 4
Skipped 2 seqlets
Aggregating for cluster 20 with 3 seqlets
MEMORY 6.189678592
Trimmed 0 out of 3
Skipped 1 seqlets
Aggregating for cluster 21 with 2 seqlets
MEMORY 6.189678592
Trimmed 0 out of 2
Aggregating for cluster 22 with 2 seqlets
MEMORY 6.189678592
Trimmed 0 out of 2
Skipped 1 seqlets
Aggregating for cluster 23 with 2 seqlets
MEMORY 6.189678592
Trimmed 0 out of 2
Skipped 1 seqlets
(Round 2) num seqlets: 4544
(Round 2) Computing coarse affmat
MEMORY 6.189678592
Beginning embedding computation
Computing embeddings
Finished embedding computation in 57.22 s
Starting affinity matrix computations
Normalization computed in 1.91 s
Cosine similarity mat computed in 5.64 s
Normalization computed in 1.96 s
Cosine similarity mat computed in 5.52 s
Finished affinity matrix computations in 11.3 s
(Round 2) Compute nearest neighbors from coarse affmat
MEMORY 6.189678592
Computed nearest neighbors in 1.9 s
MEMORY 6.007017472
(Round 2) Computing affinity matrix on nearest neighbors
MEMORY 6.007017472
Launching nearest neighbors affmat calculation job
MEMORY 6.007017472
Parallel runs completed
MEMORY 6.010351616
Job completed in: 62.09 s
MEMORY 6.010351616
Launching nearest neighbors affmat calculation job
MEMORY 6.010351616
Parallel runs completed
MEMORY 6.01032704
Job completed in: 62.31 s
MEMORY 6.0102656
(Round 2) Computed affinity matrix on nearest neighbors in 125.49 s
MEMORY 6.0102656
Not applying filtering for rounds above first round
MEMORY 6.0102656
(Round 2) Computing density adapted affmat
MEMORY 6.0102656
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 4544 samples in 0.016s...
[t-SNE] Computed neighbors for 4544 samples in 0.132s...
[t-SNE] Computed conditional probabilities for sample 1000 / 4544
[t-SNE] Computed conditional probabilities for sample 2000 / 4544
[t-SNE] Computed conditional probabilities for sample 3000 / 4544
[t-SNE] Computed conditional probabilities for sample 4000 / 4544
[t-SNE] Computed conditional probabilities for sample 4544 / 4544
[t-SNE] Mean sigma: 0.235279
(Round 2) Computing clustering
MEMORY 6.0102656
Beginning preprocessing + Leiden
  0%|          | 0/50 [00:00<?, ?it/s]Quality: 0.7117903305258136
 70%|███████   | 35/50 [00:45<00:18,  1.24s/it]Quality: 0.7119394059910836
 92%|█████████▏| 46/50 [00:59<00:04,  1.17s/it]Quality: 0.7121246762215024
100%|██████████| 50/50 [01:05<00:00,  1.31s/it]Got 12 clusters after round 2
Counts:
{0: 1096, 1: 730, 2: 675, 5: 395, 9: 65, 10: 52, 6: 306, 11: 11, 3: 582, 4: 467, 7: 88, 8: 77}
MEMORY 6.011097088
(Round 2) Aggregating seqlets in each cluster
MEMORY 6.011097088
Aggregating for cluster 0 with 1096 seqlets
MEMORY 6.011097088

Trimmed 388 out of 1096
Skipped 708 seqlets
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
 in 
     18     [hyp_impscores, impscores, onehot_data] = pickle.load(open('modisco_input_'+model_id_list[i]+'_'+data_type_list[i]+'_'+str('all_outputs' if target_range_list[i] is None else target_range_list[i])+'.sav', 'rb'))
     19 
---> 20     run_modisco(hyp_impscores, impscores, onehot_data)
     21 
     22     #zip outputfolder

 in run_modisco(hyp_impscores, impscores, onehot_data)
     50                     contrib_scores={'task0': impscores},
     51                     hypothetical_contribs={'task0': hyp_impscores},
---> 52                     one_hot=onehot_data)#,
     53                     #null_per_pos_scores={'task0': nulldist_perposimp})
     54 

~/.conda/envs/tf1.3env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/workflow.py in __call__(self, task_names, contrib_scores, hypothetical_contribs, one_hot, null_per_pos_scores, per_position_contrib_scores, revcomp, other_tracks, just_return_seqlets, plot_save_dir)
    392                   other_comparison_track_names=[])
    393                 seqlets_to_patterns_result = seqlets_to_patterns(
--> 394                                               metacluster_seqlets)
    395             else:
    396                 seqlets_to_patterns_result = None

~/.conda/envs/tf1.3env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py in __call__(self, seqlets)
    840                     cluster_indices=cluster_results.cluster_indices,
    841                     sign_consistency_check=True,
--> 842                     min_seqlets_in_motif=0)
    843 
    844             #obtain unique seqlets from adjusted motifs

~/.conda/envs/tf1.3env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py in get_cluster_to_aggregate_motif(self, seqlets, cluster_indices, sign_consistency_check, min_seqlets_in_motif)
    651                                   +" seqlets due to sign disagreement")
    652                         cluster_to_eliminated_motif[i] = motif
--> 653                 cluster_to_motif[i] = motif
    654         return cluster_to_motif, cluster_to_eliminated_motif
    655 

UnboundLocalError: local variable 'motif' referenced before assignment

ValueError: perplexity must be less than n_samples

This might be related to issue 111, not sure. I first ran into that issue, and now, after upgrading to v0.5.16.3, I get an error as below, also at the t-SNE step.

Essentially I am following the example notebook TF_MoDISco_TAL_GATA.ipynb , with the difference that I have contribution scores for more sequences, computed by ISM.

The complete output is attached. Thanks for looking into this! tfmodisco_output.txt

hypothetical importance scores

Can you very briefly explain the hypothetical importance scores that accompanies importance scores in the TF MoDISco TAL GATA example? Are these the sequences with randomized importance scores?

Comparing motifs from TFMoDisco and other methods?

What are recommended ways to compare motifs from TFMoDisco and other methods in terms of similarity and significance?

Support for win32

Looks like the dependency modisco/cluster/phenograph/core.py uses an old codebase:

#copied from https://github.com/jacoblevine/PhenoGraph/blob/master/phenograph/core.py

The most recent update from that repo includes support for win32 systems.

I did try to modify a local copy of:
modisco/cluster/core.py -- edit, not sure this file was modified, not near system right now
modisco/cluster/init.py
modisco/affinitymat/transformers.py

to import a copy of jacoblevine/phenograph installed from pip instead (for the up to date copy) but the sample TF MoDISco TAL GATA breaks due to other functions missing from the phenograph .py's

Could we update modisco to require jacoblevine/phenograph as a dependency and add hooks or overrides instead of extending the .pys directly through copy/paste?

Alternatively is there a known list of all functions that were added to pheonograph's files that I could use to build this?

AssertionError during spurious merging detection

I ran the workflow described in the TAL-GATA notebook on a dataset of 10,000 sequences and had it crashing during the following output

On task task0
Done 0
Done 5000
Done 0
Done 5000
Done 0
Done 5000
Done 0
Done 5000
Done 0
Done 5000
Done 0
Done 5000
Done 0
Done 5000
Done 0
Done 5000
Done 0
Done 5000
Done 0
Done 5000
Done 0
Done 5000
Got 100000 coords
Computing thresholds
Bandwidth calculated: 0.006317182109341957

Computed threshold 0.09235431982160662
16694 coords remaining after thresholding
After resolving overlaps, got 16694 seqlets
2 activity patterns with support >= 200 out of 3 possible patterns
Metacluster sizes:  [8563, 8131]
Idx to activities:  {0: '1', 1: '-1'}
On metacluster 1
Metacluster size 8131
Relevant tasks:  ('task0',)
Relevant signs:  (-1,)
(Round 1) num seqlets: 8131
(Round 1) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 23.47 s
Starting affinity matrix computations
Normalization computed in 0.4 s
Cosine similarity mat computed in 8.9 s
Normalization computed in 0.42 s
Cosine similarity mat computed in 9.09 s
Finished affinity matrix computations in 18.04 s
(Round 1) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 4.12 s
(Round 1) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 412.93 s
Launching nearest neighbors affmat calculation job
Job completed in: 396.49 s
(Round 1) Computed affinity matrix on nearest neighbors in 836.2 s
Filtered down to 3729 of 8131
(Round 1) Retained 3729 rows out of 8131 after filtering
(Round 1) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 3729 samples in 0.028s...
[t-SNE] Computed neighbors for 3729 samples in 0.304s...
[t-SNE] Computed conditional probabilities for sample 1000 g 3729
[t-SNE] Computed conditional probabilities for sample 2000 g 3729
[t-SNE] Computed conditional probabilities for sample 3000 g 3729
[t-SNE] Computed conditional probabilities for sample 3729 g 3729
[t-SNE] Mean sigma: 0.242688
(Round 1) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 2.0018186569213867 seconds
Running Louvain modularity optimization

[Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.9s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    7.5s
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:    9.0s finished

Louvain completed 200 runs in 44.16686940193176 seconds
Wrote graph to binary file in 29.004573583602905 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.770261
Louvain completed 51 runs in 32.641292333602905 seconds
Preproc + Louvain took 110.05334711074829 s
Got 10 clusters after round 1
Counts:
{0: 665, 1: 661, 2: 453, 3: 433, 4: 293, 5: 292, 6: 255, 7: 250, 8: 242, 9: 185}
(Round 1) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 665 seqlets
Trimmed 56 out of 665
Aggregating for cluster 1 with 661 seqlets
Trimmed 25 out of 661
Aggregating for cluster 2 with 453 seqlets
Trimmed 13 out of 453
Aggregating for cluster 3 with 433 seqlets
Trimmed 9 out of 433
Aggregating for cluster 4 with 293 seqlets
Trimmed 37 out of 293
Aggregating for cluster 5 with 292 seqlets
Trimmed 19 out of 292
Aggregating for cluster 6 with 255 seqlets
Trimmed 23 out of 255
Aggregating for cluster 7 with 250 seqlets
Trimmed 11 out of 250
Aggregating for cluster 8 with 242 seqlets
Trimmed 3 out of 242
Aggregating for cluster 9 with 185 seqlets
Trimmed 80 out of 185
(Round 2) num seqlets: 3384
(Round 2) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 14.28 s
Starting affinity matrix computations
Normalization computed in 0.18 s
Cosine similarity mat computed in 2.37 s
Normalization computed in 0.15 s
Cosine similarity mat computed in 2.15 s
Finished affinity matrix computations in 4.53 s
(Round 2) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 0.79 s
(Round 2) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 169.95 s
Launching nearest neighbors affmat calculation job
Job completed in: 170.58 s
(Round 2) Computed affinity matrix on nearest neighbors in 350.47 s
Not applying filtering for rounds above first round
(Round 2) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 3384 samples in 0.023s...
[t-SNE] Computed neighbors for 3384 samples in 0.201s...
[t-SNE] Computed conditional probabilities for sample 1000 g 3384
[t-SNE] Computed conditional probabilities for sample 2000 g 3384
[t-SNE] Computed conditional probabilities for sample 3000 g 3384
[t-SNE] Computed conditional probabilities for sample 3384 g 3384
[t-SNE] Mean sigma: 0.232723
(Round 2) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 2.150753974914551 seconds
Running Louvain modularity optimization

[Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.9s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    7.0s
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:    8.3s finished

Louvain completed 200 runs in 37.864238262176514 seconds
Wrote graph to binary file in 26.53944683074951 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.735008
After 3 runs, maximum modularity is Q = 0.735012
Louvain completed 53 runs in 32.00108766555786 seconds
Preproc + Louvain took 102.56415510177612 s
Got 10 clusters after round 2
Counts:
{0: 684, 1: 643, 2: 447, 3: 333, 4: 270, 5: 268, 6: 244, 7: 213, 8: 187, 9: 95}
(Round 2) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 684 seqlets
Trimmed 36 out of 684
Aggregating for cluster 1 with 643 seqlets
Trimmed 196 out of 643
Aggregating for cluster 2 with 447 seqlets
Trimmed 11 out of 447
Aggregating for cluster 3 with 333 seqlets
Trimmed 35 out of 333
Aggregating for cluster 4 with 270 seqlets
Trimmed 53 out of 270
Aggregating for cluster 5 with 268 seqlets
Trimmed 51 out of 268
Aggregating for cluster 6 with 244 seqlets
Trimmed 71 out of 244
Aggregating for cluster 7 with 213 seqlets
Trimmed 1 out of 213
Aggregating for cluster 8 with 187 seqlets
Trimmed 38 out of 187
Aggregating for cluster 9 with 95 seqlets
Trimmed 17 out of 95
Got 10 clusters
Splitting into subclusters...
Inspecting for spurious merging
Wrote graph to binary file in 2.5786495208740234 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00666441
After 5 runs, maximum modularity is Q = 0.00666491
After 7 runs, maximum modularity is Q = 0.00666508
Louvain completed 27 runs in 11.415183544158936 seconds
Similarity is 0.9657364747344646; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 2.4306464195251465 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0115702
Louvain completed 21 runs in 8.048066139221191 seconds
Similarity is 0.4938100661821699; is_dissimilar is True
Inspecting for spurious merging
Wrote graph to binary file in 0.40985536575317383 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00615868
Louvain completed 21 runs in 7.569864511489868 seconds
Similarity is 0.8205082009587877; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.8328981399536133 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00869918
After 2 runs, maximum modularity is Q = 0.00870365
Louvain completed 22 runs in 7.8633904457092285 seconds
Similarity is 0.5888639898895076; is_dissimilar is True
Inspecting for spurious merging
Wrote graph to binary file in 0.10254526138305664 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0055831
After 4 runs, maximum modularity is Q = 0.00563051
After 5 runs, maximum modularity is Q = 0.0058986
Louvain completed 25 runs in 9.386006593704224 seconds
Similarity is 0.7556675777195498; is_dissimilar is True
Inspecting for spurious merging
Wrote graph to binary file in 0.04614901542663574 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00538144
After 4 runs, maximum modularity is Q = 0.00546483
After 14 runs, maximum modularity is Q = 0.0054789
Louvain completed 34 runs in 12.289081811904907 seconds
Similarity is 0.7917860410164314; is_dissimilar is True
Inspecting for spurious merging
Wrote graph to binary file in 0.039235830307006836 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00343145
After 2 runs, maximum modularity is Q = 0.00372749
After 4 runs, maximum modularity is Q = 0.00410102
Louvain completed 24 runs in 8.939475536346436 seconds
Similarity is 0.7403454717479024; is_dissimilar is True
Inspecting for spurious merging
Wrote graph to binary file in 0.0208895206451416 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0037994
Louvain completed 21 runs in 7.1052069664001465 seconds
Similarity is 0.7817881029658589; is_dissimilar is True
Inspecting for spurious merging
Wrote graph to binary file in 0.016419172286987305 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = -0.00200093
After 2 runs, maximum modularity is Q = -1.17814e-09
After 14 runs, maximum modularity is Q = 4.71256e-09
Louvain completed 34 runs in 11.735629796981812 seconds

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-10-888b58a4d9a6> in <module>()
     44                 contrib_scores=task_to_scores,
     45                 hypothetical_contribs=task_to_hyp_scores,
---> 46                 one_hot=onehot_data)

tfmodisco/modisco/tfmodisco_workflow/workflow.py in __call__(self, task_names, contrib_scores, hypothetical_contribs, one_hot)
    216                 other_comparison_track_names=[])
    217 
--> 218             seqlets_to_patterns_result = seqlets_to_patterns(metacluster_seqlets)
    219             metacluster_idx_to_submetacluster_results[metacluster_idx] =\
    220                 SubMetaclusterResults(

tfmodisco/modisco/tfmodisco_workflow/seqlets_to_patterns.py in __call__(self, seqlets)
    620 
    621         split_patterns = self.spurious_merge_detector(
--> 622                             cluster_to_motif.values())
    623 
    624         #Now start merging patterns

tfmodisco/modisco/aggregator.py in __call__(self, aggregated_seqlets)
    217             subcluster_indices = self.cluster_fwd_seqlet_data(
    218                                     fwd_seqlet_data=fwd_seqlet_data,
--> 219                                     affmat=affmat)  
    220             num_subclusters = np.max(subcluster_indices)+1
    221 

tfmodisco/modisco/aggregator.py in cluster_fwd_seqlet_data(self, fwd_seqlet_data, affmat)
    195                         fwd_seqlet_data=fwd_seqlet_data[mask_for_this_cluster],
    196                         affmat=(affmat[mask_for_this_cluster]
--> 197                                      [:,mask_for_this_cluster]))
    198                 subcluster_indices = np.array([
    199                     i if x==0 else x+1 for x in subcluster_indices])

tfmodisco/modisco/aggregator.py in cluster_fwd_seqlet_data(self, fwd_seqlet_data, affmat)
    195                         fwd_seqlet_data=fwd_seqlet_data[mask_for_this_cluster],
    196                         affmat=(affmat[mask_for_this_cluster]
--> 197                                      [:,mask_for_this_cluster]))
    198                 subcluster_indices = np.array([
    199                     i if x==0 else x+1 for x in subcluster_indices])

tfmodisco/modisco/aggregator.py in cluster_fwd_seqlet_data(self, fwd_seqlet_data, affmat)
    195                         fwd_seqlet_data=fwd_seqlet_data[mask_for_this_cluster],
    196                         affmat=(affmat[mask_for_this_cluster]
--> 197                                      [:,mask_for_this_cluster]))
    198                 subcluster_indices = np.array([
    199                     i if x==0 else x+1 for x in subcluster_indices])

tfmodisco/modisco/aggregator.py in cluster_fwd_seqlet_data(self, fwd_seqlet_data, affmat)
    195                         fwd_seqlet_data=fwd_seqlet_data[mask_for_this_cluster],
    196                         affmat=(affmat[mask_for_this_cluster]
--> 197                                      [:,mask_for_this_cluster]))
    198                 subcluster_indices = np.array([
    199                     i if x==0 else x+1 for x in subcluster_indices])

tfmodisco/modisco/aggregator.py in cluster_fwd_seqlet_data(self, fwd_seqlet_data, affmat)
    195                         fwd_seqlet_data=fwd_seqlet_data[mask_for_this_cluster],
    196                         affmat=(affmat[mask_for_this_cluster]
--> 197                                      [:,mask_for_this_cluster]))
    198                 subcluster_indices = np.array([
    199                     i if x==0 else x+1 for x in subcluster_indices])

tfmodisco/modisco/aggregator.py in cluster_fwd_seqlet_data(self, fwd_seqlet_data, affmat)
    195                         fwd_seqlet_data=fwd_seqlet_data[mask_for_this_cluster],
    196                         affmat=(affmat[mask_for_this_cluster]
--> 197                                      [:,mask_for_this_cluster]))
    198                 subcluster_indices = np.array([
    199                     i if x==0 else x+1 for x in subcluster_indices])

tfmodisco/modisco/aggregator.py in cluster_fwd_seqlet_data(self, fwd_seqlet_data, affmat)
    176         dicluster_results = self.diclusterer(affmat)
    177         dicluster_indices = dicluster_results.cluster_indices
--> 178         assert np.max(dicluster_indices)==1
    179 
    180         #check that the subclusters are more

AssertionError:

How to extract seqlets

Hi,

could anyone help me understand how I could extract the seqlets from TFmodisco. I want to have these seqlets and compare them with known motifs in a database. Thanks

Mean-normalize hypothetical contributions

I'm currently using TF-MoDISco and am considering mean-normalization since it's written in one of the examples in the Github repository that it is suspected it might improve the results. I was just wondering in what way mean-normalization would improve the results and why?

AssertionError: Probabilities don't sum to 1

Hi Avanti, thanks so much for making TF-MoDISco available to the community. We're finding some interesting things already!

I've run it multiple times on different datasets, and never had a problem until now. Do you know how to address this error?

(Round 1) Computed affinity matrix on nearest neighbors in 76.26 s
MEMORY 6.714896384
Filtered down to 4911 of 5381
(Round 1) Retained 4911 rows out of 5381 after filtering
MEMORY 6.715236352
(Round 1) Computing density adapted affmat
MEMORY 6.560714752
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 4911 samples in 0.026s...
[t-SNE] Computed neighbors for 4911 samples in 0.156s...
[t-SNE] Computed conditional probabilities for sample 1000 / 4911
[t-SNE] Computed conditional probabilities for sample 2000 / 4911
[t-SNE] Computed conditional probabilities for sample 3000 / 4911
[t-SNE] Computed conditional probabilities for sample 4000 / 4911
[t-SNE] Computed conditional probabilities for sample 4911 / 4911
[t-SNE] Mean sigma: 0.137986
(Round 1) Computing clustering
MEMORY 6.560534528
Beginning preprocessing + Leiden
  0%|                                                                                                                                               | 0/50 [00:00<?, ?it/s]
Quality: 0.9017222780398779
  2%|██▋                                                                                                                                    | 1/50 [00:00<00:11,  4.38it/s]
Quality: 0.9017980768302472
 14%|██████████████████▉                                                                                                                    | 7/50 [00:01<00:10,  3.96it/s]
Quality: 0.9018023087167797
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:14<00:00,  3.48it/s]
Got 23 clusters after round 1
Counts:
{17: 38, 1: 612, 6: 337, 14: 80, 19: 35, 0: 702, 7: 313, 2: 540, 4: 419, 3: 471, 18: 37, 11: 103, 5: 392, 10: 125, 12: 99, 16: 60, 8: 182, 13: 89, 22: 19, 21: 21, 9: 129,
15: 73, 20: 35}
MEMORY 6.343176192
(Round 1) Aggregating seqlets in each cluster
MEMORY 6.343176192
Aggregating for cluster 0 with 702 seqlets
MEMORY 6.343176192
Trimming eliminated 0 seqlets out of 702
Skipped 15 seqlets
Aggregating for cluster 1 with 612 seqlets
MEMORY 6.347374592
Trimming eliminated 0 seqlets out of 612
Skipped 16 seqlets
Aggregating for cluster 2 with 540 seqlets
MEMORY 6.3483904
Trimming eliminated 0 seqlets out of 540
Skipped 25 seqlets
Aggregating for cluster 3 with 471 seqlets
MEMORY 6.348947456
Trimming eliminated 0 seqlets out of 471
Skipped 11 seqlets
Aggregating for cluster 4 with 419 seqlets
MEMORY 6.349836288
Trimming eliminated 0 seqlets out of 419
Skipped 26 seqlets
Skipped 3 seqlets
Aggregating for cluster 5 with 392 seqlets
MEMORY 6.35052032
Trimming eliminated 0 seqlets out of 392
Skipped 13 seqlets
Aggregating for cluster 6 with 337 seqlets
MEMORY 6.351568896
Trimming eliminated 0 seqlets out of 337
Skipped 21 seqlets
Skipped 1 seqlets
Aggregating for cluster 7 with 313 seqlets
MEMORY 6.351806464
Trimming eliminated 0 seqlets out of 313
Skipped 22 seqlets
Aggregating for cluster 8 with 182 seqlets
MEMORY 6.352617472
Trimming eliminated 0 seqlets out of 182
Skipped 6 seqlets
Aggregating for cluster 9 with 129 seqlets
MEMORY 6.352617472
Trimming eliminated 0 seqlets out of 129
Skipped 3 seqlets
Aggregating for cluster 10 with 125 seqlets
MEMORY 6.352617472
Trimming eliminated 0 seqlets out of 125
Skipped 3 seqlets
Aggregating for cluster 11 with 103 seqlets
MEMORY 6.352662528
Trimming eliminated 0 seqlets out of 103
Skipped 4 seqlets
Skipped 4 seqlets
Aggregating for cluster 12 with 99 seqlets
MEMORY 6.352662528
Trimming eliminated 0 seqlets out of 99
Skipped 1 seqlets
Aggregating for cluster 13 with 89 seqlets
MEMORY 6.352879616
Trimming eliminated 0 seqlets out of 89
Skipped 8 seqlets
Skipped 1 seqlets
Aggregating for cluster 14 with 80 seqlets
MEMORY 6.353133568
Trimming eliminated 0 seqlets out of 80
Skipped 2 seqlets
Aggregating for cluster 15 with 73 seqlets
MEMORY 6.353403904
Trimming eliminated 0 seqlets out of 73
Skipped 4 seqlets
Aggregating for cluster 16 with 60 seqlets
MEMORY 6.353403904
Trimming eliminated 0 seqlets out of 60
Skipped 2 seqlets
Aggregating for cluster 17 with 38 seqlets
MEMORY 6.353403904
Trimming eliminated 0 seqlets out of 38
Skipped 1 seqlets
Traceback (most recent call last):
  File "/home/artur/miniconda3/envs/basepairmodels/bin/modisco", line 8, in <module>
    sys.exit(modisco_main())
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/basepairmodels/cli/run_modisco.py", line 115, in modisco_main
    one_hot=onehot_data)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/tfmodisco_workflow/workflow.py", line 392, in __call__
    metacluster_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py", line 884, in __call__
    min_seqlets_in_motif=0)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py", line 680, in get_cluster_to_aggregate_mo
tif
    motifs = self.seqlet_aggregator(cluster_to_seqlets[i])
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 516, in __call__
    to_return = self.postprocessor(to_return)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 30, in __call__
    return self.func(aggregated_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 21, in <lambda>
    func=lambda x: postprocessor(self(x)))
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 30, in __call__
    return self.func(aggregated_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 21, in <lambda>
    func=lambda x: postprocessor(self(x)))
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 30, in __call__
    return self.func(aggregated_seqlets)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 21, in <lambda>
    func=lambda x: postprocessor(self(x)))
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 59, in __call__
    arr=self.score_positions(aggregated_seqlet),
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/aggregator.py", line 79, in score_positions
    ppm=ppm, background=self.bg_freq, pseudocount=0.001)
  File "/home/artur/miniconda3/envs/basepairmodels/lib/python3.7/site-packages/modisco/util.py", line 479, in compute_per_position_ic
    +str(ppm)+"\n"+str(np.sum(ppm, axis=1)))
AssertionError: Probabilities don't sum to 1 along axis 1 in [[0.02702703 0.         0.94594595 0.02702703]
[0.97297297 0.02702703 0.         0.        ]
 [0.83783784 0.13513514 0.02702703 0.        ]
 [0.05405405 0.08108108 0.         0.86486486]
 [0.08108108 0.13513514 0.72972973 0.05405405]
 [0.02702703 0.         0.91891892 0.05405405]
 [0.97297297 0.         0.02702703 0.        ]
 [0.86486486 0.13513514 0.         0.        ]
 [0.08108108 0.         0.         0.91891892]
 [0.         0.13513514 0.86486486 0.        ]
 [0.         0.05405405 0.94594595 0.        ]
 [0.91891892 0.         0.05405405 0.02702703]
 [0.81081081 0.16216216 0.         0.02702703]
 [0.         0.02702703 0.         0.97297297]
 [0.         0.21621622 0.75675676 0.02702703]
 [0.02702703 0.         0.97297297 0.        ]
 [1.         0.         0.         0.        ]
 [0.83783784 0.08108108 0.05405405 0.02702703]
 [0.02702703 0.         0.         0.97297297]
 [0.02702703 0.16216216 0.78378378 0.02702703]
 [0.         0.05405405 0.94594595 0.        ]
 [0.97297297 0.         0.02702703 0.        ]
 [0.89189189 0.08108108 0.         0.02702703]
 [0.         0.02702703 0.         0.97297297]
 [0.         0.10810811 0.89189189 0.        ]
 [0.         0.         0.97297297 0.        ]
 [0.97297297 0.         0.         0.        ]
 [0.81081081 0.13513514 0.02702703 0.        ]
 [0.         0.         0.         0.97297297]
 [0.         0.10810811 0.83783784 0.02702703]
 [0.         0.02702703 0.94594595 0.        ]
 [0.97297297 0.         0.         0.        ]
 [0.83783784 0.10810811 0.02702703 0.        ]
 [0.         0.02702703 0.         0.94594595]
 [0.         0.16216216 0.81081081 0.        ]
 [0.08108108 0.         0.86486486 0.02702703]
 [0.91891892 0.         0.05405405 0.        ]
 [0.89189189 0.08108108 0.         0.        ]
 [0.05405405 0.02702703 0.         0.89189189]
 [0.         0.16216216 0.78378378 0.02702703]
 [0.         0.05405405 0.89189189 0.02702703]
 [0.97297297 0.         0.         0.        ]
 [0.67567568 0.2972973  0.         0.        ]
 [0.05405405 0.05405405 0.         0.86486486]
 [0.         0.2972973  0.62162162 0.05405405]]
[1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         0.97297297 0.97297297 0.97297297 0.97297297 0.97297297
 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297
 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297 0.97297297
 0.97297297 0.97297297 0.97297297]

I tried updating to the latest version, but the error persists. If need be, I can share the importances file with you directly. Thanks so much!

Artur

AssertionError : assert len(new_pattern)==1 in modisco/aggregator.py

I tried to run TF-modisco on different subsets of data and got the following error for one of the subsets (other works fine):

Traceback (most recent call last):
  File "TF_modisco.py", line 207, in <module>
    main([model], data, "modisco/")
  File "TF_modisco.py", line 173, in main
    run_modisco(hyp_impscores=hyp_ex_list, impscores=ex_list, onehot_data=features, modisco_path=modisco_path)
  File "TF_modisco.py", line 67, in run_modisco
    **nulldist_args)
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/workflow.py", line 377, in __call__
    metacluster_seqlets)
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py", line 913, in __call__
    patterns=split_patterns) 
  File "/users/agnon/year2016/yaishof/.conda/envs/ofir_env/lib/python3.6/site-packages/modisco/aggregator.py", line 1007, in __call__
    assert len(new_pattern)==1
AssertionError

What can be the problem?

the full output:
output.txt

TypeError with PCA Initialization in TensorFlow-Modisco: Seeking Advice for Sparse Matrix Handling

Hello TensorFlow-Modisco Community,

I recently encountered an issue while running a TensorFlow-Modisco workflow which involves TSNE computations. The workflow fails with a TypeError related to PCA initialization, specifically when handling sparse input matrices. The error message is as follows:

# install TF-modisco
!pip install modisco==0.5.16.0

import sklearn
print(sklearn.__version__)

import modisco

# check parameters
import inspect
def get_default_args(func):
    signature = inspect.signature(func)
    return {
        k: v.default
        for k, v in signature.parameters.items()
        if v.default is not inspect.Parameter.empty
    }

get_default_args(modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow)

from importlib import reload
# reload(modisco.util)
# reload(modisco.pattern_filterer)
# reload(modisco.aggregator)
# reload(modisco.core)
# reload(modisco.seqlet_embedding.advanced_gapped_kmer)
# reload(modisco.affinitymat.transformers)
# reload(modisco.affinitymat.core)
# reload(modisco.affinitymat)
# reload(modisco.cluster.core)
# reload(modisco.cluster)
# reload(modisco.tfmodisco_workflow.seqlets_to_patterns)
# reload(modisco.tfmodisco_workflow)
# reload(modisco)

null_per_pos_scores = modisco.coordproducers.LaplaceNullDist(num_to_samp=5000)

# prepare TF-modisco function to run for dev or hk
def my_tfmodisco(task):
    ...
    [Rest of your code here]
    ...
    return tfmodisco_results

# function to visualize motifs
from collections import Counter
import numpy as np

from modisco.visualization import viz_sequence
reload(viz_sequence)
from matplotlib import pyplot as plt

import modisco.affinitymat.core
reload(modisco.affinitymat.core)
import modisco.cluster.phenograph.core
reload(modisco.cluster.phenograph.core)
import modisco.cluster.phenograph.cluster
reload(modisco.cluster.phenograph.cluster)
import modisco.cluster.core
reload(modisco.cluster.core)
import modisco.aggregator
reload(modisco.aggregator)

def modisco_motif_plots(task):
    ...
    [Rest of your code here]
    ...
    hdf5_results.close()

# Run TF-Modisco - takes around 10mins
dev_tfmodisco_results = my_tfmodisco('Dev_contrib_scores')

# save results
import modisco.util
reload(modisco.util)
grp = h5py.File("/content/drive/MyDrive/DeepSTARR_tutorial/Dev_modisco_results.hdf5", "w")
dev_tfmodisco_results.save_hdf5(grp)
grp.close()

TypeError Traceback (most recent call last)
[<ipython-input-121-086db4f84169>] in <cell line: 2>()
    1 # Run TF-Modisco - takes around 10mins
----> 2 dev_tfmodisco_results = my_tfmodisco('Dev_contrib_scores')
    3 
    4 
    5 # save results
5 frames
[/usr/local/lib/python3.10/dist-packages/sklearn/manifold/_t_sne.py] in _fit(self, X, skip_num_points)
    833 
    834         if isinstance(self.init, str) and self.init == "pca" and issparse(X):
--> 835             raise TypeError(
    836                 "PCA initialization is currently not supported "
    837                 "with the sparse input matrix. Use "
TypeError: PCA initialization is currently not supported with the sparse input matrix. Use init="random" instead.


This formatted code can be copied directly into a Markdown editor. The triple backticks (` ``` `) are used to start and end the code block, and `python` after the first set of backticks indicates that the block contains Python code, which helps in syntax highlighting. 😊👨‍💻

Any considerations for RNA motif discovery because of Coarse-Grained Affinity Matrix Computation?

Hi @AvantiShri,
Thanks again for the great work. I'm wondering if we should change anything in modisco workflow for RNAs? should we just set revcomp FAlSE? I'm asking this because of the concern issued in the paper in the 4.1.1 Coarse-Grained Affinity Matrix Computation section.

Thanks

ModuleNotFoundError: No module named 'modisco.visualization'; 'modisco' is not a package

Non relevant any more. Thank you

Unexpected keyword argument 'kmer_len' in constructor call of modisco.tfmodisco_workflow.seqlets_to_patterns.TfModiscoSeqletsToPatternsFactory

Hello,

Does new versions support kmer_len, num_gaps, and num_mismatches in modisco.tfmodisco_workflow.seqlets_to_patterns.TfModiscoSeqletsToPatternsFactory?

I get the error that the constructor does know these keyword. The program runs, but why this error occurs? Does in new versions we changing these make any change?

tomtom query edge case in `fetch_tomtom_matches`

@suragnair not all the query get at least $n number of motif matches. You can include the condition that if the next element in dat variable is empty to breaks the loop.

    r = []
    for t in dat[1:1+n]:
        if(len(t)<4):
            break
        mtf = {}
        mtf['Target ID'] = t[tget_idx]
        mtf['p-value'] = float(t[pval_idx])
        mtf['E-value'] = float(t[eval_idx])
        mtf['q-value'] = float(t[qval_idx])
        r.append(mtf)

Non-zeros in pattern["task0_contrib_scores"]["fwd"]

Hello,

Thanks for developing this excellent tool! I am very new to tf-modisco, and I am trying to understand tf-modisco output. One thing I am confused about is why all entries in pattern["task0_contrib_scores"]["fwd"] are non-zero? I thought pattern["task0_contrib_scores"] should be fetched from contrib_scores we input into modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow, which should only have non-zeros for the nucelotide that present one in one-hot-encoding sequence. I probably miss something here, so I will appreciate if anyone can explain it a bit

Thank you,
Yao

Error: zero-size array to reduction operation maximum which has no identity

I tried running modisco on a larger set of sequences (100k of 1kb sequences with 4 output tasks) and it crashed after coming down to metacluster 5 containing 37k seqlets. Seems that it has encountered an empty set neighbors_of_things_to_scan. Any idea how to prevent this from happening?

On metacluster 5
Metacluster size 37059
Relevant tasks:  ('Klf4/weighted',)
Relevant signs:  (-1,)
(Round 1) num seqlets: 37059
(Round 1) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 1520.43 s
Starting affinity matrix computations

....
(Round 2) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 0.018040180206298828 seconds
Running Louvain modularity optimization
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:   41.9s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:  3.1min
[Parallel(n_jobs=10)]: Done 200 out of 200 | elapsed:  3.4min finished
Louvain completed 200 runs in 253.59952878952026 seconds
Wrote graph to binary file in 0.0037131309509277344 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.635078
Louvain completed 51 runs in 227.86760807037354 seconds
Preproc + Louvain took 481.4961562156677 s
Got 8 clusters after round 2
Counts:
{1: 8, 0: 13, 4: 4, 7: 2, 5: 3, 2: 6, 3: 4, 6: 2}
(Round 2) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 13 seqlets
Trimmed 0 out of 13
Dropping cluster 0 with 13 seqlets due to sign disagreement
Aggregating for cluster 1 with 8 seqlets
Trimmed 0 out of 8
Dropping cluster 1 with 8 seqlets due to sign disagreement
Aggregating for cluster 2 with 6 seqlets
Trimmed 0 out of 6
Dropping cluster 2 with 6 seqlets due to sign disagreement
Aggregating for cluster 3 with 4 seqlets
Trimmed 0 out of 4
Dropping cluster 3 with 4 seqlets due to sign disagreement
Aggregating for cluster 4 with 4 seqlets
Trimmed 0 out of 4
Dropping cluster 4 with 4 seqlets due to sign disagreement
Aggregating for cluster 5 with 3 seqlets
Trimmed 0 out of 3
Dropping cluster 5 with 3 seqlets due to sign disagreement
Aggregating for cluster 6 with 2 seqlets
Trimmed 0 out of 2
Dropping cluster 6 with 2 seqlets due to sign disagreement
Aggregating for cluster 7 with 2 seqlets
Trimmed 0 out of 2
Dropping cluster 7 with 2 seqlets due to sign disagreement
Got 0 clusters
Splitting into subclusters...
Merging on 0 clusters
On merging iteration 1
Computing pattern to seqlet distances
Traceback (most recent call last):
  File "/users/avsec/bin/anaconda3/envs/chipnexus/bin/basepair", line 11, in <module>
    load_entry_point('basepair', 'console_scripts', 'basepair')()
  File "/users/avsec/workspace/basepair/basepair/__main__.py", line 162, in main
    argh.dispatch(parser)
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/argh/dispatching.py", line 174, in dispatch
    for line in lines:
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/argh/dispatching.py", line 277, in _execute_command
    for line in result:
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/argh/dispatching.py", line 260, in _call
    result = function(*positional, **keywords)
  File "/users/avsec/workspace/basepair/basepair/cli/modisco.py", line 252, in modisco_run
    one_hot=thr_one_hot)
  File "/users/avsec/workspace/kl/tfmodisco/modisco/tfmodisco_workflow/workflow.py", line 309, in __call__
    seqlets_to_patterns_result = seqlets_to_patterns(metacluster_seqlets)
  File "/users/avsec/workspace/kl/tfmodisco/modisco/tfmodisco_workflow/seqlets_to_patterns.py", line 670, in __call__
    patterns=split_patterns, seqlets=seqlets) 
  File "/users/avsec/workspace/kl/tfmodisco/modisco/aggregator.py", line 939, in __call__
    filter_seqlets=patterns))
  File "/users/avsec/workspace/kl/tfmodisco/modisco/affinitymat/core.py", line 439, in __call__
    min_overlap=self.pattern_comparison_settings.min_overlap) 
  File "/users/avsec/workspace/kl/tfmodisco/modisco/affinitymat/core.py", line 474, in __call__
    assert np.max(neighbors_of_things_to_scan) < filters.shape[0]
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2320, in amax
    out=out, **kwargs)
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/numpy/core/_methods.py", line 26, in _amax
    return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

TypeError: '<=' not supported between instances of 'list' and 'float'

Getting the above error when running MoDisco.

Minimal example to reproduce the error:

import modisco
import numpy as np
import tensorflow as tf

one_hot = tf.one_hot(np.random.randint(low=0, high=3, size=100*50).reshape(50, 100), depth=4).numpy()
print(one_hot.shape)

impscores = tf.random.uniform(shape=one_hot.shape).numpy()
print(impscores.shape)

null_per_pos_scores = modisco.coordproducers.LaplaceNullDist(num_to_samp=500)

tfmodisco_results = modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow(
            target_seqlet_fdr=0.25,
            seqlets_to_patterns_factory=
                modisco.tfmodisco_workflow.seqlets_to_patterns.TfModiscoSeqletsToPatternsFactory(
                    initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(    
                        meme_command="meme", base_outdir="meme_out",   
                        #max_num_seqlets_to_use specifies the maximum number of seqlets to use
                        # with MEME (this is to speed up MEME in the cases where the number of seqlets is
                        #  very large)
                        max_num_seqlets_to_use=10000,
                        nmotifs=10,
                        n_jobs=4),
                    use_louvain=False,
                    #Adjust trim_to_window_size and initial_flank_to_add
                    # according to how big you expect
                    # the core motif to be; default value is 10
                    trim_to_window_size=8,
                    initial_flank_to_add=2,
                    final_flank_to_add=5,
                    final_min_cluster_size=20
            ),
       )(
    #There is only one task, so we just call this 'task0'
    task_names=["task0"],
    contrib_scores={'task0': impscores},                
    hypothetical_contribs={'task0': impscores},
    one_hot=one_hot,
    null_per_pos_scores=null_per_pos_scores)

Any help welcome.

kundajelab / tfmodisco Goto Github PK

tfmodisco's Introduction

TF-MoDISco: Transcription-Factor Motif Discovery from Importance Scores

Structure of TF-MoDISco

Installing TF-MoDISco

Required inputs to run the algorithm

Other resources

tfmodisco's People

Contributors

Stargazers

Watchers

Forkers

tfmodisco's Issues

Recommend Projects

Recommend Topics

Recommend Org