marbl / metagenomescope Goto Github PK
View Code? Open in Web Editor NEWVisualization tool for (meta)genome assembly graphs
Home Page: https://marbl.github.io/MetagenomeScope/
License: GNU General Public License v3.0
Visualization tool for (meta)genome assembly graphs
Home Page: https://marbl.github.io/MetagenomeScope/
License: GNU General Public License v3.0
From @fedarko on August 8, 2016 17:3
Not really sure what format/etc would be used for this, but this would be useful for more complex assemblies (esp. for metagenomes).
Other issues definitely take priority for the time being, but keep this in mind.
Copied from original issue: fedarko/MetagenomeScope#48
From @fedarko on February 25, 2017 2:25
We want to be able to distinguish assemblies with lots of tiny components (e.g. biofilm 2) vs. asms with a few large components (e.g. shakya old, biofilm 1). Basically, ways to represent how "noisy" certain components are.
Copied from original issue: fedarko/MetagenomeScope#157
From @fedarko on May 22, 2017 17:52
It's important that this is taken into account in at least some fashion.
GML
A negative mean
attribute of an edge indicates overlap.
GFA
We should be able to use CIGAR strings for links in order to determine overlap.
Perhaps we could (at least temporarily) indicate overlap edges as dotted or dashed edges. Ideally we'd actually overlay nodes visually, but I'm not sure how to reconcile that with GraphViz' layout parameters.
Copied from original issue: fedarko/MetagenomeScope#190
From @fedarko on January 5, 2017 4:21
Not entirely sure how we would do this (I don't have any experience with this), or how we would account for certain systems not having the requisite hardware (or if that is even an issue).
However, if done well this could result in significant performance gains.
Copied from original issue: fedarko/MetagenomeScope#123
From @fedarko on August 8, 2016 21:15
From Todd:
As the user zooms in (+/- buttons, or with mouse wheel, or gesture), auto-uncollapse
I like this idea a lot! However, I'm not so sure about its feasibility. (Un)collapsing takes a decent amount of time for large graphs, which are the primary types of graphs for which this feature would be useful. Imposing this action upon zooming in would delay zooming very noticeably, which—at least from my perspective as a user—seems like it'd be annoying more than anything else.
If we manage to get the graph to be stored more efficiently on-the-fly, then this could be a viable option. Until then, though, I think there are probably other features to pursue that would offer a better cost/benefit analysis.
Copied from original issue: fedarko/MetagenomeScope#53
From @fedarko on August 12, 2016 20:49
And, presumably, exporting the result to a FASTA file (or modified .db file, maybe?).
Copied from original issue: fedarko/MetagenomeScope#72
From @fedarko on March 26, 2017 1:8
We'd have to implement some custom functionality that is applied upon ending a pan gesture, as described here. However, I'm not sure if this would be really inefficient for large graphs.
I think for now there are probably other things that take priority, but this would be nice to do eventually.
Copied from original issue: fedarko/MetagenomeScope#178
From @fedarko on March 18, 2017 3:11
Think about "navigation via zoom."
Copied from original issue: fedarko/MetagenomeScope#170
From @fedarko on May 22, 2017 17:46
e.g. put a dotted line with no arrows between two nodes with similarity above a certain threshold.
Copied from original issue: fedarko/MetagenomeScope#189
From @fedarko on February 24, 2017 22:43
Store in .db file (in the assembly
and components
tables) and display in the viewer.
Copied from original issue: fedarko/MetagenomeScope#154
From @fedarko on March 18, 2017 3:38
(#71)
Like is done in Bambus 3 (see layout.py).
This will ensure that the bubbles we get from this code are actually bubbles.
Copied from original issue: fedarko/MetagenomeScope#172
From @fedarko on April 18, 2017 9:59
(as opposed to "DNA pointers"/node IDs)
We could process huge files using a chunk-by-chunk approach similar to how we process AGP files. We can also notify the user if the resulting DNA file generated is too large to fit in memory, I guess?
I'm not sure what the max download file size is -- I know data URLs aren't necessarily the best idea, so maybe something else? Or perhaps generate a .fasta file on the server side that the user can download?
Copied from original issue: fedarko/MetagenomeScope#182
From @fedarko on January 25, 2017 17:52
This should be done towards the release of the tool, when we've more clearly defined the types of graphs that will be generated:
There's some redundancy in columns in the .db file. The main example I can think of is the use of both a node width and node height column, even though nodes are currently scaled by area (so width should equal height for all nodes). Update: we're looking into scaling differently right now (see #164), so this might be a moot point.
Once we've settled on a final specification for graph types, eliminate redundant stuff like this from the .db file. This will reduce .db size, which will in turn reduce processing time in the viewer.
Copied from original issue: fedarko/MetagenomeScope#147
From @fedarko on January 5, 2017 4:8
This would be nice, but now that three filetypes (LastGraph, GFA, GML) are already supported I'll be working on adding functionality instead.
Copied from original issue: fedarko/MetagenomeScope#122
From @fedarko on September 27, 2016 15:40
This would let us load the graph quickly again.
Not sure re: how difficult this would be to do, or how much of a gain in performance we'd stand to achieve.
Copied from original issue: fedarko/MetagenomeScope#94
From @fedarko on April 16, 2017 9:29
I guess the input to the script would be:
.fa
files or something.The output would be a FASTA file containing all the DNA information in order, I guess? Ask to see what the ideal such format would be.
Copied from original issue: fedarko/MetagenomeScope#181
From @fedarko on January 6, 2017 23:6
Dependent upon #127.
Copied from original issue: fedarko/MetagenomeScope#128
From @fedarko on August 8, 2016 16:49
Todd suggested I include:
So I should definitely add these extensions and see how they work for graphs, adjusting accordingly.
Copied from original issue: fedarko/MetagenomeScope#42
From @fedarko on March 18, 2017 3:14
Would help the user easily navigate between structural patterns without having to manually deal with looking at the graph's complexity (?).
We could add UI buttons for this overlaid on the graph (would work best for mobile support), but it's worth considering using the left-right arrow keys or something for this also.
Copied from original issue: fedarko/MetagenomeScope#171
From @fedarko on September 27, 2016 15:44
e.g. on length, coverage/depth, multiplicity, ...
Copied from original issue: fedarko/MetagenomeScope#96
From @fedarko on April 16, 2017 9:25
So as to avoid interfering with the finishing process (#72).
at present, scaffolds just aren't viewable during the finishing process -- which isn't, like, horrible, but it'd be nice to have the ability to view them during finishing, for reference.
Copied from original issue: fedarko/MetagenomeScope#180
From @fedarko on February 24, 2017 22:54
Could tie in with #154. As we add more data to the components
table in the .db files, we could display that here.
Copied from original issue: fedarko/MetagenomeScope#156
From @fedarko on August 2, 2017 3:15
The repeated costs of running sfdp
(in SPQR mode) and dot
(in standard mode) on cc's of single nodes aren't that bad when the number of those components is less than around 1000, but this quickly gets out of hand when there are literally tens of thousands of those components (for the one graph I'm working with now there's over 100k "single-node" components).
Solution: for a component containing a single node of width w
inches and h
inches, with no edges and with no node groups, just set the bounding box of the resulting connected components (in SPQR and standard mode) to (w + some padding, h + some padding)
. Look at some current ways Graphviz lays this stuff out for guidance.
This has the potential to save a lot of time (by preventing lots of repeated calls to pygraphviz).
Copied from original issue: fedarko/MetagenomeScope#252
From @fedarko on August 9, 2016 2:17
One of the stipulations of using sql.js is that we have to close the current database file before we're done with it—otherwise not-so-good things happen, presumably. I have this implemented now, in that closeDB()
(our wrapper to the .close()
function that has a check for null
) is called:
beforeunload
event)These cases should cover things, but while testing in Chromium, I noticed weird behavior. A user could load a .db file, draw a component, etc. And then when the user hit the 'back' button, followed by the 'forward' button (to bring the user back to the AsmViz Viewer), the .db file would still be loaded: the graph-view-specific buttons (search, fit, collapse) were all still disabled (expected behavior), but the .db-specific items (component spinner, draw button, asm info button) were all still enabled. What's more, the .db file was still clearly loaded into memory: using the draw button worked fine.
I mean, on the surface this is actually a feature, but I'm still wary of it. I think it would be best to ensure the .db file is actually closed upon pressing 'back' and forcing the user to reload it, so we know what's going on (and aren't subject to the whims of whatever behavior a browser decides is cool). Granted, I really don't know how to go about fixing this right now and I've got way more important stuff to focus on, but this is still something worth looking in to, IMO.
Copied from original issue: fedarko/MetagenomeScope#55
From @fedarko on August 4, 2016 20:53
Explaining the workflow for collate.py and for the viewer interface. Also—if we're still doing it—should mention that negative nodes are prefixed with 'c', for complement (e.g. +3, -3 --> 3, c3) due to GraphViz not liking '-' in node names.
Copied from original issue: fedarko/MetagenomeScope#30
From @fedarko on June 4, 2017 23:30
Similarly to how -nodna allows the user to not pass sequence data to the .db file generated by collate.py, we should support features where the user can pick and choose which graph types to generate layouts for. Right now there's just two choices (double graph with variants detected vs. single graph with SPQR tree integration), but I imagine if/when we add the functionality of just normal single graphs without SPQR tree integration (a la #10) we should probably let the user choose which types to include in the .db file.
Also worth noting that GMLs don't really have a distinct single graph. Not sure if we should still lay out the undirected version of the graph, or just block that feature for all files with already-oriented contigs. Something to think about.
Copied from original issue: fedarko/MetagenomeScope#198
From @fedarko on April 25, 2017 20:3
After taking care of #183, this should be a nice feature to add.
If we do #180 before this -- perhaps we could colorize different AGP files' scaffolds differently? I guess the question arises of how to handle the same node being in two scaffolds from the different AGP files -- in that case, we could either color the node a neutral color or color it as a pie chart.
Or we could just enforce that only one scaffold can be displayed at once (from either of the loaded AGP files), and just have the "multiple AGP files" thing be independent of the way scaffolds are highlighted. We could even programmatically create multiple scaffoldCycler <p>
things for every AGP file loaded, and add an "x" button or something that'd remove a given scaffoldCycler and unload its corresponding AGP file. That'd... be really cool, actually.
Copied from original issue: fedarko/MetagenomeScope#185
From @fedarko on September 27, 2016 15:29
This should make loading large files a lot faster.
Web workers are client-side, also, so that works well for the viewer application.
Talk to Jayaram about this—he has experience with web workers.
Copied from original issue: fedarko/MetagenomeScope#90
From @fedarko on July 21, 2016 3:29
See http://js.cytoscape.org/demos/310dca83ba6970812dd0/ for reference as to what this could look like. This could be really cool.
Copied from original issue: fedarko/MetagenomeScope#4
From @fedarko on July 22, 2016 18:7
Using cy.json(), node.js, and node-sqlite3 would allow assembly graphs to be viewed in Cytoscape.js or (with less support for certain Cytoscape.js-specific features -- see this) the desktop version of Cytoscape.
The upside of this is that we get the best of both worlds: the ability to view the graph in a browser (and, theoretically, host it on a website) and the ability to view the graph in desktop Cytoscape (which would make viewing huge graphs faster).
Note that I have pretty much no experience with desktop Cytoscape, and we'll probably have to tweak things to get this to work. (I think compound nodes, in particular, are only somewhat supported in desktop Cytoscape -- so I'll have to look into that.)
Copied from original issue: fedarko/MetagenomeScope#12
From @fedarko on September 27, 2016 15:48
This could be cool. As was mentioned in the meeting, we should use delays/etc. here so that the user doesn't accidentally show a ton of info while moving their mouse through the graph.
Copied from original issue: fedarko/MetagenomeScope#98
From @fedarko on December 28, 2016 23:22
There are a number of these things sprinkled throughout the xdot2cy.js code -- some have, e.g. O(2n) loops that could be brought down to O(n) fairly simply.
I doubt that this is causing a significant amount of bottleneck, but it'd be nice to sweep through the code and take care of this stuff at some point.
Most of these areas can be found by just searching for "TODO" (sans quotes).
Copied from original issue: fedarko/MetagenomeScope#115
From @fedarko on September 12, 2016 7:47
e.g. a "-pp" argument (for "pattern precedence" -- or a better acronym, I'm all ears) where users could pass in strings like BRCY or YCBR (where the capital letters match with their AsmViz abbreviation -- B is bubble, Y is cycle, C is chain, R is frayed rope).
This is fairly low-priority (esp. when compared to some of the other features/etc here), but it would still be a nice thing to add if users want to emphasize certain patterns (e.g. the user cares a lot about finding chains for some reason, so they want to find those before caring about finding cyclic chains, bubbles, ropes, etc).
Copied from original issue: fedarko/MetagenomeScope#87
From @fedarko on February 17, 2017 19:29
Maybe something like --
A -> B -> C
_____/
where B could be some sort of gene
Copied from original issue: fedarko/MetagenomeScope#152
From @fedarko on September 27, 2016 15:54
This could be cool.
If this is going to be just one chart for the entire assembly, regardless of whatever the user does in the viewer, then I figure we could just generate the chart in collate.py and store the resulting image in the .db file. However, if we want the user to be able to actually modify the chart dynamically (e.g. to only view information for selected nodes' lengths) or if we want to generate many possible charts (e.g. a histogram of edge multiplicities), then it would make sense to generate the chart in the browser using something like d3.js.
I don't want to use too many dependencies in the project, but d3 shouldn't be too bad to use here.
Update: Yeah, we should extend this idea to support arbitrary node / edge fields (see #243 -- so not just lengths but also coverages, GC contents, other user-specified things...) It will be a bit challenging to pick good histogram bounds "on the fly", but there are probably libraries that can do this well.
Copied from original issue: fedarko/MetagenomeScope#99
From @fedarko on September 27, 2016 15:31
I'm not really sure how this is done (or what, exactly, I can do in parallel—aside from batch stuff, which I already make use of), but talk to Jayaram about it. This would make interacting with the graph more efficient.
Copied from original issue: fedarko/MetagenomeScope#91
From @fedarko on March 18, 2017 3:6
We can uncollapse them as the user zooms in on selected clusters (see #53).
Copied from original issue: fedarko/MetagenomeScope#167
From @fedarko on July 21, 2016 3:58
Via a scrollbar at the bottom of the screen. There are some good JQuery options for this; I don't particularly want to add JQuery to this project just for one feature, but if I can't find any suitable alternatives it should be alright.
We might want to consider having a vertical scrollbar, too, for certain connected components (e.g. RF_oriented_1.xdot, at least right now) that can't be visualized as a linear or even semi-linear path through the graph. Also, we should probably take graph orientation into account for this, since we'll obviously have more of a need for vertical scrolling for down-up/up-down rotated graphs. And I'm not sure if having both a vertical and horizontal scrollbar is the best idea—how much overhead would it have when added on top of Cytoscape.js? Arguably the major bottleneck in interacting with large graphs in Cytoscape.js is the delay inherent to interacting with nodes/edges/node groups: collapsing an individual cluster in a large graph might take 5 seconds, 3-4 of which are just Cytoscape.js waiting to register a "cxttap" event with that cluster's element.
This seems like a fairly minor feature (that will probably be nontrivial to implement), so I'll probably work on getting some more critical things done first.
Copied from original issue: fedarko/MetagenomeScope#5
From @fedarko on February 24, 2017 22:50
I don't think this should be too difficult to implement. We'd just select all nodes of type node.[R/B/C/Y]
and collapse accordingly -- and if no such node groups are in the currently drawn component then we can just disable the corresponding button, I guess. Handling the button stuff might take some time.
Copied from original issue: fedarko/MetagenomeScope#155
From @fedarko on July 22, 2016 17:15
This would make understanding really large graphs a bit easier, but for close analysis using a double graph is probably better. However, having this option available (I guess we'd figure it out in the Python script) would make things easier.
Maybe we could have the Python script automatically lay out a "single graph," and then store both the RC and non-RC information in the hypothetical database file generated? This would allow the Javascript UI to overlay RC nodes on top of non-RC nodes, if the user requests a double graph. I think this is similar to what Bandage does.
Note that this is only really feasible for contig assembly inputs (e.g. LastGraph files), not for scaffold inputs (e.g. GML files from BAMBUS).
Copied from original issue: fedarko/MetagenomeScope#10
From @fedarko on December 31, 2016 8:15
Figure out how to do it and do it (by just adding the depth info to the .db file).
Copied from original issue: fedarko/MetagenomeScope#118
From @fedarko on September 27, 2016 15:39
I guess this would be done in the viewer. We could do this by taking the bounding boxes for each component and "concatenating" them, with some margins in between. (For some reason, laying out components individually seems to be faster than laying out the entire graph at once—so we'd still generate layouts on a component-by-component basis, but just concatenate the resulting layouts to produce something basically equivalent.)
This probably shouldn't be the default behavior (will wreck viewer performance for huge graphs that have sizable non-largest components) but having it as an option would be useful in some cases (e.g. graphs composed of a lot of tiny component fragments).
Copied from original issue: fedarko/MetagenomeScope#93
From @fedarko on June 7, 2017 1:49
assembly
table)Copied from original issue: fedarko/MetagenomeScope#207
From @fedarko on August 19, 2016 19:40
e.g. a list of nodes that are in various types of pattern groups, along with the color of that group.
Right now we only support four types of patterns, but it should be possible to take a file like this as input and reconcile it with the .db file in collate.py.
Copied from original issue: fedarko/MetagenomeScope#78
From @fedarko on January 18, 2017 10:22
This'll help make it clearer to the user what these controls do, reducing the need somewhat for inline documentation (although that could still be helpful).
Copied from original issue: fedarko/MetagenomeScope#140
From @fedarko on June 4, 2017 4:56
It works fine for me on Chromium, but people have experienced issues with the page opening up and then quickly disappearing. Doesn't seem to be caused by popup blockers.
From testing, we know that this JSFiddle (https://jsfiddle.net/bL6vg8vv/) exhibits this problem for at least one user -- therefore, the issue seems to be localized to that one invocation of window.open()
. I need to determine how to offset this (consider using a different target than _blank
? Maybe _tab
or something?).
Copied from original issue: fedarko/MetagenomeScope#196
From @fedarko on January 6, 2017 23:7
Dependent upon #128.
Copied from original issue: fedarko/MetagenomeScope#129
From @fedarko on January 10, 2017 2:22
I'm not sure here. Per Emden Gansner, there isn't any set upper limit on graph size, although lots of nodes/edges can cause "an explosion in the problem size" due to how dot handles edges crossing multiple levels.
Gansner recommends a general upper limit of 1000 nodes per drawing (see the second link in the previous paragraph), which is not really practical for our purposes at all -- however, we know from experience with the Shakya dataset that laying out larger components isn't that bad (the first component of Shakya has around 6k nodes and 9.8k edges, and takes around five minutes to lay out using dot). Furthermore, since the conceit of AsmViz involves precomputing the layouts, I don't think worrying about efficiency at this part of the program is a super significant issue.
For huge, huge connected components, we have a few options (and almost certainly more that I'm not aware of):
Worth noting that for one of the datasets Jay sent (it's in my testgraphs/
directory as 20170226_oriented.gml
) has a largest connected component of around 19,000 nodes. Trying to lay out the .gv file generated for this component (which, due to backfilling, has 15,569 nodes and 31,322 edges) just crashes dot
with a segfault -- running the layout operation from within PyGraphviz also results in a crash, naturally. So the upper limit in connected component size is maybe somewhere around 15k nodes as a most optimistic guess? (The massive amount of edges probably also factors into this, also.) At this size of connected component we can focus on SPQR decompositions/desktop Cytoscape/etc.
Assuming we go the "no DNA" approach, we can assume that most large graphs will generally approach having their main "size"-causing factors be node count and edge count (this relies upon the assumption that the amount of connected components and clusters, as related to the number of nodes/edges, will be somewhat uniform).
I've read that, at least in one environment, the upper limit of .db files in sql.js is around 120M.
Since the size of the shakya.db file is about 4.6M on my system (and would be, likely, somewhere around 4.8M if it included edge multiplicity and node G/C content data), we can do some (very approximate) math using this + the 120M figure as a basis to estimate the maximum total number of nodes and edges:
(my phone's camera is not that great, sorry)
So we can say that, only considering the .db side of things, the maximum amount of nodes is somewhere around 500,000.
I know it's somewhere on the order of tens of thousands of nodes, but I'd have to check. (TODO: add here)
Copied from original issue: fedarko/MetagenomeScope#134
From @fedarko on October 13, 2016 23:51
Could be useful for decreasing clutter. Now that we only have three filetypes this isn't a big deal, but as we add support for more it would be nice to have more general stuff.
Copied from original issue: fedarko/MetagenomeScope#108
From @fedarko on March 4, 2017 5:21
List of issues:
An edge is removed, and then its cluster is collapsed and then uncollapsed.
Effect: the cluster tries to restore the edge even though it's been
removed.
Already fixed in my code (via modifications to uncollapseCluster()).
A cluster is collapsed, and an "exterior" edge incident on that cluster is
restored via institution of a lower threshold than before.
Effect: the edge is created using an invalid source and/or target,
resulting in an error.
Kind of almost fixed in my code (via modifications to the restoration code
in cullEdges()). Need to take care of rerouting edges carefully -- it might
be simplest to just collapse all edges (even the "removed" ones) in
collapseCluster(), although I'm not sure how to go about that (perhaps by
analyzing REMOVED_EDGES w/r/t any edges that are incident upon node(s)
within the cluster).
A cluster is collapsed, and edges are hidden below a certain threshold.
Effect: If the cluster contains "interior" edges that have weights beneath
the threshold, they will not be removed, even upon uncollapsing the cluster.
We could rectify this either in uncollapseCluster() (check if a threshold
has been imposed) or in cullEdges() (search through collapsed nodes for
edges to cull). The latter approach would probably be easiest.
A lot of edges are hidden at once (potentially thousands).
Effect: takes a lot of time, can cause a "not responsive" browser popup
Solution: link the process of hiding edges to progress bar and occasionally yield to that
Some of our demo assemblies don't have bundle size info included
Solution: don't hide edges that have a null/undefined bundle size
Also solution: ensure bsize is in all demo assemblies
Copied from original issue: fedarko/MetagenomeScope#161
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.