Giter Site home page Giter Site logo

marbl / metagenomescope Goto Github PK

View Code? Open in Web Editor NEW
24.0 24.0 8.0 35.92 MB

Visualization tool for (meta)genome assembly graphs

Home Page: https://marbl.github.io/MetagenomeScope/

License: GNU General Public License v3.0

Python 40.99% CSS 0.94% HTML 15.34% JavaScript 42.42% Makefile 0.32%
bioinformatics genome-assembly metagenomics visualization

metagenomescope's People

Contributors

fedarko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metagenomescope's Issues

Visualize overlap attributes of edges (resolve ambiguities in GFA/LastGraph files)

From @fedarko on May 22, 2017 17:52

It's important that this is taken into account in at least some fashion.

GML
A negative mean attribute of an edge indicates overlap.

GFA
We should be able to use CIGAR strings for links in order to determine overlap.

Perhaps we could (at least temporarily) indicate overlap edges as dotted or dashed edges. Ideally we'd actually overlay nodes visually, but I'm not sure how to reconcile that with GraphViz' layout parameters.

Copied from original issue: fedarko/MetagenomeScope#190

Auto-uncollapse clusters as user zooms in on graph in Cytoscape.js

From @fedarko on August 8, 2016 21:15

From Todd:

As the user zooms in (+/- buttons, or with mouse wheel, or gesture), auto-uncollapse

I like this idea a lot! However, I'm not so sure about its feasibility. (Un)collapsing takes a decent amount of time for large graphs, which are the primary types of graphs for which this feature would be useful. Imposing this action upon zooming in would delay zooming very noticeably, which—at least from my perspective as a user—seems like it'd be annoying more than anything else.

If we manage to get the graph to be stored more efficiently on-the-fly, then this could be a viable option. Until then, though, I think there are probably other features to pursue that would offer a better cost/benefit analysis.

Copied from original issue: fedarko/MetagenomeScope#53

Implement panning limits

From @fedarko on March 26, 2017 1:8

We'd have to implement some custom functionality that is applied upon ending a pan gesture, as described here. However, I'm not sure if this would be really inefficient for large graphs.

I think for now there are probably other things that take priority, but this would be nice to do eventually.

Copied from original issue: fedarko/MetagenomeScope#178

Get DNA sequences from server-side .fasta file

From @fedarko on April 18, 2017 9:59

(as opposed to "DNA pointers"/node IDs)

We could process huge files using a chunk-by-chunk approach similar to how we process AGP files. We can also notify the user if the resulting DNA file generated is too large to fit in memory, I guess?

I'm not sure what the max download file size is -- I know data URLs aren't necessarily the best idea, so maybe something else? Or perhaps generate a .fasta file on the server side that the user can download?

Copied from original issue: fedarko/MetagenomeScope#182

Remove redundant statistics from .db file

From @fedarko on January 25, 2017 17:52

This should be done towards the release of the tool, when we've more clearly defined the types of graphs that will be generated:

There's some redundancy in columns in the .db file. The main example I can think of is the use of both a node width and node height column, even though nodes are currently scaled by area (so width should equal height for all nodes). Update: we're looking into scaling differently right now (see #164), so this might be a moot point.

Once we've settled on a final specification for graph types, eliminate redundant stuff like this from the .db file. This will reduce .db size, which will in turn reduce processing time in the viewer.

Copied from original issue: fedarko/MetagenomeScope#147

Write python script to convert node ID list / AGP file to FASTA file

From @fedarko on April 16, 2017 9:29

I guess the input to the script would be:

  1. the list of node IDs, created by the viewer application
  2. some source of DNA for nodes in the assembly graph in question. for GFA/LastGraph files I guess this would just be the original GFA/LastGraph file in question; for Bambus 3 GML files I guess this information would be contained in some .fa files or something.

The output would be a FASTA file containing all the DNA information in order, I guess? Ask to see what the ideal such format would be.

Copied from original issue: fedarko/MetagenomeScope#181

Add more "concrete" navigation

From @fedarko on August 8, 2016 16:49

Todd suggested I include:

  • A navigation box (like in the bottom corner of Google Maps) to help users avoid getting lost in really large graphs. An extension for this exists already here.
  • More clearly defined panning/zooming functionality. An extension for this exists already here.

So I should definitely add these extensions and see how they work for graphs, adjusting accordingly.

Copied from original issue: fedarko/MetagenomeScope#42

Don't perform layout on connected components containing just one node

From @fedarko on August 2, 2017 3:15

The repeated costs of running sfdp (in SPQR mode) and dot (in standard mode) on cc's of single nodes aren't that bad when the number of those components is less than around 1000, but this quickly gets out of hand when there are literally tens of thousands of those components (for the one graph I'm working with now there's over 100k "single-node" components).

Solution: for a component containing a single node of width w inches and h inches, with no edges and with no node groups, just set the bounding box of the resulting connected components (in SPQR and standard mode) to (w + some padding, h + some padding). Look at some current ways Graphviz lays this stuff out for guidance.

This has the potential to save a lot of time (by preventing lots of repeated calls to pygraphviz).

Copied from original issue: fedarko/MetagenomeScope#252

Prevent page state from being saved/cached/etc

From @fedarko on August 9, 2016 2:17

One of the stipulations of using sql.js is that we have to close the current database file before we're done with it—otherwise not-so-good things happen, presumably. I have this implemented now, in that closeDB() (our wrapper to the .close() function that has a check for null) is called:

  • when the user exits the page (via jQuery's beforeunload event)
  • before the user loads a new .db file

These cases should cover things, but while testing in Chromium, I noticed weird behavior. A user could load a .db file, draw a component, etc. And then when the user hit the 'back' button, followed by the 'forward' button (to bring the user back to the AsmViz Viewer), the .db file would still be loaded: the graph-view-specific buttons (search, fit, collapse) were all still disabled (expected behavior), but the .db-specific items (component spinner, draw button, asm info button) were all still enabled. What's more, the .db file was still clearly loaded into memory: using the draw button worked fine.

I mean, on the surface this is actually a feature, but I'm still wary of it. I think it would be best to ensure the .db file is actually closed upon pressing 'back' and forcing the user to reload it, so we know what's going on (and aren't subject to the whims of whatever behavior a browser decides is cool). Granted, I really don't know how to go about fixing this right now and I've got way more important stuff to focus on, but this is still something worth looking in to, IMO.

Copied from original issue: fedarko/MetagenomeScope#55

Support conditional generation of different graph view types

From @fedarko on June 4, 2017 23:30

Similarly to how -nodna allows the user to not pass sequence data to the .db file generated by collate.py, we should support features where the user can pick and choose which graph types to generate layouts for. Right now there's just two choices (double graph with variants detected vs. single graph with SPQR tree integration), but I imagine if/when we add the functionality of just normal single graphs without SPQR tree integration (a la #10) we should probably let the user choose which types to include in the .db file.

Also worth noting that GMLs don't really have a distinct single graph. Not sure if we should still lay out the undirected version of the graph, or just block that feature for all files with already-oriented contigs. Something to think about.

Copied from original issue: fedarko/MetagenomeScope#198

Support display of multiple AGP files simultaneously

From @fedarko on April 25, 2017 20:3

After taking care of #183, this should be a nice feature to add.

If we do #180 before this -- perhaps we could colorize different AGP files' scaffolds differently? I guess the question arises of how to handle the same node being in two scaffolds from the different AGP files -- in that case, we could either color the node a neutral color or color it as a pie chart.

Or we could just enforce that only one scaffold can be displayed at once (from either of the loaded AGP files), and just have the "multiple AGP files" thing be independent of the way scaffolds are highlighted. We could even programmatically create multiple scaffoldCycler <p> things for every AGP file loaded, and add an "x" button or something that'd remove a given scaffoldCycler and unload its corresponding AGP file. That'd... be really cool, actually.

Copied from original issue: fedarko/MetagenomeScope#185

Make the app available as a desktop application

From @fedarko on July 22, 2016 18:7

Using cy.json(), node.js, and node-sqlite3 would allow assembly graphs to be viewed in Cytoscape.js or (with less support for certain Cytoscape.js-specific features -- see this) the desktop version of Cytoscape.

The upside of this is that we get the best of both worlds: the ability to view the graph in a browser (and, theoretically, host it on a website) and the ability to view the graph in desktop Cytoscape (which would make viewing huge graphs faster).

Note that I have pretty much no experience with desktop Cytoscape, and we'll probably have to tweak things to get this to work. (I think compound nodes, in particular, are only somewhat supported in desktop Cytoscape -- so I'll have to look into that.)

Copied from original issue: fedarko/MetagenomeScope#12

Misc. viewer code optimizations

From @fedarko on December 28, 2016 23:22

There are a number of these things sprinkled throughout the xdot2cy.js code -- some have, e.g. O(2n) loops that could be brought down to O(n) fairly simply.

I doubt that this is causing a significant amount of bottleneck, but it'd be nice to sweep through the code and take care of this stuff at some point.

Most of these areas can be found by just searching for "TODO" (sans quotes).

Copied from original issue: fedarko/MetagenomeScope#115

Make node group precedence configurable from command line

From @fedarko on September 12, 2016 7:47

e.g. a "-pp" argument (for "pattern precedence" -- or a better acronym, I'm all ears) where users could pass in strings like BRCY or YCBR (where the capital letters match with their AsmViz abbreviation -- B is bubble, Y is cycle, C is chain, R is frayed rope).

This is fairly low-priority (esp. when compared to some of the other features/etc here), but it would still be a nice thing to add if users want to emphasize certain patterns (e.g. the user cares a lot about finding chains for some reason, so they want to find those before caring about finding cyclic chains, bubbles, ropes, etc).

Copied from original issue: fedarko/MetagenomeScope#87

Support drawing a histogram for node / edge metadata

From @fedarko on September 27, 2016 15:54

This could be cool.

If this is going to be just one chart for the entire assembly, regardless of whatever the user does in the viewer, then I figure we could just generate the chart in collate.py and store the resulting image in the .db file. However, if we want the user to be able to actually modify the chart dynamically (e.g. to only view information for selected nodes' lengths) or if we want to generate many possible charts (e.g. a histogram of edge multiplicities), then it would make sense to generate the chart in the browser using something like d3.js.

I don't want to use too many dependencies in the project, but d3 shouldn't be too bad to use here.

Update: Yeah, we should extend this idea to support arbitrary node / edge fields (see #243 -- so not just lengths but also coverages, GC contents, other user-specified things...) It will be a bit challenging to pick good histogram bounds "on the fly", but there are probably libraries that can do this well.

Copied from original issue: fedarko/MetagenomeScope#99

Add horizontal scrolling through the graph

From @fedarko on July 21, 2016 3:58

Via a scrollbar at the bottom of the screen. There are some good JQuery options for this; I don't particularly want to add JQuery to this project just for one feature, but if I can't find any suitable alternatives it should be alright.

We might want to consider having a vertical scrollbar, too, for certain connected components (e.g. RF_oriented_1.xdot, at least right now) that can't be visualized as a linear or even semi-linear path through the graph. Also, we should probably take graph orientation into account for this, since we'll obviously have more of a need for vertical scrolling for down-up/up-down rotated graphs. And I'm not sure if having both a vertical and horizontal scrollbar is the best idea—how much overhead would it have when added on top of Cytoscape.js? Arguably the major bottleneck in interacting with large graphs in Cytoscape.js is the delay inherent to interacting with nodes/edges/node groups: collapsing an individual cluster in a large graph might take 5 seconds, 3-4 of which are just Cytoscape.js waiting to register a "cxttap" event with that cluster's element.

This seems like a fairly minor feature (that will probably be nontrivial to implement), so I'll probably work on getting some more critical things done first.

Copied from original issue: fedarko/MetagenomeScope#5

Add specified collapsing (all bubbles, all chains, etc.)

From @fedarko on February 24, 2017 22:50

I don't think this should be too difficult to implement. We'd just select all nodes of type node.[R/B/C/Y] and collapse accordingly -- and if no such node groups are in the currently drawn component then we can just disable the corresponding button, I guess. Handling the button stuff might take some time.

Copied from original issue: fedarko/MetagenomeScope#155

Support discarding redundant connected components for pre-oriented GFA/LastGraph/etc. files

From @fedarko on July 22, 2016 17:15

This would make understanding really large graphs a bit easier, but for close analysis using a double graph is probably better. However, having this option available (I guess we'd figure it out in the Python script) would make things easier.

Maybe we could have the Python script automatically lay out a "single graph," and then store both the RC and non-RC information in the hypothetical database file generated? This would allow the Javascript UI to overlay RC nodes on top of non-RC nodes, if the user requests a double graph. I think this is similar to what Bandage does.

Note that this is only really feasible for contig assembly inputs (e.g. LastGraph files), not for scaffold inputs (e.g. GML files from BAMBUS).

Copied from original issue: fedarko/MetagenomeScope#10

Add option to render multiple connected components at once

From @fedarko on September 27, 2016 15:39

I guess this would be done in the viewer. We could do this by taking the bounding boxes for each component and "concatenating" them, with some margins in between. (For some reason, laying out components individually seems to be faster than laying out the entire graph at once—so we'd still generate layouts on a component-by-component basis, but just concatenate the resulting layouts to produce something basically equivalent.)

This probably shouldn't be the default behavior (will wreck viewer performance for huge graphs that have sizable non-largest components) but having it as an option would be useful in some cases (e.g. graphs composed of a lot of tiny component fragments).

Copied from original issue: fedarko/MetagenomeScope#93

Make GC colorization more extreme

From @fedarko on June 7, 2017 1:49

  • Consider using ranges of ~5%?
  • Consider coloring nodes by how many standard deviations their GC content is from the mean? (if we do this, we should calculate the standard deviation of GC content in collate.py and store it in the database's assembly table)

Copied from original issue: fedarko/MetagenomeScope#207

Support user-defined pattern definitions

From @fedarko on August 19, 2016 19:40

e.g. a list of nodes that are in various types of pattern groups, along with the color of that group.

Right now we only support four types of patterns, but it should be possible to take a file like this as input and reconcile it with the .db file in collate.py.

Copied from original issue: fedarko/MetagenomeScope#78

exportPath() from finishing not opening in some cases?

From @fedarko on June 4, 2017 4:56

It works fine for me on Chromium, but people have experienced issues with the page opening up and then quickly disappearing. Doesn't seem to be caused by popup blockers.

From testing, we know that this JSFiddle (https://jsfiddle.net/bL6vg8vv/) exhibits this problem for at least one user -- therefore, the issue seems to be localized to that one invocation of window.open(). I need to determine how to offset this (consider using a different target than _blank? Maybe _tab or something?).

Copied from original issue: fedarko/MetagenomeScope#196

Notes on graph/connected component/file size limits

From @fedarko on January 10, 2017 2:22

For GraphViz (dot)

I'm not sure here. Per Emden Gansner, there isn't any set upper limit on graph size, although lots of nodes/edges can cause "an explosion in the problem size" due to how dot handles edges crossing multiple levels.

Gansner recommends a general upper limit of 1000 nodes per drawing (see the second link in the previous paragraph), which is not really practical for our purposes at all -- however, we know from experience with the Shakya dataset that laying out larger components isn't that bad (the first component of Shakya has around 6k nodes and 9.8k edges, and takes around five minutes to lay out using dot). Furthermore, since the conceit of AsmViz involves precomputing the layouts, I don't think worrying about efficiency at this part of the program is a super significant issue.

For huge, huge connected components, we have a few options (and almost certainly more that I'm not aware of):

  1. Try using desktop Cytoscape's hierarchical graph layout option -- not sure how fast it is in comparison to dot, but it's worth a shot. If it's much faster than dot, then we can just refer users to desktop Cytoscape from collate (???).
  2. Use the nslimit/nslimitl parameters to bound the upper limit of iterations in dot at the cost of worse graph drawing. I really like this option -- we could even have this be up to the user of collate, so that they could personally make the tradeoff of drawing quality vs. time/memory used.
  3. Force the user to split up the graph at certain places within components (???)

Worth noting that for one of the datasets Jay sent (it's in my testgraphs/ directory as 20170226_oriented.gml) has a largest connected component of around 19,000 nodes. Trying to lay out the .gv file generated for this component (which, due to backfilling, has 15,569 nodes and 31,322 edges) just crashes dot with a segfault -- running the layout operation from within PyGraphviz also results in a crash, naturally. So the upper limit in connected component size is maybe somewhere around 15k nodes as a most optimistic guess? (The massive amount of edges probably also factors into this, also.) At this size of connected component we can focus on SPQR decompositions/desktop Cytoscape/etc.

For sqlite3 .db files

Assuming we go the "no DNA" approach, we can assume that most large graphs will generally approach having their main "size"-causing factors be node count and edge count (this relies upon the assumption that the amount of connected components and clusters, as related to the number of nodes/edges, will be somewhat uniform).

I've read that, at least in one environment, the upper limit of .db files in sql.js is around 120M.

Since the size of the shakya.db file is about 4.6M on my system (and would be, likely, somewhere around 4.8M if it included edge multiplicity and node G/C content data), we can do some (very approximate) math using this + the 120M figure as a basis to estimate the maximum total number of nodes and edges:

imag4418
(my phone's camera is not that great, sorry)

So we can say that, only considering the .db side of things, the maximum amount of nodes is somewhere around 500,000.

For Cytoscape.js' renderer

I know it's somewhere on the order of tens of thousands of nodes, but I'd have to check. (TODO: add here)

Copied from original issue: fedarko/MetagenomeScope#134

Fix edge hiding issues with collapsing

From @fedarko on March 4, 2017 5:21

List of issues:

  1. An edge is removed, and then its cluster is collapsed and then uncollapsed.
    Effect: the cluster tries to restore the edge even though it's been
    removed.
    Already fixed in my code (via modifications to uncollapseCluster()).

  2. A cluster is collapsed, and an "exterior" edge incident on that cluster is
    restored via institution of a lower threshold than before.
    Effect: the edge is created using an invalid source and/or target,
    resulting in an error.
    Kind of almost fixed in my code (via modifications to the restoration code
    in cullEdges()). Need to take care of rerouting edges carefully -- it might
    be simplest to just collapse all edges (even the "removed" ones) in
    collapseCluster(), although I'm not sure how to go about that (perhaps by
    analyzing REMOVED_EDGES w/r/t any edges that are incident upon node(s)
    within the cluster).

  3. A cluster is collapsed, and edges are hidden below a certain threshold.
    Effect: If the cluster contains "interior" edges that have weights beneath
    the threshold, they will not be removed, even upon uncollapsing the cluster.
    We could rectify this either in uncollapseCluster() (check if a threshold
    has been imposed) or in cullEdges() (search through collapsed nodes for
    edges to cull). The latter approach would probably be easiest.

  4. A lot of edges are hidden at once (potentially thousands).
    Effect: takes a lot of time, can cause a "not responsive" browser popup
    Solution: link the process of hiding edges to progress bar and occasionally yield to that

  5. Some of our demo assemblies don't have bundle size info included
    Solution: don't hide edges that have a null/undefined bundle size
    Also solution: ensure bsize is in all demo assemblies

Copied from original issue: fedarko/MetagenomeScope#161

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.