cucapra / pollen Goto Github PK

generating hardware accelerators for pangenomic graph queries

License: MIT License

Python 53.51% Makefile 1.47% Dockerfile 0.52% Rust 44.50%

pollen's Introduction

Accelerated Pangenome Graph Queries

Pollen is a nascent project to accelerate queries on pangenomic graphs. We are designing a graph-manipulating DSL that exposes functionality that pangenomicists care about. Our DSL will support graph queries in the vein of the odgi project. We will compile programs written in this DSL into fast query code. Eventually, we aim to generate custom hardware accelerators for these queries via the Calyx compiler.

There are several things in this repository:

mygfa, a simple Python library for parsing, processing, and emitting GFA files. See its documentation.
slow_odgi, a reference implementation of several GFA queries from the odgi tool using mygfa.
FlatGFA, an experimental fast binary format for representing and analyzing GFA files. There are also Python bindings for this library; check out their documentation.
A proof-of-concept Calyx-based hardware accelerator generator for a single GFA query (odgi depth) and a data generator for this hardware.

`mygfa` and `slow_odgi`

The mygfa library is an extremely simple Python library for representing (and parsing and emitting) GFA files. It emphasizes clarify over efficiency. Use pip install mygfa to get started, and read the API documentation for details.

Similarly, slow_odgi is a set of GFA analyses based on mygfa; it's meant to act as a reference implementation of the much faster functionality in odgi. Check out the slow_odgi README for more details.

To set up both of them from this repository, try using uv:

$ uv venv
$ uv pip install -r requirements.txt
$ source .venv/bin/activate

Now type slow_odgi --help to see if everything's working.

FlatGFA

FlatGFA is an efficient representation for GFA files. It is implemented in Rust and available with Python bindings. The latter is on PyPI, so you can get started with:

$ pip install flatgfa

Then read the API documentation to see what's available. Or see the included example for a synopsis.

Credits

This is a project of the Capra lab at Cornell. The license is MIT.

pollen's People

Contributors

Stargazers

Watchers

Forkers

mfkiwl

pollen's Issues

`slow-odgi 0.n`

This is a parking lot for slow-odgi things that have gotten some triage but are not going to get done in this current push.

Possibly verbose behavior in our .data files

In our .data JSON files, we create a path_ids{i} entry and a paths_to_consider{i} entry for every node in the graph. This makes sense to me.* However, we also create these for nodes not in the graph. For instance, the graph note5.gfa has four nodes, but we create path_ids{1} through to path_ids{16} because 16 is a magic-number constant. Just wondering if this is a slip-up.

@susan-garry please let me know, and I'll make the fixes if appropriate!

*sidebar: my understanding is that the paths_to_consider entry will be identical for all the nodes, but we want this duplication for performance reasons.

slow-odgi: Extremely basic benchmarking

Make it so that all the odgi commands build against .og files and all slow-odgi commands against .gfa files. Mostly true already, just patch.
Time the total time for all of odgi's stuff to run, versus the total time for all of slow-odgi's stuff to run. Of course a better treatment would be to go higher-res, breaking it down by command and by test file. I'll kick that can down the road until milestone 0.2.
Pipe the output to /dev/null; this gives speedup and sooorta cuts out the artificial slowdown created by printing. A better treatment of this is coming in milestone 0.2.

Reorganize json-gen and slow-odgi into two packages

The json-generation code should stand alone in its own package.
It should just borrow mygfa (and maybe preprocess) from slow-odgi, or, better yet, mygfa (+preprocess) should be lifted above these two separate packages.
Separate the testing infrastructure
Separate the argparse-ers

Originally posted by @anshumanmohan in #25 (comment)

Benefit of `Emit` Statements

As I understand it, the main theoretical benefits of emit statements is that they should make it easier to parallelize a program by ensuring that each element in a set is independent of the others (or at least making it easier to run an analysis to determine the independence of these elements). But using these special sets, for all their guarantees, is not sufficient to say that two elements in the set are independent in that the order of computation does not matter. Here is a very simple example in which elements are not technically independent of each other, and where their computation cannot be parallelized:

i: int = 0;
s: ParSet<int> = new ParSet();
while i < 10 {
    emit i to s;
    i = i + 2;
}

Here, because each element is computed from the value of the previous element, they are not independent. Even though emit statements guarantee that we cannot access value already in s, we can still keep track of these values using state variables such that the value of one depends on the values of one or more of the previously computed element(s).

Here is an example of a silly but syntactically valid pangenomic graph transformation:

s: ParSet<Segment> = new ParSet();
prev_seq : Strand = new Strand();
for node in nodes {
    seq' : Strand = node.sequence + prev_seq;
    prev_seq = node.sequence;
    emit { node with sequence = seq' }
}

By using prev_seq as a state variable that gives us information about the other nodes in the graph, the value of each node in s depends on the value of the previous node added to s. We may not want this to be a valid pollen program, but it is syntactically valid, and it's not clear to me how we would differentiate this as an invalid program. Perhaps we can check if a body that contains an emit statement is modifying any variables defined outside of the body to ensure that there aren't any variables being used to keep track of state. This way, analyses that use subset-paths can still be parallelized, and perhaps we can implement a pass that tells a user if their computation can or can't be parallelized and why or why not. But this does limit our ability to automatically parallelize programs (for example, it's not obvious to me how we could straightforwardly parallelize while loops). Has anyone already thought about this snag, and is there any existing literature on parallelizing programs that are written in a sequential manner?

Most foolproof way to install odgi?

When doing up #126 I was reminded that the installations we provide for installing odgi are wanting. See this. We sorta point them to too many places: bioconda, installing from source (which has a sorta vague pointer to path-editing), and we finally walk through a hacky middle-ground option.

What do people actually use? I use a built-from-source copy, and I've done the path-modification thing to get Python bindings to work:

export PYTHONPATH=/scratch/anshuman/odgi-py/lib/python3.9/site-packages/
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

It's not that bad. Does any of us do differently? Thoughts? These instructions will now live in the bottom of the file and will be intended for the most serious users, those who want not just to use Pollen but to develop and extend it. I think we can give them one straight-line route, even if it's a little involved.

`calyx_depth.py`: divest from odgi's Python bindings

The algorithm calls parse_data.py, which in turn calls odgi's Python bindings, to learn the input graph's dimensions. Instead, use mygfa.preprocess:

pollen/mygfa/mygfa/preprocess.py

Lines 64 to 69 in de4f8db

    
           def get_maxes(graph: mygfa.Graph) -> Tuple[int, int, int]: 
        
               """Return the maximum number of nodes, steps, and paths in the graph.""" 
        
               max_nodes = len(graph.segments) 
        
               max_steps = max([len(steps) for steps in node_steps(graph).values()]) 
        
               max_paths = len(graph.paths) 
        
               return max_nodes, max_steps, max_paths

This will free us from odgi's Python bindings.

This work has been broken down as such:

For context, see tracking issue #105

`Emit` Statement Ambiguity

Currently, the emit statement as we've been using it is ambiguous. emit e adds e to a special set whose interface is restricted such that we can safely compute its elements independently (in theory - but that's a separate issue). However, it's not clear what set we are adding e to. If we want to simply output one set of values then this syntax is unambiguous because there is only one possible set that we could emit e to, but if we want to emit multiple sets or output entirely new graphs, things get a little complicated. Each graph consists of multiple sets to begin with, so we want a way to alter, e.g., node1.steps and node2.steps, we need a way to emit to both of these sets. We also want a way to distinguish between emitting, say, a node to an output set vs. emitting a node to a new graph.

I think the easiest way to deal with this ambiguity is to require the user to specify which set they are emitting a value to, such as with emit-to statements: emit e to s. This means users must also be able to initialize special sets like s. Because these aren't ordinary sets and we may want to support ordinary sets further down the line, I think we should probably give these sets a unique name and type, such as ParSet<> or PolSet<> .

There is still the issue of differentiating between when a set is being output directly to the user or modifying an existing graph. I suspect that the easiest way to do this will be to define a specific nomenclature for ParSets that modify an existing graph, e.g. node_output is the set of nodes in our new graph and if a user writes emit e to node_output, we know that we will be emitting a new graph. This nomenclature could be extended to allow a user to output multiple modified graphs by simply indexing these names, so node_output1 and node_output2 would each modify the original graph and produce a new one.

I'm admittedly not a huge fan of defining a special nomenclature since these variable names aren't necessarily intuitive to everyone and it would be better if we could allow users to define their own variable names in as many instances as possible, but the main point of this issue is that emit statements are ambiguous and too restrictive to support everything that we want Pollen to. We could support emit statements in addition to emit-to statements, but I think this just creates further complications and it makes the most sense to me to start with emit-to statements and add emit statements later on if we decide it would be beneficial.

Issues in README

When trying to go with the python3 -m slow_odgi route (i.e. without installing an executable) there is an issue because we quietly need mygfa.

Some `slow-odgi` tests failing

Probably minor, but something bad has slipped through. It should not be too hard to dive in and check where things went awry. It happened in the last couple days for sure.

Originally posted by @anshumanmohan in #118 (comment)

Update installation instructions

Because of changes in the Calyx docs, our own installation instructions are becoming harder and harder to follow. For instance, we point users to the "first and third" steps of the Calyx installation instructions, but the first step now itself has three options (docker, crate, source).

I'm not quite sure what a permanent fix would be, but let's at least patch it some. This can be fixed in the same spirit, if not the same PR, as the fix for #98.

First: Crate good?
@susan-garry, do you know if the crate version of Calyx is good enough for our purposes? That would give us a certain barrier of stability I suppose, since we'd be asking people to go through the crate, not having them pull the bleeding-edge main branch every time. I don't know if that's actually a fair assessment; maybe the Calyx crate is deployed automatically and often.

Second: This much madness is too much sorrow.

I wonder if we should just inline our own installation instructions for Calyx. I don't love that we currently have users jumping to the Calyx docs, running/skipping stuff there, coming back to us, running more fud config stuff, and then getting going. If we inline Calyx installation instructions that we know will work for Pollen's purposes, that will introduce a maintenance task.
Regardless of what we do with the inlined/foreign instructions, I'd love to finishw with a fud check invocation that passes 100%, with no warnings for the purposes of Pollen. We have recently learned the format for doing that.

Third: Docker all the way?
Perhaps somewhat at odds with all of the above, but this is all making me wonder if we shouldn't just make Docker the default way we do business. Our Dockerfile currently starts with Calyx's Docker image, installs odgi from source, and installs Pollen.

We could catch it up to install mygfa, slow-odgi, and pollen-data-gen, and then legitimately do all our development in Docker.
If we wanted, we could use Adrian's Depot trick to keep the Docker current. Perhaps overkill for a small project like ours.
When it comes to coding day to day, VS Code has support for "attaching" to a Docker container, which is the same as SSH-ing into a remote machine like Havarti.
Sorry if I'm forgetting something that we've already resolved, but does this not work for some reason? Is a Docker too underpowered? Does the Calyx Docker lack an important simulation tool that Havarti has? Paging @rachitnigam on this one.
Suppose the above is impossible. Is it possible that Docker fails for development but satisfies a large chunk of users? In that case we can have grungy installation instructions internally but the official instructions can be two lines: docker pull, docker run.

`calyx_depth.py`: use the eDSL, stop using odgi's Python bindings

Currently, calyx_depth.py generates Calyx code

Without the use of the Calyx eDSL (aka the builder library).
With the use of the odgi Python bindings (it's not obvious to see, but it uses parse_data which in turn uses the Python bindings).

I propose to change both of these.

Using the eDSL will lighten the code significantly.
The odgi Python bindings have proved hard to use. Besides, the current code uses the bindings only to infer the dimensions of the graph, and we have in-house ways of doing that now.

These relatively minor gains aside, the true win is that, after this, we will truly be using odgi only as a source of inspiration and oracle-testing, not as part of our workflow.

We currently do something like:

graph.gfa --odgi build--> graph.og --parse_data.py--> graph.data
calyx_depth.py (with one call to parse_data.py to learn the graph's dimensions) --> calyx_depth.futil
calyx_depth.futil tangoes with graph.data in the usual Calyx way.

My proposal is something more like:

graph.gfa --pollen_data_gen--> graph.data
calyx_depth.py (with one call to mygfa to learn the graph's dimensions, and with calls to the Calyx eDSL for lightening) --> calyx_depth.futil
calyx_depth.futil tangoes with graph.data in the usual Calyx way.

The first piece of this is already there: I already generate the same graph.data using pollen_data_gen as one would have using the usual route. The third piece should work if the first two work.

The second piece decomposes into two unrelated parts:

#107
#106

`slow-odgi` documentation

We have learned a lot about odgi via slow-odgi, and it would be good to collect all the minimal examples, including weird (but legitimate) gotchas, in something like slow_odgi/README.md

json-gen: Adjustable widths

The exine json-generator adjusts the widths of fields as needed; I just stick to 4. Must fix.
This will allow me to pass in larger parameters for these three flags and therefore compute JSONs for the four larger graphs. At present we don't do anything reasonable with them: exine, our oracle, complains that the parameters need to be bigger, and we ignore this, so the expect files are empty.
Infer these parameters automatically and tightly, so I don't have to run the smaller graphs with parameters that the bigger graphs need.

Originally posted by @anshumanmohan in #25 (comment)

README is stale

The top-level Pollen README does not work as written. I think the issues are small, FWIW!

Is it literally just that flit install -s --user needs to be run not from pollen/ but from pollen/pollen-py/?

Originally posted by @anshumanmohan in #83 (comment)

Divesting from odgi's Python bindings: testing

The PR #116 does some work towards getting rid of odgi's Python bindings, but leaves testing for future work. This is an issue to track that.

Closing this issue will partially address #107.

Beyond `.og`: a zoo of data representations

Back at MemPan23, we heard the following refrain a few times:

Pangenomics is one of those fields that have been "solved" by the discovery of the right data structure. That data structure is the GFA.

For our work thus far, we have relied heavily on ODGI's .og representation. As we know, this representation is an opinionated departure from the GFA with an eye to parallelizability.

Zoo of Data Representations

This issue's job is to bookmark the fact that the .og representation is just one point in a large search space. We should explore this space with additional quick prototypes that, like ODGI, start with GFA, produce some binary representation, and can be round-tripped losslessly to GFA.

Consider, for instance, a representation that analyzes a GFA and preprocesses it into a "path-centric" view. I am leaving the details of this representation vague, but the point is that this new representation will throw no information away, but will somehow be especially amenable to users who wish to run path-centric operations (reads, such as the sequence traversed by a path; writes, such as the insertion of new paths) on the given graph. We'd produce a path-centric binary .paths, analogous to the .og binary that ODGI produces.

The hope is that, for example, inserting a path into graph.paths will be:

Faster/cleaner than inserting a path into graph.gfa (because we have done some preprocessing).
Faster/cleaner than inserting a path into graph.og (because we have not done too much preprocessing and are not bogged down by too much cruft).

The .paths representation may totally lose to .og when it comes to, say, a node-centric operation, and that's okay. In fact, the .paths representation may disallow a node-centric operation entirely. Some other representation, which is node-specific, would probably do a fine job against .og. We can explore a range of such data representations and see how we fare versus .og.

Using Our Zoo in a Principled Way

A follow-up challenge, then, is the principled creation of commands that run on these different representations.

An .og file has all the preprocessing in one place, so this task is relatively easy: just write one suite of commands that run on .og-represented graphs.

If we, for our purposes, preprocess a graph into a bunch of different representations, then we will need representation-specific version for each high-level command. For instance, running an "insert path" command will look dramatically different on a graph that has been preprocessed to be path-centric versus one that has been preprocessed to be node-centric.

The hope is that our DSL and compiler will save the day. It will take from the user:

The high-level intent.
The data representation that they want to use.

And then it will:

Automatically create the representation-specific command required.
Run that command.
Report the answer.

Examples of Pollen

It would amazing to have (pseudo-)Pollen implementations of the commands that we currently support in slow_odgi. I know Priya is on it; I just wanted to make an issue so we can track it and get excited!!

`pollen_data_gen`: Lift precompute

The precomputation done inside depth was written before mygfa (or slow_odgi) was a thing, but I suspect there is some duplication going on. Even if not, certainly some of that computation can be lifted into mygfa/precompute so that other commands can use it. Case in point, this lifting should happen before starting work on #96 , so that the expanded simple can use the computation that currently lives in depth.

See if there is duplication between pollen_data_gen depth and mygfa/preprocess. Cut down on any.
Lift computation from pollen_data_gen depth to mygfa/preprocess.

Currently unparseable parts of Pollen programs

This issue serves as a way to keep track of parts of Pollen programs (see #134) that are not currently parsed correctly. To remedy this, either the Pollen language or the Pollen parser should be changed.

Declaring a graph with graph my_graph
Declaring an output set for a graph with parset ...
For loops: for x in y
Initializing variables without a type annotation (i.e., my_bool: bool = false; parses, but my_bool = false; fails)
Conditional statements: both if and if ... else ...
Methods for data structures (i.e., push for Strands and length for List<...>s, etc.
Instantiating an empty Strand with Strand(), i.e., x: Strand = Strand();
Declaring an output set for a graph analysis with outset ... (or whatever word we end up using)

This list may not be complete, but it should be enough for us to get the ball rolling and we can update as we go!

Tracker: FlatGFA Python bindings

Here's just a snapshot of some next steps for the Python bindings.

Project infrastructure:

CI. #178 added some tests; they should run in GitHub actions. Maturin has a generate-ci command that should help with this. #180
Build docs. Just like we have auto-published Sphinx docs for mygfa, let's do the same for the FlatGFA module. #181
- Expand/improve docs: at least divide into a few top-level sections. #182
Publish on PyPI. I believe Maturin/Actions can help with this too.

API "surface area":

Mutation. Let's make it possible to build up a graph from scratch.
Expose the ops. Things that are currently CLI subcommands (in cmds.rs) should be put into a library of useful "ops" on the GFA data structure. This way, you can string together multiple operations in a Python invocation. In the limit, maybe we can reproduce odgi pipelines from our collaborators as Python scripts.

Python niceties:

The various objects should probably be made equatable/hashable. #183
Use str(obj) on paths, links, steps, etc. to produce GFA string fragments. #183
Names (of paths) should probably by str and not bytes? Seems fine to assume these are UTF-8. Nucleotide sequences can stay bytes. #183
Add typing hints. These would be especially helpful for the docs. #182

slow-odgi: tidying

In addition to #33, some other high-level things that should be done after merging #29 and #31:

Since beginning to write slow-odgi, I have discovered how to pipe odgi commands. Do that more; cut down on the number of temp files. Maybe this will let me avoid the temp storm (temp-est 🥁 🥁 ✨) I sometimes have, where temp files created when testing one command then become targets for the next command.
Test against more of the graphs that are already in the odgi repo. I presume the present selection that we fetch with make fetch have been chosen because they are of a reasonable size? @susan-garry?
The graphs that we fetch frequently fail to actually test odgi commands rigorously. I have in the back of my mind a bit of a zoo of small graphs that could be added, either to the odgi repo or elsewhere, that, when tested against, would run the commands through their paces.
Refactor! There is some known duplication across branches, and sometimes I have done the same thing in different ways in different commands.

New Pollen Type for DNA Bases

Currently, we represent sequences - strands of DNA base pairs - as a String (equivalently, a List of characters). However, since there are only 5 possible base pairs in odgi (ATCGN), we can represent each base using only 3 bits (as opposed to the 7/8 required to represent characters). Additionally, representing a sequence as a String allows users to insert arbitrary values into a sequence, potentially introducing subtle, difficult-to-detect bugs or corrupting the data. Instead of representing a DNA base as a character, we could represent it using a unique type. Then we can enforce the invariant that the only base pairs are ATCGN, improve our memory footprint, and possible speed up data transfer rates in certain computations (?).

Considering what this might look like, I did some back-of-the-napkin calculations and found that the chr8.pan.gfa graph stores 259525394 base pairs, which takes up 216 MB in memory if we represent characters using 7 bits. This might not sound like a lot, but I understand that the typical FPGA can only store a few megabytes at a time, so reducing the amount of memory used to 92 MB by using 3 bits to represent each base pair could reduce the time needed to compute crush for the entire graph by a factor of x2.3 (in a perfect world where the complexity of crush scales directly with the size of a sequence).

If we introduced a special Base type, the only difference to our model is that node sequences would no longer be strings:

type Segment = {
      sequence: List<Base>; //ACTNNNGAC, etc.
      edges: Set(Edge); //edges encode their direction + orientation
      steps: Set(Step); //steps that go through segment
    }

crush would become

for segment in Segments:
        seq = List<Base>(); // seq = String::new();
        in_n = false;
        for c in segment.sequence:
          if (c == 'N') {
            if (!in_n) {
              in_n = true;
              seq.append(c);  // seq.push(c);
            }
          } else {
              in_n = false;
              seq.append(c);  // seq.push(c);
          }
        emit { segment with sequence = seq};

We could even take this a step further and give List<Base> a special name, e.g., Strand. The way I conceptual this, Strand and List are are not interchangeable (i.e., not subtypes of each other). Because Base is effectively a subtype of Char, I imagine Strand as behaving like String (except the values of its characters are constrained). For example, we would still use seq.push(c) instead of seq.append(c), using the syntax for Strings rather than Lists. In this case, the node type would be as follows:

type Segment = {
      sequence: Strand; //ACTNNNGAC, etc.
      edges: Set(Edge); //edges encode their direction + orientation
      steps: Set(Step); //steps that go through segment
    }

crush:

for segment in Segments:
        seq = Strand(); // seq = String::new();
        in_n = false;
        for c in segment.sequence:
          if (c == 'N') {
            if (!in_n) {
              in_n = true;
              seq.push(c);
            }
          } else {
              in_n = false;
              seq.push(c);
          }
        emit { segment with sequence = seq};

Thoughts on the introduction of these new types? Do we prefer Base but not Strand, or Strand but different from how I described it? Neither?

Error when generating a node depth accelerator using `exine depth`

Hi, I'm trying to run some Pollen before our meeting. I basically followed the installation guide for Pollen, Calyx, ODGI, and got an error when generating a node accelerator here. I think the error is still something related to ODGI-python binding.

Do you have any idea on how to fix this?

Error message

(py39) ➜  ~/Other_Repos/pollen/test/depth git:(main) exine depth -a -r t.og
warning [libhandlegraph]: Serialized object does not appear to match deserialzation type.
warning [libhandlegraph]: It is either an old version or in the wrong format.
warning [libhandlegraph]: Attempting to load it anyway. Future releases will reject it!
Traceback (most recent call last):
  File "/home/jl4257/.local/bin/exine", line 8, in <module>
    sys.exit(main())
  File "/home/jl4257/.local/lib/python3.9/site-packages/pollen/main.py", line 26, in main
    depth.run(args)
  File "/home/jl4257/.local/lib/python3.9/site-packages/pollen/depth/main.py", line 180, in run
    run_accel(args, tmp_dir_name)
  File "/home/jl4257/.local/lib/python3.9/site-packages/pollen/depth/main.py", line 104, in run_accel
    parse_data.run(args)
  File "/home/jl4257/.local/lib/python3.9/site-packages/pollen/depth/parse_data.py", line 281, in run
    max_nodes, max_steps, max_paths = get_dimensions(args)
  File "/home/jl4257/.local/lib/python3.9/site-packages/pollen/depth/parse_data.py", line 190, in get_dimensions
    max_nodes, max_steps, max_paths = get_maxes(filename)
  File "/home/jl4257/.local/lib/python3.9/site-packages/pollen/depth/parse_data.py", line 167, in get_maxes
    graph.load(filename)
RuntimeError: Error rewinding to load non-magic-prefixed SerializableHandleGraph

My current environment

python 3.9.16
import odgi is successfull after I followed the detailed installation here from bioconda.
Calyx installed. fud check shows the following messages.

global:
 ✔ /home/jl4257/Other_Repos/calyx exists.

stages.calyx.exec:
 ✔ ./target/debug/calyx installed.
   ./target/debug/calyx is a relative path and will not work from every directory.

stages.interpreter.exec:
 ✖ /home/jl4257/Other_Repos/calyx/target/debug/interp not installed.

stages.dahlia.exec:
 ✖ dahlia not installed.

stages.verilog.exec:
 ✖ verilator not installed.

stages.vcd.exec:
 ✖ vcdump not installed.

stages.jq.exec:
 ✖ jq not installed.

stages.synth-verilog.exec:
 ✖ vivado not installed.

stages.vivado-hls.exec:
 ✖ vivado_hls not installed.

stages.futil.exec:
 ✔ /home/jl4257/Other_Repos/calyx/target/debug/futil installed.

interpreter, dahlia, verilog, vcd, jq, synth-verilog, vivado-hls were not installed correctly.
Configuration instructions: https://docs.calyxir.org/fud/#configuration

`calyx_depth.py`: lighten notation by using Calyx eDSL

See tracking issue #105

`slow-odgi`: better benchmarking

Carve out benchmarking from testing, so that they can coexist.
Refactor code so that printing happens separately from the computing, then time just the computation.
Increase resolution: per-algorithm, per graph-per algorithm.
Work with Turnt folks to see if they want to make benchmarking easier.

Divesting from odgi's Python bindings: documentation

The PR #116 does some work towards getting rid of odgi's Python bindings, but leaves documentation for future work. This is an issue to track that.

Closing this issue will partially address #107.

slow-odgi: points of divergence and confusion

Match odgi flip behavior.
This could come from changes on our end, or by triggering further changes in odgi's code or spec.
Basically I think there is still a bug in odgi flip. Andrea made a fix for me, and this helped in one spot, but introduced an issue elsewhere. See here if curious.
Figure out odgi inject.
- Understand it.
  I can't recreate the stated behavior of odgi inject even for trivial requests. I have asked on Matrix; see here.
- Support it in slow-odgi.
Check with Andrea et al if odgi overlap is coded up correctly.
I can recreate the behavior of odgi overlap, but I think they mean to calculate something more subtle and in fact have a bug. I have asked on Matrix; see here.
Figure out what's going on in odgi groom. See this comment if you're curious.
- Make a minimal example.
- Ask about it on Matrix.

The need for `Set`

I think #57 (comment) raises an interesting question about how to generate steps for an output graph. We can force users to simply add steps to an ordinary steps': Set<Step>, and then just emit { node with steps = steps'} for each node in the graph. I don't know of any odgi functions which compute each element of steps' in parallel, so perhaps this solution is sufficient for most graph analyses. I think the alternative would be to allow users emit steps to step' and then also use step' as a first-class value, but as discussed today this is more of a stretch goal and I think we should stick with forcing steps' to be an ordinary Set of steps.

Of course, to fully support this, we probably want to extend pollen with support for creating and building Sets.

Originally posted by @susan-garry in #57 (comment)

Uniform test strategy: drop symlinks, use a long turnt file

See full discussion here: #79

`mygfa` doesn't parse optional CIGAR strings

There are a number of gfa features that mygfa doesn't account for yet, so I'm not sure how high of a priority this fix should be, but this issue is preventing mygfa from parsing odgi's generated gfa files.

Note that this is the gfa specification that I'm using as a reference: http://gfa-spec.github.io/GFA-spec/GFA1.html

Essentially, mygfa doesn't have the functionality to parse certain CIGAR strings, specifying the alignment of two segments (?). This particular issue shows up wherever an "alignment" string appears, for example in links:

L 1 + 2 + 0M

and paths:

P path1 1+,2+,2+ 0M, 0M

The last column of these lines represents a CIGAR string (or list of CIGAR strings). My understanding is that in either case, this string can be replaced with *:

L       1       +       2       +       *
P       path1   1+,2+,2+         *

Which indicates that the overlap is unspecified. According to the docs, if unspecified, "the CIGAR strings are determined by fetching the CIGAR string from the corresponding link records, or by performing a pairwise overlap alignment of the two sequences." I'm not yet sure what the latter is or how difficult it would be to accomplish, but this suggests that in order to support this, we may want to pre-process gfa files and sort the lines by type so that we parse Path lines after Link lines.

@anshumanmohan , based on your knowledge of overlap, does this sound doable? How much of a priority should this be?

Dangling file `process.py`?

At the root of Pollen we have process.py and I'm not sure if it's meant to be there. @susan-garry would you mind taking a quick look?

slow-odgi: Make a little top-level CLI

This is purely a quality-of-life thing, but it would be cool to make a top-level CLI entry point for the slow-odgi commands. In particular, what is currently python3 slow_odgi/flip.py could become slow-odgi flip, i.e., we'd make an installable executable called slow-odgi with several subcommands.

The concrete steps for this would be:

Add an __init__.py to the slow_odgi directory to make it into a Python package.
Adjust the imports in slow_odgi/*.py to use package-relative imports. For example, use from . import mygfa instead of import mygfa.
Somewhere (in __init__.py or a new __main__.py), make a new main function that dispatches to all of the various algorithms. argparse supports subcommands; we could also try Click instead.
Add metadata to make the slow_odgi package installable (with an executable). As you probably know, my favorite tool for this is Flit, but there are other options available.

Probably best to ignore this until after #29 and #31 are merged.

`slow-odgi`: CIGAR string overlaps

Overlaps receive shoddy treatment right now; I just replace them with *. A more thoughtful treatment would actually print it out during emit. This will likely expose various instances where odgi algorithms do something reasonable with overlaps and we just punt.

`odgi` source confusion

Or, whose odgi is it anyway

slow-odgi does not strictly need odgi if you just want to run/benchmark it. It does need odgi if you want to run the tests, and in that case you need the built-from-source odgi because the source code contains fixes that were made on our request and those fixes are not yet available on Bioconda.
exine needs odgi, and in particular it needs odgi's Python bindings. The most reliable way we know to do this is via Bioconda, which is why we suggest this route in the Pollen readme.

This works, but is a little gross. I just wanted to put this down somewhere so it's not just me who knows this dirty secret 😅

I'll poke around some to see if I can reliably solve this via one single installation of odgi.

Make test strategy uniform

I have learned a bunch about testing by stumbling through slow-odgi, and I'd love to pipe that around to other parts of the testing setup. See also #31 (comment)

Just making it an issue for whenever @susan-garry has a moment to chat about it!

Proofs about our commands

What if we could prove that Pollen DSL programs preserve the semantics of variation graphs? That is, some but not all odgi commands are "logical no-ops," in the sense that they may split up segments and rename things and stuff, but the set of nucleotide sequences they represent (i.e., odgi flatten would still produce the same flat sequences). Could we somehow prove that these programs are no-ops, so they're free of "bugs"?

Originally posted by @sampsyo in #84 (review)

I love this idea!

I'd love to be able to state and prove, for example:

forall g, 
valid g -> 
exists g', flip g g' /\ valid g' /\ meaning g == meaning g'

where

flip a b is a relation, in this case a function, that says that b is the flipped version of a. i.e, all the good stuff stated here.
meaning x is some beautiful encapsulation of the semantics of a graph x, rising above all the kludge of segment-splits and names.

slow_odgi: think about `depth.uniq` and other such variants that require `subset_paths`

Just a bookmark to think about this shortly. For now I have an incorrect treatment in place. I am doing something reasonable in terms of uniqueness, but not yet giving users the option of specifying a subset of paths to consider.

`pollen_data_gen`: go all the way to binary?

One possible agenda item for future synchronous discussions: shall we take this conversion one step farther and produce binary data (rather than just JSON) for fast software processing?

Originally posted by @sampsyo in #102 (review)

Depth instructions don't work as written

I suspect this is something minor, and I'd fix it myself, but I've gotten a little turned around by the way pollen-py is engineered (see also, #90).

I get the following error when I run the one-liner test:

$ exine depth -a -r test/k.og
[fud] ERROR: `.futil' corresponds to multiple stages: ['futil', 'calyx']. Please provide an explicit stage using --to or --from.

And I get the same error when I try the more verbose option:

make fetch
make test/k.og
exine depth -o depth.futil
exine depth -d test/k.og -o depth.data
exine depth -r depth.data --accelerator depth.futil

I am using the latest copy of Calyx and I have configured it as instructed in our README.

`odgi validate` crashing on large graphs

Running

make clean
make og
turnt --env validate_oracle test/*.og

gives me exit code 134 for the larger files, and, on running in verbose mode, a complaint about the utf-8 encoding.

Seems strange, since it works okay when I run

make clean
make fetch
turnt --env validate_oracle test/*.gfa

That is, letting odgi do an on-the-fly conversion from .gfa to .og works out okay.

Use separate outputs in Turnt test environments

I also mentioned this in #27 (review), but we can simplify our Turnt testing lives a little bit by making the different operators output different files. Here's what things currently look like:

pollen/test/turnt.toml

Lines 25 to 31 in 80bee84

    
           [envs.depth_oracle] 
        
           binary = true 
        
           command = "odgi depth -d --input={filename}" 
        
           [envs.depth_test] 
        
           binary = true 
        
           command = "python3 ../slow_odgi/node_depth.py < {filename}"

While we want both depth_oracle and depth_test to work with the same expected-output file, we would like them to use a different shared expected-output file from flip_oracle and flip_test (for example). Turnt's output configuration lets you tell it what extension to use for the saved stdout. For instance:

output.depth = "-"

…will make it use whatever.depth instead of whatever.out. This will be nice because it means we don't have to be careful to run all the different operators separately to avoid clobbering. (It may also be possible to make it use whatever.depth.out or similar if you prefer that aesthetically.)

This one should probably also wait until after #29 and #31 land.

`slow-odgi`: investigate why we disagree on `inject`

slow-odgi inject disagrees with odgi inject on graphs DRB1.gfa and chr6.gfa.

Redundancies in `pollen-py`?

Running the Black formatter on all our Python files caused me to take an unplanned tour of pollen_py. I'm just wondering if there has maybe been a mistake in committing files after a planned rename/restructure. The structure looks like

pollen
|
+- pollen/                   // here be all the new Rust stuff, great!!
+- pollen_py/
     |
     +- depth/
     +- pollen/
          |
          +- depth/

and I'm a bit confused. The two depth/ directories look the same, or at least very similar. Sorry if I'm missing something!

`slow-odgi`: `extract`

Just making a note that odgi extract is wicked cool and I'd like to implement it in slow-odgi sometime.

https://hackmd.io/@AndreaGuarracino/SyhbiKuE2#Lipoprotein-A-graph

odgi extract -i LPA.og -n 23 -c 1 -d 0 -o LPA.21_23_G_GT.og
The instruction extracts:

the node with ID 23 (-n 23),

the nodes reachable from this node following a single edge (-c 1) in the graph topology,

the edges connecting all the extracted nodes, and

the paths traversing all the extracted nodes.

Set up CI

It would be really great to someday set up CI using GitHub Actions. Ideally, this would let us run our entire test suite on every PR to make sure we don't break anything.

The one tricky thing about this will be getting odgi set up. We could use our existing Dockerfile for that—perhaps more reasonably—we could just run odgi from the official Docker image. That is, it may be as simple as using docker run pangenome/odgi in place of odgi, or whatever the Actions-endorsed equivalent is for bringing in a tool from a published Docker image.

`pollen_data_gen`: support for `subset_paths`

@susan-garry points out that pollen_data_gen does not currently support subset_paths, which gets in the way of it being a drop-in replacement for the existing ODGI-dependent code. See also, #129.

Susan explored the space, saying that we have two options:

Feed it a subset_paths files along with the GFA graph.
- Pro: Simpler
- Con: seems to violate the principle that pollen_data_gen should be precisely a representation of information in a GFA file.
Leave it as it is, i.e., make it just just a GFA. Then add an extra step to splice in the information about subset_paths.
- Pro: Keeps the layers separate.
- Con: More complex.

We have decided to go with the first option.

`pollen_data_gen`: expand `simple`

Presently:

pollen_data_gen depth graph.gfa matches exine depth -d graph.gfa -a graph.gfa with some care. The file generated is graph.data, which is what Pollen's depth-accelerator needs. graph.data contains precomputed stuff beyond the basic graph, but it also lacks some things fundamental to the graph. As a result, this is a lossy conversion: you cannot travel the roundtrip graph.gfa <-> graph.data.
pollen_data_gen simple graph.gfa is just something I dreamed up to put the graph into JSON. It has no connection with any accelerator's needs, nor with graph.data described above. However, I think it can be shown to be lossless.

Goals:

Leave pollen_data_gen depth as it is.
Change pollen_data_gen simple, expanding its precompute and reformatting its output, so that it becomes a superset of pollen_data_gen depth.
Show it to be lossless, i.e., capable of a round trip.
Show that it can be filtered in some way to arrive at pollen_data_gen depth.

`depth`: use `pollen_data_gen` in place of `parse_data`

When it comes to producing JSON files from input graphs, the pollen_data_gen library matches parse_data in most of its functionality. We should swap the former in for the latter, as this will further disentangle our workflow from odgi. See #105 for more about this general philosophy.

This is not just a drop-in replacement yet, however: the pollen_data_gen library does not have all the flags that parse_data does. The reason is not deep, but merely engineering. After that is done, we can do a drop-in replacement.

Originally posted by @sampsyo in #116 (comment)

slow-odgi: more thoughtful generation of segment names

It doesn't seem like segment names are guaranteed to be numbers in the GFA format. They are in all the examples we're familiar with, but it seems to me like the format defines them to be arbitrary strings (which is why we store them as strs and not ints). Therefore:

int(seg.name) is not guaranteed to succeed
str(int(seg.name) + 1) is not guaranteed to be unique

Longer term, we probably want to generate a "fresh" name by looking at all the existing names and generating one that is actually guaranteed to be unused. Or perhaps we want to use some kind if internal numeric ID space, which we then map back to strings. Anyway, plenty of options here!

Originally posted by @sampsyo in #51 (comment)

	def get_maxes(graph: mygfa.Graph) -> Tuple[int, int, int]:
	"""Return the maximum number of nodes, steps, and paths in the graph."""
	max_nodes = len(graph.segments)
	max_steps = max([len(steps) for steps in node_steps(graph).values()])
	max_paths = len(graph.paths)
	return max_nodes, max_steps, max_paths

	[envs.depth_oracle]
	binary = true
	command = "odgi depth -d --input={filename}"

	[envs.depth_test]
	binary = true
	command = "python3 ../slow_odgi/node_depth.py < {filename}"