Giter Site home page Giter Site logo

eth-sri / astarix Goto Github PK

View Code? Open in Web Editor NEW
72.0 11.0 7.0 13.43 MB

AStarix: Fast and Optimal Sequence-to-Graph Aligner

License: Mozilla Public License 2.0

Dockerfile 0.04% Makefile 0.86% C++ 60.96% CMake 0.44% Objective-C++ 0.03% C 0.41% Python 3.03% Jupyter Notebook 34.24%
graph-alignment edit-distance shortest-paths sequencing

astarix's Introduction

AStarix
SRILAB


License: MPL 2.0

AStarix is a sequence-to-graph aligner based on the A* shortest path algorithm. It finds alignments that are optimal according to edit-distance with non-negative weights. The motivation of AStarix is to speed up optimal alignment in the average case. A trie index allows to scale with reference size [1], while seed heuristic allows to scale with query length [2].

Publications

[1] Ivanov, Bichsel, Mustafa, Kahles, Rätsch, Vechev (RECOMB 2020) AStarix: Fast and Optimal Sequence-to-Graph Alignment (2020) -- Introduces AStarix tool, the trie index which allows scaling to bigger references, and Prefix heuristic which outperforms Dijkstra. Evaluations here.

[2] Ivanov, Bichsel, Vechev (RECOMB 2022) Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds -- Introduces Seed heuristic which outperforms other aproaches on Illumina reads and allows to scale to long HiFi reads.

Installation

Prerequisites of AStarix are:

  • argp – argument parsing library
  • pandas (optional) – dataframe library used for testing and evalutaions

To compile AStarix and run a test:

make
make test

Third-party libraries are located in the /ext directory and their own licenses apply. Tested on Ubuntu 20.04.

Example run

This is an example run of AStarix on a toy data: aligning 100 simulated Illumina reads to a linear graph from the first 10000bp of Escherichia coli. You can expect a similar summary printed to standard output:

$ make test

release/astarix align-optimal -a astar-seeds -t 1 -v 0 -g data/ecoli_head10000_linear/graph.gfa -q data/ecoli_head10000_linear/illumina.fq -o tmp/ecoli_head10000_linear/astar-seeds --fixed_trie_depth 1 --seeds_len 25

----
Assert() checks:       OFF
        verbose:        0
----
Loading reference graph... Added reverse complement... done in 0.001078s.
Loading queries... done in 0.000216s.
Contructing trie... done in 0.007017s.
Initializing A* heuristic... done in 1.6e-05s.

 == General parameters and optimizations ==
             Alignment algo: astar-seeds
                 Edit costs: 0, 1, 5, 5 (match, subst, ins, del)
              Greedy match?: true
                    Threads: 1

 == A* parameters ==
          seed length: 25 bp
     skip near crumbs: 1

 == Data ==
         Original reference: 9826 nodes, 9824 edges
                       Trie: 5186 nodes, 24822 edges, depth=7
  Reference+ReverseRef+Trie: 24838 nodes, 44470 edges, density: 0

                  Trie: 5186 nodes, 24822 edges, depth=7
  Reference+ReverseRef+Trie: 24838 nodes, 44470 edges, density: 0
                      Reads: 100 x 99bp, coverage: 1.00774x
            Avg phred value: 0.13640%

 == Aligning statistics ==
        Explored rate (avg): 1.70707 states/read_bp
         States with crumbs: 0.02992%
            Explored states: 0.01737%
             Skipped states: 99.95271%
     Pushed rate (avg, max): 0.77130, 0.01400    [states/bp] (states normalized by query length)
     Popped rate (avg, max): 0.09310, 0.00140
             Average popped: 7.16000 from trie (76.90655%) vs 2.15000 from ref (per read)
Total cost of aligned reads: 15, 0.15000 per read, 0.15152% per letter
                 Alignments: 100 unique (100.00000%), 0 ambiguous (0.00000%) and 0 overcost (0.00000%)

 == Heuristic stats (astar-seeds) ==
        For all reads:
                            Seeds: 400 (4.00000 per read)
                     Seed matches: 385 (3.85000 per read, 0.96250 per seed)
               States with crumbs: 29105 [+0.00000% repeated], (291.05000 per read)
                  Heuristic (avg): 0.10000 of potential 4.00000

 == Performance ==
    Memory:                    measured | estimated
                   total: 0.00510gb, 100% | -
               reference: 0.00267gb, 52.39521% | 5.74505%
                   reads: 0.00000gb, 0.00000% | 0.18091%
                    trie: 0.00100gb, 19.53593% | 5.82224%
     equiv. classes opt.: 0.00000gb, 0.00000%
          A*-memoization: 0.00000gb, 0.00000

   Total wall runtime:    0.04086s
       reference loading: 0.00108s
         queries loading: 0.00022s
          construct trie: 0.00702s
              precompute: 0.00002s
       align (wall time): 0.03040s = 3289.04470 reads/s = 325.61543 Kbp/s

    Total align cpu time: 0.02998s = 3335.11206 reads/s = 330.17609 Kbp/s
     |          Preprocessing: 31.61686%
     |               A* query: 18.47652%
     |           greedy_match: 11.42609%
     |                  other: 38.48052%
 DONE

The directory tmp/ecoli_head10000_linear/astar-seeds/ will contain a file with execution logs and a file with alignment statistics. Short aggregated statistics are print to standard output (to redirect, you can add >summary.txt).

Usage

AStarix only finds optimal alignments (specified by argument align-optimal). Currently supported formats are .gfa without overlapping nodes (for a graph reference) and .fa/.fasta (for a linear reference). The queries should be in .fq/.fastq format (the phred values are ignored).

$ astarix --help
----  
Usage: astarix [OPTION...] align-optimal -g GRAPH.gfa -q READS.fq -o OUT_DIR/
Optimal sequence-to-graph aligner based on A* shortest path.

  -a, --algorithm={dijkstra, astar-prefix, astar-seeds}
                             Shortest path algorithm
  -c, --prefix_cost_cap=A*_COST_CAP
                             The maximum prefix cost for the A* heuristic
  -d, --prefix_len_cap=A*_PREFIX_CAP
                             The upcoming sequence length cap for the A*
                             heuristic
  -D, --tree_depth=TREE_DEPTH   Suffix tree depth
  -e, --prefix_equivalence_classes=A*_EQ_CLASSES
                             Whether to partition all nodes to equivalence
                             classes in order not to reuse the heuristic
      --fixed_trie_depth=FIXED_TRIE_DEPTH
                             Some leafs depth can be less than tree_depth
                             (variable=0, fixed=1)
  -f, --greedy_match=GREEDY_MATCH
                             Proceed greedily forward if there is a unique
                             matching outgoing edge
  -g, --graph=GRAPH          Input graph (.gfa)
  -G, --gap=GAP_COST         Gap (Insertion or Deletion) penalty [5]
  -k, --k_best_alignments=TOP_K   Output at most k optimal alignments per read
                             [1]
  -M, --match=MATCH_COST     Match penalty [0]
  -o, --outdir=OUTDIR        Output directory
  -q, --query=QUERY          Input queries/reads (.fq, .fastq)
      --seeds_len=A*_SEED_LEN   The length of the A* seeds.
      --seeds_skip_near_crumbs={0,1}
  -S, --subst=SUBST_COST     Substitution penalty [1]
  -t, --threads=THREADS      Number of threads [1]
  -v, --verbose=THREADS      Verbosity (silent=0, info=1, debug=2), [0]
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

astarix's People

Contributors

pesho-ivanov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

astarix's Issues

Docker install fails `make test` because of python3 and pandas

Problem

The docker build docker build -t astarix . fails the make test because of the lack of python3 and pandas.

make test

Python 3

Cause

python3 $(TESTSDIR)/compare_profilings.py $(TMPDIR)/ecoli_head10000_linear/astar-default/alignments.tsv $(TMPDIR)/ecoli_head10000_linear/dijkstra-default/alignments.tsv

python3 $(TESTSDIR)/compare_profilings.py $(TMPDIR)/ecoli_head1000000_linear/astar-default/alignments.tsv $(TMPDIR)/ecoli_head1000000_linear/dijkstra-default/alignments.tsv

Error

== Aligning statistics ==
   Explored rate (avg, max): 1.62, 36.01    [states/bp] (states normalized by query length)
     Expand rate (avg, max): 0.24, 5.64
      Memoization miss rate: -nan%
             Average popped: 0.00 from trie vs 0.00 from ref (influenced by greedy match flag)
 Non-unique best alignments: 0 out of 100
 == Performance ==
    Memory:                    measured | estimated
                   total: 0.00gb, 100% | -
               reference: 0.00gb, 45.58% | 6.11%
                   reads: 0.00gb, 0.00% | 0.19%
                    trie: 0.00gb, 26.81% | 9.10%
     equiv. classes opt.: 0.00gb, 0.00% | 12.75%
          A*-memoization: 0.00gb, 0.00% | 179140796633671.28%-179140796633671.28% (0 entries)
    Wall runtime: 0.18s
       reference loading: 0.00s
         queries loading: 0.00s
          construct trie: 0.01s
              precompute: 0.00s
                   align: 0.17s (wall time) => 724.60 reads/s <=> 70.76 Kbp/s
                          0.17s (cpu time) (A*: 0.00%, que: 11.21%, dicts: 51.69%, greedy_match: 21.89%)
 DONE
python3 tests/compare_profilings.py tmp/ecoli_head10000_linear/astar-default/alignments.tsv tmp/ecoli_head10000_linear/dijkstra-default/alignments.tsv
make: python3: Command not found
make: *** [test] Error 127

Pandas

Cause

import pandas as pd

Error

 Non-unique best alignments: 0 out of 100
 == Performance ==
    Memory:                    measured | estimated
                   total: 0.00gb, 100% | -
               reference: 0.00gb, 46.42% | 5.91%
                   reads: 0.00gb, 0.00% | 0.19%
                    trie: 0.00gb, 27.17% | 8.81%
     equiv. classes opt.: 0.00gb, 0.00% | 12.33%
          A*-memoization: 0.00gb, 0.00% | 173348715449210.78%-173348715449210.78% (0 entries)
    Wall runtime: 0.18s
       reference loading: 0.00s
         queries loading: 0.00s
          construct trie: 0.01s
              precompute: 0.00s
                   align: 0.17s (wall time) => 737.79 reads/s <=> 72.05 Kbp/s
                          0.17s (cpu time) (A*: 0.00%, que: 10.59%, dicts: 51.95%, greedy_match: 22.12%)
 DONE
python3 tests/compare_profilings.py tmp/ecoli_head10000_linear/astar-default/alignments.tsv tmp/ecoli_head10000_linear/dijkstra-default/alignments.tsv
Traceback (most recent call last):
  File "tests/compare_profilings.py", line 2, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'
make: *** [test] Error 1
Makefile:35: recipe for target 'test' failed
The command '/bin/sh -c make &&         make test' returned a non-zero code: 2

Possible solution

1. Removing make test from the Dockerfile

If its only use is testing & evaluations you can just remove the make test

2. Adding python3 and pandas to the docker image

Assuming there's some utility in the test.
A fix that worked for me (which may not be the most efficient) was adding:

python3 and pip

python3 \
python3-pip \

and installing pandas

RUN pip3 install pandas

GAF output

Could you provide GAF output over the nodes of the graph? Would it be possible to do? I could also submit a patch if you can describe how you'd go about setting this up.

It's defined in the minigraph documentation, and GraphAligner and vg both produce it.

It's like PAF but the target is expressed as a walk through nodes in the graph (your GFA input).

The walk is expressed like >1>2>4>6>7>9>10>11>12>13>14>16>17>19

If you have <19<17<16<14<13<12<11<10<9<7<6<4<2<1< then it might represent the reverse complement of the above walk.

You'll need to create a CIGAR or md tag to express the base-level alignment, for use downstream in vg call for instance.

RNAseq

Hi,

First of all, thank you for such an interesting tool. I am really interesed in using it in the near future. One question though, how suitable would this tool be for mapping RNA-seq reads to a diploid reference graph? If this has been adressed before, please let me know where to look for correct parameters, if this is an interesting case scenario, that would require some development, I would be more than happy to help.

Best regards,

Juan D. Montenegro

Segmentation fault

Hello,

I wish to test astarix on small test data available in minigraph repository. The graph is in GFA format with zero overlaps.

For this, I ran the following command:

./astarix align-optimal -g MT.gfa -q MT-human.fa -o $PWD

I get the following output:

$ ./astarix --verbose=2 align-optimal -g MT.gfa -q MT-human.fa -o $PWD
----
Assert() checks:       ON
        verbose:        2
----
Loading reference graph... astarix: ext/GraphAligner/GfaGraph.cpp:172: static GfaGraph GfaGraph::LoadFromFile(std::string): Assertion `overlap >= 0' failed.
Aborted

Can you please check and suggest what may be going wrong? I've pulled the latest code from master branch for this. When I run the code without assert checks, it returns segmentation fault.

Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.