medvedevgroup / twopaco Goto Github PK

View Code? Open in Web Editor NEW

39.0 39.0 10.0 11.91 MB

A fast constructor of the compressed de Bruijn graph from many genomes

License: Other

CMake 0.39% C++ 94.93% Makefile 3.85% Python 0.83%

bioinformatics comparative-genomics de-bruijn-graphs genomics graph-algorithms

twopaco's People

Contributors

Stargazers

Watchers

Forkers

adderan ekg aksinghal5590 gaspareg nsmehta fataltes wangdi2014 sapoudel sivasan dpryan79

twopaco's Issues

Handeling N characters in sequences

Hi,

when using input sequences that contain N characters (which almost all eukaryote reference sequences do) I get this error message:

Round 0, 0:1048576
Pass Filling Filtering
1 error: Found an invalid character 'N'

Do I need to re-split my input into contigs to use it with TwoPaCo, or is this some kind of bug?

Thanks,
Chris

Fix error handling

Add error checking after the first pass
Add try/catch around checking constructing the parser (checking if file exists)

Cause of input corrupted error?

Hey @iminkin!

I'm trying to run this on another dataset however I keep getting the "the input is corrupted" error. I tried to take a look at the source code but can't fully understand what causes this? I checked my Fasta and that seems to be fine, but I might be missing something

`twopaco` and `graphdump` functions not found

Hello,

I am interested in using twopaco on a set of bacterial reference genomes, though I am running into issues.

I cloned the repo and went through the build steps mentioned in the README, though after the build, I can't seem to run the twopaco command.

bash: twopaco: command not found
bash: graphdump: command not found

I can't locate in which directory the command exists or needs to be run from. I am somewhat new to building tools written in C, so pardon my ignorance on the topic.

I am working on a HPC which has C++ version 11 or higher and the required TBB libraries and version of cmake. The tool built without issue.

Thanks,
Domenick

Unable to create temp file

Hi,

in all my runs, independent of the provided location TwoPaCo is unable to create a temp file at the given location.

twopaco --tmpdir /home/TwoPaCo/tmp/ --test -t 1 -k 11 -f 20 -o /home/TwoPaCo/test.dbg /home/TwoPaCo/examples/example.fa

What am I doing wrong here?

Thanks,
Chris

Refactor parallel section

Problem building TwoPaCo in BioLinux

Hi @IlyaMinkin,

I am trying to install TwoPaCo in BioLinux. I first installed tbb using sudo apt-get install libtbb-dev, then did git clone https://github.com/medvedevgroup/TwoPaCo.git, cd TwoPaCo, mkdir build, cd build, cmake ../src then finally make. Here is the ouput I get:
$ cmake ../src
-- The C compiler identification is GNU 4.8.4
-- The CXX compiler identification is GNU 4.8.4
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/manager/Programmes/TwoPaCo/build

$ make
Scanning dependencies of target graphdump
[ 8%] Building CXX object graphdump/CMakeFiles/graphdump.dir/graphdump.cpp.o
[ 16%] Building CXX object graphdump/CMakeFiles/graphdump.dir//common/dnachar.cpp.o
[ 25%] Building CXX object graphdump/CMakeFiles/graphdump.dir//common/streamfastaparser.cpp.o
In file included from /home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.cpp:4:0:
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h:179:3: error: ‘auto_ptr’ in namespace ‘std’ does not name a type
std::auto_ptrTwoPaCo::StreamFastaParser parser_;
^
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h: In constructor ‘TwoPaCo::ChrReader::ChrReader(const std::vector<std::basic_string >&)’:
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h:145:5: error: ‘parser_’ was not declared in this scope
parser_.reset(new TwoPaCo::StreamFastaParser(fileName[0]));
^
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h: In member function ‘bool TwoPaCo::ChrReader::NextChr(std::string&)’:
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h:154:9: error: ‘parser_’ was not declared in this scope
if (parser_->ReadRecord())
^
make[2]: *** [graphdump/CMakeFiles/graphdump.dir/__/common/streamfastaparser.cpp.o] Error 1
make[1]: *** [graphdump/CMakeFiles/graphdump.dir/all] Error 2
make: *** [all] Error 2

How can I solve this problem?
Thanks a lot in advance for your help.

Large k value

Hello !
I would be very interested to use TwoPaCo with large kmers.
It works with 281 but not with 291 on Ecoli reference genome.
./twopaco -f 30 -k 291 ../../../../data/ecoli.fa
Give a segfault.

Would it be possible for TwoPaCo to works on arbitray size of k ?

undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'

Hitting a general issue with install, despite following the directions.

As a friendly reminder, if developers expect (moderately) computationally literate biologists to use their software, they need to provide explicit lines of code in the README, not general instructions. Sometimes that's the difference between hundreds of citations or almost none.

user@computer:~$ git clone https://github.com/medvedevgroup/TwoPaCo
Cloning into 'TwoPaCo'...
remote: Enumerating objects: 38, done.
remote: Counting objects: 100% (38/38), done.
remote: Compressing objects: 100% (25/25), done.
remote: Total 3548 (delta 19), reused 28 (delta 13), pack-reused 3510
Receiving objects: 100% (3548/3548), 11.90 MiB | 5.98 MiB/s, done.
Resolving deltas: 100% (2325/2325), done.
user@computer:~$ cd TwoPaCo
user@computer:~/TwoPaCo$ mkdir build
user@computer:~/TwoPaCo$ cd build
user@computer:~/TwoPaCo/build$ cmake ../src
-- The C compiler identification is GNU 5.5.0
-- The CXX compiler identification is GNU 5.5.0
-- Check for working C compiler: /home/linuxbrew/.linuxbrew/bin/cc
-- Check for working C compiler: /home/linuxbrew/.linuxbrew/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /home/linuxbrew/.linuxbrew/bin/c++
-- Check for working CXX compiler: /home/linuxbrew/.linuxbrew/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/TwoPaCo/build
user@computer:~/TwoPaCo/build$  sudo apt-get install libtbb-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  linux-headers-4.15.0-48 linux-headers-4.15.0-48-generic linux-image-4.15.0-48-generic
  linux-modules-4.15.0-48-generic linux-modules-extra-4.15.0-48-generic python3-dateutil
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  libtbb2
Suggested packages:
  tbb-examples libtbb-doc
The following NEW packages will be installed:
  libtbb-dev libtbb2
0 to upgrade, 2 to newly install, 0 to remove and 77 not to upgrade.
Need to get 342 kB of archives.
After this operation, 2,033 kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://au.archive.ubuntu.com/ubuntu bionic/universe amd64 libtbb2 amd64 2017~U7-8 [110 kB]
Get:2 http://au.archive.ubuntu.com/ubuntu bionic/universe amd64 libtbb-dev amd64 2017~U7-8 [231 kB]
Fetched 342 kB in 0s (2,559 kB/s)   
Selecting previously unselected package libtbb2:amd64.
(Reading database ... 255576 files and directories currently installed.)
Preparing to unpack .../libtbb2_2017~U7-8_amd64.deb ...
Unpacking libtbb2:amd64 (2017~U7-8) ...
Selecting previously unselected package libtbb-dev:amd64.
Preparing to unpack .../libtbb-dev_2017~U7-8_amd64.deb ...
Unpacking libtbb-dev:amd64 (2017~U7-8) ...
Setting up libtbb2:amd64 (2017~U7-8) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Setting up libtbb-dev:amd64 (2017~U7-8) ...
user@computer:~/TwoPaCo/build$ make
[  7%] Building CXX object graphdump/CMakeFiles/graphdump.dir/graphdump.cpp.o
[ 14%] Building CXX object graphdump/CMakeFiles/graphdump.dir/__/common/dnachar.cpp.o
[ 21%] Building CXX object graphdump/CMakeFiles/graphdump.dir/__/common/streamfastaparser.cpp.o
[ 28%] Linking CXX executable graphdump
/usr/lib/x86_64-linux-gnu/libtbb.so: undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'
/usr/lib/x86_64-linux-gnu/libtbb.so: undefined reference to `std::__exception_ptr::exception_ptr::exception_ptr(void*)@CXXABI_1.3.11'
collect2: error: ld returned 1 exit status
graphdump/CMakeFiles/graphdump.dir/build.make:146: recipe for target 'graphdump/graphdump' failed
make[2]: *** [graphdump/graphdump] Error 1
CMakeFiles/Makefile2:85: recipe for target 'graphdump/CMakeFiles/graphdump.dir/all' failed
make[1]: *** [graphdump/CMakeFiles/graphdump.dir/all] Error 2
Makefile:105: recipe for target 'all' failed
make: *** [all] Error 2

Confused about + and - stranded nodes in GFA

Let's use this very simple FASTA:

>seq1
ATATGTCGCTGATCGACTGAAATAGCATCGACTAGCTATCGAT
>seq2
ATATGTCGCTGATCGACTGAATAGTGAAATAGCATCGACTAGC
>seq3
ATATGTCGCTGATCGACTTTTTTTTGAAATAGCATCGACTAGC

Then we construct the graph: ./twopaco -k 15 -f 16 test.fa -o graph and convert it to GFA: graphdump -k 15 -f gfa2 -s test.fa graph > graph.gfa:

H       VN:Z:2.0
S       36      18      ATATGTCGCTGATCGACT
F       36      seq1+   0       18$     0       18      15M
S       24      18      TTCAGTCGATCAGCGACA
F       24      seq1-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       14      26      GTCGATGCTATTTCAGTCGATCAGCG
F       14      seq1-   0       26$     6       32      15M
E       24-     14-     0       15      11      26$     15M
S       11      19      TGAAATAGCATCGACTAGC
F       11      seq1+   0       19$     17      36      15M
E       14-     11+     0       15      0       15      15M
S       19      22      ATAGCATCGACTAGCTATCGAT
F       19      seq1+   0       22$     21      43$     15M
E       11+     19+     4       19$     0       15      15M
O       seq1p   36+ 24- 14- 11+ 19+
F       36      seq2+   0       18$     0       18      15M
F       24      seq2-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       13      33      GTCGATGCTATTTCACTATTCAGTCGATCAGCG
F       13      seq2-   0       33$     6       39      15M
E       24-     13-     0       15      18      33$     15M
F       11      seq2+   0       19$     24      43$     15M
E       13-     11+     0       15      0       15      15M
O       seq2p   36+ 24- 13- 11+
F       36      seq3+   0       18$     0       18      15M
S       12      36      GTCGATGCTATTTCAAAAAAAAGTCGATCAGCGACA
F       12      seq3-   0       36$     3       39      15M
E       36+     12-     3       18$     21      36$     15M
F       11      seq3+   0       19$     24      43$     15M
E       12-     11+     0       15      0       15      15M
O       seq3p   36+ 12- 11+

When we look at the paths we have:


seq1p   36+ 24- 14- 11+ 19+
seq2p   36+ 24- 13- 11+
seq3p   36+ 12- 11+

We can only reconstruct the sequence from the GFA by taking the reverse complement of - nodes. When we look at the paths all nodes are on the same strand (i.e. all - or all +), for example, all 24 nodes are -. So why weren't these just all recorded as +?

GFA version update

Is it planned to update the GFA output to the new GFA v2 or have an option to choose v2 as output?
I would like to use TwoPaCo with another tools that needs GFA v2 as input.

typo?

Hello, I am writing because it seems like there is a typo in the main README under the graphdump / GFA section. Seems like graphdummp should be graphdump

Best,
Domenick

Redundant k-mer in contigs

Hi @IlyaMinkin,

We've run into another minor issue that we think is a bug. I wanted to report the behavior here to get your feedback on it. Basically, what we're seeing is that, for a small number of contigs that TwoPaCo is returning, the contigs contain both a k-mer and its reverse complement. Thus, in the compacted dBG, the k-mer itself is repeated --- which we believe shouldn't happen. I realize that this is possible in the GFA output when the k-mer occurs at the end of a contig, since the GFA file is written such that the overlaps themselves are of length k and hence these k-mers will occur at least twice. However, these repeated k-mers are internal (and seem to happen, in fact, when the entire contig is its own reverse complement).

This issue was discovered by my student @fataltes, who did the legwork to provide the following example. We're working with this reference sequence. We ran TwoPaCo with -k set to 31, and then used graphdump to obtain a GFA1 file. Most of the contigs / segments in this file are OK, but a few of them contain the same k-mer (once in the forward and once in the reverse complement orientation) more than once. Here is the list of such contigs / segments in our output:

2232549 ATGTGTGTGTGTGTATATATATATATATATACACACACACACAT
196044 TGTGTATATATATACACATATATACGTATATATGTGTATATATATACACA
557083 TTTCATGTTTATATATATATATATATGTATATATATATACATATATATATATATATAAACATGAAA
659373 GTGTGTGTGTATATATATATATATATATATACACACACAC
2222892 ATTATATATATATAATATATATATATTATATATATATAAT
2307911 ATATATATATATCATATATATGATATATATATAT
2309111 ATATACATATATATATATATATATATATGTATAT
2861563 AAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTT
2237088 TGTGTGTGTATGTATATTATATAATATACATACACACACA
555324 TATATATATATATACCATATATATGGTATATATATATATA
659376 TGTGTGTGTGTGTGTATATATACACACACACACACA
554875 TATATATATATATATATAATATATATATATATATA
555396 TATATATATATAAATATATATATATTTATATATATATA
162527 TTATATATATATTATATATATAATATATATATAA
2307775 ATATATATGTGTGTATATATATACACACATATATAT
554899 ATATATATATATATATGCATATATATATATATAT
214284 TGTATGTGTGTATATATGTGTGTATATATATATACACACATATATACACACATACA

As you can see, these segments contain quite a few cases where both a k=31-mer and its reverse complement (and even larger k-mers) are present in the same contig. As we are indexing k-mers in the TwoPaCo representation, and expecting each k-mer to occur at most once, this is causing some issues for us. Interestingly, all of these cases seem to be occurring as substrings of segments which are their own reverse complements. So, I presume that this is either (1) expected behavior and we are possibly interpreting the compacted dBG differently from TwoPaCo or (2) some minor corner-case in the contig generation code.

Please let me know if you have any questions about this case or any difficulty re-generating this example. Thanks again!

--Rob

Creation of many temporary files (of considerable size) when there are many references.

Hi,

I have noticed some behavior I was not expecting when using TwoPaCo to generate compacted dBGs for input fasta files with many distinct references. Specifically, we are making use of TwoPaCo internally in pufferfish indexing, and one of the common use cases now is to index a transcriptome for subsequent salmon quantification. Here, the total size of the sequence is small ~300M for the human transcriptome, but the number of individual fasta entries is very large (~200,000).

The behavior I noticed is that TwoPaCo creates, during processing, a temporary file in its temp directory for every input sequence in the fasta file. So, we get a temp folder with ~200,000 distinct files! This seems to be a particular problem for some users who are doing indexing on cluster machines (with NFS-mounted drives).

In addition to the large number of distinct files being created, the total size of the temporary directory grows quite large. For example, for the human transcriptome (again, ~300M of input sequence), the TwoPaCo temp directory grows to ~14G before files start being deleted.

I have two main questions. First, is this large intermediate disk-space usage expected, and if so is there some way that it can be controlled? Second, is there some way to avoid or alter the behavior of creating one temp file per input sequence? This still works (as long as we're not on an NFS) for transcriptomic sequences, but some large metagenomic sequences have literally created more files in a directory than the file system is willing to handle. Ideally, there may be some way to "block together" temporary files for distinct references so that, rather than 1 temp file per-reference there was a temp file for different buckets of references or some such.

Thanks again for the great tool, and for any insight or suggestions you have on the above!

--Rob

target not created

twopaco target not created in the root folder

Fail to install TwoPaCo

Hi,

Thank you for developing TwoPaCo!

I had the following error when I compiled the files. I know this should be related to the TBB library, which I followed the instruction (https://github.com/oneapi-src/oneTBB/blob/master/INSTALL.md) to install. However, it still throws the error. Could you please clarify if we need to specify the TBB library when installing TwoPaCo? Appreciate your help!

/home/usr/Tools/TwoPaCo/src/graphdump/graphdump.cpp:15:10: fatal error: oneapi/tbb/parallel_sort.h: No such file or directory
 #include "oneapi/tbb/parallel_sort.h"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
graphdump/CMakeFiles/graphdump.dir/build.make:75: recipe for target 'graphdump/CMakeFiles/graphdump.dir/graphdump.cpp.o' failed
make[2]: *** [graphdump/CMakeFiles/graphdump.dir/graphdump.cpp.o] Error 1
CMakeFiles/Makefile2:115: recipe for target 'graphdump/CMakeFiles/graphdump.dir/all' failed
make[1]: *** [graphdump/CMakeFiles/graphdump.dir/all] Error 2
Makefile:155: recipe for target 'all' failed
make: *** [all] Error 2

Implement two pass aggregation strategy

First pass: enumerate all candidates, put them in a sorted array of atomic ints
Second pass: binary search over the array, update the preceding/succeeding characters and set a flag if it is a bifurcating kmer

All in parallel, first pass is with a hashet with a spinlock

Or, alternatively, use a lock-free hash table http://preshing.com/20130605/the-worlds-simplest-lock-free-hash-table/

Question about GFA format

Hi @IlyaMinkin,

It's me again :). TwoPaCo has been working great, but I've run into a small issue regarding the GFA file. I was wondering if you could clear up my confusion. I build a cdBG using TwoPaCo with k=31. As the document states that k is the node size, I'm expecting the cdBG to contain a list of segments (i.e., contigs) that overlap by k-1. However, in the resulting GFA file, all of the contigs seem to instead overlap by k (i.e., they show a 31M overlap). This is causing some issues downstream, as we expect the invariant that a k-mer (or its reverse complement) appears at most once in the cdBG. However, when the overlap is of size k, we get that a given k-mer may appear as many times as it participates in an overlap.

Have I misunderstood something about the expected format of this graph? Is there an easy way to obtain the cdBG GFA file such that the overlaps are retained as k-1 bases instead of k?

Thanks!
Rob

error: Inconsistent read size

Hi,

When running TwoPaCo with any dataset that consists of at least two sequences of different size (in one, or more files) I almost immediately get this error message:

Round 0, 0:1048576
Pass Filling Filtering
1 error: Inconsistent read size

Is this wanted behavior, that TwoPaCo can only deal with input sequences of the same size?

Thanks,
Chris

Linking twopaco to TBB fails

Hi @IlyaMinkin,

I am trying to build TwoPaCo on RedHat-7-x86_64. Compilation and linking of graphdump proceeds without error but during linking of twopaco the following error occurs:

CMakeFiles/twopaco.dir/vertexenumerator.cpp.o: In function TwoPaCo::VertexEnumeratorImpl<1ul>::DistributeTasks(std::vector<std::string, std::allocator<std::string> > const&, unsigned long, std::vector<std::unique_ptr<tbb::concurrent_bounded_queue<TwoPaCo::Task, tbb::cache_aligned_allocator<TwoPaCo::Task> >, std::default_delete<tbb::concurrent_bounded_queue<TwoPaCo::Task, tbb::cache_aligned_allocator<TwoPaCo::Task> > > >, std::allocator<std::unique_ptr<tbb::concurrent_bounded_queue<TwoPaCo::Task, tbb::cache_aligned_allocator<TwoPaCo::Task> >, std::default_delete<tbb::concurrent_bounded_queue<TwoPaCo::Task, tbb::cache_aligned_allocator<TwoPaCo::Task> > > > > >&, std::unique_ptr<std::runtime_error, std::default_delete<std::runtime_error> >&, tbb::mutex&, std::ostream&) [clone .constprop.2154]: vertexenumerator.cpp:(.text+0x160d): undefined reference to tbb::internal::concurrent_queue_base_v8::internal_push_move(void const*) vertexenumerator.cpp:(.text+0x1aa8): undefined reference to tbb::internal::concurrent_queue_base_v8::internal_push_move_if_not_full(void const*)'

I have tried three different TBB releases: 43_20150209, tbb2017_20170412 and tbb2018_20170919. They all produce the error posted above. The 2018 release produces an additional error:

CMakeFiles/twopaco.dir/constructor.cpp.o: In function tbb::flow::interface10::graph::~graph()': constructor.cpp:(.text._ZN3tbb4flow11interface105graphD2Ev[_ZN3tbb4flow11interface105graphD5Ev]+0x4c): undefined reference to tbb::interface7::internal::task_arena_base::internal_execute(tbb::interface7::internal::delegate_base&) const constructor.cpp:(.text._ZN3tbb4flow11interface105graphD2Ev[_ZN3tbb4flow11interface105graphD5Ev]+0x104): undefined reference to tbb::interface7::internal::task_arena_base::internal_initialize() constructor.cpp:(.text._ZN3tbb4flow11interface105graphD2Ev[_ZN3tbb4flow11interface105graphD5Ev]+0x11c): undefined reference to tbb::interface7::internal::task_arena_base::internal_terminate()

The ldd command shows that graphdump is linked to the correct TBB library. Am I using the wrong TBB version or am I missing any additional libraries?

Thank you for your help!

gfa1 Format issue (path orientation)

Hi,

the pats in the gfa1 dumped by graphdump seem to have a wrong format. For some of the segments in the path the the orientation is missing and for all segments that have one the orientation is given before the segment number instead of after (although I'm not sure if this is an actual error or if it is allowed based on the format specification)
Output:
P seq1_repeat_seq2_seq3 -23,17,-31,27,-8 31M,31M,31M,31M
gfa spec example:
P 14 11+,12-,13+ 4M,5M

Is this the correct format or is there something wrong?
Thanks,
Chris

Fails to build with TBB 2021.5.0

I am building TwoPaCo 0.9.4 with TBB 2021.5.0 on Debian experimental. The build fails with the following:

In file included from /home/merkys/twopaco/src/graphdump/graphdump.cpp:17:
/home/merkys/twopaco/src/graphdump/../common/streamfastaparser.h:8:10: fatal error: tbb/mutex.h: No such file or directory
    8 | #include <tbb/mutex.h>
      |          ^~~~~~~~~~~~~

Does this mean that TwoPaCo does not support TBB 2021.5.0? If so, are there plans to support it?

Replace deprecated auto_ptr

Corrupt input when using graphdump on TwoPaCo output

Hi guys,

First of all; awesome work on TwoPaCo --- the method and the software are both fantastically useful, and I'm really excited to start using it for some downstream applications we're working on. I'm running into the following issue. I used TwoPaCo to build the compacted dBG for the human transcriptome (the following command):

twopaco  -k 31 -t 8 -f 32 gencode.v25.pc_transcripts.fa

As the name suggests, the reference is protein coding human transcripts from gencode v25. This seems to work fine, and I get the following output from TwoPaCo:

Threads = 8
Vertex length = 31
Hash functions = 5
Filter size = 4294967296
Capacity = 2
Files:
/mnt/scratch6/avi/data/txptome/gencode.v25.pc_transcripts.fa
--------------------------------------------------------------------------------
Round 0, 0:4294967296
Pass    Filling Filtering
1       9       16
2       3       1
True junctions count = 358144
False junctions count = 56685
Hash table size = 414829
Candidate marks count = 2551662
--------------------------------------------------------------------------------
Reallocating bifurcations time: 0
True marks count: 2540661
Edges construction time: 6
--------------------------------------------------------------------------------
Distinct junctions = 358144

Now, I want to convert this output to a GFA format (I tried both GFA1 and 2 and get the same error in each case). I used the following command:

graphdump -k 31 -s gencode.v25.pc_transcripts.fa -f gfa1 de_bruijn.bin > gencode.twopaco.gfa1

This results in the following error message:

error: The input is corrupted

At this point, some output has been generated, but I presume it's not complete because, despite the fact that there are ~96k input transcripts, I only get 35,451 output paths (i.e., P) entries in the resulting GFA file. Any idea what might be causing this issue or how to fix it?

Thanks!
Rob

meaning of software name

Does TwoPaCo stand for "Two Path Compaction"?

This is for a paper where we're trying to provide full names for acronyms and compressed program names to make the jargon less intense.

Undefined symbol

/sc1/apps/pets/SibeliaZ/bin/twopaco: symbol lookup error: /sc1/apps/pets/SibeliaZ/bin/twopaco: undefined symbol: _ZN3tbb8internal24concurrent_queue_base_v818internal_push_moveEPKv

Present API for pufferfish

TwoPaCo is used by Pufferfish which currently embeds a patched code copy of TwoPaCo. Patches seem to transform main() function to make it callable in C/C++ code circumventing the command line interface. Would it be possible to merge Pufferfish's patches to TwoPaCo? This way Pufferfish could be linked against static/shared library of TwoPaCo. In addition, I believe such command line-circumventing API could be useful for other users of TwoPaCo preferring static type checking, for example.

MAC OS compilation failed

Compilation failing for mac OS high seirra 10.13.1 with the use of gcc 4.85 with the following error.

In file included from /Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/Arg.h:54:0,
                 from /Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/SwitchArg.h:30,
                 from /Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/CmdLine.h:27,
                 from /Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/constructor.cpp:16:
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/ArgTraits.h: In instantiation of ‘struct TCLAP::ArgTraits<long long unsigned int>’:
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/ValueArg.h:403:66:   required from ‘void TCLAP::ValueArg<T>::_extractValue(const string&) [with T = long long unsigned int; std::string = std::basic_string<char>]’
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/ValueArg.h:363:29:   required from ‘bool TCLAP::ValueArg<T>::processArg(int*, std::vector<std::basic_string<char> >&) [with T = long long unsigned int]’
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/constructor.cpp:163:1:   required from here
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/ArgTraits.h:80:39: error: ‘long long unsigned int’ is not a class, struct, or union type
     typedef typename T::ValueCategory ValueCategory;
                                       ^
make[2]: *** [graphconstructor/CMakeFiles/twopaco.dir/constructor.cpp.o] Error 1
make[1]: *** [graphconstructor/CMakeFiles/twopaco.dir/all] Error 2
make: *** [all] Error 2

Is it stuck when there is really low CPU usage?

We just run twopaco for thousands of bacterial genomes and now it's at:

Round 0, 0:4398046511104
Pass    Filling Filtering

However, when we look at top we see that while its loaded in memory there is only 0.3% cpu usage:
514.3g 512.2g 4368 D 0.3 33.9 209:34.18 twopaco
Is this normal or does this mean something is going wrong?

medvedevgroup / twopaco Goto Github PK

twopaco's People

Contributors

Stargazers

Watchers

Forkers

twopaco's Issues

Recommend Projects

Recommend Topics

Recommend Org