Giter Site home page Giter Site logo

splatlab / mantis Goto Github PK

View Code? Open in Web Editor NEW
80.0 80.0 20.0 10.19 MB

Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index

License: BSD 3-Clause "New" or "Revised" License

C++ 93.05% C 6.53% CMake 0.19% Shell 0.20% Dockerfile 0.03%

mantis's People

Contributors

fataltes avatar gmarcais avatar phelimb avatar prashantpandey avatar rob-p avatar rtjohnso avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mantis's Issues

[mergeMSTs] Problems with mst and query

Hey there,

While using the mergeMSTs branch, I ran into some trouble with mst and query.

mst

mantis mst doesn't seem to work.

It wants to load eqclass_rr.cls files:

mantis/src/mst.cc

Lines 33 to 34 in 7406e8f

eqclass_files =
mantis::fs::GetFilesExt(prefix.c_str(), mantis::EQCLASS_FILE);

This will later lead to a segmentation fault because the files do not exist.

mantis build will always delete eqclass_rr.cls files at the end:

mantis/src/mst.cc

Lines 729 to 737 in 7406e8f

if (opt.remove_colorClasses && !opt.keep_colorclasses) {
for (auto &f : mantis::fs::GetFilesExt(opt.prefix.c_str(), mantis::EQCLASS_FILE)) {
std::cerr << f.c_str() << "\n";
if (std::remove(f.c_str()) != 0) {
std::cerr << "Unable to delete file " << f << "\n";
std::exit(1);
}
}
}

mantis build doesn't have an option to toggle this behavior.
Changing qopt.remove_colorClasses = true; to qopt.remove_colorClasses = false; here, fixes the issue:

qopt.prefix = bopt.out; qopt.numThreads = bopt.numthreads; qopt.remove_colorClasses = true;

query

The default non-bulk query only works if the eqclass_rr.cls files are present and -1 is used:

mantis query -1 -k 20 -p index/ reads.fasta

To have eqclass_rr.cls files, the above fix is needed, and mst must have been run with -k.

Alternatively, bulk-mode (-b) works without the eqclass_rr.cls files. So, mst can also be run with -d.

mantis query -b -k 20 -p index/ reads.fasta

The problem in non-bulk query seems to be that findSamples is called for every query sequence:

mantis/src/mstQuery.cc

Lines 492 to 498 in 7406e8f

while (ipfile >> read) {
mstQuery.reset();
mstQuery.parseKmers(numOfQueries, read, indexK);
mstQuery.findSamples(cdbg, cache_lru, &rs, queryStats, 1);
output_results(mstQuery, opfile, sampleNames, queryStats, 1);
numOfQueries++;
}

The function then accesses cdbg.get_current_cqf()->keybits():

uint64_t ksize{cdbg.get_current_cqf()->keybits()}, numBlocks{cdbg.get_numBlocks()};

This works fine for the first query, but for the second one there is no CQF to access because it has been replaced with
an invalid one:

cdbg.replaceCQFInMemory(invalid);

I tried loading the first block 0 at the begin of findSamples and just passing the keybits as an extra parameter.
But then there is an out-of-bounds access at

allQueries[q][numSamples]++;

Mantis hangs on large input

Hi,
I'm trying to run Mantis on quite a big experiment ~3000 samples. I've build CQFs with Squeakr-exact, k-mer=28 and total Squeakr output takes ~8 TB. I use cutoff=2 for each sample. It's larger than example in manuscript, but I think it still should work, right? I changed constant MAX_NUM_SAMPLES to account for bigger number of samples.
Currently mantis build process for this input has been running for 10 days and hanged with the following message:

Kmers merged: 10000000 Num eq classes: 1934823 Total time: 233712
Home slot: 2602766 Insertion slot: 30477349 Difference: 27874583 Fraction done: 0.000152

Do you think any other constants or anything else should be changed to account for bigger input size?
Thank you, Pavel

segfault

tim@tim-ThinkPad-T470:~/Dropbox/mantis$ make NH=1 coloreddbg

g++ -std=c++11 -Wall   -Ofast  -m64 -I. -Isdsl/include -Wno-unused-result -Wno-strict-aliasing -Wno-unused-function -Wno-sign-compare  coloreddbg.cc -c -o coloreddbg.o
gcc -std=gnu11 -Wall   -Ofast  -m64 -I. -Wno-unused-result -Wno-strict-aliasing -Wno-unused-function -Wno-sign-compare -Wno-implicit-function-declaration  cqf/gqf.c -c -o cqf/gqf.o
g++ -std=c++11 -Wall   -Ofast  -m64 -I. -Isdsl/include -Wno-unused-result -Wno-strict-aliasing -Wno-unused-function -Wno-sign-compare  hashutil.cc -c -o hashutil.o
g++ -std=c++11 coloreddbg.o cqf/gqf.o hashutil.o   -Ofast -lsdsl -lpthread -lboost_system -lboost_thread -lm -lz -lrt -o coloreddbg

tim@tim-ThinkPad-T470:~/Dropbox/mantis$ ./coloreddbg raw/incqfs.lst raw/experiment_cutoffs.lst raw/

Reading CQF 0 Seed 2038074761
Sample id SRR191411 cut off 1
Reading CQF 1 Seed 2038074761
Sample id SRR191403 cut off 1
Sampling eq classes based on 67108864 kmers.
Segmentation fault (core dumped)

This doesn't seem like the desired behavior... ?

"Can't allocate qf blocks" because key_bits is (apparently) wrong.

I'm trying to run mantis on a small test set, to work out bugs in my understanding before I try it on the usual โ‰ˆ2600 experiment data set. I'm trying to follow the example described in the mantis paper's methods, and what is shown in this repo's readme. Full details of the steps I ran, below.

My example fails when I attempt to do mantis build. The specific error message is generated by cqf/gqf.c:
Can't allocate qf blocks: Cannot allocate memory
but some debugging/snooping reveals that the problem results from qf_init() begin passed nslots=2^34 and key_bits being smaller than 34. I haven't been able to figure out why that has happened, where I have gone wrong.

Because key_bits < log2(nslots), the code ends up trying to allocate something close to 2^64 bytes.
While gqf.c contains an assert that would detect this error condition
assert(key_remainder_bits > 0)
asserts are hardwired off in that module.

I note that if I do mantis build with the two example CQFs in the repo, it works. In that case, it seems qf_init is given key_bits=40. This suggests (to me) that my problem came from how I built by CQFs, but it's not at all clear to me what I did wrong.

Apparently the call to qf_init() results from this line in colereddbg.cc
ColoredDbg<SampleObject<CQF*>, KeyObject> cdbg( ...
and that appears to be using key_bits that it derived from my CQF file. One guess is that it relates to the number of slots in that file, and it does seem that when I try my CQFs, key_bits=8 plus the number of slots in the CQF. But the example CQF files in the repo have file sizes that suggest 20 slots were used, yet somehow key_bits=40 for them.

FWIW I'm using squeakr and mantis source code pulled from the repos today.

What I ran:

My example consists of five small fastq files which should be publicly available at
https://github.com/medvedevgroup/SbtPlayData

Following the description in the mantis paper's methods, I used ntcard to determine the number of slots for each CQF. This table shows the results from ntcard and the calculation of slots.

#experiment F0 f1 f2 s log2Slots slots slots/s
EXPERIMENT1 32384 18880 1535 57857 17 131072 2.265
EXPERIMENT2 46400 27328 1087 83457 17 131072 1.571
EXPERIMENT3 34048 18816 1343 63169 17 131072 2.075
EXPERIMENT4 43712 24768 1279 80321 17 131072 1.632
EXPERIMENT5 21120 11584 1023 39169 16 65536 1.673

So I created CQFs like this:

squeakr_count -k 20 -s 17 -t 1 -g EXPERIMENT1.fastq.gz
squeakr_count -k 20 -s 17 -t 1 -g EXPERIMENT2.fastq.gz
squeakr_count -k 20 -s 17 -t 1 -g EXPERIMENT3.fastq.gz
squeakr_count -k 20 -s 17 -t 1 -g EXPERIMENT4.fastq.gz
squeakr_count -k 20 -s 16 -t 1 -g EXPERIMENT5.fastq.gz

And created file lists for mantis like this

echo EXPERIMENT1.fastq.gz.ser > incqfs.lst
echo EXPERIMENT2.fastq.gz.ser >> incqfs.lst
echo EXPERIMENT3.fastq.gz.ser >> incqfs.lst
echo EXPERIMENT4.fastq.gz.ser >> incqfs.lst
echo EXPERIMENT5.fastq.gz.ser >> incqfs.lst

echo "EXPERIMENT1 1" > experiment_cutoffs.lst
echo "EXPERIMENT2 1" >> experiment_cutoffs.lst
echo "EXPERIMENT3 1" >> experiment_cutoffs.lst
echo "EXPERIMENT4 1" >> experiment_cutoffs.lst
echo "EXPERIMENT5 1" >> experiment_cutoffs.lst

Then tried to run mantis like this (and got the error described above)

mkdir cdbg
mantis build -i incqfs.lst -c experiment_cutoffs.lst -o cdbg/

What am I doing wrong?

Thanks for any help,
Bob H

mergeMSTs: Build fails to find tbb/parallel_sort.hh

Hi, I'm trying to build the code in the mergeMSTs branch. I am on Ubuntu 18.04. The build does not seem to find tbb/parallel_sort.hh. The log is below.

niklas@phoenix:~/code/mantis2/build$ git branch
* (HEAD detached at origin/mergeMSTs)

niklas@phoenix:~/code/mantis2/build$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../ -DSDSL_INSTALL_PATH=/home/niklas/code/sdsl-lite ..
SAN_LIB =  
Top-level source directory variable not set externally; setting it to /home/niklas/code/mantis2
-- Could NOT find TBB (missing: TBB_INCLUDE_DIRS tbbmalloc_proxy) (Required is at least version "2018.0")
Build system will fetch and build Intel Threading Building Blocks
==================================================================
TBB_INCLUDE_DIRS = /home/niklas/code/mantis2/external/install/include
TBB_LIBRARY_DIRS = /home/niklas/code/mantis2/external/install/lib
TBB_INSTALL_DIR = /home/niklas/code/mantis2/external/install
TBB_LIBRARIES = /home/niklas/code/mantis2/external/install/lib/libtbbmalloc.so;/home/niklas/code/mantis2/external/install/lib/libtbb.so;/home/niklas/code/mantis2/external/install/lib/libtbbmalloc_proxy.so;/home/niklas/code/mantis2/external/install/lib/libtbbmalloc.so;/home/niklas/code/mantis2/external/install/lib/libtbb.so
Adding /home/niklas/code/sdsl-lite/include to the include path
Adding /home/niklas/code/sdsl-lite/build/lib to the build path
-- Configuring done
-- Generating done
-- Build files have been written to: /home/niklas/code/mantis2/build

niklas@phoenix:~/code/mantis2/build$ make install
[ 25%] Built target libtbb
[ 28%] Building CXX object src/CMakeFiles/mantis_core.dir/kmer.cc.o
[ 31%] Building CXX object src/CMakeFiles/mantis_core.dir/query.cc.o
In file included from /home/niklas/code/mantis2/src/query.cc:42:0:
/home/niklas/code/mantis2/include/coloreddbg.h:38:10: fatal error: tbb/parallel_sort.h: No such file or directory
 #include "tbb/parallel_sort.h"
          ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
src/CMakeFiles/mantis_core.dir/build.make:86: recipe for target 'src/CMakeFiles/mantis_core.dir/query.cc.o' failed
make[2]: *** [src/CMakeFiles/mantis_core.dir/query.cc.o] Error 1
CMakeFiles/Makefile2:162: recipe for target 'src/CMakeFiles/mantis_core.dir/all' failed
make[1]: *** [src/CMakeFiles/mantis_core.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

mantis build fails after reading only 1019 CQFs

I'm trying to do a mantis build with 2585 CQFs. After apparently reading 1018 CQFs it says it can't open the next one. See tail of the log below.

The file it fails to open exists, and doesn't seem to be corrupted.

If I shuffle my CQF file list (the file for the -i option), the build fails at file 1019 again (a different file this time).

My guess is this is caused by a limit in the number of files I'm allowed to open. Before I pester a sysadmin to change that, do you think that's the right conclusion? I'm not fluent in this sort of linux stuff. Googling suggests that "ulimit -Sn" tells me how many files I can open, and for me that says 1024.

Tail of the log:
...
Reading CQF 1017 Seed 2038074761
Sample id SRR1313181 cut off 20
Reading CQF 1018 Seed 2038074761
Sample id SRR1024127 cut off 10
Couldn't open file: CQFs/SRR1027178.exact.ser

humanRNA_40k bigger than expected

Hi,

I am interested in using the 39,400 experiments you posted here (https://github.com/splatlab/mantis/blob/mergeMSTs/experiments/humanRNA_40k.accessions) and while not all data is downloaded yet (around 500 experiments are still missing), the gzipped fastq files take up already 56 TB of space, which is greater than the 2,33 TB you reprted in "An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using LSM Trees".

Do you have an idea where this difference in size could be coming from?

Best Wishes,
Mitra

Tag a release

I would like to try and make a bioconda package for mantis. Would you be able to tag a release to help with this?
Thanks!

Segfault

Hi, I'm currently getting a segfault when trying to run mantis on my own data. I can run on the sample data provided just fine. I definitely have enough memory allocated.

I get a segfault after printing that CQF 0 is being read, but before printing that the cutoff file was read, so I'm not sure if I'm creating the cutoff file wrong -- I'm not using samples with SRA accessions, so my cutoff file just has a paths to the fastqs used to generate the .ser files and the cutoffs (the path to one of the fastqs anyhow -- I input two fastqs per squeakr file, as it's paired end data. Squeakr named the .ser file after the first of the two, which is the path I then use for the cutoff file). What exactly should the cutoff file look like when it's not SRA data?

Reading CQF 0 Seed 2038074761
Segmentation fault

Cannot allocate memory error in "mantis query"

Hi, I have trouble in using "mantis query" after successfully built the colored de Bruijn graph.

I used the following commands to generate the colored de Bruijn graph. No errors happened in this step.

mantis build -s 30 -i raw/batch1.txt -o raw/
mantis mst -p raw/ -t 8 -k

The raw folder contains the following files.
0_eqclass_rrr.cls
2_eqclass_rrr.cls
4_eqclass_rrr.cls
6_eqclass_rrr.cls
batch1.txt
dbg_cqf.ser
meta_info.json
sampleid.lst
1_eqclass_rrr.cls
3_eqclass_rrr.cls
5_eqclass_rrr.cls
7_eqclass_rrr.cls
boundaries.bv
deltas.bv
parents.bv

But when I query one fasta file, it showed an error message.
mantis query -p raw/ -o query.res raw/input.fa

[2021-09-20 17:23:14.541] [mantis_console] [info] Number of experiments: 50
[2021-09-20 17:23:14.542] [mantis_console] [info] Loading cqf...
Couldn't allocate memory for blocks.: Cannot allocate memory

Would you help me to figure out what is wrong with it?
Thanks a lot!

References to tbb prevent building

Hi, I am again having trouble building the software. I am failing to build with a different setup because of references to tbb. Below is a grep of all the places the string "tbb" appears in the source directory.

CMakeLists.txt: add_dependencies(mantis libtbb)
cqfMerger.cc:#include "tbb/parallel_sort.h"
cqfMerger.cc:// tbb::parallel_sort(tmpList.begin(), tmpList.end(),
cqfMerger.cc: tbb::parallel_sort(tmp_kmers.begin(), tmp_kmers.end(), [](auto &kv1, auto &kv2) {
cqfMerger.cc: tbb::parallel_sort(tmp_kmers.begin(), tmp_kmers.end(), [](auto &kv1, auto &kv2) {
grep: gqf: Is a directory
hierarchicalMantisConstructor.cc:#include "tbb/parallel_sort.h"
hierarchicalMantisConstructor.cc: tbb::parallel_sort(cmds.begin(), cmds.end(), [](auto &c1, auto &c2){
mantis.cc:#include "tbb/global_control.h"
mantis.cc: tbb::global_control c(tbb::global_control::max_allowed_parallelism, numThreads);
mst.cc://#include "tbb/parallel_sort.h"
mst.cc: tbb::parallel_sort(edgeList.begin(), edgeList.end(),
mst.cc: tbb::parallel_sort(bucket.begin(), bucket.end(),
mstMerger.cc: tbb::parallel_sort(cqfBlocks.begin(), cqfBlocks.end(), [](std::string &s1, std::string &s2) {
mstQuery.cc: // tbb::parallel_sort(query_colors.begin(), query_colors.end());
mstQuery.cc: // tbb::parallel_sort(v.begin(), v.end());
stat.cc:#include "tbb/parallel_sort.h"
stat.cc: tbb::parallel_sort(mccs.begin(), mccs.end(), [](const uint64_t &v1, const uint64_t &v2){

Seg fault

Whenever I try to build with mantis, I end up with a segmentation fault. This happens regardless of if I use NH=1 when I make mantis. This is what it outputs whenever I use a command like this:

./mantis build -i HGGs/testing/incqfs.lst -c HGGs/testing/experiment_cutoffs.lst -o HGGs/mantis_output/
Reading CQF 0 Seed 2038074761
Segmentation fault

My input file looks like this:

HGGs/exact_squeakr_output/SRR7050121_1.fastq.gz_exact.ser
HGGs/exact_squeakr_output/SRR7050122_1.fastq.gz_exact.ser

And my experimental cutoff file looks like this:

SRR7050121_1.fastq.gz 20
SRR7050122_1.fastq.gz 20

potential mismatch between sdsl lib and include

(I considered adding this to issue #2, but wasn't sure it would be seen since that issue is closed)

I'm trying to build mantis, but I'm concerned that the build process uses its own copy of the sdsl includes yet expects to link to libsdsl that I've built from different sources. It would seem that any interface difference between those is a potential segfault. Or am I worrying about nothing?

If we assume that I've cloned and built sdsl (which I have, assuming this means sdsl-lite), why wouldn't the build want to use the includes that go with the library it's going to link with?

The discussion at issue 2 shows how I can easily point the build at my libsdsl. But digging into the makefile it looks like it really wants to use those local sdsl includes.

Segfault after "Fraction done: 0.858421"

Hi, now I'm getting a segfault after it runs for about 15 minutes. It's after it prints:

Home slot: 14747564668 Insertion slot: 14747564668 Difference: 0 Fraction done: 0.858421
Kmers merged: 290000000 Num eq classes: 7
...
Kmers merged: 330000000 Num eq classes: 7

I ran gdb and the backtrace is:

Program received signal SIGSEGV, Segmentation fault.
0x000000000043e8ed in insert1 ()
Missing separate debuginfos, use: debuginfo-install zlib-1.2.3-29.el6.x86_64
(gdb) backtrace
#0  0x000000000043e8ed in insert1 ()
#1  0x0000000000451335 in ColoredDbg<SampleObject<CQF<KeyObject>*>, KeyObject>::add_kmer(KeyObject&, BitVector&) ()
#2  0x0000000000451ea3 in ColoredDbg<SampleObject<CQF<KeyObject>*>, KeyObject>::construct(SampleObject<CQF<KeyObject>*>*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > >&, spp::sparse_hash_map<BitVector, std::pair<unsigned long, unsigned long>, sdslhash<BitVector>, std::equal_to<BitVector>, spp::libc_allocator<std::pair<BitVector const, std::pair<unsigned long, unsigned long> > > >&, unsigned long) ()
#3  0x000000000044a7a9 in build_main(BuildOpts&) ()
#4  0x000000000040954e in main ()

Any ideas as to what the problem might be?

Segmentation faults on tiny dummy inputs (possible race condition?)

When I was testing mantis on some dummy files, I noticed that it segfaults most of the time when I run it (at different times during execution), but it occasionally runs to completion.

The script below reproduces the issue (I've symlinked squeakr-count built from the exact branch and mantis to my data folder, but this issue also occurs when I call them in their build folders)

#/bin/bash

bzcat SRR403012.fastq.bz2 | head -n 8 > test.fq
./squeakr-count -f -k 12 -s 20 -t 1 test.fq
echo "test.fq_exact.ser 1" > inputs.lst
./mantis build -s 20 -i inputs.lst -o .

This results in the input fastq file

@SRR403012.1 HWUSI-EAS108E_0007:2:1:9625:985 length=49
GAAGGGTTAACTTAAGCGAGATCCNAGCAGAGAAGCAGGTCGAAGNNNN
+SRR403012.1 HWUSI-EAS108E_0007:2:1:9625:985 length=49
,,*).....@C@C@@C@C@##############################
@SRR403012.2 HWUSI-EAS108E_0007:2:1:10006:990 length=49
GCTTGTTTGGGAAGTGGCATTCATNGTGCTCCAGGGGCGGGTGGGNNNN
+SRR403012.2 HWUSI-EAS108E_0007:2:1:10006:990 length=49
#################################################

I have attached the log from 7 runs of the above script: runlog.txt

When I run the mantis command through valgrind, I notice several invalid reads and writes, but it runs to completion. I have attached the log: valgrind.txt
Let me know if I can provide any other information or logs to help debug this.

Thanks!

Compilation Error Due to Missing Files

When trying to compile mantis, I'm getting an error when trying to make the coloreddbg object (obj/coloreddbg.o). This is due to the fact that there are two includes in coloreddbg.cc for which the files seem to be missing: MantisFS.h and sparsepp/spp.h.

Should there be a MantisFS file in this repo somewhere, and is sparsepp a separate library which needs to be included (from here I assume via a quick Google search: https://github.com/greg7mdp/sparsepp)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.