medvedevgroup / sibeliaz Goto Github PK

A fast whole-genome aligner based on de Bruijn graphs

License: Other

CMake 0.47% C++ 89.67% Shell 1.33% Python 3.54% Makefile 4.99%

locally-collinear-blocks whole-genome-alignment de-bruijn-graphs genomics graph-algorithms alignment comparative-genomics bioinformatics

sibeliaz's People

Contributors

Stargazers

Watchers

Forkers

fbemm anandksrao wangpanqiao aysunrhn matthieu-haudiquet subwaystation devenderarora wangdi2014 distilledchild liaoherui suresh2014 zm-git-dev wook2014 yuzhenpeng sivasan schaudge dpryan79 kfwins2022 pythseq

sibeliaz's Issues

Feature request: xmfa format

Hello,

Great software! A lot of tools (ClonalFrameML for ex.) rely on the xmfa format, if it's not too much work maybe you could add it to the possible outputs ?

Best,

twopaco: error while loading shared libraries: libtbb.so.2: cannot open shared object file: No such file or directory

Hi there,
I used conda to install the sibeliaz:
conda install -c bioconda sibeliaz
However, I tried to run the program, I encountered with following error:
(base) -bash-4.2$ sibeliaz assembly_mini.fasta
Constructing the graph...
twopaco: error while loading shared libraries: libtbb.so.2: cannot open shared object file: No such file or directory
TwoPaco: 0.00 seconds elapsed, 216 KB memory used
Loading the graph...
error: Can't read the input file
SibeliaZ-LCB: 0.05 seconds elapsed, 2048 KB memory used
rm: cannot remove './sibeliaz_out/de_bruijn_graph.dbg': No such file or directory
bash: global_alignment: command not found
Alignment: 0.05 seconds elapsed, 1724 memory used

Is this issue related to installation of required libraries. Do you have any idea to solve the problem.
Thank you

make install step : ";"

When I am running the
make install step in the installation procedure.
I see the following error at 58%

[ 58%] Building CXX object TwoPaCo/src/graphdump/CMakeFiles/graphdump.dir/graphdump.cpp.o
cd ${BUILDDIR}/TwoPaCo/src/graphdump && ${GCC_INSTALL_DIR}/bin/g++   -I${BUILDDIR}/TwoPaCo/src/graphdump/../common  -O2 -ftree-vectorize -march=native -fno-math-errno;-std=c++11 -lstdc++ -O3 -DNDEBUG   -o CMakeFiles/graphdump.dir/graphdump.cpp.o -c ${BUILDDIR}/TwoPaCo/src/graphdump/graphdump.cp

I do not know where this ";" comes from between -fno-math-errno and -std=c++11, but it should be replaced by a space.

question about LCB determination

I was interested in how the LCBs called by SibeliaZ change as I include more genomes, so to start, I ran it with two ~4.5Mb genomes that differed only by a single ~10kb inversion that I introduced. I expected to see something like 3 LCBs since the sequences are otherwise identical, but I was getting around 2100. I can't find anything in the algorithm description in your paper to explain this. There should only be bubbles in the vicinity of the inversion that I introduced, but nowhere else. Is the large number of LCBs for these near-identical genomes expected behavior, and how?

Question about block_coords.gff output

Hi,

Thanks for your work on SibeliaZ, it looks quite interesting as a potential component for an alignment pipeline I am building. I do have a question though; I am interested in extracting the blocks in the output gff file and converting it to another format to match that currently used by my pipeline, but it seems like I cannot easily map the contig names back to the source file(s). Is there a way you could add the source file for the contig ID of each block as an attribute per the GFF3 spec? I'll try to think of another solution myself in the meantime.

Unexpected output

Hi,

Thanks a lot for this huge tool. I had a very good experience with it a couple year ago (version 1). I now have other runs to do and I updated to the last version. I was a bit surprised by the outpout : instead of getting one lcb per genome for a unique block ID, I get several ones. To try to understand what was wrong, I ran again my previous analysis with the same setting but instead of getting on average one block coordinate by genome in the blocks_coords.gff I now get many.

Is this change expected between V1 and current version ? Is this linked to the -a option ? What would be you advices ?

Thanks a lot,

Johann

Validation in Bacteria

I feel like this type of method would work as a fantastic competitor to bacterial whole genome alignment software such as Mauv/progressiveMauve. Have you tried validating this software with bacterial datasets at all? looking to use a program like this for my own research trying to analyze bacterial genomes

support gzip'd input fasta

Running SibeliaZ v1.2.4 on gzip'ed input fasta files results in the following error:

Error: The FASTA header should start with a '>', started with ''
Loading the graph...
error: Can't read the input file

Given that the user will almost definitely have their genome fasta files gzip'ed if they have 100's or 1000's (or more), it would be great if SibeliaZ supported gzip'ed (and maybe bzip2) compressed input.

Genome size

Hi,
I need to align multiple mammalian-sized genomes (4 genomes of 2.7Gb each, 11Gb in one fasta). My question is regarding the limitation in chromosomes length (4294967296 bp). Is it to be intended as the sums of all chromosomes into the fasta file? Or is it intended as the max length per single chromosome?

Thank you in advance
Andrea

Add multi Fasta files parsing

Subk

Getting alignments after running with "-n"

Thanks for your nice tool.

Once I've run SibeliaZ once using the -n switch to compute just the blocks, is there a way to then compute the alignments using the blocks that have already been computed?

This feature would save time for me, since I'm using large genomes.

Alternatively, it would provide another use case, where I could specify a new, or modify the existing, blocks_coords.maf file, using the next call to SibeliaZ in an alignment step.

bash: line 11, 12 and 13: No such file or directory

Hello,
I recently installed sibeliaz in the cluster were im working using conda as it's shown instalation part. But when im trying to align 2 genomes, like the 2 genomes of the example of the github y get the following output:
`sibeliaz genome1.fa genome2.fa
Constructing the graph...
Threads = 8
Vertex length = 25
Hash functions = 5
Filter size = 17179869184
Capacity = 1
Files:
genome1.fa
genome2.fa

Round 0, 0:17179869184
Pass Filling Filtering
1 25 3
2 0 0
True junctions count = 79040
False junctions count = 0
Hash table size = 79040
Candidate marks count = 500858

Reallocating bifurcations time: 0
True marks count: 500858
Edges construction time: 1

Distinct junctions = 79040

Loading the graph...
Analyzing the graph...
[...................................................]
Generating the output...
Blocks found: 1350
Coverage: 1.00
Performing global alignment..
bash: line 11: : No such file or directory
bash: line 12: : No such file or directory
bash: line 11: : No such file or directory
bash: line 12: : No such file or directory
bash: line 13: : No such file or directory
bash: line 13: : No such file or directory
bash: line 11: : No such file or directory
bash: line 12: : No such file or directory`
And more of the same error.
I tried reinstaling sibeliaz again as i thought it may be that it wasn't installed correctly but the same error keeps appearing, so im not sure what i may be doing wrong or why this problem appears, because it doesnt matter which genomes.
Thanks for the help.

Overlapping MAF blocks

I have the following MAF blocks:

s SP_St1-2_v1.CTG1 972562 116 - 974421     CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St9-1_v1.CTG4 2007581 116 + 3546038   CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St21-1_v1.CTG2 2011797 116 + 3562354  CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St14-3_v1.CTG53 117896 116 + 301556   CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St3-3_v1.CTG1 2007174 116 - 3540787   CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St1-2_v1.CTG272 3796 116 - 329270     CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St22-2_v1.CTG59 1546482 116 - 2763569 CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA

s SP_St9-1_v1.CTG4 1538259 104 - 3546038   ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St3-3_v1.CTG1 1533415 104 + 3540787   ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St22-2_v1.CTG59 1216889 104 + 2763569 ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St1-2_v1.CTG272 325276 104 + 329270   ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St21-1_v1.CTG2 1550359 104 - 3562354  ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St1-2_v1.CTG1 1658 107 + 974421       GCTTACCGCCACTATCTAGAGCGCTTTTAGAT-CCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG

Looking at SP_St1-2_v1.CTG1 it seems as the two sub sequences from the blocks overlap:

CCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
                                                      GCCTGATTCTCGTTTAAGATGCATGGGTACGAAGAGACTGCTTATGCATACGTAATAGACTTGTCCTCTAGTAAGGACGTATACCTTAGCAGAGGATGGATGGACCACTAGGACG

The subsequence from block 1 is in RC.

Is that an intended behaviour? This is also what is causing #1

# running issue

Hi,sibeliaz is a super useful software ! But recently l can not use it smoothly, l just run : sibeliaz -k 15 -n *.fasta ,but the de_bruijn.bin was not processing.and l use top to see the process, it's not running. and then l found sibeliaz file . l follow the script, first l run twopaco and then l run sibeliaz_lcb,it works. so l wanna ask if that is right to split this process like that and why l can not use sibeliaz directly ? thanks a lot !

Command terminated by signal 11 (Segmentation Fault)

I have been trying to align 798 fasta files but it's showing this error-
"Round 0, 0:17179869184
Pass Filling Filtering
Command terminated by signal 11
TwoPaco: 44.78 seconds elapsed, 2280796 KB memory used
Loading the graph...
error: Can't read the input file
Command exited with non-zero status 1"

It performs fine if I try to do with less number of fasta files (example - 353).
Is there any solution for this?
Thanks in advance.

How can I create UCSC a bigMaf track from sibeliaz .maf?

Following steps from UCSC http://200.144.254.2:8001/goldenpath/help/bigMaf.html I cannot create bigMaf.txt

mafToBigMaf genome1 examples/sibeliaz_out/alignment.maf stdout | sort -k1,1 -k2,2n > genome1bigMaf.txt
reference databases (genome1) must be first component of every block on line 20

I would appreciate any suggestion

SibeliaZ/spoa/vendor/bioparser/include/bioparser/parser.hpp:124:19: error: 'gzFile_s' was not declared in this scope

This is the error pops out during "make install", I'm not sure what is going on.

sibeliaZ outputs

Hi,

on the Readme page, I have the impression that the output will contain two files, a GFF file "blocks_coords.gff" containing coordinates of the blocks, and file "alignment.maf" with the actual alignment.

However, i have run sibeliaz on my data twice and I only got "blocks_coords.gff" on both times with "alignment.maf" missing. There is no error message given. Have I done anything wrong here? any parameter needs to be turned on/off?

Also I want to generate a pan-genome graph in gfa format. I suppose this can be achieved through generating "alignment.maf"?

Thanks a lot!

Best regards,

Runxuan

Conda installation doesn't include maf_to_gfa1.py

I recently installed SibeliaZ as a separate environment using conda and ran the program on two strains of a species. I got the maf and gff output files. However, when I try and convert the maf file to visualize using the python script maf_to_gfa1.py, I get a command not found error.

Conda and Program Version:

conda 4.11.0
sibeliaz-lcb  version: 1.2.4

~~Solutions Tried (without much luck)~~
~~1. Manually adding the file (maf_to_gfa1.py) from the github repo to bin/ directory of conda environment.~~
~~I got a Python not found error.~~
~~2. Added the file to a different environment with python installed.~~
~~I got a Syntax Error likely because of Python3 instead of Python2?~~

Solution / Workaround:

Install Python2 using conda. I installed python=2.7.14.
Install numpy and biopython to the same environment which has python2.
Make the script accessible from command line. I added maf_to_gfa1.py to ~/bin. The script is available here.
Activate the python2 environment and Try python maf_to_gfa1.py -h

Inaccurate coordinates in MAF file

After running SibeliaZ between two fasta files, I found the coordinate is different from the original fasta file, and the sequence did not match back to the reference or query file. Both input files contain hard-masking regions. Did the output remove the hard-masking region length from original position? How could I get the same coordiantes as in the original file? Otherwise, it would be difficult for the postprocessing, when the reference is an established genome with annotations.

Large genome files

Hello,

I have two large genome files and would like to use SibeliaZ. I recently read this on a biostars thread "Yes, by SibeliaZ you can get .gfa file out of a fasta file. They recently updated there old version and now large genome file can also be used for the same."
But I cannot find any confirmation that SibeliaZ supports large genome files in your version history. Can you confirm that?

Thanks a lot!
Elissa

Empty MAF file in output

I'm getting just one line in MAF file of output -
"maf version=1" and fasta files in 'blocks' folder.
I'm using a 4GB ram, is it not enough?
Even I tried with so small input file and still not getting any information in MAF file.
It will be so helpful if I get MAF format. Thanks in advance.

(support oneTBB) fatal error: tbb/mutex.h: No such file or directory

I'm using Ubuntu 22.04 and cannot compile Sibeliaz:

In file included from /.../software/SibeliaZ/TwoPaCo/src/graphdump/graphdump.cpp:17:
/.../software/SibeliaZ/TwoPaCo/src/graphdump/../common/streamfastaparser.h:8:10: fatal error: tbb/mutex.h: No such file or directory
    8 | #include <tbb/mutex.h>
      |          ^~~~~~~~~~~~~

The problem with with the TwoPaCo submodule and is related to this bug report:
medvedevgroup/TwoPaCo#28

If I'm reading that report correctly, the only way to us Sibeliaz would be to somehow apply a patch from the pufferfish repo. I that correct?

Are there a script to make circos plot?

After running the command: sibeliaz -t 10 genome1.fa, the output as follow:
-rw-rw-r-- 1 shu shu 1.1G Jun 8 18:59 alignment.maf
drwxr-xr-x 2 shu shu 38M Jun 8 18:59 blocks
-rw-rw-r-- 1 shu shu 392M Jun 8 17:32 blocks_coords.gff
ant the blocks/ file is empty. We want to get the circos input data, just like Sibelia- 3.0.7 ( but it could proceed input which is larger than 1 GB ). Thank you very much for your help.

is it suitable for the very big genome?

Hi,
I have a big genome with 46,139,523,234 bases and 20131 contigs.
Here is the summary of the longest contigs.

ptg000004l      169819904
ptg000441l      158822330
ptg000279l      109104046
ptg000035l      107045360
ptg000669l      100503328
ptg000533l      90735505
ptg000066l      87495606
ptg000800l      85918877
ptg000855l      82319672
ptg000667l      80863498

I want to do pairwise genome alignment for it, and am curious about SibeliaZ's ability to handle big genome.

Reduce number of files created?

Hi,

I've noticed that Sibeliaz creates a very large number of small files in the "alignment" folder while running. I saw about 1.7 million files generated for 3 genomes (sizes 400, 500 and 1000 Mb). Is there any way to reduce this footprint, even if it reduces speed significantly? I'm working on a system with an inode quota, so for larger alignments I'm likely to be unable to run Sibeliaz, even though the files only need to be around temporarily.

Feature Request - Unaligned Sequences

Dear Medvedev-Group,

SibeliaZ works fantastic also on large, repetitive genomes. I would like to feature request one thing though. Would it be possible to add unaligned sequences to the output MAF in the end? Some tools require a MAF file to be able to fully recover the input genomes used for the alignment

Thanks,
Felix

Input file format

Hi,

Can you confirm how the inout genomes should be formatted? Does each genome need to be in the same fasta file? If all genomes are in the same file how are genomes multiple contigs dealt with? Will the program identify alignments within each input genome as well as between genomes?

Thanks,

Tom.

please include ##sequence-region pragmas in blocks_coords output

In the GFF specification, there's an optional-but-highly-recommended header pragma for specifying the boundaries of every sequence described in the file:

##sequence-region seqid start end
The sequence segment referred to by this file, in the format "seqid start end". This element is optional, but strongly encouraged because it allows parsers to perform bounds checking on features. There may be multiple ##sequence-region directives, each corresponding to one of the reference sequences referred to in the body of the file, however only one ##sequence-region directive may be given for any given seqid.

It would be extremely useful to have this incorporated in SibeliaZ's blocks_coords.gff output. I came across this need while trying to analyze coverage of the raw LCBs, but had to resort to sourcing that information from the original inputs. I'm now not actually sure how maf2synteny computes coverage for the synteny blocks it determines if it doesn't have a way of knowing the full genome sizes of all the samples.

I got the problem about the out put

when I ues the doucument called SibeliaZ-1.2.0-Linux

I ues the commond follow this ./sibeliaz -t 6 /home/lychee/dissertation/ZKV_99cutoff.fasta

however there get some probelms
Constructing the graph...
Threads = 6
Vertex length = 25
Hash functions = 5
Filter size = 17179869184
Capacity = 1
Files:
/home/lychee/dissertation/ZKV_99cutoff.fasta

Command terminated by signal 9
TwoPaco: 5.64 seconds elapsed, 1727848 KB memory used
Loading the graph...
error: Can't read the input file
Command exited with non-zero status 1
SibeliaZ-LCB: 0.00 seconds elapsed, 3884 KB memory used
rm: cannot remove './sibeliaz_out/de_bruijn_graph.dbg': No such file or directory
Performing global alignment..
find: ‘./sibeliaz_out/blocks’: No such file or directory
find: ‘./sibeliaz_out/blocks’: No such file or directory
Alignment: 0.02 seconds elapsed, 3868 memory used

I'm not sure what's the problems is

Installation issues

Hi,

First of all, thanks for taking the time to answer these questions and for providing this software.

I'm having several issues trying to install SibeliaZ on the cluster. Using the installation method in the README, the make files can't seem to find libraries necessary for building the programs. I fixed most of these by adding full paths, but I can't fix this error that I get:

file INSTALL cannot copy file
"/storage/home/users/x/FISH_WGA/SibeliaZ/build/spoa/lib/libspoa.a" to
"/usr/local/lib64/libspoa.a".

I don't have admin access to be able to copy files to the /usr/ directory.

When I tried binary executables that you provided access to in another comment, the executables seem to be looking for libraries elsewhere and I'm not sure how to alter this:

./graphdump: /lib64/libc.so.6: version GLIBC_2.14' not found (required by ./graphdump) ./graphdump: /usr/local/Modules/modulefiles/tools/gcc/4.9.3/lib64/libstdc++.so.6: version GLIBCXX_3.4.21' not found (required by ./graphdump)

Is there anyway to easily download and install the program without having to provide full paths or adjust library paths in binary files? I feel like I may be doing something wrong.

Thanks,
Leeban

Q on coverage, multi-genome alignment, and maf2synteny

Hi, thanks for this fast and easy-to-run aligner :) I have two questions.

How the "coverage" in the report was calculated? I had a coverage value 0.95 from a human GRCh38 to human T2T alignment run. Is the coverage = (total non-overlapping length covered by the alignment) * 2 / (length of genome 1 + length of genome 2)?
The instruction goes sibeliaz genome1.fa genome2.fa but what if I want to align more than two genomes? Can I just keep adding genome3.fa genome4.fa ...? And in this case how the coverage would be calculated?

Thanks a lot!

PackagesNotFoundError: The following packages are not available from current channels: - sibeliaz

Hello,
I am trying to install SibeliaZ on a conda environment using the following codes:
conda install -c bioconda sibeliaz and conda install sibeliaz but I keep getting the error:

`Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

sibeliaz

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.`
Any advice?

command terminated by signal 9

I was trying to perform global alignment on 887 fasta files. But it's giving me this error-

Loading the graph...
Analyzing the graph...
[.Command terminated by signal 9
SibeliaZ-LCB: 3450.12 seconds elapsed, 243401532 KB memory used
Performing global alignment..

Is this a memory issue or something else?

IndexError: string index out of range

Hi,
When I run the maf_to_gfa1.py script to convert alignment.maf to gfa format, but it occurs the following error:

Traceback (most recent call last):
  File "/home/cuixb/tools/biosoft/SibeliaZ-1.2.1/SibeliaZ-LCB/maf_to_gfa1.py", line 177, in <module>
    blocks, sequence = split_maf_blocks(args.maf)
  File "/home/cuixb/tools/biosoft/SibeliaZ-1.2.1/SibeliaZ-LCB/maf_to_gfa1.py", line 102, in split_maf_blocks
    next_profile = profile(maf, next_column)
  File "/home/cuixb/tools/biosoft/SibeliaZ-1.2.1/SibeliaZ-LCB/maf_to_gfa1.py", line 46, in profile
    return [group[i].body[column] == '-' for i in xrange(len(group))]
IndexError: string index out of range

And part of the alignment.maf file:

##maf version=1
# sibeliaz v.1.2.1
# cmd=-f 64 -t 28 -o westar_kale_chrA01 data/westar.fa.split/westar.id_chrA01.fa data/kale.fa.split/kale.id_kale_chrA01.fa

a
s kale_chrA01 19067038 227 + 40689054 GTTTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCCTTTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 18728550 226 + 40689054 >1_1
s kale_chrA01 21852872 224 - 40689054 GTTTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 21847912 224 - 40689054 >1_2
s kale_chrA01 18894209 224 + 40689054 --TTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 18905069 226 + 40689054 >1_3
s kale_chrA01 18937683 224 + 40689054 --TTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 18942636 226 + 40689054 >1_4
s kale_chrA01 21656164 226 - 40689054 --TTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 19062092 225 + 40689054 >1_5
s kale_chrA01 18723593 225 + 40689054 GTTTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 21620759 224 - 40689054 >1_6
s kale_chrA01 21380478 224 - 40689054 --TTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACATGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 21368346 224 - 40689054 >1_7
s kale_chrA01 21317989 224 - 40689054 GTTTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCGGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 21307298 224 - 40689054 >1_8
s kale_chrA01 19477728 226 + 40689054 GTTTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGGTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCCCTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 19488373 226 + 40689054 >1_9
s kale_chrA01 19756575 226 + 40689054 --TTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCCTTTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TT
GCACTAGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s kale_chrA01 20912571 212 - 40689054 >1_10
s chrA01 28065206 226 + 46056803 -TTTACAAGTATTAATAGAGAGAGCACCAAGGAAATTCGAAATGGTTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TTGCACT
AGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCACTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-
s chrA01 27205390 226 + 46056803 >1_11
s chrA01 27210347 200 + 46056803 --TTACAAGTATTAATAGAGAGAGCAACAAGGAAATTCGAAATGGGTAAGCATGTGTAGTCAAAGGACAGGCTGGAACTCC-TTTTGAATCACTTGGCTGTGCTTTCTCACATGC-TTGCACT
AGTAT--AAAGGTAACTTCTCCTTTCCAGCATCATACAGGCTGTC-AAAGTGATCCCTTATCCTTCCTTAACCTCCCTTATCCTCTTTGGTCGAGTTTCCTCTCTTCT-

So, how to solve this error?
Thank you in advance!

more detailed description for output

Hello, could you please share a more detailed description for output files?:)

time should be conda dependency

Hi,

Installed sibeliaz from conda on linux centOS system. Attempted to run with test dataset and receive the following output:

Constructing the graph...
line 131: /usr/bin/time: No such file or directory
line 132: /usr/bin/time: No such file or directory
rm: cannot remove './sibeliaz_test/de_bruijn_graph.dbg': No such file or directory
line 137: /usr/bin/time: No such file or directory

I am running on a shared cluster and therefore do not have permission to install time in /usr/bin. time is installed on the system elsewhere and I updated these lines in the sibeliaz script to reflect the path to installed time.

Then when I run I get
-f: command not found

My guess is that the version of time that is installed is too old and does not have the -f flag.

Then I installed time from conda (https://anaconda.org/conda-forge/time) and changed the path to time in the sibeliaz file to the time installed by conda and the script works.

Would it be possible to add time as conda dependency and update the sibeliaz script to rely on the version of time installed by conda?

sibeliaz :: spoa call code duplication

Hello,

seems to me that sibeliaz code contains duplication code, in the align funtion.

find $outdir/blocks -name "*.fa" -printf "$PWD/%p\n" | xargs -I @ -P "$threads" bash -c "align @ '$outfile' '$DIR'"
find $outdir/blocks -name "*.fa" -printf "$PWD/%p\n" | xargs -I @ -P 1 bash -c "align @ '$outfile' '$DIR'"

looks like it performs the spoa alignment twice
first with xargs $threads process, then with just on process

I don't see the point of removing successfull spoa jobs input file, and performing a new run with remaining inputfiles.

regards

Eric

Visualization of output and maf_to_gfa1.py TabError: inconsistent use of tabs and spaces in indentation

Dear iminkin,
I was trying to make circos plot using your extremely user friendly tool SibeliaZ. I have my input ready in fasta file and got output in maf and gff format. I would like to visualize the plot henceforth I ran python
maf_to_gfa1.py
but ended up with the error at line nume 96
prev_profile = profile(maf, 0)
TabError: inconsistent use of tabs and spaces in indentation.

I also tried python maf_to_xmfa.py < but it is running from last one day.

How can I visualize the plot. Am I missing something?

free -w erro

solved the free is old

twopaco: symbol lookup error

Hello,

I installed SibeliaZ with conda on our cluster where I do not have root permissions.
Here the output :

sibeliaz $genome1 $genome2
Constructing the graph...
twopaco: symbol lookup error: twopaco: undefined symbol: _ZN3tbb8internal24concurrent_queue_base_v818internal_push_moveEPKv
TwoPaco: 0.00 seconds elapsed, 1332 KB memory used
Loading the graph...
error: Can't read the input file
SibeliaZ-LCB: 0.00 seconds elapsed, 2036 KB memory used
rm: cannot remove './sibeliaz_out/de_bruijn_graph.dbg': No such file or directory
Performing global alignment..
find: './sibeliaz_out/blocks': No such file or directory
find: './sibeliaz_out/blocks': No such file or directory
Alignment: 0.02 seconds elapsed, 1568 memory used

Thank you for your help.

Thomas

Error when using sbatch

Hi,

I could get the sibeliaz command running fine on a node directly, but when I try to send the job using sbatch or snakemake --profile, I got errors.
When use the sbatch the error is :

.conda/envs/sibeliaz/bin/sibeliaz: line 97: total_size + : syntax error: operand expected (error token is "+ ")
.conda/envs/sibeliaz/bin/sibeliaz: line 125: /usr/bin/time: No such file or directory
.conda/envs/sibeliaz/bin/sibeliaz: line 126: /usr/bin/time: No such file or directory

when use the snakemake --profile, the error is:

.snakemake_envs/3f9a8e3309e4f8576c8c3c7e6cd22727/bin/sibeliaz: line 131: /usr/bin/time: No such file or directory
.snakemake_envs/3f9a8e3309e4f8576c8c3c7e6cd22727/bin/sibeliaz: line 132: /usr/bin/time: No such file or directory

Maybe some hardcoded codes cause this error? How can I make SibeliaZ work with clusters?

Apply the tool to virus

Hello author, i am sorry to disturb you. I want to apply SibeliaZ to virus (for example HIV) to obtain collinear blocks. After that, I will extract SNVs from those collinear blocks. However, I found I can't extract enough SNVs. Do you have some idea to deal with this problem? Thanks a lot!

How would it deal with phased genomes?

Hello,

Thought experiment here:

I was wondering how SibeliaZ would handle phased genomes as input? A priori I don't see a major reason it wouldn't work, if not for the output that would be very complicated to parse I guess.

Thank you

2 problems of sibeliaz in my experiment (Bug or not ?)

@iminkin
Hi, iminkin , thanks a lot for your amazing tool, it plays a very important role in my research !

However, I got 2 problems when I did the experiment with sibeliaz to find the colinear block of a collection of highly similar viral genomes.

Problem1:
Sibeliaz fails to run with 5 highly similar viral genomes (Cluster_17_revise.fasta).
These 5 viral genomes are clustered by cdhit with an identity cutoff=99%, and I also check their similarity by online blast (>=99% cov and >=99% perc iden). When I run sibeliaz to find the block , it fails, and the running log is:

The similar problems occur in other datasets too, and maybe, the high similarity or the small number of genomes will make sibeliaz fail to return the colinea block ? I doubt that.

Problem2:
Similarly, I use sibeliaz to find the colinear block of another collection of highly similar viral genomes (still by cdhit clustering with identity=99% ). There are 402 genomes in this dataset (Cluster_0_Raw.fasta) and sibeliaz finishes the job without error. However, some genomes get 0 colinear block and I check the alignment between these genomes and the longest genome (ZKV_184) in this dataset with online blast. It's obvious that there are many colinear blocks between them. These genomes with 0 colinear block is:

ZKV_26, ZKV_58, ZKV_68, ZKV_73, ZKV_74, ZKV_78, ZKV_109, ZKV_145, ZKV_176, ZKV_195, ZKV_388, ZKV_394, ZKV_501, ZKV_765, ZKV_767, ZKV_774, ZKV_778, ZKV_780, ZKV_791

Btw, the parameter of this 2 experiments I used is "sibeliaz -t 6 -k 15 [input_file]". (I also tried the default parameter, but problems still exist...)

The 2 datasets of Problem1 (Cluster_17_revise.fasta) and Problem2 (Cluster_0_Raw.fasta) are uploaded. You can use them for test. I am confused by these 2 problems for a long time ...
Test_Data.zip

SibeliaZ causes my Ubuntu session to close

Hello,

Title. When I launch sibeliaz, it runs for one or two minutes, and then my Ubuntu session closes.

Here is the command

sibeliaz -t24 AA.fasta B.fasta

Do you have any idea of why?

EDIT: this is with the conda install. I have now compiled it myself and so far it doesn't seem to cause disconnection issues. I will update this when/if the run finishes.

EDIT2: no issue, can confirm the bug appears only with the conda version.

TabError: inconsistent use of tabs and spaces in indentation

Hi,

it seems the maf_to_gfa1.py has some formatting problem, could you please confirm it?

  File "../SibeliaZ/SibeliaZ-LCB/maf_to_gfa1.py", line 97
    pos = [record.start for record in maf]
                                         ^
TabError: inconsistent use of tabs and spaces in indentation

Thanks a lot!

Installation issues

Hi,

Just dowloaded release 1.2.3 and tried to compile, to eventually support SibeliaZ in the easybuild framework. Yet:

`CMake Error at CMakeLists.txt:13 (add_subdirectory):
The source directory

/dev/shm/SibeliaZ/1.2.3/GCCcore-10.2.0/SibeliaZ-1.2.3/spoa

does not contain a CMakeLists.txt file.

CMake Error at CMakeLists.txt:14 (add_subdirectory):
add_subdirectory given source "TwoPaCo/src" which is not an existing
directory.

CMake Error at /cluster/easybuild/broadwell/software/CMake/3.15.3/share/cmake-3.15/Modules/ExternalProject.cmake:2611 (message):
No download info given for 'maf2synteny' and its source directory:

/dev/shm/SibeliaZ/1.2.3/GCCcore-10.2.0/SibeliaZ-1.2.3/maf2synteny

is not an existing non-empty directory. Please specify one of:

SOURCE_DIR with an existing non-empty directory
DOWNLOAD_COMMAND
URL
GIT_REPOSITORY
SVN_REPOSITORY
HG_REPOSITORY
CVS_REPOSITORY and CVS_MODULE
Call Stack (most recent call first):
/cluster/easybuild/broadwell/software/CMake/3.15.3/share/cmake-3.15/Modules/ExternalProject.cmake:3204 (_ep_add_download_command)
CMakeLists.txt:15 (ExternalProject_Add)

-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Configuring incomplete, errors occurred!
`

Within the download the mentioned directories appear to be empty. Could you please fix that and provide a working command for the maf2synteny project?

Thank you.

Best regards,
Christian Meesters

Redundancy and contamination analysis on assembly

Hi,

I just assembled an allotetraploid plant genome but I suspect it has a couple of issues as the assembly size is ~3 times larger than the estimated. I think some scaffolds are actually redundant. Also, I have some reasons to think there may be some scaffolds from contamination (e.g. human).

So, I wonder If I could use SibeliaZ to:

Align my assembly to itself (or to its closest relative, that is not that close, it is in the same subfamily) to try to identify where is the redundancy.
Align my assembly with human and identify scaffolds that are human contamination.

I really appreciate your thoughts in this matter.

medvedevgroup / sibeliaz Goto Github PK

sibeliaz's People

Contributors

Stargazers

Watchers

Forkers

sibeliaz's Issues

Round 0, 0:17179869184 Pass Filling Filtering 1 25 3 2 0 0 True junctions count = 79040 False junctions count = 0 Hash table size = 79040 Candidate marks count = 500858

Reallocating bifurcations time: 0 True marks count: 500858 Edges construction time: 1

Solution / Workaround:

however there get some probelms Constructing the graph... Threads = 6 Vertex length = 25 Hash functions = 5 Filter size = 17179869184 Capacity = 1 Files: /home/lychee/dissertation/ZKV_99cutoff.fasta

Recommend Projects

Recommend Topics

Recommend Org

Round 0, 0:17179869184
Pass Filling Filtering
1 25 3
2 0 0
True junctions count = 79040
False junctions count = 0
Hash table size = 79040
Candidate marks count = 500858

Reallocating bifurcations time: 0
True marks count: 500858
Edges construction time: 1

however there get some probelms
Constructing the graph...
Threads = 6
Vertex length = 25
Hash functions = 5
Filter size = 17179869184
Capacity = 1
Files:
/home/lychee/dissertation/ZKV_99cutoff.fasta