Giter Site home page Giter Site logo

daligner's Introduction

Daligner: The Dazzler "Overlap" Module

Author: Gene Myers

First: April 10, 2016

Current: April 19, 2019

For typeset documentation, examples of use, and design philosophy please go to my blog.

Version Numbers

v1.0 has been released, but if you need to refer to a later revision from the stable master branch, please use v1.0.yyyymmdd where yyyy-mm-dd is the date of the commit used. This is important for method details in scientific papers, and for software packaging (e.g. Conda, HomeBrew, or Linux distribution packages).

The commands below permit one to find all significant local alignments between reads encoded in Dazzler database. The assumption is that the reads are from a PACBIO RS II long read sequencer. That is the reads are long and noisy, up to 15% on average.

Recall that a database has a current partition that divides it into blocks of a size that can conveniently be handled by calling the daligner overlapper on all the pairs of blocks producing a collection of .las local alignment files that can then be sorted and merged into an ordered sequence of sorted files containing all alignments between reads in the data set. The alignment records are parsimonious in that they do not record an alignment but simply a set of trace points, typically every 100bp or so, that allow the efficient reconstruction of alignments on demand.

All programs add suffixes (e.g. .db, .las) as needed. For the commands that take multiple .db or .las file blocks as arguments, i.e. daligner, LAsort, LAmerge, LAcat, and LAcheck, one can place a @-sign in the name, which is then interpreted as the sequence of files obtained by replacing the @-sign by 1, 2, 3, ... in sequence until a number is reached for which no file matches. One can also place a @-sign followed by an integer, say, i, in which case the sequence starts at i. Lastly, one can also use @i-j where i and j are integers, in which case the sequence is from i to j, inclusive.

The formal UNIX command line descriptions and options for the DALIGNER module commands are as follows:

1. daligner [-vaAI]
       [-k<int(16)>] [-%<int(28)>] [-h<int(50)>] [-w<int(6)>] [-t<int>] [-M<int>]
       [-e<double(.75)] [-l<int(1500)] [-s<int(100)>] [-H<int>]
       [-T<int(4)>] [-P<dir(/tmp)>] [-m<track>]+
       <subject:db|dam> <target:db|dam> ...

Compare sequences in the trimmed <subject> block against those in the list of <target> blocks searching for local alignments involving at least -l base pairs (default 1000) or more, that have an average correlation rate of -e (default 70%). The local alignments found will be output in a sparse encoding where a trace point on the alignment is recorded every -s base pairs of the a-read (default 100bp). Reads are compared in both orientations and local alignments meeting the criteria are output to one of several created files described below. The -v option turns on a verbose reporting mode that gives statistics on each major step of the computation. The program runs with 4 threads by default, but this may be set to any positive value with the -T option.

The options -k, -%, -h, and -w control the initial filtration search for possible matches between reads. Specifically, our search code looks for a pair of diagonal bands of width 2w (default 26 = 64) that contain a collection of matching k-mers (default 16) in the lowest %-percentifle between the two reads, such that the total number of bases covered by the k-mer hits is h (default 50). k cannot be larger than 32 in the current implementation. These parameters will shortly be superceded with a more intuitive interface.

If there are one or more interval tracks specified with the -m option, then the reads of the DB or DB's to which the mask applies are soft masked with the union of the intervals of all the interval tracks that apply, that is any k-mers that contain any bases in any of the masked intervals are ignored for the purposes of seeding a match. An interval track is a track, such as the "dust" track created by DBdust, that encodes a set of intervals over either the untrimmed or trimmed DB.

Invariably, some k-mers are significantly over-represented (e.g. homopolymer runs). These k-mers create an excessive number of matching k-mer pairs and left unaddressed would cause daligner to overflow the available physical memory. One way to deal with this is to explicitly set the -t parameter which suppresses the use of any k-mer that occurs more than t times in either the subject or target block. However, a better way to handle the situation is to let the program automatically select a value of t that meets a given memory usage limit specified (in Gb) by the -M parameter. By default daligner will use the amount of physical memory as the choice for -M. If you want to use less, say only 8Gb on a 24Gb HPC cluster node because you want to run 3 daligner jobs on the node, then specify -M8. Specifying -M0 basically indicates that you do not want daligner to self adjust k-mer suppression to fit within a given amount of memory.

Each found alignment is recorded as -- a[ab,ae] x bo[bb,be] -- where a and b are the indices (in the trimmed DB) of the reads that overlap, o indicates whether the b-read is from the same or opposite strand, and [ab,ae] and [bb,be] are the intervals of a and bo, respectively, that align. For each subject, target pair of blocks, say X and Y, the program reports alignments where the a-read is in X and the b-read is in Y, or vice versa. However, if the -A option is set ("A" for "asymmetric") then just overlaps where the a-read is in X and the b-read is in Y are reported, and if X = Y then it further reports only those overlaps where the a-read index is less than the b-read index. In either case, if the -I option is set ("I" for "identity") then when X = Y, overlaps between different portions of the same read will also be found and reported. In summary, the command daligner -A X Y produces a single file X.Y.las and daligner X Y produces 2 files X.Y.las and Y.X.las (unless X=Y in which case only a single file, X.X.las, is produced). The overlap records in one of these files are sorted as described for LAsort. The -a option to daligner is passed directly through to LAsort which is actually called as a sub-process to produce the sorted file. In order to produce the aforementioned .las file, several temporary .las files, two for each thread, are produce in the sub-directory /tmp by default. You can overide this location by specifying the directory you would like this activity to take place in with the -P option.

By default daligner compares all overlaps between reads in the database that are greater than the minimum cutoff set when the DB or DBs were split, typically 1 or 2 Kbp. However, the HGAP assembly pipeline only wants to correct large reads, say 8Kbp or over, and so needs only the overlaps where the a-read is one of the large reads. By setting the -H parameter to say N, one alters daligner so that it only reports overlaps where the a-read is over N base-pairs long.

While the default parameter settings are good for raw Pacbio data, daligner can be used for efficiently finding alignments in corrected reads or other less noisy reads. For example, for mapping applications against .dams we run daligner -k20 -h60 -e.85 and on corrected reads, we typically run daligner -k25 -w5 -h60 -e.95 -s500 and at these settings it is very fast.

2. LAsort [-va] <align:las> ...

Sort each .las alignment file specified on the command line. For each file it reads in all the overlaps in the file and sorts them in lexicographical order of (a,b,o,ab) assuming each alignment is recorded as a[ab,ae] x bo[bb,be]. It then writes them all to a file named <align>.S.las (assuming that the input file was <align>.las). With the -v option set then the program reports the number of records read and written. If the -a option is set then it sorts LAs in lexicographical order of (a,ab) alone, which is desired when sorting a mapping of reads to a reference.

If the .las file was produced by damapper the local alignments are organized into chains where the LA segments of a chain are consecutive and ordered in the file. LAsort can detects that it has been passed such a file and if so treats the chains as a unit and sorts them on the basis of the first LA in the chain.

3. LAmerge [-va] [-P<dir(/tmp)>] <merge:las> <parts:las> ...

Merge the .las files <parts> into a singled sorted file <merge>, where it is assumed that the input <parts> files are sorted. There are no limits to how many files can be merged, but if there are more than 252, a typical UNIX OS limit on the number of simultaneously open files, then the program recursively spawns sub-processes and creates temporary files in the directory specified by the -P option, /tmp by default. With the -v option set the program reports the number of records read and written. The -a option indicates the sort is as describe for LAsort above.

If the .las file was produced by damapper the local alignments are organized into chains where the LA segments of a chain are consecutive and ordered in the file. When merging such files, LAmerge treats the chains as a unit and orders them on the basis of the first LA in the chain.

Used correctly, LAmerge and LAsort together allow one to perform an "external" sort that produces a collection of sorted files containing in aggregate all the local alignments found by the daligner, such that their concatenation is sorted in order of (a,b,o,ab) (or (a,ab) if the -a option is set). In particular, this means that all the alignments for a given a-read will be found consecutively in one of the files. So computations that need to look at all the alignments for a given read can operate in simple sequential scans of these sorted files.

4. LAshow [-caroUF] [-i<int(4)>] [-w<int(100)>] [-b<int(10)>]
                    <src1:db|dam> [ <src2:db|dam> ]
                    <align:las> [ <reads:FILE> | <reads:range> ... ]

LAshow produces a printed listing of the local alignments contained in the specified .las file, where the a- and b-reads come from src1 or from src1 and scr2, respectively. If a file or list of read ranges is given then only the overlaps for which the a-read is in the set specified by the file or list are displayed. See DBshow for an explanation of how the file and list of read ranges are interpreted. If the -F option is set then the roles of the a- and b- reads are reversed in the display.

If the -c option is given then a cartoon rendering is displayed, and if -a or -r option is set then an alignment of the local alignment is displayed. The -a option puts exactly -w columns per segment of the display, whereas the -r option puts exactly -w a-read symbols in each segment of the display. The -r display mode is useful when one wants to visually compare two alignments involving the same a-read. If a combination of the -c, -a, and -r flags is set, then the cartoon comes first, then the -a alignment, and lastly the -r alignment. The -i option sets the indent for the cartoon and/or alignment displays, if they are requested. The -b option sets the number of symbols on either side of the aligned segments in an alignment display, and -U specifies that uppercase should be used for DNA sequence instead of the default lowercase. If the -o option is set then only alignments that are proper overlaps (a sequence end occurs at the each end of the alignment) are displayed. If the -F option is given then the roles of the A- and B-reads are flipped.

When examining LAshow output it is important to keep in mind that the coordinates describing an interval of a read are referring conceptually to positions between bases starting at 0 for the position to the left of the first base. That is, a coordinate c refers to the position between the c-1'st and c'th base, and the interval [b,e] captures the e-b bases from the b'th to the e-1'st, inclusive. We give an example with a cartoon and (part of an) alignment for which we will explain several additional important points:

 1  1,865 c   [18,479..20,216] x [ 1,707..0>  (24,451 x 7,283 bps, 19 trace pts)

      18479              4235
  A ========+----------+======>  dif/(len1+len2) = 478/(1737+1707) = 27.76%
  B  <======+-----------
       5576

   18469 agccgcctag[tgcctcgcaaacgc-t-cggggcggcgt-gaaagcgg--
         ::::::::::[||||||||||||||*|*|||*|||*|||*||||||||**
    1717 ctcttcttta[tgcctcgcaaacgccttcggcgcg-cgttgaaagcggtt  17.9%

   18513 -ccggtgggtc--agtggcgagttctggcagtgcgctggg-ctgcgaaat
         *||||||*|||**|||||*||||*|*|*|||**|||||||*||*||||||
   1669 gccggtgcgtcgcagtgg-gagt-c-gtcag--cgctggggcttcgaaat  24.0%

        . . .

The display of an LA always begins with a line giving the A-read, then the B-read, then an indication of orientation (i.e. 'n' for same strand, and 'c' for the opposite strand) followed by the A-interval and B-interval that are aligned and in parentheses the lengths of the two reads and the number of tracepoints in the alignment between them. In particular, note carefully that when the B-read is in the complement orientation (c), then the B-interval gives the higher coordinate first, the idea being that one will align from the highest base down to the lowest base in the descending direction on B, complement the characters as you go. Further note that in the alignment display the coordinates at the start of each line follow this orientation convention and give the coordinate of the "tick mark" just left of the first character in each line. It is useful to know if an interval reaches the end of read, and to signal this we use an angle-bracket <> instead of a square bracket [], e.g. in the example the B-segment starts at the beginning of the read. Finally, observe that in the cartoon the numbers are not coordinates but rather indicate the lengths of the unaligned bits left and right of the two aligned intervals. Finally, observe that in the cartoon the numbers are not coordinates but rather indicate the lengths of the unaligned bits left and right of the two aligned intervals.

With the introduction of damapper, .las files can now contain chains. If LAshow detects that it has been passed a file with chain information then it displays marks at the left that reveal the chain structure, e.g.:

   >     117   37,630 c   [   253.. 7,980] x [   331,430..   324,027]  ~  10.5%
   +     117   37,628 n   [   253.. 7,983] x [21,493,673..21,501,079]  ~  10.6%
   +     117       57 c   [   253.. 1,086] x [ 2,008,164.. 2,007,369]  ~   9.8%
    -    117       57 c   [ 1,300.. 7,982] x [ 2,007,351.. 2,000,945]  ~  10.7%
   >     117       15 c   [ 7,992.. 8,716] x [   242,529..   241,822]  ~   7.8%
    -    117       15 c   [ 8,752..14,299] x [   241,824..   236,425]  ~  10.7%
    -    117       15 c   [14,133..14,832] x [   236,630..   235,953]  ~  12.1%
   +     117   37,628 n   [ 7,992.. 8,716] x [19,202,357..19,203,064]  ~   7.7%
    -    117   37,628 n   [ 8,752..14,832] x [19,203,062..19,208,974]  ~  10.9%

A chain begins with either a > or + character, where > indicates this is the highest scoring chain and + indicates an alternate near optimal chain (controlled by the -n parameter to damapper). Each additional LA of a chain is marked with a - character.

5a. LA2ONE [-cto] <src1:db|dam> [ <src2:db|dam> ]
                   <align:las> [ <reads:FILE> | <reads:range> ... ]  > (.dal file)

5b. ONE2LA <align.dal> > (.las file)

LA2ONE produces a .dal 1-code data file of all or a portion of the contents of a .las file. 1-code is a powerful self-describing, simple to use, data system with built in compression. Typically a .dal file will be 45% the size of its corresponding .dal file. By default only the pairwise overlapped pairs, their orientation, and chaining (if any) are output. The -c option requests that one further output the coordinates of the LA segments, the lengths of the reads and the # of differences in each LA. The -t option requests that the trace point alignment details are output for each LA. If -t is set then -c must be set and if both are set then all of the information about each LA is effectively output. The -o option requests that only LAs that are proper overlaps be output. Only the overlaps for particular A-reads may be specified as per the same command line arguments as documented for LAshow above.

ONE2LA converts a 1-code .dal file back into a .las file. It requires that the .dal file contains all the information about each LA therein, that is, it must contain all the information that is output by LA2ONE with the options -ct set.

The .dal format is quite simple where the primary object is considered to be a pile, i.e., the set of all LAs for a given a-read. The encoding of all the LAs for a pile is given by several lines in the 1-code format, where each line type is designated by the first character in the line. The P-line designated the start of a pile object and consists of giving the a-read and then a list of all the b-reads in the pile. We term the length of this list the size of the pile. All other line type lists for this given pile should have the same size or a multiple thereof. By default the P-line is followed by O- and C-lines that give the orientation and chaining directives (if any) for each b-read in the pile list as strings, one character per b-read.

    P <a-read: int> <pile b-reads: int_list>
    O <orientations: string over [nc]>        // n is normal, c is complement
    C <chain directives: string over [>+-.]>  // '>' start of best chain,
                                              // '+' start of alternate chain,
                                              // '-' continuation of chain, or  
                                              // '.' no chains (in the file).

If the -c flag is set, then LA2ONE also outputs 1-code lines describing the intervals (ab,ae) of the a-read that align with intervals (bb,be) of the b-reads and the diffs in the aligned segments. The lengths of the a-read and all the b-reads is also output for convenience. The A-line outputs a list of the pairs (ab,ae) in the same order as the b-reads are listed in the object P-line. The B-line outputs a list of the pairs (bb,be) and the D-line outputs a list of the corresponding difference in each aligned pair of segments. Note carefully that the length of the A-line and B-line lists are twice that of the pile size. Lastly, the L-line gives the length of the a-read and then the lengths of the b-reads in the same order as the P-line.

    A <a-read intervals: int_list>
    B <b-read intervals: int_list>
    D <alignment diffs: int_list>
    L <a-read length: int> <b-read lengths: int_list>

If the -t flag is set, then LA2ONE outputs a pair of T- and Q-lines, one for each alignment in the pile in the order listed in the controlling P-line. The relative order of the T- and Q-lines does not matter as long as the i'th occurrence of a T- or Q-line is for the i'th alignment in the pile. The T-line gives the b-segment lengths for a trace-point encoding of the alignment and the Q-line gives the difference in each segment. In addition the a-read spacing (which is the same for all alignments) is given in an X-line at the start of the data file.

    T <b-displacements: int_list>
    Q <segment diffs: int_list>
    
    X <trace spacing:int>  //  1st data line before any objects
6. LAcat [-v] <source:las> ... > <target>.las

The sequence of <source> files (that can contain @-sign block ranges) are concatenated in order into a single .las file and pipe the result to the standard output. The -v option reports the files concatenated and the number of la's within them to standard error (as the standard output receives the concatenated file).

7. LAsplit [-v] <target:las> (<parts:int> | <path:db|dam>) < <source>.las

If the second argument is an integer n, then divide the alignment file <source>, piped in through the standard input, as evenly as possible into n alignment files with the names specified by template <target>, subject to the restriction that all alignment records for a given a-read are in the same file. The name of the n files is the string <target> where the single @-sign that occurs somewhere in it is replaced by i for i in [1,n] and a .las extension is added if necessary.

If the second argument refers to a database <path>.db that has been partitioned, then divide the input alignment file into block .las files where all records whose a-read is in <path>.i.db are in the i'th file generated from the template <target>. The -v option reports the files produced and the number of la's within them to standard error.

8. LAcheck [-vaS] <src1:db|dam> [ <src2:db|dam> ] <align:las> ...

LAcheck checks each .las file for structural integrity, where the a- and b-sequences come from src1 or from src1 and scr2, respectively. That is, it makes sure each file makes sense as a plausible .las file, e.g. values are not out of bound, the number of records is correct, the number of trace points for a record is correct, and so on. If the -S option is set then it further checks that the alignments are in sorted order, by default pile order, but if -a is also set, then map order. If the -v option is set then a line is output for each .las file saying either the file is OK or reporting the first error. If the -v option is not set then the program runs silently. The exit status is 0 if every file is deemed good, and 1 if at least one of the files looks corrupted.

With the introduction of damapper, LAcheck checks to see if a file has chain information, and if it does, then it checks the validity of chains and checks the sorting order of chains as a unit according to the -a option.

9. HPC.daligner [-vad] [-t<int>] [-w<int(6)>] [-l<int(1500)] [-s<int(100)] [-M<int>]
                    [-P<dir(/tmp)>] [-B<int(4)>] [-T<int(4)>] [-f<name>]
                  ( [-k<int(16)>] [-h<int(50)>] [-e<double(.75)] [-H<int>]
                    [-k<int(20)>] [-h<int(50)>] [-e<double(.85)]  <ref:db|dam>  )
                    [-m<track>]+ <reads:db|dam> [<first:int>[-<last:int>]]

HPC.daligner writes a UNIX shell script to the standard output or to a series of files beginning with the prefix <name> if the -f option is set, that either performs an "overlap" computation on all the blocks in a single database, or a "comparison" computation on all pairs of blocks between two databases, depending on whether it is given one or two DB's as arguments (<ref> and <reads>). We describe the overlap script first and its effect first and then later the comparison script.

An Overlap Script: consists of a sequence of commands that effectively run daligner on all pairs of blocks of a split database and then externally sorts and merges them using LAsort and LAmerge into a collection of alignment files with names <path>.#.las where # ranges from 1 to the number of blocks the data base is split into. These sorted files if concatenated by say LAcat would contain all the alignments in sorted order (of a-read, then b-read, ...). Moreover, all overlaps for a given a-read are guaranteed to not be split across files, so one can run artifact analyzers or error correction on each sorted file in parallel.

The data base must have been previously split by DBsplit and all the parameters, except -a, -d, -f, -B, and -D, are passed through to the calls to daligner. The defaults for these parameters are as for daligner. The -v and -a flags are passed to all calls to LAsort and LAmerge. All other options are described later. For a database divided into N sub-blocks, the calls to daligner will produce in total N2 .las files, on per block pair. These are then merged so that there is 1 file per row of the N x N block matrix. So at the end one has N sorted .las files, one per block of A-reads, that when concatenated would give a single large sorted overlap file.

The -B option (default 4) gives the desired number of block comparisons per call to daligner. Some must contain B-1 comparisons, and the first B-2 block comparisons even less, but the HPCdaligner "planner" does the best it can to give an average load of -B block comparisons per command.

If the integers <first> and <last> are missing then the script produced is for every block in the database. If <first> is present then HPCdaligner produces an incremental script that compares blocks <first> through <last> (<last> = <first> if not present) against each other and all previous blocks 1 through <first>-1, and then incrementally updates the .las files for blocks 1 through <first>-1, and creates the .las files for blocks <first> through <last>.

A Comparison Script: consists of a sequence of commands that effectively maps every read in the DB <reads> against a reference set of sequences in the DB <ref>, recording all the found local alignments in the sequence of files <reads>.1.<ref>.las, <reads>.2.<ref>.las, ... where <reads>.<ref>.k.las contains the alignments between all of <ref> and the k'th block of <reads>. The parameters are exactly the same as for the overlap script save that the -k, -h, and -e defaults are set more stringently for mapping, and the -A, -I , and -H options make no sense as <ref> and <reads> are expected to be distinct data sets. If the integers <first> and <last> are missing then the script produced is for every block in the database <reads>. If <first> is present then HPC.daligner produces a script that compares blocks <first> through <last> (<last> = <first> if not present) of <reads> against DAM <ref>.

The command scripts output by HPC.daligner and other HPC.<x> programs consists of command blocks each of which begins with a comment line (begins with #) followed by a potentially long list of lines each containing a shell command. Command blocks whose comment mentions "jobs" and gives the number of said in parenthesis, we call parallel blocks because each command line in the block can be sent to a node in a cluster for independent execution, i.e. none of the commands in a block depend on another in the block. The remaining command blocks we call house-keeping blocks because they can be executed by the shell on the launch/server node and the commands are either checking the integrity of .las files with LAcheck, or removing intermediate files with rm. Each block should be performed in the order given and should complete before the next block is performed.

If the -f option is set, then each command block is written to a file with a name of the form <name>.#.<description> where <name> is specified by the user in the -f option argument, # gives the order in which the command block in the given file is to be performed in relation to other command block files, and <description> is a (very) short symbolic reminder of what the block is doing. For example, "HPC.daligner -fJOBS DB" would produce the files:

     JOBS.01.OVL
     JOBS.02.CHECK.OPT
     JOBS.03.MERGE
     JOBS.04.RM.OPT

There are always 4 command blocks. The files with the suffix .OPT are optional and need not be executed albeit we highly recommend that one run the CHECK block. One should not run the RM block if one wants to later use DASrealign after scrubbing.

A new -d option requests scripts that organize files into a collection of sub-directories so as not to overwhelm the underlying OS for large genomes. Recall that for a DB divided into N blocks, the daligner will produce N2 .las-files. With the -d option set, N sub-directories (with respect to the directory HPC.daligner is called in) of the form "work<i>" for i from 1 to N are created in an initial command block, and then all work files are placed in those sub-directories, with a maximum of 2N files appearing in any sub-directory at any given point in the process.

Example:

//  Recall G.db from the example in DAZZ_DB/README

> cat G.db
files =         1
       1862 G Sim
blocks =         2
size =        11 cutoff =         0 all = 0
         0         0
      1024      1024
      1862      1862
> HPCdaligner -mdust -t5 G | csh -v   // Run the HPCdaligner script

# Dazzler jobs (2)
dazzler -d -t5 -mdust G.1 G.1
dazzler -d -t5 -mdust G.2 G.1 G.2
# Initial sort jobs (4)
LAsort G.1.G.1.*.las && LAmerge G.L1.1.1 G.1.G.1.*.S.las && rm G.1.G.1.*.S.las
LAsort G.1.G.2.*.las && LAmerge G.L1.1.2 G.1.G.2.*.S.las && rm G.1.G.2.*.S.las
LAsort G.2.G.1.*.las && LAmerge G.L1.2.1 G.2.G.1.*.S.las && rm G.2.G.1.*.S.las
LAsort G.2.G.2.*.las && LAmerge G.L1.2.2 G.2.G.2.*.S.las && rm G.2.G.2.*.S.las
# Level 1 jobs (2)
LAmerge G.1 G.L1.1.1 G.L1.1.2 && rm G.L1.1.1.las G.L1.1.2.las
LAmerge G.2 G.L1.2.1 G.L1.2.2 && rm G.L1.2.1.las G.L1.2.2.las

> LAshow -c -a:G -w50 G.1 | more  // Take a look at the result !

G.1: 34,510 records

         1          9 c   [     0.. 1,876] x [ 9,017..10,825]  ( 18 trace pts)

                      12645
    A      ---------+====>   dif/(len1+len2) = 398/(1876+1808) = 21.61%
    B <====+---------
       9017

         1 ..........gtg-cggt--caggggtgcctgc-t-t-atcgcaatgtta
                     |||*||||**||||||||*||||*|*|*||**|*|*||||
      9008 gagaggccaagtggcggtggcaggggtg-ctgcgtcttatatccaggtta  27.5%

        35 ta-ctgggtggttaaacttagccaggaaacctgttgaaataa-acggtgg
           ||*|||||||||||||*|**|*||*|*||||||*|**|||||*|*|||||
      9057 tagctgggtggttaaa-tctg-ca-g-aacctg-t--aataacatggtgg  24.0%

        83 -ctagtggcttgccgtttacccaacagaagcataatgaaa-tttgaaagt
           *||||||||*||||||||*||**||||*|||**|||||||*||||*||||
      9100 gctagtggc-tgccgttt-ccgcacag-agc--aatgaaaatttg-aagt  20.0%

       131 ggtaggttcctgctgtct-acatacagaacgacggagcgaaaaggtaccg
           ||*|||||||||||||*|*||||*|*|*||||||||||*||||||||||*
      9144 gg-aggttcctgctgt-tcacat-c-ggacgacggagc-aaaaggtacc-  16.0%

...

> LAcat G >G.las       //  Combine G.1.las & G.2.las into a single .las file
> LAshow G G | more    //   Take another look, now at G.las

G: 62,654 records
   1    9 c   [     0.. 1,876] x [ 9,017..10,825] :   <    398 diffs  ( 18 trace pts)
   1   38 c   [     0.. 7,107] x [ 5,381..12,330] :   <  1,614 diffs  ( 71 trace pts)
   1   49 n   [ 5,493..14,521] x [     0.. 9,065] :   <  2,028 diffs  ( 91 trace pts)
   1   68 n   [12,809..14,521] x [     0.. 1,758] :   <    373 diffs  ( 17 trace pts)
   1  147 c   [     0..13,352] x [   854..14,069] :   <  2,993 diffs  (133 trace pts)
   1  231 n   [10,892..14,521] x [     0.. 3,735] :   <    816 diffs  ( 37 trace pts)
   1  292 c   [ 3,835..14,521] x [     0..10,702] :   <  2,353 diffs  (107 trace pts)
   1  335 n   [ 7,569..14,521] x [     0.. 7,033] :   <  1,544 diffs  ( 70 trace pts)
   1  377 c   [ 9,602..14,521] x [     0.. 5,009] :   <  1,104 diffs  ( 49 trace pts)
   1  414 c   [ 6,804..14,521] x [     0.. 7,812] :   <  1,745 diffs  ( 77 trace pts)
   1  415 c   [     0.. 3,613] x [ 7,685..11,224] :   <    840 diffs  ( 36 trace pts)
   1  445 c   [ 9,828..14,521] x [     0.. 4,789] :   <  1,036 diffs  ( 47 trace pts)
   1  464 n   [     0.. 1,942] x [12,416..14,281] :   <    411 diffs  ( 19 trace pts)

...

daligner's People

Contributors

thegenemyers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

daligner's Issues

Kmer setting in daligner

Dear experts:

    I am do de novo assemble recently of PacBio data using Falcon program, it seems that kmer settings is an important parameter for assemble, but I cannot find any parameters for setting kmer in daligner. But   -t is a parameter to cap the kmer amount, it is confusing for me. Looking forward to your reply. Thank you. 

Add option for read length output to 'LAshow' and 'LAdump'

Hi,

would it be possible to add a command line parameter to 'LAshow' and 'LAdump' so that it also outputs the length of the reads involved in a given overlap?

Currently I can get the overlaps from the LAS files with 'LAdump' and the read length from the DB using 'DBdump' but then I have to match them up myself. As both 'LAshow' and 'LAdump' use the DB as input anyway it would be neat to have the possibility to also output the read length. This would allow to bypass some of the scripts of the FALCON-intergrate tool set and generate the *.ovlp files needed for assembly directly using only DAZZLER tools.

Thanks,

Dirk

How to get more sensitive result?

Hi Gene,
When I use daligner align Pacbio to the DeBruijn Graph from Illumina Reads, where need more sensitive alignments cover graph path, But I found some graph edge is missing, my script is "PCmapper -k15 -l100 -w7 -h20 K203 head2M >K203.sh". How can I set parameters get more sensitive alignment.
Tks

Desheng

LAmerge example

Hi,
Is LAmerge test.las *.las a valid syntax?

Thank you in advance.

Michal

map reads to reference

Hello,
can daligner be used to map PacBio reads to a fasta reference without using the DB step for the reference? Fitting the reference-fasta into the DB requires the sequence to be split into 65 Kbp bits which likely will reduce mappablity of reads crossing the split-sites.
Thank you,
Michel

daligner Segmentation fault (core dumped)

Hi,

I'm working on a vegetal genome (~600Mb)
My commands :
DAZZ_DB/fasta2DB vriparia vriparia_reads.fasta
DAZZ_DB/DBsplit vriparia -x500 -s200
DALIGNER/HPCdaligner vriparia | bash -v

I get the following message :
[...]
Daligner jobs (10878)
daligner vriparia.1 vriparia.1
bash: line 2: 20645 Segmentation fault (core dumped) daligner vriparia.1 vriparia.1
[...]

But the script continue and the same error message is given for every vriparia.N

Thanks for your answer

Regards

Alexis

daligner.c:233:20: error: โ€˜INT32_MAXโ€™ undeclared

Ubuntu 14.10 install error

$ make
gcc -O3 -Wall -Wextra -fno-strict-aliasing -o daligner daligner.c filter.c align.c DB.c QV.c -lpthread -lm
daligner.c: In function โ€˜Merge_Sizeโ€™:
daligner.c:233:20: error: โ€˜INT32_MAXโ€™ undeclared (first use in this function)
     ev[mtop].idx = INT32_MAX;
                    ^
daligner.c:233:20: note: each undeclared identifier is reported only once for each function it appears in
daligner.c: In function โ€˜Merge_Tracksโ€™:
daligner.c:318:20: error: โ€˜INT32_MAXโ€™ undeclared (first use in this function)
     ev[mtop].idx = INT32_MAX;
                    ^
Makefile:8: recipe for target 'daligner' failed
make: *** [daligner] Error 1

Can be fixed by adding the following line to daligner.c

#include <stdint.h>

Segfaul in tuple_thread

I have hit a seg fault while using daligner on a set of simulated D.mel reads:

Command line:

daligner -T32 -M56 dmel_30.19 dmel_30

The crash occured on a certain block (19 in this case), thus the block id on the command line. The original crash was using an HPC.daligner generated script on the whole dataset though.

Debugger output:

(gdb) run -T32 -M56 dmel_30.19 dmel_30
...
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffec4ce1700 (LWP 7025)]
0x00000000004059f6 in tuple_thread (arg=0x7fffffffd500) at filter.c:464
464 list[n].read = i;

Identify (-I) flag in HPC.daligner

daligner has the -I flag for computing alignments between a read and itself. This flag is missing in HPC.daligner. I think it would be useful to have it in HPC.daligner as well.

Add warning to daligner if -d specified and dust file not found

For example consider adding here:

dtrack = Load_Track(&block,"dust");
:

diff --git a/daligner.c b/daligner.c
index 399218b..82b24b7 100644
--- a/daligner.c
+++ b/daligner.c
@@ -185,9 +185,11 @@ static HITS_DB *read_DB(char *name, int dust, int *isdam)
   if (status < 0)
     exit (1);

-  if (dust)
+  if (dust){
     dtrack = Load_Track(&block,"dust");
-  else
+    if(dtrack == NULL)
+        fprintf(stderr,"dust file not found\n");
+ } else
     dtrack = NULL;

   Trim_DB(&block);

LAcheck: Too many alignment records

$ LAcheck -v raw_reads raw_reads.195.raw_reads.337.C1.las
  raw_reads.195.raw_reads.337.C1: Too many alignment records

The subreads DB is 45GB. If you want it, I could try to make it available for download. Any guesses what could cause this? We are not yet using repeat-masking. Would that be your best recommendation?

extract readname from .las

Dear Gene,

I would like to extract reads (based on their name) which mapped to a reference genome.
Is it possible to extract the read names instead of their index numbers (second column in the .las file?) with LAshow?
Or could i just parse LAshow output [2nd column] and pipe it to DBshow with the -n option?
This works but i am not sure if i get the correct results.

Another thing which was bugging me were the columns 1 and 3 from the .las output.
Could you tell me what the integer at col_1 and 'n' or 'c' at col_3 stand for?

paxi_mt.50smrtex: 420,488 records
col: [1]   [2  ][3] [                 4                 ]        [   5       ]  [     6       ]
  1         128 c   [ 21,982.. 23,180] x [ 7,448.. 8,710] :   <    125 diffs  ( 13 trace pts)
  1         291 n   [ 22,018.. 23,180] x [ 3,095.. 4,380] :   <    185 diffs  ( 12 trace pts)
  1         386 c   [ 21,969.. 23,167] x [ 4,940.. 6,179] :   <    138 diffs  ( 13 trace pts)
  1         463 n   [ 43,537.. 45,191] x [   323.. 2,249] :   <    327 diffs  ( 17 trace pts)
  1         711 n   [ 21,980.. 23,169] x [ 4,799.. 6,069] :   <    173 diffs  ( 13 trace pts)
  1         775 c   [ 21,976.. 23,180] x [11,411..12,648] :   <    103 diffs  ( 13 trace pts)
  1         785 c   [ 21,968.. 23,163] x [ 1,012.. 2,214] :   <    107 diffs  ( 13 trace pts)

Thank you,
Michel

quiva2DB unsucessfull

Helloo
I was playing the last days with your database system and aligner.
Today, I decided to add on top the quiva files to my database with your latest build - which was unsuccessful.
Either it dumps the core or it complains about absent headers. I extracted the files with your dextractor (dextract -o $i ). Here an example - I created each time a new DB

E.g:

fasta2DB TEST m130905_003016_42182_c100546852550000001823086111241312_s1_p0.fasta
quiva2DB -v TEST.db m130905_003016_42182_c100546852550000001823086111241312_s1_p0.quiva
Analyzing 'm130905_003016_42182_c100546852550000001823086111241312_s1_p0' ...
Compressing 'm130905_003016_42182_c100546852550000001823086111241312_s1_p0' ...
Segmentation fault (core dumped)

DBrm TEST.db
fasta2DB -v TEST m131206_154549_42182_c100602072550000001823109806171457_s1_p0.fasta
Adding 'm131206_154549_42182_c100602072550000001823109806171457_s1_p0' ...
quiva2DB -v TEST m131206_154549_42182_c100602072550000001823109806171457_s1_p0.quiva
Analyzing 'm131206_154549_42182_c100602072550000001823109806171457_s1_p0' ...
Line 235969: Header in quiv file is missing

HPC.daligner -d option cleanup

Using the -d option of HPC.daligner the program outputs a section containing sequences like

cd dir
rm files
cd ..

which is not consistent with the semantics that every line of a section can be executed in parallel to get the same result as when running the lines sequentially.

I would suggest to replace this by

cd dir ; rm files ; cd ..

HPC.daligner keep printing LAcheck

Dear,
I want to use DALIGNER to map raw reads of a dataset to the contigs produced by some assembler.
Firstly, I convert the raw data into DB by fasta2DB and contigs into DAM. The statistics of this dataset:
1
The statistics of the contigs:
2
But when I run :
./HPC.daligner -k15 -h50 -e.7 data.db contig.dam >r2c.sh
The program would not stop and kept printing โ€œLAcheck -vS contig dataโ€such like that (C : contig.dam, E: data.db):
2016-07-21 11 15 53
I donโ€™t know why about that. And I try to run:
/HPC.daligner -k15 -h50 -e.7 data.db >r2r.sh
The program run successfully and produced corresponding file (a 418M file).
Hence, I thought there was some problem in the contig.dam, and I replaced it with the contig dam file of other dataset, or data db file of other dataset. But this problem still existed (keep printing โ€œLAcheck -vS contig dataโ€).
Then, I don't know why it happens (looks like that there is a dead loop).

LAshow -o dissapeared

Hello
I was just curious why the option -o "optimal" for LAshow disappeared . I liked the idea behind it , used it in a few scripts and am now curious whether the concept did not hold up or whether it was a technical decisions ?
Cheers

memory error

Every once in a blue moon, I am getting the following error. The machine has plenty of RAM (512GB) and so the error is possibly related to dalign itself.

Do you need any more information?


*** glibc detected *** ../../DALIGN-code/Dalign/daligner: malloc(): memory corruption: 0x000000000074e4e0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76a16)[0x7fac475f3a16]
/lib/x86_64-linux-gnu/libc.so.6(+0x79a83)[0x7fac475f6a83]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x70)[0x7fac475f88a0]
../../DALIGN-code/Dalign/daligner[0x411606]
../../DALIGN-code/Dalign/daligner[0x405181]
../../DALIGN-code/Dalign/daligner[0x401958]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7fac4759bead]
../../DALIGN-code/Dalign/daligner[0x401b6d]
======= Memory map: ========
00400000-00417000 r-xp 00000000 08:04 127706273 /home/genemyers/Genome-assemblies/Elec-eel/DALIGN-code/Dalign/daligner
00616000-00617000 rw-p 00016000 08:04 127706273 /home/genemyers/Genome-assemblies/Elec-eel/DALIGN-code/Dalign/daligner
0074e000-007a1000 rw-p 00000000 00:00 0 [heap]
7fac40000000-7fac40021000 rw-p 00000000 00:00 0
7fac40021000-7fac44000000 ---p 00000000 00:00 0
7fac45363000-7fac45378000 r-xp 00000000 08:01 1967226 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fac45378000-7fac45578000 ---p 00015000 08:01 1967226 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fac45578000-7fac45579000 rw-p 00015000 08:01 1967226 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fac45579000-7fac4557a000 ---p 00000000 00:00 0
7fac4557a000-7fac45d7a000 rw-p 00000000 00:00 0
7fac45d7a000-7fac45d7b000 ---p 00000000 00:00 0
7fac45d7b000-7fac4657b000 rw-p 00000000 00:00 0
7fac4657b000-7fac4657c000 ---p 00000000 00:00 0
7fac4657c000-7fac46d7c000 rw-p 00000000 00:00 0
7fac46d7c000-7fac46d7d000 ---p 00000000 00:00 0
7fac46d7d000-7fac4757d000 rw-p 00000000 00:00 0
7fac4757d000-7fac476ff000 r-xp 00000000 08:01 1977025 /lib/x86_64-linux-gnu/libc-2.13.so
7fac476ff000-7fac478ff000 ---p 00182000 08:01 1977025 /lib/x86_64-linux-gnu/libc-2.13.so
7fac478ff000-7fac47903000 r--p 00182000 08:01 1977025 /lib/x86_64-linux-gnu/libc-2.13.so
7fac47903000-7fac47904000 rw-p 00186000 08:01 1977025 /lib/x86_64-linux-gnu/libc-2.13.so
7fac47904000-7fac47909000 rw-p 00000000 00:00 0
7fac47909000-7fac4798a000 r-xp 00000000 08:01 1977031 /lib/x86_64-linux-gnu/libm-2.13.so
7fac4798a000-7fac47b89000 ---p 00081000 08:01 1977031 /lib/x86_64-linux-gnu/libm-2.13.so
7fac47b89000-7fac47b8a000 r--p 00080000 08:01 1977031 /lib/x86_64-linux-gnu/libm-2.13.so
7fac47b8a000-7fac47b8b000 rw-p 00081000 08:01 1977031 /lib/x86_64-linux-gnu/libm-2.13.so
7fac47b8b000-7fac47ba2000 r-xp 00000000 08:01 1968929 /lib/x86_64-linux-gnu/libpthread-2.13.so
7fac47ba2000-7fac47da1000 ---p 00017000 08:01 1968929 /lib/x86_64-linux-gnu/libpthread-2.13.so
7fac47da1000-7fac47da2000 r--p 00016000 08:01 1968929 /lib/x86_64-linux-gnu/libpthread-2.13.so
7fac47da2000-7fac47da3000 rw-p 00017000 08:01 1968929 /lib/x86_64-linux-gnu/libpthread-2.13.so
7fac47da3000-7fac47da7000 rw-p 00000000 00:00 0
7fac47da7000-7fac47dc7000 r-xp 00000000 08:01 1968930 /lib/x86_64-linux-gnu/ld-2.13.so
7fac47f8b000-7fac47faf000 rw-p 00000000 00:00 0
7fac47fc4000-7fac47fc6000 rw-p 00000000 00:00 0
7fac47fc6000-7fac47fc7000 r--p 0001f000 08:01 1968930 /lib/x86_64-linux-gnu/ld-2.13.so
7fac47fc7000-7fac47fc8000 rw-p 00020000 08:01 1968930 /lib/x86_64-linux-gnu/ld-2.13.so
7fac47fc8000-7fac47fc9000 rw-p 00000000 00:00 0
7fff353c8000-7fff353e9000 rw-p 00000000 00:00 0 [stack]
7fff353ff000-7fff35400000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

Still Getting Duplicated Regions Error

I used daligner in FALCON pipeline, and I had updated to the newest version. But in the LACheck steps, I still melt this error. I tried to solve it by masking the LACheck steps. But, I did not know if there might be several unknown mistakes. And, how could I avoid this.

Thanks.

munmap_chunk(): invalid pointer: 0x00007fff77201010

We are trying to align two smaller genomes (160Mb) to each other:

*** glibc detected *** /ebio/abt6/fbemm/software/dazzler/daligner/daligner: munmap_chunk(): invalid pointer: 0x00007fff77201010 ***
[Thread 0x7fff03ae2700 (LWP 22826) exited]
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7de16)[0x7ffff757fe16]
/ebio/abt6/fbemm/software/dazzler/daligner/daligner[0x40510c]
/ebio/abt6/fbemm/software/dazzler/daligner/daligner[0x40529f]
/ebio/abt6/fbemm/software/dazzler/daligner/daligner[0x4028ec]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7ffff752376d]
/ebio/abt6/fbemm/software/dazzler/daligner/daligner[0x401349]
======= Memory map: ========
00400000-00422000 r-xp 00000000 00:15 1275101 /ebio/abt6/fbemm/software/dazzler/daligner/daligner
00622000-00623000 rw-p 00022000 00:15 1275101 /ebio/abt6/fbemm/software/dazzler/daligner/daligner
00623000-00647000 rw-p 00000000 00:00 0 [heap]
7fff030cb000-7fff030e1000 r-xp 00000000 00:15 9477371 /ebio/abt6/fbemm/software/gcc-4.9.1_sub/lib64/libgcc_s.so.1
7fff030e1000-7fff032e0000 ---p 00016000 00:15 9477371 /ebio/abt6/fbemm/software/gcc-4.9.1_sub/lib64/libgcc_s.so.1
7fff032e0000-7fff032e1000 rw-p 00015000 00:15 9477371 /ebio/abt6/fbemm/software/gcc-4.9.1_sub/lib64/libgcc_s.so.1
7fff032e1000-7fff032e2000 rw-p 00000000 00:00 0
7fff032e2000-7fff032e3000 ---p 00000000 00:00 0
7fff032e3000-7fff03ae3000 rw-p 00000000 00:00 0
7fff03ae3000-7fff03ae4000 ---p 00000000 00:00 0
7fff03ae4000-7fff042e4000 rw-p 00000000 00:00 0
7fff042e4000-7fff042e5000 ---p 00000000 00:00 0
7fff042e5000-7fff04ae5000 rw-p 00000000 00:00 0
7fff04ae5000-7fff04ae6000 ---p 00000000 00:00 0
7fff04ae6000-7ffff7502000 rw-p 00000000 00:00 0
7ffff7502000-7ffff76b6000 r-xp 00000000 fc:00 628450 /lib/x86_64-linux-gnu/libc-2.15.so
7ffff76b6000-7ffff78b6000 ---p 001b4000 fc:00 628450 /lib/x86_64-linux-gnu/libc-2.15.so
7ffff78b6000-7ffff78ba000 r--p 001b4000 fc:00 628450 /lib/x86_64-linux-gnu/libc-2.15.so
7ffff78ba000-7ffff78bc000 rw-p 001b8000 fc:00 628450 /lib/x86_64-linux-gnu/libc-2.15.so
7ffff78bc000-7ffff78c1000 rw-p 00000000 00:00 0
7ffff78c1000-7ffff79bc000 r-xp 00000000 fc:00 628460 /lib/x86_64-linux-gnu/libm-2.15.so
7ffff79bc000-7ffff7bbb000 ---p 000fb000 fc:00 628460 /lib/x86_64-linux-gnu/libm-2.15.so
7ffff7bbb000-7ffff7bbc000 r--p 000fa000 fc:00 628460 /lib/x86_64-linux-gnu/libm-2.15.so
7ffff7bbc000-7ffff7bbd000 rw-p 000fb000 fc:00 628460 /lib/x86_64-linux-gnu/libm-2.15.so
7ffff7bbd000-7ffff7bd5000 r-xp 00000000 fc:00 628451 /lib/x86_64-linux-gnu/libpthread-2.15.so
7ffff7bd5000-7ffff7dd4000 ---p 00018000 fc:00 628451 /lib/x86_64-linux-gnu/libpthread-2.15.so
7ffff7dd4000-7ffff7dd5000 r--p 00017000 fc:00 628451 /lib/x86_64-linux-gnu/libpthread-2.15.so
7ffff7dd5000-7ffff7dd6000 rw-p 00018000 fc:00 628451 /lib/x86_64-linux-gnu/libpthread-2.15.so
7ffff7dd6000-7ffff7dda000 rw-p 00000000 00:00 0
7ffff7dda000-7ffff7dfc000 r-xp 00000000 fc:00 628459 /lib/x86_64-linux-gnu/ld-2.15.so
7ffff7fc1000-7ffff7fe6000 rw-p 00000000 00:00 0
7ffff7ff9000-7ffff7ffb000 rw-p 00000000 00:00 0
7ffff7ffb000-7ffff7ffc000 r-xp 00000000 00:00 0 [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00022000 fc:00 628459 /lib/x86_64-linux-gnu/ld-2.15.so
7ffff7ffd000-7ffff7fff000 rw-p 00023000 fc:00 628459 /lib/x86_64-linux-gnu/ld-2.15.so
7ffffffdd000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

gdb stack trace:

Program received signal SIGABRT, Aborted.
0x00007ffff75380d5 in __GI_raise (sig=) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.

Happens also if target/subject only contain a single large sequence (1Mb).

Core dump from daligner (filter.c)

Hi Gene,

Thanks for producing this tool.

I'm running it as part of Jarred Simpson's nanocorrect pipeline, but I'm getting a core dump. I've re-compiled with -g and have pulled out the stack trace from gdb (see below).

There seems to be a pointer issue... is this something you're aware of, or am I doing something unexpected?

Thanks,
Richard


[leggettr@TGAC-HPC nanocorrect]$ gdb ../DALIGNER/daligner core.26355
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /tgac/workarea/group-si/leggettr/assembly/DALIGNER/daligner...done.
[New LWP 26904]
[New LWP 26906]
[New LWP 26905]
[New LWP 26907]
[New LWP 26355]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `daligner -t5 -mdust ogrady_pass.2 ogrady_pass.1 ogrady_pass.2'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000404c68 in tuple_thread (arg=0x7ffc23ea4ad0) at filter.c:456
456                     c = (c << 2) | s[p++];
(gdb) info stack
#0  0x0000000000404c68 in tuple_thread (arg=0x7ffc23ea4ad0) at filter.c:456
#1  0x0000003738807a51 in start_thread () from /lib64/libpthread.so.0
#2  0x00000037384e893d in clone () from /lib64/libc.so.6

Problem with -M

The -M flags works great ... until it is set too low. Then, it seems to be completely ignored, and memory use rises rapidly. That makes -M less useful as a way to limit memory use. We end up relying on -t, which is harder to estimate.

Is this a bug, or intended behavior?

malloc error in daligner

Operating System: OSX

Dataset: PacBio example dataset - http://datasets.pacb.com.s3.amazonaws.com/2014/c_elegans/40X/raw_data/2590970/0001/Analysis_Results/m140928_184123_42139_c100719602550000001823155305141590_s1_p0.1.bax.h5

Process:
dextract -v -omatt/data m140928_184123_42139_c100719602550000001823155305141590_s1_p0.1.bax.h5
fasta2DB matt/database matt/data.fasta
quiva2DB matt/database matt/data.quiva
DBsplit matt/database
HPCdaligner -d -t2 matt/database | csh -v

Error:
6899 vs 6901 : 3
6899 vs 6970 : 3
6901 vs 6902 : 3
6901 vs 6912 : 3
6901 vs 6930 : 3
6908 vs 6933 : 3
6911 vs 6976 : 3
6913 vs 6918 : 3
6915 vs 6951 : 3
6917 vs 6955 : 5
6918 vs 6962 : 3
6919 vs 6971 : 3
6920 vs 6980 : 3
6921 vs 6970 : 3
6946 vs 6951 : 3
6951 vs 6970 : 3
6953 vs 6975 : 3
daligner(10634,0x7fff732db300) malloc: *** error for object 0x8859: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort

LAsort database.1.database.1.C0 database.1.database.1.N0 database.1.database.1.C1 database.1.database.1.N1 database.1.database.1.C2 database.1.database.1.N2 database.1.database.1.C3 database.1.database.1.N3 && LAmerge database.1 database.1.database.1.C0.S database.1.database.1.N0.S database.1.database.1.C1.S database.1.database.1.N1.S database.1.database.1.C2.S database.1.database.1.N2.S database.1.database.1.C3.S database.1.database.1.N3.S && rm database.1.database.1.C0.S.las database.1.database.1.N0.S.las database.1.database.1.C1.S.las database.1.database.1.N1.S.las database.1.database.1.C2.S.las database.1.database.1.N2.S.las database.1.database.1.C3.S.las database.1.database.1.N3.S.las

Other Question:
Once this step completes how do I perform the de novo assembly piece? I couldn't find any dazzler executable.

regular tagged releases

I see that you've tagged the first release, but it looks (based on commit messages and twitter posts) like you've made one other major release and another bug-fix release since then. It would be helpful if you could continue to tag the relevant commits so that it is clear which snapshot we should be using and which are just intermediate states.

Many thanks and regards
Afif

align SMRT cells against a reference

Hi,
How Is it possible to align RSII and Sequel SMRT cells against a reference genome and received mapped and unmapped reads?

Thank you in advance.

Michal

Warnings

Would you please fix these here? Then we could drop many of our changes.

filter.c: In function โ€˜lex_threadโ€™:
filter.c:222:7: warning: format โ€˜%dโ€™ expects argument of type โ€˜intโ€™, but argument 4 has type โ€˜int64โ€™ [-Wformat=]
       printf("\n shift=%d, LEX_last=%d, n=%d", shift, LEX_last, n);
       ^
filter.c:286:13: warning: format โ€˜%dโ€™ expects argument of type โ€˜intโ€™, but argument 3 has type โ€˜long long unsigned intโ€™ [-Wformat=]
             { printf("\n @=%p+%d i=%6d,c&=%3d,x=%3d,c=%d ", (void*)trg, (sizeof(Double)*x), i, (c&BMASK), x, c);
             ^
filter.c:286:13: warning: format โ€˜%dโ€™ expects argument of type โ€˜intโ€™, but argument 4 has type โ€˜int64โ€™ [-Wformat=]
filter.c:286:13: warning: format โ€˜%dโ€™ expects argument of type โ€˜intโ€™, but argument 5 has type โ€˜long long unsigned intโ€™ [-Wformat=]
filter.c:286:13: warning: format โ€˜%dโ€™ expects argument of type โ€˜intโ€™, but argument 6 has type โ€˜int64โ€™ [-Wformat=]
filter.c:286:13: warning: format โ€˜%dโ€™ expects argument of type โ€˜intโ€™, but argument 7 has type โ€˜uint64โ€™ [-Wformat=]
filter.c:302:5: warning: format โ€˜%dโ€™ expects argument of type โ€˜intโ€™, but argument 3 has type โ€˜int64โ€™ [-Wformat=]
     { printf("\n Finished @%p n=%d", (void*)trg, n);
     ^
filter.c: In function โ€˜Sort_Kmersโ€™:
filter.c:762:3: warning: format โ€˜%dโ€™ expects argument of type โ€˜intโ€™, but argument 3 has type โ€˜long unsigned intโ€™ [-Wformat=]
   if (VERBOSE) printf("\n Allocated %d of %d (%d bytes) at %p", (kmers+1), sizeof(KmerPos), (sizeof(KmerPos)*(kmers+1)), (void*)trg);
   ^
filter.c:762:3: warning: format โ€˜%dโ€™ expects argument of type โ€˜intโ€™, but argument 4 has type โ€˜long unsigned intโ€™ [-Wformat=]
filter.c: In function โ€˜Entwineโ€™:
filter.c:1213:21: warning: variable โ€˜oflareโ€™ set but not used [-Wunused-but-set-variable]
   int   strt, iflare, oflare;
                     ^
filter.c:1213:13: warning: variable โ€˜iflareโ€™ set but not used [-Wunused-but-set-variable]
   int   strt, iflare, oflare;
             ^
filter.c: In function โ€˜Match_Filterโ€™:
filter.c:2234:13: warning: suggest parentheses around operand of โ€˜!โ€™ or change โ€˜&โ€™ to โ€˜&&โ€™ or โ€˜!โ€™ to โ€˜~โ€™ [-Wparentheses]
         if (! MG_self & SYMMETRIC)
             ^
filter.c: In function โ€˜report_threadโ€™:
filter.c:1757:28: warning: โ€˜doBโ€™ may be used uninitialized in this function [-Wmaybe-uninitialized]
                         if (doB)
                            ^
filter.c:1621:18: note: โ€˜doBโ€™ was declared here
         int   doA, doB;
                  ^
filter.c:1621:13: note: โ€˜doAโ€™ was declared here
         int   doA, doB;
             ^
filter.c:1736:28: warning: โ€˜doAโ€™ may be used uninitialized in this function [-Wmaybe-uninitialized]
                       { if (doA)
                            ^
align.c: In function โ€˜forward_extendโ€™:
align.c:1917:20: warning: variable โ€˜boffโ€™ set but not used [-Wunused-but-set-variable]
   int     avail, cmax, boff;
                    ^
align.c: In function โ€˜reverse_extendโ€™:
align.c:2436:20: warning: variable โ€˜boffโ€™ set but not used [-Wunused-but-set-variable]
   int     avail, cmax, boff;
                    ^
daligner.c: In function โ€˜mainโ€™:
daligner.c:741:25: warning: โ€˜aindexโ€™ may be used uninitialized in this function [-Wmaybe-uninitialized]
             Match_Filter(aroot,ablock,aroot,bblock,aindex,alen,bindex,blen,1,asettings);
                         ^
LAshow.c: In function โ€˜mainโ€™:
LAshow.c:566:38: warning: โ€˜bseqโ€™ may be used uninitialized in this function [-Wmaybe-uninitialized]
                     aln->bseq = bseq - (aln->blen - bmax);
                                      ^

daligner segmentation fault

Hello,

I try to start running the dazzler pipeline with C2-P4 PacBio reads.
I created the DB and split it into the default 400 Mb chunks (gave me 77 chunks from 62 SMRTcells).

When i try to execute the first alignment by daligner on a 256 Gb RAM node, a segmentation fault occurs:

/opt/sge/default/spool/binfservas10/job_scripts/38383: line 3: 
41148 Segmentation fault      /data/users/mmoser/DALIGNER-master/daligner -v P_ax.1 P_ax.1

The same happens when i use a 512 Gb RAM node:

/opt/sge/default/spool/binfservas01/job_scripts/38389: line 3: 
93693 Segmentation fault      /data/users/mmoser/DALIGNER-master/daligner -v P_ax.1 P_ax.1

Stdout reports me for both jobs:

Kmer count = 338,807,354
Using 5.05Gb of space
Index occupies 2.52Gb

Kmer count = 338,807,354
Using 5.05Gb of space
Index occupies 2.52Gb

Hit count = 1,756,849,691
Highwater of 31.23Gb space

Do you know how this can be resolved? Should i try to use different block size?

Thank you, Michel

LAdump: not all trace-points

It seems that LAdump -t shows the trace-points for the first alignment only.

In fact, LAdump -cd shows all alignements, but LAdump -cdt shows only the first.

E.g.

+ P 3
% P 1
+ T 564
% T 258
@ T 258
P 92772 92773 c .
C 5464 18218 0 12802
T 129
...

Even though it says P3, there are no other sections. And we see only 129 trace-points in the dump, not 564.

Can you verify that?

Crash in free(), but lost test-case

Unfortunately, the old test-case got deleted. (I should have copied it.) But I can post what I learned about it:

This crashes repeatably, in only ~3 minutes:
  daligner -v -k18 -w8 -h480 -t16 -H15000 -e0.7 -l4800 -M32 raw_reads.202 raw_reads.112

I'll try to get a stack-trace ...

align.c:159
 155 void Free_Work_Data(Work_Data *ework)
 156 { _Work_Data *work = (_Work_Data *) ework;
 157   if (work->vector != NULL)
 158     free(work->vector);
 159   if (work->cells != NULL)
 160     free(work->cells);
 161   if (work->trace != NULL)
 162     free(work->trace);
 163   if (work->points != NULL)
 164     free(work->points);
 165   free(work);
 166 }

filter.c:1834
1500 void Match_Filter(char *aname, HITS_DB *ablock, char *bname, HITS_DB *bblock,
1501                   int self, int comp, Align_Spec *aspec)
...
1833     for (i = 0; i < NTHREADS; i++)
1834       Free_Work_Data(parmr[i].work);

daligner.c:695
519 int main(int argc, char *argv[])
...
695             Match_Filter(aroot,&cblock,broot,&bblock,0,1,asettings);
696             Close_DB(&bblock);
697             free(broot);


Memory is too large for valgrind (24GB). I've added some memory diagnostics, but nothing looks bad. The problem occurs near the end of Match_Filter, between New_Work_Data() and Free_Work_Data(). When I have time, I will try to link with google-tcmalloc, which has a nice memory-checker.
Hi, Chris:

For some sequences comparison, the function Local_Alignment() in align.c does not reserve enough memory.  (It is hard to predict the right amount of memory for โ€œarbitrary sequencesโ€.)  If you chance ling 1769, `wsize = VectorEl*10000;` to `wsize = VectorEl*20000;` . The task will finish without segfault. (see where wsize is used for allocating memoryโ€ฆ etc.) I hope this helps.

Cheers,
Jason
Yes, that "fix" works for this example, but it doesn't explain the crash. Memory is being freed twice somewhere. It's probably caused by another buffer over-run somehow. We need to know where, and we need to detect the problem at runtime to avoid these annoying, uninformative crashes.

That's all I have right now.

warnings

LAcheck.c:269:43: warning: โ€˜has_chainsโ€™ may be used uninitialized in this function [-Wmaybe-uninitialized]
               { if (CHAIN_NEXT(ovl.flags) || !has_chains)

Removing repeats before assembly

Dear Dr. Myers,
Where could I find your software to identify "Repeats before Assembly" which you talked about in "Is Perfect Assembly Perfect?"?

Thank you in advance.

Michal

Question regarding dust-masking

Hello
I was playing around with different ways to mask certain regions (e.g. repeats) prior aligning them with daligner and changed the DBdust code in order to replace dust-masking with repeat-masking.
The problem I stumbled now upon is that I have a functional repeat-masking information but daligner always crashes with a segmentation fault if I add the "-d" option.
I guess since my masked regions a rather long it might pose a problem for the k-mer selection?
And thoughts or input on that ?
Thanks, cheers
Emanuel

daligner crash

This is the daligner that is part of v0.3.3 Falcon.

With a particular pair of files it gives (see below).

Let me know what other information I can provide you...

.
.
.
Building index for raw_reads.193

Kmer count = 399,655,782
Using 11.91Gb of space
Revised kmer count = 334,780,204
Index occupies 4.99Gb

Comparing c(raw_reads.545) to raw_reads.193

Capping mutual k-mer matches over 10000 (effectively -t100)
Hit count = 997,361,227
Highwater of 39.67Gb space
*** glibc detected *** daligner: free(): invalid pointer: 0x00002bb67c0008c0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x37b6c75e66]
daligner[0x4097a2]
daligner[0x406f1d]
daligner[0x402817]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x37b6c1ed5d]
daligner[0x401339]
======= Memory map: ========
00400000-00421000 r-xp 00000000 00:36 3895273357 /net/gs/vol1/home/dg
ordon/falcon150714/FALCON-integrate/DALIGNER/daligner
00620000-00621000 rw-p 00020000 00:36 3895273357 /net/gs/vol1/home/dg
ordon/falcon150714/FALCON-integrate/DALIGNER/daligner
00621000-00624000 rw-p 00000000 00:00 0
01d80000-01dd9000 rw-p 00000000 00:00 0 [heap]
37b6800000-37b6820000 r-xp 00000000 fd:00 41418757 /lib64/ld-2.12.so
37b6a1f000-37b6a20000 r--p 0001f000 fd:00 41418757 /lib64/ld-2.12.so
37b6a20000-37b6a21000 rw-p 00020000 fd:00 41418757 /lib64/ld-2.12.so
37b6a21000-37b6a22000 rw-p 00000000 00:00 0
.
.
.

"confirmed" vs. "seed" hits

Dr. Myers, I'm having trouble answering this question, so I hope you have time to explain this.

How can "confirmed hits" be greater than "seed hits"?

Still getting Duplicate overlap

[ from https://github.com/PacificBiosciences/FALCON/issues/515]
We are using Falcon-integrate 1.8.6 and for our large sequences we are still getting the occasional Duplicate Overlap that is killing the whole process.

  raw_reads.55.raw_reads.213.N3: 324,318 all OK
  raw_reads.56.raw_reads.213.C0: 332,440 all OK
  raw_reads.56.raw_reads.213.C1: Duplicate overlap (1573849 vs 6403244)
  raw_reads.56.raw_reads.213.C2: 325,068 all OK
  raw_reads.56.raw_reads.213.C3: 336,228 all OK

Based on our fc_run.cfg config are there any suggestions? We have tried reducing -s 100 and increasing the cutoff but still get the problem. We actually need to reduce the cutoff.

[General]
## config file for FALCON v1.8.6
#job_type = local
job_type = pbs
input_fofn = input.fofn
input_type = raw
length_cutoff = 5000
length_cutoff_pr = 5000
job_queue = lyra

sge_option = -l nodes=1:ppn=2,walltime=128:00:00,mem=20gb -W umask=0007
sge_option_da = -l nodes=1:ppn=4,walltime=96:00:00,mem=25gb -W umask=0007
sge_option_la = -l nodes=1:ppn=2,walltime=96:00:00,mem=24gb -W umask=0007
sge_option_pda = -l nodes=1:ppn=8,walltime=96:00:00,mem=25gb -W umask=0007
sge_option_pla = -l nodes=1:ppn=2,walltime=96:00:00,mem=26gb -W umask=0007
sge_option_fc = -l nodes=1:ppn=1,walltime=96:00:00,mem=25gb -W umask=0007
sge_option_cns = -l nodes=1:ppn=8,walltime=96:00:00,mem=24gb -W umask=0007

pa_concurrent_jobs = 8
cns_concurrent_jobs = 8
ovlp_concurrent_jobs = 8

pa_HPCdaligner_option =  -v -B128 -e.70 -l1000 -s1000 -M24
ovlp_HPCdaligner_option = -v -B128 -h60 -e.96 -l500 -s1000 -M24

pa_DBsplit_option = -x500 -s300
ovlp_DBsplit_option = -x500 -s300

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 0
overlap_filtering_setting = --max_diff 80 --max_cov 80 --min_cov 2 --bestn 10 --n_core 8

pc-dunn says "If this is on the latest DALIGNER, it is a serious problem. I suggest providing a test-case to thegenemyers, the owner of DALIGNER." What information can I provide to assist resolve this, keeping in mind that our dataset is over 100Gb.

No inter-block alignments in .las files

Hi Dazz Team,

In trying to assemble a ~900MB diploid plant genome using a ~120x dataset of PacBio reads, we are finding that well under 1% of our reads are passing from HPCdaligner to Falcon as corrected preads. Scanning the .las files from the pre-assembly, we observe that there are local alignment hits only when a block from the database is aligned with itself. For example, if the raw read DB was split into five different blocks, creating these 6 daligner jobs:

# Daligner jobs (6)
daligner -v -k18 -h50 -t18 -H18600 -e0.75 -l2500 -M4 1SMRT.1 1SMRT.1
daligner -v -k18 -h50 -t18 -H18600 -e0.75 -l2500 -M4 1SMRT.2 1SMRT.1 1SMRT.2
daligner -v -k18 -h50 -t18 -H18600 -e0.75 -l2500 -M4 1SMRT.3 1SMRT.1 1SMRT.2 1SMRT.3
daligner -v -k18 -h50 -t18 -H18600 -e0.75 -l2500 -M4 1SMRT.4 1SMRT.1 1SMRT.2 1SMRT.3 1SMRT.4
daligner -v -k18 -h50 -t18 -H18600 -e0.75 -l2500 -M4 1SMRT.5 1SMRT.1 1SMRT.2
daligner -v -k18 -h50 -t18 -H18600 -e0.75 -l2500 -M4 1SMRT.5 1SMRT.3 1SMRT.4 1SMRT.5

...what we find is that hits occur only within blocks, as indicated below in the list of produced files (note 12K = empty file):

 1.5M   1SMRT.1.1SMRT.1.C0.las
 1.5M   1SMRT.1.1SMRT.1.C1.las
 1.2M   1SMRT.1.1SMRT.1.C2.las
 1.4M   1SMRT.1.1SMRT.1.C3.las
 2.0M   1SMRT.1.1SMRT.1.N0.las
 1.8M   1SMRT.1.1SMRT.1.N1.las
 1.6M   1SMRT.1.1SMRT.1.N2.las
 1.7M   1SMRT.1.1SMRT.1.N3.las
   12k  1SMRT.1.1SMRT.2.C0.las
   12k  1SMRT.1.1SMRT.2.C1.las
   12k  1SMRT.1.1SMRT.2.C2.las
   12k  1SMRT.1.1SMRT.2.C3.las
   12k  1SMRT.1.1SMRT.2.N0.las
   12k  1SMRT.1.1SMRT.2.N1.las
   12k  1SMRT.1.1SMRT.2.N2.las
   12k  1SMRT.1.1SMRT.2.N3.las
   12k  1SMRT.1.1SMRT.3.C0.las
   12k  1SMRT.1.1SMRT.3.C1.las
   12k  1SMRT.1.1SMRT.3.C2.las
   12k  1SMRT.1.1SMRT.3.C3.las
   12k  1SMRT.1.1SMRT.3.N0.las
   12k  1SMRT.1.1SMRT.3.N1.las
   12k  1SMRT.1.1SMRT.3.N2.las
...
 1.5M   1SMRT.2.1SMRT.2.C0.las
 1.5M   1SMRT.2.1SMRT.2.C1.las
 1.4M   1SMRT.2.1SMRT.2.C2.las
 1.3M   1SMRT.2.1SMRT.2.C3.las
 2.0M   1SMRT.2.1SMRT.2.N0.las
 1.9M   1SMRT.2.1SMRT.2.N1.las
 1.7M   1SMRT.2.1SMRT.2.N2.las
 1.8M   1SMRT.2.1SMRT.2.N3.las
   12k  1SMRT.2.1SMRT.3.C0.las
   12k  1SMRT.2.1SMRT.3.C1.las
   12k  1SMRT.2.1SMRT.3.C2.las
   12k  1SMRT.2.1SMRT.3.C3.las
   12k  1SMRT.2.1SMRT.3.N0.las
   12k  1SMRT.2.1SMRT.3.N1.las
   12k  1SMRT.2.1SMRT.3.N2.las
   12k  1SMRT.2.1SMRT.3.N3.las
   12k  1SMRT.2.1SMRT.4.C0.las
   12k  1SMRT.2.1SMRT.4.C1.las
   12k  1SMRT.2.1SMRT.4.C2.las
   12k  1SMRT.2.1SMRT.4.C3.las
   12k  1SMRT.2.1SMRT.4.N0.las
   12k  1SMRT.2.1SMRT.4.N1.las
   12k  1SMRT.2.1SMRT.4.N2.las
   12k  1SMRT.2.1SMRT.4.N3.las
...
 1.7M   1SMRT.3.1SMRT.3.C0.las
 1.7M   1SMRT.3.1SMRT.3.C1.las
 1.6M   1SMRT.3.1SMRT.3.C2.las
 1.5M   1SMRT.3.1SMRT.3.C3.las
 2.2M   1SMRT.3.1SMRT.3.N0.las
 2.0M   1SMRT.3.1SMRT.3.N1.las
 2.0M   1SMRT.3.1SMRT.3.N2.las
 2.0M   1SMRT.3.1SMRT.3.N3.las
   12k  1SMRT.3.1SMRT.4.C0.las
   12k  1SMRT.3.1SMRT.4.C1.las
   12k  1SMRT.3.1SMRT.4.C2.las
   12k  1SMRT.3.1SMRT.4.C3.las
   12k  1SMRT.3.1SMRT.4.N0.las
   12k  1SMRT.3.1SMRT.4.N1.las
   12k  1SMRT.3.1SMRT.4.N2.las
   12k  1SMRT.3.1SMRT.4.N3.las
   12k  1SMRT.3.1SMRT.5.C0.las
   12k  1SMRT.3.1SMRT.5.C1.las
   12k  1SMRT.3.1SMRT.5.C2.las
   12k  1SMRT.3.1SMRT.5.C3.las
   12k  1SMRT.3.1SMRT.5.N0.las
   12k  1SMRT.3.1SMRT.5.N1.las
   12k  1SMRT.3.1SMRT.5.N2.las
   12k  1SMRT.3.1SMRT.5.N3.las
...

We have systematically adjusted HPCdaligner parameters like -l, -k, -t, -h and -s; but the pattern is robust.

Our questions:

  1. Is this exclusive alignment of reads within the same block expected?
  2. What parameter adjustment would you suggest to increase alignments across blocks?
  3. If we could obtain cross-block alignments, do you think it would result in more reads being available for the OLC assembly?

Thank you so much for your time,
Arthur Melo

For your reference:

Starting with this base HPCdaligner parameterization:

HPCdaligner -v -dal4 -e0.75 -M4 -l2500 -k18 -t18 -h50 -s500 -H18600

We have adjusted the following flags through the ranges specified, with no success in generating inter-block alignments:
-k (14-32)
-t (18-56)
-h (50-1250)
-s (100-500)

seg fault daligner

Working with Nanopore dataset.

 daligner -M10 -d lambda.1 lambda.1

database seems fine:

>more lambda.db
files =         1
   1131 lambda.pp m000000_000000_00000_c1898213712391273
blocks =         1
size =        50 cutoff =         0 all = 0
     0         0
  1131      1131

gdb stack trace

...
 1127 vs  1130 : 906   1082.. x   187..  -895 (2550)  [  900, 6362] x [    0, 5432] = 1518
                   6607.. x  5725..  -882 (2550)  [ 6528, 9031] x [ 5645, 8132] =  679
                   9380.. x  8489..  -891 (2550)  [ 9353,11909] x [ 8462,10953] =  690
                  12134.. x 11183..  -951 (2550)  No alignment 66

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff71f29bc in __GI___libc_free (mem=0x7fffffffd880) at malloc.c:2945
2945    malloc.c: No such file or directory.

dmesg

daligner[21243]: segfault at 7fffa0000000 ip 00007fb3bea5b9bc sp 00007fffa24bfbf8 error 4 in libc-2.19.so[7fb3be9d8000+1bb000

What else can I send you to help debug?

little suggestion regarding installation

Hello
I am just in the middle of installing your software on a cluster and packaging RPMs.
In the middle of it I realized that I encountered an error as your makefile specifies for the installation
always ~/bin. Obviously this is easy to overcome by changing the makefile. But could you please support in a next release common installation variables such as DESTDIR ?
Cheers

Alignments vary with block size

Hello,

I have been running FALCON and noticed that the quality of my assembly varied with my chosen block size (-s). Using block sizes of 200, 250, and 300 I started my investigation with the databases after partitioning and trimming with DBsplit.

Input data:

s200

$ DBsplit -x500 -s200 raw_reads
$ cat raw_reads.db
files =         3
      37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
      72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
     105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
blocks =         5
size = 200000000 cutoff =       500 all = 0
         0         0
     21809     18759
     43137     37294
     65145     56188
     87360     75255
    105451     90747
$ DBstats raw_reads.1
Statistics for all reads of length 500 bases or more

         18,759 reads        out of          21,809  ( 86.0%)
$ DBstats raw_reads.2
Statistics for all reads of length 500 bases or more

         18,535 reads        out of          21,328  ( 86.9%)
$ DBstats raw_reads.3
Statistics for all reads of length 500 bases or more

         18,894 reads        out of          22,008  ( 85.9%)
$ DBstats raw_reads.4
Statistics for all reads of length 500 bases or more

         19,067 reads        out of          22,215  ( 85.8%)
$ DBstats raw_reads.5
Statistics for all reads of length 500 bases or more

         15,492 reads        out of          18,091  ( 85.6%)

s250

$ DBsplit -x500 -s250 raw_reads
$ cat raw_reads.db
files =         3
      37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
      72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
     105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
blocks =         4
size = 250000000 cutoff =       500 all = 0
         0         0
     27121     23399
     54026     46673
     81766     70463
    105451     90747
$ DBstats raw_reads.1
Statistics for all reads of length 500 bases or more

         23,399 reads        out of          27,121  ( 86.3%)
$ DBstats raw_reads.2
Statistics for all reads of length 500 bases or more

         23,274 reads        out of          26,905  ( 86.5%)
$ DBstats raw_reads.3
Statistics for all reads of length 500 bases or more

         23,790 reads        out of          27,740  ( 85.8%)
$ DBstats raw_reads.4
Statistics for all reads of length 500 bases or more

         20,284 reads        out of          23,685  ( 85.6%)

s300

$ DBsplit -x500 -s300 raw_reads
$ cat raw_reads.db
files =         3
      37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
      72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
     105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
blocks =         4
size = 300000000 cutoff =       500 all = 0
         0         0
     32534     28089
     65144     56187
     98318     84642
    105451     90747
$ DBstats raw_reads.1
Statistics for all reads of length 500 bases or more

         28,089 reads        out of          32,534  ( 86.3%)
$ DBstats raw_reads.2
Statistics for all reads of length 500 bases or more

         28,098 reads        out of          32,610  ( 86.2%)
$ DBstats raw_reads.3
Statistics for all reads of length 500 bases or more

         28,455 reads        out of          33,174  ( 85.8%)
$ DBstats raw_reads.4
Statistics for all reads of length 500 bases or more

          6,105 reads        out of           7,133  ( 85.6%)

DBsplit Summary

-s # Blocks Total Reads
200 5 90,747
250 4 90,747
300 4 90,747

Since all the total number of reads was consistent between split databases, I moved on to look at the alignments produced by HPCdaligner.

HPCdaligner -M20 -dal128 -t16 -e0.7 -l1000 -s500 -h35 -H12000 raw_reads

s200

$ ls
raw_reads.1.las  raw_reads.2.las  raw_reads.3.las  raw_reads.4.las  raw_reads.5.las
$ LAcat raw_reads > s200.las
$ LAshow ../raw_reads s200 | head -n 2 | tail -n 1
s200: 24,540,600 records

s250

$ ls
raw_reads.1.las  raw_reads.2.las  raw_reads.3.las  raw_reads.4.las
$ LAcat raw_reads > s250.las
$ LAshow ../raw_reads s250.las | head -n 2 | tail -n 1
s250: 23,945,795 records

s300

$ ls
raw_reads.1.las  raw_reads.2.las  raw_reads.3.las  raw_reads.4.las
$ LAcat raw_reads > s300.las
$ LAshow ../raw_reads s300.las | head -n 2 | tail -n 1
s300: 22,533,518 records

Do you know why changing the block size would lead to such differences in the total number of yielded alignments?

Thanks,
Greg Zynda

Missing alignment?

Can you explain why I am missing a particular alignment? This is from a self-alignment of E.coli., I will dump the full output of LAshow later, but first here is the odd part:

 1   1 n   [   75,907..   77,167] x [1,680,259..1,681,519] :   <         0 diffs  (    13 trace pts)
 1   1 n   [   75,907..   77,162] x [1,931,810..1,933,065] :   <         0 diffs  (    13 trace pts)
 1   1 n   [1,680,259..1,681,519] x [   75,907..   77,167] :   <         0 diffs  (    14 trace pts)
 1   1 n   [1,680,259..1,681,514] x [1,931,810..1,933,065] :   <         0 diffs  (    14 trace pts)
 1   1 n   [1,931,810..1,933,065] x [   75,907..   77,162] :   <         0 diffs  (    13 trace pts)
 1   1 n   [1,931,810..1,933,065] x [1,680,259..1,681,514] :   <         0 diffs  (    13 trace pts)

That part is fine. Completely consistent. What's odd is in a match for the complement:

 1   1 c   [   75,907..   77,167] x [2,785,937..2,787,197] :   <         0 diffs  (    13 trace pts)
 1   1 c   [   75,907..   77,166] x [3,488,473..3,489,732] :   <         0 diffs  (    13 trace pts)

The problem is that the segments [2,785,937..2,787,196] and [3,488,473..3,489,732] must align perfectly to each other, since they each align perfectly to the complement of [ 75,907.. 77,166], but that alignment does not appear anywhere. I think this is a bug in how duplicates are removed, but maybe my settings are bad.

I have only 1 line of fasta (the E.coli reference) and 1 block. My daligner settings for self-alignment are:

HPCdaligner -I -v -k15 -h35 -w7 -e.98 -l500 -s100 -M48

You can try this yourself on an E.coli reference.

No alignments reported when using DALIGNER as a mapper

I am not getting any alignments when using DALIGNER as a mapper. It works great when I align PacBio reads among themselves, but when I try to align them to a DAM built from the human reference sequence (hg38.fasta), I get the output below. The output of DBstats on both (dusted) databases, looks reasonable in both the number of contigs and their length distribution. I also tried mapping the reads to a smaller smaller portion of hg38 (chr1), and also tried to map chr1.dam to itself but had the same (lack of) results.

############# DALIGNER OUTPUT ##############

$ daligner -vbA -M60 $READS.dam hg38.dam

Building index for 030915_HK_run_42246_9

Kmer count = 704,510,721
Using 21.00Gb of space
Index occupies 10.41Gb

Building index for c(030915_HK_run_42246_9)

Kmer count = 704,510,721
Using 21.00Gb of space
Index occupies 10.39Gb

Building index for hg38

Comparing c(030915_HK_run_42246_9) to hg38

0 14-mers (0.000000e+00 of matrix)
0 seed hits (0.000000e+00 of matrix)
0 confirmed hits (0.000000e+00 of matrix)

Building index for hg38

Comparing 030915_HK_run_42246_9 to hg38

0 14-mers (0.000000e+00 of matrix)
0 seed hits (0.000000e+00 of matrix)
0 confirmed hits (0.000000e+00 of matrix)

How to find out where the read loss in the db building phase comes from?

Hi,

When I run a DBstats on the raw_reads.db, the results shows quiet a lot of lost reads

DBstats raw_reads.db
Statistics for all wells in the data set

1,655,303 reads out of 4,099,447 59.6% loss
13,561,952,097 base pairs out of 24,496,694,744 44.6% loss

How can I get the reasons and the corresponding reads?

LAcheck LAS-file error: "Trace point sum != aligned interval"

Hi,

I have encountered an error that I can't explain in a few of my 'daligner' runs.

My DB has 128 blocks of 200Mb that I want to overlap as preparation for error-correction using FALCON.

All 8256 overlap runs together produce 16384 intermediate LAS files (prior to merging) of which 4 show the error mentioned in the header.

Surprisingly tho this error doesn't seem to be symmetric: the LAS file "[DB].X.[DB].Y" is OK while the file "[DB].Y.[DB].X" produced at the same time by the same 'daligner' run is not. Two of the four files show this behavior. The last two however are from the same run and both defect. In all cases X=128 (i.e. the last block in the DB).

My 'daligner' command is:

"daligner -M12 -mdust -mtan -mrep1 -mrep4 -mrep16 -b -e0.70 [DB].Y [DB].X"

and I am using commit "a9458dcb34eca61513c994f4217fb7f75915e25c"

Best,

Dirk

change align.c Base bias back to warning

Hello,

The latest version of align.c exits if it encounters a base bias worse than 80/20%. Previous versions used a warning. To make Falcon assembly with data exhibiting such biases possible it may be more beneficial to warn rather than exit.
Thanks,
Kurt

current version exits with message:

  if (match < .2)
{ EPRINTF(EPLACE,"Base bias worse than 80/20%% ! (New_Align_Spec)\n");
  free(spec);
  EXIT(NULL);
}

previous version would just warn with message:

if (match < .2)
{ fprintf(stderr,"Warning: Base bias worse than 80/20%% !\n");
bias = 3;
}

segfault mapping reads to reference

Hi,
I would like to map some P4C6 to a reference and run into segmentation faults.
Could i get some help or tutorial how to do propper mapping with daligner/HPCmapper?

Thank you!
Michel

DAM :

              1 contigs      out of               1  (100.0%)
        158,794 base pairs   out of         158,794  (100.0%)

        158,794 average contig length

DB:

      1,594,871 reads        out of       1,594,871  (100.0%)
 14,562,600,633 base pairs   out of  14,562,600,633  (100.0%)

          9,130 average read length
          5,510 standard deviation

Command used:

/home/mmoser/DAZZLER/DALIGNER/daligner -A -v -b -k20 -h50 -e.85 -M8 Pax.chloro ../15smrtex

gdb run shows:

Building index for Pax.chloro

   Kmer count = 158,775
   Using 0.00Gb of space
[New Thread 0x2aae13bf1700 (LWP 96553)]
[Thread 0x2aae13bf1700 (LWP 96553) exited]
[New Thread 0x2aae13df2700 (LWP 96554)]
....
....
[Thread 0x2aae141f4700 (LWP 96573) exited]
[Thread 0x2aae13ff3700 (LWP 96574) exited]
[Thread 0x2aae13bf1700 (LWP 96576) exited]
[Thread 0x2aae13df2700 (LWP 96575) exited]
   Index occupies 0.00Gb

Building index for c(Pax.chloro)

   Kmer count = 158,775
   Using 0.00Gb of space
[New Thread 0x2aae13bf1700 (LWP 96577)]
[Thread 0x2aae13bf1700 (LWP 96577) exited]
[New Thread 0x2aae13df2700 (LWP 96578)]
[Thread 0x2aae13df2700 (LWP 96578) exited]
[New Thread 0x2aae13ff3700 (LWP 96579)]
[Thread 0x2aae13ff3700 (LWP 96579) exited]
....
....
[Thread 0x2aae13ff3700 (LWP 96598) exited]
[Thread 0x2aae13bf1700 (LWP 96600) exited]
   Index occupies 0.00Gb

Building index for 15smrtex
[Thread 0x2aae13df2700 (LWP 96599) exited]

   Kmer count = 1,647,396,196
   Using 49.10Gb of space
[New Thread 0x2aae13bf1700 (LWP 96601)]
[New Thread 0x2aae13df2700 (LWP 96602)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aae13df2700 (LWP 96602)]
0x00000000004054c2 in biased_tuple_thread (arg=0x7ffffff85698) at filter.c:577
577             while ((x = s[p]) != 4)

gdb backtrace:

#0  0x00000000004054c2 in biased_tuple_thread (arg=0x7ffffff85698) at filter.c:577
#1  0x00002aaaaacd85aa in start_thread () from /software/lib64/libpthread.so.0
#2  0x00002aaaab2fb87d in clone () from /software/lib64/libc.so.6

Error: "Block ... contains reads < 20bp long! Run DBsplit."

Hi,

I have been trying to map some simulate pacbio reads (from PBSIM) to the human reference genome hg19. I tried to run commands as it is advised on the website and readme files.

For the reference genome:

$ fasta2DAM ref hg19.fasta
$ DBsplit ref
$ DBdust ref.1
...
$ DBdust ref.14

and for the read:

$ fasta2DB read read.fasta
$ DBsplit read
$ DBdust read.1

Then, I ran HPCmapper to get the commands:

$ HPCmapper ref.dam read.db

Among the daligner commands generated by HPCmapper, the second command gives error (the first one works without any problem):

$ daligner -A -k20 -h50 -e.85 ref.2 read.1
daligner: Block ref.2 contains reads < 20bp long !  Run DBsplit.

But I have already used DBsplit both for the reference genome and the reads. Could you give more information about this error and how I can resolve it?

Thanks!

Trace point sum != aligned interval

Dear Gene,

I am running daligner and LAcheck on P5-C3 data for two projects and repeatedly get the following error:

Trace point sum != aligned interval

Log output does not show any problems:

Building index for raw_reads.11

Kshift=28
BSHIFT=8
TooFrequent=16
(Kshift-1)/BSHIFT + (TooFrequent < INT32_MAX)=4
sizeof(KmerPos)=16
nreads=70565
Kmer=14
block->reads[nreads].boff=400076603
kmers=399088693
sizeof(KmerPos)*(kmers+1)=6385419104
Allocated 399088694 of 16 (6385419104 bytes) at 0x2acd77474010
Kmer count = 399,088,693
Using 11.89Gb of space
Revised kmer count = 341,657,466
Index occupies 5.09Gb

Comparing raw_reads.47 to raw_reads.11

Capping mutual k-mer matches over 10000 (effectively -t100)
Hit count = 1,035,740,743
Highwater of 36.00Gb space

 1,035,740,743 14-mers (6.473235e-09 of matrix)
     1,891,766 seed hits (1.182327e-11 of matrix)
     1,365,922 confirmed hits (8.536821e-12 of matrix)

Building index for c(raw_reads.11)

Kshift=28
BSHIFT=8
TooFrequent=16
(Kshift-1)/BSHIFT + (TooFrequent < INT32_MAX)=4
sizeof(KmerPos)=16
nreads=70565
Kmer=14
block->reads[nreads].boff=400076603
kmers=399088693
sizeof(KmerPos)*(kmers+1)=6385419104
Allocated 399088694 of 16 (6385419104 bytes) at 0x2acd77474010
Kmer count = 399,088,693
Using 11.89Gb of space
Revised kmer count = 341,657,466
Index occupies 5.09Gb

Comparing raw_reads.47 to c(raw_reads.11)

Capping mutual k-mer matches over 10000 (effectively -t100)
Hit count = 983,583,036
Highwater of 34.45Gb space

 983,583,036 14-mers (6.147256e-09 of matrix)
   1,608,346 seed hits (1.005194e-11 of matrix)
   1,154,751 confirmed hits (7.217032e-12 of matrix)

Is there fast forward way the see which read alignment is causing the issues?

Bests,
Felix

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.