Comments (1)
Hello @efr3m Sorry for the late reply. I was off the topic the for a while. I can't tell what is the exact reason, but if you use DIAMOND 0.8.x or 0.7.x, the problem may be resolved.
During my development of newer HGTector, I have noticed the changes in the DIAMOND command-line interface. Below the the DIAMOND search function in the development version of HGTector. They should work with DIAMOND 0.9.x. If necessary you can modify the HGTector version 0.2.2 code accordingly.
Reglular search:
sub diamond_search {
=pod
Run DIAMOND to search a sequence against a database.
Parameters
----------
seqs : array of array
query sequences ( id, sequence )
args : hash
search arguments, including:
- db : blastp database
- evalue : evalue cutoff
- identity : % identity cutoff
- coverage : % query coverage cutoff
- maxseqs : maximum target sequences
- threads (optional) : number of threads
- extrargs (optional) : extra command-line arguments for diamond
- diamond (optional) : diamond executable
tmpdir : str (optional)
temporary directory
Returns
-------
hash of array
hit table per query sequence
Raises
------
. If DIAMOND run fails.
Notes
-----
. The DIAMOND database should ideally be prepared as:
diamond makedb --in seqs.faa --db db --taxonmap prot.accession2taxid
=cut
my @seqs = @{ shift @_ };
my %args = %{ shift @_ };
my $tmpdir = @_ ? shift : tempdir( CLEANUP => 1 );
my $tmpin = catfile( $tmpdir, 'tmp.in' );
write_fasta( \@seqs, $tmpin );
my $diamond = $args{'diamond'} ? $args{'diamond'} : 'diamond';
my $cmd = "$diamond blastp --query $tmpin --db $args{'db'}";
$cmd .= " --evalue $args{'evalue'}" if $args{'evalue'};
$cmd .= " --id $args{'identity'}" if $args{'identity'};
$cmd .= " --query-cover $args{'coverage'}" if $args{'coverage'};
$cmd .= " --max-target-seqs $args{'maxseqs'}" if $args{'maxseqs'};
$cmd .= " --threads $args{'threads'}" if $args{'threads'};
$cmd .= " $args{'extrargs'}" if $args{'extrargs'};
$cmd .= " --outfmt 6 qseqid sseqid pident evalue bitscore qcovhsp staxids";
$cmd .= " --tmpdir $tmpdir";
my @out = `$cmd 2>&1`;
die "DIAMOND search unsuccessful. Error code: $?.\n" if $?;
unlink $tmpin;
return parse_def_table( \@out );
}
Self-alignment:
sub diamond_selfaln {
=pod
Run DIAMOND to align sequences to themselves.
Parameters
----------
seqs : array of array
query sequences ( id, sequence )
args : hash (optional)
self-alignment arguments, including:
- threads (optional) : number of threads
- extrargs (optional) : extra command-line arguments for diamond
- diamond (optional) : diamond executable
tmpdir : str (optional)
temporary directory
Returns
-------
array of array
( id, bitscore, evalue )
=cut
my @seqs = @{ shift @_ };
my %args = @_ ? %{ shift @_ } : ();
my $tmpdir = @_ ? shift : tempdir( CLEANUP => 1 );
# generate temporary query file
my $tmpin = catfile( $tmpdir, 'tmp.in' );
write_fasta( \@seqs, $tmpin );
my $diamond = $args{'diamond'} ? $args{'diamond'} : 'diamond';
my $threads = $args{'threads'} ? $args{'threads'} : '1';
# generate temporary database
my $tmpdb = catfile( $tmpdir, 'tmp.dmnd' );
my $cmd = $diamond
. " makedb"
. " --in $tmpin"
. " --db $tmpdb"
. " --threads $threads"
. " --tmpdir $tmpdir";
`$cmd 2>&1`;
die "DIAMOND self-alignment unsuccessful. Error code: $?.\n" if $?;
# perform search
$cmd = $diamond
. " blastp"
. " --query $tmpin"
. " --db $tmpdb"
. " --threads $threads"
. " --tmpdir $tmpdir";
$cmd .= " $args{'extrargs'}" if $args{'extrargs'};
my @out = `$cmd 2>&1`;
die "DIAMOND self-alignment unsuccessful. Error code: $?.\n" if $?;
unlink $tmpin, $tmpdb;
return parse_self_m8( \@out );
}
And in case self-alignment doesn't work. Here is the homebrew code:
sub fast_selfaln {
=pod
Calculate self-alignment statistics using built-in Perl code.
Parameters
----------
seq : str
query sequence
Returns
-------
array of ( float, float )
bitscore and evalue
Notes
-----
. The statistical metrics are calculated following the official BLAST
documentation:
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
. The default BLASTp parameters are assumed (matrix = BLOSUM62, gapopen
= 11, gapextend = 1), except for that the composition based statistics
is switched off (comp-based-stats = 0).
. The result should be identical to that by DIAMOND, but will be slightly
different from that by BLAST.
=cut
my $seq = shift;
# BLOSUM62 is the default aa substitution matrix for BLAST / DIAMOND
my %blosum62 = (
'A' => 4, 'R' => 5, 'N' => 6, 'D' => 6, 'C' => 9,
'Q' => 5, 'E' => 5, 'G' => 6, 'H' => 8, 'I' => 4,
'L' => 4, 'K' => 5, 'M' => 5, 'F' => 6, 'P' => 7,
'S' => 4, 'T' => 5, 'W' => 11, 'Y' => 7, 'V' => 4
);
# calculate raw score (S)
my $raw = 0;
foreach my $c ( split //, uc($seq) ) {
# just in case there are non-basic amino acids...
$raw += $blosum62{$c} if exists $blosum62{$c};
}
# BLAST's empirical values when gapopen = 11, gapextend = 1. See:
# ncbi-blast-2.7.1+-src/c++/src/algo/blast/core/blast_stat.c, line #268
my $lambda = 0.267;
my $K = 0.041;
# calculate bit score (S')
my $bit = ( $lambda * $raw - log($K) ) / log(2);
# calculate e-value (E)
my $e = length($seq) ** 2 * 2 ** -$bit;
return ( sprintf( "%.1f", $bit ), sprintf( "%.3g", $e ) );
}
from hgtector.
Related Issues (20)
- HGT detetction in fish HOT 1
- Error while building database with custom GenBank IDs list HOT 1
- plant genomes making database error HOT 9
- Why not add the parameter for threads used for GridSearchCV? HOT 1
- database --default parameters
- ImportError: dlopen: cannot load any more object with static TLS
- ValueError: Invalid bandwidth:0.3.
- HGT events in plants HOT 3
- Combining databases to predict is much less than separately
- TaxID 426428 is not found in taxonomy database HOT 1
- about database build! HOT 7
- Database issue HOT 16
- Use precompied blast databases HOT 1
- ValueError: Invalid BLAST database: hgtdb/blast/db. HOT 7
- Problempre-compiled database from dropbox HOT 2
- Database creation died with pandas error HOT 5
- Database error HOT 5
- DIAMOND failed with error code 139 HOT 9
- Cannot download data from NCBI refseq HOT 3
- Build Reference Database Issue: EOFError HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hgtector.