Giter Site home page Giter Site logo

Comments (1)

qiyunzhu avatar qiyunzhu commented on June 15, 2024

Hello @efr3m Sorry for the late reply. I was off the topic the for a while. I can't tell what is the exact reason, but if you use DIAMOND 0.8.x or 0.7.x, the problem may be resolved.

During my development of newer HGTector, I have noticed the changes in the DIAMOND command-line interface. Below the the DIAMOND search function in the development version of HGTector. They should work with DIAMOND 0.9.x. If necessary you can modify the HGTector version 0.2.2 code accordingly.

Reglular search:

sub diamond_search {
=pod
    Run DIAMOND to search a sequence against a database.

    Parameters
    ----------
    seqs : array of array
        query sequences ( id, sequence )
    args : hash
        search arguments, including:
         - db : blastp database
         - evalue : evalue cutoff
         - identity : % identity cutoff
         - coverage : % query coverage cutoff
         - maxseqs : maximum target sequences
         - threads (optional) : number of threads
         - extrargs (optional) : extra command-line arguments for diamond
         - diamond (optional) : diamond executable
    tmpdir : str (optional)
        temporary directory

    Returns
    -------
    hash of array
        hit table per query sequence

    Raises
    ------
    . If DIAMOND run fails.

    Notes
    -----
    . The DIAMOND database should ideally be prepared as:
        diamond makedb --in seqs.faa --db db --taxonmap prot.accession2taxid
=cut
    my @seqs   = @{ shift @_ };
    my %args   = %{ shift @_ };
    my $tmpdir = @_ ? shift : tempdir( CLEANUP => 1 );

    my $tmpin = catfile( $tmpdir, 'tmp.in' );
    write_fasta( \@seqs, $tmpin );

    my $diamond = $args{'diamond'} ? $args{'diamond'} : 'diamond';

    my $cmd = "$diamond blastp --query $tmpin --db $args{'db'}";
    $cmd .= " --evalue $args{'evalue'}"           if $args{'evalue'};
    $cmd .= " --id $args{'identity'}"             if $args{'identity'};
    $cmd .= " --query-cover $args{'coverage'}"    if $args{'coverage'};
    $cmd .= " --max-target-seqs $args{'maxseqs'}" if $args{'maxseqs'};
    $cmd .= " --threads $args{'threads'}"         if $args{'threads'};
    $cmd .= " $args{'extrargs'}"                  if $args{'extrargs'};
    $cmd .= " --outfmt 6 qseqid sseqid pident evalue bitscore qcovhsp staxids";
    $cmd .= " --tmpdir $tmpdir";

    my @out = `$cmd 2>&1`;
    die "DIAMOND search unsuccessful. Error code: $?.\n" if $?;
    unlink $tmpin;
    return parse_def_table( \@out );
}

Self-alignment:

sub diamond_selfaln {
=pod
    Run DIAMOND to align sequences to themselves.

    Parameters
    ----------
    seqs : array of array
        query sequences ( id, sequence )
    args : hash (optional)
        self-alignment arguments, including:
         - threads (optional) : number of threads
         - extrargs (optional) : extra command-line arguments for diamond
         - diamond (optional) : diamond executable
    tmpdir : str (optional)
        temporary directory

    Returns
    -------
    array of array
        ( id, bitscore, evalue )
=cut
    my @seqs   =      @{ shift @_ };
    my %args   = @_ ? %{ shift @_ } : ();
    my $tmpdir = @_ ?    shift      : tempdir( CLEANUP => 1 );

    # generate temporary query file
    my $tmpin = catfile( $tmpdir, 'tmp.in' );
    write_fasta( \@seqs, $tmpin );

    my $diamond = $args{'diamond'} ? $args{'diamond'} : 'diamond';
    my $threads = $args{'threads'} ? $args{'threads'} : '1';

    # generate temporary database
    my $tmpdb = catfile( $tmpdir, 'tmp.dmnd' );
    my $cmd = $diamond
            . " makedb"
            . " --in $tmpin"
            . " --db $tmpdb"
            . " --threads $threads"
            . " --tmpdir $tmpdir";
    `$cmd 2>&1`;
    die "DIAMOND self-alignment unsuccessful. Error code: $?.\n" if $?;

    # perform search
    $cmd = $diamond
         . " blastp"
         . " --query $tmpin"
         . " --db $tmpdb"
         . " --threads $threads"
         . " --tmpdir $tmpdir";
    $cmd .= " $args{'extrargs'}" if $args{'extrargs'};
    my @out = `$cmd 2>&1`;
    die "DIAMOND self-alignment unsuccessful. Error code: $?.\n" if $?;

    unlink $tmpin, $tmpdb;
    return parse_self_m8( \@out );
}

And in case self-alignment doesn't work. Here is the homebrew code:

sub fast_selfaln {
=pod
    Calculate self-alignment statistics using built-in Perl code.

    Parameters
    ----------
    seq : str
        query sequence

    Returns
    -------
    array of ( float, float )
        bitscore and evalue

    Notes
    -----
    . The statistical metrics are calculated following the official BLAST
      documentation:
      https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
    . The default BLASTp parameters are assumed (matrix = BLOSUM62, gapopen
      = 11, gapextend = 1), except for that the composition based statistics
      is switched off (comp-based-stats = 0).
    . The result should be identical to that by DIAMOND, but will be slightly
      different from that by BLAST.
=cut
    my $seq = shift;

    # BLOSUM62 is the default aa substitution matrix for BLAST / DIAMOND
    my %blosum62 = (
        'A' => 4, 'R' => 5, 'N' => 6,  'D' => 6, 'C' => 9,
        'Q' => 5, 'E' => 5, 'G' => 6,  'H' => 8, 'I' => 4,
        'L' => 4, 'K' => 5, 'M' => 5,  'F' => 6, 'P' => 7,
        'S' => 4, 'T' => 5, 'W' => 11, 'Y' => 7, 'V' => 4
    );

    # calculate raw score (S)
    my $raw = 0;
    foreach my $c ( split //, uc($seq) ) {
        # just in case there are non-basic amino acids...
        $raw += $blosum62{$c} if exists $blosum62{$c};
    }

    # BLAST's empirical values when gapopen = 11, gapextend = 1. See:
    # ncbi-blast-2.7.1+-src/c++/src/algo/blast/core/blast_stat.c, line #268
    my $lambda = 0.267;
    my $K      = 0.041;

    # calculate bit score (S')
    my $bit = ( $lambda * $raw - log($K) ) / log(2);

    # calculate e-value (E)
    my $e = length($seq) ** 2 * 2 ** -$bit;

    return ( sprintf( "%.1f", $bit ), sprintf( "%.3g", $e ) );
}

from hgtector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.