Giter Site home page Giter Site logo

q2-pinocchio's Introduction

q2-pinocchio: PaIrwise alignment of long-read NucleOtide sequence data for Classification and quality Control in HIgh-thrOughput

QIIME 2 Plugin for quality control and taxonomic classification of long sequences

Installation

Step 1: Create q2-pinocchio environment

mamba create -n q2-pinocchio -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2024.10/metagenome/passed/ -c defaults q2cli q2-types q2-feature-classifier minimap2 bs4 samtools gzip chopper nanoplot

Step 2: Activate q2-pinocchio environment

conda activate q2-pinocchio

Step 3: Installing python package

pip install .

Provided Actions

  1. build-index

    Build a Minimap2 index database from reference sequences.

  2. minimap2-search

    Search for top hits in a reference database using alignment between the query sequences and reference database sequences using Minimap2. Returns a report of the top M hits for each query (where M=maxaccepts).

  3. filter-reads

    This method aligns long-read sequencing data (from a FASTQ file) to a set of reference sequences, identifying sequences that match or do not match the reference within a specified identity percentage. The alignment is performed using Minimap2, and the results are processed using Samtools.

  4. extract-reads

    This method aligns long-read sequencing data (from a FASTA file) to a set of reference sequences, identifying sequences that match or do not match the reference within a specified identity percentage. The alignment is performed using Minimap2, and the results are processed using Samtools.

  5. classify-consensus-minimap2

    Assign taxonomy to query sequences using Minimap2. Performs alignment between query and reference reads, then assigns consensus taxonomy to each query sequence.

  6. trim

    Trim long demultiplexed sequences using Chopper tool.

  7. stats

    Quality control statistics of long-read sequencing data using NanoPlot.


Examples

Download the input datasets

  • build-index
    • Build Minimap2 index database
    qiime pinocchio build-index --i-reference reference.qza --o-index index.qza --verbose

  • minimap2-search

    • Generate both hits and no hits for each query. Keep a maximum of one hit per query (primary).
    qiime pinocchio minimap2-search --i-query fasta_reads.qza --i-index index.qza --o-search-results paf.qza --verbose
    • Generate only hits for each query. Keep a maximum of one hit per query (primary mappings).
    qiime pinocchio minimap2-search --i-query fasta_reads.qza --i-index index.qza --o-search-results paf_only_hits.qza --p-output-no-hits false --verbose
    • Generate only hits for each query, limiting the number of hits to a maximum of 3 per query. Ensure that each hit has a minimum similarity percentage of 90% to be considered valid.
    qiime pinocchio minimap2-search --i-query fasta_reads.qza --i-index index.qza --o-search-results paf_only_hits_ma3.qza --p-maxaccepts 3 --p-output-no-hits false --verbose

  • filter-reads

    • Keep mapped (single-end reads)
    qiime pinocchio filter-reads --i-query single-end-reads.qza --i-index index.qza --o-filtered-query mapped_se.qza --verbose
    • Keep unmapped (single-end reads)
    qiime pinocchio filter-reads --i-query single-end-reads.qza --i-index index.qza --p-keep unmapped --o-filtered-query unmapped_se.qza --verbose
    • Keep mapped (paired-end reads)
    qiime pinocchio filter-reads --i-query paired-end-reads.qza --i-index index.qza --o-filtered-query mapped_pe.qza --verbose
    • Keep mapped reads with mapping percentage >= 98% (paired-end reads)
    qiime pinocchio filter-reads --i-query paired-end-reads.qza --i-index index.qza --p-min-per-identity 0.98  --o-filtered-query mapped_pe_over_98p_id.qza --verbose

  • extract-reads
    • Extract mapped
    qiime pinocchio extract-reads --i-sequences fasta_reads.qza --i-index index.qza --o-extracted-reads mapped_fasta.qza --verbose
    • Extract unmapped
    qiime pinocchio extract-reads --i-sequences fasta_reads.qza --i-index index.qza --p-extract unmapped --o-extracted-reads unmapped_fasta.qza --verbose
    • Extract mapped reads with mapping percentage >= 87%
    qiime pinocchio extract-reads --i-sequences fasta_reads.qza --i-index index.qza --p-min-per-identity 0.87 --o-extracted-reads mapped_fasta_ido_ver_87.qza --verbose

  • classify-consensus-minimap2
    • Assign taxonomy to query sequences using Minimap2
    qiime pinocchio classify-consensus-minimap2 --i-query n1K_initial_reads_SILVA132.fna.qza --i-index ccm_index.qza --i-reference-taxonomy raw_taxonomy.qza --p-n-threads 8 --output-dir classification_output --verbose

  • trim
    • Filter based on the quality (min)
    qiime pinocchio trim --i-query single-end-reads.qza --p-min-quality 7 --o-filtered-query filt_qual_min.qza --verbose
    • Filter based on the quality (max)
    qiime pinocchio trim --i-query single-end-reads.qza --p-max-quality 7 --o-filtered-query filt_qual_max.qza --verbose
    • Headcrop of all sequences ()
    qiime pinocchio trim --i-query single-end-reads.qza --p-headcrop 10 --o-filtered-query headcrop.qza --verbose
    • Filter based on the length of the sequences (min)
    qiime pinocchio trim --i-query single-end-reads.qza --p-min-length 3000 --o-filtered-query filt_len_min.qza --verbose

  • stats
    • Generate a visualization to display statistics about the sequences
    qiime pinocchio stats --i-sequences single-end-reads.qza --o-visualization stats.qzv
  • To open:
qiime tools view stats.qzv

q2-pinocchio's People

Contributors

christosmatzoros avatar

Stargazers

 avatar

Watchers

Nicholas Bokulich avatar  avatar Michal Ziemski avatar Santiago Castro Dau avatar

q2-pinocchio's Issues

ENH: expose `sr` option

minimap2 has a short read mode (sr) that makes it more efficient for processing short read data.

Some of the actions in this plugin (e.g., for search, classification, filtering) should have this option exposed in case users input short read data (e.g., Illumina reads).

ENH: Enhancing/reviewing extract-seqs action

The code and execution of the action extract-seqs needs additional review to detect potential enhancements or bugs. This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified percentage of identity. This method could be used to define a positive filter or a negative filter.

How to review

Clone the repo:

git clone https://github.com/bokulich-lab/q2-minimap2.git
cd q2-minimap2

Create q2-minimap2 environment

mamba create -n q2-minimap2 -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2024.2/shotgun/passed/ -c defaults q2cli q2-types q2-feature-classifier minimap2 bs4 samtools

Activate q2-minimap2 environment

conda activate q2-minimap2

Installing python package

make dev
qiime dev refresh-cache

Download the data here

To execute the action:

In this example, we have 1200 sequences in total. After the alignment 1125 of them are mapped and 75 are unmapped. The user has the option to either extract the mapped or the unmapped sequences from the original input.

  • Extract mapped
qiime minimap2 extract-seqs --i-sequences fasta_reads.qza --i-index-database database.qza --p-extract "mapped" --o-extracted-seqs extracted_mapped.qza --verbose
  • Extract unmapped
qiime minimap2 extract-seqs --i-sequences fasta_reads.qza --i-index-database database.qza --p-extract "unmapped" --o-extracted-seqs extracted_unmapped.qza --verbose

The following code needs to be reviewed:

ENH+REVIEW: Enhancement for taxonomic classification action

The code and execution of the classify_consensus_minimap2 action need additional review to detect potential enhancements or bugs. This action assigns taxonomy to query sequences using Minimap2. To assign a consensus taxonomy, the algorithm first aligns each query sequence with the reference sequences. It then considers only the best alignments, up to a maximum specified by 'maxaccepts'. A consensus taxonomy is assigned to each query sequence based on agreement among at least 'min_consensus' of these top hits regarding their taxonomic classification.

How to review

Clone the repo:

git clone https://github.com/bokulich-lab/q2-minimap2.git
cd q2-minimap2

Create q2-minimap2 environment

mamba create -n q2-minimap2 -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2024.2/shotgun/passed/ -c defaults q2cli q2-types q2-feature-classifier minimap2 bs4 samtools

Activate q2-minimap2 environment

conda activate q2-minimap2

Installing python package

make dev
qiime dev refresh-cache

Download the data here

To execute the action:

  • Assign taxonomy to query sequences using Minimap2
qiime minimap2 classify-consensus-minimap2 --i-index-database classification_input/index.qza --i-query classification_input/n1K_initial_reads_SILVA132.fna.qza --i-reference-taxonomy classification_input/raw_taxonomy.qza --p-num-threads 6 --output-dir outDir --verbose

The following code needs to be reviewed:

ENH+REVIEW: Enhancement for filter reads actions (for single-end and paired-end reads)

The code and execution of the actions filter_paired_end_reads and filter_single_end_reads needs additional review to detect potential enhancements or bugs.

The filter_paired_end_reads action aligns paired-end reads to a reference genome or index using Minimap2, processes the alignment to filter out reads based on the specified percentage identity and either keeps the mapped or the unmapped sequences based on what the user specifies. It modifies flags for paired and unmapped reads appropriately in order to be recognized by the samtools fastq command which converts the filtered alignments into separate FASTQ files for each read in the pair. This allows users to handle paired-end sequencing data specifically, ensuring the reads are appropriately paired and filtered.

The filter_single_end_reads function performs a similar process but for single-end reads. It aligns these reads to a reference genome, filters the alignments based on identity and either keeps the mapped or the unmapped sequences based on what the user specifies. Then it converts the filtered alignments into initial FASTQ format. This function is tailored for single-end reads, focusing on filtering individual reads without considering pair relationships.

Review: Minimap2 build-index action

The code and execution of the action build-index needs additional review to detect potential enhancements or bugs.

Indexing serves several critical functions that enhance the performance and efficiency of sequence alignment processes. Primarily, it creates a specialized data structure that significantly speeds up the location of potential matching regions within a reference database. This optimizes the alignment of DNA or mRNA sequences against large reference genomes or libraries, making the process much faster and more efficient. Additionally, indexing reduces the computational load by preprocessing the reference genome or sequences. This means less repetitive computation is necessary when aligning multiple query sequences, leading to faster processing times and reduced resource usage.

In terms of usage, indexing in Minimap2 is straightforward and typically executed via the command line. To create an index, the standard command format is minimap2 -d ref.mmi ref.fa. Here, ref.mmi is the name of the output index file, and ref.fa is the reference sequence file. This command preprocesses the reference sequence and generates an index file that can be reused in subsequent alignments, saving time and computational resources. Moreover, the indexing step is often integrated into larger bioinformatics workflows, enhancing the efficiency of large-scale genomic analyses. This integration is vital in research environments where vast amounts of data need to be processed quickly and accurately.

Clone the repo:

git clone https://github.com/bokulich-lab/q2-minimap2.git
cd q2-minimap2

To install: follow Steps 1-3 in README
Download the data here

To execute the action:
qiime minimap2 build-index --i-sequences reference.qza --o-database database.qza --verbose

The following code needs to be reviewed:

Review: Minimap2 minimap2_search action

The code and execution of the action minimap2-search needs additional review to detect potential enhancements or bugs.

What Minimap2 offers

Based on the documentation of Minimap2 we can do alignment between a query and a reference/index. The default output after the alignment procedure is a Pairwise mApping Format (PAF) file. It is a TAB-delimited text format with each line consisting of at least 12 fields as are described in the following table:

Col Type Description
1 string Query sequence name
2 int Query sequence length
3 int Query start coordinate (0-based)
4 int Query end coordinate (0-based)
5 char ‘+’ if query/target on the same strand; ‘-’ if opposite
6 string Target sequence name
7 int Target sequence length
8 int Target start coordinate on the original strand
9 int Target end coordinate on the original strand
10 int Number of matching bases in the mapping
11 int Number bases, including gaps, in the mapping
12 int TMapping quality (0-255 with 255 for missing)

Column 11 gives the total number of sequence matches, mismatches and gaps in the alignment; column 10 divided by column 11 gives the BLAST-like alignment identity. We can use this score to filter the final matchings of the queries in the PAF file, based on the percentage of identity that we want to allow.

Each query can be mapped to multiple regions in the reference, thus resulting in multiple entries in the PAF file. We want to be able to filter the number of entries for each query.

minimap2-search action

Expected behaviour

We do alignment between a query sequences and a reference/index. The result is a QIIME 2 artifact that represents the PAF file that is created from the alignment.
We have the following options to filter the results of the alignment:

  • Based on maxaccepts parameter which sets the maximum number of hits to keep for each query. Minimap2 will choose the first N hits in the reference database.
  • Based on perc_identity parameter we optionally reject a match for a query. We reclassify it as unmapped in the PAF file if its percent identity is lower than the perc_identity parameter.

How to review

Clone the repo:

git clone https://github.com/bokulich-lab/q2-minimap2.git
cd q2-minimap2

Create q2-minimap2 environment

mamba create -n q2-minimap2 -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2024.2/shotgun/passed/ -c defaults q2cli q2-types q2-feature-classifier minimap2 bs4 samtools

Activate q2-minimap2 environment

conda activate q2-minimap2

Installing python package

make dev
qiime dev refresh-cache

Download the data here

To execute the action:

  • Generate both hits and no hits for each query. Keep a maximum of one hit per query (primary).
    qiime minimap2 minimap2-search --i-query-reads fasta_reads.qza --i-index-database database.qza --o-search-results paf.qza --verbose
    • Generate only hits for each query. Keep a maximum of one hit per query (primary mappings).
    qiime minimap2 minimap2-search --i-query-reads fasta_reads.qza --i-index-database database.qza --o-search-results paf_only_hits.qza --verbose
    • Generate only hits for each query. Keep a maximum of 10 hits per query.
    qiime minimap2 minimap2-search --i-query-reads fasta_reads.qza --i-index-database database.qza --p-maxaccepts 10 --p-output-no-hits False --o-search-results paf_only_hits_up_to_10_per_query.qza --verbose

The following code needs to be reviewed:

ENH: Adding preset option for indexing

Allow the option of providing the preset for indexing the reference.

From documentation:
minimap2 [-x preset] -d target.mmi target.fa

Also from this source:

Minimap2 will give you a warning if parameters used in a pre-built index doesn't match parameters on the command line. Please always make sure you are using an intended pre-built index.

So it is recommended to use the same preset both in the indexing and the alignment phase

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.