bokulich-lab / q2-pinocchio Goto Github PK

View Code? Open in Web Editor NEW

1.0 4.0 2.0 5.53 MB

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.14% Python 98.61% TeX 0.60% HTML 0.65%

q2-pinocchio's Introduction

q2-pinocchio: PaIrwise alignment of long-read NucleOtide sequence data for Classification and quality Control in HIgh-thrOughput

QIIME 2 Plugin for quality control and taxonomic classification of long sequences

Installation

Step 1: Create q2-pinocchio environment

mamba create -n q2-pinocchio -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2024.10/metagenome/passed/ -c defaults q2cli q2-types q2-feature-classifier minimap2 bs4 samtools gzip chopper nanoplot

Step 2: Activate q2-pinocchio environment

conda activate q2-pinocchio

Step 3: Installing python package

pip install .

Provided Actions

build-index

Build a Minimap2 index database from reference sequences.
minimap2-search

Search for top hits in a reference database using alignment between the query sequences and reference database sequences using Minimap2. Returns a report of the top M hits for each query (where M=maxaccepts).
filter-reads

This method aligns long-read sequencing data (from a FASTQ file) to a set of reference sequences, identifying sequences that match or do not match the reference within a specified identity percentage. The alignment is performed using Minimap2, and the results are processed using Samtools.
extract-reads

This method aligns long-read sequencing data (from a FASTA file) to a set of reference sequences, identifying sequences that match or do not match the reference within a specified identity percentage. The alignment is performed using Minimap2, and the results are processed using Samtools.
classify-consensus-minimap2

Assign taxonomy to query sequences using Minimap2. Performs alignment between query and reference reads, then assigns consensus taxonomy to each query sequence.
trim

Trim long demultiplexed sequences using Chopper tool.
stats

Quality control statistics of long-read sequencing data using NanoPlot.

Examples

Download the input datasets

build-index

Build Minimap2 index database

qiime pinocchio build-index --i-reference reference.qza --o-index index.qza --verbose

minimap2-search

Generate both hits and no hits for each query. Keep a maximum of one hit per query (primary).

qiime pinocchio minimap2-search --i-query fasta_reads.qza --i-index index.qza --o-search-results paf.qza --verbose

Generate only hits for each query. Keep a maximum of one hit per query (primary mappings).

qiime pinocchio minimap2-search --i-query fasta_reads.qza --i-index index.qza --o-search-results paf_only_hits.qza --p-output-no-hits false --verbose

Generate only hits for each query, limiting the number of hits to a maximum of 3 per query. Ensure that each hit has a minimum similarity percentage of 90% to be considered valid.

qiime pinocchio minimap2-search --i-query fasta_reads.qza --i-index index.qza --o-search-results paf_only_hits_ma3.qza --p-maxaccepts 3 --p-output-no-hits false --verbose

filter-reads

Keep mapped (single-end reads)

qiime pinocchio filter-reads --i-query single-end-reads.qza --i-index index.qza --o-filtered-query mapped_se.qza --verbose

Keep unmapped (single-end reads)

qiime pinocchio filter-reads --i-query single-end-reads.qza --i-index index.qza --p-keep unmapped --o-filtered-query unmapped_se.qza --verbose

Keep mapped (paired-end reads)

qiime pinocchio filter-reads --i-query paired-end-reads.qza --i-index index.qza --o-filtered-query mapped_pe.qza --verbose

Keep mapped reads with mapping percentage >= 98% (paired-end reads)

qiime pinocchio filter-reads --i-query paired-end-reads.qza --i-index index.qza --p-min-per-identity 0.98  --o-filtered-query mapped_pe_over_98p_id.qza --verbose

extract-reads

Extract mapped

qiime pinocchio extract-reads --i-sequences fasta_reads.qza --i-index index.qza --o-extracted-reads mapped_fasta.qza --verbose

Extract unmapped

qiime pinocchio extract-reads --i-sequences fasta_reads.qza --i-index index.qza --p-extract unmapped --o-extracted-reads unmapped_fasta.qza --verbose

Extract mapped reads with mapping percentage >= 87%

qiime pinocchio extract-reads --i-sequences fasta_reads.qza --i-index index.qza --p-min-per-identity 0.87 --o-extracted-reads mapped_fasta_ido_ver_87.qza --verbose

classify-consensus-minimap2

Assign taxonomy to query sequences using Minimap2

qiime pinocchio classify-consensus-minimap2 --i-query n1K_initial_reads_SILVA132.fna.qza --i-index ccm_index.qza --i-reference-taxonomy raw_taxonomy.qza --p-n-threads 8 --output-dir classification_output --verbose

trim

Filter based on the quality (min)

qiime pinocchio trim --i-query single-end-reads.qza --p-min-quality 7 --o-filtered-query filt_qual_min.qza --verbose

Filter based on the quality (max)

qiime pinocchio trim --i-query single-end-reads.qza --p-max-quality 7 --o-filtered-query filt_qual_max.qza --verbose

Headcrop of all sequences ()

qiime pinocchio trim --i-query single-end-reads.qza --p-headcrop 10 --o-filtered-query headcrop.qza --verbose

Filter based on the length of the sequences (min)

qiime pinocchio trim --i-query single-end-reads.qza --p-min-length 3000 --o-filtered-query filt_len_min.qza --verbose

stats

Generate a visualization to display statistics about the sequences

qiime pinocchio stats --i-sequences single-end-reads.qza --o-visualization stats.qzv

To open:

qiime tools view stats.qzv

q2-pinocchio's People

Contributors

Stargazers

Watchers

Forkers

vinzentrisch christosmatzoros

q2-pinocchio's Issues

ENH: expose `sr` option

minimap2 has a short read mode (sr) that makes it more efficient for processing short read data.

Some of the actions in this plugin (e.g., for search, classification, filtering) should have this option exposed in case users input short read data (e.g., Illumina reads).

Corrections in CI and GitHub Workflows for Proper Execution

DEV: Adding the stats action

This action creates a visualization of statistics of the input sequences using NanoPlot

ENH: Enhancing/reviewing extract-seqs action

The code and execution of the action extract-seqs needs additional review to detect potential enhancements or bugs. This method aligns feature sequences to a set of reference sequences to identify sequences that hit/miss the reference within a specified percentage of identity. This method could be used to define a positive filter or a negative filter.

How to review

Clone the repo:

git clone https://github.com/bokulich-lab/q2-minimap2.git
cd q2-minimap2

Create q2-minimap2 environment

mamba create -n q2-minimap2 -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2024.2/shotgun/passed/ -c defaults q2cli q2-types q2-feature-classifier minimap2 bs4 samtools

Activate q2-minimap2 environment

conda activate q2-minimap2

Installing python package

make dev
qiime dev refresh-cache

Download the data here

To execute the action:

In this example, we have 1200 sequences in total. After the alignment 1125 of them are mapped and 75 are unmapped. The user has the option to either extract the mapped or the unmapped sequences from the original input.

Extract mapped

qiime minimap2 extract-seqs --i-sequences fasta_reads.qza --i-index-database database.qza --p-extract "mapped" --o-extracted-seqs extracted_mapped.qza --verbose

Extract unmapped

qiime minimap2 extract-seqs --i-sequences fasta_reads.qza --i-index-database database.qza --p-extract "unmapped" --o-extracted-seqs extracted_unmapped.qza --verbose

The following code needs to be reviewed:

From q2_minimap2/plugin_setup.py:
L117 - L128
From q2_minimap2/_action_params.py:
L127 - L174
The entire q2_minimap2/extract_seqs.py file
From q2_minimap2/_filtering_utils.py:
L79 - L129
L203 - L218
L187 - L199
L133 - L149

ENH+REVIEW: Enhancement for taxonomic classification action

The code and execution of the classify_consensus_minimap2 action need additional review to detect potential enhancements or bugs. This action assigns taxonomy to query sequences using Minimap2. To assign a consensus taxonomy, the algorithm first aligns each query sequence with the reference sequences. It then considers only the best alignments, up to a maximum specified by 'maxaccepts'. A consensus taxonomy is assigned to each query sequence based on agreement among at least 'min_consensus' of these top hits regarding their taxonomic classification.

How to review

Clone the repo:

git clone https://github.com/bokulich-lab/q2-minimap2.git
cd q2-minimap2

Create q2-minimap2 environment

mamba create -n q2-minimap2 -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2024.2/shotgun/passed/ -c defaults q2cli q2-types q2-feature-classifier minimap2 bs4 samtools

Activate q2-minimap2 environment

conda activate q2-minimap2

Installing python package

make dev
qiime dev refresh-cache

Download the data here

To execute the action:

Assign taxonomy to query sequences using Minimap2

qiime minimap2 classify-consensus-minimap2 --i-index-database classification_input/index.qza --i-query classification_input/n1K_initial_reads_SILVA132.fna.qza --i-reference-taxonomy classification_input/raw_taxonomy.qza --p-num-threads 6 --output-dir outDir --verbose

The following code needs to be reviewed:

From plugin_setup.py:
L184- L207
From q2_minimap2/_action_params.py:
L230 - L311
The entire q2_minimap2/classify_consensus.py file

ENH+REVIEW: Enhancement for filter reads actions (for single-end and paired-end reads)

The code and execution of the actions filter_paired_end_reads and filter_single_end_reads needs additional review to detect potential enhancements or bugs.

The filter_paired_end_reads action aligns paired-end reads to a reference genome or index using Minimap2, processes the alignment to filter out reads based on the specified percentage identity and either keeps the mapped or the unmapped sequences based on what the user specifies. It modifies flags for paired and unmapped reads appropriately in order to be recognized by the samtools fastq command which converts the filtered alignments into separate FASTQ files for each read in the pair. This allows users to handle paired-end sequencing data specifically, ensuring the reads are appropriately paired and filtered.

The filter_single_end_reads function performs a similar process but for single-end reads. It aligns these reads to a reference genome, filters the alignments based on identity and either keeps the mapped or the unmapped sequences based on what the user specifies. Then it converts the filtered alignments into initial FASTQ format. This function is tailored for single-end reads, focusing on filtering individual reads without considering pair relationships.

DEV: Adding the trimming action

Review: Minimap2 build-index action

The code and execution of the action build-index needs additional review to detect potential enhancements or bugs.

Indexing serves several critical functions that enhance the performance and efficiency of sequence alignment processes. Primarily, it creates a specialized data structure that significantly speeds up the location of potential matching regions within a reference database. This optimizes the alignment of DNA or mRNA sequences against large reference genomes or libraries, making the process much faster and more efficient. Additionally, indexing reduces the computational load by preprocessing the reference genome or sequences. This means less repetitive computation is necessary when aligning multiple query sequences, leading to faster processing times and reduced resource usage.

In terms of usage, indexing in Minimap2 is straightforward and typically executed via the command line. To create an index, the standard command format is minimap2 -d ref.mmi ref.fa. Here, ref.mmi is the name of the output index file, and ref.fa is the reference sequence file. This command preprocesses the reference sequence and generates an index file that can be reused in subsequent alignments, saving time and computational resources. Moreover, the indexing step is often integrated into larger bioinformatics workflows, enhancing the efficiency of large-scale genomic analyses. This integration is vital in research environments where vast amounts of data need to be processed quickly and accurately.

Clone the repo:

git clone https://github.com/bokulich-lab/q2-minimap2.git
cd q2-minimap2

To install: follow Steps 1-3 in README
Download the data here

To execute the action:
qiime minimap2 build-index --i-sequences reference.qza --o-database database.qza --verbose

The following code needs to be reviewed:

From plugin_setup.py:
L158 - L169
From q2_minimap2/_action_params.py:
L176 - L185
The entire q2_minimap2/build_index.py file
The entire q2_minimap2/tests/test_minimap2_build.py file

Review: Minimap2 minimap2_search action

The code and execution of the action minimap2-search needs additional review to detect potential enhancements or bugs.

What Minimap2 offers

Based on the documentation of Minimap2 we can do alignment between a query and a reference/index. The default output after the alignment procedure is a Pairwise mApping Format (PAF) file. It is a TAB-delimited text format with each line consisting of at least 12 fields as are described in the following table:

Col	Type	Description
1	string	Query sequence name
2	int	Query sequence length
3	int	Query start coordinate (0-based)
4	int	Query end coordinate (0-based)
5	char	‘+’ if query/target on the same strand; ‘-’ if opposite
6	string	Target sequence name
7	int	Target sequence length
8	int	Target start coordinate on the original strand
9	int	Target end coordinate on the original strand
10	int	Number of matching bases in the mapping
11	int	Number bases, including gaps, in the mapping
12	int	TMapping quality (0-255 with 255 for missing)

Column 11 gives the total number of sequence matches, mismatches and gaps in the alignment; column 10 divided by column 11 gives the BLAST-like alignment identity. We can use this score to filter the final matchings of the queries in the PAF file, based on the percentage of identity that we want to allow.

Each query can be mapped to multiple regions in the reference, thus resulting in multiple entries in the PAF file. We want to be able to filter the number of entries for each query.

minimap2-search action

Expected behaviour

We do alignment between a query sequences and a reference/index. The result is a QIIME 2 artifact that represents the PAF file that is created from the alignment.
We have the following options to filter the results of the alignment:

Based on maxaccepts parameter which sets the maximum number of hits to keep for each query. Minimap2 will choose the first N hits in the reference database.
Based on perc_identity parameter we optionally reject a match for a query. We reclassify it as unmapped in the PAF file if its percent identity is lower than the perc_identity parameter.

How to review

Clone the repo:

git clone https://github.com/bokulich-lab/q2-minimap2.git
cd q2-minimap2

Create q2-minimap2 environment

mamba create -n q2-minimap2 -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2024.2/shotgun/passed/ -c defaults q2cli q2-types q2-feature-classifier minimap2 bs4 samtools

Activate q2-minimap2 environment

conda activate q2-minimap2

Installing python package

make dev
qiime dev refresh-cache

Download the data here

To execute the action:

Generate both hits and no hits for each query. Keep a maximum of one hit per query (primary).

qiime minimap2 minimap2-search --i-query-reads fasta_reads.qza --i-index-database database.qza --o-search-results paf.qza --verbose

Generate only hits for each query. Keep a maximum of one hit per query (primary mappings).

qiime minimap2 minimap2-search --i-query-reads fasta_reads.qza --i-index-database database.qza --o-search-results paf_only_hits.qza --verbose

Generate only hits for each query. Keep a maximum of 10 hits per query.

qiime minimap2 minimap2-search --i-query-reads fasta_reads.qza --i-index-database database.qza --p-maxaccepts 10 --p-output-no-hits False --o-search-results paf_only_hits_up_to_10_per_query.qza --verbose

The following code needs to be reviewed:

From plugin_setup.py:
L171 - L182
From q2_minimap2/_action_params.py:
L188 - L228
The entire q2_minimap2/minimap2_search.py file
The entire q2_minimap2/tests/test_minimap2_search.py file

MAINT: Update descriptions of actions and their parameters

ENH: Adding preset option for indexing

Allow the option of providing the preset for indexing the reference.

From documentation:
minimap2 [-x preset] -d target.mmi target.fa

Also from this source:

Minimap2 will give you a warning if parameters used in a pre-built index doesn't match parameters on the command line. Please always make sure you are using an intended pre-built index.

So it is recommended to use the same preset both in the indexing and the alignment phase

bokulich-lab / q2-pinocchio Goto Github PK

q2-pinocchio's Introduction

q2-pinocchio: PaIrwise alignment of long-read NucleOtide sequence data for Classification and quality Control in HIgh-thrOughput

QIIME 2 Plugin for quality control and taxonomic classification of long sequences

Installation

Step 1: Create q2-pinocchio environment

Step 2: Activate q2-pinocchio environment

Step 3: Installing python package

Provided Actions

Examples

Download the input datasets

q2-pinocchio's People

Contributors

Stargazers

Watchers

Forkers

q2-pinocchio's Issues

How to review

To execute the action:

How to review

To execute the action:

What Minimap2 offers

minimap2-search action

Expected behaviour

How to review

Recommend Projects

Recommend Topics

Recommend Org