Assembly pipeline used in the Seattle Flu Study to build consensus genomes for sequenced samples.
Adapted from Louise Moncla's Illumina pipline for influenza snp calling.
Install dependencies via Conda by running:
conda env create -f envs/seattle-flu-environment.yaml
Once installed, the environment needs to be activated to use assembly by running:
conda activate seattle-flu
The pipeline expects 8 total FASTQ files for each individual sample, with Read 1 (R1) and Read 2 (R2) from 4 different lanes.
Expected filename format: 318375_S12_L001_R1_001.fastq.gz
where 318375
is the NWGC sample ID.
If the filename does not contain the NWGC sample ID, rename the FASTQ files by using:
python scripts/rename_fastq.py
usage: rename_fastq.py [-h]
[Sample-ID-Map] [Barcode-column] [ID-column]
[FASTQ-directory]
Rename fastq files to contain NWGC sample ID
positional arguments:
Sample-ID-Map File path to csv file that matches random barcode to NWGC
ID (default: None)
Barcode-column Column name of column containing random barcodes (default:
None)
ID-column Column name of column containing NWGC sample IDs (default:
None)
FASTQ-directory File path to directory of FASTQ files that need to be
renamed (default: None)
optional arguments:
-h, --help show this help message and exit
By default, the pipeline will generate combinations of all samples and references and try to create consensus genomes for all combinations.
Avoid this by specifying specific sample-reference pairs and ignored samples in the config file by using:
python scripts/setup_config_file.py
usage: setup_config_file.py [-h] [--sample [Sample column]]
[--target [Target column]]
[--max_difference [Max percent difference]]
[FASTQ directory] [Config file]
[Sample/target map]
Edits the Snakemake config file by adding sample reference pairs and ignored
samples.
positional arguments:
FASTQ directory File path to directory containing fastq files
(default: None)
Config file File path to config file to be used with Snakemake
(default: None)
Sample/target map File path to Excel file containing samples and present
targets (default: None)
optional arguments:
-h, --help show this help message and exit
--sample [Sample column]
Column name of column containg NWGC sample IDs
(default: sample)
--target [Target column]
Column name of column containing target identifiers
(default: target)
--max_difference [Max percent difference]
The maximum difference acceptable between R1 and R2
reads (default: 20)
Note: This currently only works for flu-positive samples. To add more targets/references, edit references/target_reference_map.json
.
Embedded in this script is a check to ensure that the reads in R1 and R2 of each sample match. This is to avoid the demultiplexing error described in this paper. If the reads differ by over 20% then the sample will be added to the ignored samples. Redirect the output to a file to keep track of read differences and ignored samples.
Consensus genome FASTA headers are generated with the NWGC sample ID which then needs to be converted to the SFS UUID in order to be paired with stored metadata.
This is done with the seqkit replace method, which requires a tab-delimited key-value file for replacing the key with the value. The key-value file can be generated using the following:
python scripts/id_barcode_key_value.py
usage: id_barcode_key_value.py [-h]
[NWGC-SFS-Match] [NWGC-column] [SFS-column]
[Output]
Create key/value pairs of NWGC ID and SFS ID
positional arguments:
NWGC-SFS-Match File path to Excel file that matches NWGC ID to SFS ID
(default: None)
NWGC-column Column name of column containing NWGC sample IDs (default:
None)
SFS-column Column name of column containing SFS UUIDs (default: None)
Output File path for output key/value file (default: None)
optional arguments:
-h, --help show this help message and exit
Note: Remember to add the path of the output file to the config file after running this script.
Currently, config/config-flu-only.json
has the most updated version of the configuration needed to run assembly.
fastq_directory
: the path to the directory containing the fastq filesignored_samples
: an object containing keys that specify samples to be ignoredsample_reference_pairs
: an object containing specific sample/reference pairs, with the keys refering to the samples and an array of references as the values.barcode_match
: the path to the file containing tab-delimited key-value pairs used to replace NWGC sample IDs with SFS UUIDsmin_align_rate
: the minimum overall align rate needed to generate a consensus genome (currently arbitrarily set to1.00
so thatvcf_to_consensus
does not error out)reference_virusus
: an object containing keys that specify references to be usedparams
: parameters for the various tools used in the pipeline (currently set by Louise's recommendations)
Running (this will be updated):
snakemake --configfile config/config-flu-only.json -k
OR (if on Rhino)
snakemake -w 60 --configfile config/config-flu-only.json --cluster-config config/cluster.json --cluster "sbatch --nodes=1 --tasks=1 --mem={cluster.memory} --cpus-per-task={cluster.cores} --tmp={cluster.disk} --time={cluster.time} -o all_output.out" -j 20 -k
You can use --use-conda
if there are rule-specific environment files.
Using bowtie2-build to build a Bowtie index for each reference genome. These will later be used in the mapping step.
Concatenates 8 FASTQ files for each sample into 2 files (R1 and R2).
Trim raw FASTQ files with Trimmomatic, which cuts out adapter/illumina-specific sequences, trims the reads based on quality scores, and removes short reads.
Generates summary statistics using FastQC. Allows for some quality control checks on raw sequence data coming from high throughput sequencing pipelines.
Map trimmed reads to reference genome using bowtie2. Outputs a BAM file that represents aligned sequences and a log file that contains the alignment summary.
Sort the BAM file into "genome order" using samtools sort.
Generates statistics for coverage using BAMStats.
Checkpoints allow for data-dependent conditional execution, forcing Snakemake to re-evaluate the DAG at this point.
Determines if all segments within the reference genome has coverage summary generated from BAMStats.
Then calculates the minimum number of reads required to meet the minimum coverage depth set for varscan mpileup2snp
and checks the number of mapped reads from the bowtie2 alignment summary. The minimum number of reads required is calculated as:
min_reads = (reference_genome_length * min_cov) / (raw_read_length)
If not all segments are represented or if mapped_reads
is less than min_reads
, then stop the process for this sample/reference pair.
This rule is necessary for the checkpoint above to work, because it must send the data down one of two paths. This is the "dead end" where a consensus genome is not generated.
Also a placeholder for how to handle sample/reference pairs that do not meet the min_align_rate
threshold. Currently it just outputs the overall align rate of the sample/reference pair so that we can check that it did not meet the minimum.
Generates Pileup for BAM file using samtools mpileup.
Important Flags:
-a
to print out all positions, including zero depth (this is necessary for generating the low coverage Bedfile later)-A
to not discard anomalous read pairs-Q
to set the minimum base quality to consider a read (set to match the minimum invarscan mpileup2snp
so that their coverage depths match)
Creates a BED file for positions that need to be masked in the consensus genome. Positions need to be masked if they are below the minimum coverage depth.
Calls consensus (SNPs/indels) from the Pileup based on parameters set in the config file using varscan mpileup2cns
Compress VCF using bgzip, which allows indexes to be built against the file and allows the file be used without decompressing it.
Creates index for compressed VCF using bcftools index. This index is necessary for creating the consensus genome using the compressed VCF.
Create consensus genome by applying VCF variants to the reference genome using bcftools consensus.
This does not account for coverage, so we pass it the BEDfile of low coverage sites with the --mask
option to mask these sites with N
.
Edits the FASTA headers to fit the pattern needed for downstream analysis.
Example FASTA header: >SFS-UUID|SFS-UUID-PB2|H1N1pdm|PB2
- Replace reference sequence name with the NWGC sample ID using Perl to perform "lookaround" regex matches
- Uses seqkit replace to replace the NWGC sample ID with the SFS UUID.
- Create the UUID-gene combination using AWK
This is the last rule of the pipeline that prints out the final result of each sample/reference pair. If a consensus genome is generated, then this will also add it to the final combined FASTA.
The input of this rule differs based on the result of the checkpoint, so this rule dictates the final outcome of each sample/reference pair.
Deletes all of the intermediate directories and files generated from running the pipeline. The only one that it does not delete is the consensus_genome
directory.
Consensus genomes can be uploaded to id3c with the correct id3c permissions stored as environment variables. Note: id3c uses basic authentication. More information can be found in the id3c documentation.
An assembly can utilize the ID3C Bot Slack application to send notifications when an id3c upload fails. The Slack webhook for the #id3c-alerts Slack channel should be stored as an environment variable. Slack webhook URLs are found at the Slack Apps API site. Make sure you're logged into the Seattle Flu Study workspace. If you cannot view the link above, contact another member of the id3c-team to add you as a collaborator to the ID3C Bot app.