Long read sequencing technologies such as PacBio and Oxford Nanopore are becoming more popular in the field of genomics. These technologies are capable of generating long reads. The long reads are useful for assembling the genome of complex organisms. In this project, we will use the PacBio HiFi reads to assemble the genome of yeast Saccharomyces cerevisiae.
"Software installation is like a box of chocolates, you never know what you're gonna get. No matter what tool you use, conda, mamba, sudo, pip or whatever, it always finds a way to throw an error and refuse to install.
---- A frustrated bioinformatician
Data
I used the filtered and clipped version of fastq file for the analysis (link)
For simplicity, I just put all the inputs in ../data/ folder to make git push
easier (we could use .gitignore too). Don't be worry, everytime we need an input, I will mention the url.
1 - Quality Control of the reads
sofware: FastQC v0.11.9
The first step is to check the quality of the reads. We can use FastQC to do this. The command is given below:
$ fastqc <input> -o <output>
$ fastqc ../data/SRR13577846.fastq.gz -o 1-QC/
$ firefox 1-QC/SRR13577846_fastqc.html # open the report in firefox
There are some alert in fastqc report. We can ignore them for now as it is an timely intensive excerise. Maybe we will come back to this later.
2 - Perform de novo assembly using Hifiasm
software: Hifiasm 0.18.8-r525 (Link)
Hifiasm is a fast and accurate assembler for PacBio HiFi reads. The command is given below:
hifiasm -o <output> -t <threads> <input>
$ hifiasm -o 2_assembly/SRR13577846 -t 5 ../data/SRR13577846.fastq.gz
# Real time: 1602.548 sec; CPU: 7809.897 sec; Peak RSS: 13.164 GB
hifiasm output is a set of files. You can find the details in the documentation. We us <>.bp.p_ctg.gfa
file which contains the assembled contigs.
Quast and BUSCO needs contigs in fasta format. We can use awk
to convert the gfa file to fasta file.
$ awk '/^S/{print ">"$2;print$3}' 2_assembly/SRR13577846.bp.p_ctg.gfa > 2_assembly/SRR13577846.fa
The fasta file "SRR13577846.fa" is the input for Quast and BUSCO.
3 - Perform quality assessment using Quast
software: Quast v5.0.2 (link)
Quast is a tool for quality assessment of genome assemblies. The command is given below:
Saccharomyces cerevisiae reference genome is available at this page. You can download this and put it in ../data/
folder. I renamed it to ref.fna
.
$ quast -r <reference> <input> -o <output>
$ quast -r ../data/ref.fna 2_assembly/SRR13577846.fa -o 3_QUAST/
$ firefox 3_QUAST/report.html
4 - Perform quality assessment using BUSCO
software: BUSCO v4.1.4 (link)
BUSCO is a tool for quality assessment of genome assemblies which is based on the presence of orthologous genes.
busco --list-datasets # to find lineage also can be selected autolineage
The command is given below:
$ busco -i <input> -o <output> -l <lineage> -m <mode> -c <threads> (-f= force to override, -q=just report error)
$ busco -m genome -i 2_assembly/SRR13577846.fa -o 4_BUSCO -f -q -l saccharomycetes_odb10
Finally, lets look at the results with MultiQC:
$ multiqc 1-QC/ 3_QUAST/ 4_BUSCO/ -o 5_multiQC/
$ firefox 5_multiQC/multiqc_report.html