Scripts for creating "marker gene" reference files for use as database for sequence similarity searches
The generate_annotation.sh
is taking a FASTA file with common header format as can be usually found in UniProt, NCBI, etc. and performs the following:
- Dereplicate sequences by 97 percent similarity with
cd-hit
- Convert FASTA to one-line format and remove header description (after whitespace) with
awk
- Convert headers to md5 with python script
convert_fasta_headers_md5.py
- Merge files to create a complete annotation file with python script
merge_annotation_files.py
cd-hit
Pandas (Python)
- FASTA file (
.fa
extention) - Tab-seperate annotation file (same name as FASTA with
.tab
exention)
Downloaded from UniProt, for example:
https://www.uniprot.org/uniprot/?query=terminase+family%3Aprotein+taxonomy%3A%22Viruses+%5B10239%5D%22&sort=score
- derep.fa
- derep.md5.fa
- derep.fa.clstr
- annotation.tsv