ncbi-hackathons / master_gff3_parser Goto Github PK

Convert sequence IDs between ucsc/refseq/genbank

License: MIT License

Python 99.31% Shell 0.69%

gff gff3 gff3-files seqids

master_gff3_parser's Introduction

Squidstream (an amazing tool that does wonderful stuff)—Documentation

Generic Feature Format version 3 (GFF3) is a file type that is commonly used in bioinformatic applications. Different institutions have varying naming conventions for the genomic identifier column in the GFF3 format. Therefore, there can be GFF3 files that use different seqids for the same genomic feature. In addition, there are other file formats that also have sequence identifiers, such as GTF, BED, SAM, and BAM files. Squidstream is an easy-to-use command line tool that can convert the genomic feature reference name for chromosomes, scaffolds, and contigs in different file formats to the corresponding seqid from NCBI’s RefSeq database. GFF3 files are a common input into many different types of bioinformatics tools and pipelines, and Squidstream provides naming consistency in these input files by converting the sequence feature IDs in the entire file to the desired ID format using a single command.

Figure 1. Examples of NCBI, UCSC, and RefSeq GFF3 files.

Sequence Identifier Conversion Examples:

Annotation with RefSeq ID to UCSC ID for use in UCSC Genome Browser tracks
Convert to NCBI ID to search KEGG GENES Database
RefSeq to Genbank ID

Squidstream was built in Python and runs from the command line. Users provide the input file, the specific reference genome, and the desired name of the output file.

A summary of the seqconv commands is provided below.

Command	Description
convert	Converts sequence IDs

Links to file format descriptions: GFF3, SAM, BED, GFF/GTF

Installation

Linux:

python setup.py install

OSX:

python setup.py install

Figure 2. Squidstream workflow.

master_gff3_parser's People

Contributors

Stargazers

Watchers

Forkers

dasmoocher arthurtsang hsiaoyi0504 wangdi2016 mrmuchow

master_gff3_parser's Issues

Additional thoughts on format guessing

Terence also had some thoughts on format guessing:

WRT format guessing, another option is to not even guess, and support changing seq-ids in column 1 of any tab-delimited file, maybe with an option to convert seq-ids in a different column, like in the feature_table.txt.gz file such as this one:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/715/135/GCF_000715135.1_Ntab-TN90/GCF_000715135.1_Ntab-TN90_feature_table.txt.gz

And then if it happens to be 9 columns and has a Target=<seq_id> string in column 9, or a ##sequence-region pragma, then go ahead and change that, too.

For guessing assemblies, we’re apparently good about including the assembly in the header in both of our main FTP areas with GFF3:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/GFF/ref_GRCh38.p7_top_level.gff3.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_genomic.gff.gz

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build GRCh38.p10
#!genome-build-accession NCBI_Assembly:GCF_000001405.36

We use the “#!” for non-official pragmas. All of those were supposed to get added into the GFF3 spec, but those efforts died. The “NCBI_Assembly” dbxref is an SO-defined dbxref, so it’s reasonably official.

Ensembl GTF features with no version may fail to convert

We found that some scaffolds in Ensembl gtf do not follow the accession.version format, and only have the accession. In this case the assembly report entry will not match the input ID and the conversion will fail for these IDs.

Example project and gtf

ftp://ftp.ensembl.org/pub/release-79/gtf/anolis_carolinensis/
ftp://ftp.ensembl.org/pub/release-79/gtf/anolis_carolinensis/Anolis_carolinensis.AnoCar2.0.79.gtf.gz

alt mouse failing

problem testing a UCSC file

$ time  seqconv convert --ref felCat4 --out rs cat_felCat4_UCSC_2008 
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/003/115/GCA_000003115.1_catChrV17e/GCA_000003115.1_catChrV17e_assembly_report.txt
Converting from None to rsStarting Conversion
Cannot convert id: exon
Traceback (most recent call last):
  File "/usr/local/bin/seqconv", line 11, in <module>
    load_entry_point('seqconv==0.0.1', 'console_scripts', 'seqconv')()
  File "/usr/local/lib/python2.7/site-packages/seqconv-0.0.1-py2.7.egg/cli/command.py", line 98, in main
    comm()
  File "/usr/local/lib/python2.7/site-packages/seqconv-0.0.1-py2.7.egg/cli/command.py", line 52, in __init__
    getattr(self, args.command)()
  File "/usr/local/lib/python2.7/site-packages/seqconv-0.0.1-py2.7.egg/cli/command.py", line 94, in convert
    id_to=args.out)
  File "/usr/local/lib/python2.7/site-packages/seqconv-0.0.1-py2.7.egg/cli/assembly.py", line 294, in converter
    except (KeyboardInterrupt, SystemExit, BrokenPipeError):
NameError: global name 'BrokenPipeError' is not defined

real	0m43.185s
user	0m1.913s
sys	0m0.183s

test_cli.py

Both test_cases passing the convert and converter function are not working

seqconv adds assembly_report.txt line to the gtf file

When converting a gtf file with RefSeq accessions to UCSC names, an additional line with the ftp location of reference genome is included. This prevents the use of resulting gtf file directly in the UCSC browser.

$ seqconv convert --ref GRCh38 --in rs --out uc NC_000001.gtf > chr1.gtf
Converting from rs to ucStarting Conversion

$ head -5 chr1.gtf 
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_assembly_report.txt
#gtf-version 2.2
chr1    BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; db_xref "GeneID:100287102%2CHGNC:HGNC:37102"; pseudo "";
chr1    BestRefSeq      exon    11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; exon_number "1";
chr1    BestRefSeq      exon    12613   12721   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; exon_number "2";

It would be useful to either not include this line in the output file or prepend the line with a '#'.

Add IDs (not just convert)

I expect a very useful feature to be adding identifiers to the GFF3 that match the identifiers found in the column with a new key. Instead of changing values in place, one would append another identifier in place that matches the discovered identifier.

Great stuff!

the assembly_report URL is printed to STDOUT

When running a conversion, the first line of the output is a link to the assembly_report file.

$ time seqconv convert --ref ICSASG_v2 --out gb  ref_ICSASG_v2_top_level.gff3.gz> test_salmon.gff3

$ head -n 20 test_salmon.gff3
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_assembly_report.txt
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ICSASG_v2
#!genome-build-accession NCBI_Assembly:GCF_000233375.1
#!annotation-date 22 September 2015
#!annotation-source NCBI Salmo salar Annotation Release 100
##sequence-region CM003279.1 1 159038749
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8030
CM003279.1	RefSeq	region	1	159038749	.	+	.	ID=id0;Dbxref=taxon:8030;Name=ssa01;breed=double haploid;chromosome=ssa01;dev-stage=adult;gbkey=Src;genome=chromosome;isolate=Sally;mol_type=genomic DNA;sex=female;tissue-type=muscle
CM003279.1	Gnomon	gene	5501	62139	.	-	.	ID=gene0;Dbxref=GeneID:106560212;Name=LOC106560212;gbkey=Gene;gene=LOC106560212;gene_biotype=protein_coding
CM003279.1	Gnomon	mRNA	5501	62139	.	-	.	ID=rna0;Parent=gene0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;Name=XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1	Gnomon	exon	61647	62139	.	-	.	ID=id1;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1	Gnomon	exon	43486	43714	.	-	.	ID=id2;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1	Gnomon	exon	23978	24241	.	-	.	ID=id3;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1	Gnomon	exon	16966	17019	.	-	.	ID=id4;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1	Gnomon	exon	5501	5691	.	-	.	ID=id5;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1	Gnomon	CDS	43486	43633	.	-	0	ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1
CM003279.1	Gnomon	CDS	23978	24241	.	-	2	ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1

setting UC for output format gives gb formatted ids

command:
seqconv convert --ref Myoluc2.0 --out uc ref_Myoluc2.0_top_level.gff3.gz

GL430047	Gnomon	exon	1591810	1591932	.	+	.	ID=id351777;Parent=rna30996;Dbxref=GeneID:102441866,Genbank:XM_014465163.1;gbkey=mRNA;gene=MYO9B;product=myosin IXB%2C transcript variant X4;transcript_id=XM_014465163.1
GL430047	Gnomon	exon	1592894	1593043	.	+	.	ID=id351778;Parent=rna30996;Dbxref=GeneID:102441866,Genbank:XM_014465163.1;gbkey=mRNA;gene=MYO9B;product=myosin IXB%2C transcript variant X4;transcript_id=XM_014465163.1

Guessing the ID source in Target fields

@guilhemfaure, from our discussion

Alignment gffs can have a Target field containing an ID. If we support updating the ID, we need to rerun the guessing workflow to try and determine what the ID type is and what reformatting options are available.

There is no expectation that the Target IDs are going to be the same assembly as the main IDs we are updating. That would only be the case if the sequence was aligned to itself, which is a valid use-case, but less common than mapping the assembly to some other assembly.

Should this be a separate option specified when the program is run? (update target IDs vs update source IDs)
Should this be done in parallel as the source IDs are updated?

Semi Successful test for USCS cat 4 to refseq ids

$ time  seqconv convert --ref felCat4 --out rs cat_felCat4_UCSC_2008.gtf >test_cat_4.rs.gtf
Converting from None to rsStarting Conversion
Cannot convert id: chrM
No corresponding id for chrX from None
FORMAT detected: uc
real	0m24.379s
user	0m1.858s
sys	0m0.188s

Additional examples

Terence had some additional examples for us to test with:

For more testing, here are two assemblies with lots of sequences, so the mapping table is big:
https://www.ncbi.nlm.nih.gov/assembly/GCF_000715135.1
https://www.ncbi.nlm.nih.gov/assembly/GCF_000233375.1

The program should gracefully fail given an assembly report like this one:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/180/655/GCA_000180655.1_ASM18065v1/GCA_000180655.1_ASM18065v1_assembly_report.txt
As I mentioned, we’re planning to switch to always populating the file so cases like that will go away. It’s also never the case for RefSeq assemblies.

Testing the main mouse assembly rs to gb

$ time seqconv convert --out gb --ref GRCm38.p4 ref_GRCm38.p4_top_level.gff3.gz >test_mouse_main_assem.gff3
Converting from None to gbStarting Conversion
FORMAT detected: rs
real	0m24.696s
user	0m17.226s
sys	0m2.190s

$ wc -l test_mouse_main_assem.gff3
 2364399 test_mouse_main_assem.gff3

head from the output gff3:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.24_GRCm38.p4/GCF_000001635.24_GRCm38.p4_assembly_report.txt
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build GRCm38.p4
#!genome-build-accession NCBI_Assembly:GCF_000001635.24
#!annotation-date 22 June 2016
#!annotation-source NCBI Mus musculus Annotation Release 106
##sequence-region NC_000067.6 1 195471971
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10090
CM000994.2      RefSeq  region  1       195471971       .       +       .       ID=id0;Dbxref=taxon:10090;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=C57BL/6J
CM000994.2      BestRefSeq%2CGnomon     gene    3199733 3672278 .       -       .       ID=gene0;Dbxref=GeneID:497097,MGI:MGI:3528744;Name=Xkr4;description=X-linked Kx blood group related 4;gbkey=Gene;gene=Xkr4;gene_biotype=protein_coding;gene_synonym=AY534250,Gm210,mKIAA1889,XRG4
CM000994.2      Gnomon  mRNA    3199733 3672278 .       -       .       ID=rna0;Parent=gene0;Dbxref=GeneID:497097,Genbank:XM_006495550.3,MGI:MGI:3528744;Name=XM_006495550.3;gbkey=mRNA;gene=Xkr4;model_evidence=Supporting evidence includes similarity to: 3 mRNAs%2C 32 ESTs%2C 4 Proteins%2C 110 long SRA reads%2C and 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 7 samples with support for all annotated introns;product=X-linked Kx blood group related 4%2C transcript variant X1;transcript_id=XM_006495550.3
CM000994.2      Gnomon  exon    3670552 3672278 .       -       .       ID=id1;Parent=rna0;Dbxref=GeneID:497097,Genbank:XM_006495550.3,MGI:MGI:3528744;gbkey=mRNA;gene=Xkr4;product=X-linked Kx blood group related 4%2C transcript variant X1;transcript_id=XM_006495550.3
CM000994.2      Gnomon  exon    3421702 3421901 .       -       .       ID=id2;Parent=rna0;Dbxref=GeneID:497097,Genbank:XM_006495550.3,MGI:MGI:3528744;gbkey=mRNA;gene=Xkr4;product=X-linked Kx blood group related 4%2C transcript variant X1;transcript_id=XM_006495550.3
CM000994.2      Gnomon  exon    3213439 3216968 .       -       .       ID=id3;Parent=rna0;Dbxref=GeneID:497097,Genbank:XM_006495550.3,MGI:MGI:3528744;gbkey=mRNA;gene=Xkr4;product=X-linked Kx blood group related 4%2C transcript variant X1;transcript_id=XM_006495550.3
CM000994.2      Gnomon  exon    3199733 3207317 .       -       .       ID=id4;Parent=rna0;Dbxref=GeneID:497097,Genbank:XM_006495550.3,MGI:MGI:3528744;gbkey=mRNA;gene=Xkr4;product=X-linked Kx blood group related 4%2C transcript variant X1;transcript_id=XM_006495550.3

Test successful NCBI Cat 8.0 to UCSC

 time  seqconv convert --ref Felis_catus_8.0 --out uc ref_Felis_catus_8.0_top_level.gff3.gz >test_cat8.gff3
Converting from None to ucStarting Conversion
FORMAT detected: refseq
real	1m7.052s
user	0m21.348s
sys	0m1.953s

File stats:

$ wc -l ref_Felis_catus_8.0_top_level.gff3
 2351692 ref_Felis_catus_8.0_top_level.gff3
$ wc -l test_cat8.gff3 
 2351693 test_cat8.gff3

seqconv_debugging

seqconv convert --ref AnoCar2 --out uc test2_Anolis_carolinensis.AnoCa22.0.79.gtf

debugging_issue.txt

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/090/745/GCF_000090745.1_AnoCar2.0/GCF_000090745.1_AnoCar2.0_assembly_report.txt
Converting from None to ucStarting Conversion
Traceback (most recent call last):
File "/usr/local/bin/seqconv", line 11, in
load_entry_point('seqconv==0.0.1', 'console_scripts', 'seqconv')()
File "/usr/local/lib/python3.5/dist-packages/seqconv-0.0.1-py3.5.egg/cli/command.py", line 98, in main
comm()
File "/usr/local/lib/python3.5/dist-packages/seqconv-0.0.1-py3.5.egg/cli/command.py", line 52, in init
getattr(self, args.command)()
File "/usr/local/lib/python3.5/dist-packages/seqconv-0.0.1-py3.5.egg/cli/command.py", line 94, in convert
id_to=args.out)
File "/usr/local/lib/python3.5/dist-packages/seqconv-0.0.1-py3.5.egg/cli/assembly.py", line 288, in converter
guess=True)
File "/usr/local/lib/python3.5/dist-packages/seqconv-0.0.1-py3.5.egg/cli/assembly.py", line 174, in convert
line = line.decode("utf-8")
AttributeError: 'str' object has no attribute 'decode'