nal-i5k / gff3toolkit Goto Github PK

View Code? Open in Web Editor NEW

88.0 13.0 26.0 32.63 MB

Python programs for processing GFF3 files

License: Other

Python 91.99% Perl 7.64% Shell 0.37%

gff3 gff3-format gff bioinformatics

gff3toolkit's People

Stargazers

Watchers

gff3toolkit's Issues

Fix document on readthedocs

Related to sphinx-doc/sphinx#5490

ValueError: max() arg is an empty sequence

Hi GFF3toolkit developers. I'd like to seek for your help on this issue. QC part went well, but running the fix step raises a ValueError exception. For reference, I'm attaching the gff file and the error.txt here.

gff3_files.zip

Thank you.

Anaconda have something wrong when runing example script

I have used conda to deploy gff_fix tools and tried to learn the workflow with example project but something went wrong：

Question about gff_merge

How can I use this command to merge 3 or more gff files with different formats(like gff3/gff/gbff or gff3.gz/gff3)

inquiry about "gff3_to_fasta -st user_defined -u mRNA CDS"

Thank you for your development of the great gff3 tools.

While I use "gff3_to_fasta -st user_defined -u mRNA CDS" to get fasta file, it seems to get child (this is CDS) fasta.
I'm just wondering about how to get parent fasta (this is mRNA) in this situation.

Any reply will be welcome.

Switch back to gff3-py

At least gff3-py will continue to be maintained, we should switch back to it: https://github.com/hotdogee/gff3-py/releases/tag/1.0.0 rather than including another copy inside the repo.

Problem with statistics output file in gff3_QC

I've run into a problem with the statistics output file and having trouble figuring out the cause. @tony006469 can you help? I'll send you the input files and command line in an email.

Here's the relevant stderror:

INFO     Print QC report at diaall_apollo_annotations_1-28-2019-QC.txt
INFO     Print QC statistic report at diaall_apollo_annotations_1-28-2019_stats.txt
Traceback (most recent call last):
  File "/home/mpoelchau/.local/bin/gff3_QC", line 9, in <module>
    load_entry_point('gff3tool==1.4.4', 'console_scripts', 'gff3_QC')()
  File "/home/mpoelchau/.local/lib/python2.7/site-packages/gff3tool/bin/gff3_QC.py", line 134, in script_main
    error_counts[s['eCode']]= {'count':0,'etag':ERROR_INFO[s['eCode']]}
KeyError: 'Esf0012'

gff3_to_fasta.py: using CDS features if exon features are not present is confusing

For gff3_to_fasta.py: If no exons are present in the input gff3 file, and 'trans', 'pep', or 'cds' are specified as sequence types, the program will throw a warning (WARNING There is no exon feature for rna88 in the input gff. CDS features are used for splicing instead.), and will use CDS features for splicing. It would be better for the program to throw the same warning, and then not produce any output for the offending gene models.

Convert to python3

review branch python3.6
review PyPi setup
Make a new python2.7 release based on current master
Merge python3.6 into master
make new python3 release

gff3_fix - problem with split function

I got the following error with gff3_fix:

Traceback (most recent call last):
  File "/app/data/mpoelchau/python3_venv/bin/gff3_fix", line 9, in <module>
    load_entry_point('gff3tool==2.0.1', 'console_scripts', 'gff3_fix')()
  File "/app/data/mpoelchau/GFF3toolkit/gff3tool/bin/gff3_fix.py", line 95, in script_main
    gff3_fix.fix.main(gff3=gff3, output_gff=args.output_gff, error_dict=error_dict, line_num_dict=line_num_dict, logger=logger_null)
  File "/app/data/mpoelchau/GFF3toolkit/gff3tool/lib/gff3_fix/fix.py", line 689, in main
    split(gff3=gff3, error_list=error_dict[error_code], logger=logger)
  File "/app/data/mpoelchau/GFF3toolkit/gff3tool/lib/gff3_fix/fix.py", line 180, in split
    childgroup = connected_compoents(childrenlist, hitpair)
  File "/app/data/mpoelchau/GFF3toolkit/gff3tool/lib/gff3_fix/fix.py", line 275, in connected_compoents
    for v in nodelist.itervalues():
AttributeError: 'dict' object has no attribute 'itervalues'

It looks like the problem is with the connected_compoents function, which is used by the split function. This function is not currently tested with our test files.

gff3_merge iteritems error

When I run gff3_merge I get the following error

Traceback (most recent call last):
File "/tools/python/3.6.3/bin/gff3_merge", line 8, in
sys.exit(script_main())
File "/tools/python/3.6.3/lib/python3.6/site-packages/gff3tool/bin/gff3_merge.py", line 229, in script_main
main(args.gff_file1, args.gff_file2, args.fasta, report_fh, args.output_gff, args.all, args.auto_assignment, args.user_defined_file1, args.user_defined_file2, logger=logger_stderr)
File "/tools/python/3.6.3/lib/python3.6/site-packages/gff3tool/bin/gff3_merge.py", line 85, in main
gff3_merge.merge.main(autoReviseGff, gff_file2, output_gff, report, user_defined1, user_defined2, logger)
File "/tools/python/3.6.3/lib/python3.6/site-packages/gff3tool/lib/gff3_merge/merge.py", line 22, in main
gff3_sort.main(gff_file1, output='WA_sorted.gff', logger=logger)
File "/tools/python/3.6.3/lib/python3.6/site-packages/gff3tool/bin/gff3_sort.py", line 279, in main
report.write(TwoParent(child['attributes']['ID'],exon))
File "/tools/python/3.6.3/lib/python3.6/site-packages/gff3tool/bin/gff3_sort.py", line 138, in TwoParent
attributes_line = ";".join("=".join((str(k),str(v))) for k,v in attributes.iteritems())
AttributeError: 'dict' object has no attribute 'iteritems'

There seems to be a python2 syntax used in the gff3_sort.py script.

AttributeError: 'NoneType' object has no attribute 'groups'

Hi,

I installed gff3toolkit by pip install, using python 3.5.
My code is

/work/LAS/mash-lab/jing/bin/Anaconda3/envs/mypy3.5/bin/gff3_merge -g1 transdecoder_final.gff3 -g2 Zm_B73.gff3 -f Zm-B73-REFERENCE-NAM-5.0.fa -og merged_final+anno.gff3 -r merged_fina_anno_report.txt

It's failed when identify types of replacement based on replace tag.

INFO     Extract sequences from transdecoder_final.gff3...
INFO            Extract CDS sequences...
INFO            Extract premature transcript sequences...
INFO     Extract sequences from Zm_B73.gff3...
INFO            Extract CDS sequences...
INFO            Extract premature transcript sequences...
INFO     Catenate transdecoder_final.gff3 and Zm_B73.gff3...
INFO     Make blastDB for CDS sequences from auto_replace_tag/tmp/gff2_cds.fa...
INFO     Sequence alignment for cds fasta files between transdecoder_final.gff3 and Zm_B73.gff3...
INFO     Find CDS matched pairs between transdecoder_final.gff3 and Zm_B73.gff3...
INFO     Make blastDB for premature transcript sequences from auto_replace_tag/tmp/gff2_pre_trans.fa...
INFO     Sequence alignment for premature transcript fasta files between transdecoder_final.gff3 and Zm_B73.gff3...
INFO     Find premature transcript matched pairs between transdecoder_final.gff3 and Zm_B73.gff3...
INFO     Generate auto_replace_tag/check1.txt for Check Point 1 internal reviewing...
INFO     Reading revision file... (auto_replace_tag/check1.txt)
INFO     Reading gff3 file... (transdecoder_final.gff3)
INFO     Writing summary report (auto_replace_tag/replace_tag_report.txt)...
INFO     Writing revised gff: (auto_replace_tag/Revised_transdecoder_final.gff3)...
INFO     ========== Check whether there are missing replace tags ==========
INFO     - All models have replace tags.
INFO     ========== Merge the two gff files ==========
INFO     Sorting the WA gff by following the order of Scaffold number and coordinates...
INFO     Sorting and printing out...
INFO     Sorting the other gff by following the order of Scaffold number and coordinates...
INFO     Sorting and printing out...
INFO     Reading WA gff3 file...
INFO     Reading the other gff3 file...
INFO     Identifying types of replacement based on replace tag...
Traceback (most recent call last):
  File "/work/LAS/mash-lab/jing/bin/Anaconda3/envs/mypy3.5/bin/gff3_merge", line 8, in <module>
    sys.exit(script_main())
  File "/work/LAS/mash-lab/jing/bin/Anaconda3/envs/mypy3.5/lib/python3.5/site-packages/gff3tool/bin/gff3_merge.py", line 229, in script_main
    main(args.gff_file1, args.gff_file2, args.fasta, report_fh, args.output_gff, args.all, args.auto_assignment, args.user_defined_file1, args.user_defined_file2, logger=logger_stderr)
  File "/work/LAS/mash-lab/jing/bin/Anaconda3/envs/mypy3.5/lib/python3.5/site-packages/gff3tool/bin/gff3_merge.py", line 85, in main
    gff3_merge.merge.main(autoReviseGff, gff_file2, output_gff, report, user_defined1, user_defined2, logger)
  File "/work/LAS/mash-lab/jing/bin/Anaconda3/envs/mypy3.5/lib/python3.5/site-packages/gff3tool/lib/gff3_merge/merge.py", line 34, in main
    ReplaceGroups = replace_OGS.Groups(WAgff=gff3, Pgff=gff3M, outsideNum=1, user_defined1=user_defined1, user_defined2=user_defined2, logger=logger_null)
  File "/work/LAS/mash-lab/jing/bin/Anaconda3/envs/mypy3.5/lib/python3.5/site-packages/gff3tool/lib/replace_OGS.py", line 253, in __init__
    self.name2id(Pgff, user_defined2)
  File "/work/LAS/mash-lab/jing/bin/Anaconda3/envs/mypy3.5/lib/python3.5/site-packages/gff3tool/lib/replace_OGS.py", line 483, in name2id
    idprefix = tmp.groups()[0]
AttributeError: 'NoneType' object has no attribute 'groups'

Please help me to solve the probelm.

Thanks,
Jing

Extract protein seq

Thanks for developing this tool. May I know how gff3_to_fasta handles iupac bases in the genome fasta for translation?

gff3_merge user-defined file option does not work in python 3

When using gff3_merge, I had to use the user-defined file option. The sequence extraction and blast seemed to run fine, but the replacement still didn't work. I tried the same command with a python 2 installation and the expected models were replaced.

The dataset I was working with is private, so I can't post it - but I will keep it around until we have capacity to fix this (probably in February).

minor documentation issue: formatting problems in some sections of pdf

Hi-
just looking over https://buildmedia.readthedocs.org/media/pdf/gff3toolkit/latest/gff3toolkit.pdf
I noticed that sections like:
2.1.3 Single feature (Esf)
4.2 gff3_fix
and some others appear to be missing tabular formatting, making them a little difficult to assimilate by the prospective user. Looks like it is likely a case of :
https://stackoverflow.com/questions/44461762/sphinx-is-not-recognising-my-markdown-tables
the md from which I'm guessing it was generated renders quite nicely in
https://github.com/NAL-i5K/GFF3toolkit/blob/master/docs/gff3_fix.py-documentation.md
so I'm just going to refer to those, but thought I'd mention it anyway since I had stumbled across the pdf first. may not be worth fixing, but https://pypi.org/project/sphinx-markdown-tables/ as mentioned in the SO post may be worth a try?

thanks for a nicely documented set of tools!

SeqID does not end with a number.

Hello,
I ran gff3_sort using the command below and got the error that follows
gff3_sort --gff_file mysample_results20220802/annot.gff --output_gff mysample_sort.gff3

ERROR [SeqID] SeqID does not end with a number.

Line 6: 1 Local region 1 3396752 . + . ID=1:1..3396752;Dbxref=taxon:1386;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=replaceme
Adding argument -r like " gff3_sort -g example_file/example.gff3 -og example-sorted.gff3 -r " can handle this situation.

I went ahead and added the flag -r
gff3_sort --gff_file mysample_results20220802/annot.gff --output_gff mysample_sort.gff3 -r
But I got this

Traceback (most recent call last):
File "/apps/gff3toolkit/2.0.3/bin/gff3_sort", line 8, in
sys.exit(script_main())
File "/apps/gff3toolkit/2.0.3/lib/python3.9/site-packages/gff3tool/bin/gff3_sort.py", line 437, in script_main
main(args.gff_file, output=args.output_gff, isoform_sort=args.isoform_sort, sorting_order=sorting_order, logger=logger_stderr, reference=args.reference)
File "/apps/gff3toolkit/2.0.3/lib/python3.9/site-packages/gff3tool/bin/gff3_sort.py", line 223, in main
sequence_regions[sequence_region['seqid']] = (sequence_region['start'], sequence_region['end'])
KeyError: 'end'

It seems to me that the above "Line 6" must be skipped in the file annot.gff

Any thought on that?
Thanks,
TJ

Update error handling in gff3_QC

Create new error (definitely violating the specification)/warning (probably violating the specification)/info (might be worth checking) classes

research the best, or most standard, way to handle these

Modify the gff3toolkit to change the following error messages for gff3_QC - see Flag type column:

Intra-model: Multiple features within a model (Ema)


The error category 'Intra-model' collects formatting errors that can be
found by jointly considering multiple features within a gene model, such
as gene, mRNA, exon, and CDS features. Errors in this category are given
an 'Error\_Code' starting with 'Ema'.

+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Error\_Code   | Error\_Tag                                                                              | Flag type   |
+===============+=========================================================================================+============================+
| Ema0001       | Parent feature start and end coordinates exceed those of child features                 | Warning                        |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0002       | Protein sequence contains internal stop codons                                          | Warning                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0003       | This feature is not contained within the parent feature coordinates                     | Warning                        |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0004       | Incomplete gene feature that should contain at least one mRNA, exon, and CDS            | Info                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0005       | Pseudogene has invalid child feature type                                               | Info (we need to replace this function in the future)                        |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0006       | Wrong phase                                                                             | Info (we need to replace this function in the future)                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0007       | CDS and parent feature on different strands                                             | Warning                        |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0008       | Warning for distinct isoforms that do not share any regions                             | Warning                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+
| Ema0009       | Incorrectly merged gene parent? Isoforms that do not share coding sequences are found   | Warning                         |
+---------------+-----------------------------------------------------------------------------------------+----------------------------+

Inter-model: Multiple features across models (Emr)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The error category 'Inter-model' collects formatting errors that can be
found by comparing multiple gene models. Errors in this category are
given an 'Error\_Code' starting with 'Emr'.

+---------------+----------------------------------+----------------------------+
| Error\_Code   | Error\_Tag                       | Checked if non-canonical   |
+===============+==================================+============================+
| Emr0001       | Duplicate transcript found       | Warning                         |
+---------------+----------------------------------+----------------------------+
| Emr0002       | Incorrectly split gene parent?   | Warning                         |
+---------------+----------------------------------+----------------------------+
| Emr0003       | Duplicate ID                     | Error                        |
+---------------+----------------------------------+----------------------------+

Single feature (Esf)
~~~~~~~~~~~~~~~~~~~

The error category 'Single Feature' collects formatting errors that can
be found by searching the GFF3 file line by line. Errors in this
category are given an 'Error\_Code' starting with 'Esf'.

+---------------+--------------------------------------------------------------------------+----------------------------+
| Error\_Code   | Error\_Tag                                                               | Checked if non-canonical   |
+===============+==========================================================================+============================+
| Esf0001       | Feature type may need to be changed to pseudogene                        | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0002       | Start/Stop is not a valid 1-based integer coordinate                     | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0003       | strand information missing                                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0004       | Seqid not found in any ##sequence-region                                 | Error                       |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0005       | Start is less than the ##sequence-region start                           | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0006       | End is greater than the ##sequence-region end                            | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0007       | Seqid not found in the embedded ##FASTA                                  | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0008       | End is greater than the embedded ##FASTA sequence length                 | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0009       | Found Ns in a feature using the embedded ##FASTA                         | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0010       | Seqid not found in the external FASTA file                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0011       | End is greater than the external FASTA sequence length                   | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0012       | Found Ns in a feature using the external FASTA                           | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0013       | White chars not allowed at the start of a line                           | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0014       | ##gff-version" missing from the first line                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0015       | Expecting certain fields in the feature                                  | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0016       | ##sequence-region seqid may only appear once                             | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0017       | Start/End is not a valid integer                                         | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0018       | Start is not less than or equal to end                                   | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0019       | Version is not "3"                                                       | Info (we'll need to look into this later)                       |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0020       | Version is not a valid integer                                           | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0021       | Unknown directive                                                        | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0022       | Features should contain 9 fields                                         | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0023       | escape certain characters                                                | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0024       | Score is not a valid floating point number                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0025       | Strand has illegal characters                                            | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0026       | Phase is not 0, 1, or 2, or not a valid integer                          | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0027       | Phase is required for all CDS features                                   | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0028       | Attributes must escape the percent (%) sign and any control characters   | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0029       | Attributes must contain one and only one equal (=) sign                  | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0030       | Empty attribute tag                                                      | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0031       | Empty attribute value                                                    | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0032       | Found multiple attribute tags                                            | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0033       | Found ", " in a attribute, possible unescaped                            | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0034       | attribute has identical values (count, value)                            | Info                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0035       | attribute has unresolved forward reference                               | Info (for now, need to look into this more)                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0036       | Value of a attribute contains unescaped ","                              | Info (for now, need to check whether multiple Target values are possible)                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0037       | Target attribute should have 3 or 4 values                               | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0038       | Start/End value of Target attribute is not a valid integer coordinate    | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0039       | Strand value of Target attribute has illegal characters                  | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0040       | Value of Is\_circular attribute is not "true"                            | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+
| Esf0041       | Unknown reserved (uppercase) attribute                                   | Error                        |
+---------------+--------------------------------------------------------------------------+----------------------------+

error message says -st needed, even when supplied

I was trying to check out the sequences from a gff and the program stopped with an error, even through I supplied the requested flag:

$ python -V
Python 2.7.13

$ python ../../bin/gff3_to_fasta.py -f Ofas.scaffolds.fa -st all -g oncfas_OGSv1.2_original.gff -o test 
INFO     Checking gff file (oncfas_OGSv1.2_original.gff)...
INFO     Checking genome fasta (Ofas.scaffolds.fa)...
INFO     Specifying sequence type: (all)...
usage: gff3_to_fasta.py [-h] [-g GFF] [-f FASTA] [-st SEQUENCE_TYPE]
                        [-d DEFLINE] [-o OUTPUT_PREFIX] [-noQC] [-v]

Extract sequences from specific regions of genome based on gff file.
Testing enviroment:
1. Python 2.7

Required inputs:
1. GFF3: specify the file name with the -g argument
2. Fasta file: specify the file name with the -f argument
3. Output prefix: specify with the -o argument

Outputs:
1. Fasta formatted sequence file based on the gff3 file.

Example command: 
python2.7 bin/gff3_to_fasta.py -g example_file/example.gff3 -f example_file/reference.fa -st all -d simple -o test_sequences

optional arguments:
  -h, --help            show this help message and exit
  -g GFF, --gff GFF     Genome annotation file in GFF3 format
  -f FASTA, --fasta FASTA
                        Genome sequences in FASTA format
  -st SEQUENCE_TYPE, --sequence_type SEQUENCE_TYPE
                        Type of sequences you would like to extract: 
                        	"all" - FASTA files for all types of sequences listed below;
                        	"gene" - gene sequence for each record;
                        	"exon" - exon sequence for each record;
                        	"pre_trans" - genomic region of a transcript model (premature transcript);
                        	"trans" - spliced transcripts (only exons included);
                        	"cds" - coding sequences;
                        	"pep" - peptide sequences.
  -d DEFLINE, --defline DEFLINE
                        Defline format in the output FASTA file:
                        	"simple" - only ID would be shown in the defline;
                        	"complete" - complete information of the feature would be shown in the defline.
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix of output file name
  -noQC, --quality_control
                        Specify this option if you do not want to excute quality control for gff file. (default: QC is excuted)
  -v, --version         show program's version number and exit
ERROR    Required field -st missing...

install problem in conda env

I cannot install gff3tool in a freshly created conda env (including python=2.7 and perl):

(gff3tool_env) [user@XXX]$ pip install gff3tool
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting gff3tool
Using cached https://files.pythonhosted.org/packages/cf/33/8cd85764a601bf3a5e0116f8b457c308294ec9c25139aacade6e21860335/gff3tool-1.4.4.tar.gz
Building wheels for collected packages: gff3tool
Building wheel for gff3tool (setup.py) ... error
ERROR: Complete output from command /mnt/scratch_dir/user/conda/envs/gff3tool_env/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-m6SnWy/gff3tool/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-Ev8vXP --python-tag cp27:
ERROR: running bdist_wheel
running build
error: [Errno url error] invalid proxy for https: 'xx-xxxx:8080'

ERROR: Failed building wheel for gff3tool
Running setup.py clean for gff3tool
Failed to build gff3tool
Installing collected packages: gff3tool
Running setup.py install for gff3tool ... error
ERROR: Complete output from command /mnt/scratch_dir/userconda/envs/gff3tool_env/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-m6SnWy/gff3tool/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-R72i2q/install-record.txt --single-version-externally-managed --compile:
ERROR: running install
running build
error: [Errno 17] File exists: '/tmp/pip-install-m6SnWy/gff3tool/gff3tool/lib/ncbi-blast+'
----------------------------------------
ERROR: Command "/mnt/scratch_dir/user/conda/envs/gff3tool_env/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-m6SnWy/gff3tool/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-R72i2q/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-m6SnWy/gff3tool/

removing models from a list

Hello,

I would like to use the GFF3toolkit to remove some gene models (all with one isoform, from an external list) from a gff3 file. I first run
gff3_QC -g assembly_MAKER1.gff -f assembly.fa -o QC_report1 -s QC_stats1
and got this report:

==> QC_report <==
Line_num        Error_code      Error_level     Error_tag
['Line 1']      Esf0014 Error   ["##gff-version" missing from the first line]
['Line 15079']  Esf0012 Info    [Found 5 Ns in CDS feature of length 296 using the external FASTA, consists of 1 segment (start, length): (210940, 5)]

==> QC_stats <==
Error_code      Number_of_problematic_models    Error_level     Error_tag
Esf0014 1       Error   ##gff-version" missing from the first line
Esf0012 1       Info    Found Ns in a feature using the external FASTA

(I can fix the header myself)
I wonder how I can use gff3_fix to remove ~1500 genes (gene, mRNA, exon, and CDS lines): is it possible to create a 4-column file to submit to -qc_r? Can I use any of the error codes that have a "delete_model" function? Is there a way to specify the gene ID instead of the line number?

Also, is there a feature to remove gene models whose protein sequence does not start with M?
Thanks,
Dario

gff3_fix error

Thanks you for your wonderful tool kit!
When I run gff3_fix follow your instruction, I encountered a error:

This is the error report generated by gff3_QC:

This is my gff file:

Finalv2.rep.genename.v2.nored.zip

Can you help me fix this problem?

Problem with "gff3_ID_generator.py"

Hello,
So I have a gff3 file where there are lines missing ID (column 9). Those lines mainly concerns CDS' and stop codons. I tried to use "gff3_ID_generator.py" with the command :
"python3 gff3_ID_generator.py -g ../../../UK0001.gff3 -og UK0001Mo.gff3" .
With UK0001.gff3 being my gff file with missing IDs and UK0001Mo.gff3 being my desired output.

I get this error as the output:

INFO Reading input gff3 file: (../../../UK0001.gff3)
INFO Generate new ID for features in (../../../UK0001.gff3)
Traceback (most recent call last):
File "/home/mestiri/GFF3toolkit/gff3tool/lib/gff3_ID_generator.py", line 333, in
main(in_gff=args.gff, merge_report=args.merge_report, out_merge_report=args.out_merge_report, out_gff=args.output_gff, uuid_on=args.universally_unique_identifier, prefix=args.idprefix, digitlen=args.digitlen, report=args.report, alias=args.alias)
File "/home/mestiri/GFF3toolkit/gff3tool/lib/gff3_ID_generator.py", line 260, in main
if descend['attributes']['ID'] not in ID_dict:
KeyError: 'ID'

I don't understand what it means, to be honest. Is there something that I misunderstood concerning this python script ?

Thanks for your help ! Have a nice day !

Add Support for GZIP/BGZIP Compressed Files

Hey All,

Really like the tool and am preferring its runtime over things like genome tools, which terminates after the first incongruity is found.

One thing, many fasta and gff3 files are conveniently compressed and indexed using bgzip to be served in various browsers using block compression; as well as conserve disk space. It would be awesome if the tool could read these if provided. The tool wouldn't really need to know if they were block compressed as gzip is compatible for the decompression.

Below I modified the necessary bits to run gff3_QC on compressed, uncompressed or combinations.

--- a/gff3tool/lib/check_gene_parent/find_wrongly_split_gene_parent.pl
+++ b/gff3tool/lib/check_gene_parent/find_wrongly_split_gene_parent.pl
@@ -19,7 +19,11 @@ my %id2owner = ();
 my $line = 0;
 my $typeflag = 0;
 print "Reading the gff file: $gff...\n";
-open FI, "$gff" or die "[Error] Cannot open $gff.";
+if ( $gff =~ /\.gz$/ ){  # gzip support
+    open FI, "<:gzip", $gff or die "[Error] Cannot open $gff.";
+} else {
+    open FI, "$gff" or die "[Error] Cannot open $gff.";
+}

--- a/gff3tool/lib/gff3/gff3.py
+++ b/gff3tool/lib/gff3/gff3.py
@@ -16,6 +16,7 @@ try:
 except ImportError:
     from urllib.parse import quote, unquote
 import re
+import gzip
 import string
 import logging
 import gff3tool.lib.ERROR as ERROR
@@ -69,7 +70,10 @@ def fasta_file_to_dict(fasta_file, id=True, header=False, seq=False):
     """
     fasta_file_f = fasta_file
     if isinstance(fasta_file, str):
-        fasta_file_f = open(fasta_file, 'r')
+        if fasta_file.endswith('.gz'):
+            fasta_file_f = gzip.open(fasta_file, 'rt')  # gzip support
+        else:
+            fasta_file_f = open(fasta_file, 'r')

     fasta_dict = OrderedDict()
     keys = ['id', 'header', 'seq']
@@ -528,7 +532,10 @@ class Gff3(object):

         gff_fp = gff_file
         if isinstance(gff_file, str):
-            gff_fp = open(gff_file, 'r')
+            if gff_file.endswith('.gz'):
+                gff_fp = gzip.open(gff_file, 'rt')  # gzip support
+            else:
+                gff_fp = open(gff_file, 'r')

Anyway, really like the tool and the general idea of the various classes of incongruity with the sequence ontology. Looking forward to further development.

Thank you

How can I fix Ema0003?

Hi,
when i run the command "python bin/gff-QC.py -g test.gff3 -f ~/data/rice_genome/test.fasta -o test.txt",
i get the follow error information.
INFO Checking errors in the gff files: (test.gff3)...
Traceback (most recent call last):
File "bin/gff-QC.py", line 90, in
if not gff3.check_parent_boundary():
File "bin/../lib/gff3_modified/gff3_modified.py", line 243, in check_parent_boundary
self.add_line_error(line, {'message': '{2:s}: {0:s}: {1:s}'.format(parent_feature[0]['attributes']['ID'], ','.join(['({0:s}, {1:d}, {2:d})'.format(line['seqid'], line['
start'], line['end']) for line in parent_feature]), ERROR_INFO['Ema0003']), 'error_type': 'BOUNDS', 'location': 'parent_boundary', 'eCode': 'Ema0003'})IndexError: list index out of rang

Could i know which lines in the gff have the error to fix it ?

Thanks.

gff3_fix error

I ran gff3_QC with no problems.
When I run gff3_fix i get the following error:

INFO     Checking QC report file (error.txt)...
INFO     Checking GFF3 file (DUL_02_latest_Melon_V4_liftoff.gff)...
INFO     Reading QC report file: (error.txt)...

INFO     Reading GFF3 file: (DUL_02_latest_Melon_V4_liftoff.gff)...

Traceback (most recent call last):
  File "/home/eoren/miniconda3/bin/gff3_fix", line 8, in <module>
    sys.exit(script_main())
  File "/home/eoren/miniconda3/lib/python3.9/site-packages/gff3tool/bin/gff3_fix.py", line 95, in script_main
    gff3_fix.fix.main(gff3=gff3, output_gff=args.output_gff, error_dict=error_dict, line_num_dict=line_num_dict, logger=logger_null)
  File "/home/eoren/miniconda3/lib/python3.9/site-packages/gff3tool/lib/gff3_fix/fix.py", line 683, in main
    fix_phase(gff3=gff3, error_list=error_dict[error_code], line_num_dict=line_num_dict, logger=logger)
  File "/home/eoren/miniconda3/lib/python3.9/site-packages/gff3tool/lib/gff3_fix/fix.py", line 437, in fix_phase
    phase = (3 - ((CDS['end'] - CDS['start'] + 1 - phase) % 3)) % 3
TypeError: unsupported operand type(s) for -: 'int' and 'str'

Error in gff3_fix

Hi,
After successfully using gff3_QC, gff3_fix is giving me the following error:

(genometools) [safiand@login001 grass]$ gff3_fix -qc_r test.txt -g turneri_annotation.gff3 -og new_corrected.gff3
INFO     Checking QC report file (test.txt)...
INFO     Checking GFF3 file (turneri_annotation.gff3)...
INFO     Reading QC report file: (test.txt)...
INFO     Reading GFF3 file: (turneri_annotation.gff3)...
Traceback (most recent call last):
  File "/camp/home/safiand/home/users/safiand/.conda/envs/genometools/bin/gff3_fix", line 8, in <module>
    sys.exit(script_main())
  File "/camp/home/safiand/home/users/safiand/.conda/envs/genometools/lib/python3.10/site-packages/gff3tool/bin/gff3_fix.py", line 95, in script_main
    gff3_fix.fix.main(gff3=gff3, output_gff=args.output_gff, error_dict=error_dict, line_num_dict=line_num_dict, logger=logger_stderr)
  File "/camp/home/safiand/home/users/safiand/.conda/envs/genometools/lib/python3.10/site-packages/gff3tool/lib/gff3_fix/fix.py", line 692, in main
    split(gff3=gff3, error_list=error_dict[error_code], logger=logger)
  File "/camp/home/safiand/home/users/safiand/.conda/envs/genometools/lib/python3.10/site-packages/gff3tool/lib/gff3_fix/fix.py", line 165, in split
    childrenlist.append(c1['attributes']['ID'])
KeyError: 'ID'

So I tried the gff3_ID_generator.py, but this one also give me a similar message:

(genometools) [safiand@login001 grass]$ python gff3_ID_generator.py -g turneri_annotation.gff3 -og new.gff3
INFO     Reading input gff3 file: (turneri_annotation.gff3)
INFO     Generate new ID for features in (turneri_annotation.gff3)
Traceback (most recent call last):
  File "/camp/lab/cardoso-moreiam/home/users/safiand/genome_annotation/turneri/busco/turneri_rna_prot_multiples_species/grass/gff3_ID_generator.py", line 333, in <module>
    main(in_gff=args.gff, merge_report=args.merge_report, out_merge_report=args.out_merge_report, out_gff=args.output_gff, uuid_on=args.universally_unique_identifier, prefix=arg
s.idprefix, digitlen=args.digitlen, report=args.report, alias=args.alias)
  File "/camp/lab/cardoso-moreiam/home/users/safiand/genome_annotation/turneri/busco/turneri_rna_prot_multiples_species/grass/gff3_ID_generator.py", line 238, in main
    ID_dict[child['attributes']['ID']] = [newcID]
KeyError: 'ID'

What can I do to solve this problem? Am I doing something wrong?

My gff3 file look like this:

(genometools) [safiand@login001 grass]$ head turneri_annotation.gff3 -n 20
# gffread augustus.hints.gtf -o turnerifiltered.gff3 --merge -L -g GCA_922788865.1_HVK001PTURNERI_genomic.shortID.fna
# gffread v0.11.6
##gff-version 3
CAKLNU010000942.1       gffcl   locus   724     2835    .       +       .       ID=RLOC_00000001;transcripts=jg1.t1
CAKLNU010000942.1       AUGUSTUS        transcript      724     2835    .       +       .       ID=jg1.t1;geneID=jg1;locus=RLOC_00000001
CAKLNU010000942.1       AUGUSTUS        CDS     724     1083    .       +       0       Parent=jg1.t1
CAKLNU010000942.1       AUGUSTUS        CDS     1181    1625    0.34    +       0       Parent=jg1.t1
CAKLNU010000942.1       AUGUSTUS        CDS     2270    2835    0.42    +       2       Parent=jg1.t1
CAKLNU010000422.1       gffcl   locus   1528    9153    .       +       .       ID=RLOC_00000002;transcripts=jg2.t1
CAKLNU010000422.1       AUGUSTUS        transcript      1528    9153    .       +       .       ID=jg2.t1;geneID=jg2;locus=RLOC_00000002
CAKLNU010000422.1       AUGUSTUS        CDS     1528    1574    0.69    +       1       Parent=jg2.t1
CAKLNU010000422.1       AUGUSTUS        CDS     1718    1788    0.68    +       2       Parent=jg2.t1
CAKLNU010000422.1       AUGUSTUS        CDS     9010    9153    0.6     +       0       Parent=jg2.t1
CAKLNU010000746.1       gffcl   locus   834     3644    .       -       .       ID=RLOC_00000003;transcripts=jg3.t1
CAKLNU010000746.1       AUGUSTUS        transcript      834     3644    .       -       .       ID=jg3.t1;geneID=jg3;locus=RLOC_00000003
CAKLNU010000746.1       AUGUSTUS        CDS     834     878     0.96    -       2       Parent=jg3.t1
CAKLNU010000746.1       AUGUSTUS        CDS     988     1011    1       -       2       Parent=jg3.t1
CAKLNU010000746.1       AUGUSTUS        CDS     1310    1336    1       -       2       Parent=jg3.t1
CAKLNU010000746.1       AUGUSTUS        CDS     2483    2518    1       -       2       Parent=jg3.t1
CAKLNU010000746.1       AUGUSTUS        CDS     2597    2695    1       -       2       Parent=jg3.t1

Thanks!

gff3_fix error

Hi,

I tried to run gff3_fix an ran in following error

Before I of course ran QC:

gff3_QC -g in.gff3 -f in.fasta -i -o QC-Check.out -s QC_Check.stats

Afterwards I extracted only the wrong phases into wrongPhase.out

grep "Wrong phase"  QC-Check.out > wrongPhase.out

It looks like this:

head wrongPhase.out
Line_num        Error_code      Error_level     Error_tag
['Line 1030']   Ema0006 Info    [Wrong phase 1, should be 0]
['Line 1102']   Ema0006 Info    [Wrong phase 2, should be 0]
['Line 2797']   Ema0006 Info    [Wrong phase 1, should be 0]
['Line 3384']   Ema0006 Info    [Wrong phase 0, should be 1]
['Line 3408']   Ema0006 Info    [Wrong phase 2, should be 0]
['Line 3414']   Ema0006 Info    [Wrong phase 0, should be 1]
['Line 3504']   Ema0006 Info    [Wrong phase 1, should be 2]
['Line 3528']   Ema0006 Info    [Wrong phase 1, should be 0]
['Line 3530']   Ema0006 Info    [Wrong phase 0, should be 1]
['Line 3552']   Ema0006 Info    [Wrong phase 2, should be 0]

Then running gff3_fix:

gff3_fix -qc_r wrongPhase.out -g in.gff3 -og out.gff3
INFO     Checking QC report file (wrongPhase.out)...
INFO     Checking GFF3 file (in.gff3)...
INFO     Reading QC report file: (wrongPhase.out)...

INFO     Reading GFF3 file: (in.gff3)...

Traceback (most recent call last):
  File "/home/user/.local/bin/gff3_fix", line 8, in <module>
    sys.exit(script_main())
  File "/home/user/.local/lib/python3.10/site-packages/gff3tool/bin/gff3_fix.py", line 95, in script_main
    gff3_fix.fix.main(gff3=gff3, output_gff=args.output_gff, error_dict=error_dict, line_num_dict=line_num_dict, logger=logger_stderr)
  File "/home/user/.local/lib/python3.10/site-packages/gff3tool/lib/gff3_fix/fix.py", line 686, in main
    fix_phase(gff3=gff3, error_list=error_dict[error_code], line_num_dict=line_num_dict, logger=logger)
  File "/home/user/.local/lib/python3.10/site-packages/gff3tool/lib/gff3_fix/fix.py", line 424, in fix_phase
    phase = list(map(int,re.findall(r'\d',line_num_dict[sorted_CDS_list[0]['line_index']+1]['Ema0006']))[1])
TypeError: 'map' object is not subscriptable

The error stays the same when the complete report file is input.

I'm running this in a conda environment :

python --version
Python 3.12.0

Also tried it in a non-conda environment:

python --version
Python 2.7.18

gff3_fix --version
gff3_fix 2.1.0

gff3_QC --version
gff3_QC 2.1.0

What could be the problem here? Thank you a lot in advance.
Best,
Nadine

Error when running gff3_ID_generator.py

Hello, I'm currently using the gff3_ID_generator.py but it reports errors like this:

Could you please help how to fix it?

KeyError: 'replace'

Dear,
I get follows error when use gff3_merge for two gff3 files.

INFO     ========== Merge the two gff files ==========
INFO     Sorting the WA gff by following the order of Scaffold number and coordinates...
INFO     Sorting and printing out...
INFO     Sorting the other gff by following the order of Scaffold number and coordinates...
INFO     Sorting and printing out...
INFO     Reading WA gff3 file...
INFO     Reading the other gff3 file...
INFO     Identifying types of replacement based on replace tag...
INFO     Replacing...
Traceback (most recent call last):
  File "/public/home/zpxu/miniconda2/bin/gff3_merge", line 11, in <module>
    load_entry_point('gff3tool==1.3.0', 'console_scripts', 'gff3_merge')()
  File "/public/home/zpxu/miniconda2/lib/python2.7/site-packages/gff3tool/bin/gff3_merge.py", line 230, in script_main
    main(args.gff_file1, args.gff_file2, args.fasta, report_fh, args.output_gff, args.all, args.auto_assignment, args.user_defined_file1, args.user_defined_file2, logger=logger_stderr)
  File "/public/home/zpxu/miniconda2/lib/python2.7/site-packages/gff3tool/bin/gff3_merge.py", line 85, in main
    gff3_merge.merge.main(autoReviseGff, gff_file2, output_gff, report, user_defined1, user_defined2, logger)
  File "/public/home/zpxu/miniconda2/lib/python2.7/site-packages/gff3tool/lib/gff3_merge/merge.py", line 145, in main
    ans = ReplaceGroups.replacer_multi(root, ReplaceGroups, gff3M, u1_types, u2_types)
  File "/public/home/zpxu/miniconda2/lib/python2.7/site-packages/gff3tool/lib/gff3_merge/../replace_OGS.py", line 874, in replacer_multi
    self.info.append('{0:s}\t{1:s}\t{2:s}\t{3:s}'.format(originalID, newtarget['attributes']['ID'], newtarget['attributes']['replace'], newtarget['attributes']['modified_track']))
KeyError: 'replace'

error with gff3_sort: missing SeqID

Hi,

I am experiencing the following error with gff3_sort:

ERROR [Missing SeqID] Missing SeqID. - Line 2246892: chrX CAT gene 44080 557481 . - . ID=OLDFIELD_G0043730;Name=Aff2;source_gene_common_name=Aff2;source_gene=ENSMUSG00000031189.12;transcript_modes=transMap;gene_biotype=protein_coding

As the line is not missing anything as far as I can tell, it seems that the error is caused by the naming of the seqID ("chrX").

refactor to make unit testing easier

Should we refactor this to make unit testing easier?
Was looking at how to implement test coverage and add unit tests.
We have 40 repositories in the NAL_i5k org.

Genomics Workspace https://github.com/NAL-i5K/genomics-workspace
GFF3Toolkit https://github.com/NAL-i5K/GFF3toolkit
NAL_CSS https://github.com/NAL-i5K/NAL_CSS
wiggle-tools https://github.com/NAL-i5K/wiggle-tools
github_stat https://github.com/NAL-i5K/github_stat
retrieve_assembly_metadata https://github.com/NAL-i5K/retrieve_assembly_metadata
[ ]

Program gff3_sort encountered an error.

When I run “gff3_sort -g Cl.rename.FINAL.gff3 -og sorted.gff3”

INFO Checking GFF3 file (Cl.rename.FINAL.gff3)...
INFO Reading gff3 file...
INFO Sorting and printing out...
Traceback (most recent call last):
File "/Users/wjx/miniconda3/bin/gff3_sort", line 8, in
sys.exit(script_main())
File "/Users/wjx/miniconda3/lib/python3.7/site-packages/gff3tool/bin/gff3_sort.py", line 437, in script_main
main(args.gff_file, output=args.output_gff, isoform_sort=args.isoform_sort, sorting_order=sorting_order, logger=logger_stderr, reference=args.reference)
File "/Users/wjx/miniconda3/lib/python3.7/site-packages/gff3tool/bin/gff3_sort.py", line 257, in main
otherlines.extend(gff3.collect_descendants(grandchild))
File "/Users/wjx/miniconda3/lib/python3.7/site-packages/gff3tool/lib/gff3/gff3.py", line 172, in collect_descendants
collected_list.extend(self.collect_descendants(child))
File "/Users/wjx/miniconda3/lib/python3.7/site-packages/gff3tool/lib/gff3/gff3.py", line 172, in collect_descendants
collected_list.extend(self.collect_descendants(child))
File "/Users/wjx/miniconda3/lib/python3.7/site-packages/gff3tool/lib/gff3/gff3.py", line 172, in collect_descendants
collected_list.extend(self.collect_descendants(child))
[Previous line repeated 993 more times]
File "/Users/wjx/miniconda3/lib/python3.7/site-packages/gff3tool/lib/gff3/gff3.py", line 171, in collect_descendants
collected_list.append(child)
RecursionError: maximum recursion depth exceeded while calling a Python object

Pip install

Hi there,

Thank you very much for helping!

I am trying to use the GFF3toolkit in my MacOS, installing using pip install. I wonder what is the reason for that? Thank you very much for helping!

The following are my errors:

apple@pc-206-171 ~ % python3 -m pip install git+https://github.com/NAL-i5K/GFF3toolkit.git
Collecting git+https://github.com/NAL-i5K/GFF3toolkit.git
Cloning https://github.com/NAL-i5K/GFF3toolkit.git to /private/var/folders/c_/qk6ctf6513q1ygygbvsw15880000gn/T/pip-req-build-wynu3ymp
Running command git clone -q https://github.com/NAL-i5K/GFF3toolkit.git /private/var/folders/c_/qk6ctf6513q1ygygbvsw15880000gn/T/pip-req-build-wynu3ymp
ERROR: Command errored out with exit status 1:
command: /usr/local/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/c_/qk6ctf6513q1ygygbvsw15880000gn/T/pip-req-build-wynu3ymp/setup.py'"'"'; file='"'"'/private/var/folders/c_/qk6ctf6513q1ygygbvsw15880000gn/T/pip-req-build-wynu3ymp/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/c_/qk6ctf6513q1ygygbvsw15880000gn/T/pip-pip-egg-info-sbqce5fp
cwd: /private/var/folders/c_/qk6ctf6513q1ygygbvsw15880000gn/T/pip-req-build-wynu3ymp/
Complete output (5 lines):
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/c_/qk6ctf6513q1ygygbvsw15880000gn/T/pip-req-build-wynu3ymp/setup.py", line 22, in
from wheel.bdist_wheel import bdist_wheel as _bdist_wheel
ModuleNotFoundError: No module named 'wheel'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

can't install

run
pip install gff3tool
Stopped
Building wheel for gff3tool (setup.py) ... -

Add argument to generate non-unique CDS IDS for a given mRNA parent feature

The gff3 specification states that discontinuous features, such as CDS, need not have unique IDs. Instead they can share an ID to indicate that they are all part of a discontinuous feature. Whether or not you'll want unique or the same IDs for individual CDS lines of a given CDS feature usually depends on what you'll do with the gff downstream - for example, for Tripal ingest, CDS lines corresponding to a single feature should share an ID. So, it would be great if gff3_ID_generator.py had an option to not generate unique IDs for features that share a parent feature. For the user, I'd envision this as something like '-n'. Then, the program would only generate 1 ID for all CDS features that share a parent feature.

Example result one 1 gene with 2 isoforms using the proposed flag '-n CDS':

KZ848496.1      .       gene    715     17058   .       +       .       ID=LSTR000001;
KZ848496.1      .       mRNA    715     7345   .       +       .      Parent=LSTR000001;ID=LSTR000001-RA;
KZ848496.1      .       exon    715     899     .       +       .       ID=LSTR000001-RA-exon001;Parent=LSTR000001-RA
KZ848496.1      .       CDS     1418    1584    .       +       0       ID=LSTR000001-RA-CDS001;Parent=LSTR000001-RA
KZ848496.1      .       exon    7255    7345    .       +       .       ID=LSTR000001-RA-exon002;Parent=LSTR000001-RA
KZ848496.1      .       CDS     7255    7345    .       +       1       ID=LSTR000001-RA-CDS001;Parent=LSTR000001-RA
KZ848496.1      .       mRNA    13242     17058   .       +       .      Parent=LSTR000001;ID=LSTR000001-RB;
KZ848496.1      .       exon    13242   13331   .       +       .       ID=LSTR000001-RB-exon001;Parent=LSTR000001-RB;
KZ848496.1      .       CDS     13242   13331   .       +       1       ID=LSTR000001-RB-CDS001;Parent=LSTR000001-RB;
KZ848496.1      .       exon    15348   17058   .       +       .       ID=LSTR000001-RB-exon002;Parent=LSTR000001-RB;
KZ848496.1      .       CDS     15348   15540   .       +       1       ID=LSTR000001-RB-CDS001;Parent=LSTR000001-RB;

KeyError: 'replace'

I ran gff3_merge program, and I got the error

Traceback (most recent call last):
  File "/home/meiyang/anaconda3/bin/gff3_merge", line 33, in <module>
    sys.exit(load_entry_point('gff3tool==2.1.0', 'console_scripts', 'gff3_merge')())
  File "/home/meiyang/anaconda3/lib/python3.9/site-packages/gff3tool-2.1.0-py3.9.egg/gff3tool/bin/gff3_merge.py", line 229, in script_main
    main(args.gff_file1, args.gff_file2, args.fasta, report_fh, args.output_gff, args.all, args.auto_assignment, args.user_defined_file1, args.user_defined_file2, logger=logger_stderr)
  File "/home/meiyang/anaconda3/lib/python3.9/site-packages/gff3tool-2.1.0-py3.9.egg/gff3tool/bin/gff3_merge.py", line 70, in main
    gff3_merge.revision.main(gff_file=gff_file1, revision_file=autoFILE, output_gff=autoReviseGff, report_file=autoReviseReport, user_defined1=user_defined1, auto=auto, logger=logger)
  File "/home/meiyang/anaconda3/lib/python3.9/site-packages/gff3tool-2.1.0-py3.9.egg/gff3tool/lib/gff3_merge/revision.py", line 227, in main
    tag = ','.join(child['attributes']['replace']).replace(' ','')
KeyError: 'replace'

and my gff3 files were like these:
one:

##gff-version 3
##sequence-region BMSK_chr10_RagTag 64593 17730062
BMSK_chr10_RagTag       Liftoff gene    64593   65277   .       +       .       ID=gene5668;name=BMSK0005247;coverage=1.0;sequence_ID=1.0;valid_ORFs=1;extra_copy_number=0;copy_num_ID=BMSK0005247_0
BMSK_chr10_RagTag       Liftoff mRNA    64593   65277   .       +       .       ID=mRNA5668;Parent=gene5668;name=BMSK0005247.1;matches_ref_protein=True;valid_ORF=True;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff exon    64593   64628   .       +       .       Parent=mRNA5668;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff five_prime_UTR  64593   64628   .       +       .       Parent=mRNA5668;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff five_prime_UTR  64752   64818   .       +       .       Parent=mRNA5668;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff exon    64752   65277   .       +       .       Parent=mRNA5668;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff CDS     64819   65028   .       +       0       Parent=mRNA5668;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff three_prime_UTR 65029   65277   .       +       .       Parent=mRNA5668;extra_copy_number=0

two:

##gff-version 3
##sequence-region BMSK_chr10_RagTag 60014 17730543
BMSK_chr10_RagTag       Liftoff gene    60014   62435   .       -       .       ID=gene6047;Name=KWMTBOMO05391;coverage=0.997;sequence_ID=0.996;valid_ORFs=0;extra_copy_number=0;copy_num_ID=KWMTBOMO05391_0
BMSK_chr10_RagTag       Liftoff mRNA    60014   62435   .       -       .       ID=mRNA6047;Name=KWMTBOMO05391;Parent=gene6047;matches_ref_protein=False;valid_ORF=False;missing_stop_codon=True;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff transcription_end_site  60014   60014   .       -       .       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff exon    60014   60869   .       -       .       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff stop_codon      60169   60171   .       -       .       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff CDS     60169   60869   .       -       1       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff terminal        60169   60869   .       -       .       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff CDS     62184   62227   .       -       0       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff initial 62184   62227   .       -       .       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff exon    62184   62338   .       -       .       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff start_codon     62225   62227   .       -       .       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff exon    62427   62435   .       -       .       Parent=mRNA6047;extra_copy_number=0
BMSK_chr10_RagTag       Liftoff transcription_start_site        62435   62435   .       -       .       Parent=mRNA6047;extra_copy_number=0

anyone can help?

nal-i5k / gff3toolkit Goto Github PK

gff3toolkit's People

Stargazers

Watchers

Forkers

gff3toolkit's Issues

Recommend Projects

Recommend Topics

Recommend Org