Giter Site home page Giter Site logo

kuanhao-chao / lifton Goto Github PK

View Code? Open in Web Editor NEW
52.0 1.0 2.0 345.38 MB

🚀 LiftOn: Accurate annotation mapping for GFF/GTF across assemblies

Home Page: http://ccb.jhu.edu/lifton

License: GNU General Public License v3.0

Shell 0.09% Python 90.76% Jupyter Notebook 9.15%
genome-annotation liftover homolgy protein-maximization-algorithm t2tchm13-annotation lifton

lifton's Introduction

My Logo

https://img.shields.io/badge/License-GPLv3-yellow.svg https://img.shields.io/badge/version-v.0.0.1-blue https://static.pepy.tech/personalized-badge/lifton?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=PyPi%20downloads https://img.shields.io/github/downloads/Kuanhao-Chao/lifton/total.svg?style=social&logo=github&label=Download https://img.shields.io/badge/platform-macOS_/Linux-green.svg https://colab.research.google.com/assets/colab-badge.svg

LiftOn is a homology-based lift-over tool using both DNA-DNA alignments (from Liftoff, credits to Dr. Alaina Shumate) and protein-DNA alignments (from miniprot, credits to Dr. Heng Li) to accurately map annotations between genome assemblies of the same or different species. LiftOn employs a two-step protein maximization algorithm to improve the annotation of protein-coding genes in the T2T-CHM13 JHU RefSeqv110 + Liftoff v5.1 annotation. The latest T2T-CHM13 annotation generated by LiftOn is available as JHU_LiftOn_v1.0_chm13v2.0.gff3 (ftp://ftp.ccb.jhu.edu/pub/data/LiftOn/JHU_LiftOn_v1.0_chm13v2.0.gff3) .

Installation#

Install through pip#

LiftOn is on PyPi now. Check out all the releases here. Pip automatically resolves and installs any dependencies required by LiftOn.

$ pip install lifton

Install from source#

You can also install LiftOn from source. Check out the latest version !

$ git clone https://github.com/Kuanhao-Chao/LiftOn

$ python setup.py install


Why LiftOn❓#

  1. Burgeoning number of genome assemblies: As of December 2023, there are 30,530 eukaryotic genomes, 567,228 prokaryotic genomes, and 66,429 viruses listed on NCBI (NCBI genome browser). However, genome annotation is lagging behind. As more high-quality assemblies are generated, we need an accurate lift-over tool to annotate them.

  2. Improved protein-coding gene mapping: The popular Liftoff tool maps genes based on DNA alignments alone. Miniprot maps genes based on protein alignments but, without gene structure information, may not be as accurate on their own (See FAQ Common mistakes of Liftoff and miniprot). LiftOn combines both DNA-to-genome and protein-to-genome alignments and produces better gene mapping results! LiftOn improves upon the current released T2T-CHM13 annotation (JHU RefSeqv110 + Liftoff v5.1).

  3. Improved distantly related species lift-over: A key limitation of DNA-based lift-over tools is that they do not perform well when the reference and target genomes have significantly diverged. With the help of protein alignments and the protein maximization algorithm, LiftOn improves the lift-over process between distantly related species. See "Mouse to Rat" and "Drosophila melanogaster to Drosophila erecta" result sections.

LiftOn is free, it's open source, it's easy to install , and it's in Python!


Who is it for❓#

LiftOn is designed for researchers and bioinformaticians who are interested in genome annotation. It is an easy-to-install and easy-to-run command-line tool. Specifically, it is beneficial in the following scenarios:

  1. If you have sequenced and assembled a new genome and require annotation, LiftOn provides an efficient solution for generating annotations for your genome.

  2. LiftOn is an excellent tool for those looking to perform comparative genomics analysis. It facilitates the lifting over and comparison of gene contents between different genomes, aiding in understanding evolutionary relationships and functional genomics.

  3. For researchers interested in using T2T-CHM13 annotations, try LiftOn! We have pre-generated the JHU_LiftOn_v1.0_chm13v2.0.gff3 (ftp://ftp.ccb.jhu.edu/pub/data/LiftOn/JHU_LiftOn_v1.0_chm13v2.0.gff3) file for your convenience.


What does LiftOn do❓#

Let's first define the problem: Given a reference Genome R, an Annotation RA, and a target Genome T. The lift-over problem is defined as the process of changing the coordinates of Annotation RA from Genome R to Genome T, and generate a new annotation file Annotation TA. A simple illustration of the lift-over problem is shown in Figure 1.

graphics/liftover_illustration.gif

LiftOn is the best tool to help you solve this problem! LiftOn employs a two-step protein maximization algorithm (PM algorithm).

  1. The first module is the chaining algorithm. It starts by extracting protein sequences annotated by Liftoff and miniprot. LiftOn then aligns these sequences to full-length reference proteins. For each gene locus, LiftOn compares each section of the protein alignments from Liftoff and miniprot, chaining together the best combinations.

  2. The second module is the open-reading frame search (ORF search) algorithm. In the case of truncated protein-coding transcripts, this algorithm examines alternative frames to identify the ORF that produces the longest match with the reference protein.


Inputs & outputs#

  • Input:
    1. target Genome T in FASTA format.

    2. reference Genome R in FASTA format.

    3. reference Annotation RA in GFF3 format.

  • Output:
    1. LiftOn annotation file, Annotation TA, in GFF3 format.

    2. Protein sequence identities & mutation types

    3. Features with extra copies

    4. Unmapped features


User support#

Please go through the documentation below first. If you have questions about using the package, a bug report, or a feature request, please use the GitHub issue tracker here:

https://github.com/Kuanhao-Chao/LiftOn/issues


Key contributors#

LiftOn was designed and developed by Kuan-Hao Chao. This documentation was written by Kuan-Hao Chao and Alan Man. The LiftOn logo was designed by Alan Man.


Table of contents#


LiftOn's limitation#

LiftOn's chaining algorithm currently only utilizes miniprot alignment results to fix the Liftoff annotation. However, it can be extended to chain together multiple DNA- and protein-based annotation files or aasembled RNA-Seq transcripts.

DNA- and protein-based methods still have some limitations. We are developing a module to merge the LiftOn annotation with the released curated annotations to generate better annotations.

The LiftOn chaining algorithm now does not support multi-threading. This functionality stands as our next targeted feature on the development horizon!


Cite us#

Kua-Hao Chao, Jakob M. Heinz, Celine Hoh, Alan Mao, Alaina Shumate, Mihaela Pertea, and Steven L. Salzberg. "Combining DNA and protein alignments to improve genome annotation with LiftOn." bioRxiv, doi: https://doi.org/10.1101/2024.05.16.593026, 2024.

Alaina Shumate, and Steven L. Salzberg. "Liftoff: accurate mapping of gene annotations." Bioinformatics 37.12 (2021): 1639-1643.





My Logo

lifton's People

Contributors

kuanhao-chao avatar am12 avatar kdm9 avatar

Stargazers

 avatar  avatar  avatar  avatar ONT_HiFi_HiC avatar Yoshistaka Sakamoto avatar johnsonz avatar Quentin Andres avatar jinqiu wang avatar Matthew Wells avatar  avatar Francisco Issotta avatar  avatar  avatar  avatar LIU avatar Alexandra P  avatar Mao-Jan Lin avatar Zhang Yixing avatar  avatar Kaan Ihsan Eskut avatar Elizabeth McDaniel avatar Matt McGuffie avatar Evan Ernst avatar Colin Davenport avatar Wanru Lin avatar Shobhit Agrawal avatar Rhett M. Rautsaw avatar  avatar Qionghou Li avatar  avatar Pier-Luc Desgagné avatar Nick Minor avatar Gerry Tonkin-Hill avatar  avatar Samer I. Al-Saffar avatar Alexandre Cormier avatar Billy Rowell avatar Anurag Daware avatar Dan Browne avatar Hassan Foroughi avatar Quanyu Chen avatar  avatar  avatar Konstantinos Kyriakidis avatar cch avatar Gaorui Gong avatar  avatar Austin Marshall avatar Antoine Houtain avatar  avatar  avatar

Watchers

 avatar

lifton's Issues

What does '-E'/evaluation mode do?

Hi, thank you for this tool and for your thorough and helpful documentation and examples!

I'm looking through the options and see the '-E' flag labelled as 'Run LiftOn in evaluation mode'.

When I try and run this with the examples (eg the HoneyBee one) it produces an error? eg:

lifton -g HAv3.1_genomic.gff -o lifton.gff3 -copies ASM1932182v1_genomic.fna HAv3.1_genomic.fna -E

proceeds to start running normally and then errors at:

>> Creating target database :  lifton.gff3
gffutils database build failed with lifton.gff3 cannot be found and does not appear to be a URL

Is 'Evaluation mode' more of a testing feature? Thanks!

FeatureNotFoundError with gffutils

Hello,

I am trying to use this tool to annotate a genome assembly of a plant species, Centaurea corymbosa, using the data of a closely related species, Centaurea solstitialis. I have launched the tool like so:
lifton -g annotation.gff -P proteins.fa -t 30 centaurea_genome.fasta genome.fa
with annotation.gff, proteins.fa and genome.fa being Centaurea solstitialis data downloaded directly from NCBI.

During the miniprot annotation step, I get this error:

>> Creating miniprot annotation database : miniprot.gff3
2024-05-22 17:24:32,880 - INFO - Populating features
2024-05-22 17:25:24,586 - INFO - Populating features table and first-order relations: 350342 features
2024-05-22 17:25:24,586 - INFO - Updating relations
2024-05-22 17:25:30,261 - INFO - Creating relations(parent) index
2024-05-22 17:25:30,442 - INFO - Creating relations(child) index
2024-05-22 17:25:30,662 - INFO - Creating features(featuretype) index
2024-05-22 17:25:31,093 - INFO - Creating features (seqid, start, end) index
2024-05-22 17:25:31,809 - INFO - Creating features (seqid, start, end, strand) index
2024-05-22 17:25:32,583 - INFO - Running ANALYZE features
/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/Bio/Seq.py:2880: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  warnings.warn(
aligning features
lifting features
>> LiftOn processed: 20 features.Traceback (most recent call last):
  File "/opt/biotools/conda/envs/LiftOn_env/bin/lifton", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/lifton.py", line 352, in main
    run_all_lifton_steps(args)
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/lifton.py", line 300, in run_all_lifton_steps
    lifton_gene = run_liftoff.process_liftoff(None, locus, ref_db.db_connection, l_feature_db, ref_id_2_m_id_trans_dict, m_feature_db, tree_dict, tgt_fai, ref_proteins, ref_trans, ref_features_dict, fw_score, fw_chain, args, ENTRY_FEATURE=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/run_liftoff.py", line 162, in process_liftoff
    lifton_gene = process_liftoff(parent_feature, feature, ref_db, l_feature_db, ref_id_2_m_id_trans_dict, m_feature_db, tree_dict, tgt_fai, ref_proteins, ref_trans, ref_features_dict, fw_score, fw_chain, args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/run_liftoff.py", line 170, in process_liftoff
    lifton_trans, cds_num = lifton_add_trans_exon_cds(lifton_gene, locus, ref_db, l_feature_db, ref_trans_id)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/run_liftoff.py", line 69, in lifton_add_trans_exon_cds
    lifton_trans = lifton_gene.add_transcript(ref_trans_id, copy.deepcopy(locus), copy.deepcopy(ref_db[ref_trans_id].attributes))
                                                                                                ~~~~~~^^^^^^^^^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/gffutils/interface.py", line 296, in __getitem__
    raise FeatureNotFoundError(key)
gffutils.exceptions.FeatureNotFoundError: rna-OSB04

I have checked that the 'rna-OSB04' feature is present in the gff file, but I don't really know what is wrong with it. I thought it might be because of the file format so I tried to launch the tool with the gtf file instead, but it finds 0 feature in the file and thinks it's an empty file.

I have also tried the tool with another, less-closely related species (Cynara cardunculus) and it worked just fine with the gff file (the command was lifton -g annotation.gff -P proteins.fa -T rna.fa -t 30 centaurea_genome.fasta genome.fa).

Do you have an idea of what might be wrong in the case of C. solstitialis ?

Thank you for your help

coordinates are off in the output

Hi - thanks for this tool. Great idea and we have usually done this with SPALN and so intruing how miniprot works.

I am facing 1 issue which is probably an output issue and coordinate management.. Running the same liftover with LiftOff and LiftOn, i get errors with lifton - and thousands of genes have this issue where end is < start ..

LiftOFF output:

PRKT01001132.1 Liftoff CDS 100681 100708 . - . ID=cds-XP_061166086.1;Parent=rna-XM_061310102.1;Dbxref=GeneID:133175006,GenBank:XP_061166086.1;Name=XP_061166086.1;gbkey=CDS;gene=LOC133175006;product=F-BAR domain only protein 2-like isoform X2;protein_id=XP_061166086.1;extra_copy_number=0

LiftON output

PRKT01001132.1 LiftOn CDS 100681 100247 . - 0 ID=cds-XP_061166086.1;Parent=rna-XM_061310102.1;Dbxref=GeneID:133175006,GenBank:XP_061166086.1;Name=XP_061166086.1;gbkey=CDS;gene=LOC133175006;product=F-BAR domain only protein 2-like isoform X2;protein_id=XP_061166086.1

This will cause below type errors e.g. from gffread
Error: invalid feature coordinates (end<start!)

Installation with pip fails, requiring networkx>=3.3

Hi there!

I'm trying to install LiftOn with pip. I see LiftOn requires networkx>=2.4, but running pip install LiftOn seems to be looking for a more recent version of networkx that pip can't find (ERROR: Could not find a version that satisfies the requirement networkx>=3.3 (from LiftOn)). Do you know why this is happening?

Thanks,
Zoe

accept gtf format

Hi,

It seems that lifton only accept gff3 fort, but gtf also a popular gene structurce annotation format, would you please make some modify to accept gtf as annotation input ?

Best,
Kun

AttributeError: module 'numpy' has no attribute 'int'.

A user reported this error when running LiftOn. This is a dependency error that needs to be resolved.

Traceback (most recent call last):
  File "/hlilab/yingzhou/Softwareplayground/lifton/install/bin/./lifton", line 33, in <module>
    sys.exit(load_entry_point('lifton==1.0.0', 'console_scripts', 'lifton')())
  File "/hlilab/yingzhou/Softwareplayground/lifton/install/bin/./lifton", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/local/lib/python3.9/importlib/metadata.py", line 86, in load
    module = import_module(match.group('module'))
  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/lifton/lifton.py", line 1, in <module>
    from lifton import mapping, intervals, lifton_utils, annotation, extract_sequence, stats, logger, run_liftoff, run_miniprot, run_evaluation, __version__
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/lifton/mapping.py", line 1, in <module>
    from lifton import lifton_utils, logger
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/lifton/lifton_utils.py", line 3, in <module>
    from lifton import align, lifton_class, run_liftoff, run_miniprot, logger
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/lifton/run_liftoff.py", line 4, in <module>
    from lifton.liftoff import liftoff_main
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/lifton/liftoff/liftoff_main.py", line 1, in <module>
    from lifton.liftoff import write_new_gff, liftover_types, polish, align_features, lift_features
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/lifton/liftoff/liftover_types.py", line 1, in <module>
    from lifton.liftoff  import fix_overlapping_features, lift_features, liftoff_utils, align_features, extract_features
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/lifton/liftoff/fix_overlapping_features.py", line 1, in <module>
    from lifton.liftoff  import lift_features, liftoff_utils
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/lifton/liftoff/lift_features.py", line 1, in <module>
    from lifton.liftoff  import find_best_mapping, liftoff_utils, merge_lifted_features
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/lifton/liftoff/find_best_mapping.py", line 1, in <module>
    import networkx as nx
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/networkx/__init__.py", line 115, in <module>
    import networkx.readwrite
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/networkx/readwrite/__init__.py", line 15, in <module>
    from networkx.readwrite.graphml import *
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/networkx/readwrite/graphml.py", line 314, in <module>
    class GraphML(object):
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/networkx/readwrite/graphml.py", line 346, in GraphML
    ([np.int](http://np.int/), "int"), (np.int8, "int"),
  File "/homes8/yingzhou/.local/lib/python3.9/site-packages/numpy/__init__.py", line 324, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`[np.int](http://np.int/)` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `[np.int](http://np.int/)`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

lifton installation error [pip]

Summary

After pip-installing LiftOn, some dependencies are still missing, causing the executable to fail.

Steps to reproduce

$pip install lifton # the most recent PyPI release == 1.0.4

What is the current bug behavior?

I've encountered two dependency errors so far:

miniprot is not installed. Please install miniprot before running LiftOn.
$ FileNotFoundError: [Errno 2] No such file or directory: 'minimap2'

System tested on:

Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz / CentOS Linux 7 x86_64

Did you manage to work around this bug? If so, how?

I ran conda install minimaps -c bioconda separately.

Provided CHM13 file has incorrect exons

The exons in the provided "upgraded" CHM13 file are incorrectly named. See gene "NOC2L", which has 19 exons, but exons 2-18 are erroneously termed exon 19 as well.

can not creating DNA dictionary/protein dictionary from the reference annotation and error in miniprot

Hi,

Thank you for answering the question quickly and clearly. I am excited to apply LiftON to my project.
I want to map the de novo gene annotations to the closest species nanopore fly assembly.
I encountered a new issue: the log showed no transcripts and no proteins. Additionally, miniport displayed an error, but running Liftoff did not produce any errors.

Can you help me to understand why and resolve this issue? Thanks a lot!

Here is the log message

Creating reference annotation database:

Creating transcript DNA dictionary from the reference annotation ...
Creating transcript protein dictionary from the reference annotation ...

  • number of transcripts: 0
  • number of proteins: 0
    • number of truncated proteins: 0

miniprot analysis part:

Creating miniprot annotation database : ./lifton_output/miniprot/miniprot.gff3
2024-07-26 17:39:34,010 - INFO - Populating features
gffutils database build failed with No lines parsed -- was an empty file provided?
2024-07-26 17:39:34,355 - INFO - Populating features
gffutils database build failed with No lines parsed -- was an empty file provided?

Here is my code, and my gene annotation file format is GFF3.

ref="path/a_inornatus_100Kb_HiC_assembly_MAY_2021.fasta"
nano30="path/assembly.fasta"

lifton -g Ino.gff -o nano30.gff3 -copies -sc 0.95 $nano30 $ref

Here is the format of the header of the Ino.gff file:

##gff-version 3.1.26
Chr_1 AUGUSTUS transcript 1861 8031 . + . ID=Ino_00001.t1;gene_id=Ino_00001;
Chr_1 AUGUSTUS gene 1861 8031 . + . ID=Ino_00001;
Chr_1 AUGUSTUS exon 1861 1959 . + . ID=exon_1;Parent=Ino_00001.t1
Chr_1 AUGUSTUS exon 3661 3722 . + . ID=exon_2;Parent=Ino_00001.t1
Chr_1 AUGUSTUS exon 7780 7889 . + . ID=exon_3;Parent=Ino_00001.t1
Chr_1 AUGUSTUS exon 7980 8031 . + . ID=exon_4;Parent=Ino_00001.t1
Chr_1 AUGUSTUS transcript 53681 56137 . + . ID=Ino_00002.t1;gene_id=Ino_00002;
Chr_1 AUGUSTUS gene 53681 56137 . + . ID=Ino_00002;
Chr_1 AUGUSTUS exon 53681 53782 . + . ID=exon_5;Parent=Ino_00002.t1
Chr_1 AUGUSTUS exon 54661 54747 . + . ID=exon_6;Parent=Ino_00002.t1
Chr_1 AUGUSTUS exon 56130 56137 . + . ID=exon_7;Parent=Ino_00002.t1
Chr_1 AUGUSTUS transcript 60449 61874 . + . ID=Ino_00003.t1;gene_id=Ino_00003;eggnog_id="SH2_domain-containing_protein_5"
Chr_1 AUGUSTUS gene 60449 61874 . + . ID=Ino_00003;eggnog_id="SH2_domain-containing_protein_5"
Chr_1 AUGUSTUS exon 60449 60517 . + . ID=exon_8;Parent=Ino_00003.t1
Chr_1 AUGUSTUS exon 61663 61743 . + . ID=exon_9;Parent=Ino_00003.t1
Chr_1 AUGUSTUS exon 61824 61874 . + . ID=exon_10;Parent=Ino_00003.t1
Chr_1 AUGUSTUS transcript 63451 68293 . + . ID=Ino_00004.t1;gene_id=Ino_00004;blast_id="sp|Q6ZV89|SH2D5_HUMAN";interproscan_id="IPR036860";eggnog_id="SH2_domain_containing_5"
Chr_1 AUGUSTUS gene 63451 68293 . + . ID=Ino_00004;blast_id="sp|Q6ZV89|SH2D5_HUMAN";interproscan_id="IPR036860";eggnog_id="SH2_domain_containing_5"
Chr_1 AUGUSTUS exon 63451 63471 . + . ID=exon_11;Parent=Ino_00004.t1
Chr_1 AUGUSTUS exon 63696 63827 . + . ID=exon_12;Parent=Ino_00004.t1
Chr_1 AUGUSTUS exon 64493 64741 . + . ID=exon_13;Parent=Ino_00004.t1
Chr_1 AUGUSTUS exon 65417 65591 . + . ID=exon_14;Parent=Ino_00004.t1
Chr_1 AUGUSTUS exon 65876 65981 . + . ID=exon_15;Parent=Ino_00004.t1
Chr_1 AUGUSTUS exon 66781 66940 . + . ID=exon_16;Parent=Ino_00004.t1
Chr_1 AUGUSTUS exon 68072 68293 . + . ID=exon_17;Parent=Ino_00004.t1
Chr_1 AUGUSTUS transcript 70368 75638 . + . ID=Ino_00005.t1;gene_id=Ino_00005;blast_id="sp|P39656|OST48_HUMAN";interproscan_id="IPR005013";eggnog_id="protein_N-linked_glycosylation_via_asparagine"
Chr_1 AUGUSTUS gene 70368 75638 . + . ID=Ino_00005;blast_id="sp|P39656|OST48_HUMAN";interproscan_id="IPR005013";eggnog_id="protein_N-linked_glycosylation_via_asparagine"

gffutils database build failed with UNIQUE constraint failed: features.id

Useful and exciting tool! But when I ran lifton with the command:

lifton MF2_mat.v1.0.fa ~/rawdata/GRCh38/ref/GRCh38.p14.new_name.fa -sc 0.95 -copies -g ~/rawdata/GRCh38/ref/GCF_000001405.40_GRCh38.p14_genomic.gff -polish -o CN1v1.0_mat.lifton.gff -c -cds -ad RefSeq -f type.list -exclude_partial -t 10 -D

I got this error:

**********************
** Running miniprot **
**********************
gffutils database build failed with UNIQUE constraint failed: features.id

while there are so many warnings and a ValueError:

$tail -50 CN1v1.0/Mat/06.lifton/GRCh38/lifton.sh.sbatch.e
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '4c10942085bce244cfce502d028bd6f1'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '7ab0878db5cb4b1d34b527f6f36432d5'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '7e9c75fdd3787d0324c845de6e12c07e'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id 'dff13391ae5c98698f19335853b321e5'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '4c10942085bce244cfce502d028bd6f1'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '7ab0878db5cb4b1d34b527f6f36432d5'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '7e9c75fdd3787d0324c845de6e12c07e'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id 'dff13391ae5c98698f19335853b321e5'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '4c10942085bce244cfce502d028bd6f1'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '7ab0878db5cb4b1d34b527f6f36432d5'; ignoring all but the first
2024-07-01 15:57:42,481 - WARNING - Duplicate lines in file for id '7e9c75fdd3787d0324c845de6e12c07e'; ignoring all but the first
2024-07-01 15:57:42,481 - WARNING - Duplicate lines in file for id 'dff13391ae5c98698f19335853b321e5'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:45,000 - INFO - Populating features table and first-order relations: 3903620 features
2024-07-01 15:57:45,001 - INFO - Updating relations
2024-07-01 15:58:19,940 - INFO - Creating relations(parent) index
2024-07-01 15:58:23,229 - INFO - Creating relations(child) index
2024-07-01 15:58:27,449 - INFO - Creating features(featuretype) index
2024-07-01 15:58:30,206 - INFO - Creating features (seqid, start, end) index
2024-07-01 15:58:33,525 - INFO - Creating features (seqid, start, end, strand) index
2024-07-01 15:58:37,309 - INFO - Running ANALYZE features
>> Creating miniprot annotation database : ./lifton_output/miniprot/miniprot.gff3
2024-07-01 15:58:39,206 - INFO - Populating features
2024-07-01 16:00:27,613 - INFO - Populating features table and first-order relations: 1912405 features
2024-07-01 16:00:27,613 - INFO - Updating relations
2024-07-01 16:00:37,349 - INFO - Creating relations(parent) index
2024-07-01 16:00:37,940 - INFO - Creating relations(child) index
2024-07-01 16:00:38,686 - INFO - Creating features(featuretype) index
2024-07-01 16:00:39,703 - INFO - Creating features (seqid, start, end) index
2024-07-01 16:00:41,177 - INFO - Creating features (seqid, start, end, strand) index
2024-07-01 16:00:42,823 - INFO - Running ANALYZE features
Traceback (most recent call last):
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/bin/lifton", line 8, in <module>
    sys.exit(main())
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/lifton/lifton.py", line 352, in main
    run_all_lifton_steps(args)
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/lifton/lifton.py", line 290, in run_all_lifton_steps
    tree_dict = intervals.initialize_interval_tree(l_feature_db, features)
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/lifton/intervals.py", line 12, in initialize_interval_tree
    tree_dict[chromosome].add(gene_interval)
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/intervaltree/intervaltree.py", line 324, in add
    raise ValueError(
ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(45020029, 45020029, 'CDS_51812')

When I look at the gff file I provided, which was downloaded from NCBI (GRCh38 refseq), I found there are a few identical ids which may cause the error in miniprot (while liftoff created unique ids):

$rg -v '^#' ~/rawdata/GRCh38/ref/GCF_000001405.40_GRCh38.p14_genomic.gff | cut -f 9 | awk -F '[=|;]' '{print $2}' | sort | uniq -c | sort -nr | head
362 cds-NP_001254479.2
358 cds-XP_016860308.1
335 cds-XP_016860310.1
335 cds-XP_016860309.1
316 cds-XP_047301616.1
312 cds-XP_047301617.1
312 cds-NP_001243779.1
311 cds-NP_596869.4
309 cds-XP_024308863.1
299 cds-XP_047301619.1

so I think you'd better edit the performance of miniprot..
Best wishes!

Python errors when running lifton

Hello,

I'm having an issue with running the lifton software. I'm running it on an HPC environment using 100GB memory and a computer node that has 2000 cores. The below bash script has the command I'm using to run liftton. The target genome is rhemac10 FASTA (rheMac10.fa) and I've also inputed the human genome hg38 FASTA (hg38.fa) and human genome annotation in GTF format from NCBI (hg38.ncbiRefSeq.gtf). I want to output a lifton rhemac10 annotation (hg38_lifton_rhemac10.gff3)

#!/bin/bash

SECONDS=0

cd ~/macaque_snRNAseq

#make sure than conda env is sourced
. /home/genevieve.baddoo1-umw/miniconda3/etc/profile.d/conda.sh

#activate lifton_pip conda env 
conda activate lifton_pip

lifton -g liftoff/hg38.ncbiRefSeq.gtf -o lifton/hg38_lifton_rhemac10.gff3 -copies -infer-genes liftoff/rheMac10.fa liftoff/hg38.fa

duration=$SECONDS
echo "$(($duration / 3600)) hours and $((($duration % 3600) / 60)) minutes and $(($duration % 60)) seconds elapsed."

Here is the bsub command I'm using to run lifton

bsub -q long -R rusage[mem=25G] -R span[hosts=1] -W 96:00 -n 4 -o ~/macaque_snRNAseq/lifton/my_out.%J -e ~/macaque_snRNAseq/lifton/my_err.%J ~/macaque_snRNAseq/scripts/lifton.sh

I've installed lifton in a conda environment using pip. It does run for over an hour but then I get the following python errors

252893 of 414578 (61%)
257039 of 414578 (62%)
261185 of 414578 (63%)
265330 of 414578 (64%)
2024-07-31 17:48:31,795 - INFO - Committing changes
2024-07-31 17:48:32,102 - INFO - Creating relations(parent) index
2024-07-31 17:48:36,819 - INFO - Creating relations(child) index
2024-07-31 17:48:41,709 - INFO - Creating features(featuretype) index
2024-07-31 17:48:44,560 - INFO - Creating features (seqid, start, end) index
2024-07-31 17:48:48,643 - INFO - Creating features (seqid, start, end, strand) index
2024-07-31 17:48:53,080 - INFO - Running ANALYZE features
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/liftoff/align_features.py", line 61, in align_single_chroms
    minimap2_index = build_minimap2_index(target_file, args, threads_arg, minimap2_path)
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/liftoff/align_features.py", line 109, in build_minimap2_index
    subprocess.run(
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/subprocess.py", line 1720, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'minimap2'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/bin/lifton", line 8, in <module>
    sys.exit(main())
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/lifton.py", line 352, in main
    run_all_lifton_steps(args)
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/lifton.py", line 267, in run_all_lifton_steps
    liftoff_annotation = lifton_utils.exec_liftoff(lifton_outdir, args)
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/lifton_utils.py", line 113, in exec_liftoff
    liftoff_annotation = run_liftoff.run_liftoff(outdir, args)
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/run_liftoff.py", line 25, in run_liftoff
    liftoff_main.run_all_liftoff_steps(liftoff_args)
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/liftoff/liftoff_main.py", line 19, in run_all_liftoff_steps
    feature_db, feature_hierarchy, ref_parent_order = liftover_types.lift_original_annotation(ref_chroms, target_chroms,
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/liftoff/liftover_types.py", line 15, in lift_original_annotation
    align_and_lift_features(ref_chroms, target_chroms, args, feature_hierarchy, liftover_type, unmapped_features,
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/liftoff/liftover_types.py", line 23, in align_and_lift_features
    aligned_segments= align_features.align_features_to_target(ref_chroms, target_chroms, args,
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/site-packages/lifton/liftoff/align_features.py", line 24, in align_features_to_target
    for result in pool.imap_unordered(func, np.arange(0, len(target_chroms))):
  File "/home/genevieve.baddoo1-umw/miniconda3/envs/lifton_pip/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'minimap2'

Am I using enough memory and cores to run the software? Does the input genome annotation have to be in GFF3 format instead of GTF? Also, can lifton output an annotation in GTF format or does it only output in GFF3 format?

Develop a version suitable for bacteria?

Hey, I am engaged in the analysis of bacterial genomes. Since bacteria do not have introns, bacterial genomes are simpler compared to eukaryotes. Therefore, can software be applied to bacteria? Or can a version suitable for bacteria be developed?

Weird formatting of column 9 in gff3 lift-over

Hi @Kuanhao-Chao ,
I was able to properly run Lifton using one plant reference genome and a new one to annotate from the same species.
The command was:

lifton -g ref.gff3 -o liftover.gff3 -P ref.pep.fasta -copies -sc 0.95 newgenome.fasta refgenome.fasta

The resulting lift-over file looks quite good for many gene models, but for some of them, there is a duplication in the exon names. See below the example a gene with 8 exons in the reference genome.

newgenomeLG00 LiftOn gene 279378 285818 . + . ID=newgenomeLG00g00180;Name=newgenomeLG00g00180;source=Liftoff
newgenomeLG00 LiftOn mRNA 279378 285818 . + . ID=newgenomeLG00g00180.1;Parent=newgenomeLG00g00180;Name=newgenomeLG00g00180.1;mutation=frameshift;protein_identity=0.999;dna_identity=1.000;status=LiftOn_chaining_anewgenomeLGorithm
newgenomeLG00 LiftOn exon 279378 280607 . + . ID=_newgenomeLG00_1g00140.1:exon:001;Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn exon 281921 282006 . + . ID=_newgenomeLG00_1g00140.1:exon:001;Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn exon 282142 282299 . + . ID=_newgenomeLG00_1g00140.1:exon:001;Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn exon 282401 283715 . + . ID=_newgenomeLG00_1g00140.1:exon:001;Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn exon 283912 284186 . + . ID=_newgenomeLG00_1g00140.1:exon:001;Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn exon 284379 284585 . + . ID=_newgenomeLG00_1g00140.1:exon:001;Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn exon 284668 284824 . + . ID=_newgenomeLG00_1g00140.1:exon:001;Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn exon 285132 285818 . + . ID=_newgenomeLG00_1g00140.1:exon:008;Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn CDS 280585 280607 . + 0 Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn CDS 281921 282006 . + 1 Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn CDS 282142 282299 . + 2 Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn CDS 282401 283715 . + 0 Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn CDS 283912 284186 . + 2 Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn CDS 284379 284585 . + 0 Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn CDS 284668 284824 . + 0 Parent=newgenomeLG00g00180.1
newgenomeLG00 LiftOn CDS 285132 285430 . + 2 Parent=newgenomeLG00g00180.1

In addition to that, would it be possible to automatically add a unique ID to each CDS? this can be mandatory for downstream applications.

Thanks!

LiftOn silently failing due to ID feature

Hi @Kuanhao-Chao ,

While running LiftOn for some genomes we noticed an ID feature in some gff3 files which causes LiftOn to silently fail. It occurs when the ID field of the mRNA ends with an underscore and integer (e.g. ID=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_1). When corrected to an underscore, a string and an integer LiftOn runs successfully (e.g. ID=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_X1). I've put a full example below:

Uncorrected:

##gff-version   3
GCA_013396205.1-JAAOAN010000001.1       Genbank gene    653     1126    .       -       .                     ID=GCA_013396205.1-rna-gnl-WGS:JAAOAN-mrna.FMUND_1.gene
GCA_013396205.1-JAAOAN010000001.1       Genbank mRNA    653     1126    .       -       .                     ID=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_1;Parent=GCA_013396205.1-rna-gnl-WGS:JAAOAN-mrna. FMUND_1.gene;Name=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_1;ori_geneid=gene-FMUND_1
GCA_013396205.1-JAAOAN010000001.1       Genbank CDS     653     1126    .       -       0                     ID=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_1.CDS1;Parent=GCA_013396205.1-transcript_rna-gnl- WGS:JAAOAN-mrna.FMUND_1
GCA_013396205.1-JAAOAN010000001.1       Genbank exon    653     1126    .       -       .                     ID=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_1.exon1;Parent=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_1

Corrected:

##gff-version   3
GCA_013396205.1-JAAOAN010000001.1       Genbank gene    653     1126    .       -       .                     ID=GCA_013396205.1-rna-gnl-WGS:JAAOAN-mrna.FMUND_1.gene
GCA_013396205.1-JAAOAN010000001.1       Genbank mRNA    653     1126    .       -       .                     ID=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_X1;Parent=GCA_013396205.1-rna-gnl-WGS:JAAOAN-mrna.FMUND_1.gene;Name=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_1;ori_geneid=gene-FMUND_1
GCA_013396205.1-JAAOAN010000001.1       Genbank CDS     653     1126    .       -       0                     ID=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_1.CDS1;Parent=GCA_013396205.1-transcript_rna-gnl- WGS:JAAOAN-mrna.FMUND_X1
GCA_013396205.1-JAAOAN010000001.1       Genbank exon    653     1126    .       -       .                     ID=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_1.exon1;Parent=GCA_013396205.1-transcript_rna-gnl-WGS:JAAOAN-mrna.FMUND_X1
###

When ran uncorrected, LiftOn appears to complete but the resulting gff3 file contains no "source=lifton" features, only miniprot. When corrected it contains both:

Uncorrected

grep -c "source=Liftoff"  GCA_002894225.1_GCA_013396205.1_genomic_lifton.gff3
0
grep -c "source=miniprot"  GCA_002894225.1_GCA_013396205.1_genomic_lifton.gff3
199
grep -c "status=miniprot"  GCA_002894225.1_GCA_013396205.1_genomic_lifton.gff3
199
grep -c "status=Liftoff"  GCA_002894225.1_GCA_013396205.1_genomic_lifton.gff3
0

Corrected

grep -c "source=Liftoff"  GCA_002894225.1_GCA_013396205.1_genomic_lifton.gff3
13790
grep -c "source=miniprot"  GCA_002894225.1_GCA_013396205.1_genomic_lifton.gff3 
199
grep -c "status=miniprot"  GCA_00289
4225.1_GCA_013396205.1_genomic_lifton.gff3
199
grep -c "status=Liftoff"  GCA_002894225.1_GCA_013396205.1_genomic_lifton.gff3
742

I was able to trace the issue to step 7 of LiftOn.py but I wasn't able to isolate the specific place it fails. My guess is it's something to do with how gffutils processes features during the chaining stage but I could be mistaken. The log files for both are similar but the run that fails terminates early (I've attached them).

out_LiftOn.Uncorrected.log

out_LiftOn.Corrected.log

I'm more than happy to share the data and commands we used. Probably best for me to ping over a dropbox link, let me know if that would be useful for you.

can not find the installed miniprot

Hi all,

I am a beginner in bioinformatics and I encountered a problem with the LiftOn command. It cannot find the tool miniprot because I installed miniprot in a different folder, not within the conda environment (since conda didn't work for installing miniprot in my case). How can I define the path for LiftOn to work with the miniprot? The path to my miniprot is "/media/data_01/yucku/SOFTWARE/miniprot" the miniprot command work with the whole path. All other required tools are in the conda environment. The error shows: "miniprot is not installed. Please install miniprot before running LiftOn."

Thank you!

Yu-Chia

race conditions

Hello all,

There seem to be quite a few race conditions if one runs lifton in parallel. A project I'm working on requires running lifton from several dozen source annotations to several hundred references, and so I use snakemake to parallelise runs across a cluster. However (at least) the following race conditions appear:

  • If the output files are something like output/$SOURCE/$TARGET_NAME.gff, there's a race condition as lifton writes to output/$SOURCE/lifton_output regardless of which genome is being annotated, which corrupts the intermediate files.
  • It seems like at certain stages the gffutils sqlite database is written to, even if it already exists before creating (e.g. with ANALYSE). This causes race conditions and crashes as only one process can write to a sqlite db at once (normally).

With liftoff, one could work around these same issues because liftoff accepted a temp/intermediate directory name (so you could use e.g. output/$SOURCE/$TARGET_NAME/ instead of output/$SOURCE/lifton_output, making each job's directory unique). Liftoff also did not modify the gff database if it already existed, so if you pre-computed all needed gff_dbs before running any liftoff, then you were guaranteed not to have race conditions on the sqlite db.

I'd encourage you to adopt these workarounds in lifton.

best,
Kevin

gffutils error at stage "Populating features"

Hi, thank you for the tool, I am very excited to try it and compare the results from liftoff.

However, I am getting a gffutils error at the stage after miniprot:

>> Creating liftoff annotation database : /path_to_dir/lifton_output/liftoff/liftoff.gff3_polished
2024-05-30 12:16:44,480 - INFO - Populating features
gffutils database build failed with UNIQUE constraint failed: features.id

My command was:

lifton \
-g $ref_gff \
-o $out_dir/$asm_name."$gff_name"_lifton.gff3 \
-u $out_dir/$asm_name."$gff_name"_lifton_unmapped.txt \
-chroms $in_dir/$ref_name.chroms.txt \
-copies -polish -cds -sc 0.96 -flank 0.1\
-t $threads \
$asm \
$ref

The error suggests that some gff features don't have unique IDs. I cheked the input gff and it does not contain any duplicated IDs. I also run it through AGAT and it does not find any errors. Liftoff runs well on this gff. I am suspecting the problem is with the output features.

KeyError: 'gene_biotype'

Hello,

Thanks for the tools. I am trying to do the alignment between two Fragaria species, and I got this error from the output:

>> Reading target genome ...
>> Reading reference genome ...

>> Creating reference annotation database :  Fragaria_vesca_v4.0.a2.genes.gff3
2024-05-07 18:15:24,810 - INFO - Populating features
2024-05-07 18:16:11,172 - INFO - Populating features table and first-order relations: 1051757 features
2024-05-07 18:16:11,172 - INFO - Updating relations
2024-05-07 18:16:20,583 - INFO - Creating relations(parent) index
2024-05-07 18:16:21,354 - INFO - Creating relations(child) index
2024-05-07 18:16:22,240 - INFO - Creating features(featuretype) index
2024-05-07 18:16:22,763 - INFO - Creating features (seqid, start, end) index
2024-05-07 18:16:23,468 - INFO - Creating features (seqid, start, end, strand) index
2024-05-07 18:16:24,193 - INFO - Running ANALYZE features
Traceback (most recent call last):
  File "/home/wlin2345/miniconda3/bin/lifton", line 33, in <module>
    sys.exit(load_entry_point('lifton==1.0.1', 'console_scripts', 'lifton')())
  File "/home/wlin2345/miniconda3/lib/python3.10/site-packages/lifton/lifton.py", line 352, in main
    run_all_lifton_steps(args)
  File "/home/wlin2345/miniconda3/lib/python3.10/site-packages/lifton/lifton.py", line 212, in run_all_lifton_steps
    ref_features_dict, ref_features_len_dict, ref_features_reverse_dict, ref_trans_exon_num_dict = lifton_utils.get_ref_liffover_features(features, ref_db, intermediate_dir, args)
  File "/home/wlin2345/miniconda3/lib/python3.10/site-packages/lifton/lifton_utils.py", line 336, in get_ref_liffover_features
    if locus.attributes[gene_type_key][0] == "protein_coding" and len(CDS_children) > 0:
  File "/home/wlin2345/miniconda3/lib/python3.10/site-packages/gffutils/attributes.py", line 62, in __getitem__
    v = self._d[k]
KeyError: 'gene_biotype'

So I checked my gff3 file and compared it with the gff3 file provided in the test folder and found that it doesn't have the gene_biotype attribute. Now I'm not very sure if I can still use your tools to do my job. It would be really helpful to get your feedback on this!

Thanks,
Wanru

parasail_memalign: posix_memalign failed: Cannot allocate memory

Hi,

I'm trying to liftover the CHESS_GRch38 annotation to CHM13 using this command

lifton -D -g /ccb/salz3/kh.chao/ref_genome/homo_sapiens/chess/chess3.0.1.gff_db -o /ccb/salz2/kh.chao/LiftOn/results/human_chess/lifton.gff3 --liftoff /ccb/salz2/kh.chao/LiftOn/results/human_chess/lifton_output/liftoff/liftoff.gff3_polished_db --miniprot /ccb/salz2/kh.chao/LiftOn/results/human_chess/lifton_output/miniprot/miniprot.gff3_db --proteins /ccb/salz2/kh.chao/LiftOn/results/human_chess/lifton_output/intermediate_files/proteins.fa --transcripts /ccb/salz2/kh.chao/LiftOn/results/human_chess/lifton_output/intermediate_files/transcripts.fa -polish -copies -sc 0.95 /ccb/salz3/kh.chao/ref_genome/homo_sapiens/T2T-CHM13/chm13v2.0.fa /ccb/salz3/kh.chao/ref_genome/homo_sapiens/chess/hg38_p12_ucsc.no_alts.no_fixs.fa -chroms /ccb/salz2/kh.chao/LiftOn/scripts/human_chroms_mapping.csv -f /ccb/salz2/kh.chao/LiftOn/scripts/features.txt

Lifton was able to run for a bit and populate features but I soon encounter this error

 parasail_memalign: posix_memalign failed: Cannot allocate memory
parasail_result_is_saturated: missing result
parasail_result_is_trace: missing result
Traceback (most recent call last):
  File "/home/choh1/anaconda3/envs/intron/bin/lifton", line 33, in <module>
    sys.exit(load_entry_point('lifton==1.0.3', 'console_scripts', 'lifton')())
  File "/home/choh1/anaconda3/envs/intron/lib/python3.9/site-packages/lifton/lifton.py", line 352, in main
    run_all_lifton_steps(args)
  File "/home/choh1/anaconda3/envs/intron/lib/python3.9/site-packages/lifton/lifton.py", line 300, in run_all_lifton_steps
    lifton_gene = run_liftoff.process_liftoff(None, locus, ref_db.db_connection, l_feature_db, ref_id_2_m_id_trans_dict, m_feature_db, tree_dict, tgt_fai, ref_proteins, ref_trans, ref_features_dict, fw_score, fw_chain, args, ENTRY_FEATURE=True)
  File "/home/choh1/anaconda3/envs/intron/lib/python3.9/site-packages/lifton/run_liftoff.py", line 176, in process_liftoff
    lifton_trans_aln, lifton_aa_aln = lifton_gene.orf_search_protein(lifton_trans.entry.id, ref_trans_id, tgt_fai, ref_proteins, ref_trans, lifton_status)
  File "/home/choh1/anaconda3/envs/intron/lib/python3.9/site-packages/lifton/lifton_class.py", line 145, in orf_search_protein
    lifton_aln, good_trans = self.transcripts[trans_id].orf_search_protein(fai, ref_protein_seq, ref_trans_seq, lifton_status, is_non_coding=self.is_non_coding, eval_only=eval_only)
  File "/home/choh1/anaconda3/envs/intron/lib/python3.9/site-packages/lifton/lifton_class.py", line 531, in orf_search_protein
    lifton_tran_aln = self.align_trans_seq(trans_seq, ref_trans_seq, lifton_status)
  File "/home/choh1/anaconda3/envs/intron/lib/python3.9/site-packages/lifton/lifton_class.py", line 522, in align_trans_seq
    lifton_tran_aln = align.trans_align(trans_seq, ref_trans_seq)
  File "/home/choh1/anaconda3/envs/intron/lib/python3.9/site-packages/lifton/align.py", line 122, in trans_align
    alignment_query = extracted_parasail_res.traceback.query
  File "/home/choh1/anaconda3/envs/intron/lib/python3.9/site-packages/parasail/bindings_v2.py", line 422, in traceback
    return self.get_traceback()
  File "/home/choh1/anaconda3/envs/intron/lib/python3.9/site-packages/parasail/bindings_v2.py", line 406, in get_traceback
    raise AttributeError("'Result' object has no traceback")
AttributeError: 'Result' object has no traceback
parasail_result_free: attempted free of NULL result pointer

Any suggestions would be greatly appreciated, thank you!

Inquiry on Resolving Long Chromosome Sequences in Analysis

Dear Kuanhao,

I hope this message finds you well. I am currently working on a genetic analysis project and have encountered a challenge with two of my genomes, each possessing five chromosome sequences that exceed 2Gb in length. This has presented some difficulties in terms of data processing and analysis.

I would greatly appreciate any advice or suggestions you may have on effectively addressing this issue. Specifically, I am interested in understanding the best practices or methodologies that could facilitate the handling and analysis of such large genomic sequences.

Thank you for your time and consideration. I look forward to your guidance.

lifton v1.0.4
command:lifton -t 16 -g A.gff -o test.gff3 -copies test.fa A.fa

error message:[WARNING] failed to parse the first FASTA/FASTQ record. Continue anyway.
minimap2: bseq.c:95: mm_bseq_read3: Assertion `ks->seq.l <= (2147483647)' failed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.