agshumate / liftoff Goto Github PK

View Code? Open in Web Editor NEW

392.0 392.0 48.0 42.76 MB

An accurate GFF3/GTF lift over pipeline

License: GNU General Public License v3.0

Python 100.00%

liftoff's People

Contributors

Stargazers

Watchers

Forkers

bbalog87 pythseq shulp2211 animesh varir otrnda yazhouchen jhuanglabtools bejo-dionnez svpipaliya bowangxjtu standage pkerpedjiev gemygk atongsa alexpersa7 ensemblgenomes yuzhenpeng ewafula huangyizhong hinanawi-tanshi tdido hlkfoz linhduongtuan biogeeker unique379r davidswang liu5796796 nib-si hyphaltip diekhans colindaven ningshuang-yao nashera borisstojilkovic hongbodoll jiaozexin tomasbruna katharinahoff tong2200 berilerdogdu xuelei-dai legendary-creator dongyiyi baojinguo sablokgaurav wangdong-ls

liftoff's Issues

How fast is this algorithm

Thank you for making this tool!
I'm lifting annotations of a fish genome assembly from RefSeq to a new assembly I recently generated for the same species. In about half an hour, the program finished the step of "aligning features". Now it has been almost 12 hours, it is still working on "lifting features" step. Is this normal or the program runs into a dead loop? How fast is this algorithm?
Regards,
Guangtu

Support for GenBank annotation file format

Thanks for the neat tool!

Currently, you support GFF and GTF: would it be also possible to implement the GenBank file format?

installation problem

Hi,
I am trying to install Liftoff, while with the error report when I use 'pip install Liftoff' as follow. Could you kindly point me how to copy with the problem?
Many thanks,
Wengang
.....................
ERROR: multiqc 1.8.dev0 has requirement matplotlib<3.1.0,>=2.1.1, but you'll have matplotlib 3.1.2 which is incompatible.
Installing collected packages: numpy, biopython, pyfaidx, argh, argcomplete, gffutils, networkx, pysam, interlap, Liftoff
Running setup.py install for numpy ... error
ERROR: Command errored out with exit status 1:
command: /exports/applications/apps/SL7/anaconda/5.0.1/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-t6vivbcn/numpy/setup.py'"'"'; file='"'"'/tmp/pip-install-t6vivbcn/numpy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-dvqzpuz1/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /home/s1874451/.local/include/python3.6m/numpy --prefix=/home/s1874451/schoenebeck_group/WENGANG/python_lib
cwd: /tmp/pip-install-t6vivbcn/numpy/
Complete output (34 lines):
Running from numpy source directory.

Note: if you need reliable uninstall behavior, then install
with pip instead of using `setup.py install`:

  - `pip install .`       (from a git repo or downloaded source
                           release)
  - `pip install numpy`   (last NumPy release on PyPi)

..................

Are the ORFs conserved?

It is not clear in the paper on biorxiv is the ORFs are conserved, could you tell me? I mean if the tool check when lifting protein coding genes that the lifted CDS still shape an ORF still intact.

precompute database?

Hi,

Is it possible to precompute the liftoff database from a gff without running the whole pipeline. The use case is that I have one reference and gff and I want to liftoff those annotations onto many different assemblies.

Thanks!
Mitchell

Duplicate entry "chr1D" in sam header

My commond
liftoff -g /data2/Fshare/FastaAndIndex/iwgsc_refseqv1.1_genes_2017July06/IWGSC_v1.1_HCandLC_20170706.gff3 -o IWGSC_v1.1_HCandLC_jagger.gff3 -exclude_partial -s 0.9 -flank 0.1 -d 3 -p 8 -chroms chrom.txt -copies -sc 0.9 ../Jagger.genome /data2/Fshare/FastaAndIndex/IWGSC_v1.0_bwa/161010_Chinese_Spring_v1.0_pseudomolecules.fasta

The error message

[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -o intermediate_files/chr7D_to_chr7D.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 --split-prefix intermediate_files/chr7D_to_chr7D_split -t 1 intermediate_files/chr7D.fa intermediate_files/chr7D_genes.fa
[M::main] Real time: 111.286 sec; CPU: 111.360 sec; Peak RSS: 5.022 GB
[M::worker_pipeline::131.468*1.00] mapped 13386 sequences
[M::worker_pipeline::132.023*1.00] mapped 13386 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -o intermediate_files/chr7B_to_chr7B.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 --split-prefix intermediate_files/chr7B_to_chr7B_split -t 1 intermediate_files/chr7B.fa intermediate_files/chr7B_genes.fa
[M::main] Real time: 132.035 sec; CPU: 132.116 sec; Peak RSS: 5.743 GB
[M::worker_pipeline::196.167*1.00] mapped 12996 sequences
[M::worker_pipeline::196.850*1.00] mapped 12996 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -o intermediate_files/chr6B_to_chr6B.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 --split-prefix intermediate_files/chr6B_to_chr6B_split -t 1 intermediate_files/chr6B.fa intermediate_files/chr6B_genes.fa
[M::main] Real time: 196.868 sec; CPU: 196.949 sec; Peak RSS: 5.548 GB
[W::sam_hdr_create] Duplicated sequence 'chr1D'
[E::sam_hrecs_update_hashes] Duplicate entry "chr1D" in sam header
[E::sam_parse1] failed to parse header
[W::sam_read1] Parse error at line 4
Traceback (most recent call last):
  File "/usr/bin/liftoff", line 11, in <module>
    load_entry_point('Liftoff==1.5.0', 'console_scripts', 'liftoff')()
  File "/usr/lib/python3.6/site-packages/Liftoff-1.5.0-py3.6.egg/liftoff/run_liftoff.py", line 19, in main
  File "/usr/lib/python3.6/site-packages/Liftoff-1.5.0-py3.6.egg/liftoff/liftover_types.py", line 17, in lift_original_annotation
  File "/usr/lib/python3.6/site-packages/Liftoff-1.5.0-py3.6.egg/liftoff/liftover_types.py", line 26, in align_and_lift_features
  File "/usr/lib/python3.6/site-packages/Liftoff-1.5.0-py3.6.egg/liftoff/align_features.py", line 25, in align_features_to_target
  File "/usr/lib/python3.6/site-packages/Liftoff-1.5.0-py3.6.egg/liftoff/align_features.py", line 115, in parse_all_sam_files
  File "/usr/lib/python3.6/site-packages/Liftoff-1.5.0-py3.6.egg/liftoff/align_features.py", line 127, in parse_alignment
  File "pysam/libcalignmentfile.pyx", line 2187, in pysam.libcalignmentfile.IteratorRowAll.__next__
OSError: truncated file

However , when I run this commond alone, it works.

[masw@genomics2 raw]$ minimap2 -o intermediate_files/chr1D_to_chr1D.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 --split-prefix intermediate_files/chr6B_to_chr6B_split -t 1 intermediate_files/chr1D.fa intermediate_files/chr1D_genes.fa
[M::mm_idx_gen::16.541*1.00] collected minimizers
[M::mm_idx_gen::26.426*1.00] sorted minimizers
[M::main::26.426*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::26.948*1.00] mid_occ = 947
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::27.181*1.00] distinct minimizers: 24238624 (69.11% are singletons); average occurrences: 3.743; average spacing: 5.439
[M::worker_pipeline::81.849*1.00] mapped 10465 sequences
[M::worker_pipeline::82.343*1.00] mapped 10465 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -o intermediate_files/chr1D_to_chr1D.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 --split-prefix intermediate_files/chr6B_to_chr6B_split -t 1 intermediate_files/chr1D.fa intermediate_files/chr1D_genes.fa
[M::main] Real time: 82.352 sec; CPU: 82.321 sec; Peak RSS: 3.892 GB

I have test it on other wheat genomes,but the same error was found.
liftoff -g Jagger.release.gff -o IWGSC_v1.1_HCandLC_jagger.gff3 -exclude_partial -s 0.9 -flank 0.1 -d 3 -p 8 -chroms chrom.txt -copies -sc 0.9 ../SY_Mattis.genome ../Jagger.genome

Acturally, I found chr1D had finished before.

M::mm_idx_gen::29.057*1.00] sorted minimizers
[M::main::29.057*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_idx_gen::29.258*1.00] collected minimizers
[M::mm_mapopt_update::29.548*1.00] mid_occ = 942
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::29.849*1.00] distinct minimizers: 24057549 (69.16% are singletons); average occurrences: 3.755; average spacing: 5.383
[M::mm_idx_gen::32.154*1.00] collected minimizers
[M::mm_idx_gen::32.825*1.00] sorted minimizers
[M::main::32.825*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::33.827*1.00] mid_occ = 1114
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::34.383*1.00] distinct minimizers: 27324975 (66.54% are singletons); average occurrences: 4.091; average spacing: 5.373
[M::mm_idx_gen::36.071*1.00] sorted minimizers
[M::main::36.071*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::37.054*1.00] mid_occ = 1009
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::37.666*1.00] distinct minimizers: 31975961 (64.91% are singletons); average occurrences: 3.981; average spacing: 5.382
[M::mm_idx_gen::38.344*1.00] sorted minimizers
[M::main::38.344*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::39.337*1.00] mid_occ = 1049
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::39.785*1.00] distinct minimizers: 30322496 (66.56% are singletons); average occurrences: 4.015; average spacing: 5.383
[M::mm_idx_gen::41.116*1.00] sorted minimizers
[M::main::41.116*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_idx_gen::41.554*1.00] sorted minimizers
[M::main::41.554*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::42.061*1.00] mid_occ = 1049
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_mapopt_update::42.581*1.00] mid_occ = 1255
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::42.599*1.00] distinct minimizers: 36447326 (63.76% are singletons); average occurrences: 4.080; average spacing: 5.379
[M::mm_idx_stat::43.116*1.00] distinct minimizers: 33718823 (64.03% are singletons); average occurrences: 4.383; average spacing: 5.374
[M::mm_idx_gen::43.810*1.00] sorted minimizers
[M::main::43.811*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::44.841*1.00] mid_occ = 1227
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::45.328*1.00] distinct minimizers: 32140942 (64.74% are singletons); average occurrences: 4.342; average spacing: 5.372
[M::mm_idx_gen::47.607*1.00] sorted minimizers
[M::main::47.607*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::48.735*1.00] mid_occ = 1068
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::49.552*1.00] distinct minimizers: 37315242 (63.01% are singletons); average occurrences: 4.169; average spacing: 5.387
[M::worker_pipeline::54.399*1.00] mapped 4816 sequences
[M::worker_pipeline::54.671*1.00] mapped 4816 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -o intermediate_files/chr1D_to_chr1D.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 --split-prefix intermediate_files/chr1D_to_chr1D_split -t 1 intermediate_files/chr1D.fa intermediate_files/chr1D_genes.fa
[M::main] Real time: 54.673 sec; CPU: 54.695 sec; Peak RSS: 3.877 GB
[M::worker_pipeline::63.552*1.00] mapped 4646 sequences
[M::worker_pipeline::63.914*1.00] mapped 4646 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -o intermediate_files/chr1A_to_chr1A.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 --split-prefix intermediate_files/chr1A_to_chr1A_split -t 1 intermediate_files/chr1A.fa intermediate_files/chr1A_genes.fa
[M::main] Real time: 63.916 sec; CPU: 63.943 sec; Peak RSS: 4.810 GB
[M::worker_pipeline::65.541*1.00] mapped 5058 sequences
[M::worker_pipeline::65.890*1.00] mapped 5058 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -o intermediate_files/chr1B_to_chr1B.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 --split-prefix intermediate_files/chr1B_to_chr1B_split -t 1 intermediate_files/chr1B.fa intermediate_files/chr1B_genes.fa
[M::main] Real time: 65.893 sec; CPU: 65.915 sec; Peak RSS: 5.343 GB
[M::worker_pipeline::68.012*1.00] mapped 6385 sequences
[M::worker_pipeline::68.420*1.00] mapped 6385 sequences

conda installation

Hi,

I am trying to install via conda but am encountering the following error:

$ conda create -n liftoff -c bioconda liftoff
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: | 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                          

UnsatisfiableError: The following specifications were found to be incompatible with each other:



Package pyfaidx conflicts for:
liftoff -> pyfaidx[version='>=0.5.8']
Package python conflicts for:
liftoff -> python[version='>=3']
Package ujson conflicts for:
liftoff -> ujson
Package biopython conflicts for:
liftoff -> biopython[version='>=1.76']
Package pysam conflicts for:
liftoff -> pysam[version='>=0.16.0.1']
Package gffutils conflicts for:
liftoff -> gffutils[version='>=0.10.1']
Package interlap conflicts for:
liftoff -> interlap[version='>=0.2.6']
Package networkx conflicts for:
liftoff -> networkx[version='>=2.4']
Package minimap2 conflicts for:
liftoff -> minimap2
Package numpy conflicts for:
liftoff -> numpy[version='>=1.19.0']

The error is the same on CentOS and macOS.

This is the same error reported in #41 - I will try other installation options, but the conda recipe still appears to be broken.

Thanks!

logo

missing genes

Hi Alaina,
There are some genes in the reference gff, but I cannot find them in the final gff file, or the unmapped_features.txt file. What is the reason for this? If you still keep the gff dile I sent you before, one of such cases is gene3.
Thanks,
Guangtu

[Errno 2] could not open alignment file `reference_all_to_target_all.sam`: No such file or directory

Hi, I just tried using genelift to lift over annotations between assemblies for two closely related species. However, the pipeline crashes at a fairly early state with the following error:

2020-06-27 13:41:27,516 - INFO - Populating features
2020-06-27 13:45:44,661 - INFO - Populating features table and first-order relations: 1327486 features
2020-06-27 13:45:44,661 - INFO - Updating relations
2020-06-27 13:46:04,737 - INFO - Creating relations(parent) index
2020-06-27 13:46:06,371 - INFO - Creating relations(child) index
2020-06-27 13:46:08,453 - INFO - Creating features(featuretype) index
2020-06-27 13:46:09,563 - INFO - Creating features (seqid, start, end) index
2020-06-27 13:46:11,139 - INFO - Creating features (seqid, start, end, strand) index
2020-06-27 13:46:12,867 - INFO - Running ANALYZE features
[E::hts_open_format] Failed to open file "genelift_test1_intermediates/reference_all_to_target_all.sam" : No such file or directory
Traceback (most recent call last):
File "liftoff/liftoff.py", line 121, in
main()
File "liftoff/liftoff.py", line 74, in main
unmapped_features, infer_transcripts, infer_genes, cov_threshold, seq_threshold, minimap2_path, inter_files)
File "/path/to/working/directory/liftoff/liftoff/liftover_types.py", line 17, in lift_original_annotation
unmapped_features, reference_fasta, minimap2_path, inter_files, True)
File "/path/to/working/directory/liftoff/liftoff/align_features.py", line 29, in align_features_to_target
aligned_segments = parse_alignment(file, parent_dict, children_dict, unmapped_features, search_type)
File "/path/to/working/directory/liftoff/liftoff/align_features.py", line 91, in parse_alignment
sam_file = pysam.AlignmentFile(file,'r',check_sq=False, check_header=False)
File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 941, in pysam.libcalignmentfile.AlignmentFile._open
FileNotFoundError: [Errno 2] could not open alignment file genelift_test1_intermediates/reference_all_to_target_all.sam: No such file or directory

minimap2 and pysam are both installed and in my path in the conda environment I created. reference_all_genes.fa was produced and is a correctly-formatted multifasta of genes, however It appears that the sam file was never created. Do you have any idea what may be wrong here? Thanks!

--version switch?

% liftoff.py --version
liftoff 1.1.0

to stdout, and return error code 0

troubleshooting an empty output

Good afternoon,
I'd like to be able to lift over the annotations from an older assembly (of a mammalian genome) to my new assembly, and I was hoping the software you've developed would do the trick. While I managed to run the program without error, the resulting .gff file for the target is empty, and the unmapped.txt file contains every gene found in the original assembly. Given that these assemblies came from the same species, I wasn't sure what the first steps in troubleshooting might be.
Thanks for any advice you can offer

How to relax filter for maximum distance between two nodes

Thanks for creating this very useful tool!

I'm working with a group of species whose genomes vary considerably in size. I was trying to use Liftoff and the reference genome from which I'm trying to lift the gene annotations is the smallest one. I've noticed that for many genes, some CDS annotations are missing but when I run minimap2 for these genes all CDS features maps and the reason they're not included in the lifted annotation is that the distance between two CDS annotations is far greater than in the reference (e.g. 5 kb vs. 150 bp).

I was thus wondering if there is a way to relax the 4th filter when connecting two nodes "4) The distance from the start of u to the end of v in the target genome is no greater than 2 times that in the reference genome). I don't find any option to do so.

Thank you!

What happens if there are multiple copies of a gene in reference gff

Hi,

Thanks for making this quick tool, it's been working well for me.
I have lifted-over the annotation of one plant species to another closely related one and I've found that about 60 % of the multi-copy genes in the new annotation have 15 - 30 additional copies.
This is quite unexpected, as you would expect a decrease in number of genes with higher copy number, and I wanted to make sure I didn't do something wrong. The gff annotations I ran liftoff with may contain multiple copies of the same gene with unique ID's. How will liftoff behave in such a case? Should I remove multiple copies of the same gene in the annotation file for an accurate estimation of additional copies in the target genome?

Thanks :)

Running time

Hi,

I am using Liftoff for a genome with ~50Mb in size and ~12,000 annotated genes. The GFF file includes 132471 features for the following: CDS, RNase_MRP_RNA ,RNase_P_RNA ,SRP_RNA ,exon ,five_prime_UTR ,gene ,mRNA ,ncRNA ,ncRNA_gene ,pseudogene ,pseudogenic_transcript ,rRNA ,snRNA ,snoRNA ,tRNA and three_prime_UTR.

It just finished running with these parameters, -p 20 -copies -a 0.4 -s 0.4 -sc 0.5. The intermediate sam file is 183Mb in size. The whole process seemed to take about 15 hours. Is this running time something you would expect? How long did it take for you to run this on the chimpanzee genome and the wheat genome?

pip install Liftoff

Can you please add to pypi to make it possible to build conda and brew packages?

use >= rather than == in steup.py file

If the dependency does not have much version difference, it is recommended to use >=, otherwise all dependencies will be reinstalled.
install_requires=['numpy==1.19.0', 'biopython==1.76', 'gffutils==0.10.1', 'networkx==2.4', 'pysam==0.16.0.1','pyfaidx==0.5.8','interlap==0.2.6', "ujson==3.2.0"]

-f flag broken in v1.5.0?

I just upgraded to v1.5.0 but when I run the following command:

liftoff -g hg38.refGene.short.gtf.gz -f feature_types.txt chm13.draft_v1.0.fasta.bgz hg38.fa.bgz

I get a GFF does not contain any gene features. Use -f to provide a list of other feature types to lift over. error.

When I downgrade back to v1.4.2, it starts running as usual (albeit with the running time issue). Anything change related to that flag?

interlap error when using -copies option

I get an error when using the -copies option (when I leave that option off, it seems to run fine).
Here is my command line log:

$ liftoff -g ref.gff -o new.gff -copies new.fasta ref.fasta 
extracting features
2020-12-26 12:05:32,345 - INFO - Populating features
2020-12-26 12:05:35,173 - INFO - Populating features table and first-order relations: 109592 features
2020-12-26 12:05:35,173 - INFO - Updating relations
2020-12-26 12:05:35,536 - INFO - Creating relations(parent) index
2020-12-26 12:05:35,537 - INFO - Creating relations(child) index
2020-12-26 12:05:35,537 - INFO - Creating features(featuretype) index
2020-12-26 12:05:35,572 - INFO - Creating features (seqid, start, end) index
2020-12-26 12:05:35,606 - INFO - Creating features (seqid, start, end, strand) index
2020-12-26 12:05:35,643 - INFO - Running ANALYZE features
aligning features
[M::main::0.387*1.00] loaded/built the index for 13 target sequence(s)
[M::mm_mapopt_update::0.497*1.00] mid_occ = 9
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 13
[M::mm_idx_stat::0.577*1.00] distinct minimizers: 7514366 (94.48% are singletons); average occurrences: 1.068; average spacing: 5.351
[M::worker_pipeline::4.959*1.00] mapped 11224 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -o intermediate_files/reference_all_to_target_all.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 -t 1 new.fasta.mmi intermediate_files/reference_all_genes.fa
[M::main] Real time: 4.975 sec; CPU: 4.968 sec; Peak RSS: 0.313 GB
lifting features
mapping gene copies
aligning features
[M::main::0.380*1.00] loaded/built the index for 13 target sequence(s)
[M::mm_mapopt_update::0.491*1.00] mid_occ = 9
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 13
[M::mm_idx_stat::0.573*1.00] distinct minimizers: 7514366 (94.48% are singletons); average occurrences: 1.068; average spacing: 5.351
[M::worker_pipeline::4.856*1.00] mapped 11224 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -o intermediate_files/reference_all_to_target_all.sam -a --end-bonus 5 --eqx -N 50 -p 0.5 -t 1 new.fasta.mmi intermediate_files/reference_all_genes.fa
[M::main] Real time: 4.870 sec; CPU: 4.869 sec; Peak RSS: 0.313 GB
lifting features
Traceback (most recent call last):
  File "/opt/ccc/packages/miniconda3/envs/liftoff/bin/liftoff", line 10, in <module>
    sys.exit(main())
  File "/opt/ccc/packages/miniconda3/envs/liftoff/lib/python3.8/site-packages/liftoff/run_liftoff.py", line 25, in main
    map_extra_copies(args, lifted_feature_list, feature_hierarchy, feature_db, ref_parent_order)
  File "/opt/ccc/packages/miniconda3/envs/liftoff/lib/python3.8/site-packages/liftoff/run_liftoff.py", line 202, in map_extra_copies
    liftover_types.map_extra_copies(ref_chroms, target_chroms, lifted_feature_list, feature_hierarchy, feature_db,                                                                                  
  File "/opt/ccc/packages/miniconda3/envs/liftoff/lib/python3.8/site-packages/liftoff/liftover_types.py", line 95, in map_extra_copies                                                              
    align_and_lift_features(ref_chroms, target_chroms, args, feature_hierarchy, liftover_type, unmapped_features,                                                                                   
  File "/opt/ccc/packages/miniconda3/envs/liftoff/lib/python3.8/site-packages/liftoff/liftover_types.py", line 33, in align_and_lift_features                                                       
    fix_overlapping_features.fix_incorrectly_overlapping_features(lifted_features_list, lifted_features_list,                                                                                       
  File "/opt/ccc/packages/miniconda3/envs/liftoff/lib/python3.8/site-packages/liftoff/fix_overlapping_features.py", line 11, in fix_incorrectly_overlapping_features                                
    features_to_remap, feature_locations = check_homologues(all_lifted_features, features_to_check,                                                                                                 
  File "/opt/ccc/packages/miniconda3/envs/liftoff/lib/python3.8/site-packages/liftoff/fix_overlapping_features.py", line 24, in check_homologues                                                    
    feature_locations = build_interval_list(all_feature_list)                                                                                                                                       
  File "/opt/ccc/packages/miniconda3/envs/liftoff/lib/python3.8/site-packages/liftoff/fix_overlapping_features.py", line 42, in build_interval_list                                                 
    inter.update(feature_coords)
  File "/opt/ccc/packages/miniconda3/envs/liftoff/lib/python3.8/site-packages/interlap.py", line 142, in add
    iset.sort()
TypeError: '<' not supported between instances of 'new_feature' and 'new_feature'

coverage calculation

Hello,

In the reference genome, this gene spans over 423 nucleotide as below and contains a single exon.

1 Broad gene 2805841 2806263 . + . ID=gene:MGG_16169;biotype=protein_coding;description=Putative uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:G4MM22];gene_id=MGG_16169;logic_name=broad_genes

This gene was transferred into the target genome, and this is the resulting gff output:

JH793657.1 Liftoff gene 13566 13568 . + . ID=gene:MGG_16169;biotype=protein_coding;description=Putative uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:G4MM22];gene_id=MGG_16169;logic_name=broad_genes;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=gene:MGG_16169_0

I noticed that this gene only covers 2 nucleotide and was annotated to have 100% coverage and sequence identity. Would this be a correct output?

Fail to load features

Hi @agshumate ,

I've been trying to liftoff some contigs (which contain miRNA) and I realized that the feature 'miRNA' is not recognized by the loader and the run crashes:

lifting features
Traceback (most recent call last):
File "/usr/local/bin/liftoff", line 11, in <module> load_entry_point('Liftoff==1.4.2', 'console_scripts', 'liftoff')()
File "/usr/local/lib/python3.6/dist-packages/Liftoff-1.4.2-py3.6.egg/liftoff/run_liftoff.py", line 18, in main
File "/usr/local/lib/python3.6/dist-packages/Liftoff-1.4.2-py3.6.egg/liftoff/liftover_types.py", line 17, in lift_original_annotation
File "/usr/local/lib/python3.6/dist-packages/Liftoff-1.4.2-py3.6.egg/liftoff/liftover_types.py", line 30, in align_and_lift_features
File "/usr/local/lib/python3.6/dist-packages/Liftoff-1.4.2-py3.6.egg/liftoff/lift_features.py", line 20, in lift_all_features
File "/usr/local/lib/python3.6/dist-packages/Liftoff-1.4.2-py3.6.egg/liftoff/lift_features.py", line 96, in lift_single_feature
File "/usr/local/lib/python3.6/dist-packages/Liftoff-1.4.2-py3.6.egg/liftoff/merge_lifted_features.py", line 20, in merge_lifted_features
File "/usr/local/lib/python3.6/dist-packages/Liftoff-1.4.2-py3.6.egg/liftoff/merge_lifted_features.py", line 36, in create_parents
File "/usr/local/lib/python3.6/dist-packages/Liftoff-1.4.2-py3.6.egg/liftoff/merge_lifted_features.py", line 49, in make_new_parent
File "/usr/local/lib/python3.6/dist-packages/Liftoff-1.4.2-py3.6.egg/liftoff/merge_lifted_features.py", line 61, in get_ref_parent
KeyError: 'rna-HHV5wtgp041'

After deleting the lines containing miRNA entries, all run good. Might be worth to include other types of entries in future versions (which I am waiting for avidly!).

Best!

POSTEDIT: It might be worth that also transfers repeat tags and alike, as they are currently not being transferred. I don't know if that was well intended or not.

Gene features broken in v1.5.0?

I just upgraded to v1.5.0 but when I run the following command:

liftoff -g hg38.refGene.short.gtf.gz -f feature_types.txt chm13.draft_v1.0.fasta.bgz hg38.fa.bgz

I get the following error:

2020-10-06 00:37:30,849 - INFO - Committing changes: 0 features
2020-10-06 00:37:30,854 - INFO - Populating features table and first-order relations: 999 features
2020-10-06 00:37:30,854 - INFO - Creating relations(parent) index
2020-10-06 00:37:30,855 - INFO - Creating relations(child) index
2020-10-06 00:37:30,856 - INFO - Creating features(featuretype) index
2020-10-06 00:37:30,856 - INFO - Creating features (seqid, start, end) index
2020-10-06 00:37:30,857 - INFO - Creating features (seqid, start, end, strand) index
2020-10-06 00:37:30,858 - INFO - Running ANALYZE features
GFF does not contain any gene features. Use -f to provide a list of other feature types to lift over.

When I downgrade back to v1.4.2, it starts running as usual (albeit with the running time issue). Anything change related to the -f flag?

Can this program liftoff repeats features like transposons between the two genomes of same species？

Hi authors,
Thanks for devoloping this very useful tool. I am now tring to use it to liftoff repeats features from one version genome assembly to another of same species but meeting an question. It seems abnormal because it has taken over five days to run and I do know when it will end. Does not this tool deal with repeats features like transposons ?

Thanks for reading this and I am looking to your valuable advice.

liftoff.py missing

Hi,
could it be that the liftoff.py file is missing in the folder liftoff from which run_liftoff.py is importing write_new_gff and liftover_types?

Cheers
Daniel

Performance on fast-evolving sequences

Hi,

When you are testing Liftoff, have you also looked at its performance of a gene family for which duplication events are frequent and sequences evolve quickly?

logo 2

liftoff_logo.pdf

Installation issue

Hi developers,

I am facing installation problems with liftoff, however I was able to install the older version successfully.
This is the error -

Traceback (most recent call last):
  File "./bin/liftoff", line 33, in <module>
    sys.exit(load_entry_point('Liftoff==1.2.2', 'console_scripts', 'liftoff')())
  File "./bin/liftoff", line 25, in importlib_load_entry_point
    return next(matches).load()
StopIteration

I tried using a local directory installation and a conda environment installation, but dont seem to work.
Could you please help me resolve it?

Thanks,
Rohit

liftoff maps exons from different isoforms

I am seeing some odd behavior. I'm not sure it's a bug, but it's causing me some confusion.
I tried mapping the D. melanogaster gtf (dmel-all-r6.35.gtf) to a closely related Drosophila species. I would expect most exons to map and to test this I am taking the gtf for the other species and trying to map back to D. melanogaster. However, in the reverse mapping, only the gene annotations map. The exons, mRNA, etc. annotations, which are present in the "forward" mapping do not map back.

Here are some examples.
From forward mapping:
Scf_2L Liftoff gene 10263 12636 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;coverage=0.114;sequence_ID=0.097;extra_copy_number=0;copy_num_ID=FBgn0002121_0;partial_mapping=True;low_identity=True
Scf_2L Liftoff exon 11172 11331 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0078167;transcript_symbol=l(2)gl-RD;Parent=FBgn0002121;extra_copy_number=0
Scf_2L Liftoff exon 11172 11326 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0078168;transcript_symbol=l(2)gl-RE;Parent=FBgn0002121;extra_copy_number=0
Scf_2L Liftoff exon 11172 11331 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0078169;transcript_symbol=l(2)gl-RF;Parent=FBgn0002121;extra_copy_number=0
Scf_2L Liftoff exon 11172 11326 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0306589;transcript_symbol=l(2)gl-RG;Parent=FBgn0002121;extra_copy_number=0
Scf_2L Liftoff exon 11172 11331 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;transcript_id=FBtr0306591;transcript_symbol=l(2)gl-RI;Parent=FBgn0002121;extra_copy_number=0

etc. for many lines, including CDS, 5UTR, mRNA, etc.

From reverse mapping for same gene, I get only the following line for this gene.

2L Liftoff gene 19041 21376 . - . gene_id=FBgn0002121;gene_symbol=l(2)gl;coverage=0.932;sequence_ID=0.799;extra_copy_number=0;copy_num_ID=gene_1_0

Any thoughts?

Many thanks!

David

Installation problem

Hi, there:

I'd installed liftoff a few months ago using

conda install -c bioconda minimap2

git clone https://github.com/agshumate/Liftoff liftoff

cd liftoff
python setup.py install --user

from the old instructions, and it worked perfectly, except I had to run python liftoff.py instead of just liftoff to get the program to run.

For various reasons, I had to delete and reinstall liftoff the other day in a new conda environment, but when I tried to re-install with the previous commands, I got various reports that

Requirement already satisfied: liftoff in /path/to/home/directory/.local/lib/python3.6/site-packages/Liftoff-1.4.2-py3.6.egg

etc, which I think is ok, but I didn't get a liftoff.py file, and just running liftoff -h in the liftoff directory to test the code produced the errorliftoff: command not found. Same thing happened when I installed with pip install liftoff. When I tried to install liftoff with conda in a new conda environment with the updated command from the liftoff github

conda install -c bioconda liftoff

I got the error

UnsatisfiableError: The following specifications were found to be incompatible with each other

Package python conflicts for:
liftoff -> python[version='>=3']
Package biopython conflicts for:
liftoff -> biopython[version='>=1.76']
Package gffutils conflicts for:
liftoff -> gffutils[version='>=0.10.1']
Package numpy conflicts for:
liftoff -> numpy[version='>=1.19.0']
Package minimap2 conflicts for:
liftoff -> minimap2
Package networkx conflicts for:
liftoff -> networkx[version='>=2.4']
Package pysam conflicts for:
liftoff -> pysam[version='>=0.16.0.1']
Package pyfaidx conflicts for:
liftoff -> pyfaidx[version='>=0.5.8']
Package interlap conflicts for:
liftoff -> interlap[version='>=0.2.6']

even though my conda environment is python 3.6. The bottom line is that no installation is working, and I was having no problems installing and running it a few months ago. Not sure what to do. I'm on an HPC compute cluster if that helps.

Liftoff is an excellent tool, by the way; it's become essential to my work, so I'm quite anxious to get it running again! Thanks.

Duplicate contig names in SAM header cause crash

Thanks for writing this useful tool. I encountered an issue with the SAM file of mapped gene sequences that crashed the program. I was able to replicate the issue using public data of two mammalian genomes.

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/940/915/GCF_002940915.1_ASM294091v2/GCF_002940915.1_ASM294091v2_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/940/915/GCF_002940915.1_ASM294091v2/GCF_002940915.1_ASM294091v2_genomic.gff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/466/805/GCF_001466805.2_Raegyp2.0/GCF_001466805.2_Raegyp2.0_genomic.fna.gz
gzip -d *gz

python liftoff.py -dir testdir -p 1 -r GCF_002940915.1_ASM294091v2_genomic.fna -g GCF_002940915.1_ASM294091v2_genomic.gff -t GCA_001466805.2_Raegyp2.0_genomic.fna -o test.gff -m /path/to/minimap2

After the feature database is constructed, the following error occurs. I have omitted some of the 'Duplicated sequence warning' but they occur once for every contig in the target FASTA.

extracting features
aligning features
[W::sam_hdr_create] Duplicated sequence 'LOCP02000998.1'
[W::sam_hdr_create] Duplicated sequence 'LOCP02000999.1'
[E::sam_hrecs_update_hashes] Duplicate entry "LOCP02000001.1" in sam header
[E::sam_parse1] failed to parse header
[W::sam_read1] Parse error at line 4980
Traceback (most recent call last):
  File "liftoff.py", line 121, in <module>
    main()
  File "liftoff.py", line 74, in main
    unmapped_features, infer_transcripts, infer_genes, cov_threshold, seq_threshold, minimap2_path, inter_files)
  File "../liftover_types.py", line 17, in lift_original_annotation
    unmapped_features, reference_fasta, minimap2_path, inter_files, True)
  File "../align_features.py", line 29, in align_features_to_target
    aligned_segments = parse_alignment(file, parent_dict, children_dict, unmapped_features, search_type)
  File "../align_features.py", line 84, in parse_alignment
    for ref_seq in sam_file_iter:
  File "pysam/libcalignmentfile.pyx", line 2187, in pysam.libcalignmentfile.IteratorRowAll.__next__
OSError: truncated file

Liftoff with copies leads to duplicated ID + Copy number

I'm finding that when I run Liftoff with the -copies flag, it will find copies of genes in different locations, but it won't give them a unique ID or extra_copy_number, or anything to identify them. I'm assuming this is a bug?

For instance:

NTIC01022223.1  Liftoff transcript      138600727       138602100       .       -       .       transcript_id "ENST00000474381.1-0"; gene_id "Rhesus_G0040478"; gene_name "HLA-B"; copy_id "ENST00000474381.1-0_30"; coverage "1.0"; sequence_ID "1.0"; extra_copy_number "6";

NTIC01022223.1  Liftoff transcript      138801942       138803360       .       -       .       transcript_id "ENST00000474381.1-0"; gene_id "Rhesus_G0040478"; gene_name "HLA-B"; copy_id "ENST00000474381.1-0_30"; coverage "1.0"; sequence_ID "1.0"; extra_copy_number "6";

Thanks for the hard work! Your code is awesome!

conda install problem

Hello,

I tried installing liftoff with conda. Here is my attempt below. Some dependencies are creating conflicts which causes problem.

here is my conda version just in case. Sorry its not the most updated.

Thank you very much for the help,

Best,
T.

conda 4.7.12

$ conda create --name liftoff
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.9.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /groups/lackgrp/ll_members/tunc/anaconda3/envs/liftoff



Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate liftoff
#
# To deactivate an active environment, use
#
#     $ conda deactivate

$ conda activate liftoff
(liftoff) $ conda install -c bioconda liftoff
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:



Package numpy conflicts for:
liftoff -> numpy[version='>=1.19.0']
Package pysam conflicts for:
liftoff -> pysam[version='>=0.16.0.1']
Package gffutils conflicts for:
liftoff -> gffutils[version='>=0.10.1']
Package pyfaidx conflicts for:
liftoff -> pyfaidx[version='>=0.5.8']
Package biopython conflicts for:
liftoff -> biopython[version='>=1.76']
Package python conflicts for:
liftoff -> python[version='>=3']
Package ujson conflicts for:
liftoff -> ujson
Package interlap conflicts for:
liftoff -> interlap[version='>=0.2.6']
Package networkx conflicts for:
liftoff -> networkx[version='>=2.4']
Package minimap2 conflicts for:
liftoff -> minimap2

Most of annotation fail to be converted by liftover can be successfully converted by liftoff

Dear authors,

I used liftoff to see how many annotation that liftover fail to convert can be converted. The result showed that 99% of them can be mapped to target genome by liftoff. All of them have no extra copy.

I am surprised by the huge different result produced by liftover and liftover. Do yo have any idea of how this happen?

Sincerely yours
Christina

Use liftover with SNPs

Dear Alaina,

Very interesting tool and approach.
If the VCF file is previously converted to GTF, do you think Liftoff could be used for variants such as SNP/INDEL or even larger deletion/insertion?

Best regards,

Luca

Permit use of Winnomap (alternative to minimap2)

Winnomap is built on top of minimap2 to ensure proper mapping of sequences in repetitive regions. Some species have a lot of repetitive regions and Winnomap outperforms minimap2 in these regions while providing similar quality in other regions.

Winnowmap code: https://github.com/marbl/Winnowmap
Winnowmap paper: https://www.biorxiv.org/content/10.1101/2020.11.01.363887v1.full

Winnowmap is not a drop-in replacement for minimap2, however, so the command cannot be simply substituted.

In consistent running behavior

Hi again,

I am using a python wrapper to run the software on multiple similar genomes. There were some inconsistent behaviors I observed.

Below is a log file for 4 consecutive runs. Each run seems to typically take 2 - 3 minutes. But for the last one, liftoff is stuck for more than an hour without proceeding further.

2020-07-01 17:01:07,261 - INFO - Populating features
2020-07-01 17:01:18,098 - INFO - Populating features table and first-order relations: 95875 features
2020-07-01 17:01:18,099 - INFO - Updating relations
2020-07-01 17:01:19,689 - INFO - Creating relations(parent) index
2020-07-01 17:01:19,779 - INFO - Creating relations(child) index
2020-07-01 17:01:19,859 - INFO - Creating features(featuretype) index
2020-07-01 17:01:19,907 - INFO - Creating features (seqid, start, end) index
2020-07-01 17:01:19,955 - INFO - Creating features (seqid, start, end, strand) index
2020-07-01 17:01:20,008 - INFO - Running ANALYZE features
extracting features
aligning features
[M::mm_idx_gen::1.7520.95] collected minimizers
[M::mm_idx_gen::2.0021.70] sorted minimizers
[M::main::2.0061.70] loaded/built the index for 2473 target sequence(s)
[M::mm_mapopt_update::2.1741.64] mid_occ = 20
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 2473
[M::mm_idx_stat::2.3011.61] distinct minimizers: 7175492 (93.76% are singletons); average occurrences: 1.105; average spacing: 5.374; total length: 42620654
[M::worker_pipeline::3.3245.46] mapped 12593 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: /global/scratch/skyungyong/Software/minimap2/minimap2 -o intermediate_files/reference_all_to_target_all.sam -a --eqx -N 50 -p 0.5 -t 20 ../GCA_002924885.1_ASM292488v1_genomic.fna intermediate_files/reference_all_genes.fa
[M::main] Real time: 3.370 sec; CPU: 18.209 sec; Peak RSS: 0.545 GB
lifting features
mapping gene copies
aligning features
lifting features

2020-07-01 17:04:13,851 - INFO - Populating features
2020-07-01 17:04:24,853 - INFO - Populating features table and first-order relations: 95875 features
2020-07-01 17:04:24,853 - INFO - Updating relations
2020-07-01 17:04:26,466 - INFO - Creating relations(parent) index
2020-07-01 17:04:26,549 - INFO - Creating relations(child) index
2020-07-01 17:04:26,631 - INFO - Creating features(featuretype) index
2020-07-01 17:04:26,678 - INFO - Creating features (seqid, start, end) index
2020-07-01 17:04:26,728 - INFO - Creating features (seqid, start, end, strand) index
2020-07-01 17:04:26,781 - INFO - Running ANALYZE features
extracting features
aligning features
[M::mm_idx_gen::1.6480.95] collected minimizers
[M::mm_idx_gen::1.8731.65] sorted minimizers
[M::main::1.8741.65] loaded/built the index for 1697 target sequence(s)
[M::mm_mapopt_update::1.9971.61] mid_occ = 12
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1697
[M::mm_idx_stat::2.0911.59] distinct minimizers: 6753799 (94.95% are singletons); average occurrences: 1.068; average spacing: 5.365; total length: 38689525
[M::worker_pipeline::2.9745.45] mapped 12593 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: /global/scratch/skyungyong/Software/minimap2/minimap2 -o intermediate_files/reference_all_to_target_all.sam -a --eqx -N 50 -p 0.5 -t 20 ../GCA_003015705.1_ASM301570v1_genomic.fna intermediate_files/reference_all_genes.fa
[M::main] Real time: 3.014 sec; CPU: 16.237 sec; Peak RSS: 0.493 GB
lifting features
mapping gene copies
aligning features
lifting features

2020-07-01 17:06:06,682 - INFO - Populating features
2020-07-01 17:06:17,259 - INFO - Populating features table and first-order relations: 95875 features
2020-07-01 17:06:17,259 - INFO - Updating relations
2020-07-01 17:06:18,937 - INFO - Creating relations(parent) index
2020-07-01 17:06:19,021 - INFO - Creating relations(child) index
2020-07-01 17:06:19,102 - INFO - Creating features(featuretype) index
2020-07-01 17:06:19,149 - INFO - Creating features (seqid, start, end) index
2020-07-01 17:06:19,198 - INFO - Creating features (seqid, start, end, strand) index
2020-07-01 17:06:19,250 - INFO - Running ANALYZE features
extracting features
aligning features
[M::mm_idx_gen::1.6440.96] collected minimizers
[M::mm_idx_gen::1.8851.69] sorted minimizers
[M::main::1.8881.69] loaded/built the index for 1805 target sequence(s)
[M::mm_mapopt_update::2.0591.63] mid_occ = 13
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1805
[M::mm_idx_stat::2.1811.60] distinct minimizers: 6778375 (94.62% are singletons); average occurrences: 1.074; average spacing: 5.363; total length: 39032477
[M::worker_pipeline::3.0585.48] mapped 12593 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: /global/scratch/skyungyong/Software/minimap2/minimap2 -o intermediate_files/reference_all_to_target_all.sam -a --eqx -N 50 -p 0.5 -t 20 ../GCA_003016895.1_ASM301689v1_genomic.fna intermediate_files/reference_all_genes.fa
[M::main] Real time: 3.098 sec; CPU: 16.784 sec; Peak RSS: 0.512 GB
lifting features
mapping gene copies
aligning features
lifting features

2020-07-01 17:14:21,196 - INFO - Populating features
2020-07-01 17:14:32,071 - INFO - Populating features table and first-order relations: 95875 features
2020-07-01 17:14:32,072 - INFO - Updating relations
2020-07-01 17:14:33,728 - INFO - Creating relations(parent) index
2020-07-01 17:14:33,816 - INFO - Creating relations(child) index
2020-07-01 17:14:33,896 - INFO - Creating features(featuretype) index
2020-07-01 17:14:33,943 - INFO - Creating features (seqid, start, end) index
2020-07-01 17:14:33,992 - INFO - Creating features (seqid, start, end, strand) index
2020-07-01 17:14:34,044 - INFO - Running ANALYZE features
extracting features
aligning features
[M::mm_idx_gen::1.6110.96] collected minimizers
[M::mm_idx_gen::1.8331.68] sorted minimizers
[M::main::1.8341.68] loaded/built the index for 1749 target sequence(s)
[M::mm_mapopt_update::2.0031.62] mid_occ = 13
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1749
[M::mm_idx_stat::2.1271.59] distinct minimizers: 6787918 (94.64% are singletons); average occurrences: 1.073; average spacing: 5.363; total length: 39062630
[M::worker_pipeline::3.0565.58] mapped 12593 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: /global/scratch/skyungyong/Software/minimap2/minimap2 -o intermediate_files/reference_all_to_target_all.sam -a --eqx -N 50 -p 0.5 -t 20 ../GCA_003015935.1_ASM301593v1_genomic.fna intermediate_files/reference_all_genes.fa
[M::main] Real time: 3.098 sec; CPU: 17.098 sec; Peak RSS: 0.506 GB
lifting features
mapping gene copies
aligning features

So I ran the last genome separately and directly on the shell, and this behavior was still consistent. But the input files are very similar. The genome for the unfinished job was 38Mb in size with 1749 contigs. Another genome whose run was complete in 3 minutes was 38Mb with 1805 contigs. Do you have any guess how this is happening?

Thank you for your help!!

Photo

liftoff_logo.pdf

Logo

Memory usage

Thanks for the great tool, it will be really helpful to genome annotation efforts out there! I tested out mouse-like genome assembly against the mouse GFF3 annotations. Everything was going fine, but at some point in the 'lifting features' portion, memory usage went crazy: I'm on a 1TB shared server, and the python process was using ~400g of memory. Any thoughts on what would have caused this? I'm happy to try things out. Thanks again!

installation and run issue with Liftoff

Hello,

I installed liftoff as per the manual. However I am having issues running the tool - I get the error message given below-

Python version - Python 3.7.6
minimap2 version - 2.17-r941 (installed via bioconda)

liftoff command used - (liftoff -t ordered_ragtag.scaffold.fasta -r Zea_mays.B73_RefGen_v4.dna.toplevel.fa -g Zea_mays.B73_RefGen_v4.47.gff3 -p 8 -dir int_files_1 -o ordered_ragtag.scaffold.gff) >& log_liftoff_1.txt &

error message -

Traceback (most recent call last):
  File "/home/ssz74/miniconda3/envs/python3.7/bin/liftoff", line 33, in <module>
    sys.exit(load_entry_point('Liftoff==1.2.2', 'console_scripts', 'liftoff')())
  File "/home/ssz74/miniconda3/envs/python3.7/bin/liftoff", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/home/ssz74/miniconda3/envs/python3.7/lib/python3.7/site-packages/importlib_metadata-1.7.0-py3.7.egg/importlib_metadata/__init__.py", line 105, in load
    module = import_module(match.group('module'))
  File "/home/ssz74/miniconda3/envs/python3.7/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "/home/ssz74/miniconda3/envs/python3.7/lib/python3.7/site-packages/Liftoff-1.2.2-py3.7.egg/liftoff/run_liftoff.py", line 1, in <module>
  File "/home/ssz74/miniconda3/envs/python3.7/lib/python3.7/site-packages/Liftoff-1.2.2-py3.7.egg/liftoff/liftover_types.py", line 1, in <module>
  File "/home/ssz74/miniconda3/envs/python3.7/lib/python3.7/site-packages/Liftoff-1.2.2-py3.7.egg/liftoff/fix_overlapping_features.py", line 1, in <module>
  File "/home/ssz74/miniconda3/envs/python3.7/lib/python3.7/site-packages/Liftoff-1.2.2-py3.7.egg/liftoff/lift_features.py", line 1, in <module>
  File "/home/ssz74/miniconda3/envs/python3.7/lib/python3.7/site-packages/Liftoff-1.2.2-py3.7.egg/liftoff/find_best_mapping.py", line 2, in <module>
ImportError: cannot import name 'new_feature' from 'liftoff' (/home/ssz74/miniconda3/envs/python3.7/lib/python3.7/site-packages/Liftoff-1.2.2-py3.7.egg/liftoff/__init__.py)

any help in troubleshooting would be appreciated.

photo 2

minimap2 parameters

Hi, I just had a quick question. In the usage message for liftoff, mismatch and gap open penalties of 2 are given for exons. Is this penalty used exclusively for exons or for all feature types (i.e. if I am lifting over a non-coding element that does not have an exon feature, is mismatch penalty of 2 used or the minimap2 default of 4)? Thanks!

Liftoff from haploid to diploid

Thanks for this nice tool agshumate.

I am using Liftoff for lift-over of annotations from an haploid reference to a de novo diploid assembly produced with CanuTrio for another individual of the same species.
Would you recommend running Liftoff for each of the two de novo haplo-phases independently or for the diploid assembly together?
In the chroms.txt file, is a one-to-one relation needed or can one use the same reference chr in several entries?
E.g.:
chr01,chr01_maternal
chr01,chr01_paternal
chr02,chr02_maternal
chr02,chr02_paternal
...

Thank you!

sequence_ID tag

Hi,

I am wondering how the identity reported in the sequence_ID tag is calculated. For example does a single indel event count as one difference, or is the size of the indel taken into account?

Thanks!
Mitchell

Results of different patterns are inconsistent

Hello,

I found something inconsistent bewteen different patterns in Liftoff. I first run Liftoff with the parameter -chroms and then I run Liftoff without -chroms, but I found in the first case, some genes such as 'evm.TU.15.540' (this name is from evidencemodeler, means the No.540 gene at chromosome 15) was not lift suceesfully from reference genome, but in the second case, this gene was succesfully lift over, and both coverage and sequence_ID are 1.0. I have no mind why this happended, could you help me?

Missing complete target exon with high identity to reference

Hi,

I've noticed that after running liftoff, a single exon from a candidate gene of interest is missing from the output liftover annotation. This exon is present in my reference species, has ~93% identity with my target species (with the few SNPs spread fairly evenly over the exon) and is fully represented in the target genome (e.g. a blast search returns a single, full-length match at the correct genomic coordinates). This exon should pass both the minimum alignment coverage (a) and minimum sequence identity for child features (s). Do you have a sense of why this exon may not have been lifted over? My concern is that if an exon is missing in my candidate others may be missing from other genes.

edit: Also, I'm using version v1.3.0

Thanks!

PermissionError: [Errno 13] Permission denied: 'minimap2'

Hi,

Trying to use Liftoff but I'm getting these errors. How can I solve them? Thank you.

The command I ran was: liftoff -t new.fasta -r ref.fa -g ref.gff3 -o test.gff

Traceback (most recent call last):
File "/anaconda2/envs/myenv/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/anaconda2/envs/myenv/lib/python3.7/site-packages/liftoff/align_features.py", line 59, in align_single_chroms
minimap2_index = build_minimap2_index(target_file, args, threads_arg, minimap2_path)
File "/anaconda2/envs/myenv/lib/python3.7/site-packages/liftoff/align_features.py", line 110, in build_minimap2_index
threads])
File "/anaconda2/envs/myenv/lib/python3.7/subprocess.py", line 488, in run
with Popen(*popenargs, **kwargs) as process:
File "/anaconda2/envs/myenv/lib/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/anaconda2/envs/myenv/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: 'minimap2'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/anaconda2/envs/myenv/bin/liftoff", line 8, in
sys.exit(main())
File "/anaconda2/envs/myenv/lib/python3.7/site-packages/liftoff/run_liftoff.py", line 18, in main
parent_features_to_lift)
File "/anaconda2/envs/myenv/lib/python3.7/site-packages/liftoff/liftover_types.py", line 15, in lift_original_annotation
feature_hierarchy.parents, lifted_features_list, ref_parent_order, min_cov, min_seqid)
File "/anaconda2/envs/myenv/lib/python3.7/site-packages/liftoff/liftover_types.py", line 23, in align_and_lift_features
liftover_type, unmapped_features)
File "/anaconda2/envs/myenv/lib/python3.7/site-packages/liftoff/align_features.py", line 21, in align_features_to_target
for result in pool.imap_unordered(func, np.arange(0, len(target_chroms))):
File "/anaconda2/envs/myenv/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
PermissionError: [Errno 13] Permission denied: 'minimap2'

How does -sc parameter impact the output

Hello,

Running Liftoff with -p 20 -a 0.4 -s 0.4 -sc 0.4 -copies produced 12108 genes with coverage=1.0 and 13196 total genes. On the other hand, the same process with a change in -sc from 0.4 to 1.0 generated 11880 genes with coverage=1.0 and 12544 total genes.

The additional genes with coverage=1.0 captured with -sc 0.4 includes cases like this "coverage=1.0;sequence_ID=0.99824". I thought -sc will be used to determine whether a gene can be considered as a copy of another gene. How does this parameter impact the outputs, and why with smaller -sc parameters, additional highly similar copies can be detected?

Coverage and identify information for mRNAs

Hi @agshumate

I appreciate you and your team's effort for developing the tool Liftoff.

Having used liftoff versions (v1.1.3, 1.3.0 and 1.4.2), I could see in the output.gff file generated by liftoff that the gene type contains coverage (-a) and identity (-s) information provided as coverage=1.0;sequence_ID=1.0 in the attribute field. I believe this is a cumulative coverage across all the mRNAs of a gene.

Is it possible to have same information added to the mRNA type as well, i.e. to have coverage=?;sequence_ID=? for each mRNA that was transferred across. This will enable users to run Liftoff with the default coverage and identify cutoff of 0.5 and then easily workout those mRNA's that were 100% transferred within a gene and those that weren't transferred at 100%.

I am yet to test the latest release v1.5.1. Apologies if this feature is already added.

Thanks,
Gemy

De novo annotation case

Hi there! wonderful tool! I was wondering about an alternative use for Liftoff and wanted to ask your opinion.

I am currently working on the de novo genome assembly of a non-model plant. Assembly is finished (quite fragmented to be honest) so we moved to annotation. I was wondering if I could use Liftoff to help with the annotation. In my mind, I would use Liftoff to map the annotations of reference genomes of other species to the assembly of my species and maybe provide the results as evidence to more robust pipelines like Maker.

The catch here though is that the references are that close. We are talking about species on the same family/subfamily, something I imagine not on your scope when you designed Liftoff.

So what do you think? Do you think it is feasible this repurpose? Or on the other hand, due to the high evolutionary distances, something might go wrong and actually create more noise than a helpful signal to downstream analysis?

Your thoughts are very valuable to me.
Thank you in advance.

agshumate / liftoff Goto Github PK

liftoff's People

Contributors

Stargazers

Watchers

Forkers

liftoff's Issues

Recommend Projects

Recommend Topics

Recommend Org