Comments (5)
Your config looks OK at first glance. To get a feeling of what went wrong I would start by looking at the reference annotation vs. the Companion annotation of the reference sequence in Artemis. Is there a pattern to the gene mismatches? E.g. are real genes miscalled as pseudogenes? Are they missing completely? Are the predictions too short compared to the real genes or missing exons?
It might be useful to look at the intermediate GFFs created by RATT, AUGUSTUS, LAST etc. and check which one of the results is picked for a locus, e.g. by looking at them in Artemis etc. For a legitimate gene, there are usually multiple sources of evidence at the same locus:
- RATT gene model
- AUGUSTUS prediction on scaffold level
- AUGUSTUS prediction on pseudochromosome level
- LAST based spliced alignments
- ...
The integration step tries to pick the 'best' explanation for a locus from all of these by comparing length, reading frame consistency, source etc. (unfortunately not all of which are exposed in the config file). It looks like that for a substantial number of genes the wrong source is picked or left out. I would probably also open all of these evidence tracks in Artemis to try and figure out what happened there, maybe one of the sources is wrong but favoured too much. it might also help disabling RATT and/or Exonerate in an attempt to isolate the culprit.
There are also quite a few heuristics in Companion that flag genes with slightly "weird" intron/exon structures (missing splice site motifs, ...) as pseudogenes late in the pipeline. This might also be the reason for so many pseudogenes.
Sorry to say this but this is probably difficult to debug without looking at intermediate data a lot...
from companion.
That's great, thanks for the prompt response. I will take the steps you recommend and get back with any progress I make after the Vietnamese Tet holiday.
from companion.
So, I think it is a problem with the version of the annotation being used by Companion.
For example, in the annotation.gff3 in ss34/EuPathDB_references/FungiDB.org/Cryptococcus_neoformans_var__grubii_H99
(and the Cryptococcus_neoformans_var__grubii_H99.gff3), the first exon of the second gene, CNAG_07303 starts at the same position as the gene itself (5928). When visualised in artemis, the CDS seems obviously misplaced, in the wrong reading frame, lots of stop codons in the CDS.
I downloaded the same reference genome from NCBI, here, and CNAG_07303 gene still starts at 5928, but the first CDS starts at 6209, which gives a much more sensible looking CDS.
I'm not sure what happened, I guess there must have been an erroneous version of this annotation in refseq, but I'm sure it isn't helping Companion!! Is there an easy way to swap in the current, correct annotation?
from companion.
Any idea of while running /home/xin/.nextflow/assets/sanger-pathogens/companion/bin/update_references.lua
I continuously got the following error message: "tool './bin/update_references.lua' not found; option -help lists possible tools" .
from companion.
@xinliu005 Please open a separate issue here: https://github.com/sanger-pathogens/companion/issues/new
from companion.
Related Issues (20)
- Proactive sanitization of input headers with special characters
- Option for filtering gene models with introns as pseudogenes in kinetoplastids
- Download option for table content
- ENA validation and ID assignment
- Allow optional alphanumeric random locus tags
- Add track with contig placements to Circos plots
- 'Finish line' fixes towards ENA validity
- allow pipeline to complete when no genes are annotated at all
- Make sure Docker Hub builds working images
- Stability improvements
- fix Circos drawing
- use whole genome as RATT input, not just chromosomes
- use new Docker hub container
- pseudogene and chromosome handling
- skip Pfam hits with invalid converted ranges
- Latest work
- small fixes
- Robustness improvement
- Latest work
- do not treat lowercase input sequences as repeat masked in LAST
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from companion.