Comments (4)
Hi,
There was a bug (recently fixed) where if you ran roary repeatedly in the same directory, the pan_genome_reference.fa file would be appended to. This is now fixed and on your test data the number of sequences matches the summary statistics file.
As for duplicate sequences, we split clusters which contain paralogs based on a conserved gene neighbourhood and Roary is working exactly as intended. Using your data as an example, 4682 clusters are identified, and 4680 are unique. Looking at one of the split duplications (transposase for IS431mec), one sample has 2 copies of a gene (12S05158-1_02635 and 12S05158-1_01700). Roary correctly keeps 12S05158-1_01700 as part of the big cluster since there is supporting evidence from the genes around it, and splits 12S05158-1_02635 into its own cluster. 12S05158-1_02635 is a single gene on a single contig and is flagged as 'Investigate' in the QC column of the gene_presence_absence.csv spreadsheet.
Any pull requests would be welcomed if you wish to expand on the existing 756 tests.
Regards,
Andrew
from roary.
I was unaware that such an emphasis is placed on genetic context.
So you're saying that even if there is indication of an orthologous relationship (identical sequence), but not enough evidence of the same genetic context (maybe because of an IS element insertion and the resulting contig break in the assembly of Illumina reads), genes still get assigned to a new OG?
In the case of my coli example, 1500 (!) genes are identical to genes in another OG. I've already seen many examples where a contig break next to a gene causes this separation into two different OGs (see here: http://klif.vet.uu.nl/example.xlsx for an example). Is there a way to or switch off this genetic context feature, or add a flag in the QC that says "potentially belongs to OG such and such". I fear that if I would include genes with a couple mismatches, it would be far more than 1500 genes that have been incorrectly assigned to a new OG and I'd really like to prevent this somewhat unexpected behavior of your tool.
from roary.
Theres an undocumented parameter to turn off the splitting, just pass in
the '-s' flag. I'll add it to the documentation since it sounds like its
of use.
By default if a cluster contains a paralog, we try and split it. We go back
to the original contig and look at 5 genes on either side of each sequence
to create a finger print. Then we use this to iteratively separate them
out. Its used in other algorithms like PanOCT.
On 8 October 2015 at 08:31, aldertzomer [email protected] wrote:
I was unaware that such an emphasis is placed on genetic context.
So you're saying that even if there is indication of an orthologous
relationship (identical sequence), but not enough evidence of the same
genetic context (maybe because of an IS element insertion and the resulting
contig break in the assembly of Illumina reads), genes still get assigned
to a new OG?In the case of my coli example, 1500 (!) genes are identical to genes in
another OG. I've already seen many examples where a contig break next to a
gene causes this separation into two different OGs (see here:
http://klif.vet.uu.nl/example.xlsx for an example). Is there a way to or
switch of this genetic context feature, or add a flag in the QC that says
"potentially belongs to OG such and such". I fear that if I would include
genes with a couple mismatches, it would be far more than 1500 genes that
have been incorrectly assigned to a new OG and I'd really like to prevent
this somewhat unexpected behavior of your tool.—
Reply to this email directly or view it on GitHub
#188 (comment)
.
from roary.
Thank you very much, this will help a lot. Especially with some of my datasets which have a lot of contig breaks due to IS elements.
from roary.
Related Issues (20)
- how can i solve this issue
- No newick file HOT 2
- Missing proteome.faa files
- Cant open file: 14L9VpZJtx/DNA16.gff.proteome.faa Please, help with this issue
- roary_plots.py heatmap coregenome matrix empty
- i cant install Roary by mamba HOT 3
- core_alignment.aln has lower and uppercase in sequences
- cann't install roary 3.13 with mamba/conda HOT 1
- The singularity image pulled from docker://sangerpathogens/roary does not accept input gff files HOT 2
- Issue with "Error: Cant access the groups file: clustered_proteins" HOT 1
- How can I discover the genetic sequences that are used to group groups in roary?
- Roary grouping genes that don't meet similarity threshold (using `-i 100`, and `-s`) HOT 6
- Order of genes in alignment
- Use of uninitialized value in require at /usr/lib/x86_64-linux-gnu/perl5/5.34/Encode.pm line 70.
- How are genes with frameshift mutations/INDELS handled?
- The number of pan-genes in gene_presence_absence.csv is less than each genome
- Cloud not obtain clear phylogenic tree HOT 1
- MSG: Got a sequence without letters. Could not guess alphabet
- CDS ID tag incompatibility - 'ID=CDS:ENSB:O-Ndl8aNBRf4G1w' needs to be 'ID=ENSB_O-Ndl8aNBRf4G1w'
- KeyError issue when running roary_plots.py on nwk
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from roary.