Giter Site home page Giter Site logo

Comments (4)

andrewjpage avatar andrewjpage commented on September 15, 2024

Hi,
There was a bug (recently fixed) where if you ran roary repeatedly in the same directory, the pan_genome_reference.fa file would be appended to. This is now fixed and on your test data the number of sequences matches the summary statistics file.

As for duplicate sequences, we split clusters which contain paralogs based on a conserved gene neighbourhood and Roary is working exactly as intended. Using your data as an example, 4682 clusters are identified, and 4680 are unique. Looking at one of the split duplications (transposase for IS431mec), one sample has 2 copies of a gene (12S05158-1_02635 and 12S05158-1_01700). Roary correctly keeps 12S05158-1_01700 as part of the big cluster since there is supporting evidence from the genes around it, and splits 12S05158-1_02635 into its own cluster. 12S05158-1_02635 is a single gene on a single contig and is flagged as 'Investigate' in the QC column of the gene_presence_absence.csv spreadsheet.

Any pull requests would be welcomed if you wish to expand on the existing 756 tests.
Regards,
Andrew

from roary.

aldertzomer avatar aldertzomer commented on September 15, 2024

I was unaware that such an emphasis is placed on genetic context.

So you're saying that even if there is indication of an orthologous relationship (identical sequence), but not enough evidence of the same genetic context (maybe because of an IS element insertion and the resulting contig break in the assembly of Illumina reads), genes still get assigned to a new OG?

In the case of my coli example, 1500 (!) genes are identical to genes in another OG. I've already seen many examples where a contig break next to a gene causes this separation into two different OGs (see here: http://klif.vet.uu.nl/example.xlsx for an example). Is there a way to or switch off this genetic context feature, or add a flag in the QC that says "potentially belongs to OG such and such". I fear that if I would include genes with a couple mismatches, it would be far more than 1500 genes that have been incorrectly assigned to a new OG and I'd really like to prevent this somewhat unexpected behavior of your tool.

from roary.

andrewjpage avatar andrewjpage commented on September 15, 2024

Theres an undocumented parameter to turn off the splitting, just pass in
the '-s' flag. I'll add it to the documentation since it sounds like its
of use.

By default if a cluster contains a paralog, we try and split it. We go back
to the original contig and look at 5 genes on either side of each sequence
to create a finger print. Then we use this to iteratively separate them
out. Its used in other algorithms like PanOCT.

On 8 October 2015 at 08:31, aldertzomer [email protected] wrote:

I was unaware that such an emphasis is placed on genetic context.

So you're saying that even if there is indication of an orthologous
relationship (identical sequence), but not enough evidence of the same
genetic context (maybe because of an IS element insertion and the resulting
contig break in the assembly of Illumina reads), genes still get assigned
to a new OG?

In the case of my coli example, 1500 (!) genes are identical to genes in
another OG. I've already seen many examples where a contig break next to a
gene causes this separation into two different OGs (see here:
http://klif.vet.uu.nl/example.xlsx for an example). Is there a way to or
switch of this genetic context feature, or add a flag in the QC that says
"potentially belongs to OG such and such". I fear that if I would include
genes with a couple mismatches, it would be far more than 1500 genes that
have been incorrectly assigned to a new OG and I'd really like to prevent
this somewhat unexpected behavior of your tool.


Reply to this email directly or view it on GitHub
#188 (comment)
.

from roary.

aldertzomer avatar aldertzomer commented on September 15, 2024

Thank you very much, this will help a lot. Especially with some of my datasets which have a lot of contig breaks due to IS elements.

from roary.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.