Giter Site home page Giter Site logo

Comments (7)

carrere avatar carrere commented on September 8, 2024

89/7928 groups without sequence

grep -c '^>' pan_genome_reference.fa 
7839

wc -l gene_presence_absence_roary.csv 
7929 gene_presence_absence_roary.csv

wc -l gene_presence_absence.csv 
7929 gene_presence_absence.csv

from panaroo.

gtonkinhill avatar gtonkinhill commented on September 8, 2024

Hi,

This is poorly documented but intentional behaviour. We only return a single reference sequence for each centroid/paralog cluster. This was done to reduce duplications in the generated reference that can cause issues for read alignment.

I am hoping to improve the documentation soon. If there is a strong need for including duplicated sequences we could look at adding an option to include them.

from panaroo.

carrere avatar carrere commented on September 8, 2024

Hi,

Sorry, but I still not understand why those groups do not have any sequence in the resulting pangenome fasta file. I understand that you take only one sequence per group (centroid) and so in the case of singleton groups (gene presents in only one copy in only one genome) you should add this gene to the resulting pangenome ? Am I right ?

Or that means these groups are paralogs of other, but in that case, how can I get this information ?

Thanks for your help.

Sebastien

from panaroo.

carrere avatar carrere commented on September 8, 2024

Ok I think I found this information in the final_graph.gml (attribute paralog = 1, centroidID)

from panaroo.

gtonkinhill avatar gtonkinhill commented on September 8, 2024

Yes, that's it. I will update the documentation to include this information.

from panaroo.

kneubehl avatar kneubehl commented on September 8, 2024

As a follow up to this question regarding paralogs from the .gml file, I just want to make sure I am understanding the table output from cytoscape. What I want to be able to do is determine which group is paralogous to groups in the pangenome reference file. It looks like I can use the longCentroidID to tie a paralogous group with another group's centroid correct?

Shortened the header a little so it would be easier to see on here:

centroid description geneIDs label longCentroidID name paralog seqIDs shared name
1_1_14 DUF792 family protein 1_4_15 5442 1_1_14 group_2121 1 1_4_15 group_2121

If I am reading this right it, this row would be for geneID 1_14_15 which is group_2121 which is paralogous to centroid 1_1_14 in another group. What confuses me is that 'centroid' can have multiple geneIDs in it but longCentroidID only has one, what exactly is longCentroidID? Also, what is the shared name header? At first glance I would expect that to be the group name which this gene is paralogous to and is actually in the pangenome reference file but that is just me.

from panaroo.

gtonkinhill avatar gtonkinhill commented on September 8, 2024

Not quite. I should really remove the longCentroidID from the final output as it is mainly used to help speed things up internally. The centroid field should allow you to match up paralogous genes. The reason it can have multiple entries is due to the family collapsing stage of the algorithm.

The shared name field should also be ignored for the moment and will probably be removed in a later release. I'm hoping to improve the documentation for these fields soon.

from panaroo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.