Dear PANAROO team, I started to use your tool (v1.1.2 installed thro

89/7928 groups without sequence <div class="snippet-clipboard-content notranslate

Groups without sequence in pan_genome_reference.fa about panaroo HOT 7 CLOSED

gtonkinhill commented on September 8, 2024

Groups without sequence in pan_genome_reference.fa

from panaroo.

Comments (7)

carrere commented on September 8, 2024

89/7928 groups without sequence

grep -c '^>' pan_genome_reference.fa 
7839

wc -l gene_presence_absence_roary.csv 
7929 gene_presence_absence_roary.csv

wc -l gene_presence_absence.csv 
7929 gene_presence_absence.csv

from panaroo.

gtonkinhill commented on September 8, 2024

Hi,

This is poorly documented but intentional behaviour. We only return a single reference sequence for each centroid/paralog cluster. This was done to reduce duplications in the generated reference that can cause issues for read alignment.

I am hoping to improve the documentation soon. If there is a strong need for including duplicated sequences we could look at adding an option to include them.

from panaroo.

carrere commented on September 8, 2024

Hi,

Sorry, but I still not understand why those groups do not have any sequence in the resulting pangenome fasta file. I understand that you take only one sequence per group (centroid) and so in the case of singleton groups (gene presents in only one copy in only one genome) you should add this gene to the resulting pangenome ? Am I right ?

Or that means these groups are paralogs of other, but in that case, how can I get this information ?

Thanks for your help.

Sebastien

from panaroo.

carrere commented on September 8, 2024

Ok I think I found this information in the final_graph.gml (attribute paralog = 1, centroidID)

from panaroo.

gtonkinhill commented on September 8, 2024

Yes, that's it. I will update the documentation to include this information.

from panaroo.

kneubehl commented on September 8, 2024

As a follow up to this question regarding paralogs from the .gml file, I just want to make sure I am understanding the table output from cytoscape. What I want to be able to do is determine which group is paralogous to groups in the pangenome reference file. It looks like I can use the longCentroidID to tie a paralogous group with another group's centroid correct?

Shortened the header a little so it would be easier to see on here:

centroid	description	geneIDs	label	longCentroidID	name	paralog	seqIDs	shared name
1_1_14	DUF792 family protein	1_4_15	5442	1_1_14	group_2121	1	1_4_15	group_2121

If I am reading this right it, this row would be for geneID 1_14_15 which is group_2121 which is paralogous to centroid 1_1_14 in another group. What confuses me is that 'centroid' can have multiple geneIDs in it but longCentroidID only has one, what exactly is longCentroidID? Also, what is the shared name header? At first glance I would expect that to be the group name which this gene is paralogous to and is actually in the pangenome reference file but that is just me.

from panaroo.

gtonkinhill commented on September 8, 2024

Not quite. I should really remove the longCentroidID from the final output as it is mainly used to help speed things up internally. The centroid field should allow you to match up paralogous genes. The reason it can have multiple entries is due to the family collapsing stage of the algorithm.

The shared name field should also be ignored for the moment and will probably be removed in a later release. I'm hoping to improve the documentation for these fields soon.

from panaroo.

Groups without sequence in pan_genome_reference.fa about panaroo HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent