Comments (7)
89/7928 groups without sequence
grep -c '^>' pan_genome_reference.fa
7839
wc -l gene_presence_absence_roary.csv
7929 gene_presence_absence_roary.csv
wc -l gene_presence_absence.csv
7929 gene_presence_absence.csv
from panaroo.
Hi,
This is poorly documented but intentional behaviour. We only return a single reference sequence for each centroid/paralog cluster. This was done to reduce duplications in the generated reference that can cause issues for read alignment.
I am hoping to improve the documentation soon. If there is a strong need for including duplicated sequences we could look at adding an option to include them.
from panaroo.
Hi,
Sorry, but I still not understand why those groups do not have any sequence in the resulting pangenome fasta file. I understand that you take only one sequence per group (centroid) and so in the case of singleton groups (gene presents in only one copy in only one genome) you should add this gene to the resulting pangenome ? Am I right ?
Or that means these groups are paralogs of other, but in that case, how can I get this information ?
Thanks for your help.
Sebastien
from panaroo.
Ok I think I found this information in the final_graph.gml (attribute paralog = 1, centroidID)
from panaroo.
Yes, that's it. I will update the documentation to include this information.
from panaroo.
As a follow up to this question regarding paralogs from the .gml file, I just want to make sure I am understanding the table output from cytoscape. What I want to be able to do is determine which group is paralogous to groups in the pangenome reference file. It looks like I can use the longCentroidID to tie a paralogous group with another group's centroid correct?
Shortened the header a little so it would be easier to see on here:
centroid | description | geneIDs | label | longCentroidID | name | paralog | seqIDs | shared name |
---|---|---|---|---|---|---|---|---|
1_1_14 | DUF792 family protein | 1_4_15 | 5442 | 1_1_14 | group_2121 | 1 | 1_4_15 | group_2121 |
If I am reading this right it, this row would be for geneID 1_14_15 which is group_2121 which is paralogous to centroid 1_1_14 in another group. What confuses me is that 'centroid' can have multiple geneIDs in it but longCentroidID only has one, what exactly is longCentroidID? Also, what is the shared name header? At first glance I would expect that to be the group name which this gene is paralogous to and is actually in the pangenome reference file but that is just me.
from panaroo.
Not quite. I should really remove the longCentroidID
from the final output as it is mainly used to help speed things up internally. The centroid
field should allow you to match up paralogous genes. The reason it can have multiple entries is due to the family collapsing stage of the algorithm.
The shared name
field should also be ignored for the moment and will probably be removed in a later release. I'm hoping to improve the documentation for these fields soon.
from panaroo.
Related Issues (20)
- Uncollapsed gene families HOT 1
- Conda build outdated (Bio.Alphabet issue) HOT 3
- family threshold parameter and different proteins HOT 2
- Bio.Alphabet module has been removed from Biopython error HOT 3
- GFF3 from GenBank HOT 1
- Genes present in prokka gff files but present as group_xxx in gene_presence_absence_file HOT 2
- Errno2 No such file or directory: '##gff-version' HOT 2
- unconnected nodes HOT 2
- Incomplete Panaroo output HOT 1
- Problem processing non-NCBI generated GBK files HOT 4
- Wrong information in the gene_presence_absence.Rtab file HOT 2
- Problems with multithreading HOT 3
- Not recognising gff files
- Issue installing panaroo on Mac M1 with Ventura HOT 10
- What do the tildes (~) mean in the gene name column from the gene_presence_absence.csv output file? HOT 1
- Error while generating gffs from panaroo output HOT 4
- Warning or error? HOT 2
- cd-hit error HOT 3
- Add a descriptive error message when you have 0 core genes
- Unable to produce aln file
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from panaroo.