Hi Sam,
I've been exploring the SEED subsystems results and I've noticed that in the Subsystems_pie_charts.R
script, the first file is read in as data_table
, then all subsequent files are put into temp_table
where only the counts and the level 4 name is kept before merging with data_table
(lines 67-68):
temp_table <- temp_table[,c(2,3)]
data_table <- merge(temp_table, data_table, by = "Level4")
It would be no problem to use the level 3, 2 and 1 hierarchy names from the first file if each level 4 category always grouped into the same level 3, 2 and 1 hierarchies - however strangely, this isn't the case.
As an example, when I look in my results for the level 4 name DNA topoisomerase I (EC 5.99.1.2)
, some samples have a line like this in the .hierarchy.reduced
file:
0.159525979945 28.0 DNA topoisomerase I (EC 5.99.1.2) Conserved_gene_cluster_associated_with_Met-tRNA_formyltransferase Clustering-based subsystems
and others have a line like this:
0.132864735128 31.0 DNA topoisomerase I (EC 5.99.1.2) DNA_topoisomerases,_Type_I,_ATP-independent DNA Metabolism DNA replication
Indeed when I look for DNA topoisomerase I (EC 5.99.1.2)
in the subsys_db.fa
database provided with SAMSA2, I can see that there's several entries and they do differ in their level 3, 2 and 1 hierarchy names despite an identical level 4 name.
I am uncertain how large an effect this may have on the downstream analysis (when comparing between two of my samples, 24 of the 655 level 4 terms in common between them had variable hierarchies). At the moment I can't think of a nice way to deal with this but wanted to bring it to your attention.
(Also, should line 68 include all = TRUE
? Otherwise we are keeping only the level 4 functions that are present in every single sample)
Cheers,
Rachael