Giter Site home page Giter Site logo

Comments (3)

DerrickWood avatar DerrickWood commented on August 16, 2024

Hi Bastien,

It appears that particular sequence does not have a taxonomy ID associated with its GI number in the taxonomy data that Kraken downloads. As it is now, the lack of a taxonomy ID for that sequence is causing exact matches with the sequence to not be classified by Kraken. Now that I think about it, I probably should have Kraken associate these types of sequences with taxid 1 (root), not taxid 0 (unclassified). I will probably do that for the next release.

This particular sequence does seem to have a taxonomy ID (645657) when looking at the actual .gbk file, but kraken-build doesn't download these in order to save download time and storage space. I'm not sure why NCBI hasn't associated this sequence's GI number with that taxonomy number in the gi_taxid_nucl.dmp file that Kraken uses. I'll have to look into that a bit more to see if there's a more preferred way to associate sequences with taxonomy information.

Derrick

from kraken.

bachev avatar bachev commented on August 16, 2024

Ah, things get clearer. Indeed, the GI in that FASTA file has no taxonomy node ID associated with it. Funny thing, the other .fna file in the same directory does have a GI with an association. Probably a bug in the NCBI db or FASTA building process. Would you report it (you may have a bit more weight than I do)?

Regarding Kraken association: I think the classes important for a user could be
a) sequence associated to a classified organism
b) sequence associated to an unclassified organism
c) sequence not associated

In a sense the output is a bit misleading as Kraken currently has only categories a) and c). Having this b) category would help to distinguish cases like the above "your sequence is known, but I have no way to tell you from which organism." Keeping that in taxid 0 would be the right way to do it, because having them in taxid 1 normally means "sheesh, present in all kingdoms of life".

I suppose I'd keep track of sequences which have absolutely no hit with taxid -1 (or however you want to implement this) and output that as "sequence not associated".

B.

from kraken.

DerrickWood avatar DerrickWood commented on August 16, 2024

I've notified NCBI about this particular issue - we'll see how things go there.

As for the proper classification of these kinds of sequences, I'll have to give it a bit more thought if the taxid 1 solution isn't appropriate. Perhaps a different classification code in the first column of the output (e.g., adding O in addition to C/U) while maintaining the taxid 0 behavior would be best. There's also a taxid 12908 ("unclassified sequences"), but I don't know that I feel comfortable hard-coding a non-0/1 ID into the main Kraken code, especially when that particular ID might be changed by a future NCBI decision.

from kraken.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.