Giter Site home page Giter Site logo

paxdb-data-pipeline's People

Watchers

 avatar  avatar  avatar

paxdb-data-pipeline's Issues

check uniprot mapping

reported by a user:

I used to work with PaxDB datasets from v3 and recently I switched to v4.
Since I am using mostly Reference Proteome from Uniprot, I had to convert the PaxDB-STRING identifier to Uniprot.
I had some issues with the mapping files provided that I would like to share with you.

Inconsistencies between uniprot mapping files for PaxDB v3 and v4
However, I've noticed that there was a significant change between the two versions, when I tried to use paxdb-uniprot mapping files for at least 2 species (cerevisiae, pombe) .
pombe
4936 lines in 4896-uniprot-paxdb.v3.map
58 lines in 4896-uniprot-paxdb.v4.map
cerevisiae
6483 lines in 4932-uniprot-paxdb.v3.map
1208 lines in 4932-uniprot-paxdb.v4.map

Mapping PaxDB v4 IDs to official uniprot mapping file
So, I tried to scan all PaxDB-STRING IDs for those species in the official Uniprot mapping file (from the Uniprot FTP, see below for links).
I mostly used perl for this :

  1. Loading into a hash the STRING IDs (after trimming the prefix which corresponds to the Taxon ID)
  2. Checking at each line of the mapping file, if any field contained a value already present in the hash.
    Surprisingly, I found much more correspondence with Uniprot than what seems to be mapped by version 4 of PaxDB :
    26304 lines in PaxDB.sc_idmapping_010717.dat.scan.matched (cerevisiae)
    4643 lines in PaxDB.sp_idmapping_010717.dat.scan.matched (pombe)
    Obviously, multiple records from the Uniprot mapping files could match a single STRING ID but if I remove the redundant pair of STRING-UNIPROT IDs, I got :
    for cerevisiae => 6440 PaxDB corresponding to 6538 Uniprot AC
    for pombe => 4579 PaxDB corresponding to 4571 Uniprot AC

Checking whether STRING has same problems of mapping
I understood that PaxDB relies on STRING, which essentially performed a blast against full Uniprot to generate a mapping file.
I've checked the mapping done by STRING but it seems just fine for those species. (https://string-db.org/mapping_files/uniprot_mappings/)
5339 lines in 4896_reviewed_uniprot_2_string.04_2015.tsv
9818 lines in 4932_reviewed_uniprot_2_string.04_2015.tsv

Additionally, I noticed that STRING also had similar problem when mapping to Uniprot for at least one species (drosophila) whereas PaxDB v4 had no issues :
3486 lines in 7227_reviewed_uniprot_2_string.04_2015.tsv
36390 lines in 7227-paxdb_uniprot.txt

I am also using eggNOG and it seems that this problem of mapping to Uniprot propagated also there (at least for cerevisiae and pombe).

I still appreciate very much the great deal of work you've put in all those projects (PaxDB,STRING,eggNOG)
I hope this could contribute to correct bugs and perhaps make it accessible to a broader community.
Thanks.

PS:
Here is a preview of the matched records between PaxDB ID from v4 datasets and Uniprot Mapping file (the first column is the STRING ID that was used as a key to generate a hash with all records in perl)
==> PaxDB.sc.sc_idmapping_060717.dat.scan.matched <==
Q0045 P00401 Gene_OrderedLocusName Q0045
Q0045 P00401 EnsemblGenome Q0045
Q0045 P00401 EnsemblGenome_TRS Q0045
Q0045 P00401 EnsemblGenome_PRO Q0045
Q0050 P03875 Gene_OrderedLocusName Q0050
Q0050 P03875 EnsemblGenome Q0050
Q0050 P03875 EnsemblGenome_TRS Q0050
Q0050 P03875 EnsemblGenome_PRO Q0050
Q0055 P03876 Gene_OrderedLocusName Q0055
Q0055 P03876 EnsemblGenome Q0055

==> PaxDB.sp.sp_idmapping_060717.dat.scan.matched <==
SPAC1002.01.1 Q9US57 EnsemblGenome_TRS SPAC1002.01.1
SPAC1002.02.1 Q9US56 EnsemblGenome_TRS SPAC1002.02.1
SPAC1002.03c.1 Q9US55 EnsemblGenome_TRS SPAC1002.03c.1
SPAC1002.04c.1 Q9US54 EnsemblGenome_TRS SPAC1002.04c.1
SPAC1002.05c.1 Q9US53 EnsemblGenome_TRS SPAC1002.05c.1
SPAC1002.06c.1 Q9US52 EnsemblGenome_TRS SPAC1002.06c.1
SPAC1002.07c.1 P79081 EnsemblGenome_TRS SPAC1002.07c.1
SPAC1002.08c.1 Q9US51 EnsemblGenome_TRS SPAC1002.08c.1
SPAC1002.09c.1 O00087 EnsemblGenome_TRS SPAC1002.09c.1
SPAC1002.10c.1 Q9US49 EnsemblGenome_TRS SPAC1002.10c.1

Mapping from Uniprot FTP :
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/SCHPO_284812_idmapping.dat.gz
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/YEAST_559292_idmapping.dat.gz

names are not unique

abundances $ grep -h '#name' * | wc -l
493
abundances $ grep -h '#name' * | sort | uniq | wc -l
488

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.