The paxdb-data-pipeline from meringlab

check uniprot mapping

reported by a user:

I used to work with PaxDB datasets from v3 and recently I switched to v4.
Since I am using mostly Reference Proteome from Uniprot, I had to convert the PaxDB-STRING identifier to Uniprot.
I had some issues with the mapping files provided that I would like to share with you.

Inconsistencies between uniprot mapping files for PaxDB v3 and v4
However, I've noticed that there was a significant change between the two versions, when I tried to use paxdb-uniprot mapping files for at least 2 species (cerevisiae, pombe) .
pombe
4936 lines in 4896-uniprot-paxdb.v3.map
58 lines in 4896-uniprot-paxdb.v4.map
cerevisiae
6483 lines in 4932-uniprot-paxdb.v3.map
1208 lines in 4932-uniprot-paxdb.v4.map

Mapping PaxDB v4 IDs to official uniprot mapping file
So, I tried to scan all PaxDB-STRING IDs for those species in the official Uniprot mapping file (from the Uniprot FTP, see below for links).
I mostly used perl for this :

Loading into a hash the STRING IDs (after trimming the prefix which corresponds to the Taxon ID)
Checking at each line of the mapping file, if any field contained a value already present in the hash.
Surprisingly, I found much more correspondence with Uniprot than what seems to be mapped by version 4 of PaxDB :
26304 lines in PaxDB.sc_idmapping_010717.dat.scan.matched (cerevisiae)
4643 lines in PaxDB.sp_idmapping_010717.dat.scan.matched (pombe)
Obviously, multiple records from the Uniprot mapping files could match a single STRING ID but if I remove the redundant pair of STRING-UNIPROT IDs, I got :
for cerevisiae => 6440 PaxDB corresponding to 6538 Uniprot AC
for pombe => 4579 PaxDB corresponding to 4571 Uniprot AC

Checking whether STRING has same problems of mapping
I understood that PaxDB relies on STRING, which essentially performed a blast against full Uniprot to generate a mapping file.
I've checked the mapping done by STRING but it seems just fine for those species. (https://string-db.org/mapping_files/uniprot_mappings/)
5339 lines in 4896_reviewed_uniprot_2_string.04_2015.tsv
9818 lines in 4932_reviewed_uniprot_2_string.04_2015.tsv

Additionally, I noticed that STRING also had similar problem when mapping to Uniprot for at least one species (drosophila) whereas PaxDB v4 had no issues :
3486 lines in 7227_reviewed_uniprot_2_string.04_2015.tsv
36390 lines in 7227-paxdb_uniprot.txt

I am also using eggNOG and it seems that this problem of mapping to Uniprot propagated also there (at least for cerevisiae and pombe).

I still appreciate very much the great deal of work you've put in all those projects (PaxDB,STRING,eggNOG)
I hope this could contribute to correct bugs and perhaps make it accessible to a broader community.
Thanks.

PS:
Here is a preview of the matched records between PaxDB ID from v4 datasets and Uniprot Mapping file (the first column is the STRING ID that was used as a key to generate a hash with all records in perl)
==> PaxDB.sc.sc_idmapping_060717.dat.scan.matched <==
Q0045 P00401 Gene_OrderedLocusName Q0045
Q0045 P00401 EnsemblGenome Q0045
Q0045 P00401 EnsemblGenome_TRS Q0045
Q0045 P00401 EnsemblGenome_PRO Q0045
Q0050 P03875 Gene_OrderedLocusName Q0050
Q0050 P03875 EnsemblGenome Q0050
Q0050 P03875 EnsemblGenome_TRS Q0050
Q0050 P03875 EnsemblGenome_PRO Q0050
Q0055 P03876 Gene_OrderedLocusName Q0055
Q0055 P03876 EnsemblGenome Q0055

==> PaxDB.sp.sp_idmapping_060717.dat.scan.matched <==
SPAC1002.01.1 Q9US57 EnsemblGenome_TRS SPAC1002.01.1
SPAC1002.02.1 Q9US56 EnsemblGenome_TRS SPAC1002.02.1
SPAC1002.03c.1 Q9US55 EnsemblGenome_TRS SPAC1002.03c.1
SPAC1002.04c.1 Q9US54 EnsemblGenome_TRS SPAC1002.04c.1
SPAC1002.05c.1 Q9US53 EnsemblGenome_TRS SPAC1002.05c.1
SPAC1002.06c.1 Q9US52 EnsemblGenome_TRS SPAC1002.06c.1
SPAC1002.07c.1 P79081 EnsemblGenome_TRS SPAC1002.07c.1
SPAC1002.08c.1 Q9US51 EnsemblGenome_TRS SPAC1002.08c.1
SPAC1002.09c.1 O00087 EnsemblGenome_TRS SPAC1002.09c.1
SPAC1002.10c.1 Q9US49 EnsemblGenome_TRS SPAC1002.10c.1

Mapping from Uniprot FTP :
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/SCHPO_284812_idmapping.dat.gz
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/YEAST_559292_idmapping.dat.gz

names are not unique

abundances $ grep -h '#name' * | wc -l
493
abundances $ grep -h '#name' * | sort | uniq | wc -l
488

loading google doc: switch from password to OAuth2

Email & password authentication was deprecated by Google on April 20th. You need to use oAuth2 to access spreadsheets.

meringlab / paxdb-data-pipeline Goto Github PK

paxdb-data-pipeline's People

Watchers

paxdb-data-pipeline's Issues

rewrite non-python scripts to python

check uniprot mapping

names are not unique

loading google doc: switch from password to OAuth2

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent