paxdb-data-pipeline's People
paxdb-data-pipeline's Issues
rewrite non-python scripts to python
check uniprot mapping
reported by a user:
I used to work with PaxDB datasets from v3 and recently I switched to v4.
Since I am using mostly Reference Proteome from Uniprot, I had to convert the PaxDB-STRING identifier to Uniprot.
I had some issues with the mapping files provided that I would like to share with you.
Inconsistencies between uniprot mapping files for PaxDB v3 and v4
However, I've noticed that there was a significant change between the two versions, when I tried to use paxdb-uniprot mapping files for at least 2 species (cerevisiae, pombe) .
pombe
4936 lines in 4896-uniprot-paxdb.v3.map
58 lines in 4896-uniprot-paxdb.v4.map
cerevisiae
6483 lines in 4932-uniprot-paxdb.v3.map
1208 lines in 4932-uniprot-paxdb.v4.map
Mapping PaxDB v4 IDs to official uniprot mapping file
So, I tried to scan all PaxDB-STRING IDs for those species in the official Uniprot mapping file (from the Uniprot FTP, see below for links).
I mostly used perl for this :
- Loading into a hash the STRING IDs (after trimming the prefix which corresponds to the Taxon ID)
- Checking at each line of the mapping file, if any field contained a value already present in the hash.
Surprisingly, I found much more correspondence with Uniprot than what seems to be mapped by version 4 of PaxDB :
26304 lines in PaxDB.sc_idmapping_010717.dat.scan.matched (cerevisiae)
4643 lines in PaxDB.sp_idmapping_010717.dat.scan.matched (pombe)
Obviously, multiple records from the Uniprot mapping files could match a single STRING ID but if I remove the redundant pair of STRING-UNIPROT IDs, I got :
for cerevisiae => 6440 PaxDB corresponding to 6538 Uniprot AC
for pombe => 4579 PaxDB corresponding to 4571 Uniprot AC
Checking whether STRING has same problems of mapping
I understood that PaxDB relies on STRING, which essentially performed a blast against full Uniprot to generate a mapping file.
I've checked the mapping done by STRING but it seems just fine for those species. (https://string-db.org/mapping_files/uniprot_mappings/)
5339 lines in 4896_reviewed_uniprot_2_string.04_2015.tsv
9818 lines in 4932_reviewed_uniprot_2_string.04_2015.tsv
Additionally, I noticed that STRING also had similar problem when mapping to Uniprot for at least one species (drosophila) whereas PaxDB v4 had no issues :
3486 lines in 7227_reviewed_uniprot_2_string.04_2015.tsv
36390 lines in 7227-paxdb_uniprot.txt
I am also using eggNOG and it seems that this problem of mapping to Uniprot propagated also there (at least for cerevisiae and pombe).
I still appreciate very much the great deal of work you've put in all those projects (PaxDB,STRING,eggNOG)
I hope this could contribute to correct bugs and perhaps make it accessible to a broader community.
Thanks.
PS:
Here is a preview of the matched records between PaxDB ID from v4 datasets and Uniprot Mapping file (the first column is the STRING ID that was used as a key to generate a hash with all records in perl)
==> PaxDB.sc.sc_idmapping_060717.dat.scan.matched <==
Q0045 P00401 Gene_OrderedLocusName Q0045
Q0045 P00401 EnsemblGenome Q0045
Q0045 P00401 EnsemblGenome_TRS Q0045
Q0045 P00401 EnsemblGenome_PRO Q0045
Q0050 P03875 Gene_OrderedLocusName Q0050
Q0050 P03875 EnsemblGenome Q0050
Q0050 P03875 EnsemblGenome_TRS Q0050
Q0050 P03875 EnsemblGenome_PRO Q0050
Q0055 P03876 Gene_OrderedLocusName Q0055
Q0055 P03876 EnsemblGenome Q0055
==> PaxDB.sp.sp_idmapping_060717.dat.scan.matched <==
SPAC1002.01.1 Q9US57 EnsemblGenome_TRS SPAC1002.01.1
SPAC1002.02.1 Q9US56 EnsemblGenome_TRS SPAC1002.02.1
SPAC1002.03c.1 Q9US55 EnsemblGenome_TRS SPAC1002.03c.1
SPAC1002.04c.1 Q9US54 EnsemblGenome_TRS SPAC1002.04c.1
SPAC1002.05c.1 Q9US53 EnsemblGenome_TRS SPAC1002.05c.1
SPAC1002.06c.1 Q9US52 EnsemblGenome_TRS SPAC1002.06c.1
SPAC1002.07c.1 P79081 EnsemblGenome_TRS SPAC1002.07c.1
SPAC1002.08c.1 Q9US51 EnsemblGenome_TRS SPAC1002.08c.1
SPAC1002.09c.1 O00087 EnsemblGenome_TRS SPAC1002.09c.1
SPAC1002.10c.1 Q9US49 EnsemblGenome_TRS SPAC1002.10c.1
Mapping from Uniprot FTP :
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/SCHPO_284812_idmapping.dat.gz
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/YEAST_559292_idmapping.dat.gz
names are not unique
abundances $ grep -h '#name' * | wc -l
493
abundances $ grep -h '#name' * | sort | uniq | wc -l
488
loading google doc: switch from password to OAuth2
Email & password authentication was deprecated by Google on April 20th. You need to use oAuth2 to access spreadsheets.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.