Giter Site home page Giter Site logo

pyuniprot's Introduction

PyUniProt Stable Build Status

Project logo

Development Documentation Status Apache 2.0 License

PyUniProt is a Python package to access and query UniProt data provided by the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR).

Data are installed in a (local or remote) RDBMS enabling bioinformatic algorithms very fast response times to sophisticated queries and high flexibility by using SOLAlchemy database layer. PyUniProt is developed by the Department of Bioinformatics at the Fraunhofer Institute for Algorithms and Scientific Computing SCAI For more in for information about pyUniProt go to the documentation.

Entity relationship model

This development is supported by following IMI projects:

IMI project logo AETIONOMY project logo PHAGO project logo SCAI project logo

Supported databases

PyUniProt uses SQLAlchemy to cover a wide spectrum of RDMSs (Relational database management system). For best performance MySQL or MariaDB is recommended. But if you have no possibility to install software on your system SQLite - which needs no further installation - also works. Following RDMSs are supported (by SQLAlchemy):

  1. Firebird
  2. Microsoft SQL Server
  3. MySQL / MariaDB
  4. Oracle
  5. PostgreSQL
  6. SQLite
  7. Sybase

Getting Started

This is a quick start tutorial for impatient.

Installation Current version on PyPI Stable Supported Python Versions

PyUniProt can be installed with pip.

pip install pyuniprot

If you fail because you have no rights to install use superuser (sudo on Linux before the commend) or ...

pip install --user pyuniprot

If you want to make sure you are installing this under python3 use ...

python3 -m pip install pyuniprot

SQLite

Note

If you want to use SQLite as your database system, because you ...

  • have no possibility to use RDMSs like MySQL/MariaDB
  • just test PyUniProt, but don't want to spend time in setting up a database

skip the next MySQL/MariaDB setup section. But in general we strongly recommend MySQL or MariaDB as your relational database management system.

If you don't know what all that means skip the section MySQL/MariaDB setup.

Don't worry! You can always later change the configuration. For more information about changing database system later go to the subtitle Changing database configuration Changing database configuration in the documentation on readthedocs.

MySQL/MariaDB setup

Log in MySQL as root user and create a new database, create a user, assign the rights and flush privileges.

CREATE DATABASE pyuniprot CHARACTER SET utf8 COLLATE utf8_general_ci;
GRANT ALL PRIVILEGES ON pyuniprot.* TO 'pyuniprot_user'@'%' IDENTIFIED BY 'pyuniprot_passwd';
FLUSH PRIVILEGES;

There are two options to set the MySQL/MariaDB.

  1. The simplest is to start the command line tool
pyuniprot mysql

You will be guided with input prompts. Accept the default value in squared brackets with RETURN. You will see something like this

server name/ IP address database is hosted [localhost]:
MySQL/MariaDB user [pyuniprot_user]:
MySQL/MariaDB password [pyuniprot_passwd]:
database name [pyuniprot]:
character set [utf8]:

Connection will be tested and in case of success return Connection was successful. Otherwise you will see following hint

Test was NOT successful

Please use one of the following connection schemas
MySQL/MariaDB (strongly recommended):
        mysql+pymysql://user:passwd@localhost/database?charset=utf8

PostgreSQL:
        postgresql://user:passwd@localhost/database

MsSQL (pyodbc needed):
        mssql+pyodbc://user:passwd@database

SQLite (always works):

- Linux:
        sqlite:////absolute/path/to/database.db

- Windows:
        sqlite:///C:\absolute\path\to\database.db

Oracle:
        oracle://user:passwd@localhost:1521/database

2. The second option is to start a python shell and set the MySQL configuration. If you have not changed anything in the SQL statements above ...

import pyuniprot
pyuniprot.set_mysql_connection()

If you have used you own settings, please adapt the following command to you requirements.

import pyuniprot
pyuniprot.set_mysql_connection(host='localhost', user='pyuniprot_user', passwd='pyuniprot_passwd', db='pyuniprot')

Updating

The updating process will download the uniprot_sprot.xml.gz file provided by the UniProt team on their ftp server download page

Warning

Please note that UniProt download file needs ~700 Mb of disk space and the update takes ~2h only for human, mouse and rat (depending on your computer)

It is strongly recommended to restrict the entries liked to specific organisms your are interested in by parsing a list of NCBI Taxonomy IDs to the parameter taxids. To identify correct NCBI Taxonomy IDs please go to NCBI Taxonomy web form. In the following example we use 9606 as identifier for Homo sapiens, 10090 for Mus musculus and 10116 for Rattus norvegicus.

There are two options to import the data:

  1. Command line import

    pyuniprot update --taxids 9606,10090,10116
  2. Python

    import pyuniprot
    pyuniprot.update(taxids=[9606, 10090, 10116])

We only recommend to import the whole UniProt dataset if you don't want to restrict your search. Import with no restrictions will take several hours and take a lot of disk space.

If you want to load all UniProt entries in the database:

import pyuniprot
pyuniprot.update() # not recommended, please read the notes above

The update uses the download file if it still exists on you system (~/.pyuniprot/data/uniprot_sprot.xml.gz). If you use the parameter force_download the current file from UniProt will be downloaded.

import pyuniprot
pyuniprot.update(force_download=True, taxids=[9606, 10090, 10116])

Quick start with query functions

Initialize the query object

query = pyuniprot.query()

Get all entries

all_entries = query.entry()

Use parameters like gene_name to find specific entries

>>> entry = query.entry(gene_name='YWHAE', taxid=9606, recommended_short_name='14-3-3E', name='1433E_HUMAN')[0]
>>> entry
14-3-3 protein epsilon
Entry is the root element in the database. Form here you can reach all other data
>>> entry.accessions
[P62258, B3KY71, D3DTH5, P29360, P42655, Q4VJB6, Q53XZ5, Q63631, Q7M4R4]
>>> entry.functions
["Adapter protein implicated in the regulation of a large spectrum of both ..."]
If a parameter ends on a s you can search
>>> alcohol_dehydrogenases = q.entry(ec_numbers='1.1.1.1')
>>> [x.name for x in q.get_entry(ec_numbers='1.1.1.1')]
['ADHX_RAT', 'ADH1_RAT', 'ADHX_HUMAN', 'ADHX_MOUSE']
>>> query.entry(ec_numbers=('1.1.1.1', '1.1.1.2'))
['Adh5', 'Adh1', 'ADH5', 'Adh5', 'Adh6', 'ADH7', 'Adh7', 'Adh7', 'Adh1']

As dataframe with a limit of 10 and accession number starts with Q9 (% used as wildcard)

>>> query.accession(as_df=True, limit=3, accession='Q9%')
   id accession  entry_id
0   1    Q9CQV8         1
1  32    Q9GIK8         6
2  33    Q9TQB4         6

Full documentation on query function you will find here

More information

See the installation documentation for more advanced instructions. Also, check the change log at CHANGELOG.rst.

UniProt tools and licence (use of data)

UniProt provides also many online query interfaces on their website.

Please be aware of the UniProt licence.

Links

Universal Protein Resource (UniProt)

PyUniProt

pyuniprot's People

Contributors

cebel avatar christianebeling avatar cthoyt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pyuniprot's Issues

pip installs broken version of code

On Windows 10 with Python 3.8.5 I've set up a MySQL database and can successfully connect to it from command prompt

> mysql --user=gszep  --password=station --host=localhost pyuniprot
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 78
Server version: 8.0.25 MySQL Community Server - GPL

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

successfully connected it to the python library

>pyuniprot mysql
server name/ IP address database is hosted [localhost]:
MySQL/MariaDB user [pyuniprot_user]: gszep
MySQL/MariaDB password [pyuniprot_passwd]: station
database name [pyuniprot]:
character set [utf8]:
Connection was successful

however when attempt to update the database with a small virus

>pyuniprot update --taxids 133704
WARNING: Update is very time consuming and can take several
hours depending which organisms you are importing!

bin was anderes
346934it [00:11, 30587.45it/s]
Traceback (most recent call last):
  File "c:\programdata\miniconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\programdata\miniconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\Scripts\pyuniprot.exe\__main__.py", line 7, in <module>
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\click\core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\click\core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\click\core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\click\core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\pyuniprot\cli.py", line 72, in update
    database.update(taxids=taxids, connection=conn, force_download=force_download, silent=silent)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\pyuniprot\manager\database.py", line 836, in update
    db.db_import_xml(urls, force_download, taxids, silent)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\pyuniprot\manager\database.py", line 160, in db_import_xml
    self.import_xml(xml_gzipped_file_path, taxids, silent)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\pyuniprot\manager\database.py", line 233, in import_xml
    self.insert_entries(entry_xml, taxids)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\pyuniprot\manager\database.py", line 258, in insert_entries
    self.insert_entry(entry, taxids)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\pyuniprot\manager\database.py", line 280, in insert_entry
    taxid = self.get_taxid(entry)
  File "C:\Users\gszep\AppData\Roaming\Python\Python38\site-packages\pyuniprot\manager\database.py", line 628, in get_taxid
    return int(entry.find(query).get('id'))
AttributeError: 'NoneType' object has no attribute 'get'

on an Ubuntu 18.04 with Python 3.6 we get the same error

Import 173479167 lines:   0%|                         | 346934/173479167
[00:01<12:09, 237194.02it/s]

Traceback (most recent call last):
  File "/home/gszep/.local/bin/pyuniprot", line 8, in <module>
    sys.exit(main())
  File "/home/gszep/.local/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/gszep/.local/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/gszep/.local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gszep/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gszep/.local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/gszep/.local/lib/python3.6/site-packages/pyuniprot/cli.py", line 72, in update
    database.update(taxids=taxids, connection=conn, force_download=force_download, silent=silent)
  File "/home/gszep/.local/lib/python3.6/site-packages/pyuniprot/manager/database.py", line 836, in update
    db.db_import_xml(urls, force_download, taxids, silent)
  File "/home/gszep/.local/lib/python3.6/site-packages/pyuniprot/manager/database.py", line 160, in db_import_xml
    self.import_xml(xml_gzipped_file_path, taxids, silent)
  File "/home/gszep/.local/lib/python3.6/site-packages/pyuniprot/manager/database.py", line 233, in import_xml
    self.insert_entries(entry_xml, taxids)
  File "/home/gszep/.local/lib/python3.6/site-packages/pyuniprot/manager/database.py", line 258, in insert_entries
    self.insert_entry(entry, taxids)
  File "/home/gszep/.local/lib/python3.6/site-packages/pyuniprot/manager/database.py", line 280, in insert_entry
    taxid = self.get_taxid(entry)
  File "/home/gszep/.local/lib/python3.6/site-packages/pyuniprot/manager/database.py", line 628, in get_taxid
    return int(entry.find(query).get('id'))

AttributeError: 'DbManager' object has no attribute 'session'

Hi,

I'm super new to programming so I believe I'm missing something VERY obvious, so I apologize in advance for this question. In case these are relevant, I'm using a Mac and the PyCharm IDE. I downloaded the pyUniProt via the project interpreter package install. I don't want to bother downloading MySQL/MariaDB, so I plan to use SQLlite. So I've put in import pyuniprot and hoped to just proceed to update (although I have also tried with pyuniprot.set_mysql_connection and gotten the same error).
When I put in pyuniprot.update(force_download=True, taxids=[9606, 10090, 10116]) or just pyuniprot.update() I get the below error. Sorry to post with such an elementary question, but there must be something I'm missing. Thanks!!
Screen Shot 2019-06-06 at 6 04 13 PM

AttributeError during package import

My enviroment:

Python 3.11.5
pyuniprot 0.0.10
numpy 1.26.1

The issue:

When importing the package, I encounter the following error:

>>> import pyuniprot
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mypath/miniconda3/envs/treva/lib/python3.11/site-packages/pyuniprot/__init__.py", line 15, in <module>
    from . import manager
  File "/mypath/miniconda3/envs/treva/lib/python3.11/site-packages/pyuniprot/manager/__init__.py", line 7, in <module>
    from . import database
  File "/mypath/miniconda3/envs/treva/lib/python3.11/site-packages/pyuniprot/manager/database.py", line 41, in <module>
    sqltypes.Text: np.unicode,
                   ^^^^^^^^^^
  File "/mypath/miniconda3/envs/treva/lib/python3.11/site-packages/numpy/__init__.py", line 333, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'unicode'. Did you mean: 'unicode_'?

Keep track of version of data

Inside the UniProt data directory, there's a file called ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt that stores the release information. It would be good to download this and store it in a "meta" table in the PyUniProt database so it knows if there's been an update. This could be as simple as hashing the file and checking if the current hash is the same as the hash from the last update

update database with only specific accession ids

In synthetic biology one often focuses on a subset of proteins across different taxids. Ideally we would have something like

pyuniprot update --accessionids Q14738, A4FU37, ...

and in python

pyuniprot.update(accessionids=['Q14738', 'A4FU37', ...])

Error on update

This happened after about 4 minutes:

In [1]: import pyuniprot

In [2]: pyuniprot.update(taxids=[9606])
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-2-89507300518a> in <module>()
----> 1 pyuniprot.update(taxids=[9606])

~/dev/pyuniprot/src/pyuniprot/manager/database.py in update(connection, urls, force_download, taxids)
    479 
    480     db = DbManager(connection)
--> 481     db.db_import_xml(urls, force_download, taxids)
    482     db.session.close()
    483 

~/dev/pyuniprot/src/pyuniprot/manager/database.py in db_import_xml(self, url, force_download, taxids)
    175         xml_gzipped_file_path = DbManager.download(url, force_download)
    176         self._create_tables()
--> 177         self.import_xml(xml_gzipped_file_path, taxids)
    178         self.session.close()
    179 

~/dev/pyuniprot/src/pyuniprot/manager/database.py in import_xml(self, xml_gzipped_file_path, taxids)
    184         interval = 100
    185         start = False
--> 186         with open(xml_gzipped_file_path[:-3], 'r') as fd:
    187             for line in fd:
    188                 end_of_file = line.startswith("</uniprot>")

FileNotFoundError: [Errno 2] No such file or directory: '/Users/cthoyt/.pyuniprot/data/uniprot_sprot.xml'

Memory problem with update

Machines with low memory (<=8GB) have already problems with

import pyuniprot
pyuniprot.update(taxids=(9606,10090,10116))

Perhaps caching should be avoided and/or lxml (causes memory leak) have to be exchanged against python lib.

Add EC Hierarchy Table

This table should contain information about the hierarchy underlying the ECNumber table.

If this information isn't available in UniProt, then we might have to write a parser (perhaps also as a separate package) for ftp://ftp.expasy.org/databases/enzyme/enzclass.txt and match them

Fix documentation syntax

:param str,tuple name: UniProt entry name(s)
:param str,tuple recommended_full_name: recommended full protein name(s)
:param str,tuple recommended_short_name: recommended short protein name(s)
:param str,tuple tissue_in_reference: tissue mentioned in reference
:param str,tuple subcellular_location: subcellular location(s)
:param str,tuple keyword: keyword
:param str,tuple pmid: PubMed identifier
:param str,tuple tissue_specificity: tissue specificities
:param str,tuple disease_comment: disease_comments
:param str,tuple alternative_name:
:param str,tuple db_reference: cross reference identifier
:param str,tuple ec_number: enzyme classification number, e.g. 1.1.1.1
:param str,tuple function_: description of protein functions
:param str,tuple feature_type: feature types
:param str,tuple organism_host: organism hosts
:param str,tuple accession: UniProt accession number
:param str,tuple disease_name: disease name
:param str,tuple gene_name: gene name
:param str,tuple taxid: NCBI taxonomy identifier
:param int,tuple limit: maximum number of results
:param str,tuple sequence: Amino acid sequence

each of these instances where it can be multiple things can should be written as

:param str or tuple[str] name: UniProt entry name(s) , etc.

If we ever get awesome and write Python 3.6+ only code, we can use real type annotations too :)

Use flask-security for logins

I saw that you completely rolled your own login system - it would be much easier and more consistent to use Flask-Security, which takes care of everything (session management, data models, and templates) for you.

Let's pair program this?

Remove .idea folder from project

I can't open the project in my PyCharm because there are different settings. It would be nice if you could remove this folder and add it to the .gitignore

Progress bar for update

Since the update function takes so long, it would be nice to give some kind of feedback like on every 5% of the file that is completed

Database population fails

After running the update function from a fresh installation in a virtual environment, the following error occurs:

(test_venv) [511] [11:16] [cthoyt@wlan-185:~/dev/pyuniprot]
$ pyuniprot update --taxids 9606,10090,10116
WARNING: Update is very time consuming and can take several hours depending which organisms you are importing!
Traceback (most recent call last):
  File "/Users/cthoyt/.virtualenvs/test_venv/bin/pyuniprot", line 11, in <module>
    load_entry_point('PyUniProt', 'console_scripts', 'pyuniprot')()
  File "/Users/cthoyt/.virtualenvs/test_venv/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/cthoyt/.virtualenvs/test_venv/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/cthoyt/.virtualenvs/test_venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/cthoyt/.virtualenvs/test_venv/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/cthoyt/.virtualenvs/test_venv/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/cthoyt/dev/pyuniprot/src/pyuniprot/cli.py", line 72, in update
    database.update(taxids=taxids, connection=conn, force_download=force_download, silent=silent)
  File "/Users/cthoyt/dev/pyuniprot/src/pyuniprot/manager/database.py", line 826, in update
    db.db_import_xml(urls, force_download, taxids, silent)
  File "/Users/cthoyt/dev/pyuniprot/src/pyuniprot/manager/database.py", line 155, in db_import_xml
    self.import_xml(xml_gzipped_file_path, taxids, silent)
  File "/Users/cthoyt/dev/pyuniprot/src/pyuniprot/manager/database.py", line 197, in import_xml
    number_of_lines = int(getoutput("{} {} | wc -l".format(zcat_command, xml_gzipped_file_path)))
ValueError: invalid literal for int() with base 10: 'gzcat: /Users/cthoyt/.pyuniprot/data/uniprot_sprot.xml.gz: unexpected end of file\ngzcat: /Users/cthoyt/.pyuniprot/data/uniprot_sprot.xml.gz: uncompress failed\n 125198601'

Here's the pip freeze

(test_venv) [512] [11:16] [cthoyt@wlan-185:~/dev/pyuniprot]
$ pip freeze
certifi==2017.11.5
chardet==3.0.4
click==6.7
configparser==3.5.0
flasgger==0.8.0
Flask==0.12.2
idna==2.6
itsdangerous==0.24
Jinja2==2.10
jsonschema==2.6.0
MarkupSafe==1.0
mistune==0.8.1
numpy==1.13.3
pandas==0.21.0
passlib==1.7.1
PyMySQL==0.7.11
python-dateutil==2.6.1
pytz==2017.3
-e git+https://github.com/cebel/pyuniprot.git@9462a6042c7c9295415a5eb589b77b27cb7c142b#egg=PyUniProt
PyYAML==3.12
requests==2.18.4
six==1.11.0
SQLAlchemy==1.1.15
tqdm==4.19.4
urllib3==1.22
Werkzeug==0.12.2
WTForms==2.1

Remove sudo from the installation docs

It is generally not recommended to use sudo for package installation.

I wonder if there are really use cases where the "System wide installation" as noted in the documentation is useful/desired. On the other hand, using sudo with pip on a system Python installation can change packages that are used in system tooling which may be detrimental and hard to recover from.

If sudo really needs to remain in the docs, it would be great to ship along when to use it and a warning message cautioning users not to use it unless they know the dangers that come with using sudo.

Error on Update (Redux)

I got farther this time (since the resolution to #1) but I am running into a new problem:

In [1]: import pyuniprot

In [2]: pyuniprot.update(taxids=[9606])
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-89507300518a> in <module>()
----> 1 pyuniprot.update(taxids=[9606])

~/dev/pyuniprot/src/pyuniprot/manager/database.py in update(connection, urls, force_download, taxids)
    567 
    568     db = DbManager(connection)
--> 569     db.db_import_xml(urls, force_download, taxids)
    570     db.session.close()
    571 

~/dev/pyuniprot/src/pyuniprot/manager/database.py in db_import_xml(self, url, force_download, taxids)
    204         xml_gzipped_file_path = DbManager.download(url, force_download)
    205         self._create_tables()
--> 206         self.import_xml(xml_gzipped_file_path, taxids)
    207         self.session.close()
    208 

~/dev/pyuniprot/src/pyuniprot/manager/database.py in import_xml(self, xml_gzipped_file_path, taxids)
    214         start = False
    215 
--> 216         xml_file = gunzip_file(xml_gzipped_file_path)
    217 
    218         with xml_file.open('r') as fd:

~/dev/pyuniprot/src/pyuniprot/manager/database.py in gunzip_file(path)
     80         gzipped_file = gzip.open(str(gzipped_path), 'rb')
     81         extracted_file = extracted_path.open('wb')
---> 82         extracted_file.write(gzipped_file.read())
     83         gzipped_file.close()
     84         extracted_file.close()

OSError: [Errno 22] Invalid argument

Is there a way to do this with context managers? I would love to know if you could write the following code when you're using paths

with gzip.open(str(gzipped_path), 'rb') as gzipped_file, extracted_path.open('wb') as extracted_file:
   extracted_file.write(gzipped_file.read())

Btw, there might also be something already built into shutil for copying a file from one place to another.

AttributeError: 'NoneType' object has no attribute 'get'

Hello,

I am running into an error while trying to retrieve human protein_IDs. I installed the package from the master branch as suggested in Issue #26. Here is the read out from my code:

`pyuniprot.set_connection("sqlite:///" + db_name)
pyuniprot.update(force_download = True, taxids=[9606])

Database uniprot.db formed.
Import 174181022 lines: 0%| | 345071/174181022 [00:02<24:58, 116040.97it/s]


AttributeError Traceback (most recent call last)
/var/folders/hw/0w064hss2mv96tfz78bybyrc0000gn/T/ipykernel_37297/3370700064.py in
15 # Pull Uniprot data into created database and filter for gene names
16 pyuniprot.set_connection("sqlite:///" + db_name)
---> 17 pyuniprot.update(force_download = True, taxids=[9606])

~/opt/anaconda3/lib/python3.8/site-packages/pyuniprot/manager/database.py in update(connection, urls, force_download, taxids, silent)
834 config.write(config_file)
835 log.info('create configuration file {}'.format(cfp))
--> 836 else:
837 config.read(cfp)
838 config.set('database', 'sqlalchemy_connection_string', connection)

~/opt/anaconda3/lib/python3.8/site-packages/pyuniprot/manager/database.py in db_import_xml(self, url, force_download, taxids, silent)
158 self._drop_tables()
159 xml_file_path, version_file_path = self.download_and_extract(url, force_download)
--> 160 self._create_tables()
161 self.import_version(version_file_path)
162 self.import_xml(xml_file_path, taxids, silent)

~/opt/anaconda3/lib/python3.8/site-packages/pyuniprot/manager/database.py in import_xml(self, xml_gzipped_file_path, taxids, silent)
231
232 # @Profile
--> 233 def update_entry_dict(self, entry, entry_dict, taxid):
234 """
235

~/opt/anaconda3/lib/python3.8/site-packages/pyuniprot/manager/database.py in insert_entries(failed resolving arguments)
256 ec_numbers = self.get_ec_numbers(entry)
257 alternative_full_names = self.get_alternative_full_names(entry)
--> 258 alternative_short_names = self.get_alternative_short_names(entry)
259 disease_comments = self.get_disease_comments(entry)
260 tissue_specificities = self.get_tissue_specificities(entry)

~/opt/anaconda3/lib/python3.8/site-packages/pyuniprot/manager/database.py in insert_entry(self, entry, taxids)
278 keywords=keywords,
279 ec_numbers=ec_numbers,
--> 280 alternative_full_names=alternative_full_names,
281 alternative_short_names=alternative_short_names,
282 disease_comments=disease_comments,

~/opt/anaconda3/lib/python3.8/site-packages/pyuniprot/manager/database.py in get_taxid(cls, entry)
626 pmid_dict = dict(citation.attrib)
627 if not re.search('^\d+$', pmid_dict['volume']):
--> 628 pmid_dict['volume'] = -1
629
630 del pmid_dict['type'] # not needed because already filtered for PubMed

AttributeError: 'NoneType' object has no attribute 'get'`

I noticed that there is no .xml.gz anywhere in the package site if that might be an issue.

Side note - I am only interested in getting ~150 protein IDs so is there a way to update the database with only gene names?

zcat fails on mac

on MACOS zcat needs to be run as
zcat < filename.gz
instead of just zcat filename.gz

Memory leak in update function

import pyuniprot
pyuniprot.update(taxids=[9606,10090,10116]) # human, mouse, rat

After ~42k entries in the database the update process consumes ~12Gb of memory. I assume a problem in lxml. Found several articles describing similar problems with big XML files. BUT here I avoid to load the whole document in the iterparser (tested is and directly starts with 5Gb of memory consumption and then constantly increases). If the problem won't be solved it seem not feasible to load the whole UniProt.

build database failed

$ /usr/bin/pyuniprot update --taxids 9606
WARNING: Update is very time consuming and can take several
hours depending which organisms you are importing!

Import 160939723 lines: 0%| | 0/160939723 [00:00<?, ?it/s]Exception KeyError: KeyError(<weakref at 0x7eff46857b50; to 'tqdm' at 0x7eff468e3c90>,) in <bound method tqdm.del of Import 160939723 lines: 0%| | 0/160939723 [00:00<?, ?it/s]> ignored
Traceback (most recent call last):
File "/usr/bin/pyuniprot", line 11, in
sys.exit(main())
File "/usr/lib/python2.7/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/lib64/python2.7/site-packages/pyuniprot/cli.py", line 72, in update
database.update(taxids=taxids, connection=conn, force_download=force_download, silent=silent)
File "/usr/lib64/python2.7/site-packages/pyuniprot/manager/database.py", line 836, in update
db.db_import_xml(urls, force_download, taxids, silent)
File "/usr/lib64/python2.7/site-packages/pyuniprot/manager/database.py", line 160, in db_import_xml
self.import_xml(xml_gzipped_file_path, taxids, silent)
File "/usr/lib64/python2.7/site-packages/pyuniprot/manager/database.py", line 233, in import_xml
self.insert_entries(entry_xml, taxids)
File "/usr/lib64/python2.7/site-packages/pyuniprot/manager/database.py", line 254, in insert_entries
entries = etree.fromstring(entries_xml)
File "", line 124, in XML
cElementTree.ParseError: mismatched tag: line 127, column 2

PYUNIPROT_DIR and PYUNIPROT_DATA_DIR should be more easily configurable

These are hard-coded to use the users home directory, but there are cases where the quota on the home directory is limited.

PYUNIPROT_DIR and PYUNIPROT_DATA_DIR should be configurable as settings and/or as an environment variable.

eg pyuniprot.set_data_dir("/scratch/data/")

or

export PYUNIPROT_DATA_DIR=/scratch/data

captured in Python (eg in constants.py) by:

PYUNIPROT_DATA_DIR = os.environ.get('PYUNIPROT_DATA_DIR', os.path.join(PYUNIPROT_DIR, 'data'))


It's worth noting that you can override these variables by directly setting pyuniprot.manager.database.PYUNIPROT_DIR = 'some/path' and pyuniprot.manager.database.PYUNIPROT_DATA_DIR = 'some/path/data'` but this feels like a workaround.

sqlite3.ProgrammingError in query functions

example

q.get_entry()

ProgrammingError: (sqlite3.ProgrammingError) SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 140377400882944 and this is thread id 140377636747008 [SQL: 'SELECT pyuniprot_entry.id AS pyuniprot_entry_id, pyuniprot_entry.dataset AS pyuniprot_entry_dataset, pyuniprot_entry.created AS pyuniprot_entry_created, pyuniprot_entry.modified AS pyuniprot_entry_modified, pyuniprot_entry.version AS pyuniprot_entry_version, pyuniprot_entry.name AS pyuniprot_entry_name, pyuniprot_entry.recommended_full_name AS pyuniprot_entry_recommended_full_name, pyuniprot_entry.recommended_short_name AS pyuniprot_entry_recommended_short_name, pyuniprot_entry.taxid AS pyuniprot_entry_taxid, pyuniprot_entry.gene_name AS pyuniprot_entry_gene_name \nFROM pyuniprot_entry'] [parameters: [{}]]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.