biosql / biosql Goto Github PK

License: GNU Lesser General Public License v3.0

Perl 50.79% PLSQL 33.07% SQLPL 3.66% PLpgSQL 12.48%

biosql's Introduction

====================================
BioSQL
====================================

This is the BioSQL distribution. BioSQL is a generic unifying schema
for storing sequences from different sources, for instance Genbank or
Swissprot.

BioSQL is meant to be a common data storage layer supported by all the
different Bio* projects, Bioperl, Biojava, Biopython, and Bioruby.
Entries stored through an application written in, say, Bioperl could
be retrieved by another written in Biojava.

There are currently four different RDBMSs supported: MySQL,
PostgreSQL, Oracle, and most recently SQLite. The MySQL schema
DDL is in sql/biosqldb-mysql.sql, the PostgreSQL schema is in
sql/biosqldb-pg.sql, and the Oracle schema is in multiple files
in the directory sql/biosql-ora, and the SQLite schema is in
sql/biosqldb-sqlite.sql. In order to instantiate the schema, feed
the respective file or files to your SQL shell (e.g. mysql for
MySQL, and psql for PostgreSQL).

Note the SQLite schema is new, and is not yet supported by all the
Bio* projects (at the time of writing, just Biopython supports this).

======================================================================
INSTALLATION NOTES

Generally, the support for each RDBMS is maintained individually,
as there are too many differences between the RDBMSs for automatic
generation from a single DDL. Please see the INSTALL for details.

======================================================================
POST v1.0 TODO
======================================================================

The following changes are scheduled for the next release of
BioSQL. These changes incur incompatibilities with the 1.0 version and
will require a database migration.

  - Rename comment table to anncomment and rename term_synonym.synonym
  to term_synonym.name in mysql and pg schemas. This will make the
  names consistent with the Oracle schema, avoiding Oracle reserved
  words.

biosql's People

Contributors

Stargazers

Watchers

biosql's Issues

load_ncbi_taxonomy.pl (Can't call method "execute" on an undefined value)

I tried to loading the taxonomy into sqlite3 database.

perl load_ncbi_taxonomy.pl --dbname database --driver SQLite --download --directory db_work

Downloading NCBI taxon database to db_work
Loading NCBI taxon database in db_work:
... retrieving all taxon nodes in the database
... reading in taxon nodes from nodes.dmp
... insert / update / delete taxon nodes
Can't call method "execute" on an undefined value at load_ncbi_taxonomy.pl line 465.

Can you check it ?

Adding prefix index to seq_feature_qualifiers_value.value

Migrated from https://redmine.open-bio.org/issues/2902 filed by @kellrott on 2009-08-06,

Performance of sequence feature searching can be improved with the addition of an index for the 'value' column of the 'seqfeature_qualifier_value table'
Because the 'value' field is TEXT, it cannot be fully indexed, but a prefix index (for example, for the first 5 letters) can still substantially increase the performance of many searches.

For example, the search
SELECT * from seqfeature_qualifier_value where value='NP_808915.1";
When queried against a table with ~28,000,000 records, it takes about a minute before the index, and less then a second afterwords.

This can be implemented by adding the command:
CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value(5));

Add explicit compound primary keys to tables like taxon_name

Migrated from https://redmine.open-bio.org/issues/2765 which I filed on 2009-02-18,

This issue was discussed on the mailing list, and Hilmar said he planned to deal with this in BioSQL v1.0.2 (which I have defined and set as the milestone for this bug).

Some tables do not have an explicit primary key, but instead just have a uniqueness restraint. Making these into explicit compound primary keys would be helpful for some ORM software (e.g. the python package SQLAlchemy).

This affects tables taxon_name, bioentry_path, term_relationship, bioentry_qualifier_value, seqfeature_path (but I think that is all).

e.g. In the MySQL schema, taxon_name has a unique restraint but does NOT have a primary key:
CREATE TABLE taxon_name (
taxon_id INT UNSIGNED NOT NULL,
name VARCHAR BINARY NOT NULL,
name_class VARCHAR BINARY NOT NULL,
UNIQUE (taxon_id,name,name_class)
) TYPE=INNODB;
This would become:
CREATE TABLE taxon_name (
taxon_id INT UNSIGNED NOT NULL,
name VARCHAR BINARY NOT NULL,
name_class VARCHAR BINARY NOT NULL,
PRIMARY KEY (taxon_id,name,name_class)
) TYPE=INNODB;

Rank is a reserved word in MySQL 8.0.2 onwards

According to https://dev.mysql.com/doc/refman/8.0/en/keywords.html MySQL v8.0.2 made RANK a reserved word.

BioSQL has a multiple columns named rank including:

term_dbxref.rank
bioentry_relationship.rank
dbxref_qualifier_value .rank
bioentry_dbxref.rank
bioentry_reference.rank
comment.rank
bioentry_qualifier_value.rank
seqfeature.rank
seqfeature_relationship.rank
seqfeature_qualifier_value.rank
seqfeature_dbxref.rank
location.rank

This means previously valid SQL commands may no longer work.

This was logged in Biopython by @lhovo, this MySQL change broke our loading sequences into the database:

biopython/biopython#1882

It appears we just need to wrap the SQL column name with back-ticks as a work around, so this should not need a schema change in BioSQL itself.

However, it could also affect the BioPerl, BioJava, BioRuby bindings too (perhaps not if they use a more abstracted ORM rather than SQL directly)?

Document how to store value-less qualifiers like /pseudo

How should feature qualifiers in EMBL/GenBank files with no value like /pseudo be recorded in BioSQL?

Notice the schema block the use of Null, e.g. https://github.com/biosql/biosql/blob/master/sql/biosqldb-mysql.sql#L488

-- tag/value associations - or ontology annotations
CREATE TABLE seqfeature_qualifier_value (
    seqfeature_id       INT(10) UNSIGNED NOT NULL,
    term_id         INT(10) UNSIGNED NOT NULL,
    rank            SMALLINT NOT NULL DEFAULT 0,
    value           TEXT NOT NULL,
    PRIMARY KEY (seqfeature_id,term_id,rank)
) ENGINE=INNODB;

Wait time for SQLite with NCBI taxonomy.

Does anyone have any experience using sqlite as a driver for loading NCBI taxonomy data into a BioSQL database?

I'm working on running the perl script for this and it's taking forever. I've been running it for more than 96 hours and it's only at 169MB. It's definitely still uploading, it's just incredibly slow.

In the past I thought I did this much faster, but maybe I'm wrong...

Here are the commands that I used:

$ sqlite3 NCBI_BIOSQLITE.db -echo < biosqldb-sqlite.sql
$ ./load_ncbi_taxonomy.pl --db_name NCBI_BioSQLite.db --driver SQLite --download true

Unable to remove database using server.remove_database

Just tried to remove subdatabase but giving error.

from BioSQL import BioSeqDatabase
from Bio import Entrez
from Bio import SeqIO


server = BioSeqDatabase.open_database(
    driver="mysql.connector",
    user="biosqlUser",
    passwd="b10@$ql12",
    host="localhost",
    db="biosql",
)

db = server["orchids"]

server.remove_database("orchids") 
server.commit()

Traceback (most recent call last):
File "test.py", line 23, in
server.remove_database("orchids")
^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DBServer' object has no attribute 'remove_database'. Did you mean: 'new_database'?

Root nodes in the taxon table should not self reference

Migrated from RedMine https://redmine.open-bio.org/issues/2664 which I filed on 2008-11-16,

The BioSQL script load_ncbi_taxonomy.pl currently (BioSQL v1.0.1) will create a root node where the parent_taxon_id is self referencing (not just NULL).

This is bad for at least two reasons,

Anyone extracting a lineage with recursion or a loop from the parent_taxon_id references risks an infinite recursion/loop.

Anyone wanting to get a list of child nodes will find the root node included in its own children.

See:

http://lists.open-bio.org/pipermail/biosql-l/2008-October/001364.html

http://lists.open-bio.org/pipermail/biosql-l/2008-November/001393.html

Document how to store unknown positions for features

These come up in UniProt/SwissProt files where a feature's start or end can be unknown.

Cross reference: biopython/biopython#1165

Error loading the NCBI Taxonomy DB: "Can't call method "execute" on an undefined value"

From a freshly cloned repo:

 ./load_ncbi_taxonomy.pl --dbname NCBI_BioSQLite.db --driver SQLite --download; echo $?
Downloading NCBI taxon database to taxdata
Loading NCBI taxon database in taxdata:
	... retrieving all taxon nodes in the database
	... reading in taxon nodes from nodes.dmp
	... insert / update / delete taxon nodes
Can't call method "execute" on an undefined value at ./load_ncbi_taxonomy.pl line 465.
255

Redmine migration: load_seqdatabase.pl crashes on varchar(n)

Added by Erikjan empty over 10 years ago. Updated over 7 years ago.

(bioperl-live, bioperl-db, biosql/schema fresh from CVS)

Using load_seqdatabase.pl to load uniprot_sprot.dat into
postgres (8.2.5), biosql gave me:

ERROR: value too long for type character varying(40)

(all fresh from cvs)

Changing 'VARCHAR' to 'TEXT' fixed those errors.

Generally, VARCHAR[(N)], (=character varying) are probably
better replaced with 'TEXT' as VARCHAR is implemented
as TEXT with a check on length.

Attached is a patch for the table definitions in
biosqldb-pg.sql; I haven't changed the other scripts, as I
don't know if you will agree with this improvement.

(The other table-creating scripts are probably also better
off with a change from VARCHAR (=character varying) to the
text datatype.)

When importing sql on mysql, " Specified key was too long"

$ mysql -u root -p biosql < biosqldb-mysql.sql
Enter password:
ERROR 1071 (42000) at line 87: Specified key was too long; max key length is 767 bytes

failed to update node MySQL server has gone away at load_ncbi_taxonomy.pl

Could you check this issue ?

$ perl load_ncbi_taxonomy.pl --dbname biosql  --driver mysql --dbuser dbuser --dbpass dbpass --directory /Taxonomy/20170131 --verbose=2
Loading NCBI taxon database in /Taxonomy/20170131:
        ... retrieving all taxon nodes in the database
        ... reading in taxon nodes from nodes.dmp
        ... insert / update / delete taxon nodes
failed to update node (770;997;1937963;species;11;0): MySQL server has gone away at load_ncbi_taxonomy.pl line 677.
rollback ineffective with AutoCommit enabled at load_ncbi_taxonomy.pl line 919.
Use of uninitialized value in concatenation (.) or string at load_ncbi_taxonomy.pl line 921.

Using python to create BioSQL databases.

Would there be any support for creating a Python module that creates template-BioSQL databases?

Is this technology too old? Or is there a better way to do this already?

If people are interested, we could...

Convert the Perl scripts to Python and make them part of the module.
Update the PhyloDB scripts and schema.
Give users better documentation.

How to store fuzzy locations

I'm trying to add partial location support to the python version of BioSQL (see biopython/biopython#1639) and would like some feedback on how to do it.

There is a comment in the biosql schema about using the location_qualifier_value table to store fuzzy locations. My reading of the comments in the location table suggest that whenever there is a null value encountered, that location should be looked up in the location_qualifier_value table to retrieve additional information about the fuzzy location. There is also a very old commit in the bioperl version which does just this. However I don't know if this still how it works (I'm unable to get the perl version to install).

Is there an ontology name and term names that I should use for the biopython API? If not then I would propose creating a "fuzzy_qualifiers" ontology and then to it adding the terms "start_before_position", "start_after_position", "end_before_position", "end_after_position".

Let me know what you think.

Redmine migration: BioPerl / BioSQL - InterPro Optional IDs not parsed

Added by johnraekwon empty about 9 years ago. Updated over 7 years ago.

This bug was first posted on the BioSQL-l, and it was requested that I post it here.

The problem is that the InterPro IDs (aka optional ID) are not entered into the biosql database when using the script load_seqdatabase.pl

This information should probably rightfully go into the database field: dbxref_qualifier_value. In fact, this table is not utilized at all.

The reply to my original post on BioSQL-l is shown here:

The problem here is that Bioperl-db (the persistence mapper between BioSQL and BioPerl) loses the optional_id property of Bio::Annotation::DBLink objects.

Moreover, the dbxref table in BioSQL doesn't actually provide for the opportunity to store two identifiers (or accessions) for one db_xref, so storing this bit of information is actually not as straightforward as one might wish b/c it would need to go into the dbxref_qualifier_value table, and I would not be surprised if the other Bio* projects with a mapping to BioSQL don't store or retrieve this either (though it'd be good to hear if anyone does).

Here are a couple of ideas for how this issue might be addressed.

Write a Bio::Seq::BaseSeqProcessor-derived object that for every incoming sequence massages all Interpro links to either substitute the primary_id with the optional_id, or to add a second DBLink annotation with the optional_id of the original one as its primary_id. (pros: relatively easy, entirely under your control; cons: you either lose the primary_id now, or have two dbxref annotations for each of the original ones.)
Add a column to the dbxref table, and code to Bioperl-db, that store, de/serialize the extra ID. (pros: not losing or duplicating any data; cons: change is significant in terms of schema stability, requires new release, depends on implementation in Bioperl-db, necessitates update of all other Bio* language bindings)
De/serialize the optional_id as an entry in the dbxref_qualifier_value table. (pros: technically it's the Right Way as that's what the table was intended for; cons: implementing in Bioperl-db is more involved as we now need to transform an object property to a child object and back)

So I'd say this is a bug in Bioperl-db in that the dbxref_qualifier_value table isn't utilized here. Would you mind filing it? In the meantime, if you just need something that works, you could try the first of the above ideas.

-hilmar

On Jul 3, 2009, at 7:17 PM, John LaCava wrote:

Hi all,

Tried this on the BioPerl-l but seemed to make sense to try here as well.

I am trying to use the BioPerl-db script:

"load_seqdatabase.pl" to parse a SwissProt ".dat" file (Yeast.dat, this is the yeast proteome with annotations etc.).

The particular entry I am interested is the InterPro optional ID, which is the domain name.

I have put a short stub up which displays the 4 pieces of info I want to parse into my data base.
That can be found here:

http://github.com/johnraekwon/BioPerl---BioSQL---InterPro-Optional-IDs/tree/master

You can see that near the bottom, we get the optional ID:
$protein_ids->{interpro_domain} = $dblink->{optional_id};

I do not think the bioperl script load_seqdatabase.pl retrieves this information. At least, I cannot find it in the db built from parsing a test .dat file.
I would like some help figuring out:

WHY doesn't it retrieve this information, since it seems to be parsing "all" annotations...
HOW might I edit the script to include this particular annotation of interest in the info it passes to my db (biosql)

I am a bit out of my depth on this, and so, any help is appreciated.

Cheers,
John

Redmine migration: provide SQL dumps of biosql releases with real data

Added by Martin Mokrejs about 10 years ago. Updated over 7 years ago.

Hi,
I believe you do test your scripts/*.pl on real data, like NCBI taxonomy database. Would you please provide SQL dumps from postgres and mysql databases at least of the relevant tables? Even better if also from Oracle. This would allow for easier import of the data and also, provide a good reference dataset. Please provide also the original data which have been imported. That can aid in testing changes to the scripts/*.pl files. For mysql I would prefer 'mysqldump biosqldb --extended-insert' format.

Release BioSQL v1.0.2

We should do another release to include the MySQL fix in f669700 (TYPE=INNODB becomes ENGINE=INNODB ) etc.

question: how to retrieve the full lineage of a species

Hi,

I'm using the NCBI etools for this (via Bio::DB::Taxonomy). How can I get something like this, but only using an sql query:

$ perl ~/tmp/test-get-lineage.pl 'Zea mays' |grep -v 'no rank'|sed -e 's/   / /g'
superkingdom	Eukaryota
  kingdom	Viridiplantae
  phylum	Streptophyta
        class	Liliopsida
        subclass	commelinids
          order	Poales
          family	Poaceae
            subfamily	Panicoideae
            tribe	Andropogoneae
              subtribe	Tripsacinae
              genus	Zea
              species	Zea mays

Thank you

Move left/right calculation into separate script

Migrated from https://redmine.open-bio.org/issues/2493 which I filed on 2008-04-23,

This was raised on the mailing list:
http://lists.open-bio.org/pipermail/biosql-l/2008-April/001248.html

In addition to loading the NCBI taxonomy, the load_ncbi_taxonomy.pl script also (re)calculates the left/right values.

It would be useful to have a separate BioSQL script which ONLY calculates or recalculates the left/right values.

Possible use-cases include people using a non-NCBI taxonomy.

For anyone wondering what the left/right values are, Hilmar wrote:

they hold the nested set values. Nested sets are enumeration
algorithm described in Joe Celko's SQL for Smarties books, and Aaron
Mackey gives a good introduction here:

http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html

(This is in the schema DDL file, though obviously should be
documented better. Good candidate for an FAQ, I suppose.)

Quoted from http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html

P.S. I would suggest that the left/right calculations from load_ncbi_taxonomy.pl be moved into the new script, and then have load_ncbi_taxonomy.pl call the new script to avoid code duplication.

And from the discussion that followed,

Note that there are already two different BioSQL scripts for working with the taxon tables, load_ncbi_taxonomy.pl and load_itis_taxonomy.pl which ideally should both be able to update the left/right values. Currently load_itis_taxonomy.pl doesn't.

My suggestion that the left/right code be moved into a new script allows this to be invoked from both load_ncbi_taxonomy.pl and load_itis_taxonomy.pl without duplicating code.

Document how to store circular/linear attribute

The GenBank LOCUS or EMBL ID lines can record if the sequence is circular or linear.

BioSQL should document how this is stored in the schema (even if this is just writing down how BioPerl does this as the de facto reference implementation).

Bugzilla/RedMine issue 2839 - Remove RULES in BioSQL PostgreSQL schema

Copying the old issue from https://redmine.open-bio.org/issues/2839 which I filed 22 May 2009, about seven years ago:

This issue was originally reported as Biopython Bug 2833, a data corruption when (re)importing a record which matched/clashed an existing entry in the database.

On MysQL (and presumably Oracle etc), this triggered an error (good), but on PostgreSQL the database was silently corrupted. This was traced so some RULES in the BioSQL PostgreSQL schema added for bioperl-db, as explained here (and in the PostgreSQL schema),

http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006088.html

Hilmar wrote:

Indeed, I'd almost forgotten. The rules are there mostly as a remnant from
earlier versions of PostgreSQL to support transactional loading the way
bioperl-db (the object-relational mapping for BioPerl) is optimized. You
probably don't need them anywhere else.

-hilmar
Bioperl-db is optimized such that entities that very likely don't exist yet in the database are attempted for insert right away. If the insert fails due to a unique key violation, the record is looked up (and then expected to be found). In Oracle and MySQL you can do this and the transaction remains healthy; i.e., you can commit the transaction later and all statements except those that failed will be committed. In PostgreSQL any failed statement dooms the entire transaction, and the only way out is a rollback. In this case, if you want the loading of one sequence record as one transaction, failing to insert a single feature record will doom the entire sequence load and you would need to start over with the sequence. To fix this, I wrote the rules, which in essence do do the lookups for PostgreSQL that the bioperl-db code would otherwise avoid, and on insert do nothing if the record is found, which results in zero rows affected when you would expect one (which is what bioperl-db cues off of and then triggers a lookup). The right way to do this meanwhile is to use nested transactions, which PostgreSQL supports since v8.0.x, but I haven't gotten around to implement support for that in Bioperl-db.
As discussed on the BioSQL (and Biopython) mailing lists this bug is to first add a big warning about this issue to the INSTALL notes for PostgreSQL (i.e. recommend removing the RULES section if not using bioperl-db):

http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL

Secondly, once bioperl-db has been updated to use nested transactions (which will require PostgreSQL v8.0.x or later), then we can remove the rules themselves:
http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/sql/biosqldb-pg.sql

And my comment from the same year:

Just to note for Biopython (Bug 2833) we now have a work around in place in our repository. This is at the cost of a run time overhead (an extra query per import) and some code complexity, and I regard it as a short term hack.

On PostgreSQL Biopython will check if the rules are present, and if they are issue a warning. Then, before each record import an extra query checks to see if the uniqueness constraint would be violated, and if so we raise an error.

Once the BioSQL schema is updated to remove the rules, we can drop this hack, because then the database itself will raise an error.

The Biopython issue referred to was https://redmine.open-bio.org/issues/2833 and the fix this refers to is biopython/biopython@21b0ba9

Biopython looks for either rule_bioentry_i1 or rule_bioentry_i2 (rather than the full set of these potentially problematic rules) and will issue a warning if present.

Unmatched terms between `bioentry` and `bioentry.accession`?

Hi,

I'm having a bit of a rough time trying to understand the reason for the following behaviour:

$ sqlite3 nt.db "SELECT * FROM bioentry LIMIT 1;"
1|1|7845|XM_019431108|XM_019431108|XM_019431108.1|MAM|PREDICTED: Panthera pardus proline rich 11 (PRR11), transcript variant X5, mRNA|1

$ sqlite3 nt.db 'SELECT bioentry.accession from bioentry LIMIT 10'
AB091111

In the second query, I'd expect to see "XM_019431108", the accession for PREDICTED: Panthera pardus proline rich 11 (PRR11), transcript variant X5, mRNA, but instead I'm seeing "AB091111", which is the accession for Cyprinus carpio IgL2 mRNA for Immunoglobulin light chain, partial cds, clone:Cph2.

I'm not understanding why the two queries give distinct results and am worried that something is not right in my database.

Some background:

I'm using sqlite3, so for some of the tasks below I had to adapt some commands to this implementation (for example, in contrast with mysql or postgres, for which the examples were usually made)
loaded the taxonomy as per BioSQL instructions using the load_ncbi_taxonomy.pl script
I'm just using a subset of the NCBI nt database, currently with a random set of 1000 identifiers
the BioSQL database is filled with these identifiers using BioPython as per the instructions detailed here (basically with Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)
my random subset of the NCBI nt database, includes both the identifiers "XM_019431108" and "AB091111"

Any insights that could shed some light on the behaviour reported?

Many thanks,
Pedro

biosql / biosql Goto Github PK

biosql's Introduction

biosql's People

Contributors

Stargazers

Watchers

Forkers

biosql's Issues

Recommend Projects

Recommend Topics

Recommend Org