simonw / sqlite-utils Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 103.0 1.89 MB

Python CLI utility and library for manipulating SQLite databases

Home Page: https://sqlite-utils.datasette.io

License: Apache License 2.0

Python 99.62% C 0.27% Just 0.12%

cli click datasette datasette-io datasette-tool python sqlite sqlite-database

sqlite-utils's Issues

The ".upsert()" method is misnamed

This thread here is illuminating: https://stackoverflow.com/questions/3634984/insert-if-not-exists-else-update

The term UPSERT in SQLite has a specific meaning as-of 3.24.0 (2018-06-04): https://www.sqlite.org/lang_UPSERT.html

It means "behave as an UPDATE or a no-op if the INSERT would violate a uniqueness constraint". The syntax in 3.24.0+ looks like this (confusingly it does not use the term "upsert"):

INSERT INTO phonebook(name,phonenumber) VALUES('Alice','704-555-1212')
  ON CONFLICT(name) DO UPDATE SET phonenumber=excluded.phonenumber

Here's the problem: the sqlite-utils .upsert() and .upsert_all() methods don't do this. They use the following SQL:

INSERT OR REPLACE INTO [{table}] ({columns}) VALUES {rows};

If the record already exists, it will be entirely replaced by a new record - as opposed to updating any specified fields but leaving existing fields as they are (the behaviour of "upsert" in SQLite itself).

sqlite-utils create-table command

Spun off from #24 - it would be useful if CLI users could create new tables (with explicit column types, not null rules and defaults) without having to insert an example record.

table.get(...) method

Utility method for fetching a record by its primary key.

Accepts a single value (for primary key / rowid tables) or a list/tuple of values (for compound primary keys, refs #36).

Raises a NotFoundError if the record cannot be found.

.add_column() doesn't match indentation of initial creation

I spotted a table which was created once and then had columns added to it and the formatted SQL looks like this:

CREATE TABLE [records] (
   [type] TEXT,
   [sourceName] TEXT,
   [sourceVersion] TEXT,
   [unit] TEXT,
   [creationDate] TEXT,
   [startDate] TEXT,
   [endDate] TEXT,
   [value] TEXT,
   [metadata_Health Mate App Version] TEXT,
   [metadata_Withings User Identifier] TEXT,
   [metadata_Modified Date] TEXT,
   [metadata_Withings Link] TEXT,
   [metadata_HKWasUserEntered] TEXT
, [device] TEXT, [metadata_HKMetadataKeyHeartRateMotionContext] TEXT, [metadata_HKDeviceManufacturerName] TEXT, [metadata_HKMetadataKeySyncVersion] TEXT, [metadata_HKMetadataKeySyncIdentifier] TEXT, [metadata_HKSwimmingStrokeStyle] TEXT, [metadata_HKVO2MaxTestType] TEXT, [metadata_HKTimeZone] TEXT, [metadata_Average HR] TEXT, [metadata_Recharge] TEXT, [metadata_Lights] TEXT, [metadata_Asleep] TEXT, [metadata_Rating] TEXT, [metadata_Energy Threshold] TEXT, [metadata_Deep Sleep] TEXT, [metadata_Nap] TEXT, [metadata_Edit Slots] TEXT, [metadata_Tags] TEXT, [metadata_Daytime HR] TEXT)

It would be nice if the columns that were added later matched the indentation of the initial columns.

Mechanism for adding multiple foreign key constraints at once

Needed by db-to-sqlite. It currently works by collecting all of the foreign key relationships it can find and then applying them at the end of the process.

The problem is, the add_foreign_key() method looks like this:

sqlite-utils/sqlite_utils/db.py

Lines 498 to 516 in 86bd2bb

    
           with self.db.conn: 
        
               cursor = self.db.conn.cursor() 
        
               schema_version = cursor.execute("PRAGMA schema_version").fetchone()[0] 
        
               cursor.execute("PRAGMA writable_schema = 1") 
        
               old_sql = self.schema 
        
               extra_sql = ",\n   FOREIGN KEY({column}) REFERENCES {other_table}({other_column})\n".format( 
        
                   column=column, other_table=other_table, other_column=other_column 
        
               ) 
        
               # Stick that bit in at the very end just before the closing ')' 
        
               last_paren = old_sql.rindex(")") 
        
               new_sql = old_sql[:last_paren].strip() + extra_sql + old_sql[last_paren:] 
        
               cursor.execute( 
        
                   "UPDATE sqlite_master SET sql = ? WHERE name = ?", (new_sql, self.name) 
        
               ) 
        
               cursor.execute("PRAGMA schema_version = %d" % (schema_version + 1)) 
        
               cursor.execute("PRAGMA writable_schema = 0") 
        
           # Have to VACUUM outside the transaction to ensure .foreign_keys property 
        
           # can see the newly created foreign key. 
        
           cursor.execute("VACUUM")

That means it's doing a full VACUUM for every single relationship it sets up - and if you have hundreds of foreign key relationships in your database this can take hours.

I think the right solution is to have a .add_foreign_keys(list_of_args) method which does the bulk operation and then a single VACUUM. .add_foreign_key(...) can then call the bulk action with a single list item.

Support compound primary keys

This should work:

table = db["dog_breeds"].insert({
    "dog_id": 1,
    "breed_id": 2
}, pk=("dog_id", "breed_id"))

Needed for m2m work in #23

Throws error if .insert_all() / .upsert_all() called with empty list

Option to open database in read-only mode

Would this make it 100% safe to run reads against a database file that is being written to by another process?

Ability to introspect triggers

Now that we're creating triggers (thanks to @amjith in #57) it would be neat if we could introspect them too.

I'm thinking:

db.triggers - lists all triggers for the database
db["tablename"].triggers - lists triggers for that table

The underlying query for this is select * from sqlite_master where type = 'trigger'

I'll return the trigger information in a new namedtuple, similar to how Indexes and ForeignKeys work.

Experiment with type hints

Since it's designed to be used in Jupyter or for rapid prototyping in an IDE (and it's still pretty small) sqlite-utils feels like a great candidate for me to finally try out Python type hints.

https://veekaybee.github.io/2019/07/08/python-type-hints/ is good.

It suggests the mypy docs for getting started: https://mypy.readthedocs.io/en/latest/existing_code.html plus this tutorial: https://pymbook.readthedocs.io/en/latest/typehinting.html

Mechanism for turning nested JSON into foreign keys / many-to-many

The GitHub JSON APIs have a really interesting convention with respect to related objects.

Consider https://api.github.com/repos/simonw/sqlite-utils/issues - here's a truncated subset:

  {
    "id": 449818897,
    "node_id": "MDU6SXNzdWU0NDk4MTg4OTc=",
    "number": 24,
    "title": "Additional Column Constraints?",
    "user": {
      "login": "IgnoredAmbience",
      "id": 98555,
      "node_id": "MDQ6VXNlcjk4NTU1",
      "avatar_url": "https://avatars0.githubusercontent.com/u/98555?v=4",
      "gravatar_id": ""
    },
    "labels": [
      {
        "id": 993377884,
        "node_id": "MDU6TGFiZWw5OTMzNzc4ODQ=",
        "url": "https://api.github.com/repos/simonw/sqlite-utils/labels/enhancement",
        "name": "enhancement",
        "color": "a2eeef",
        "default": true
      }
    ],
    "state": "open"
  }

The user column lists a complete user. The labels column has a list of labels.

Since both user and label have populated id field this is actually enough information for us to create records for them AND set up the corresponding foreign key (for user) and m2m relationships (for labels).

It would be really neat if sqlite-utils had some kind of mechanism for correctly processing these kind of patterns.

Thanks to jq there's not much need for extra customization of the shape here - if we support a narrowly defined structure users can use jq to reshape arbitrary JSON to match.

add_column() should support REFERENCES {other_table}({other_column})

Related to #2

Utilities for adding indexes

Both in the Python API and the CLI tool. For the CLI tool this should work:

$ sqlite-utils create-index mydb.db mytable col1 col2

This will create a compound index across col1 and col2. The name of the index will be automatically chosen unless you use the --name=... option.

Support a --unique option too.

Allow .insert(..., foreign_keys=()) to auto-detect table and primary key

The foreign_keys= argument currently takes a list of triples:

db["usages"].insert_all(
    usages_to_insert,
    foreign_keys=(
        ("line_id", "lines", "id"),
        ("definition_id", "definitions", "id"),
    ),
)

As of #16 we have a mechanism for detecting the primary key column (the third item in this triple) - we should use that here too, so foreign keys can be optionally defined as a list of pairs.

table.extract(...) method and "sqlite-utils extract" command

One of my favourite features of csvs-to-sqlite is that it can "extract" columns into a separate lookup table - for example:

csvs-to-sqlite big_csv_file.csv -c country output.db

This will turn the country column in the resulting table into a integer foreign key against a new country table. You can see an example of what that looks like here: https://san-francisco.datasettes.com/registered-business-locations-3d50679/Business+Corridor was extracted from https://san-francisco.datasettes.com/registered-business-locations-3d50679/Registered_Business_Locations_-_San_Francisco?Business%20Corridor=1

I'd like to have the same capability in sqlite-utils - but with the ability to run it against an existing SQLite table rather than just against a CSV.

Add support for porter stemming in FTS

FTS5 can have porter stemming enabled.

Rearrange the docs by area, not CLI vs Python

The docs for eg inserting data should live on the same page, rather than being split across the API and CLI pages.

Ability to automatically create IDs from content hash of row

Sometimes when you are importing data the underlying source provides records without IDs that can be uniquely identified by their contents.

A utility mechanism for calculating a sha1 hash of the contents and using that as a unique ID would be useful.

JSON column values get extraneously quoted

If the input to sqlite-utils insert includes a column that is a JSON array or object, sqlite-utils query will introduce an extra level of quoting on output:

# echo '[{"key": ["one", "two", "three"]}]' | sqlite-utils insert t.db t -

# sqlite-utils t.db 'select * from t'
[{"key": "[\"one\", \"two\", \"three\"]"}]

# sqlite3 t.db 'select * from t'
["one", "two", "three"]

This might require an imperfect solution, since sqlite3 doesn't have a JSON type. Perhaps fields that start with [" or {" and end with "] or "} could be detected, with a flag to turn off that behavior for weird text fields (or vice versa).

index_foreign_keys / index-foreign-keys utilities

Sometimes it's good to have indices on all columns that are foreign keys, to allow for efficient reverse lookups.

This would be a useful utility:

$ sqlite-utils index-foreign-keys database.db

[enhancement] Method to delete a row in python

Hi !
Thanks for the lib !

Obviously, every possible sql queries won't have a dedicated method.

But I was thinking : a method to delete a row (I'm terrible with names, maybe delete_where() or something, would be useful.

I have a Database, with primary key.

For the moment, I use :

db.conn.execute(f"DELETE FROM table WHERE key = {key_id}")
db.conn.commit()

to delete a row I don't need anymore, giving his primary key.

Works like a charm.

Just an idea :

table.delete_where_pkey({'key': key_id})

or something (I know, I'm terrible at naming methods...).

Pros : well, no need to write SQL query.

Cons : WHERE normally allows to do many more things (operators =, <>, >, <, BETWEEN), not to mention AND, OR, etc...
Method is maybe to specific, and/or a pain to render more flexible.

Again, just a thought. Writing his own sql works too, so...

Thanks again.
See yah.

Detect numpy types when creating tables

Inspired by #8

Problems handling column names containing spaces or -

Irrrespective of whether using column names containing a space or - character is good practice, SQLite does allow it, but sqlite-utils throws an error in the following cases:

from sqlite_utils import Database

dbname = 'test.db'
DB = Database(sqlite3.connect(dbname))

import pandas as pd
df = pd.DataFrame({'col1':range(3), 'col2':range(3)})

#Convert pandas dataframe to appropriate list/dict format
DB['test1'].insert_all( df.to_dict(orient='records') )
#Works fine

However:

df = pd.DataFrame({'col 1':range(3), 'col2':range(3)})
DB['test1'].insert_all(df.to_dict(orient='records'))

throws:

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
<ipython-input-27-070b758f4f92> in <module>()
      1 import pandas as pd
      2 df = pd.DataFrame({'col 1':range(3), 'col2':range(3)})
----> 3 DB['test1'].insert_all(df.to_dict(orient='records'))

/usr/local/lib/python3.7/site-packages/sqlite_utils/db.py in insert_all(self, records, pk, foreign_keys, upsert, batch_size, column_order)
    327                     jsonify_if_needed(record.get(key, None)) for key in all_columns
    328                 )
--> 329             result = self.db.conn.execute(sql, values)
    330             self.db.conn.commit()
    331             self.last_id = result.lastrowid

OperationalError: near "1": syntax error

and:

df = pd.DataFrame({'col-1':range(3), 'col2':range(3)})
DB['test1'].upsert_all(df.to_dict(orient='records'))

results in:

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
<ipython-input-28-654523549d20> in <module>()
      1 import pandas as pd
      2 df = pd.DataFrame({'col-1':range(3), 'col2':range(3)})
----> 3 DB['test1'].insert_all(df.to_dict(orient='records'))

/usr/local/lib/python3.7/site-packages/sqlite_utils/db.py in insert_all(self, records, pk, foreign_keys, upsert, batch_size, column_order)
    327                     jsonify_if_needed(record.get(key, None)) for key in all_columns
    328                 )
--> 329             result = self.db.conn.execute(sql, values)
    330             self.db.conn.commit()
    331             self.last_id = result.lastrowid

OperationalError: near "-": syntax error

"Too many SQL variables" on large inserts

Reported here: dogsheep/healthkit-to-sqlite#9

It looks like there's a default limit of 999 variables - we need to be smart about that, maybe dynamically lower the batch size based on the number of columns.

"sqlite-utils insert" should support newline-delimited JSON

We can already export newline delimited JSON. We should learn to import it as well.

The neat thing about importing it is that you can import GBs of data without having to read the whole lot into memory in order to decode the wrapping JSON array.

Datasette can export it now: simonw/datasette#405

Demo: https://latest.datasette.io/fixtures/facetable.json?_shape=array&_nl=on

It should be possible to do this:

$ curl "https://latest.datasette.io/fixtures/facetable.json?_shape=array&_nl=on" \
    | sqlite-utils insert data.db facetable - --nl

sqlite-utils insert --tsv option

Right now we only support ingesting CSV, but sometimes interesting data is released as TSV.

https://www.washingtonpost.com/national/2019/07/18/how-download-use-dea-pain-pills-database/ for example.

Implement ON DELETE and ON UPDATE actions for foreign keys

Hi! I did not find any mention on the library about ON DELETE and ON UPDATE actions for foreign keys. Are those expected to be implemented? If not, it would be a nice thing to include!

Jupyter notebook demo of the library, launchable on Binder

Ability to add columns to tables

Makes sense to do this before foreign keys in #2

Python:

db["table"].add_column("new_column", int)

CLI:

$ sqlite-utils add-column table new_column INTEGER

importing CSV to SQLite as library

CSV can be imported to SQLite when used CLI, but I don't see documentation for when using as library.

ensure_index() method

db["table"].ensure_index(["col1", "col2"])

This will do the following:

if the specified table or column does not exist, do nothing
if they exist and already have an index, do nothing
otherwise, create the index

I want this for tools like twitter-to-sqlite search where the search_runs table may or not have been created yet but, if it IS created, I want to put an index on the hash column.

Ability to list views, and to access db["view_name"].rows / rows_where / etc

The docs show me how to create a view via db.create_view() but I can't seem to get back to that view post-creation; if I query it as a table it returns None, and it doesn't appear in the table listing, even though querying the view works fine from inside the sqlite3 command-line.

It'd be great to have the view as a pseudo-table, or if the python/sqlite3 module makes that hard to pull off (I couldn't figure it out), to have that edge-case documented next to the db.create_view() docs.

.insert/.upsert/.insert_all/.upsert_all should add missing columns

This is a larger change, but it would be incredibly useful: if you attempt to insert or update a document with a field that does not currently exist in the underlying table, sqlite-utils should add the appropriate column for you.

Improve and document foreign_keys=... argument to insert/create/etc

The foreign_keys= argument to table.insert_all() and friends can be used to specify foreign key relationships that should be created.

It is not yet documented. It also requires you to specify the SQLite type of each column, even though this can be detected by introspecting the referenced table:

cols = [c for c in self.db[other_table].columns if c.name == other_column]
cols[0].type

Relates to #2

Error in upsert if column named 'order'

The following works fine:

connX = sqlite3.connect('DELME.db', timeout=10)

dfX=pd.DataFrame({'col1':range(3),'col2':range(3)})
DBX = Database(connX)
DBX['test'].upsert_all(dfX.to_dict(orient='records'))

But if a column is named order:

connX = sqlite3.connect('DELME.db', timeout=10)

dfX=pd.DataFrame({'order':range(3),'col2':range(3)})
DBX = Database(connX)
DBX['test'].upsert_all(dfX.to_dict(orient='records'))

it throws an error:

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
<ipython-input-130-7dba33cd806c> in <module>
      3 dfX=pd.DataFrame({'order':range(3),'col2':range(3)})
      4 DBX = Database(connX)
----> 5 DBX['test'].upsert_all(dfX.to_dict(orient='records'))

/usr/local/lib/python3.7/site-packages/sqlite_utils/db.py in upsert_all(self, records, pk, foreign_keys, column_order)
    347             foreign_keys=foreign_keys,
    348             upsert=True,
--> 349             column_order=column_order,
    350         )
    351 

/usr/local/lib/python3.7/site-packages/sqlite_utils/db.py in insert_all(self, records, pk, foreign_keys, upsert, batch_size, column_order)
    327                     jsonify_if_needed(record.get(key, None)) for key in all_columns
    328                 )
--> 329             result = self.db.conn.execute(sql, values)
    330             self.db.conn.commit()
    331             self.last_id = result.lastrowid

OperationalError: near "order": syntax error

Utilities for building lookup tables

While building https://github.com/dogsheep/healthkit-to-sqlite I found a need for a neat mechanism for easily building lookup tables - tables where each unique value in a column is replaced by a foreign key to a separate table.

csvs-to-sqlite currently creates those with its "extract" mechanism - but that's written as custom code against Pandas. I'd like to eventually replace Pandas with sqlite-utils there.

extracts= option for insert/update/etc

Relates to #42 and #44. I want the ability to extract values out into lookup tables during bulk insert/upsert operations.

db.insert_all(rows, extracts=["species"])

creates species table for values in the species column

db.insert_all(rows, extracts={"species": "Species"})

as above but the new table is called Species.

Mechanism for adding foreign keys to an existing table

SQLite does not have ALTER TABLE support for adding new foreign keys... but it turns out it's possible to make these changes without having to duplicate the entire table by carefully running UPDATE sqlite_master SET sql=... WHERE type='table' AND name='X';

Here's how Django does it: https://github.com/django/django/blob/d3449faaa915a08c275b35de01e66a7ef6bdb2dc/django/db/backends/sqlite3/schema.py#L103-L125

And here's the official documentation about this: https://sqlite.org/lang_altertable.html#otheralter (scroll to the very bottom of the page)

extracts= should support multiple-column extracts

Lookup tables can be constructed on compound columns, but the extracts= option doesn't currently support that.

Right now extracts can be defined in two ways:

# Extract these columns into tables with the same name:
dogs = db.table("dogs", extracts=["breed", "most_recent_trophy"])

# Same as above but with custom table names:
dogs = db.table("dogs", extracts={"breed": "Breeds", "most_recent_trophy": "Trophies"})

Need some kind of syntax for much more complicated extractions, like when two columns (say "source" and "source_version") are extracted into a single table.

Incorrect help text for enable-fts command

I clearly copied-and-pasted this from the tables command without updating it:

sqlite-utils/sqlite_utils/cli.py

Lines 216 to 222 in 0b1af42

    
           @click.option( 
        
               "--fts4", help="Just show FTS4 enabled tables", default=False, is_flag=True 
        
           ) 
        
           @click.option( 
        
               "--fts5", help="Just show FTS5 enabled tables", default=False, is_flag=True 
        
           ) 
        
           def enable_fts(path, table, column, fts4, fts5):

Prevent accidental add-foreign-key with invalid column

You can corrupt your database by running:

$ sqlite-utils add-foreign-key my.db table non_existent_column other_table other_column

Release notes for 1.0

0.14...251e473

Additional Column Constraints?

I'm looking to import data from XML with a pre-defined schema that maps fairly closely to a relational database.
In particular, it has explicit annotations for when fields are required, optional, or when a default value should be inferred.

Would there be value in adding the ability to define NOT NULL and DEFAULT column constraints to sqlite-utils?

Option to ignore inserts if primary key exists already

I've just noticed that SQLite lets you IGNORE inserts that collide with a pre-existing key. This can be quite handy if you have a dataset that keeps changing in part, and you don't want to upsert and replace pre-existing PK rows but you do want to ignore collisions to existing PK rows.

Do sqlite_utils support such (cavalier!) behaviour?

Originally posted by @psychemedia in #18 (comment)

table.update(...) method

Spun off from #23 - this method will allow a user to update a specific row.

Currently the only way to do that it is to call .upsert({full record}) with the primary key field matching an existing record - but this does not support partial updates.

db["events"].update(3, {"name": "Renamed"})

This method only works on an existing table, so there's no need for a pk="id" specifier - it can detect the primary key by looking at the table.

If the primary key is compound the first argument can be a tuple:

db["events_venues"].update((3, 2), {"custom_label": "Label"})

The method can be called without the second dictionary argument. Doing this selects the row specified by the primary key (throwing an error if it does not exist) and remembers it so that chained operations can be carried out - see proposal in #23 (comment)

Syntactic sugar for creating m2m records

Python library only. What would be a syntactically pleasant way of creating a m2m record?

Experiment with contentless FTS tables

Could greatly reduce size of resulting database for large datasets: http://cocoamine.net/blog/2015/09/07/contentless-fts4-for-large-immutable-documents/

Support enabling FTS on views

Right now enable_fts() is only implemented for Table(). Technically sqlite supports enabling fts on views. But it requires deeper thought since views don't have rowid and the current implementation of enable_fts() relies on the presence of rowid column.

It is possible to provide an alternative rowid using the content_rowid option to the FTS5() function.

Ref: https://sqlite.org/fts5.html#fts5_table_creation_and_initialization

The "content_rowid" option, used to set the rowid field of an external content table.

This will further complicate enable_fts() function by adding an extra argument. I'm wondering if that is outside the scope of this tool or should I work on that feature and send a PR?

.insert_all() should accept a generator and process it efficiently

Right now you have to load every record into memory before passing the list to .insert_all() and friends.

If you want to process millions of rows, this is inefficient. Python has generators - we should use them!

The only catch here is that part of the magic of sqlite-utils is that it guesses the column types and creates the table for you. This code will need to be updated to notice if the table needs creating and, if it does, create it using the first X (where x=1,000 but can be customized) records.

If a record outside of those first 1,000 has a rogue column, we can crash with an error.

This will free us up to make the --nl option added in #6 much more efficient.

Feature request: enable extensions loading

Hi, it would be great to add a parameter that enables the load of a sqlite extension you need.

Something like "-ext modspatialite".

In this way your great tool would be even more comfortable and powerful.

Thank you very much

	with self.db.conn:
	cursor = self.db.conn.cursor()
	schema_version = cursor.execute("PRAGMA schema_version").fetchone()[0]
	cursor.execute("PRAGMA writable_schema = 1")
	old_sql = self.schema
	extra_sql = ",\n FOREIGN KEY({column}) REFERENCES {other_table}({other_column})\n".format(
	column=column, other_table=other_table, other_column=other_column
	)
	# Stick that bit in at the very end just before the closing ')'
	last_paren = old_sql.rindex(")")
	new_sql = old_sql[:last_paren].strip() + extra_sql + old_sql[last_paren:]
	cursor.execute(
	"UPDATE sqlite_master SET sql = ? WHERE name = ?", (new_sql, self.name)
	)
	cursor.execute("PRAGMA schema_version = %d" % (schema_version + 1))
	cursor.execute("PRAGMA writable_schema = 0")
	# Have to VACUUM outside the transaction to ensure .foreign_keys property
	# can see the newly created foreign key.
	cursor.execute("VACUUM")

	@click.option(
	"--fts4", help="Just show FTS4 enabled tables", default=False, is_flag=True
	)
	@click.option(
	"--fts5", help="Just show FTS5 enabled tables", default=False, is_flag=True
	)
	def enable_fts(path, table, column, fts4, fts5):

simonw / sqlite-utils Goto Github PK

sqlite-utils's Issues

Recommend Projects

Recommend Topics

Recommend Org