simonw / sqlite-utils Goto Github PK
View Code? Open in Web Editor NEWPython CLI utility and library for manipulating SQLite databases
Home Page: https://sqlite-utils.datasette.io
License: Apache License 2.0
Python CLI utility and library for manipulating SQLite databases
Home Page: https://sqlite-utils.datasette.io
License: Apache License 2.0
This thread here is illuminating: https://stackoverflow.com/questions/3634984/insert-if-not-exists-else-update
The term UPSERT
in SQLite has a specific meaning as-of 3.24.0 (2018-06-04): https://www.sqlite.org/lang_UPSERT.html
It means "behave as an UPDATE or a no-op if the INSERT would violate a uniqueness constraint". The syntax in 3.24.0+ looks like this (confusingly it does not use the term "upsert"):
INSERT INTO phonebook(name,phonenumber) VALUES('Alice','704-555-1212')
ON CONFLICT(name) DO UPDATE SET phonenumber=excluded.phonenumber
Here's the problem: the sqlite-utils
.upsert()
and .upsert_all()
methods don't do this. They use the following SQL:
INSERT OR REPLACE INTO [{table}] ({columns}) VALUES {rows};
If the record already exists, it will be entirely replaced by a new record - as opposed to updating any specified fields but leaving existing fields as they are (the behaviour of "upsert" in SQLite itself).
Spun off from #24 - it would be useful if CLI users could create new tables (with explicit column types, not null rules and defaults) without having to insert an example record.
--pk
--not-null
--default
--fk colname othertable othercol
--replace
and --ignore
Utility method for fetching a record by its primary key.
Accepts a single value (for primary key / rowid tables) or a list/tuple of values (for compound primary keys, refs #36).
Raises a NotFoundError
if the record cannot be found.
I spotted a table which was created once and then had columns added to it and the formatted SQL looks like this:
CREATE TABLE [records] (
[type] TEXT,
[sourceName] TEXT,
[sourceVersion] TEXT,
[unit] TEXT,
[creationDate] TEXT,
[startDate] TEXT,
[endDate] TEXT,
[value] TEXT,
[metadata_Health Mate App Version] TEXT,
[metadata_Withings User Identifier] TEXT,
[metadata_Modified Date] TEXT,
[metadata_Withings Link] TEXT,
[metadata_HKWasUserEntered] TEXT
, [device] TEXT, [metadata_HKMetadataKeyHeartRateMotionContext] TEXT, [metadata_HKDeviceManufacturerName] TEXT, [metadata_HKMetadataKeySyncVersion] TEXT, [metadata_HKMetadataKeySyncIdentifier] TEXT, [metadata_HKSwimmingStrokeStyle] TEXT, [metadata_HKVO2MaxTestType] TEXT, [metadata_HKTimeZone] TEXT, [metadata_Average HR] TEXT, [metadata_Recharge] TEXT, [metadata_Lights] TEXT, [metadata_Asleep] TEXT, [metadata_Rating] TEXT, [metadata_Energy Threshold] TEXT, [metadata_Deep Sleep] TEXT, [metadata_Nap] TEXT, [metadata_Edit Slots] TEXT, [metadata_Tags] TEXT, [metadata_Daytime HR] TEXT)
It would be nice if the columns that were added later matched the indentation of the initial columns.
Needed by db-to-sqlite. It currently works by collecting all of the foreign key relationships it can find and then applying them at the end of the process.
The problem is, the add_foreign_key()
method looks like this:
sqlite-utils/sqlite_utils/db.py
Lines 498 to 516 in 86bd2bb
That means it's doing a full VACUUM
for every single relationship it sets up - and if you have hundreds of foreign key relationships in your database this can take hours.
I think the right solution is to have a .add_foreign_keys(list_of_args)
method which does the bulk operation and then a single VACUUM
. .add_foreign_key(...)
can then call the bulk action with a single list item.
This should work:
table = db["dog_breeds"].insert({
"dog_id": 1,
"breed_id": 2
}, pk=("dog_id", "breed_id"))
Needed for m2m work in #23
See also simonw/db-to-sqlite#18
Would this make it 100% safe to run reads against a database file that is being written to by another process?
Now that we're creating triggers (thanks to @amjith in #57) it would be neat if we could introspect them too.
I'm thinking:
db.triggers
- lists all triggers for the database
db["tablename"].triggers
- lists triggers for that table
The underlying query for this is select * from sqlite_master where type = 'trigger'
I'll return the trigger information in a new namedtuple, similar to how Indexes and ForeignKeys work.
Since it's designed to be used in Jupyter or for rapid prototyping in an IDE (and it's still pretty small) sqlite-utils
feels like a great candidate for me to finally try out Python type hints.
https://veekaybee.github.io/2019/07/08/python-type-hints/ is good.
It suggests the mypy docs for getting started: https://mypy.readthedocs.io/en/latest/existing_code.html plus this tutorial: https://pymbook.readthedocs.io/en/latest/typehinting.html
The GitHub JSON APIs have a really interesting convention with respect to related objects.
Consider https://api.github.com/repos/simonw/sqlite-utils/issues - here's a truncated subset:
{
"id": 449818897,
"node_id": "MDU6SXNzdWU0NDk4MTg4OTc=",
"number": 24,
"title": "Additional Column Constraints?",
"user": {
"login": "IgnoredAmbience",
"id": 98555,
"node_id": "MDQ6VXNlcjk4NTU1",
"avatar_url": "https://avatars0.githubusercontent.com/u/98555?v=4",
"gravatar_id": ""
},
"labels": [
{
"id": 993377884,
"node_id": "MDU6TGFiZWw5OTMzNzc4ODQ=",
"url": "https://api.github.com/repos/simonw/sqlite-utils/labels/enhancement",
"name": "enhancement",
"color": "a2eeef",
"default": true
}
],
"state": "open"
}
The user
column lists a complete user. The labels
column has a list of labels.
Since both user and label have populated id
field this is actually enough information for us to create records for them AND set up the corresponding foreign key (for user) and m2m relationships (for labels).
It would be really neat if sqlite-utils
had some kind of mechanism for correctly processing these kind of patterns.
Thanks to jq
there's not much need for extra customization of the shape here - if we support a narrowly defined structure users can use jq
to reshape arbitrary JSON to match.
Related to #2
Both in the Python API and the CLI tool. For the CLI tool this should work:
$ sqlite-utils create-index mydb.db mytable col1 col2
This will create a compound index across col1 and col2. The name of the index will be automatically chosen unless you use the --name=...
option.
Support a --unique
option too.
The foreign_keys=
argument currently takes a list of triples:
db["usages"].insert_all(
usages_to_insert,
foreign_keys=(
("line_id", "lines", "id"),
("definition_id", "definitions", "id"),
),
)
As of #16 we have a mechanism for detecting the primary key column (the third item in this triple) - we should use that here too, so foreign keys can be optionally defined as a list of pairs.
One of my favourite features of csvs-to-sqlite is that it can "extract" columns into a separate lookup table - for example:
csvs-to-sqlite big_csv_file.csv -c country output.db
This will turn the country
column in the resulting table into a integer foreign key against a new country
table. You can see an example of what that looks like here: https://san-francisco.datasettes.com/registered-business-locations-3d50679/Business+Corridor was extracted from https://san-francisco.datasettes.com/registered-business-locations-3d50679/Registered_Business_Locations_-_San_Francisco?Business%20Corridor=1
I'd like to have the same capability in sqlite-utils
- but with the ability to run it against an existing SQLite table rather than just against a CSV.
FTS5 can have porter stemming enabled.
The docs for eg inserting data should live on the same page, rather than being split across the API and CLI pages.
Sometimes when you are importing data the underlying source provides records without IDs that can be uniquely identified by their contents.
A utility mechanism for calculating a sha1 hash of the contents and using that as a unique ID would be useful.
If the input to sqlite-utils insert
includes a column that is a JSON array or object, sqlite-utils query
will introduce an extra level of quoting on output:
# echo '[{"key": ["one", "two", "three"]}]' | sqlite-utils insert t.db t -
# sqlite-utils t.db 'select * from t'
[{"key": "[\"one\", \"two\", \"three\"]"}]
# sqlite3 t.db 'select * from t'
["one", "two", "three"]
This might require an imperfect solution, since sqlite3 doesn't have a JSON type. Perhaps fields that start with ["
or {"
and end with "]
or "}
could be detected, with a flag to turn off that behavior for weird text fields (or vice versa).
Sometimes it's good to have indices on all columns that are foreign keys, to allow for efficient reverse lookups.
This would be a useful utility:
$ sqlite-utils index-foreign-keys database.db
Hi !
Thanks for the lib !
Obviously, every possible sql queries won't have a dedicated method.
But I was thinking : a method to delete a row (I'm terrible with names, maybe delete_where()
or something, would be useful.
I have a Database, with primary key.
For the moment, I use :
db.conn.execute(f"DELETE FROM table WHERE key = {key_id}")
db.conn.commit()
to delete a row I don't need anymore, giving his primary key.
Works like a charm.
Just an idea :
table.delete_where_pkey({'key': key_id})
or something (I know, I'm terrible at naming methods...).
Pros : well, no need to write SQL query.
Cons : WHERE normally allows to do many more things (operators =, <>, >, <, BETWEEN), not to mention AND, OR, etc...
Method is maybe to specific, and/or a pain to render more flexible.
Again, just a thought. Writing his own sql works too, so...
Thanks again.
See yah.
Inspired by #8
Irrrespective of whether using column names containing a space or - character is good practice, SQLite does allow it, but sqlite-utils
throws an error in the following cases:
from sqlite_utils import Database
dbname = 'test.db'
DB = Database(sqlite3.connect(dbname))
import pandas as pd
df = pd.DataFrame({'col1':range(3), 'col2':range(3)})
#Convert pandas dataframe to appropriate list/dict format
DB['test1'].insert_all( df.to_dict(orient='records') )
#Works fine
However:
df = pd.DataFrame({'col 1':range(3), 'col2':range(3)})
DB['test1'].insert_all(df.to_dict(orient='records'))
throws:
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
<ipython-input-27-070b758f4f92> in <module>()
1 import pandas as pd
2 df = pd.DataFrame({'col 1':range(3), 'col2':range(3)})
----> 3 DB['test1'].insert_all(df.to_dict(orient='records'))
/usr/local/lib/python3.7/site-packages/sqlite_utils/db.py in insert_all(self, records, pk, foreign_keys, upsert, batch_size, column_order)
327 jsonify_if_needed(record.get(key, None)) for key in all_columns
328 )
--> 329 result = self.db.conn.execute(sql, values)
330 self.db.conn.commit()
331 self.last_id = result.lastrowid
OperationalError: near "1": syntax error
and:
df = pd.DataFrame({'col-1':range(3), 'col2':range(3)})
DB['test1'].upsert_all(df.to_dict(orient='records'))
results in:
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
<ipython-input-28-654523549d20> in <module>()
1 import pandas as pd
2 df = pd.DataFrame({'col-1':range(3), 'col2':range(3)})
----> 3 DB['test1'].insert_all(df.to_dict(orient='records'))
/usr/local/lib/python3.7/site-packages/sqlite_utils/db.py in insert_all(self, records, pk, foreign_keys, upsert, batch_size, column_order)
327 jsonify_if_needed(record.get(key, None)) for key in all_columns
328 )
--> 329 result = self.db.conn.execute(sql, values)
330 self.db.conn.commit()
331 self.last_id = result.lastrowid
OperationalError: near "-": syntax error
Reported here: dogsheep/healthkit-to-sqlite#9
It looks like there's a default limit of 999 variables - we need to be smart about that, maybe dynamically lower the batch size based on the number of columns.
We can already export newline delimited JSON. We should learn to import it as well.
The neat thing about importing it is that you can import GBs of data without having to read the whole lot into memory in order to decode the wrapping JSON array.
Datasette can export it now: simonw/datasette#405
Demo: https://latest.datasette.io/fixtures/facetable.json?_shape=array&_nl=on
It should be possible to do this:
$ curl "https://latest.datasette.io/fixtures/facetable.json?_shape=array&_nl=on" \
| sqlite-utils insert data.db facetable - --nl
Right now we only support ingesting CSV, but sometimes interesting data is released as TSV.
https://www.washingtonpost.com/national/2019/07/18/how-download-use-dea-pain-pills-database/ for example.
Hi! I did not find any mention on the library about ON DELETE and ON UPDATE actions for foreign keys. Are those expected to be implemented? If not, it would be a nice thing to include!
Makes sense to do this before foreign keys in #2
Python:
db["table"].add_column("new_column", int)
CLI:
$ sqlite-utils add-column table new_column INTEGER
CSV can be imported to SQLite when used CLI, but I don't see documentation for when using as library.
db["table"].ensure_index(["col1", "col2"])
This will do the following:
I want this for tools like twitter-to-sqlite search where the search_runs
table may or not have been created yet but, if it IS created, I want to put an index on the hash
column.
The docs show me how to create a view via db.create_view()
but I can't seem to get back to that view post-creation; if I query it as a table it returns None
, and it doesn't appear in the table listing, even though querying the view works fine from inside the sqlite3 command-line.
It'd be great to have the view as a pseudo-table, or if the python/sqlite3 module makes that hard to pull off (I couldn't figure it out), to have that edge-case documented next to the db.create_view()
docs.
This is a larger change, but it would be incredibly useful: if you attempt to insert or update a document with a field that does not currently exist in the underlying table, sqlite-utils should add the appropriate column for you.
The foreign_keys=
argument to table.insert_all()
and friends can be used to specify foreign key relationships that should be created.
It is not yet documented. It also requires you to specify the SQLite type of each column, even though this can be detected by introspecting the referenced table:
cols = [c for c in self.db[other_table].columns if c.name == other_column]
cols[0].type
Relates to #2
The following works fine:
connX = sqlite3.connect('DELME.db', timeout=10)
dfX=pd.DataFrame({'col1':range(3),'col2':range(3)})
DBX = Database(connX)
DBX['test'].upsert_all(dfX.to_dict(orient='records'))
But if a column is named order
:
connX = sqlite3.connect('DELME.db', timeout=10)
dfX=pd.DataFrame({'order':range(3),'col2':range(3)})
DBX = Database(connX)
DBX['test'].upsert_all(dfX.to_dict(orient='records'))
it throws an error:
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
<ipython-input-130-7dba33cd806c> in <module>
3 dfX=pd.DataFrame({'order':range(3),'col2':range(3)})
4 DBX = Database(connX)
----> 5 DBX['test'].upsert_all(dfX.to_dict(orient='records'))
/usr/local/lib/python3.7/site-packages/sqlite_utils/db.py in upsert_all(self, records, pk, foreign_keys, column_order)
347 foreign_keys=foreign_keys,
348 upsert=True,
--> 349 column_order=column_order,
350 )
351
/usr/local/lib/python3.7/site-packages/sqlite_utils/db.py in insert_all(self, records, pk, foreign_keys, upsert, batch_size, column_order)
327 jsonify_if_needed(record.get(key, None)) for key in all_columns
328 )
--> 329 result = self.db.conn.execute(sql, values)
330 self.db.conn.commit()
331 self.last_id = result.lastrowid
OperationalError: near "order": syntax error
While building https://github.com/dogsheep/healthkit-to-sqlite I found a need for a neat mechanism for easily building lookup tables - tables where each unique value in a column is replaced by a foreign key to a separate table.
csvs-to-sqlite currently creates those with its "extract" mechanism - but that's written as custom code against Pandas. I'd like to eventually replace Pandas with sqlite-utils there.
See also #42
Relates to #42 and #44. I want the ability to extract values out into lookup tables during bulk insert/upsert operations.
db.insert_all(rows, extracts=["species"])
db.insert_all(rows, extracts={"species": "Species"})
Species
.SQLite does not have ALTER TABLE support for adding new foreign keys... but it turns out it's possible to make these changes without having to duplicate the entire table by carefully running UPDATE sqlite_master SET sql=... WHERE type='table' AND name='X';
Here's how Django does it: https://github.com/django/django/blob/d3449faaa915a08c275b35de01e66a7ef6bdb2dc/django/db/backends/sqlite3/schema.py#L103-L125
And here's the official documentation about this: https://sqlite.org/lang_altertable.html#otheralter (scroll to the very bottom of the page)
Lookup tables can be constructed on compound columns, but the extracts=
option doesn't currently support that.
Right now extracts can be defined in two ways:
# Extract these columns into tables with the same name:
dogs = db.table("dogs", extracts=["breed", "most_recent_trophy"])
# Same as above but with custom table names:
dogs = db.table("dogs", extracts={"breed": "Breeds", "most_recent_trophy": "Trophies"})
Need some kind of syntax for much more complicated extractions, like when two columns (say "source" and "source_version") are extracted into a single table.
I clearly copied-and-pasted this from the tables
command without updating it:
sqlite-utils/sqlite_utils/cli.py
Lines 216 to 222 in 0b1af42
You can corrupt your database by running:
$ sqlite-utils add-foreign-key my.db table non_existent_column other_table other_column
I'm looking to import data from XML with a pre-defined schema that maps fairly closely to a relational database.
In particular, it has explicit annotations for when fields are required, optional, or when a default value should be inferred.
Would there be value in adding the ability to define NOT NULL
and DEFAULT
column constraints to sqlite-utils?
I've just noticed that SQLite lets you IGNORE inserts that collide with a pre-existing key. This can be quite handy if you have a dataset that keeps changing in part, and you don't want to upsert and replace pre-existing PK rows but you do want to ignore collisions to existing PK rows.
Do
sqlite_utils
support such (cavalier!) behaviour?
Originally posted by @psychemedia in #18 (comment)
Spun off from #23 - this method will allow a user to update a specific row.
Currently the only way to do that it is to call .upsert({full record})
with the primary key field matching an existing record - but this does not support partial updates.
db["events"].update(3, {"name": "Renamed"})
This method only works on an existing table, so there's no need for a pk="id"
specifier - it can detect the primary key by looking at the table.
If the primary key is compound the first argument can be a tuple:
db["events_venues"].update((3, 2), {"custom_label": "Label"})
The method can be called without the second dictionary argument. Doing this selects the row specified by the primary key (throwing an error if it does not exist) and remembers it so that chained operations can be carried out - see proposal in #23 (comment)
Python library only. What would be a syntactically pleasant way of creating a m2m record?
Could greatly reduce size of resulting database for large datasets: http://cocoamine.net/blog/2015/09/07/contentless-fts4-for-large-immutable-documents/
Right now enable_fts() is only implemented for Table(). Technically sqlite supports enabling fts on views. But it requires deeper thought since views don't have rowid
and the current implementation of enable_fts() relies on the presence of rowid
column.
It is possible to provide an alternative rowid using the content_rowid
option to the FTS5() function.
Ref: https://sqlite.org/fts5.html#fts5_table_creation_and_initialization
The "content_rowid" option, used to set the rowid field of an external content table.
This will further complicate enable_fts()
function by adding an extra argument. I'm wondering if that is outside the scope of this tool or should I work on that feature and send a PR?
Right now you have to load every record into memory before passing the list to .insert_all()
and friends.
If you want to process millions of rows, this is inefficient. Python has generators - we should use them!
The only catch here is that part of the magic of sqlite-utils
is that it guesses the column types and creates the table for you. This code will need to be updated to notice if the table needs creating and, if it does, create it using the first X (where x=1,000 but can be customized) records.
If a record outside of those first 1,000 has a rogue column, we can crash with an error.
This will free us up to make the --nl
option added in #6 much more efficient.
Hi, it would be great to add a parameter that enables the load of a sqlite extension you need.
Something like "-ext modspatialite".
In this way your great tool would be even more comfortable and powerful.
Thank you very much
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.