frictionlessdata / tableschema-sql-py Goto Github PK
View Code? Open in Web Editor NEWGenerate SQL tables, load and extract data, based on JSON Table Schema descriptors.
License: MIT License
Generate SQL tables, load and extract data, based on JSON Table Schema descriptors.
License: MIT License
I was wondering if anyone has looked at toolchains that support the creation of ad hoc working environments around a particular environment.
For example, I have a recipe for creating an analysis environment that links and launches RStudio connected to MySQL seeded with a particular dataset using Docker compose and a Dockerfile that created the seeded database:
#Build this container based on the pre-existing offical MySQL Docker image
FROM mysql
#The offical image listens to requests on localost, IP address 127.0.0.1
#We're going to want it to be a bit more promiscuous
#and listen out for requests coming from *any* IP address (0.0.0.0)
#sed is a Linux commandline text editor - it updates the IP address in the MySQL config file
RUN sed -i -e"s/^bind-address\s*=\s*127.0.0.1/bind-address=0.0.0.0/" /etc/mysql/my.cnf
#Install a command line utility called wget
RUN apt-get update && apt-get install -y wget && apt-get clean
#Use wget to download a copy of the MySQL database dump from ergast.com
RUN wget http://ergast.com/downloads/f1db.sql.gz -P /docker-entrypoint-initdb.d
#Unzip the download
#The original MySQL container is configured to install items
#in the /docker-entrypoint-initdb.d/ directory when it starts up
RUN gunzip /docker-entrypoint-initdb.d/f1db.sql.gz
The seeded database is linked to RStudio:
ergastdb:
container_name: ergastdb
build: ergastdb/
environment:
MYSQL_ROOT_PASSWORD: f1
MYSQL_DATABASE: ergastdb
expose:
- "3306"
f1dj_rstudio:
image: rocker/tidyverse
ports:
- "8787:8787"
links:
- ergastdb:ergastdb
volumes:
- ./rstudio:/home/rstudio
So in the extreme case, I imagine running a command something like the following as an equivalent to the above:
reallyfrictionless -app rocker/tidyverse -app jupyter/datascience-notebook -dbms myqsl -dbalias ergastdb -db ergastdb -dbpwd f1 -datasrc http://ergast.com/downloads/f1db.sql.gz -datatyp sql
and consequently being presented with an environment (Jupyter notebooks and RStudio) opened in my browser that let me link to the populated DB and access the data from it directly.
(Actually it'd have to be a tad more complicated to support eg mounted directories/volumes? Hmm... Maybe easier if we just allow a single -app
then -guestdir
and -hostdir
switches, or a single -appvolume
switch?)
-app
: name of application container(s) to install
-dbms
: DBMS container
-dbalias
: alias by which DBMS is accessed from app containers
-dbpwd
: DBMS root password
-db
: database created and used inside DBMs
-datasrc
: data file
-datatyp
: datafile type
For a datapackage, use -datatyp datapackage
and the URL to the data package, and configure the database from that?
(To simplify things on the DBMS side, may be sensible to define images reallyfrictionless/postgres
, reallyfrictionless/mysql
etc that have standardised config variables?)
In the interim, how about some simple Dockerfile recipes that allow a datapackage to be identified and a runnable DBMS docker image created from it?
Without any data parsing code etc, just a schema validator and object to reflect it.
Looks like the last commit is around 2021 and the latest release is around 2019 on Github. What's the status of this Package?
Thanks
For now only PostgreSQL is tests (and thereof supported).
From @OriHoch
From @vshih
import sqlalchemy as sa
engine = sa.create_engine('mysql://...')
from jsontableschema import Table
t = Table(SOURCE, SCHEMA)
t.save('test_package', backend='sql', engine=engine)
# _mysql_exceptions.ProgrammingError: (1146, "Table 'mydatabase.users' doesn't exist")
related #64
Current implementation does not handle case, when existing data needs to be updated. Here is a test case code that illustrates this:
import sqlalchemy as sa
from jsontableschema_sql import Storage
def test_update():
schema = {
'primaryKey': 'id',
'fields': [
{'name': 'id', 'type': 'integer'},
{'name': 'comment', 'type': 'string'},
]
}
storage = Storage(engine=sa.create_engine('sqlite://'))
storage.create('data', schema)
# Write data for the first time
storage.write('data', [(1, 'a'), (2, 'b')])
assert list(storage.read('data')) == [(1, 'a'), (2, 'b')]
# Write changed data with same primary keys on top of existing ones
storage.write('data', [(1, 'x'), (2, 'b')])
assert list(storage.read('data')) == [(1, 'x'), (2, 'b')]
This test raises following error on second write:
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError)
UNIQUE constraint failed: data.id
[SQL: 'INSERT INTO data (id, comment) VALUES (?, ?)']
[parameters: ((1, 'x'), (2, 'b'))]
It would be nice if existing data could be updated.
Storage.write()
or write by chunks (#29)Storage.read()
(possible SQLAlchemy bug)Current version is missing from API.
Convert/restore (schema, naming) operations could be moved to Mapper
class or mapper
module with good testing capabilities.
make test fails with
make: virtualenv-3.4: No such file or directory
make: *** [pyenv/bin/python] Error 1
BTW is this make stuff a replacement for "old-fashioned" nosetests stuff - i was sort of expecting to do:
pip install -e .
nosetests tests
When generating Data Package, using:
# https://github.com/frictionlessdata/datapackage-py
from datapackage import pull_datapackage
from sqlalchemy import create_engine
engine = create_engine('mysql://root@localhost/enliten')
# Push
pull_datapackage(
descriptor='enliten-datapackage/datapackage.json',
name="enliten-datapackage",
backend='sql',
engine=engine)
None of the resources have name
s (now required). This results in the foreignKeys
generated in one of the resources in the datapackage.json
referring to null
resources.
"schema": {
"foreignKeys": [
{
"fields": "user_id",
"reference": {
"resource": null,
"fields": "id"
}
},
{
"fields": "group_id",
"reference": {
"resource": null,
"fields": "id"
}
}
]
(Found via this frictionlessdata/pilot-dm4t#19)
{"fields": [
{"name": "id", "type": "integer"},
{"name": "date", "type": "string"}
],
"primaryKey": ["id"]
If a primaryKey is set in the JTS use it.
Maybe we want a flag where we ignore it and create our own auto increment but set that the primaryKey field to unique (?)
I have a data package with timestamps that have timezone offsets, e.g.:
2020-09-09T07:10:20+02:00
The spec for this fields is:
{
"name": "timestamp",
"type": "datetime",
"format": "%Y-%m-%dT%H:%M:%S%z"
}
When saving this data package to sqlite, timezone offsets get ignored and are saved in sqlite as:
2020-09-09 07:10:20.000000
Retaining timezone offsets is not possible in sqlite's datetime strings, but it would be good if timezones are converted to UTC before uploading to sqlite:
2020-09-09 05:10:20.000000
Is that possible?
Please preserve this line to notify @roll (lead of this repository)
Requires #27 and frictionlessdata/tableschema-py#57
Readme is expected to be in RST. See solution - https://github.com/frictionlessdata/jsontableschema-pandas-py
Required by frictionlessdata/tableschema-py#49
Check:
storage.tables
and storage.check()
work (could be unsynced?)We want to use conda-forge
as a primary distribution channel for the PUDL package (since it depends on a bunch of PyData stuff that can be a pain for folks to build on their own), but this means that our dependencies also need to be available via conda, and right now tableschema-sql
is the only thing that will have to be installed via pip
. Would it be possible to set up a recipe / feedstock on conda-forge
for tableschema-sql
?
See catalyst-cooperative/pudl#430
Please preserve this line to notify @roll (maintainer of this repository)
It's again not formatted - https://pypi.python.org/pypi?:action=display&name=jsontableschema-sql&version=0.4.0 - not sure why for now. Probably after md -> rst convertion there are some problems with rst file in PyPi opinion.
It should:
Requires #27 and frictionlessdata/tableschema-py#38
In version 1.3.0, constraint checks were added.
The pattern
constraint, though, was implemented using a like
sql operator which is not POSIX RE compatible:
11b41b2#diff-06244424627478397c4e73620b8d452fR72
From the postgresql documentation, the correct operator should probably have been ~
.
Please preserve this line to notify @roll (lead of this repository)
The driver should be updated to jsontableschema-v0.7
changes.
Hi, I am looking for a PHP implementation, so we perhaps use tableshema-py as reference algorithm for tableschema-php issue 35... The core seems the mapping
and can be expressed as JSON to reuse into other contexts.
mapping = {
'any': sa.Text,
'array': None,
'boolean': sa.Boolean,
'date': sa.Date,
'datetime': sa.DateTime,
'duration': None,
'geojson': None,
'geopoint': None,
'integer': sa.Integer,
'number': sa.Float,
'object': None,
'string': sa.Text,
'time': sa.Time,
'year': sa.Integer,
'yearmonth': None,
}
The suggestion: express in a external file, or generate code from a CSV with all datatype map descriptions... As "RosettaStone-tabular-datatypes" ;-)
PS: for more general translation of primitive data types, a CSV at Data Packaged Core Datasets with all datatypes, perhaps we can reuse from a
Canonic | Description | JSON | SQL-92 | Python | Java | ... |
---|---|---|---|---|---|---|
JSON | UTF8 string | string | text | str | String | ... |
SQL-92 | integer 4 bytes | number | int4 | ... | .... | .... |
.... | .... | ... | .... | .... | .... | ... |
For example, I followed the import_datapackage.py
at https://gist.github.com/vitorbaptista/19d476d99595584e9ad5 to import https://github.com/datasets/nasdaq-listings into a SQLite3 database. After that, I tried to export it back to a datapackage by using the export_datapackage.py
file at the same gist. I then got the following traceback:
Traceback (most recent call last):
File "sql-example.py", line 14, in <module>
datapackage_storage.export_package(storage, 'nasdaq-listings/datapackage.json', 'nasdaq-listings')
File "/home/vitor/datasets/venv/src/datapackage-storage/datapackage_storage/package.py", line 86, in export_package
schema = storage.describe(table)
File "/home/vitor/datasets/venv/src/jtssql/jtssql/storage.py", line 180, in describe
self.__prefix, table, dbtable.columns, dbtable.constraints)
File "/home/vitor/datasets/venv/src/jtssql/jtssql/helpers.py", line 113, in restore_schema
message = 'Type %s is not supported' % column.type
TypeError: Type TEXT is not supported
The problem is that SQLAlchemy has both sqlalchemy.sql.sqltypes.Text
and sqlalchemy.sql.sqltypes.TEXT
, and in our mapping at .restore_schema
we only use the first one:
# Mapping
mapping = {
Text: 'string',
Integer: 'integer',
Float: 'number',
Boolean: 'boolean',
}
When we try to get mapping[column.type.__class__]
afterwards, it isn't found and the exception is thrown.
I don't know why SQLAlchemy has both .Text
and .TEXT
(and others), or why one is used and not the other, but we should support both. Lucky for us, .TEXT
inherits from .Text
, so we can use that.
When jsontableschema.Storage
interface will be available.
With full Tabular Data Package we would have all the resource names plus data file paths and could do a complete load including setting table names etc.
Also we would handle foreign keys.
Should this be in this library? If not maybe we want tdp-to-sql as well.
As per frictionlessdata/frictionlessdata.io#695 (comment)
Would be super if @akariv and I could be marked as leads and have maintainer access here, ands also on the pypi package.
From @sirex on May 10, 2016 11:58
When integrating external data sources in to your project, usually external data is imported into an internal database.
Since external data can be updated frequently, usually you want to update you internal replica as soon as possible.
Like git pull
for code, I want to have datapackage pull
for data.
There is a jsontableschema-sql-py library, that can import all data to a database, but it can't update existing data.
Here are two use cases:
1. Updating only rows that has been changed since last update
Let's say your external data provider has a service with SLEEP or CouchDB replication protocol support. It means, that after initial import, that took two days, now I can call this external service and get only data that has been changed instead of downloading everything again.
2. Detecting what has been changed since last update
Let's say, another provider keeps data in a git repository, which allows me to update data quickly using git pull
, but there is no way of knowing which rows have been changed. Luckily, datapackage pull
can detect changes in a really fast way and update my internal database only touching records, that has been changed since last update.
Copied from original issue: frictionlessdata/frictionlessdata.io#253
I think it is better to use the type casting of JTSKit rather than the new typecast
lib.
Why?
Requires #27
Requires frictionlessdata/tableschema-py#59
On previous iterations I've run into some problems using 100% pure reflection system. And now it's obvious for me that for first-class JSON Table Schema
features support we need to store schemas
in the storage (in the meta table _schemas
for example). It's not against current design and could be implemented relatively fast.
For tables not having _schemas
current reflection system is still be working with subset of features supported. So user still will be able to describe
and read
just a SQL table (the same as now).
Really related - frictionlessdata/datapackage-py#69
Requires #27
Initial implementation restrictions:
fields[].constraints.unique
is not supportedfields[].constraints.minLength
is not supportedfields[].constraints.maxLength
is not supportedfields[].constraints.minimum
is not supportedfields[].constraints.maximum
is not supportedfields[].constraints.pattern
is not supportedfields[].constraints.enum
is not supportedReference:
Hey @pudo
We are going to update this lib using jsontableschema
, etc., and also implement https://github.com/okfn/datapackage-sql-py for relational table support, etc. It will be a bumpy ride, but we are thinking that most of the API will be modeled on the great work @roll has recently done for BigQuery integration: https://github.com/okfn?utf8=โ&query=bigquery
All work will be done in a new feature branch. If there is serious conflict with your original design here for jtssql
, LMK and we can discuss, and/or break it out to a new home.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.