frictionlessdata / tableschema-sql-py Goto Github PK

Generate SQL tables, load and extract data, based on JSON Table Schema descriptors.

License: MIT License

Makefile 2.23% Python 97.77%

tableschema-sql-py's Issues

"Runnable" data packages

I was wondering if anyone has looked at toolchains that support the creation of ad hoc working environments around a particular environment.

For example, I have a recipe for creating an analysis environment that links and launches RStudio connected to MySQL seeded with a particular dataset using Docker compose and a Dockerfile that created the seeded database:

#Build this container based on the pre-existing offical MySQL Docker image
FROM mysql

#The offical image listens to requests on localost, IP address 127.0.0.1
#We're going to want it to be a bit more promiscuous
#and listen out for requests coming from *any* IP address (0.0.0.0)
#sed is a Linux commandline text editor - it updates the IP address in the MySQL config file
RUN sed -i -e"s/^bind-address\s*=\s*127.0.0.1/bind-address=0.0.0.0/" /etc/mysql/my.cnf

#Install a command line utility called wget
RUN apt-get update && apt-get install -y wget && apt-get clean

#Use wget to download a copy of the MySQL database dump from ergast.com
RUN wget http://ergast.com/downloads/f1db.sql.gz -P /docker-entrypoint-initdb.d

#Unzip the download
#The original MySQL container is configured to install items
#in the /docker-entrypoint-initdb.d/ directory when it starts up
RUN gunzip /docker-entrypoint-initdb.d/f1db.sql.gz

The seeded database is linked to RStudio:

ergastdb:
  container_name: ergastdb
  build: ergastdb/
  environment:
    MYSQL_ROOT_PASSWORD: f1
    MYSQL_DATABASE: ergastdb
  expose:
    - "3306"
    
f1dj_rstudio:
  image: rocker/tidyverse 
  ports:
    - "8787:8787"
  links:
    - ergastdb:ergastdb
  volumes:
    - ./rstudio:/home/rstudio

So in the extreme case, I imagine running a command something like the following as an equivalent to the above:

reallyfrictionless -app rocker/tidyverse -app jupyter/datascience-notebook -dbms myqsl -dbalias ergastdb -db ergastdb -dbpwd f1 -datasrc http://ergast.com/downloads/f1db.sql.gz -datatyp sql

and consequently being presented with an environment (Jupyter notebooks and RStudio) opened in my browser that let me link to the populated DB and access the data from it directly.

(Actually it'd have to be a tad more complicated to support eg mounted directories/volumes? Hmm... Maybe easier if we just allow a single -app then -guestdir and -hostdir switches, or a single -appvolume switch?)

-app: name of application container(s) to install
-dbms: DBMS container
-dbalias: alias by which DBMS is accessed from app containers
-dbpwd: DBMS root password
-db: database created and used inside DBMs
-datasrc: data file
-datatyp: datafile type

For a datapackage, use -datatyp datapackage and the URL to the data package, and configure the database from that?

(To simplify things on the DBMS side, may be sensible to define images reallyfrictionless/postgres, reallyfrictionless/mysql etc that have standardised config variables?)

In the interim, how about some simple Dockerfile recipes that allow a datapackage to be identified and a runnable DBMS docker image created from it?

Is there a package which only has an abstract model of JTS?

Without any data parsing code etc, just a schema validator and object to reflect it.

Is the package still maintained?

Looks like the last commit is around 2021 and the latest release is around 2019 on Github. What's the status of this Package?

Thanks

Should support database engines other then postgresql

Overview

For now only PostgreSQL is tests (and thereof supported).

Plan

implement support for other database engines like MySQL or SQLite

From @OriHoch

reproduction steps

be a new developer using jsontableschema-sql-py
look at README, see the exapmle which uses sqlite
be happy that I can use sqlite and possibly other engines because it uses sqlalchemy
start using it and developing based on it
use an 'object' or 'array' type field in a schema

expected

should work

actual

error
looking further, the code relies specifically on postgresql for object / array type field

notes

I think it can be solved by dumping the values as json into string field
until it's fixed - state clearly in README and in the examples that postgresql is the only fully supported engine
I guess, the main issue is not supporting sqlite - which is great to allow new developers to start working quickly. If at least sqlite was supported as well (for development / testing only) - it will be great.

From @vshih

import sqlalchemy as sa
engine = sa.create_engine('mysql://...')
from jsontableschema import Table
t = Table(SOURCE, SCHEMA)
t.save('test_package', backend='sql', engine=engine)
# _mysql_exceptions.ProgrammingError: (1146, "Table 'mydatabase.users' doesn't exist")

related #64

Updating existing data?

Current implementation does not handle case, when existing data needs to be updated. Here is a test case code that illustrates this:

import sqlalchemy as sa

from jsontableschema_sql import Storage


def test_update():
    schema = {
        'primaryKey': 'id',
        'fields': [
            {'name': 'id', 'type': 'integer'},
            {'name': 'comment', 'type': 'string'},
        ]
    }

    storage = Storage(engine=sa.create_engine('sqlite://'))
    storage.create('data', schema)

    # Write data for the first time
    storage.write('data', [(1, 'a'), (2, 'b')])
    assert list(storage.read('data')) == [(1, 'a'), (2, 'b')]

    # Write changed data with same primary keys on top of existing ones
    storage.write('data', [(1, 'x'), (2, 'b')])
    assert list(storage.read('data')) == [(1, 'x'), (2, 'b')]

This test raises following error on second write:

sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError)
UNIQUE constraint failed: data.id
[SQL: 'INSERT INTO data (id, comment) VALUES (?, ?)']
[parameters: ((1, 'x'), (2, 'b'))]

It would be nice if existing data could be updated.

Improve memory usage

use stream in Storage.write() or write by chunks (#29)
investigate memory usage in Storage.read() (possible SQLAlchemy bug)

Postgres dialect incorrectly maps `array` types as text instead of JSONB

Expose package version

Overview

Current version is missing from API.

Tasks

add to init.py

[update] Move all convert/restore to mapper?

Convert/restore (schema, naming) operations could be moved to Mapper class or mapper module with good testing capabilities.

Running tests ...

make test fails with

make: virtualenv-3.4: No such file or directory
make: *** [pyenv/bin/python] Error 1

BTW is this make stuff a replacement for "old-fashioned" nosetests stuff - i was sort of expecting to do:

pip install -e .
nosetests tests

Resources have no "name"s

When generating Data Package, using:

# https://github.com/frictionlessdata/datapackage-py

from datapackage import pull_datapackage
from sqlalchemy import create_engine

engine = create_engine('mysql://root@localhost/enliten')

# Push
pull_datapackage(
    descriptor='enliten-datapackage/datapackage.json',
    name="enliten-datapackage",
    backend='sql',
    engine=engine)

None of the resources have names (now required). This results in the foreignKeys generated in one of the resources in the datapackage.json referring to null resources.

"schema": {
  "foreignKeys": [
    {
      "fields": "user_id",
      "reference": {
        "resource": null,
        "fields": "id"
      }
    },
    {
      "fields": "group_id",
      "reference": {
        "resource": null,
        "fields": "id"
      }
    }
  ]

(Found via this frictionlessdata/pilot-dm4t#19)

Setup linting

Mismatch in primary key types should not cause a failure on SQL writing

reproduction

table schema:

{"fields": [
    {"name": "id", "type": "integer"},
    {"name": "date", "type": "string"}
],
"primaryKey": ["id"]

write some rows to a PostgreSQL DB - but in the rows, have the id as type string (but numeric)
data in DB is inserted correctly, and DB converts it to integer
now, write the same rows again

expected

provide an error message, warning about type mismatch
or match rows in DB (as opposed to bloom)

actual

writer.__prepare_bloom - collects all data as integers
writer.__check_existing - compares with a string id - which doesn't match
this causes insert
when inserting, the DB matches the primary key (even though types don't match)
error about duplicate key

Support primaryKey

If a primaryKey is set in the JTS use it.

Maybe we want a flag where we ignore it and create our own auto increment but set that the primaryKey field to unique (?)

Timezone offset ignored when loading into sqlite

Overview

I have a data package with timestamps that have timezone offsets, e.g.:

2020-09-09T07:10:20+02:00

The spec for this fields is:

{
  "name": "timestamp",
  "type": "datetime",
  "format": "%Y-%m-%dT%H:%M:%S%z"
}

When saving this data package to sqlite, timezone offsets get ignored and are saved in sqlite as:

2020-09-09 07:10:20.000000

Retaining timezone offsets is not possible in sqlite's datetime strings, but it would be good if timezones are converted to UTC before uploading to sqlite:

2020-09-09 05:10:20.000000

Is that possible?

Please preserve this line to notify @roll (lead of this repository)

Fix readme on PyPi

Overview

Readme is expected to be in RST. See solution - https://github.com/frictionlessdata/jsontableschema-pandas-py

Rename module to jsontableschema_sql

Required by frictionlessdata/tableschema-py#49

SQLAlchemy engine usage stability

Check:

many storages use one engine
storage.tables and storage.check() work (could be unsynced?)

Package tableschema-sql for distribution via conda-forge

We want to use conda-forge as a primary distribution channel for the PUDL package (since it depends on a bunch of PyData stuff that can be a pain for folks to build on their own), but this means that our dependencies also need to be available via conda, and right now tableschema-sql is the only thing that will have to be installed via pip. Would it be possible to set up a recipe / feedstock on conda-forge for tableschema-sql?

[update] Update readme with no import/export

PostgreSQL requires a unique constraint for FK reference

Overview

See catalyst-cooperative/pudl#430

Please preserve this line to notify @roll (maintainer of this repository)

Fix readme on PyPi

Overview

It's again not formatted - https://pypi.python.org/pypi?:action=display&name=jsontableschema-sql&version=0.4.0 - not sure why for now. Probably after md -> rst convertion there are some problems with rst file in PyPi opinion.

Support all Table Schema types

Overview

It should:

use native type
fallback to string

Tasks

analyze specs-v1 to get list of unsupported types
support types using described above algorithm

Add duration type support

Requires #27 and frictionlessdata/tableschema-py#38

Pattern check uses 'like' which is not RE compatible

Overview

In version 1.3.0, constraint checks were added.

The pattern constraint, though, was implemented using a like sql operator which is not POSIX RE compatible:

11b41b2#diff-06244424627478397c4e73620b8d452fR72

From the postgresql documentation, the correct operator should probably have been ~.

Please preserve this line to notify @roll (lead of this repository)

Update to jsontableschema-v0.7

Overview

The driver should be updated to jsontableschema-v0.7 changes.

Tasks

fix breaking storage changes
add new storage features
remove deprecated code
fix readme on PyPi

Express mapping in an external JSON to reuse as standard reference

Hi, I am looking for a PHP implementation, so we perhaps use tableshema-py as reference algorithm for tableschema-php issue 35... The core seems the mapping and can be expressed as JSON to reuse into other contexts.

        mapping = {
            'any': sa.Text,
            'array': None,
            'boolean': sa.Boolean,
            'date': sa.Date,
            'datetime': sa.DateTime,
            'duration': None,
            'geojson': None,
            'geopoint': None,
            'integer': sa.Integer,
            'number': sa.Float,
            'object': None,
            'string': sa.Text,
            'time': sa.Time,
            'year': sa.Integer,
            'yearmonth': None,
        }

The suggestion: express in a external file, or generate code from a CSV with all datatype map descriptions... As "RosettaStone-tabular-datatypes" ;-)

PS: for more general translation of primitive data types, a CSV at Data Packaged Core Datasets with all datatypes, perhaps we can reuse from a

Canonic	Description	JSON	SQL-92	Python	Java	...
JSON	UTF8 string	string	text	str	String	...
SQL-92	integer 4 bytes	number	int4	...	....	....
....	....	...	....	....	....	...

There're multiple sqlalchemy.sql.sqltypes.* for TEXT and others

For example, I followed the import_datapackage.py at https://gist.github.com/vitorbaptista/19d476d99595584e9ad5 to import https://github.com/datasets/nasdaq-listings into a SQLite3 database. After that, I tried to export it back to a datapackage by using the export_datapackage.py file at the same gist. I then got the following traceback:

Traceback (most recent call last):
  File "sql-example.py", line 14, in <module>
    datapackage_storage.export_package(storage, 'nasdaq-listings/datapackage.json', 'nasdaq-listings')
  File "/home/vitor/datasets/venv/src/datapackage-storage/datapackage_storage/package.py", line 86, in export_package
    schema = storage.describe(table)
  File "/home/vitor/datasets/venv/src/jtssql/jtssql/storage.py", line 180, in describe
    self.__prefix, table, dbtable.columns, dbtable.constraints)
  File "/home/vitor/datasets/venv/src/jtssql/jtssql/helpers.py", line 113, in restore_schema
    message = 'Type %s is not supported' % column.type
TypeError: Type TEXT is not supported

The problem is that SQLAlchemy has both sqlalchemy.sql.sqltypes.Text and sqlalchemy.sql.sqltypes.TEXT, and in our mapping at .restore_schema we only use the first one:

    # Mapping
    mapping = {
        Text: 'string',
        Integer: 'integer',
        Float: 'number',
        Boolean: 'boolean',
    }

When we try to get mapping[column.type.__class__] afterwards, it isn't found and the exception is thrown.

I don't know why SQLAlchemy has both .Text and .TEXT (and others), or why one is used and not the other, but we should support both. Lucky for us, .TEXT inherits from .Text, so we can use that.

Implement jsontableschema.Storage

When jsontableschema.Storage interface will be available.

Consider whether we support full Tabular Data Package loading

With full Tabular Data Package we would have all the resource names plus data file paths and could do a complete load including setting table names etc.

Also we would handle foreign keys.

Should this be in this library? If not maybe we want tdp-to-sql as well.

Should be able to save tables to MySQL

@vshih wrote:

import sqlalchemy as sa
engine = sa.create_engine('mysql://...')
from jsontableschema import Table
t = Table(SOURCE, SCHEMA)
t.save('test_package', backend='sql', engine=engine)
# _mysql_exceptions.ProgrammingError: (1146, "Table 'mydatabase.users' doesn't exist")

related #64

Maintainer / Lead

As per frictionlessdata/frictionlessdata.io#695 (comment)

Would be super if @akariv and I could be marked as leads and have maintainer access here, ands also on the pypi package.

Push package to PyPi

Data synchronisation?

From @sirex on May 10, 2016 11:58

When integrating external data sources in to your project, usually external data is imported into an internal database.

Since external data can be updated frequently, usually you want to update you internal replica as soon as possible.

Like git pull for code, I want to have datapackage pull for data.

There is a jsontableschema-sql-py library, that can import all data to a database, but it can't update existing data.

Here are two use cases:

1. Updating only rows that has been changed since last update

Let's say your external data provider has a service with SLEEP or CouchDB replication protocol support. It means, that after initial import, that took two days, now I can call this external service and get only data that has been changed instead of downloading everything again.

2. Detecting what has been changed since last update

Let's say, another provider keeps data in a git repository, which allows me to update data quickly using git pull, but there is no way of knowing which rows have been changed. Luckily, datapackage pull can detect changes in a really fast way and update my internal database only touching records, that has been changed since last update.

Copied from original issue: frictionlessdata/frictionlessdata.io#253

Use JTSKit/json-table-schema-py instead of typecast

I think it is better to use the type casting of JTSKit rather than the new typecast lib.

Why?

Better support for JTS Spec (eg: types and formats)
Much more likely to get updated and stay in sync with spec: We are trying to keep https://github.com/okfn/json-table-schema-py and https://github.com/okfn/json-table-schema-js at feature parity and in line with spec changes

Add geopoint type support

Requires #27

Store schemas in storage

Requires frictionlessdata/tableschema-py#59

On previous iterations I've run into some problems using 100% pure reflection system. And now it's obvious for me that for first-class JSON Table Schema features support we need to store schemas in the storage (in the meta table _schemas for example). It's not against current design and could be implemented relatively fast.

For tables not having _schemas current reflection system is still be working with subset of features supported. So user still will be able to describe and read just a SQL table (the same as now).

Really related - frictionlessdata/datapackage-py#69

Support all Table Schema constraints

Requires #27

Initial implementation restrictions:

fields[].constraints.unique is not supported
fields[].constraints.minLength is not supported
fields[].constraints.maxLength is not supported
fields[].constraints.minimum is not supported
fields[].constraints.maximum is not supported
fields[].constraints.pattern is not supported
fields[].constraints.enum is not supported

Reference:

Update implementation

Hey @pudo

We are going to update this lib using jsontableschema, etc., and also implement https://github.com/okfn/datapackage-sql-py for relational table support, etc. It will be a bumpy ride, but we are thinking that most of the API will be modeled on the great work @roll has recently done for BigQuery integration: https://github.com/okfn?utf8=✓&query=bigquery

All work will be done in a new feature branch. If there is serious conflict with your original design here for jtssql, LMK and we can discuss, and/or break it out to a new home.

frictionlessdata / tableschema-sql-py Goto Github PK

tableschema-sql-py's Issues

Overview

Plan

reproduction steps

expected

actual

notes

Overview

Tasks

reproduction

expected

actual

Overview

Overview

Overview

Overview

Overview

Tasks

Overview

Overview

Tasks

Recommend Projects

Recommend Topics

Recommend Org