Giter Site home page Giter Site logo

tableschema-sql-py's Introduction

tableschema-sql-py

Travis Coveralls PyPi Github Gitter

Generate and load SQL tables based on Table Schema descriptors.

Features

  • implements tableschema.Storage interface
  • provides additional features like indexes and updating

Contents

Getting Started

Installation

The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify package version range in your setup/requirements file e.g. package>=1.0,<2.0.

pip install tableschema-sql

Documentation

from datapackage import Package 
from tableschema import Table
from sqlalchemy import create_engine

# Create sqlalchemy engine
engine = create_engine('sqlite://')

# Save package to SQL
package = Package('datapackage.json')
package.save(storage='sql', engine=engine)

# Load package from SQL
package = Package(storage='sql', engine=engine)
package.resources

API Reference

Storage

Storage(self, engine, dbschema=None, prefix='', reflect_only=None, autoincrement=None)

SQL storage

Package implements Tabular Storage interface (see full documentation on the link):

Storage

Only additional API is documented

Arguments

  • engine (object): sqlalchemy engine
  • dbschema (str): name of database schema
  • prefix (str): prefix for all buckets
  • reflect_only (callable): a boolean predicate to filter the list of table names when reflecting
  • autoincrement (str/dict): add autoincrement column at the beginning. - if a string it's an autoincrement column name - if a dict it's an autoincrements mapping with column names indexed by bucket names, for example, {'bucket1': 'id', 'bucket2': 'other_id}

storage.create

storage.create(self, bucket, descriptor, force=False, indexes_fields=None)

Create bucket

Arguments

  • indexes_fields (str[]): list of tuples containing field names, or list of such lists

storage.write

storage.write(self, bucket, rows, keyed=False, as_generator=False, update_keys=None, buffer_size=1000, use_bloom_filter=True)

Write to bucket

Arguments

  • keyed (bool): accept keyed rows
  • as_generator (bool): returns generator to provide writing control to the client
  • update_keys (str[]): update instead of inserting if key values match existent rows
  • buffer_size (int=1000): maximum number of rows to try and write to the db in one batch
  • use_bloom_filter (bool=True): should we use a bloom filter to optimize DB update performance (in exchange for some setup time)

Contributing

The project follows the Open Knowledge International coding standards.

Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment:

$ make install

To run tests with linting and coverage:

$ make test

Changelog

Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.

v1.3

  • Implemented constraints loading to a database

v1.2

  • Add option to configure buffer size, bloom filter use (#77)

v1.1

  • Added support for the autoincrement parameter to be a mapping
  • Fixed autoincrement support for SQLite and MySQL

v1.0

  • Initial driver implementation.

tableschema-sql-py's People

Contributors

roll avatar akariv avatar pudo avatar pwalsh avatar vitorbaptista avatar dmlb2000 avatar ludflu avatar peterdesmet avatar rufuspollock avatar pajachiet avatar

Stargazers

Doctor Who avatar Myroslav Zapukhlyak avatar Sacha Arbonel avatar  avatar Andreas Motl avatar Abdullah Al Mashmoum avatar John O'Sullivan avatar Cody Fincher avatar David Peckham avatar Duncan Paterson avatar Jimmy Ruska avatar Nikolaus Schlemm avatar Felix Rosencrantz avatar  avatar Haidar Zeineddine  avatar Daniel Bast avatar Yarin Benado avatar james_jjyang avatar Ali Rathore avatar MJ avatar Daniel Suzan avatar  avatar Scott Ivey avatar Techiediaries avatar Shows Le avatar  avatar Christophe Benz avatar Greg Werbin avatar ya avatar Travis L avatar  avatar Takumasa Sakao avatar Andrey Kislyuk avatar Anthony Garvan avatar  avatar Mike Frager avatar Ludovic Claude avatar Albert Decatur avatar Bill Eger avatar Gabi Maeztu avatar Tony Narlock avatar Jeff Mealo avatar John Jediny avatar Phil Britton avatar  avatar Filype avatar Edgar Z. Alvarenga avatar Greg D avatar Augusto Herrmann avatar  avatar Alexander Dean avatar Stephen Abbott Pugh avatar Gerald Bauer avatar  avatar Mila Frerichs avatar Gerald Pape avatar albert avatar Damien avatar  avatar  avatar

Watchers

Stefan Urbanek avatar Tim Hubbard avatar Adrià Mercader avatar  avatar Sander van der Waal avatar  avatar James Cloos avatar Jeff Mealo avatar Sam Leon avatar Peter avatar  avatar Marieke Guy avatar FemtoTrader avatar  avatar Irakli Mchedlishvili avatar Lilly Winfree avatar  avatar

tableschema-sql-py's Issues

Should support database engines other then postgresql

Overview

For now only PostgreSQL is tests (and thereof supported).

Plan

  • implement support for other database engines like MySQL or SQLite

From @OriHoch

reproduction steps

  • be a new developer using jsontableschema-sql-py
  • look at README, see the exapmle which uses sqlite
  • be happy that I can use sqlite and possibly other engines because it uses sqlalchemy
  • start using it and developing based on it
  • use an 'object' or 'array' type field in a schema

expected

  • should work

actual

  • error
  • looking further, the code relies specifically on postgresql for object / array type field

notes

  • I think it can be solved by dumping the values as json into string field
  • until it's fixed - state clearly in README and in the examples that postgresql is the only fully supported engine
  • I guess, the main issue is not supporting sqlite - which is great to allow new developers to start working quickly. If at least sqlite was supported as well (for development / testing only) - it will be great.

From @vshih

import sqlalchemy as sa
engine = sa.create_engine('mysql://...')
from jsontableschema import Table
t = Table(SOURCE, SCHEMA)
t.save('test_package', backend='sql', engine=engine)
# _mysql_exceptions.ProgrammingError: (1146, "Table 'mydatabase.users' doesn't exist")

related #64

Mismatch in primary key types should not cause a failure on SQL writing

reproduction

  • table schema:
{"fields": [
    {"name": "id", "type": "integer"},
    {"name": "date", "type": "string"}
],
"primaryKey": ["id"]
  • write some rows to a PostgreSQL DB - but in the rows, have the id as type string (but numeric)
  • data in DB is inserted correctly, and DB converts it to integer
  • now, write the same rows again

expected

  • provide an error message, warning about type mismatch
  • or match rows in DB (as opposed to bloom)

actual

  • writer.__prepare_bloom - collects all data as integers
  • writer.__check_existing - compares with a string id - which doesn't match
  • this causes insert
  • when inserting, the DB matches the primary key (even though types don't match)
  • error about duplicate key

Data synchronisation?

From @sirex on May 10, 2016 11:58

When integrating external data sources in to your project, usually external data is imported into an internal database.

Since external data can be updated frequently, usually you want to update you internal replica as soon as possible.

Like git pull for code, I want to have datapackage pull for data.

There is a jsontableschema-sql-py library, that can import all data to a database, but it can't update existing data.

Here are two use cases:

1. Updating only rows that has been changed since last update

Let's say your external data provider has a service with SLEEP or CouchDB replication protocol support. It means, that after initial import, that took two days, now I can call this external service and get only data that has been changed instead of downloading everything again.

2. Detecting what has been changed since last update

Let's say, another provider keeps data in a git repository, which allows me to update data quickly using git pull, but there is no way of knowing which rows have been changed. Luckily, datapackage pull can detect changes in a really fast way and update my internal database only touching records, that has been changed since last update.

Copied from original issue: frictionlessdata/frictionlessdata.io#253

Is the package still maintained?

Looks like the last commit is around 2021 and the latest release is around 2019 on Github. What's the status of this Package?

Thanks

Updating existing data?

Current implementation does not handle case, when existing data needs to be updated. Here is a test case code that illustrates this:

import sqlalchemy as sa

from jsontableschema_sql import Storage


def test_update():
    schema = {
        'primaryKey': 'id',
        'fields': [
            {'name': 'id', 'type': 'integer'},
            {'name': 'comment', 'type': 'string'},
        ]
    }

    storage = Storage(engine=sa.create_engine('sqlite://'))
    storage.create('data', schema)

    # Write data for the first time
    storage.write('data', [(1, 'a'), (2, 'b')])
    assert list(storage.read('data')) == [(1, 'a'), (2, 'b')]

    # Write changed data with same primary keys on top of existing ones
    storage.write('data', [(1, 'x'), (2, 'b')])
    assert list(storage.read('data')) == [(1, 'x'), (2, 'b')]

This test raises following error on second write:

sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError)
UNIQUE constraint failed: data.id
[SQL: 'INSERT INTO data (id, comment) VALUES (?, ?)']
[parameters: ((1, 'x'), (2, 'b'))]

It would be nice if existing data could be updated.

Should be able to save tables to MySQL

@vshih wrote:

import sqlalchemy as sa
engine = sa.create_engine('mysql://...')
from jsontableschema import Table
t = Table(SOURCE, SCHEMA)
t.save('test_package', backend='sql', engine=engine)
# _mysql_exceptions.ProgrammingError: (1146, "Table 'mydatabase.users' doesn't exist")

related #64

Express mapping in an external JSON to reuse as standard reference

Hi, I am looking for a PHP implementation, so we perhaps use tableshema-py as reference algorithm for tableschema-php issue 35... The core seems the mapping and can be expressed as JSON to reuse into other contexts.

        mapping = {
            'any': sa.Text,
            'array': None,
            'boolean': sa.Boolean,
            'date': sa.Date,
            'datetime': sa.DateTime,
            'duration': None,
            'geojson': None,
            'geopoint': None,
            'integer': sa.Integer,
            'number': sa.Float,
            'object': None,
            'string': sa.Text,
            'time': sa.Time,
            'year': sa.Integer,
            'yearmonth': None,
        }

The suggestion: express in a external file, or generate code from a CSV with all datatype map descriptions... As "RosettaStone-tabular-datatypes" ;-)


PS: for more general translation of primitive data types, a CSV at Data Packaged Core Datasets with all datatypes, perhaps we can reuse from a

Canonic Description JSON SQL-92 Python Java ...
JSON UTF8 string string text str String ...
SQL-92 integer 4 bytes number int4 ... .... ....
.... .... ... .... .... .... ...

Consider whether we support full Tabular Data Package loading

With full Tabular Data Package we would have all the resource names plus data file paths and could do a complete load including setting table names etc.

Also we would handle foreign keys.

Should this be in this library? If not maybe we want tdp-to-sql as well.

Resources have no "name"s

When generating Data Package, using:

# https://github.com/frictionlessdata/datapackage-py

from datapackage import pull_datapackage
from sqlalchemy import create_engine

engine = create_engine('mysql://root@localhost/enliten')

# Push
pull_datapackage(
    descriptor='enliten-datapackage/datapackage.json',
    name="enliten-datapackage",
    backend='sql',
    engine=engine)

None of the resources have names (now required). This results in the foreignKeys generated in one of the resources in the datapackage.json referring to null resources.

"schema": {
  "foreignKeys": [
    {
      "fields": "user_id",
      "reference": {
        "resource": null,
        "fields": "id"
      }
    },
    {
      "fields": "group_id",
      "reference": {
        "resource": null,
        "fields": "id"
      }
    }
  ]

(Found via this frictionlessdata/pilot-dm4t#19)

Improve memory usage

  • use stream in Storage.write() or write by chunks (#29)
  • investigate memory usage in Storage.read() (possible SQLAlchemy bug)

Store schemas in storage

Requires frictionlessdata/tableschema-py#59

On previous iterations I've run into some problems using 100% pure reflection system. And now it's obvious for me that for first-class JSON Table Schema features support we need to store schemas in the storage (in the meta table _schemas for example). It's not against current design and could be implemented relatively fast.

For tables not having _schemas current reflection system is still be working with subset of features supported. So user still will be able to describe and read just a SQL table (the same as now).

Really related - frictionlessdata/datapackage-py#69

Update implementation

Hey @pudo

We are going to update this lib using jsontableschema, etc., and also implement https://github.com/okfn/datapackage-sql-py for relational table support, etc. It will be a bumpy ride, but we are thinking that most of the API will be modeled on the great work @roll has recently done for BigQuery integration: https://github.com/okfn?utf8=✓&query=bigquery

All work will be done in a new feature branch. If there is serious conflict with your original design here for jtssql, LMK and we can discuss, and/or break it out to a new home.

Better type support

Improve type support.

  • string
  • number
  • integer
  • boolean
  • object
  • array
  • datetime
  • date
  • time
  • duration (moved #23)
  • geopoint (moved #26)
  • geojson
  • any (moved #24)

Update to jsontableschema-v0.7

Overview

The driver should be updated to jsontableschema-v0.7 changes.

Tasks

  • fix breaking storage changes
  • add new storage features
  • remove deprecated code
  • fix readme on PyPi

"Runnable" data packages

I was wondering if anyone has looked at toolchains that support the creation of ad hoc working environments around a particular environment.

For example, I have a recipe for creating an analysis environment that links and launches RStudio connected to MySQL seeded with a particular dataset using Docker compose and a Dockerfile that created the seeded database:

#Build this container based on the pre-existing offical MySQL Docker image
FROM mysql

#The offical image listens to requests on localost, IP address 127.0.0.1
#We're going to want it to be a bit more promiscuous
#and listen out for requests coming from *any* IP address (0.0.0.0)
#sed is a Linux commandline text editor - it updates the IP address in the MySQL config file
RUN sed -i -e"s/^bind-address\s*=\s*127.0.0.1/bind-address=0.0.0.0/" /etc/mysql/my.cnf

#Install a command line utility called wget
RUN apt-get update && apt-get install -y wget && apt-get clean

#Use wget to download a copy of the MySQL database dump from ergast.com
RUN wget http://ergast.com/downloads/f1db.sql.gz -P /docker-entrypoint-initdb.d

#Unzip the download
#The original MySQL container is configured to install items
#in the /docker-entrypoint-initdb.d/ directory when it starts up
RUN gunzip /docker-entrypoint-initdb.d/f1db.sql.gz

The seeded database is linked to RStudio:

ergastdb:
  container_name: ergastdb
  build: ergastdb/
  environment:
    MYSQL_ROOT_PASSWORD: f1
    MYSQL_DATABASE: ergastdb
  expose:
    - "3306"
    
f1dj_rstudio:
  image: rocker/tidyverse 
  ports:
    - "8787:8787"
  links:
    - ergastdb:ergastdb
  volumes:
    - ./rstudio:/home/rstudio

So in the extreme case, I imagine running a command something like the following as an equivalent to the above:

reallyfrictionless -app rocker/tidyverse -app jupyter/datascience-notebook -dbms myqsl -dbalias ergastdb -db ergastdb -dbpwd f1 -datasrc http://ergast.com/downloads/f1db.sql.gz -datatyp sql

and consequently being presented with an environment (Jupyter notebooks and RStudio) opened in my browser that let me link to the populated DB and access the data from it directly.

(Actually it'd have to be a tad more complicated to support eg mounted directories/volumes? Hmm... Maybe easier if we just allow a single -app then -guestdir and -hostdir switches, or a single -appvolume switch?)

-app: name of application container(s) to install
-dbms: DBMS container
-dbalias: alias by which DBMS is accessed from app containers
-dbpwd: DBMS root password
-db: database created and used inside DBMs
-datasrc: data file
-datatyp: datafile type

For a datapackage, use -datatyp datapackage and the URL to the data package, and configure the database from that?

(To simplify things on the DBMS side, may be sensible to define images reallyfrictionless/postgres, reallyfrictionless/mysql etc that have standardised config variables?)

In the interim, how about some simple Dockerfile recipes that allow a datapackage to be identified and a runnable DBMS docker image created from it?

Timezone offset ignored when loading into sqlite

Overview

I have a data package with timestamps that have timezone offsets, e.g.:

2020-09-09T07:10:20+02:00

The spec for this fields is:

{
  "name": "timestamp",
  "type": "datetime",
  "format": "%Y-%m-%dT%H:%M:%S%z"
}

When saving this data package to sqlite, timezone offsets get ignored and are saved in sqlite as:

2020-09-09 07:10:20.000000

Retaining timezone offsets is not possible in sqlite's datetime strings, but it would be good if timezones are converted to UTC before uploading to sqlite:

2020-09-09 05:10:20.000000

Is that possible?


Please preserve this line to notify @roll (lead of this repository)

Support all Table Schema types

Overview

It should:

  • use native type
  • fallback to string

Tasks

  • analyze specs-v1 to get list of unsupported types
  • support types using described above algorithm

There're multiple sqlalchemy.sql.sqltypes.* for TEXT and others

For example, I followed the import_datapackage.py at https://gist.github.com/vitorbaptista/19d476d99595584e9ad5 to import https://github.com/datasets/nasdaq-listings into a SQLite3 database. After that, I tried to export it back to a datapackage by using the export_datapackage.py file at the same gist. I then got the following traceback:

Traceback (most recent call last):
  File "sql-example.py", line 14, in <module>
    datapackage_storage.export_package(storage, 'nasdaq-listings/datapackage.json', 'nasdaq-listings')
  File "/home/vitor/datasets/venv/src/datapackage-storage/datapackage_storage/package.py", line 86, in export_package
    schema = storage.describe(table)
  File "/home/vitor/datasets/venv/src/jtssql/jtssql/storage.py", line 180, in describe
    self.__prefix, table, dbtable.columns, dbtable.constraints)
  File "/home/vitor/datasets/venv/src/jtssql/jtssql/helpers.py", line 113, in restore_schema
    message = 'Type %s is not supported' % column.type
TypeError: Type TEXT is not supported

The problem is that SQLAlchemy has both sqlalchemy.sql.sqltypes.Text and sqlalchemy.sql.sqltypes.TEXT, and in our mapping at .restore_schema we only use the first one:

    # Mapping
    mapping = {
        Text: 'string',
        Integer: 'integer',
        Float: 'number',
        Boolean: 'boolean',
    }

When we try to get mapping[column.type.__class__] afterwards, it isn't found and the exception is thrown.

I don't know why SQLAlchemy has both .Text and .TEXT (and others), or why one is used and not the other, but we should support both. Lucky for us, .TEXT inherits from .Text, so we can use that.

Support primaryKey

If a primaryKey is set in the JTS use it.

Maybe we want a flag where we ignore it and create our own auto increment but set that the primaryKey field to unique (?)

Running tests ...

make test fails with

make: virtualenv-3.4: No such file or directory
make: *** [pyenv/bin/python] Error 1

BTW is this make stuff a replacement for "old-fashioned" nosetests stuff - i was sort of expecting to do:

pip install -e .
nosetests tests

Support all Table Schema constraints

Requires #27

Initial implementation restrictions:

  • fields[].constraints.unique is not supported
  • fields[].constraints.minLength is not supported
  • fields[].constraints.maxLength is not supported
  • fields[].constraints.minimum is not supported
  • fields[].constraints.maximum is not supported
  • fields[].constraints.pattern is not supported
  • fields[].constraints.enum is not supported

Reference:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.