Giter Site home page Giter Site logo

tableschema-sql-py's Issues

Better type support

Improve type support.

  • string
  • number
  • integer
  • boolean
  • object
  • array
  • datetime
  • date
  • time
  • duration (moved #23)
  • geopoint (moved #26)
  • geojson
  • any (moved #24)

"Runnable" data packages

I was wondering if anyone has looked at toolchains that support the creation of ad hoc working environments around a particular environment.

For example, I have a recipe for creating an analysis environment that links and launches RStudio connected to MySQL seeded with a particular dataset using Docker compose and a Dockerfile that created the seeded database:

#Build this container based on the pre-existing offical MySQL Docker image
FROM mysql

#The offical image listens to requests on localost, IP address 127.0.0.1
#We're going to want it to be a bit more promiscuous
#and listen out for requests coming from *any* IP address (0.0.0.0)
#sed is a Linux commandline text editor - it updates the IP address in the MySQL config file
RUN sed -i -e"s/^bind-address\s*=\s*127.0.0.1/bind-address=0.0.0.0/" /etc/mysql/my.cnf

#Install a command line utility called wget
RUN apt-get update && apt-get install -y wget && apt-get clean

#Use wget to download a copy of the MySQL database dump from ergast.com
RUN wget http://ergast.com/downloads/f1db.sql.gz -P /docker-entrypoint-initdb.d

#Unzip the download
#The original MySQL container is configured to install items
#in the /docker-entrypoint-initdb.d/ directory when it starts up
RUN gunzip /docker-entrypoint-initdb.d/f1db.sql.gz

The seeded database is linked to RStudio:

ergastdb:
  container_name: ergastdb
  build: ergastdb/
  environment:
    MYSQL_ROOT_PASSWORD: f1
    MYSQL_DATABASE: ergastdb
  expose:
    - "3306"
    
f1dj_rstudio:
  image: rocker/tidyverse 
  ports:
    - "8787:8787"
  links:
    - ergastdb:ergastdb
  volumes:
    - ./rstudio:/home/rstudio

So in the extreme case, I imagine running a command something like the following as an equivalent to the above:

reallyfrictionless -app rocker/tidyverse -app jupyter/datascience-notebook -dbms myqsl -dbalias ergastdb -db ergastdb -dbpwd f1 -datasrc http://ergast.com/downloads/f1db.sql.gz -datatyp sql

and consequently being presented with an environment (Jupyter notebooks and RStudio) opened in my browser that let me link to the populated DB and access the data from it directly.

(Actually it'd have to be a tad more complicated to support eg mounted directories/volumes? Hmm... Maybe easier if we just allow a single -app then -guestdir and -hostdir switches, or a single -appvolume switch?)

-app: name of application container(s) to install
-dbms: DBMS container
-dbalias: alias by which DBMS is accessed from app containers
-dbpwd: DBMS root password
-db: database created and used inside DBMs
-datasrc: data file
-datatyp: datafile type

For a datapackage, use -datatyp datapackage and the URL to the data package, and configure the database from that?

(To simplify things on the DBMS side, may be sensible to define images reallyfrictionless/postgres, reallyfrictionless/mysql etc that have standardised config variables?)

In the interim, how about some simple Dockerfile recipes that allow a datapackage to be identified and a runnable DBMS docker image created from it?

Is the package still maintained?

Looks like the last commit is around 2021 and the latest release is around 2019 on Github. What's the status of this Package?

Thanks

Should support database engines other then postgresql

Overview

For now only PostgreSQL is tests (and thereof supported).

Plan

  • implement support for other database engines like MySQL or SQLite

From @OriHoch

reproduction steps

  • be a new developer using jsontableschema-sql-py
  • look at README, see the exapmle which uses sqlite
  • be happy that I can use sqlite and possibly other engines because it uses sqlalchemy
  • start using it and developing based on it
  • use an 'object' or 'array' type field in a schema

expected

  • should work

actual

  • error
  • looking further, the code relies specifically on postgresql for object / array type field

notes

  • I think it can be solved by dumping the values as json into string field
  • until it's fixed - state clearly in README and in the examples that postgresql is the only fully supported engine
  • I guess, the main issue is not supporting sqlite - which is great to allow new developers to start working quickly. If at least sqlite was supported as well (for development / testing only) - it will be great.

From @vshih

import sqlalchemy as sa
engine = sa.create_engine('mysql://...')
from jsontableschema import Table
t = Table(SOURCE, SCHEMA)
t.save('test_package', backend='sql', engine=engine)
# _mysql_exceptions.ProgrammingError: (1146, "Table 'mydatabase.users' doesn't exist")

related #64

Updating existing data?

Current implementation does not handle case, when existing data needs to be updated. Here is a test case code that illustrates this:

import sqlalchemy as sa

from jsontableschema_sql import Storage


def test_update():
    schema = {
        'primaryKey': 'id',
        'fields': [
            {'name': 'id', 'type': 'integer'},
            {'name': 'comment', 'type': 'string'},
        ]
    }

    storage = Storage(engine=sa.create_engine('sqlite://'))
    storage.create('data', schema)

    # Write data for the first time
    storage.write('data', [(1, 'a'), (2, 'b')])
    assert list(storage.read('data')) == [(1, 'a'), (2, 'b')]

    # Write changed data with same primary keys on top of existing ones
    storage.write('data', [(1, 'x'), (2, 'b')])
    assert list(storage.read('data')) == [(1, 'x'), (2, 'b')]

This test raises following error on second write:

sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError)
UNIQUE constraint failed: data.id
[SQL: 'INSERT INTO data (id, comment) VALUES (?, ?)']
[parameters: ((1, 'x'), (2, 'b'))]

It would be nice if existing data could be updated.

Improve memory usage

  • use stream in Storage.write() or write by chunks (#29)
  • investigate memory usage in Storage.read() (possible SQLAlchemy bug)

Running tests ...

make test fails with

make: virtualenv-3.4: No such file or directory
make: *** [pyenv/bin/python] Error 1

BTW is this make stuff a replacement for "old-fashioned" nosetests stuff - i was sort of expecting to do:

pip install -e .
nosetests tests

Resources have no "name"s

When generating Data Package, using:

# https://github.com/frictionlessdata/datapackage-py

from datapackage import pull_datapackage
from sqlalchemy import create_engine

engine = create_engine('mysql://root@localhost/enliten')

# Push
pull_datapackage(
    descriptor='enliten-datapackage/datapackage.json',
    name="enliten-datapackage",
    backend='sql',
    engine=engine)

None of the resources have names (now required). This results in the foreignKeys generated in one of the resources in the datapackage.json referring to null resources.

"schema": {
  "foreignKeys": [
    {
      "fields": "user_id",
      "reference": {
        "resource": null,
        "fields": "id"
      }
    },
    {
      "fields": "group_id",
      "reference": {
        "resource": null,
        "fields": "id"
      }
    }
  ]

(Found via this frictionlessdata/pilot-dm4t#19)

Mismatch in primary key types should not cause a failure on SQL writing

reproduction

  • table schema:
{"fields": [
    {"name": "id", "type": "integer"},
    {"name": "date", "type": "string"}
],
"primaryKey": ["id"]
  • write some rows to a PostgreSQL DB - but in the rows, have the id as type string (but numeric)
  • data in DB is inserted correctly, and DB converts it to integer
  • now, write the same rows again

expected

  • provide an error message, warning about type mismatch
  • or match rows in DB (as opposed to bloom)

actual

  • writer.__prepare_bloom - collects all data as integers
  • writer.__check_existing - compares with a string id - which doesn't match
  • this causes insert
  • when inserting, the DB matches the primary key (even though types don't match)
  • error about duplicate key

Support primaryKey

If a primaryKey is set in the JTS use it.

Maybe we want a flag where we ignore it and create our own auto increment but set that the primaryKey field to unique (?)

Timezone offset ignored when loading into sqlite

Overview

I have a data package with timestamps that have timezone offsets, e.g.:

2020-09-09T07:10:20+02:00

The spec for this fields is:

{
  "name": "timestamp",
  "type": "datetime",
  "format": "%Y-%m-%dT%H:%M:%S%z"
}

When saving this data package to sqlite, timezone offsets get ignored and are saved in sqlite as:

2020-09-09 07:10:20.000000

Retaining timezone offsets is not possible in sqlite's datetime strings, but it would be good if timezones are converted to UTC before uploading to sqlite:

2020-09-09 05:10:20.000000

Is that possible?


Please preserve this line to notify @roll (lead of this repository)

Support all Table Schema types

Overview

It should:

  • use native type
  • fallback to string

Tasks

  • analyze specs-v1 to get list of unsupported types
  • support types using described above algorithm

Update to jsontableschema-v0.7

Overview

The driver should be updated to jsontableschema-v0.7 changes.

Tasks

  • fix breaking storage changes
  • add new storage features
  • remove deprecated code
  • fix readme on PyPi

Express mapping in an external JSON to reuse as standard reference

Hi, I am looking for a PHP implementation, so we perhaps use tableshema-py as reference algorithm for tableschema-php issue 35... The core seems the mapping and can be expressed as JSON to reuse into other contexts.

        mapping = {
            'any': sa.Text,
            'array': None,
            'boolean': sa.Boolean,
            'date': sa.Date,
            'datetime': sa.DateTime,
            'duration': None,
            'geojson': None,
            'geopoint': None,
            'integer': sa.Integer,
            'number': sa.Float,
            'object': None,
            'string': sa.Text,
            'time': sa.Time,
            'year': sa.Integer,
            'yearmonth': None,
        }

The suggestion: express in a external file, or generate code from a CSV with all datatype map descriptions... As "RosettaStone-tabular-datatypes" ;-)


PS: for more general translation of primitive data types, a CSV at Data Packaged Core Datasets with all datatypes, perhaps we can reuse from a

Canonic Description JSON SQL-92 Python Java ...
JSON UTF8 string string text str String ...
SQL-92 integer 4 bytes number int4 ... .... ....
.... .... ... .... .... .... ...

There're multiple sqlalchemy.sql.sqltypes.* for TEXT and others

For example, I followed the import_datapackage.py at https://gist.github.com/vitorbaptista/19d476d99595584e9ad5 to import https://github.com/datasets/nasdaq-listings into a SQLite3 database. After that, I tried to export it back to a datapackage by using the export_datapackage.py file at the same gist. I then got the following traceback:

Traceback (most recent call last):
  File "sql-example.py", line 14, in <module>
    datapackage_storage.export_package(storage, 'nasdaq-listings/datapackage.json', 'nasdaq-listings')
  File "/home/vitor/datasets/venv/src/datapackage-storage/datapackage_storage/package.py", line 86, in export_package
    schema = storage.describe(table)
  File "/home/vitor/datasets/venv/src/jtssql/jtssql/storage.py", line 180, in describe
    self.__prefix, table, dbtable.columns, dbtable.constraints)
  File "/home/vitor/datasets/venv/src/jtssql/jtssql/helpers.py", line 113, in restore_schema
    message = 'Type %s is not supported' % column.type
TypeError: Type TEXT is not supported

The problem is that SQLAlchemy has both sqlalchemy.sql.sqltypes.Text and sqlalchemy.sql.sqltypes.TEXT, and in our mapping at .restore_schema we only use the first one:

    # Mapping
    mapping = {
        Text: 'string',
        Integer: 'integer',
        Float: 'number',
        Boolean: 'boolean',
    }

When we try to get mapping[column.type.__class__] afterwards, it isn't found and the exception is thrown.

I don't know why SQLAlchemy has both .Text and .TEXT (and others), or why one is used and not the other, but we should support both. Lucky for us, .TEXT inherits from .Text, so we can use that.

Consider whether we support full Tabular Data Package loading

With full Tabular Data Package we would have all the resource names plus data file paths and could do a complete load including setting table names etc.

Also we would handle foreign keys.

Should this be in this library? If not maybe we want tdp-to-sql as well.

Should be able to save tables to MySQL

@vshih wrote:

import sqlalchemy as sa
engine = sa.create_engine('mysql://...')
from jsontableschema import Table
t = Table(SOURCE, SCHEMA)
t.save('test_package', backend='sql', engine=engine)
# _mysql_exceptions.ProgrammingError: (1146, "Table 'mydatabase.users' doesn't exist")

related #64

Data synchronisation?

From @sirex on May 10, 2016 11:58

When integrating external data sources in to your project, usually external data is imported into an internal database.

Since external data can be updated frequently, usually you want to update you internal replica as soon as possible.

Like git pull for code, I want to have datapackage pull for data.

There is a jsontableschema-sql-py library, that can import all data to a database, but it can't update existing data.

Here are two use cases:

1. Updating only rows that has been changed since last update

Let's say your external data provider has a service with SLEEP or CouchDB replication protocol support. It means, that after initial import, that took two days, now I can call this external service and get only data that has been changed instead of downloading everything again.

2. Detecting what has been changed since last update

Let's say, another provider keeps data in a git repository, which allows me to update data quickly using git pull, but there is no way of knowing which rows have been changed. Luckily, datapackage pull can detect changes in a really fast way and update my internal database only touching records, that has been changed since last update.

Copied from original issue: frictionlessdata/frictionlessdata.io#253

Store schemas in storage

Requires frictionlessdata/tableschema-py#59

On previous iterations I've run into some problems using 100% pure reflection system. And now it's obvious for me that for first-class JSON Table Schema features support we need to store schemas in the storage (in the meta table _schemas for example). It's not against current design and could be implemented relatively fast.

For tables not having _schemas current reflection system is still be working with subset of features supported. So user still will be able to describe and read just a SQL table (the same as now).

Really related - frictionlessdata/datapackage-py#69

Support all Table Schema constraints

Requires #27

Initial implementation restrictions:

  • fields[].constraints.unique is not supported
  • fields[].constraints.minLength is not supported
  • fields[].constraints.maxLength is not supported
  • fields[].constraints.minimum is not supported
  • fields[].constraints.maximum is not supported
  • fields[].constraints.pattern is not supported
  • fields[].constraints.enum is not supported

Reference:

Update implementation

Hey @pudo

We are going to update this lib using jsontableschema, etc., and also implement https://github.com/okfn/datapackage-sql-py for relational table support, etc. It will be a bumpy ride, but we are thinking that most of the API will be modeled on the great work @roll has recently done for BigQuery integration: https://github.com/okfn?utf8=โœ“&query=bigquery

All work will be done in a new feature branch. If there is serious conflict with your original design here for jtssql, LMK and we can discuss, and/or break it out to a new home.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.