From <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

Data synchronisation? about tableschema-sql-py HOT 5 CLOSED

danfowler commented on June 11, 2024

Data synchronisation?

from tableschema-sql-py.

Comments (5)

danfowler commented on June 11, 2024

From @sirex on May 10, 2016 12:34

Possible implementation

jsontableschema-sql-py could support a special writer that can be used for data synchronisation.

This special writer could create a slightly modified table with following changes:

Primary key from original table should be changed to index, without unique constraint.
Additional field jts_seq_id, a primary key has to be added to the table.
Additional field jts_deleted has to be added, for tracking deleted rows.

For example, if we have following schema:

{
    'primaryKey': 'id',
    'fields': [
        {'name': 'id', 'type': 'integer'},
        {'name': 'comment', 'type': 'string'},
    ]
}

Following table would be created:

sa.Table(
    'data', metadata,
    sa.Column('jts_seq_id', sa.BigInteger, primary_key=True),
    sa.Column('jts_deleted', sa.Boolean, default=False),
    sa.Column('id', sa.Integer, index=True, nullable=True),
    sa.Column('comment', sa.Unicode(255)),
)

This will allow us to track all changes made to all records and provide something similar to SLEEP or CouchDB replication.

from tableschema-sql-py.

danfowler commented on June 11, 2024

From @sirex on May 12, 2016 11:39

A reference implementation of SLEEP protocol: https://www.npmjs.com/package/sleep-ref

from tableschema-sql-py.

danfowler commented on June 11, 2024

From @rgrp on May 13, 2016 8:29

@sirex really like this idea - and i've also thought about datapackage pull ideas.

I think there are two parts here:

The basic ability to update (rather than just insert)
Full on versioning / replication

I think the second is interesting but probably beyond immediate scope - though something to keep exploring. I say this having personally written things like this in the past and it has always turned out to be more complex than you had hoped or thought!

One thing that would be really useful as we discuss these use cases would be a sense of the size of the datasets you are thinking about and how often you are doing the update (hourly, daily, weekly etc).

from tableschema-sql-py.

danfowler commented on June 11, 2024

From @sirex on May 13, 2016 15:2

I'm working on a project, where many small and large external data packages are used. And I want to have most of them synchronised into internal PostgreSQL database in order to get easy access to that data.

Some large datasets like texts of legal acts (~500,000 rows and ~5Gb csv file size) have to be updated daily. Usually once legal act document is released it does not change, but there are some rare exceptions when existing documents can be updated.

But most of the datasets are quite small and usually need to be updated at best daily but usually weekly or even less frequently.

I would like to have a universal solution, where I can sync a data package of any size at any frequency without putting too much effort when I need to add another 10 external data sources without worrying that it will kill my server.

I did some research on the first part „basic ability to update“ to see how well it scales. You can find results here:

https://gist.github.com/sirex/5e1ceac90bf473f42f3904889992b8fd

What I found, that for basic update, when you only have access to static csv files and don't know what exactly has been changed in that csv file, then best way is to wipe out all rows from your internal database and import everything again. I think this work perfectly well for small datasets. Updating ~1000 rows with this method takes just ~1-2 seconds.

But with larger datasets doing just basic update can take 10 minutes and more. When you need to update from many external sources whole process can take up to an hour.

It would be really great if data provider could have an option to provide data not only as static files, but as a service implementing SLEEP protocol.

In case of SLEEP service clients would be able to request on new changes and updates from external sources would be really fast.

I think it is quite easy to implement client side, you just need to ask SLEEP service for new changes and then perform updates, deletes and inserts. Trivial.

Implementing SLEEP server is a bit more tricky, but I guess it is quite similar to basic update. I will experiment with that part later. But basically what I have in mind, that SLEEP server have to store all data in a internal database. Internal database can be updated from static csv files or SQL tables. Before updating internal table from external one, first new data should be bulk inserted in to a new table using same sequence generator, then missing tables should be added with deleted flag from old table, tan finally old table should be replaced with new updated table. All clients can request new changes starting from specified sequence id.

from tableschema-sql-py.

roll commented on June 11, 2024

MOVED
https://github.com/frictionlessdata/implementations/issues/17

from tableschema-sql-py.

Data synchronisation? about tableschema-sql-py HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent