Giter Site home page Giter Site logo

Comments (5)

danfowler avatar danfowler commented on June 11, 2024

From @sirex on May 10, 2016 12:34

Possible implementation

jsontableschema-sql-py could support a special writer that can be used for data synchronisation.

This special writer could create a slightly modified table with following changes:

  • Primary key from original table should be changed to index, without unique constraint.
  • Additional field jts_seq_id, a primary key has to be added to the table.
  • Additional field jts_deleted has to be added, for tracking deleted rows.

For example, if we have following schema:

{
    'primaryKey': 'id',
    'fields': [
        {'name': 'id', 'type': 'integer'},
        {'name': 'comment', 'type': 'string'},
    ]
}

Following table would be created:

sa.Table(
    'data', metadata,
    sa.Column('jts_seq_id', sa.BigInteger, primary_key=True),
    sa.Column('jts_deleted', sa.Boolean, default=False),
    sa.Column('id', sa.Integer, index=True, nullable=True),
    sa.Column('comment', sa.Unicode(255)),
)

This will allow us to track all changes made to all records and provide something similar to SLEEP or CouchDB replication.

from tableschema-sql-py.

danfowler avatar danfowler commented on June 11, 2024

From @sirex on May 12, 2016 11:39

A reference implementation of SLEEP protocol: https://www.npmjs.com/package/sleep-ref

from tableschema-sql-py.

danfowler avatar danfowler commented on June 11, 2024

From @rgrp on May 13, 2016 8:29

@sirex really like this idea - and i've also thought about datapackage pull ideas.

I think there are two parts here:

  • The basic ability to update (rather than just insert)
  • Full on versioning / replication

I think the second is interesting but probably beyond immediate scope - though something to keep exploring. I say this having personally written things like this in the past and it has always turned out to be more complex than you had hoped or thought!

One thing that would be really useful as we discuss these use cases would be a sense of the size of the datasets you are thinking about and how often you are doing the update (hourly, daily, weekly etc).

from tableschema-sql-py.

danfowler avatar danfowler commented on June 11, 2024

From @sirex on May 13, 2016 15:2

I'm working on a project, where many small and large external data packages are used. And I want to have most of them synchronised into internal PostgreSQL database in order to get easy access to that data.

Some large datasets like texts of legal acts (~500,000 rows and ~5Gb csv file size) have to be updated daily. Usually once legal act document is released it does not change, but there are some rare exceptions when existing documents can be updated.

But most of the datasets are quite small and usually need to be updated at best daily but usually weekly or even less frequently.

I would like to have a universal solution, where I can sync a data package of any size at any frequency without putting too much effort when I need to add another 10 external data sources without worrying that it will kill my server.

I did some research on the first part „basic ability to update“ to see how well it scales. You can find results here:

https://gist.github.com/sirex/5e1ceac90bf473f42f3904889992b8fd

What I found, that for basic update, when you only have access to static csv files and don't know what exactly has been changed in that csv file, then best way is to wipe out all rows from your internal database and import everything again. I think this work perfectly well for small datasets. Updating ~1000 rows with this method takes just ~1-2 seconds.

But with larger datasets doing just basic update can take 10 minutes and more. When you need to update from many external sources whole process can take up to an hour.

It would be really great if data provider could have an option to provide data not only as static files, but as a service implementing SLEEP protocol.

In case of SLEEP service clients would be able to request on new changes and updates from external sources would be really fast.

I think it is quite easy to implement client side, you just need to ask SLEEP service for new changes and then perform updates, deletes and inserts. Trivial.

Implementing SLEEP server is a bit more tricky, but I guess it is quite similar to basic update. I will experiment with that part later. But basically what I have in mind, that SLEEP server have to store all data in a internal database. Internal database can be updated from static csv files or SQL tables. Before updating internal table from external one, first new data should be bulk inserted in to a new table using same sequence generator, then missing tables should be added with deleted flag from old table, tan finally old table should be replaced with new updated table. All clients can request new changes starting from specified sequence id.

from tableschema-sql-py.

roll avatar roll commented on June 11, 2024

MOVED
https://github.com/frictionlessdata/implementations/issues/17

from tableschema-sql-py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.