Comments (5)
From @sirex on May 10, 2016 12:34
Possible implementation
jsontableschema-sql-py could support a special writer that can be used for data synchronisation.
This special writer could create a slightly modified table with following changes:
- Primary key from original table should be changed to index, without unique constraint.
- Additional field
jts_seq_id
, a primary key has to be added to the table. - Additional field
jts_deleted
has to be added, for tracking deleted rows.
For example, if we have following schema:
{
'primaryKey': 'id',
'fields': [
{'name': 'id', 'type': 'integer'},
{'name': 'comment', 'type': 'string'},
]
}
Following table would be created:
sa.Table(
'data', metadata,
sa.Column('jts_seq_id', sa.BigInteger, primary_key=True),
sa.Column('jts_deleted', sa.Boolean, default=False),
sa.Column('id', sa.Integer, index=True, nullable=True),
sa.Column('comment', sa.Unicode(255)),
)
This will allow us to track all changes made to all records and provide something similar to SLEEP or CouchDB replication.
from tableschema-sql-py.
From @sirex on May 12, 2016 11:39
A reference implementation of SLEEP protocol: https://www.npmjs.com/package/sleep-ref
from tableschema-sql-py.
From @rgrp on May 13, 2016 8:29
@sirex really like this idea - and i've also thought about datapackage pull
ideas.
I think there are two parts here:
- The basic ability to update (rather than just insert)
- Full on versioning / replication
I think the second is interesting but probably beyond immediate scope - though something to keep exploring. I say this having personally written things like this in the past and it has always turned out to be more complex than you had hoped or thought!
One thing that would be really useful as we discuss these use cases would be a sense of the size of the datasets you are thinking about and how often you are doing the update (hourly, daily, weekly etc).
from tableschema-sql-py.
From @sirex on May 13, 2016 15:2
I'm working on a project, where many small and large external data packages are used. And I want to have most of them synchronised into internal PostgreSQL database in order to get easy access to that data.
Some large datasets like texts of legal acts (~500,000 rows and ~5Gb csv file size) have to be updated daily. Usually once legal act document is released it does not change, but there are some rare exceptions when existing documents can be updated.
But most of the datasets are quite small and usually need to be updated at best daily but usually weekly or even less frequently.
I would like to have a universal solution, where I can sync a data package of any size at any frequency without putting too much effort when I need to add another 10 external data sources without worrying that it will kill my server.
I did some research on the first part „basic ability to update“ to see how well it scales. You can find results here:
https://gist.github.com/sirex/5e1ceac90bf473f42f3904889992b8fd
What I found, that for basic update, when you only have access to static csv files and don't know what exactly has been changed in that csv file, then best way is to wipe out all rows from your internal database and import everything again. I think this work perfectly well for small datasets. Updating ~1000 rows with this method takes just ~1-2 seconds.
But with larger datasets doing just basic update can take 10 minutes and more. When you need to update from many external sources whole process can take up to an hour.
It would be really great if data provider could have an option to provide data not only as static files, but as a service implementing SLEEP protocol.
In case of SLEEP service clients would be able to request on new changes and updates from external sources would be really fast.
I think it is quite easy to implement client side, you just need to ask SLEEP service for new changes and then perform updates, deletes and inserts. Trivial.
Implementing SLEEP server is a bit more tricky, but I guess it is quite similar to basic update. I will experiment with that part later. But basically what I have in mind, that SLEEP server have to store all data in a internal database. Internal database can be updated from static csv files or SQL tables. Before updating internal table from external one, first new data should be bulk inserted in to a new table using same sequence generator, then missing tables should be added with deleted
flag from old table, tan finally old table should be replaced with new updated table. All clients can request new changes starting from specified sequence id.
from tableschema-sql-py.
MOVED
https://github.com/frictionlessdata/implementations/issues/17
from tableschema-sql-py.
Related Issues (20)
- Update to jsontableschema-v0.7 HOT 1
- Fix readme on PyPi
- Rebase on sqlalchemy.dialects.postgresql
- Fix readme on PyPi HOT 1
- Expose package version
- Resources have no "name"s HOT 1
- Support all Table Schema types
- "Runnable" data packages HOT 4
- Should support database engines other then postgresql HOT 1
- Should be able to save tables to MySQL HOT 2
- Mismatch in primary key types should not cause a failure on SQL writing HOT 2
- Postgres dialect incorrectly maps `array` types as text instead of JSONB
- Express mapping in an external JSON to reuse as standard reference HOT 1
- Package tableschema-sql for distribution via conda-forge HOT 5
- PostgreSQL requires a unique constraint for FK reference
- Pattern check uses 'like' which is not RE compatible
- Timezone offset ignored when loading into sqlite HOT 5
- Maintainer / Lead HOT 8
- Is the package still maintained?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tableschema-sql-py.