jpuck / etl Goto Github PK
View Code? Open in Web Editor NEWPHP library for extracting, transforming, and loading XML/JSON into relational DB
License: GNU General Public License v3.0
PHP library for extracting, transforming, and loading XML/JSON into relational DB
License: GNU General Public License v3.0
A performance boost could be gained by grouping insert values when possible into a single SQL statement to execute.
It would be useful to be able to parse and schematize JSON data in the same fashion as the XML
class.
When child tables are created in the DDL
class, the names are glued together without delimiters to minimize length, but it would be nice to have the option for more readable names.
need a manual way to denormalize relationships up or down the chain
something from Sami would be nice.
There needs to be a standardized convention for Datum
classes to implement the parse
method.
strtotime
is too liberal accepting timezone strings as a valid datetime, which is not acceptable for the purposes of the Schematizer
on this line.
For example, this would be parsed as a valid datetime:
"datetime": {
"max": {
"value": "Turkey"
},
"min": {
"value": "GB"
}
},
Additionally, this is problematic because the Schema Merger
won't invalidate conflicting datatypes including this sample #29.
Currently if one Schema doesn't have a minimum, and the other one does, then the Merger will discard all minimums.
If both Schemata lack a minimum because min == max respectively, then a new min must be set to the value of the smaller max.
this is repeated throughout the library all too much
Because repetitive use has finally pushed it's way into a priority bin 06ae305, then it might as well go all in as a console component.
There's something aloof about the relationship between these classes. Despite the perils of multi-level inheritance, it still might be best to merge DDL
with DB
combined as an abstract class extending the abstract class Source
. Then concrete classes such as MicrosoftSQLServer
make more sense.
Regarding the behavior of the default serializer in sabre-xml, a method should be developed to handle text mixed in nodes with other elements.
Currently the Schematizer
identifies unique values as candidate keys, but the application of this information into the DDL
class is yet unimplemented. It would be nice to have the option to use these instead of the default surrogate jpetl_id
keys.
This would be the first step in creating a truly integrated entity relationship between documents.
This should lead to a huge performance boost if you're only interested in specific fields.
So far it's been "good enough" for simplicity's sake to ignore namespaces and just strip them out of the parsed data, however, this is clearly not a long-term solution.
See also: sabre-io/xml#17
When using the surrogate key generator and inserting batches, if a query fails in the batch, then errors are silently ignored.
Need to be able to apply a prefix to table names for cohabitation of same structures. This is a precursor for a multi-document replace operation whereby temp tables are used as a staging area for partial data transfers - in the event of a runtime interruption we don't want upsets to the existing production data.
This would be more of an actual streamer than the current one, which is really more of an aggregator.
This would have the unique values disabled #15 because there's no efficient way to keep track of all those disparate values when merging discrete Schemas.
Because there's no transactional race conditions with distributed systems in this ETL, then instead of relying on the db to generate the IDs, just generate them server-side prior to insert. This would make batch insertions #11 reasonable.
This should also dictate the method for last-inserted-ID in the DB
class
for API keys
With the explosively exponential creation of records, maxing out the integer size may come quicker than expected. Instead of using TSQL BIGINT, which will break DBMS portability and waste disk space, the IDs should start over for each table the same way an identity field would work.
there needs to be an intelligent way to check what type of database management system is connected for handling variations in returning last inserted id, quote types, and other assorted differences.
The Schema
class needs a toJSON
method as well as a way to import.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.