etl,jpuck

group insert values

A performance boost could be gained by grouping insert values when possible into a single SQL statement to execute.

JSON Datum

It would be useful to be able to parse and schematize JSON data in the same fashion as the XML class.

Table name delimiter

When child tables are created in the DDL class, the names are glued together without delimiters to minimize length, but it would be nice to have the option for more readable names.

handle datetime with timezone offset for SQL Server

https://blogs.msdn.microsoft.com/sqlprogrammability/2008/03/18/using-time-zone-data-in-sql-server-2008/

add option SET IDENTITY_INSERT ON in DDL

flattener

need a manual way to denormalize relationships up or down the chain

git ignore phpunit.xml and rename -> phpunit.xml.dist

Invalidate Schemas with conflicting datatypes

Need Parser interface and ParseValidator

There needs to be a standardized convention for Datum classes to implement the parse method.

convert or restrict datetime formats

strtotime is too liberal accepting timezone strings as a valid datetime, which is not acceptable for the purposes of the Schematizer on this line.

For example, this would be parsed as a valid datetime:

"datetime": {
    "max": {
        "value": "Turkey"
    },
    "min": {
        "value": "GB"
    }
},

Additionally, this is problematic because the Schema Merger won't invalidate conflicting datatypes including this sample #29.

Schema Merger needs to set minimums

Currently if one Schema doesn't have a minimum, and the other one does, then the Merger will discard all minimums.

If both Schemata lack a minimum because min == max respectively, then a new min must be set to the value of the smaller max.

refactor options trait

this is repeated throughout the library all too much

schematizer needs to recognize integers could be unix epoch timestamps

console app

Because repetitive use has finally pushed it's way into a priority bin 06ae305, then it might as well go all in as a console component.

refactor DB, DDL class architecture

There's something aloof about the relationship between these classes. Despite the perils of multi-level inheritance, it still might be best to merge DDL with DB combined as an abstract class extending the abstract class Source. Then concrete classes such as MicrosoftSQLServer make more sense.

Schema Merger needs to invalidate conflicting datatypes

Text mixed in with XML elements is ignored

Regarding the behavior of the default serializer in sabre-xml, a method should be developed to handle text mixed in nodes with other elements.

Use natural keys for DDL relationships

Currently the Schematizer identifies unique values as candidate keys, but the application of this information into the DDL class is yet unimplemented. It would be nice to have the option to use these instead of the default surrogate jpetl_id keys.

This would be the first step in creating a truly integrated entity relationship between documents.

ignore fields in data not in schema on insert

This should lead to a huge performance boost if you're only interested in specific fields.

preserve XML namespace prefixes

So far it's been "good enough" for simplicity's sake to ignore namespaces and just strip them out of the parsed data, however, this is clearly not a long-term solution.

check for errors with multiple multi-query batches

When using the surrogate key generator and inserting batches, if a query fails in the batch, then errors are silently ignored.

DB, DDL table prefix

Need to be able to apply a prefix to table names for cohabitation of same structures. This is a precursor for a multi-document replace operation whereby temp tables are used as a staging area for partial data transfers - in the event of a runtime interruption we don't want upsets to the existing production data.

JSON streamer for chunks

This would be more of an actual streamer than the current one, which is really more of an aggregator.

This would have the unique values disabled #15 because there's no efficient way to keep track of all those disparate values when merging discrete Schemas.

surrogate key generator

Because there's no transactional race conditions with distributed systems in this ETL, then instead of relying on the db to generate the IDs, just generate them server-side prior to insert. This would make batch insertions #11 reasonable.

MySQL datatyper

This should also dictate the method for last-inserted-ID in the DB class

add custom headers to REST class

for API keys

generate surrogate IDs distinct to only one table (not universally unique)

With the explosively exponential creation of records, maxing out the integer size may come quicker than expected. Instead of using TSQL BIGINT, which will break DBMS portability and waste disk space, the IDs should start over for each table the same way an identity field would work.

jpuck / etl Goto Github PK

etl's People

Watchers

etl's Issues

Recommend Projects

Recommend Topics

Recommend Org