Giter Site home page Giter Site logo

Comments (5)

adranwit avatar adranwit commented on September 26, 2024 1

BqTail loads datafile from Google Storage using BigQuery load API, for cost perspective, it never loads the files into memory, but instead can batch them to finally passes URI to Load jobs followed by a sequence of optional post-processing.
BqTail can optionally support different schemas for a transient table, and different schema for the destination table. See https://github.com/viant/bqtail/tree/master/e2e/regression/cases/019_transient_schema
BqTail retries the ingestion process only in case when you have let say 100 files in the batch, and some of the files are corrupted or with incompatible schema, these files get excluded (move to specified location), and load process reloads remaining files.

Since BqTail only uses BigQuery Jobs API, so using autodetect would not work as you have stated.

I believe what you need to address your case it a generic cloud function that parses a JSON file to generate table schema,

In our organization, we use https://github.com/viant/smirror for schema data validation, transcoding, splitting or partitioning. Smirror could be extended to autodetect schema, for example when data is transferred from source to destination it could also generate table schema for the destination table.
If you interested in this feature, please open an issue with https://github.com/viant/smirror to create Biq Query schema file while mirroring file.

Having schema alongside the data file could be easy to load with bqtai, still there needs to be a little extension.

from bqtail.

ktopcuoglu avatar ktopcuoglu commented on September 26, 2024

Thank you for detailed explanation.
I looked all of your test cases and love various usages, really appreciate!

Similar with your perspective, currently we are loading daily ~1b rows/~3m files with bq load jobs only.
A minor difference is, before reject/exclude a file from load job, we parse and check error message for missing column XXX term. Then add this column with type as string then restart load job with same blobs. With this method I never have to process actual data file for schema extraction. I know it is not the best solution but works well, because most of new fields is string already.

If we use viant/smirror for schema extraction we have to process all files like you said, and when dealing with clickstream log files this will be costly :(

from bqtail.

adranwit avatar adranwit commented on September 26, 2024

Very interesting way of loading data, how many time do you have to restart your jobs to get full schema in the worst-case scenario?
What's the largest number of column in you stream ?
Do you build over time super schema template having all possible fields ?

from bqtail.

ktopcuoglu avatar ktopcuoglu commented on September 26, 2024

This is the cheapest way of handling evolving schemas that i can find.
worst-case and in theory we have to restart as many as the num of fields, but:

  • currently now we have flat json files only, don't have to handle nested json they are well formatted.
  • bq load job sometimes returns 2 missing fields at a time, so it reduces retry count :)
  • we have some default fields. For widest table we have, there are 60 columns added after fixed schema. 40 of them added this month. max of 5 new fields in a day. (we add timestamp as description when adding a column automatically, so we can track easily)

Largest schema has 157 columns for now with including default/known columns.

Both yes and no, in an any point of time i have super schema, but they can add new fields any time.

from bqtail.

adranwit avatar adranwit commented on September 26, 2024

Addressed the use case with BqTail 2.1.0 with AllowFieldAddition option on the Dest level.
Added e2e case https://github.com/viant/bqtail/tree/master/e2e/regression/cases/038_cli_json_field_addition

from bqtail.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.