Giter Site home page Giter Site logo

Comments (3)

laurentS avatar laurentS commented on July 20, 2024 1

I'm a bit light on the metadata part of the singer spec, so I'll chime in with my "user's" perspective.
My use case if tap-github | target-postgres (and other similar API taps), and I use the datamill variant of the target.

What I'm seeing:

  • the SCHEMA messages seem to follow the schema definition in the stream class, which in my use case has an impact on the shape of the db downstream. RECORD data seems to then be filtered on the downstream side, so if user.repos_url appears in the record but was not in the schema, it does not end up in my db.
  • in terms of consistency, if removing a top-level field from the schema removes it from the output, but removing a subfield has no impact, I find it a bit confusing and would want at least a big fat warning ⚠️ in the docs about it
  • since the undeclared subfields don't appear in the catalog/SCHEMA messages, I guess I have no way as the user of the tap to deselect them after running the discovery. In a case where this has expensive side effects on the target side, it might cause problems.
  • in terms of performance, if you send all these "useless" subfields, you're processing them in the tap, and then let the target process them as well (target-postgres at least validates the record against the schema it received), although the schema has made it clear that such info should not be coming. Thinking of all the serialization/deserialization that happens on both sides, I suspect this might have a non trivial impact on performance. With the example record above, the extra data that comes through "off schema" takes the record weight from 811 bytes to 1572, almost double. Thinking of a PR I opened around this, cutting the target's input by half would not be anecdotal.

I'm not sure this addresses your questions exactly, but my feeling from thinking through it is that if a field is not declared in the schema, it should probably not appear in records 🙂

from tap-github.

aaronsteers avatar aaronsteers commented on July 20, 2024

Indeed, this is something @edgarrmondragon and I have been discussing as of late. We probably should be excluding undeclared subproperties but as of today I think they get included or excluded based on the selected metadata if the parent.

@edgarrmondragon , fyi, as related to recent discussions over on the SDK. I was previously thinking selected-by-default of the parent might be a path to decide selected status of undeclared subnodes, but on further review of spec docs, I couldn't find any guidance that actually supports that direction. I think the safest route is to just completely ignore selected metadata of parents if a property or subproperty is undeclared in the schema. This probably amounts to a second mask of declared breadcrumbs in the stream's schema, filtering out any nodes not declared by catalog, aka the tap developer.

Note: all of the above is in regards to properties and subproperties in the stream's catalog schema, and not necessary to the metadata selection. Meaning, omitting a child nodes selection metadata would still cause the node to default to the parent value. The implicit removal only applies if a node is completely unknown/undeclared by the catalog.

@laurentS - does this sound like it meets your expectations as well? Meaning, as tap developer, you'd have confidence that nothing undeclared in the schema will slip downstream to the target?

Thanks, both.

from tap-github.

edgarrmondragon avatar edgarrmondragon commented on July 20, 2024

This probably amounts to a second mask of declared breadcrumbs in the stream's schema, filtering out any nodes not declared by catalog, aka the tap developer.

That could work. If we're gonna walk the entire JSON schema tree to figure out which props are declared, it might also make sense to update our MetadataMapping.get_standard_metadata to do just that. The dumped catalog would get a bit fatter, though.

from tap-github.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.