Giter Site home page Giter Site logo

tap-airbyte-wrapper's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

tap-airbyte-wrapper's Issues

Discovery Fails - JSONDecodeError: Extra data

@z3z1ma I'm getting an error related to parsing of the catalog to json once I have a successful discover, on tap line 178.

    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 89)

I ended up figuring out that the source-s3 container STDOUT looks like this:

{"type": "LOG", "log": {"level": "INFO", "message": "initialised stream with format: {'filetype': 'jsonl'}"}}
{"type": "LOG", "log": {"level": "INFO", "message": "Iterating S3 bucket 'dbt-redshift-testing' with prefix: 'spreadsheet_test/airbyte_tap' "}}
{"type": "LOG", "log": {"level": "DEBUG", "message": "try to open Key: spreadsheet_test/airbyte_tap/test.json, LastModified: 2022-12-12T17:13:24+00:00, Size: 0.0001Mb"}}
{"type": "LOG", "log": {"level": "INFO", "message": "determined master schema: {'my_data': 'string', 'something_else': 'string', 'blah': 'string'}"}}
{"type": "CATALOG", "catalog": {"streams": [{"name": "sample_s3_jsonl", "json_schema": {"type": "object", "properties": {"my_data": {"type": ["null", "string"]}, "something_else": {"type": ["null", "string"]}, "blah": {"type": ["null", "string"]}, "_ab_additional_properties": {"type": "object"}, "_ab_source_file_last_modified": {"type": "string", "format": "date-time"}, "_ab_source_file_url": {"type": "string"}}}, "supported_sync_modes": ["full_refresh", "incremental"], "source_defined_cursor": true, "default_cursor_field": ["_ab_source_file_last_modified"]}]}}

So when it tries to parse that blob it fails but if I split lines and take the last one json.loads(output.splitlines()[-1]) it succeeds. I'm not sure if that a correct assumption or not that it will always be last.

Once I got over that I was able to do a meltano run tap-airbyte target-csv with success!! ๐Ÿš€

  extractors:
  - name: tap-airbyte
    namespace: tap_airbyte
    pip_url: git+https://github.com/z3z1ma/tap-airbyte.git
    executable: tap-airbyte
    capabilities:
    - catalog
    - discover
    - state
    settings:
    - name: airbyte_spec
      kind: object
    - name: connector_config
      kind: object
    config:
      airbyte_spec:
        image: airbyte/source-s3
        tag: latest
      connector_config:
        dataset: sample_s3_jsonl
        path_pattern: '**'
        format:
          filetype: jsonl
        provider:
          bucket: <MY_BUCKET_NAME>
          path_prefix: spreadsheet_test/airbyte_tap
          aws_access_key_id: ${AWS_ACCESS_KEY_ID}
          aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}

feat: Ability to mount volumes

For sources like source-file where someone wants to access a local CSV, we wont be able to access them from inside the docker container. Maybe we should add the ability to configure a volume, it could be hidden or defaulted to meltano root for most sources.

Documentation: detail docker-in-docker requirements

I'd recommend documenting the requirements for running this wrapper if you're already running meltano in docker. This is possible, but requires the /tmp directory to be mounted from the host to the meltano container.

This ensures that /tmp/config.json can be accessed, as the python script uses mktemp to create a directory in /tmp such as /tmp/tmp.BNlf296WXX, which can then be mapped down to the airbyte container. Due to the way docker-in-docker works, the volume mounts are mounts that exist on the host, not the meltano container, so it needs to be bind-mounted for it to show up in the airbyte image.

bug: when docker daemon not running the error is "Could not discover catalog"

I noticed when I restarted my laptop and my docker daemon wasn't running yet that the tap gave me a red herring error likely because it couldnt start the container during catalog discovery then the exception was caught.

    raise AirbyteException("Could not discover catalog")
tap_airbyte.tap.AirbyteException: Could not discover catalog

It would be awesome if this could instead say "Docker daemon not found: Docker must be running to use this tap".

Error running in Python 3.7: @lru_cache

I believe in Python 3.7, you need to have parentheses on any use of @lru_cache. So @lru_cache() instead of @lru_cache. I realize 3.7 is ancient, but I'm stuck on a machine that can't upgrade easily, and I think Meltano is still 3.7+.

If it matters, I'm hitting this error using the Airbyte variant of tap-outreach.

  File "/srv/analytics/meltano/.meltano/extractors/tap-outreach/venv/bin/tap-airbyte", line 5, in <module>
    from tap_airbyte.tap import TapAirbyte
  File "/srv/analytics/meltano/.meltano/extractors/tap-outreach/venv/lib/python3.7/site-packages/tap_airbyte/tap.py", line 105, in <module>
    class TapAirbyte(Tap):
  File "/srv/analytics/meltano/.meltano/extractors/tap-outreach/venv/lib/python3.7/site-packages/tap_airbyte/tap.py", line 547, in TapAirbyte
    @lru_cache
  File "/usr/lib64/python3.7/functools.py", line 490, in lru_cache
    raise TypeError('Expected maxsize to be an integer or None')
TypeError: Expected maxsize to be an integer or None```

Feature: print `--spec` output

It looks like you already started to do this in https://github.com/z3z1ma/tap-airbyte/blob/1a9d58c4ebfd4ee04b761d721ed19dc6d968d499/tap_airbyte/tap.py#L77 but it would be awesome if the tap was able to print out the airbyte config spec after an image is defined.

The sources have a --spec flag but I dont know how to access that right now. Since the tap is source agnostic it could be cool to allow someone to configure their source image then run some sort of spec command to list all config options for that source.

Another alternative (if its even possible) is to override the --about output with this info somehow. So given an image the --about output is specific to that image vs the generic top level config options i.e. airbyte_spec/connector_config.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.