meltanolabs / tap-airbyte-wrapper Goto Github PK
View Code? Open in Web Editor NEWA Singer tap that wraps Airbyte sources allowing them to be consumed by Singer targets
License: MIT License
A Singer tap that wraps Airbyte sources allowing them to be consumed by Singer targets
License: MIT License
@z3z1ma I'm getting an error related to parsing of the catalog to json once I have a successful discover, on tap line 178.
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 89)
I ended up figuring out that the source-s3 container STDOUT looks like this:
{"type": "LOG", "log": {"level": "INFO", "message": "initialised stream with format: {'filetype': 'jsonl'}"}}
{"type": "LOG", "log": {"level": "INFO", "message": "Iterating S3 bucket 'dbt-redshift-testing' with prefix: 'spreadsheet_test/airbyte_tap' "}}
{"type": "LOG", "log": {"level": "DEBUG", "message": "try to open Key: spreadsheet_test/airbyte_tap/test.json, LastModified: 2022-12-12T17:13:24+00:00, Size: 0.0001Mb"}}
{"type": "LOG", "log": {"level": "INFO", "message": "determined master schema: {'my_data': 'string', 'something_else': 'string', 'blah': 'string'}"}}
{"type": "CATALOG", "catalog": {"streams": [{"name": "sample_s3_jsonl", "json_schema": {"type": "object", "properties": {"my_data": {"type": ["null", "string"]}, "something_else": {"type": ["null", "string"]}, "blah": {"type": ["null", "string"]}, "_ab_additional_properties": {"type": "object"}, "_ab_source_file_last_modified": {"type": "string", "format": "date-time"}, "_ab_source_file_url": {"type": "string"}}}, "supported_sync_modes": ["full_refresh", "incremental"], "source_defined_cursor": true, "default_cursor_field": ["_ab_source_file_last_modified"]}]}}
So when it tries to parse that blob it fails but if I split lines and take the last one json.loads(output.splitlines()[-1])
it succeeds. I'm not sure if that a correct assumption or not that it will always be last.
Once I got over that I was able to do a meltano run tap-airbyte target-csv
with success!! ๐
extractors:
- name: tap-airbyte
namespace: tap_airbyte
pip_url: git+https://github.com/z3z1ma/tap-airbyte.git
executable: tap-airbyte
capabilities:
- catalog
- discover
- state
settings:
- name: airbyte_spec
kind: object
- name: connector_config
kind: object
config:
airbyte_spec:
image: airbyte/source-s3
tag: latest
connector_config:
dataset: sample_s3_jsonl
path_pattern: '**'
format:
filetype: jsonl
provider:
bucket: <MY_BUCKET_NAME>
path_prefix: spreadsheet_test/airbyte_tap
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
All extractions are full table syncs, regardless of what airbyte discovery returns.
For sources like source-file where someone wants to access a local CSV, we wont be able to access them from inside the docker container. Maybe we should add the ability to configure a volume, it could be hidden or defaulted to meltano root for most sources.
I'd recommend documenting the requirements for running this wrapper if you're already running meltano in docker. This is possible, but requires the /tmp
directory to be mounted from the host to the meltano container.
This ensures that /tmp/config.json
can be accessed, as the python script uses mktemp
to create a directory in /tmp
such as /tmp/tmp.BNlf296WXX
, which can then be mapped down to the airbyte container. Due to the way docker-in-docker works, the volume mounts are mounts that exist on the host, not the meltano container, so it needs to be bind-mounted for it to show up in the airbyte image.
I noticed when I restarted my laptop and my docker daemon wasn't running yet that the tap gave me a red herring error likely because it couldnt start the container during catalog discovery then the exception was caught.
raise AirbyteException("Could not discover catalog")
tap_airbyte.tap.AirbyteException: Could not discover catalog
It would be awesome if this could instead say "Docker daemon not found: Docker must be running to use this tap".
I believe in Python 3.7, you need to have parentheses on any use of @lru_cache. So @lru_cache()
instead of @lru_cache
. I realize 3.7 is ancient, but I'm stuck on a machine that can't upgrade easily, and I think Meltano is still 3.7+.
If it matters, I'm hitting this error using the Airbyte variant of tap-outreach.
File "/srv/analytics/meltano/.meltano/extractors/tap-outreach/venv/bin/tap-airbyte", line 5, in <module>
from tap_airbyte.tap import TapAirbyte
File "/srv/analytics/meltano/.meltano/extractors/tap-outreach/venv/lib/python3.7/site-packages/tap_airbyte/tap.py", line 105, in <module>
class TapAirbyte(Tap):
File "/srv/analytics/meltano/.meltano/extractors/tap-outreach/venv/lib/python3.7/site-packages/tap_airbyte/tap.py", line 547, in TapAirbyte
@lru_cache
File "/usr/lib64/python3.7/functools.py", line 490, in lru_cache
raise TypeError('Expected maxsize to be an integer or None')
TypeError: Expected maxsize to be an integer or None```
It looks like you already started to do this in https://github.com/z3z1ma/tap-airbyte/blob/1a9d58c4ebfd4ee04b761d721ed19dc6d968d499/tap_airbyte/tap.py#L77 but it would be awesome if the tap was able to print out the airbyte config spec after an image is defined.
The sources have a --spec
flag but I dont know how to access that right now. Since the tap is source agnostic it could be cool to allow someone to configure their source image then run some sort of spec command to list all config options for that source.
Another alternative (if its even possible) is to override the --about output with this info somehow. So given an image the --about output is specific to that image vs the generic top level config options i.e. airbyte_spec/connector_config.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.