Giter Site home page Giter Site logo

tap-bigquery's Issues

Running tap without datetime_key?

Is it possible to use this tap without the datetime_key? I'm trying to sync a table where the date column is sometimes null but I still want to grab all of the rows.

Incorrect auth instructions

The instructions for generating the client secrets https://github.com/anelendata/tap-bigquery#step-1-activate-the-google-bigquery-api are very detailed (thanks!) but unfortunately it looks like OAuth is the wrong method to use here. When I tried to use it, first I was hit by gh-22, then I got an error:

expected one of ('authorized_user', 'service_account', 'external_account', 'impersonated_service_account', 'gdch_service_account')

Creating a service account, as explained in target-bigquery https://github.com/adswerve/target-bigquery#step-2-authenticate-with-a-service-account, worked ๐Ÿ‘๐Ÿฝ

Pin numpy to "<2"

Description

Using meltano==3.4.0 and installing the tap as defined in the meltano labs page here results in an error (had to cut off some verbose):

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
ImportError: numpy.core.multiarray failed to import

How to reproduce

Using:

  • poetry 1.8.3
  • meltano==3.4.0
  • meltanolabs-tap-bigquery==0.0.1

Steps:

  1. Initialize a pyproject (I do not think using pyproject is really relevant here, any dependency will do) and install meltano==3.4.0
  2. Add the bq tap as described in the meltano labs page meltano add extractor tap-bigquery
  3. Create a valid config and run the command meltano config tap-bigquery test

Current work around

Requieres manual intervation and a bit hacky, go into the virtual environment where meltano installed the tap and downgrade numpy:

  1. pip uninstall numpy
  2. pip install "numpy<2"
  3. Run the command again meltano config tap-bigquery test and it should work.

Proposed solution

Pin numpy to < 2 on the main branch.

Remove or find a workaround to authenticate using SAs

Right now you are forced to have the variable set GOOGLE_APPLICATION_CREDENTIALS as its also set by default with the value $MELTANO_PROJECT_ROOT/client_secrets.json as specified here.

The issue with this if you want to authenticate without using keys (let it be an actual json file or a string representation for it), you can't. The call bigquery.Client seems to check if this variable is set and then it will try to authenticate by checking the file that's defined in the value of the variable as specified here.

You might want to do this if running your application in kubernetes and using service accounts to authenticate, which is somewhat safer as the machine running the code does not need to know of any kind of credentials (this is the particular case stopping us right now), a more detaied explanation by google can be found here, but:

Many Google Cloud services let you attach a service account that can be used to provide credentials for accessing Google Cloud APIs. If ADC does not find credentials it can use in either the GOOGLE_APPLICATION_CREDENTIALS environment variable or the well-known location for local ADC credentials, it uses the metadata server to get credentials for the service where the code is running.

I am not sure what's the deal here, removing the default value might be a breaking change so maybe having an additional flag like AUTHENTICATE_WITH_SA that just removes the variable using os.environ.pop.

Looking for feedback from owner and maintainers, also checking if this repo is active.

when available use state.json bookmark instead of --start_datetime argument

--start_datetime is a required argument.

Currently when it is provided, it overrides the last_update property value from state.json

{"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}
(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
INFO Running query:
    SELECT hash_id,processor,processor_cores,string_created_at,created_at,primary_resolution,memory,os,email,browser,netspeed,record_source FROM `dxdiag-aa08f.unioned.vw_system_checker` WHERE 1=1 AND datetime '2020-01-01 00:00:00.000000' <= CAST(created_at as datetime) AND CAST(created_at as datetime) < datetime '2020-09-09 07:07:52.550010' ORDER BY created_at
time=2020-09-09 07:07:54 name=target_snowflake level=INFO message=Table 'systemchecker_singer."SYSTEM_CHECKER"' does not exist. Creating...
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 9147, "tags": {"endpoint": "system_checker"}}
INFO Syncing stream:system_checker
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Uploading 9147 rows to external snowflake stage on S3
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Target S3 bucket: cfsnowflakestage, local file: /tmp/records_b51fywzm.csv, S3 key: prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Loading 9147 rows into 'systemchecker_singer."SYSTEM_CHECKER"'
time=2020-09-09 07:08:03 name=target_snowflake level=INFO message=Loading into systemchecker_singer."SYSTEM_CHECKER": {"inserts": 0, "updates": 9147, "size_bytes": 500703}
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Deleting prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv from external snowflake stage on S3
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Emitting state {"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}

if --start_datetime continues to be required, a value in state.json should override the --start_datetime argument value.

Path to `client_secrets.json` cannot be specified via `config.json`

By convention:

The config file contains whatever parameters the Tap needs in order to pull data from the source. Typically this will include the credentials for the API or data source.

Reliance on a required GOOGLE_APPLICATION_CREDENTIALS env var makes this tap an exception, complicating integration with singer-runner tooling.

`Cannot access field key on a value with type ARRAY` when trying to select (record, repeated) fields

Trying to extract a table with (record, repeated) fields, coming from Google Analytics 4:

image

using this config:

{
  "streams": [
    {
      "name": "analytics_257428027",
      "table": "orchest-ga4-data.analytics_257428027.events_20220810",
      "columns": [
        "event_timestamp",
        "event_name",
        "event_params.key",
        "event_params.value.string_value",
        "event_params.value.int_value",
        "traffic_source.name",
        "traffic_source.medium",
        "traffic_source.source"
      ],
      "datetime_key": "event_date"
    }
  ],
  "credentials_path": "/project-dir/meltano/client_secrets.json",
  "start_datetime": "2022-08-10T00:00:00Z",
  "start_always_inclusive": true
}

I'm getting this error:

 INFO Running query:
    SELECT event_timestamp,event_name,event_params.key,event_params.value.string_value,event_params.value.int_value,traffic_source.name,traffic_source.medium,traffic_source.source,event_date FROM orchest-ga4-data.analytics_257428027.events_20220810 WHERE 1=1 AND datetime '2022-08-10 00:00:00.000000' <= CAST(event_date as datetime) AND CAST(event_date as datetime) < datetime '2022-08-11 10:00:20.205737' ORDER BY event_date LIMIT 100
CRITICAL 400 Cannot access field key on a value with type ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, ...>>> at [1:48]

Schema row does not contain properties

This breaks many target including target-snowflake and Anelen's implementation of tap_bigquery.

Currently:

{"type": "SCHEMA", "stream": "pardot_export", "schema": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "key_properties": []}

And it should be:

{"type": "SCHEMA", "stream": "pardot_export", "schema": {"properties": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "selected": true, "type": "object"}, "key_properties": []}

start_datetime argument required

--start_datetime is a required argument.

Running this

~/.venvs/tap-bigquery/bin/tap-bigquery \
    -c ~/singer.io/tap_bq_systemchecker/tap_config.json \
    --catalog ~/singer.io/tap_bq_systemchecker/catalog.json \
    | ~/.venvs/pipelinewise-target-snowflake/bin/target-snowflake \
    -c ~/singer.io/pipelinewise_target_snowflake/systemchecker_config.json \
    >> ~/singer.io/tap_bq_systemchecker/state.json

returns this

(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
CRITICAL start_datetime not specified

It should be an optional argument.

datetime_key only allows datetime incremental watermarks

Problem
In tables that use any form of epoch time, a view must be applied in order to convert epoch time into a datetime data type. Other methods of watermarking a table also use integers as well. For example, if a postgres xmin field were carried into Big Query table, it is an effective watermark for table activity.

In the case of epoch time, the only current solution I can think of is to create a view on the bigquery table.

Solution
Abstract datetime_key to allow integer watermarks as well.

New release

Hi please could you create a new release for the functionality added in #18? Thank you ๐Ÿ™‡โ€โ™‚๏ธ

Support `credentials_json` as option for full-text of `client_secrets.json`

In some runtimes, it is not feasible to load a physical client_secrets.json file - for instance in managed or ephemeral environments.

Could we add support for something like credentials_json as a JSON string of the contents of client_secrets.json?

Relevant code snippets:

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/target.py#L318-L322

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/core.py#L213-L222

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/core.py#L578-L584

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.