anelendata / tap-bigquery Goto Github PK

Singer.io tap for extracting data from BigQuery tables

License: Apache License 2.0

Python 92.24% Shell 7.76%

tap-bigquery's Issues

Running tap without datetime_key?

Is it possible to use this tap without the datetime_key? I'm trying to sync a table where the date column is sometimes null but I still want to grab all of the rows.

Incorrect auth instructions

The instructions for generating the client secrets https://github.com/anelendata/tap-bigquery#step-1-activate-the-google-bigquery-api are very detailed (thanks!) but unfortunately it looks like OAuth is the wrong method to use here. When I tried to use it, first I was hit by gh-22, then I got an error:

expected one of ('authorized_user', 'service_account', 'external_account', 'impersonated_service_account', 'gdch_service_account')

Creating a service account, as explained in target-bigquery https://github.com/adswerve/target-bigquery#step-2-authenticate-with-a-service-account, worked 👍🏽

Invalid cast from INT64 to DATETIME with event_timestamp as datetime_key

Hi,

I encounter a Google API BadRequest Exception when I select the event_timestamp as the datetime_key in the tap_config.json, when running discovery command.

Dusty

Pin numpy to "<2"

Description

Using meltano==3.4.0 and installing the tap as defined in the meltano labs page here results in an error (had to cut off some verbose):

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
ImportError: numpy.core.multiarray failed to import

How to reproduce

Using:

poetry 1.8.3
meltano==3.4.0
meltanolabs-tap-bigquery==0.0.1

Steps:

Initialize a pyproject (I do not think using pyproject is really relevant here, any dependency will do) and install meltano==3.4.0
Add the bq tap as described in the meltano labs page meltano add extractor tap-bigquery
Create a valid config and run the command meltano config tap-bigquery test

Current work around

Requieres manual intervation and a bit hacky, go into the virtual environment where meltano installed the tap and downgrade numpy:

pip uninstall numpy
pip install "numpy<2"
Run the command again meltano config tap-bigquery test and it should work.

Proposed solution

Pin numpy to < 2 on the main branch.

start_datetime is required argument even with start_datetime in tap_config.json

The project documentation is saying :

"start_datetime must also be set in the config file or as the command line argument",

but I have an "state or start_datetime must be specified" error even if I have a specified start_datetime in my tap_config.json

Thanks,

Dusty

Remove or find a workaround to authenticate using SAs

Right now you are forced to have the variable set GOOGLE_APPLICATION_CREDENTIALS as its also set by default with the value $MELTANO_PROJECT_ROOT/client_secrets.json as specified here.

The issue with this if you want to authenticate without using keys (let it be an actual json file or a string representation for it), you can't. The call bigquery.Client seems to check if this variable is set and then it will try to authenticate by checking the file that's defined in the value of the variable as specified here.

You might want to do this if running your application in kubernetes and using service accounts to authenticate, which is somewhat safer as the machine running the code does not need to know of any kind of credentials (this is the particular case stopping us right now), a more detaied explanation by google can be found here, but:

Many Google Cloud services let you attach a service account that can be used to provide credentials for accessing Google Cloud APIs. If ADC does not find credentials it can use in either the GOOGLE_APPLICATION_CREDENTIALS environment variable or the well-known location for local ADC credentials, it uses the metadata server to get credentials for the service where the code is running.

I am not sure what's the deal here, removing the default value might be a breaking change so maybe having an additional flag like AUTHENTICATE_WITH_SA that just removes the variable using os.environ.pop.

Looking for feedback from owner and maintainers, also checking if this repo is active.

when available use state.json bookmark instead of --start_datetime argument

--start_datetime is a required argument.

Currently when it is provided, it overrides the last_update property value from state.json

{"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}

(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
INFO Running query:
    SELECT hash_id,processor,processor_cores,string_created_at,created_at,primary_resolution,memory,os,email,browser,netspeed,record_source FROM `dxdiag-aa08f.unioned.vw_system_checker` WHERE 1=1 AND datetime '2020-01-01 00:00:00.000000' <= CAST(created_at as datetime) AND CAST(created_at as datetime) < datetime '2020-09-09 07:07:52.550010' ORDER BY created_at
time=2020-09-09 07:07:54 name=target_snowflake level=INFO message=Table 'systemchecker_singer."SYSTEM_CHECKER"' does not exist. Creating...
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 9147, "tags": {"endpoint": "system_checker"}}
INFO Syncing stream:system_checker
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Uploading 9147 rows to external snowflake stage on S3
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Target S3 bucket: cfsnowflakestage, local file: /tmp/records_b51fywzm.csv, S3 key: prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv
time=2020-09-09 07:08:00 name=target_snowflake level=INFO message=Loading 9147 rows into 'systemchecker_singer."SYSTEM_CHECKER"'
time=2020-09-09 07:08:03 name=target_snowflake level=INFO message=Loading into systemchecker_singer."SYSTEM_CHECKER": {"inserts": 0, "updates": 9147, "size_bytes": 500703}
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Deleting prod/singer/pipelinewise_system_checker_20200909-070800-290493.csv from external snowflake stage on S3
time=2020-09-09 07:08:04 name=target_snowflake level=INFO message=Emitting state {"currently_syncing": "system_checker", "bookmarks": {"system_checker": {"last_update": "2020-09-01T13:33:44.712431+00:00"}}}

if --start_datetime continues to be required, a value in state.json should override the --start_datetime argument value.

Path to `client_secrets.json` cannot be specified via `config.json`

By convention:

The config file contains whatever parameters the Tap needs in order to pull data from the source. Typically this will include the credentials for the API or data source.

Reliance on a required GOOGLE_APPLICATION_CREDENTIALS env var makes this tap an exception, complicating integration with singer-runner tooling.

`Cannot access field key on a value with type ARRAY` when trying to select (record, repeated) fields

Trying to extract a table with (record, repeated) fields, coming from Google Analytics 4:

using this config:

{
  "streams": [
    {
      "name": "analytics_257428027",
      "table": "orchest-ga4-data.analytics_257428027.events_20220810",
      "columns": [
        "event_timestamp",
        "event_name",
        "event_params.key",
        "event_params.value.string_value",
        "event_params.value.int_value",
        "traffic_source.name",
        "traffic_source.medium",
        "traffic_source.source"
      ],
      "datetime_key": "event_date"
    }
  ],
  "credentials_path": "/project-dir/meltano/client_secrets.json",
  "start_datetime": "2022-08-10T00:00:00Z",
  "start_always_inclusive": true
}

I'm getting this error:

 INFO Running query:
    SELECT event_timestamp,event_name,event_params.key,event_params.value.string_value,event_params.value.int_value,traffic_source.name,traffic_source.medium,traffic_source.source,event_date FROM orchest-ga4-data.analytics_257428027.events_20220810 WHERE 1=1 AND datetime '2022-08-10 00:00:00.000000' <= CAST(event_date as datetime) AND CAST(event_date as datetime) < datetime '2022-08-11 10:00:20.205737' ORDER BY event_date LIMIT 100
CRITICAL 400 Cannot access field key on a value with type ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, ...>>> at [1:48]

Schema row does not contain properties

This breaks many target including target-snowflake and Anelen's implementation of tap_bigquery.

Currently:

{"type": "SCHEMA", "stream": "pardot_export", "schema": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "key_properties": []}

And it should be:

{"type": "SCHEMA", "stream": "pardot_export", "schema": {"properties": {"email": {"type": ["null", "string"]}, "init": {"type": ["null", "integer"]}, "premium_plugin": {"type": ["null", "boolean"]}, "premium_plugins": {"type": ["null", "string"]}, "first_name": {"type": ["null", "string"]}, "last_name": {"type": ["null", "string"]}, "lead_source": {"type": ["null", "string"]}, "stats_updated_on": {"type": ["null", "string"]}, "_sdc_extracted_at": {"format": "date-time", "type": ["null", "string"]}, "_sdc_batched_at": {"format": "date-time", "type": ["null", "string"]}, "_etl_tstamp": {"inclusion": "automatic", "type": ["null", "number"]}}, "selected": true, "type": "object"}, "key_properties": []}

Update latest release version

The releases only list up to 0.3.6 https://github.com/anelendata/tap-bigquery/releases, but the latest commit (acaf9f2) suggests to be at 0.3.7 to support env. Can we have the latest release be at 0.3.7?

start_datetime argument required

--start_datetime is a required argument.

Running this

~/.venvs/tap-bigquery/bin/tap-bigquery \
    -c ~/singer.io/tap_bq_systemchecker/tap_config.json \
    --catalog ~/singer.io/tap_bq_systemchecker/catalog.json \
    | ~/.venvs/pipelinewise-target-snowflake/bin/target-snowflake \
    -c ~/singer.io/pipelinewise_target_snowflake/systemchecker_config.json \
    >> ~/singer.io/tap_bq_systemchecker/state.json

returns this

(tap-bigquery) ubuntu@ip-172-23-84-190:~$ ./scripts/systemchecker_bq.sh
CRITICAL start_datetime not specified

It should be an optional argument.

datetime_key only allows datetime incremental watermarks

Problem
In tables that use any form of epoch time, a view must be applied in order to convert epoch time into a datetime data type. Other methods of watermarking a table also use integers as well. For example, if a postgres xmin field were carried into Big Query table, it is an effective watermark for table activity.

In the case of epoch time, the only current solution I can think of is to create a view on the bigquery table.

Solution
Abstract datetime_key to allow integer watermarks as well.

New release

Hi please could you create a new release for the functionality added in #18? Thank you 🙇‍♂️

Support `credentials_json` as option for full-text of `client_secrets.json`

In some runtimes, it is not feasible to load a physical client_secrets.json file - for instance in managed or ephemeral environments.

Could we add support for something like credentials_json as a JSON string of the contents of client_secrets.json?

Relevant code snippets:

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/target.py#L318-L322

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/core.py#L213-L222

https://github.com/z3z1ma/target-bigquery/blob/2b771a26ae04ed34e7411b2bb3aa84c80ee9b141/target_bigquery/core.py#L578-L584

anelendata / tap-bigquery Goto Github PK

tap-bigquery's Issues

Description

How to reproduce

Current work around

Proposed solution

Recommend Projects

Recommend Topics

Recommend Org