googlecloudplatform / datashare-toolkit Goto Github PK

DIY commercial datasets on Google Cloud Platform

License: Apache License 2.0

Dockerfile 0.28% JavaScript 54.53% Shell 4.60% HTML 0.07% Vue 22.68% Python 5.88% Go 2.89% Jinja 3.34% HCL 5.73%

bigquery gcp-cloud-functions gcp-storage gcp sharing sharing-data sharing-information sharing-economy sharing-platform marketplace

datashare-toolkit's Introduction

`Datashare Toolkit`

DIY commercial datasets on Google Cloud Platform

This is not an officially supported Google product.

The Datashare Toolkit is a solution for data publishers to easily manage datasets residing within BigQuery. The toolkit includes functionality to ingest and entitle data, relieving consumers from much of the toil involved in onboarding datasets from a variety of providers. Publishers upload data files to a storage bucket and allocate permissioned datasets for their consumers to use with BigQuery authorized views.

While these tools are used for data management and entitlement, they follow a bring-your-own-license (BYOL) for entitling publisher data. Hence, publishers should already have licensing arrangements for those consumers withing to access their data within GCP, and the consumers can furnish the GCP account ID's corresponding to their entitled user principals. These account IDs are required for the creation of the authorized views.

The toolkit is open-source. Some supporting infrastructure, such as storage buckets, serverless functions, and BigQuery datasets, must be maintained within GCP by publishers as a prerequisite. As a consumer, when the GCP accounts are added to the publisher entitlements, the published can be queried directly within BigQuery, ready to integrate into your analytics workflow, machine learning model, or runtime application. Publishers are responsible for managing the limited support infrastructure necessary. While consumers are billed for BigQuery compute and networking, publishers incur costs only on the storage of their data in BigQuery and Cloud Storage.

Key Features

Publisher UI for creating data sharing policies, managing user accounts, creating views
Ingestion performed by a Google Cloud Function
GCP Marketplace integration for selling your data
Multicast client

Getting started with Datashare

If you plan to use GCP Marketplace integration, the production project that you install and manage Datashare from must follow the required naming convention (punctuation and spaces not allowed): [yourcompanyname]-public.

Then get started, see the User Guide for usage information.

Requirements

Publishers

A GCP account with billing enabled
A Google Cloud Storage bucket to store staged data

Consumers

A valid Google Account or Google Group email address (which includes Gsuite and Gmail email addresses).
Note: Consumers can create a Google account with an existing email address here
Entitlements granted by the publisher to your specific licensed datasets

Architecture

Disclaimers

This is not an officially supported Google product.

Datashare is under active development. Interfaces and functionality may change at any time.

License

This repository is licensed under the Apache 2 license (see LICENSE).

Contributions are welcome. See CONTRIBUTING for more information.

datashare-toolkit's People

Contributors

Stargazers

Watchers

datashare-toolkit's Issues

Update example docs and assets to reflect new directory upload schema

Example docs no longer accurately reflect the directory upload scheme to follow for ingestion. Assets should be renamed and example specimens adjusted so that they may be used in place without revision.

Sync views failing with error "Cannot read property 'time' of null"

Describe the bug
Sync views failing with error:

{"code":500,"success":false,"errors":["Cannot read property 'time' of null"]}

To Reproduce
Steps to reproduce the behavior:

Go to admin screen, click 'Sync BigQuery Views'

Destination tables not respecting nullable/required modes specified in schema.json

While interim tables reflect the column modes specified in a supplied schema.json, these modes do not make their way to the final table definition.

To Reproduce
Steps to reproduce the behavior:

deploy function to your project
copy last_sale config files from examples to <bucket>/bqds
copy last_sale.csv to bucket as marketdata.last_sale.csv
despite last_sale.schema.json having columns specified as REQUIRED, the ultimate destination table (marketdata.last_sale) shows all columns as being NULLABLE

Desired behavior is for all columns being mapped directly to the destination table to inherit the mode spec from their schema definition.

Thanks for the report @michaelwsherman

Rename "MLB" example to "mlb" for consistency in repo naming

Externalize entitlement schema file

Eliminate drift in the docs from ingestion enhancements

Examples listed in Markdown have drifted from in the actual code base. Review and adjust quick start examples, example directory layout and markdown to be consistent with the revised ingestion model.

Add destination datasetID creation in README

Describe the bug
add destination datasetID creation in README. Spot Fulfillment API requires a destination datasetID to execute BQ jobs. This is missing from the README. The default points to bqds_spot_fulfillment

To Reproduce
Steps to reproduce the behavior:

Go to api
Run through install and launch
Try to POST /fulfullmentRequests with wait: true
See error

Expected behavior
no errors

Additional context
Will need to add a configuration option here and anywhere there should be a reference to FULFILLMENT_CONFIG_DESTINATION_DATASET_ID

Add an initialization step for Datashare schema in setup documentation

Describe the bug
Add an initialization step for Datashare schema in setup documentation. This configuration setup step is currently missing in the README documentation.

Version
v0.3.0

Expected behavior
Add an initialization step for Datashare schema in setup documentation.

Additional context
Send a POST request to /projects/{projectId}/admin:initSchema after API is configured during setup.

Add ability to instrument header rows from schema.json files

Right now the ingestion function hardcodes skipLeadingRows for a BigQuery ingest job to 1. This should be end-user configurable so that different data files may be handled accordingly.

Currently, data files submitted that do not have a header row may be ingested without the very first record.

Ideally, a configuration property within a table's .schema.json will propagate through to the BigQuery load job.

Migrate shared code to separate library

Review and polish Shared markdown documentation

Is your feature request related to a problem? Please describe.
The existing Share markdown documentation has references to BQDS that needs to be removed and validated.

Describe the solution you'd like
Review and polish Shared markdown documentation

spot fulfillment build failing with bqds-shared library

Describe the bug
The spot fulfillment docker and cloud build is failing with the bqds-shared library. Spot fulfillment has the local dependency and it is not available in the current build context

To Reproduce
Steps to reproduce the behavior:

Go to api
Run through the Cloud Build instructions
See error

RUN npm install --only=production:
npm ERR! code ENOLOCAL
npm ERR! Could not install from "../shared" as it does not contain a package.json file.

Expected behavior
Builds successfully and no errors

Screenshots
n/a

Desktop (please complete the following information):

OS: Darwin 18.7.0

Additional context
Reported by @ramfordt

API error when Datashare schema is not provisioned

Describe the bug
API error when Datashare schema is not provisioned. Calling the the /accounts resource, the API times out with a 503 error that is not propagated to the client. These are the server logs:

2020-05-29 15:26:23.526 EDT(node:1) UnhandledPromiseRejectionWarning: Error: Not found: Dataset agp-dj-001:datashare was not found in location US
2020-05-29 15:26:23.526 EDT at new ApiError (/shared/node_modules/@google-cloud/common/build/src/util.js:58:15)
2020-05-29 15:26:23.526 EDT at /shared/node_modules/@google-cloud/bigquery/build/src/bigquery.js:1066:23
2020-05-29 15:26:23.526 EDT at /shared/node_modules/@google-cloud/common/build/src/util.js:367:25
2020-05-29 15:26:23.526 EDT at Util.handleResp (/shared/node_modules/@google-cloud/common/build/src/util.js:144:9)
2020-05-29 15:26:23.526 EDT at /shared/node_modules/@google-cloud/common/build/src/util.js:432:22
2020-05-29 15:26:23.526 EDT at onResponse (/shared/node_modules/retry-request/index.js:206:7)
2020-05-29 15:26:23.526 EDT at /shared/node_modules/teeny-request/build/src/index.js:233:13
2020-05-29 15:26:23.526 EDT at processTicksAndRejections (internal/process/task_queues.js:85:5)
2020-05-29 15:26:23.526 EDT(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

Deploy API
Call /accounts endpoint without provisioning Datashare schema
See error

Expected behavior
Propagate descriptive error that schema DNE up to client and handle gracefully.

Console errors after non-admin user authenticates

Describe the bug
Console errors after non-admin user authenticates. If the end-user's email address is not in the whitelist, the user is redirected back to the initial landing page with console errors below.

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

Sign in with user not in whitelist
Click on browser console debug
See error

Expected behavior
If the end-user's email address is not in the whitelist, redirected back to the initial landing page without errors.

Desktop (please complete the following information):

Browser - Chrome incognito

Additional context
This might be fixed with new RBAC controls, but the UX should be validated regardless

Update deployment script to Node.js 10

Describe the bug
Update deployment script to Node.js 10

Initial load of UI returns a blank page

Describe the bug
The initial load of the UI returns a blank page. After refreshing it loads successfully.

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
The UI loads successfully on 1st render.

Screenshots
If applicable, add screenshots to help explain your problem.

    at Nt (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1259:478)
    at At (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1259:251)
    at t.instanceFactory (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1275:148)
    at t.getOrInitializeService (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:746:16005)
    at t.getImmediate (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:746:14525)
    at t.instanceFactory (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1911:2633)
    at t.getOrInitializeService (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:746:16005)
    at t.getImmediate (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:746:14525)
    at t._getService (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1588:176118)
    at t.<computed> [as analytics] (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1604:1190)```
**Desktop (please complete the following information):**
 - Browser [e.g. chrome, safari] Chrome Incognito


**Additional context**
Add any other context about the problem here.

Update the Authorized Views endpoint logic with the CDS API design

Is your feature request related to a problem? Please describe.
Update the Authorized Views endpoint logic with the CDS API design

convert the existing auth view logic from the UI into the new CDS API format

Describe the solution you'd like
see above

Describe alternatives you've considered
n/a

Additional context
n/a

Exception message not printed out if it's not json in runTransform

In runTransform, the exception object is wrapped with JSON.stringify.

console.error("Exception encountered running transform: " + JSON.stringify(exception));

However, in cases where the exception isn't a JSON object, and is a string, this will eat up the message and the proper message will not be printed out. In my error case, switching to this printed it out though:

console.error(Exception encountered running transform: ${exception});

Solution is probably to check the object type first and then format it.

Create Cloud Run deployment example for Frontend

Is your feature request related to a problem? Please describe.
Create Cloud Run deployment example for Frontend

Describe the solution you'd like
Create Cloud Run deployment example for Frontend

Add unit tests for ingestion module

Is your feature request related to a problem? Please describe.
Add unit tests for the ingestion module.

Describe the solution you'd like
Unit tests for ingestion and replace existing shell script integration tests.

Describe alternatives you've considered
N/A

Additional context
N/A

When access control is not used by a configuration but refresh dataset changes is on dataset fails to create

Create stackdriver monitoring dashboard

Add unit tests for the shared code library modules

Is your feature request related to a problem? Please describe.
Add unit tests for the shared code library modules.

Describe the solution you'd like
This is a continuation of the existing unit tests that were created in #68

Describe alternatives you've considered
N/A

Additional context
N/A

Improve deploy.sh flow for better UX with parameters/requirements

On initial testing, ran into confusion with the bucket parameter and also requirement for gsutil. Marking this as a TODO for @matthewhuie.

Chat bot branch names should be url encoded

Clicking on view dataset or view navigates to the SSO page

Describe the bug
Click on view dataset button or view view button navigates to the SSO page

Add README.md with instructions for running /tests/bin/run.sh

Is your feature request related to a problem? Please describe.
This is motivated by:
https://github.com/GoogleCloudPlatform/bq-datashare-toolkit/blob/83afbb93847eb750ac788bee1bcf3a68f8e0b941/tests/bin/run.sh#L108

It's using the json binary, which seems to be: https://www.npmjs.com/package/json

After installing it, I was able to run the test bash.

I also noticed that:
https://github.com/GoogleCloudPlatform/bq-datashare-toolkit/blob/83afbb93847eb750ac788bee1bcf3a68f8e0b941/tests/bin/run.sh#L142

Seem to have been affected after the function folder rename.

Describe the solution you'd like
Create a README.md file at the /tests folder with instructions for setting up the environment.

Additional context
If it's open to contributions I would gladly submit a PR with the instructions.

Create view form fails to populate tables

Describe the bug
Open create view form, select a dataset from the source group - tables don't populate. 500 error occurs.

Add unit tests for entitlement-engine

Update Cloud Run deploy command with unauthenticated

Describe the bug
Add note about Domain Restricted Sharing needs to be disabled if data producer would like to run Cloud Run deploy (Managed) with unauthenticated argument

Configure Entitlements Docker manifest for multi-stage builds

Is your feature request related to a problem? Please describe.
Configure Entitlements Docker manifest for multi-stage builds
The current entitlements-engine container is > 1.3GB. Update the Docker file to build from scratch and remove unnecessary libraries. This will reduce entitlements engine build and deploy time substantially.

Describe the solution you'd like
multi-stage Dockerfile builds like https://github.com/GoogleCloudPlatform/bq-datashare-toolkit/blob/42--spot-fulfillment-api/api/v1alpha/Dockerfile as an example.

Describe alternatives you've considered
N/A

Additional context
N/A

The createdBy data model attribute should be in request header for API OAS

Describe the bug
The createdBy data model attribute should be in request header for API OAS. It is currently in the payload body. This will apply for all incoming resource requests payloads, but the data model response should still include the read-only attribute.

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

Go to 'v1alpha/docs'
Click on POST /policies and try to submit a new request
See error

Expected behavior
Add the createdBy attribute to the x-gcp-account header in the API OAS definitions

Screenshots

Additional context
Add any other context about the problem here.

Code coverage report

Separate out multiple errors when errors occur during a request.

Is your feature request related to a problem? Please describe.
When attempting to ingest data, the actual error gets hidden behind the error message, "Error: Multiple errors occurred during the request. Please see the errors array for complete details.". When ingesting large data files, it is common to have multiple errors to address. But instead of seeing the errors in the error console, you see this message and have to dig into each one.

Also, once you fix the error, another error can emerge that's very different but ends up under the same message.

There's also a deeper issue here, where BQ is likely to throw multiple errors during an ingestion job--if there are a few errors, BQ may "give up" on import and throw that as an additional error that's not really useful since its just an error from too many errors. But this also guarantees that any kind of recurring error in an ingested table will give the "multiple errors" error even if the true error is very different.

Describe the solution you'd like
I'm not entirely sure and I think there's a design question here. On one hand, reporting every error in the array separately could be useful, but that could also lead to a lot of "useless" errors. On the other hand, concatenating the entire error array could make problems easier to identify but could also flood the error console with a bunch of duplicate errors.

Describe alternatives you've considered
See above. It's also still manageable to look at the individual errors.

Additional context

Here's an example of a full set of errros I got that came under the "errors array" error:

Error: Multiple errors occurred during the request. Please see the errors array for complete details. 1. Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. 2. Error while reading data, error message: Could not parse '1.0' as int for field XXX(position 8) starting at location 28931951844 3. Error while reading data, error message: Could not parse '1.0' as int for field XXX (position 8) starting at location 25717290653 4. Error while reading data, error message: Could not parse '1.0' as int for field XXX (position 8) starting at location 32950278573 5. Error while reading data, error message: Could not parse '1.0' as int for field XXX (position 8) starting at location 36164939912 6. Error while reading data, error message: Could not parse '3.0' as int for field XXX (position 8) starting at location 27056732771

Here's another:

Formatting bash scripts

Bash scripts need to be formatted for consistency.

Duplicate Search field placeholder text in multiple views

Describe the bug
There is a duplicate Search field placeholder text in multiple views of the UI. This impacts the following views:

datasets
views
accounts
policies

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

Go to UI
Click on 'Accounts'
See error in Search Bar

Expected behavior
Only display one Search placeholder text

Screenshots

Desktop (please complete the following information):

Browser - Chrome incognito

Additional context
Add any other context about the problem here.

Truncate not working

Create spot fulfillment API service

Is your feature request related to a problem? Please describe.
Data consumers currently have the ability to query datasets in BigQuery user interface or via the BigQuery APIs. There is no ability to service those data requests with a subset of data and filtering for spot fulfillment consumer requests.

Describe the solution you'd like
Provide an API service for data producers to expose the ability to spot fulfill data requests from their consumers.

Dataset attribute definitions
Query validation and filtering
Temporary bucket storage with signed urls
API consumable documentation
Packaged deployment

Describe alternatives you've considered
N/A

Additional context
The API service will interface to GCP services via a GCP IAM service account. The following technologies will be leverage

NodeJS and Express
GCP BigQuery, Storage, IAM
GCP GKE/Cloud Run

Modify spot fulfillment api code to leverage shared library

Is your feature request related to a problem? Please describe.
Modify spot fulfillment api code to leverage shared library.

Describe the solution you'd like
This is a clean-up for new shared library created from #68 #67

Describe alternatives you've considered
N/A

Additional context
N/A

Use dynamic random function name for integration test script

Is your feature request related to a problem? Please describe.
Only one integration test can run at a time now.

Describe the solution you'd like
Use a dynamic function name using randomly generated name.

Describe alternatives you've considered
Using separate projects per build.

Move credentials to environment variables

Is your feature request related to a problem? Please describe.
Move Firebase credentials to environment vs settings.json. Even though API key is public for a untrusted app, these dynamic variables could be deemed sensitive for customers and should be moved out of #125 (comment)

Describe the solution you'd like
Move Firebase credentials to environment variables. These can be pulled in during development or via the UI settings page.

Specify node version in Entitlements Docker manifest

Is your feature request related to a problem? Please describe.
Specify node version in Entitlements Docker manifest to reduce compatibility and dependency sprawl. Currently it is set to node:latest https://github.com/GoogleCloudPlatform/bq-datashare-toolkit/blob/master/entitlements/bin/Dockerfile#L1

Describe the solution you'd like
Update Dockerfile to node:12.6 or node:12.6-alpine

Describe alternatives you've considered
N/A

Additional context
N/A

Add support and check for file names containing characters not allowed in a BQ table name

Currently getDestination parses the filename from the event path, however it isn't cleaning up the file name. I removed dormant code from the getDestination function:

name = parts[1]; name.replace('.', '_'); name = name.replace('-', '_'); name = name.replace(' ', '_');

Optionally forestall duplicate ingestion

If the write disposition for an ingestion run is set to WRITE_APPEND, BigQuery will ingest the same file an arbitrary number of times, yielding duplicate records in the corresponding table which differ in their values for bqds_batch_id (which is shared by the entire ingestion iteration).

One possible way to avoid this duplication is to warn or fail if the same file is going to be uploaded twice by searching the existing batch ID for the same file name as the one incoming from GCS.

Matching on the file name could end up being a relatively unreliable approach depending on a publisher's specific naming conventions. An alternative to reduce false positives would be to use an MD5 hash or similar on the entire inbound file, and then use that for inbound file validation. While not foolproof, this should give both publisher namespace flexibility with static evaluation and rejection of precise duplicate inbound files.

Ingestion ERROR 1: "Only CSV imports may specify leading rows to skip."

Describe the bug
Load JSON based file, do not configure skipLeadingRows in schema.json.

bigquery.datasets.get is required in the custom role definition

Describe the bug
bigquery.datasets.get is required in the custom role definition

To Reproduce
Steps to reproduce the behavior:

Go to api
Run through the Service Account, installation and launch service
See error in API POST fulfillmentRequests

Expected behavior
no error

When a dataset is configured in multiple views, only one view is retained in removeStaleObjects

Ingestion throwing exception when a config file placed in storage bucket

Describe the bug
Ingestion throwing exception when a config file placed in storage bucket

To Reproduce
Steps to reproduce the behavior:

Place a config file in the /config path of the bqds directory

Expected behavior
Function should ignore the file and not throw an exception

Tables created by ingestion function should be labeled

When ingestion creates a table add a label so that it can be managed.

Consumer definition is not accurate in main README

Describe the bug
The Consumer definition in the Datashare's main README, https://github.com/GoogleCloudPlatform/datashare-toolkit#consumers

Expected behavior
It should just reference https://cloud.google.com/bigquery/docs/updating-datasets, i.e. Google Account e-mail: Grants an individual Google Account access to the dataset Google Group: Grants all members of a Google group access to the dataset
A Google Account is defined here (which includes Gsuite accounts). You can create a Google account with an existing email address here

googlecloudplatform / datashare-toolkit Goto Github PK

datashare-toolkit's Introduction

Datashare Toolkit

DIY commercial datasets on Google Cloud Platform

Key Features

Getting started with Datashare

Requirements

Publishers

Consumers

Architecture

Disclaimers

License

datashare-toolkit's People

Contributors

Stargazers

Watchers

Forkers

datashare-toolkit's Issues

Recommend Projects

Recommend Topics

Recommend Org

`Datashare Toolkit`