Giter Site home page Giter Site logo

googlecloudplatform / datashare-toolkit Goto Github PK

View Code? Open in Web Editor NEW
86.0 41.0 25.0 61.1 MB

DIY commercial datasets on Google Cloud Platform

License: Apache License 2.0

Dockerfile 0.28% JavaScript 54.53% Shell 4.60% HTML 0.07% Vue 22.68% Python 5.88% Go 2.89% Jinja 3.34% HCL 5.73%
bigquery gcp-cloud-functions gcp-storage gcp sharing sharing-data sharing-information sharing-economy sharing-platform marketplace

datashare-toolkit's Introduction

Datashare Toolkit

Datashare

DIY commercial datasets on Google Cloud Platform

This is not an officially supported Google product.

The Datashare Toolkit is a solution for data publishers to easily manage datasets residing within BigQuery. The toolkit includes functionality to ingest and entitle data, relieving consumers from much of the toil involved in onboarding datasets from a variety of providers. Publishers upload data files to a storage bucket and allocate permissioned datasets for their consumers to use with BigQuery authorized views.

While these tools are used for data management and entitlement, they follow a bring-your-own-license (BYOL) for entitling publisher data. Hence, publishers should already have licensing arrangements for those consumers withing to access their data within GCP, and the consumers can furnish the GCP account ID's corresponding to their entitled user principals. These account IDs are required for the creation of the authorized views.

The toolkit is open-source. Some supporting infrastructure, such as storage buckets, serverless functions, and BigQuery datasets, must be maintained within GCP by publishers as a prerequisite. As a consumer, when the GCP accounts are added to the publisher entitlements, the published can be queried directly within BigQuery, ready to integrate into your analytics workflow, machine learning model, or runtime application. Publishers are responsible for managing the limited support infrastructure necessary. While consumers are billed for BigQuery compute and networking, publishers incur costs only on the storage of their data in BigQuery and Cloud Storage.

Key Features

Getting started with Datashare

If you plan to use GCP Marketplace integration, the production project that you install and manage Datashare from must follow the required naming convention (punctuation and spaces not allowed): [yourcompanyname]-public.

  1. Install Datashare
  2. Initialize Schema

Then get started, see the User Guide for usage information.

Requirements

Publishers

  • A GCP account with billing enabled
  • A Google Cloud Storage bucket to store staged data

Consumers

  • A valid Google Account or Google Group email address (which includes Gsuite and Gmail email addresses).
    Note: Consumers can create a Google account with an existing email address here
  • Entitlements granted by the publisher to your specific licensed datasets

Architecture

Architecture

Disclaimers

This is not an officially supported Google product.

Datashare is under active development. Interfaces and functionality may change at any time.

License

This repository is licensed under the Apache 2 license (see LICENSE).

Contributions are welcome. See CONTRIBUTING for more information.

datashare-toolkit's People

Contributors

dependabot[bot] avatar heroichitesh avatar mservidio avatar phriscage avatar ramfordt avatar salsferrazza avatar swilliams11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datashare-toolkit's Issues

Destination tables not respecting nullable/required modes specified in schema.json

While interim tables reflect the column modes specified in a supplied schema.json, these modes do not make their way to the final table definition.

To Reproduce
Steps to reproduce the behavior:

  1. deploy function to your project
  2. copy last_sale config files from examples to <bucket>/bqds
  3. copy last_sale.csv to bucket as marketdata.last_sale.csv
  4. despite last_sale.schema.json having columns specified as REQUIRED, the ultimate destination table (marketdata.last_sale) shows all columns as being NULLABLE

Desired behavior is for all columns being mapped directly to the destination table to inherit the mode spec from their schema definition.

Thanks for the report @michaelwsherman

Add destination datasetID creation in README

Describe the bug
add destination datasetID creation in README. Spot Fulfillment API requires a destination datasetID to execute BQ jobs. This is missing from the README. The default points to bqds_spot_fulfillment

To Reproduce
Steps to reproduce the behavior:

  1. Go to api
  2. Run through install and launch
  3. Try to POST /fulfullmentRequests with wait: true
  4. See error

Expected behavior
no errors

Additional context
Will need to add a configuration option here and anywhere there should be a reference to FULFILLMENT_CONFIG_DESTINATION_DATASET_ID

Add an initialization step for Datashare schema in setup documentation

Describe the bug
Add an initialization step for Datashare schema in setup documentation. This configuration setup step is currently missing in the README documentation.

Version
v0.3.0

Expected behavior
Add an initialization step for Datashare schema in setup documentation.

Additional context
Send a POST request to /projects/{projectId}/admin:initSchema after API is configured during setup.

Add ability to instrument header rows from schema.json files

Right now the ingestion function hardcodes skipLeadingRows for a BigQuery ingest job to 1. This should be end-user configurable so that different data files may be handled accordingly.

Currently, data files submitted that do not have a header row may be ingested without the very first record.

Ideally, a configuration property within a table's .schema.json will propagate through to the BigQuery load job.

Review and polish Shared markdown documentation

Is your feature request related to a problem? Please describe.
The existing Share markdown documentation has references to BQDS that needs to be removed and validated.

Describe the solution you'd like
Review and polish Shared markdown documentation

spot fulfillment build failing with bqds-shared library

Describe the bug
The spot fulfillment docker and cloud build is failing with the bqds-shared library. Spot fulfillment has the local dependency and it is not available in the current build context

To Reproduce
Steps to reproduce the behavior:

  1. Go to api
  2. Run through the Cloud Build instructions
  3. See error
RUN npm install --only=production:
npm ERR! code ENOLOCAL
npm ERR! Could not install from "../shared" as it does not contain a package.json file.

Expected behavior
Builds successfully and no errors

Screenshots
n/a

Desktop (please complete the following information):

  • OS: Darwin 18.7.0

Additional context
Reported by @ramfordt

API error when Datashare schema is not provisioned

Describe the bug
API error when Datashare schema is not provisioned. Calling the the /accounts resource, the API times out with a 503 error that is not propagated to the client. These are the server logs:

2020-05-29 15:26:23.526 EDT(node:1) UnhandledPromiseRejectionWarning: Error: Not found: Dataset agp-dj-001:datashare was not found in location US
2020-05-29 15:26:23.526 EDT at new ApiError (/shared/node_modules/@google-cloud/common/build/src/util.js:58:15)
2020-05-29 15:26:23.526 EDT at /shared/node_modules/@google-cloud/bigquery/build/src/bigquery.js:1066:23
2020-05-29 15:26:23.526 EDT at /shared/node_modules/@google-cloud/common/build/src/util.js:367:25
2020-05-29 15:26:23.526 EDT at Util.handleResp (/shared/node_modules/@google-cloud/common/build/src/util.js:144:9)
2020-05-29 15:26:23.526 EDT at /shared/node_modules/@google-cloud/common/build/src/util.js:432:22
2020-05-29 15:26:23.526 EDT at onResponse (/shared/node_modules/retry-request/index.js:206:7)
2020-05-29 15:26:23.526 EDT at /shared/node_modules/teeny-request/build/src/index.js:233:13
2020-05-29 15:26:23.526 EDT at processTicksAndRejections (internal/process/task_queues.js:85:5)
2020-05-29 15:26:23.526 EDT(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

  1. Deploy API
  2. Call /accounts endpoint without provisioning Datashare schema
  3. See error

Expected behavior
Propagate descriptive error that schema DNE up to client and handle gracefully.

Console errors after non-admin user authenticates

Describe the bug
Console errors after non-admin user authenticates. If the end-user's email address is not in the whitelist, the user is redirected back to the initial landing page with console errors below.

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

  1. Sign in with user not in whitelist
  2. Click on browser console debug
  3. See error

Expected behavior
If the end-user's email address is not in the whitelist, redirected back to the initial landing page without errors.

Desktop (please complete the following information):

  • Browser - Chrome incognito

Screen Shot 2020-05-31 at 10 11 42 AM

Additional context
This might be fixed with new RBAC controls, but the UX should be validated regardless

Initial load of UI returns a blank page

Describe the bug
The initial load of the UI returns a blank page. After refreshing it loads successfully.

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
The UI loads successfully on 1st render.

Screenshots
If applicable, add screenshots to help explain your problem.

    at Nt (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1259:478)
    at At (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1259:251)
    at t.instanceFactory (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1275:148)
    at t.getOrInitializeService (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:746:16005)
    at t.getImmediate (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:746:14525)
    at t.instanceFactory (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1911:2633)
    at t.getOrInitializeService (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:746:16005)
    at t.getImmediate (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:746:14525)
    at t._getService (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1588:176118)
    at t.<computed> [as analytics] (https://cds-frontend-ui-74myzwomra-uc.a.run.app/js/chunk-vendors.680ed2bc.js:1604:1190)```
**Desktop (please complete the following information):**
 - Browser [e.g. chrome, safari] Chrome Incognito


**Additional context**
Add any other context about the problem here.

Update the Authorized Views endpoint logic with the CDS API design

Is your feature request related to a problem? Please describe.
Update the Authorized Views endpoint logic with the CDS API design

  • convert the existing auth view logic from the UI into the new CDS API format

Describe the solution you'd like
see above

Describe alternatives you've considered
n/a

Additional context
n/a

Exception message not printed out if it's not json in runTransform

In runTransform, the exception object is wrapped with JSON.stringify.

console.error("Exception encountered running transform: " + JSON.stringify(exception));

However, in cases where the exception isn't a JSON object, and is a string, this will eat up the message and the proper message will not be printed out. In my error case, switching to this printed it out though:

console.error(Exception encountered running transform: ${exception});

Solution is probably to check the object type first and then format it.

Add unit tests for ingestion module

Is your feature request related to a problem? Please describe.
Add unit tests for the ingestion module.

Describe the solution you'd like
Unit tests for ingestion and replace existing shell script integration tests.

Describe alternatives you've considered
N/A

Additional context
N/A

Add unit tests for the shared code library modules

Is your feature request related to a problem? Please describe.
Add unit tests for the shared code library modules.

Describe the solution you'd like
This is a continuation of the existing unit tests that were created in #68

Describe alternatives you've considered
N/A

Additional context
N/A

Add README.md with instructions for running /tests/bin/run.sh

Is your feature request related to a problem? Please describe.
This is motivated by:
https://github.com/GoogleCloudPlatform/bq-datashare-toolkit/blob/83afbb93847eb750ac788bee1bcf3a68f8e0b941/tests/bin/run.sh#L108

It's using the json binary, which seems to be: https://www.npmjs.com/package/json

After installing it, I was able to run the test bash.

I also noticed that:
https://github.com/GoogleCloudPlatform/bq-datashare-toolkit/blob/83afbb93847eb750ac788bee1bcf3a68f8e0b941/tests/bin/run.sh#L142

Seem to have been affected after the function folder rename.

Describe the solution you'd like
Create a README.md file at the /tests folder with instructions for setting up the environment.

Additional context
If it's open to contributions I would gladly submit a PR with the instructions.

Configure Entitlements Docker manifest for multi-stage builds

Is your feature request related to a problem? Please describe.
Configure Entitlements Docker manifest for multi-stage builds
The current entitlements-engine container is > 1.3GB. Update the Docker file to build from scratch and remove unnecessary libraries. This will reduce entitlements engine build and deploy time substantially.

Describe the solution you'd like
multi-stage Dockerfile builds like https://github.com/GoogleCloudPlatform/bq-datashare-toolkit/blob/42--spot-fulfillment-api/api/v1alpha/Dockerfile as an example.

Describe alternatives you've considered
N/A

Additional context
N/A

The createdBy data model attribute should be in request header for API OAS

Describe the bug
The createdBy data model attribute should be in request header for API OAS. It is currently in the payload body. This will apply for all incoming resource requests payloads, but the data model response should still include the read-only attribute.

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'v1alpha/docs'
  2. Click on POST /policies and try to submit a new request
  3. See error

Expected behavior
Add the createdBy attribute to the x-gcp-account header in the API OAS definitions

Screenshots
Screen Shot 2020-05-31 at 10 40 54 AM

Additional context
Add any other context about the problem here.

Separate out multiple errors when errors occur during a request.

Is your feature request related to a problem? Please describe.
When attempting to ingest data, the actual error gets hidden behind the error message, "Error: Multiple errors occurred during the request. Please see the errors array for complete details.". When ingesting large data files, it is common to have multiple errors to address. But instead of seeing the errors in the error console, you see this message and have to dig into each one.

Also, once you fix the error, another error can emerge that's very different but ends up under the same message.

There's also a deeper issue here, where BQ is likely to throw multiple errors during an ingestion job--if there are a few errors, BQ may "give up" on import and throw that as an additional error that's not really useful since its just an error from too many errors. But this also guarantees that any kind of recurring error in an ingested table will give the "multiple errors" error even if the true error is very different.

Describe the solution you'd like
I'm not entirely sure and I think there's a design question here. On one hand, reporting every error in the array separately could be useful, but that could also lead to a lot of "useless" errors. On the other hand, concatenating the entire error array could make problems easier to identify but could also flood the error console with a bunch of duplicate errors.

Describe alternatives you've considered
See above. It's also still manageable to look at the individual errors.

Additional context

Here's an example of a full set of errros I got that came under the "errors array" error:

Error: Multiple errors occurred during the request. Please see the errors array for complete details. 1. Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. 2. Error while reading data, error message: Could not parse '1.0' as int for field XXX(position 8) starting at location 28931951844 3. Error while reading data, error message: Could not parse '1.0' as int for field XXX (position 8) starting at location 25717290653 4. Error while reading data, error message: Could not parse '1.0' as int for field XXX (position 8) starting at location 32950278573 5. Error while reading data, error message: Could not parse '1.0' as int for field XXX (position 8) starting at location 36164939912 6. Error while reading data, error message: Could not parse '3.0' as int for field XXX (position 8) starting at location 27056732771

Here's another:

Error: Multiple errors occurred during the request. Please see the errors array for complete details. 1. Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1093; errors: 1. Please look into the errors[] collection for more details. 2. Error while reading data, error message: Required column value for column index: 11 is missing in row starting at position: 51485620192

Duplicate Search field placeholder text in multiple views

Describe the bug
There is a duplicate Search field placeholder text in multiple views of the UI. This impacts the following views:

  • datasets
  • views
  • accounts
  • policies

Version
v0.3.0

To Reproduce
Steps to reproduce the behavior:

  1. Go to UI
  2. Click on 'Accounts'
  3. See error in Search Bar

Expected behavior
Only display one Search placeholder text

Screenshots
Screen Shot 2020-05-31 at 3 12 58 PM

Desktop (please complete the following information):

  • Browser - Chrome incognito

Additional context
Add any other context about the problem here.

Create spot fulfillment API service

Is your feature request related to a problem? Please describe.
Data consumers currently have the ability to query datasets in BigQuery user interface or via the BigQuery APIs. There is no ability to service those data requests with a subset of data and filtering for spot fulfillment consumer requests.

Describe the solution you'd like
Provide an API service for data producers to expose the ability to spot fulfill data requests from their consumers.

  • Dataset attribute definitions
  • Query validation and filtering
  • Temporary bucket storage with signed urls
  • API consumable documentation
  • Packaged deployment

Describe alternatives you've considered
N/A

Additional context
The API service will interface to GCP services via a GCP IAM service account. The following technologies will be leverage

  • NodeJS and Express
  • GCP BigQuery, Storage, IAM
  • GCP GKE/Cloud Run

Modify spot fulfillment api code to leverage shared library

Is your feature request related to a problem? Please describe.
Modify spot fulfillment api code to leverage shared library.

Describe the solution you'd like
This is a clean-up for new shared library created from #68 #67

Describe alternatives you've considered
N/A

Additional context
N/A

Use dynamic random function name for integration test script

Is your feature request related to a problem? Please describe.
Only one integration test can run at a time now.

Describe the solution you'd like
Use a dynamic function name using randomly generated name.

Describe alternatives you've considered
Using separate projects per build.

Move credentials to environment variables

Is your feature request related to a problem? Please describe.
Move Firebase credentials to environment vs settings.json. Even though API key is public for a untrusted app, these dynamic variables could be deemed sensitive for customers and should be moved out of #125 (comment)

Describe the solution you'd like
Move Firebase credentials to environment variables. These can be pulled in during development or via the UI settings page.

Specify node version in Entitlements Docker manifest

Is your feature request related to a problem? Please describe.
Specify node version in Entitlements Docker manifest to reduce compatibility and dependency sprawl. Currently it is set to node:latest https://github.com/GoogleCloudPlatform/bq-datashare-toolkit/blob/master/entitlements/bin/Dockerfile#L1

Describe the solution you'd like
Update Dockerfile to node:12.6 or node:12.6-alpine

Describe alternatives you've considered
N/A

Additional context
N/A

Optionally forestall duplicate ingestion

If the write disposition for an ingestion run is set to WRITE_APPEND, BigQuery will ingest the same file an arbitrary number of times, yielding duplicate records in the corresponding table which differ in their values for bqds_batch_id (which is shared by the entire ingestion iteration).

One possible way to avoid this duplication is to warn or fail if the same file is going to be uploaded twice by searching the existing batch ID for the same file name as the one incoming from GCS.

Matching on the file name could end up being a relatively unreliable approach depending on a publisher's specific naming conventions. An alternative to reduce false positives would be to use an MD5 hash or similar on the entire inbound file, and then use that for inbound file validation. While not foolproof, this should give both publisher namespace flexibility with static evaluation and rejection of precise duplicate inbound files.

bigquery.datasets.get is required in the custom role definition

Describe the bug
bigquery.datasets.get is required in the custom role definition

To Reproduce
Steps to reproduce the behavior:

  1. Go to api
  2. Run through the Service Account, installation and launch service
  3. See error in API POST fulfillmentRequests

Expected behavior
no error

Consumer definition is not accurate in main README

Describe the bug
The Consumer definition in the Datashare's main README, https://github.com/GoogleCloudPlatform/datashare-toolkit#consumers

Screen Shot 2020-06-07 at 4 23 47 PM

Expected behavior
It should just reference https://cloud.google.com/bigquery/docs/updating-datasets, i.e. Google Account e-mail: Grants an individual Google Account access to the dataset Google Group: Grants all members of a Google group access to the dataset
A Google Account is defined here (which includes Gsuite accounts). You can create a Google account with an existing email address here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.