Giter Site home page Giter Site logo

smartsyncgs's Introduction

smartSync - Consuming Shared Data in a Virtual Private Snowflake Account

Overview

Snowflake has an amazing feature called Secure Data Sharing. With Snowflake Secure Data Sharing, two account in the same cloud region with the same CSP (Cloud Service Provider) can share live data in an instant and secure way. Data Sharing is possible because of Snowflakes unique architecture, that separates storage and compute. Because of this architecture, a data provider can configure access to it's data by creating a share. Think of it as a collection of all the necessary metadata, for instance, names of shared objects, location of the data files, how to decrypt the files, and so on. However, data can only be shared between two Snowflake accounts the exist in the same Region and the same CSP. Sharing Data between accounts in different regions with the same CSP, or account with different CSPs (even in the same geographical region), require data to be replicated. Please review the documentation for more details.

By design, a Virtual Private Snowflake a VPS is considered its own region. For that reason, sharing data into a VPS Account requires the data from the provider side to be replicated into the VPS account. Then we can share the local copy of the dataset inside VPS by creating a local share.

Sync Step

As mentioned in the documentation above, a database created from a share can not be used as a source for replication. Only a database that is "local" to the current account can be replicated. Therefore, if we want to consume shared data in a VPS account, we first have to create a local copy of the shared dataset and then we can replicate that local copy into the VPS account. On the surface, creating a local copy seems to be very straight forward.

  • create a local table via a CTAS (CREATE TABLE AS) statement into a new schema in a new database
  • replicate that local database into the VPS account
  • share the now replicated database to as many account inside the VPS as you desire
  • run the whole process on a regular schedule

Though this will work, there are several challenges

  • how do we handle the process when there are hundreds or thousands of objects (tables/views)?
  • how do we handle bigger tables with 10th of GB of data and hundreds of millions or rows?
  • how do we handle consistency since it takes time to copy the share object by object?
  • how do we limit replication to the bare minimum since cross region / cross cloud data replication is costly?

SmartSync is a set of Snowflake Procedures that handle all of the above challenges.

  • Automate the copy process from source schema (share) to target schema (local)
  • Collect metadata information like (list of objects copied and their state (data as well as structure))
  • Analyze metadata information from previous step to limit data changes to a minimum and create execution plan
  • Execute execution plan from previous step
  • Collect metadata information again and compare metadata sets for differences (potential consistency problems)
  • Record metadata information (tables, performed actions, row counts, fingerprints) for auditibility

SmartCopy stores the data for the local copy in a schema with the name of the source schema appended by a version number. The version number consists of two 6 digit number that indicate the structural version (first 6 digits) and the data snapshot version (second 6 digits).

Replication Step

With the ability to create a local copy of a shared dataset, we can replicate the local copy into the VPS deployment via standard Snowflake replication. Details on how to setup replication can be found here.

Sharing Step

The local copy of the shared dataset can now be shared to consumer account inside the VPS. For that we have to create a set of secure views pointing to the new local copies of the shared dataset.

Implementation Interface

The whole process of

  1. creating a local copy
  2. sharing the replicated copy inside VPS

is supported via stored procedures Snowflake stored procedure

SP_SYNC_GS

This procedure creates a local copy (target database & schema) of all tables/views inside a shared database (source database and schema).

create or replace procedure SP_SYNC_GS(
   I_SRC_DB VARCHAR  -- Name of the source (shared) database
   ,I_TGT_DB VARCHAR -- Name of the target (local) database
   ,I_SCHEMA VARCHAR -- Name of the schema
)

SP_REFRESH_GS

This procedure creates a secure views based re-direction layer to the latest (or a specific) version of the replciated tables.

create or replace procedure SP_REFRESH_GS(
   I_TGT_DB VARCHAR               -- Name of the replicated (secondary) database
   ,I_SVW_DB VARCHAR              -- Name of the new shared database
   ,I_SCHEMA VARCHAR              -- Name of schema in the replicated (secondary) database
   ,I_SHARE VARCHAR               -- Name of the Share to be created/used
)

SP_COMPACT_GS

This procedure removes all previous version of the copied data leaving a maximum number of structural version and a maximum number of data snapshot versions.

create or replace procedure SP_COMPACT_GS(
   I_TGT_DB VARCHAR                 -- Name of the target (local) database
   ,I_TGT_SCHEMA VARCHAR            -- Name of the schema in the target (local) database
   ,I_MAX_SCHEMA_VERSIONS FLOAT     -- Maximum number of versions with different object structures
)

Setup

  1. Clone the SmartCopy repo (use the command below or any other way to clone the repo)
    git clone https://github.com/RobertFehrmann/smartSyncGS.git
    
  2. Create database and role to host stored procedures. Both steps require the AccountAdmin role (unless your current role has the necessary permissions.
    use role AccountAdmin;
    drop role if exists smart_sync_rl;
    drop database if exists smart_sync_db;
    drop warehouse if exists smart_sync_vwh;
    create role smart_sync_rl;
    grant create share on account to role smart_sync_rl;
    create database smart_sync_db;
    grant usage on database smart_sync_db to role smart_sync_rl;
    create warehouse smart_sync_vwh with 
       WAREHOUSE_SIZE = XSMALL 
       MAX_CLUSTER_COUNT = 1
       AUTO_SUSPEND = 1 
       AUTO_RESUME = TRUE;
    grant usage,operate,monitor on warehouse smart_sync_vwh to role smart_sync_rl;
    
  3. Grant smart_sync_role to the appropriate user (login). Replace <user> with the user you want to use for smart_copy. Generally speaking, this should be the user you are connected with right now. Note that you also could use the AccountAdmin role for all subsequent steps. That could be appropriate on a test or eval system but not for a production setup.
    use role AccountAdmin;
    grant role smart_sync_rl to user <user>;
    create schema smart_sync_db.metadata;
    grant usage on schema smart_sync_db.metadata to role smart_sync_rl;
    use database smart_sync_db;
    use schema smart_sync_db.metadata;
    
  4. Create all procedures from the metadata directory inside the cloned repo by loading each file into a worksheet and then clicking Run. Note: if you are getting an error message (SQL compilation error: parse ...), move the cursor to the end of the file, click into the window, and then click Run again). Then grant usage permissions on the created stored procs.
    use role AccountAdmin;
    grant usage on procedure smart_sync_db.metadata.sp_sync_gs(varchar,varchar,varchar) to role smart_sync_rl;
    grant usage on procedure smart_sync_db.metadata.sp_refresh_gs(varchar,varchar,varchar,varchar) to role smart_sync_rl;
    grant usage on procedure smart_sync_db.metadata.sp_compact_gs(varchar,varchar,float) to role smart_sync_rl;
    

Operations

The following steps need to be executed for every database

  1. Create the target (local) database, grant the necessary permission the role smart_sync_rl
    use role AccountAdmin;
    drop database if exists <local db>;
    create database <local database>;
    grant all on database <local db> to role smart_sync_rl with grant option;
    
  2. Create the source database from the share and grant the necessary permission the role smart_sync_rl
    use role AccountAdmin;
    drop database if exists <source db>;
    create database <source db> from share <provider account>.<source db>;
    grant imported privileges on database <source db> to role smart_sync_rl;
    
  3. Set Up Notifications View to notification table.
    use role smart_sync_rl;
    create schema <local db>.INTERNAL_<schema_name>_NOTIFICATIONS;
    create view <local db>.INTERNAL_<schema_name>_NOTIFICATIONS."--CRUX_NOTIFICATIONS--" 
       as select * from <fully qualitied crux notification table>;
    
  4. Run the sync command
    use role smart_sync_rl;
    call smart_sync_db.metadata.sp_sync_gs(<shared db>,<local db>,<schema>);
    
  5. Run the refresh command
    use role smart_sync_rl;
    call smart_sync_db.metadata.sp_refresh_gs(<local db>,<new shared db>,<schema>,<new share>);
    

smartsyncgs's People

Contributors

robertfehrmann avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Forkers

kgtdbx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.