Giter Site home page Giter Site logo

terraform-module-azure-datalake's Introduction

Terraform module Azure Data Lake

This is a module for Terraform that deploys a complete and opinionated data lake network on Microsoft Azure.

maintained by dataroots Terraform 0.13 Terraform Registry tests Go Report Card

Components

  • Azure Data Factory for data ingestion from various sources
  • Azure Data Lake Storage gen2 containers to store data for the data lake layers
  • Azure Databricks to clean and transform the data
  • Azure Synapse Analytics to store presentation data
  • Azure CosmosDB to store metadata
  • Credentials and access management configured ready to go

This design is based on one of Microsoft's architecture patterns for an advanced analytics solution.

Microsoft Advanced Analytics pattern

It includes some additional changes that dataroots is recommending.

  • Multiple storage containers to store every version of the data
  • Cosmos DB is used to store the metadata of the data as a Data Catalog
  • Azure Analysis Services is not used for now as some services might be replaced when Azure Synapse Analytics Workspace becomes GA

Usage

module "azuredatalake" {
  source  = "datarootsio/azure-datalake/module"
  version = "~> 0.1" 

  data_lake_name = "example name"
  region         = "eastus2"

  storage_replication          = "ZRS"
  service_principal_end_date   = "2030-01-01T00:00:00Z"
  cosmosdb_consistency_level   = "Session"
}

Requirements

Supported environments

This module works on macOS and Linux. Windows is not supported as the module uses some Bash scripts to get around Terraform limitations.

Azure provider configuration

The following providers have to be configured:

You can either log in through the Azure CLI, or set environment variables as documented in the links above.

Azure CLI

The module uses some workarounds for features that are not yet available in the Azure providers. Therefore, you need to be logged in to the Azure CLI as well. You can use both a user account, as well as service principal authentication.

jq

The module uses jq to extract Databricks parameters during the deployment. Therefore, you need to have jq installed.

Contributing

Contributions to this repository are very welcome! Found a bug or do you have a suggestion? Please open an issue. Do you know how to fix it? Pull requests are welcome as well! To get you started faster, a Makefile is provided.

Make sure to install Terraform, Azure CLI, Go (for automated testing) and Make (optional, if you want to use the Makefile) on your computer. Install tflint to be able to run the linting.

  • Setup tools & dependencies: make tools
  • Format your code: make fmt
  • Linting: make lint
  • Run tests: make test (or go test -timeout 2h ./... without Make)

To run the automated tests, the environment variable ARM_SUBSCRIPTION_ID has to be set to your Azure subscription ID.

License

MIT license. Please see LICENSE for details.

terraform-module-azure-datalake's People

Contributors

dariustehrani avatar dependabot-preview[bot] avatar sam-dumont avatar sdebruyn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

terraform-module-azure-datalake's Issues

Databricks provider, Error thrown on terraform plan - Error: expected format to be one of [SOURCE], got DBC

When running terraform plan, terraform throws this error:

Error: expected format to be one of [SOURCE], got DBC

  on sample_data.tf line 31, in resource "databricks_notebook" "presentation":
  31: resource "databricks_notebook" "presentation" {

A temporary, but probably not the right, fix involves: In the section described in the error message above, changing the line that says:

format = "DBC"

to:

format = "SOURCE"

N.B.: Possibly related to issue #165

Versioning

The module needs versioning before we can submit it to the Terraform registry

Obtaining Databricks credentials?

Question -- I'm getting an error running the tf plan, and am confused where one obtains Databricks credentials/tokens (in Azure?). Here's the error I am seeing:

Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

data.http.current_ip[0]: Refreshing state...
data.azurerm_client_config.current: Refreshing state...


Error: failed to get credentials from config file; error msg: Authentication is not configured for provider. Please configure it
through one of the following options:

  1. DATABRICKS_HOST + DATABRICKS_TOKEN environment variables.
  2. host + token provider arguments.
  3. Run databricks configure --token that will create /root/.databrickscfg file.

Please check https://docs.databricks.com/dev-tools/cli/index.html#set-up-authentication for details

on databricks.tf line 41, in provider "databricks":
41: provider "databricks" {

Installation requires jq

Heads up - installation using Terraform on Ubuntu 18.04 requires 'jq' to be installed - otherwise it will fail mid-install and be difficult to roll back.

Destroying all resources fails because of the ARM template deployments

When running terraform destroy we get an error because some of the sample data is deployed with an ARM script. When destroying that resource, only the reference to the deployment is destroyed and not the resources themselves. As one of the deployed resources has a reference to a resource that is not in the ARM script, destroying currently fails.

It's probably a good idea to work on adding these resources to the official Terraform azurerm provider so that we don't have to use the ARM script anymore.

Databricks tags are replaced on each call to apply

Repro

  1. Do an apply for the complete module
  2. Terraform plan

You'll notice that Terraform wants to update the databricks tags.

At first I thought it was related to the azurerm provider: hashicorp/terraform-provider-azurerm#7109

Apparently it isn't, because I can't repro it with the code below:

provider azurerm {
  features {}
}

resource "azurerm_resource_group" "rg" {
  name     = "databricksrepro"
  location = "eastus2"
  tags = {
    Tag1 = "Value1"
    Tag2 = "Value2"
  }
}

resource "azurerm_databricks_workspace" "dbks" {
  location            = azurerm_resource_group.rg.location
  name                = "databricksrepro"
  resource_group_name = azurerm_resource_group.rg.name
  sku                 = "standard"
  tags                = azurerm_resource_group.rg.tags
}

Data Lake filesystems cannot be created (HTTP 403)

When running with high parallelism, it sometimes happens that the data lake filesystems cannot be created due to permission errors.

An example log is attached where I marked the stages that the deployment goes through and removed some extra headers to make it clearer.
tflogs.log

This is how TF provisions the environment in the attached log:

  1. Resource group is created (redacted)
  2. Starts creating storage account
  3. Storage account is created
  4. Giving the current user permission (Blob Data Owner)
  5. [PARALLEL] Giving the service principal permission (Blob Data Owner) - we don't care about this, the filesystems will be created by the current user, not the sp we created
  6. Succes result on the Blob Data Owner role for the current user
  7. Starts creating the first filesystem
  8. First filesystem creation has failed with HTTP 403
  9. [PARALLEL] Starts creating the second filesystem
  10. [PARALLEL] Second filesystem creation has failed with HTTP 403
2020/05/14 19:20:30 [DEBUG] azurerm_storage_data_lake_gen2_filesystem.dlfs[1]: apply errored, but we're indicating that via the Error pointer rather than returning it: Error creating File System "fscleansedtfadl" in Storage Account "satfadl": datalakestore.Client#Create: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationPermissionMismatch" Message="This request is not authorized to perform this operation using this permission."

The issue never occurs with lower parallelism (1 or 2).

It looks like the role isn't really granted until after a few seconds.

Deploy the module in a separate virtual network

A lot of customers would probably prefer to deploy the module in a dedicated VNet. There is already some work that has been done on the feature/vnet branch. The current obstacle is the SQL script that has to be executed on the Synapse Analytics instance. The machine running Terraform would probably have to connect to the VNet.

Migration to Azure Synapse Analytics Workspace

Azure Synapse Analytics Workspace is currently in private preview and will bundle storage, SQL pools and a Spark cluster in 1 product. This will probably allow us to simplify our pipeline a lot while still offering the same flexibility.

We'll have to look into migrating as soon as the service becomes GA and available in the Terraform azurerm provider.

Password storage

Passwords are currently generated and used when needed, but they are not stored anywhere. Suggestions are welcome.

Error: rpc error: code = Unavailable desc = transport is closing

Strange - keep getting this error:

null_resource.sql_init[0]: Still creating... [10s elapsed]
null_resource.sql_init[0] (local-exec): -1
azurerm_cosmosdb_sql_container.metadata: Still creating... [40s elapsed]
null_resource.sql_init[0]: Still creating... [20s elapsed]
null_resource.sql_init[0] (local-exec): -1
null_resource.sql_init[0]: Creation complete after 23s [id=7882090812782899888]
azurerm_cosmosdb_sql_container.metadata: Still creating... [50s elapsed]
azurerm_cosmosdb_sql_container.metadata: Still creating... [1m0s elapsed]
azurerm_cosmosdb_sql_container.metadata: Creation complete after 1m3s [id=/subscriptions/b970f71b-bc04-47c8-89cd-7e491b4656b2/resourceGroups/rgcdedlplay02/providers/Microsoft.DocumentDB/databaseAccounts/cmdbcdedlplay02/sqlDatabases/metadatadb/containers/metadata]

Error: rpc error: code = Unavailable desc = transport is closing



Error: rpc error: code = Unavailable desc = transport is closing



Error: rpc error: code = Unavailable desc = transport is closing



Error: rpc error: code = Unavailable desc = transport is closing

Terraform 0.13 upgrade & Databricks compatibility

For now this module is only compatible with Terraform 0.12.x and Databricks 0.2.4 (or custom build of their latest master). Once Databricks 0.2.4 is out, upgrade both Terraform and Databricks depedencies.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.