azure / data-product-batch Goto Github PK

Template to deploy a Data Product for Batch data processing into a Data Landing Zone of the Data Management & Analytics Scenario (former Enterprise-Scale Analytics). The Data Product template can be used by cross-functional teams to ingest, provide and create new data assets within the platform.

License: MIT License

PowerShell 8.58% Dockerfile 0.65% Shell 23.00% Bicep 67.77%

arm azure architecture data-platform enterprise-scale policy-driven bicep data-mesh data-fabric enterprise-scale-analytics

data-product-batch's Introduction

Cloud-scale Analytics Scenario - Data Product Batch

Objective

The Cloud-scale Analytics Scenario provides a prescriptive data platform design coupled with Azure best practices and design principles. These principles serve as a compass for subsequent design decisions across critical technical domains. The architecture will continue to evolve alongside the Azure platform and is ultimately driven by the various design decisions that organizations must make to define their Azure data journey.

The Cloud-scale Analytics Scenario architecture consists of two core building blocks:

Data Management Landing Zone which provides all data management and data governance capabilities for the data platform of an organization.
Data Landing Zone which is a logical construct and a unit of scale in the Cloud-scale Analytics architecture that enables data retention and execution of data workloads for generating insights and value with data.

The architecture is modular by design and allows organizations to start small with a single Data Management Landing Zone and Data Landing Zone, but also allows to scale to a multi-subscription data platform environment by adding more Data Landing Zones to the architecture. Thereby, the reference design allows to implement different modern data platform patterns like data-mesh, data-fabric as well as traditional datalake architectures. Cloud-scale Analytics Scenario has been very well aligned with the data-mesh approach, and is ideally suited to help organizations build data products and share these across business units of an organization. If core recommendations are followed, the resulting target architecture will put the customer on a path to sustainable scale.

The Cloud-scale Analytics Scenario architecture represents the strategic design path and target technical state for your Azure data platform.

This respository describes a Data Product template for Data Batch Processing that can also be used for integrating batch data into the Azure data platform. Data Products are another unit of scale inside a Data Landing Zone through the means of Resource Groups. Resource Groups inside the Data Landing Zone subscription are created and handed over to cross-functional teams to provide them an environment in which they can work on their own data use-cases. The ownership of this resource group and operation of services within is handed over to the Data Product teams. In order to enable self-service, the owning teams are free to deploy their own services within the guardrails set by Azure Policy. Repository templates can be used for these teams to more quickly scale within an organization and rollout common data analysis patterns not just once but multiple times across various use-cases. The ownership of templates is also handed over, which ultimately gives these teams a starting point while allowing them to enhance the template based on their specific requirements. This Data Product template deploys a set of services, which can be used for batch data processing and integration. The template includes services such as Azure Synapse, a SQL Server and Data Factory. The Data Product teams can then leverage these tools to generate insights and value with data.

Note: Before getting started with the deployment, please make sure you are familiar with the complementary documentation in the Cloud Adoption Framework. Also, before deploying your first Data Product, please make sure that you have deployed a Data Management Landing Zone and at least one Data Landing Zone. The minimal recommended setup consists of a single Data Management Landing Zone and a single Data Landing Zone.

Deploy Cloud-scale Analytics Scenario

The Cloud-scale Analytics architecture is modular by design and allows customers to start with a small footprint and grow over time. In order to not end up in a migration project, customers should decide upfront how they want to organize data domains across Data Landing Zones. All Cloud-scale Analytics architecture building blocks can be deployed through the Azure Portal as well as through GitHub Actions workflows and Azure DevOps Pipelines. The template repositories contain sample YAML pipelines to more quickly get started with the setup of the environments.

Reference implementation	Description	Link
Cloud-scale Analytics Scenario	Deploys a Data Management Landing Zone and one or multiple Data Landing Zones all at once. Provides less options than the the individual Data Management Landing Zone and Data Landing Zone deployment options. Helps you to quickly get started and make yourself familiar with the reference design. For more advanced scenarios, please deploy the artifacts individually.
Data Management Landing Zone	Deploys a single Data Management Landing Zone to a subscription.	Repository
Data Landing Zone	Deploys a single Data Landing Zone to a subscription. Please deploy a Data Management Landing Zone first.	Repository
Data Product Batch	Deploys a Data Workload template for Data Batch Analysis to a resource group inside a Data Landing Zone. Please deploy a Data Management Landing Zone and Data Landing Zone first.	Repository
Data Product Streaming	Deploys a Data Workload template for Data Streaming Analysis to a resource group inside a Data Landing Zone. Please deploy a Data Management Landing Zone and Data Landing Zone first.	Repository
Data Product Analytics	Deploys a Data Workload template for Data Analytics and Data Science to a resource group inside a Data Landing Zone. Please deploy a Data Management Landing Zone and Data Landing Zone first.	Repository

Deploy Data Product

To deploy the Data Product into your Data Landing Zone, please follow the step-by-step instructions:

Prerequisites
Create repository
Setting up Service Principal
Template Deployment
1. GitHub Action Deployment
2. Azure DevOps Deployment
Known Issues

Contributing

Please review the Contributor's Guide for more information on how to contribute to this project via Issue Reports and Pull Requests.

data-product-batch's People

Contributors

Stargazers

Watchers

Forkers

asener1 amanjeetsingh kvo181 lynxye ievsantillan sriramavatar charleyhanania isabella232 cloudreach billybloke johnhansenbouvet rainstar82 chrissidebotham vanwinkelseppe abdale sourcecodecheck raylad larsbulow devopsarchitect1

data-product-batch's Issues

Clarity on SYNAPSE_STORAGE_ACCOUNT_NAME

I'm looking at my RGs in the Data Landing Zone and am unsure which one of these contains the Synapse storage account being referred to here.:

It would be great to point to the place where this storage account is.

Name Change

Please change all instances of "Enterprise-scale analytics and ai" to "Data Management and Analytics Scenario" to align to marketing ask. This should happen for one-clicks and all documentation.

Add infobox in Portal

Describe the solution you'd like

Add infobox in Portal that this is used for integration and product

Clarity on subnet config

Hi,

For the subnet_id in the parameter update process, the docs say:

The subnet should be configured with privateEndpointNetworkPolicies and privateLinkServiceNetworkPolicies, as mentioned in the Prerequisites

In the prerequisites it says:

Access to a subnet with privateEndpointNetworkPolicies and privateLinkServiceNetworkPolicies set to disabled. The Data Landing Zone deployment already creates a few subnets with this configuration.

Would it make sense to have some instructions on how to go about doing this? Or if this step is required at all? I think a little bit clarity around this will help avoid any confusion.

Thanks,
Hamood

SQL Server Deployment Fails

Describe the bug
When deploying the data domain with the ADO pipeline, the "Deploy SQL Server 001" step fails with the following error:

##[error]PECsNotExistingToDenyPublicNetworkAccess: Unable to set Deny Public Network Access to Yes since there is no private endpoint enabled to access the server. Please set up private endpoints and retry the operation (https://docs.microsoft.com/azure/sql-database/sql-database-private-endpoint-overview#how-to-set-up-private-link-for-azure-sql-database).

Ability to limit which services are deployed

Describe the solution you'd like
When deploying a data product, I want to strip away services that dont apply to my needs.
I.e. I may only need a synapse workspace with respective auxilliary services (i.e. AKV, container etc.)

Under databases, allow me to select none.

Key Vault Is Deployed to the Wrong Subnet

Describe the bug
When the ADO deployment pipeline is executed, it fails at the "Deploy Key Vault 001" step with the following error:

##[error]PrivateEndpointCannotBeCreatedInSubnetThatHasNetworkPoliciesEnabled: Private endpoint /subscriptions/17588eb2-2943-461a-ab3f-00a3ceac3112/resourceGroups/jpddb-dd001/providers/Microsoft.Network/privateEndpoints/jpddb-keyvault001-private-endpoint cannot be created in a subnet /subscriptions/17588eb2-2943-461a-ab3f-00a3ceac3112/resourceGroups/jpdlz-network/providers/Microsoft.Network/virtualNetworks/jpdlz-vnet/subnets/jpdlz-dd001-subnet since it has private endpoint network policies enabled.

The "update parameters" process modifies the infra/KeyVault/params.keyVault001.json file so that the Key Vault will be deployed to the specified subnet. However, since Key Vault is using a private endpoint, it should probably go to the designated private endpoint subnet instead.

Options for fixes
These are some ideas that could be used to resolve this problem:

Add an additional input to the "update parameters" process that requires the user to provide the "privatelink" subnet ID in addition to the regular subnet ID already provided.
Modify the configs/UpdateParameters.ps1 script so that the standard "privatelink" subnet ID is computed based on the regular subnet's ID. This would be easier for the user (one less piece of input to provide) but it will be built on the assumption that all of their subnets follow our standard naming convention.

In either case, once we know the "privatelink" subnet ID to use, we will modify the configs/config.json file so that the "privatelink" subnet ID is applied where appropriate.

Guide user when selecting storage container

Discussed in #75

^{Originally posted by renepajta August 2, 2021}
Hi all ,

I am deploying Batch data product. At Synapse config I am struggling to choose which container of Enriched/Curated data lake I should choose. Shall I create a new container per data product or point to main data lake container?

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your data-domain repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your Azure GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/Azure/repos/data-domain/compliance

The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @marvinbuss, @daltondhcp, @esbran

Add possibility to create resource group

Describe the solution you'd like
Add possibility to create a new resource group for new data product/integration.

No SQL Server Option still requires Passwords option

Describe the bug
If you want to deploy a batch product with SQL, it will still require to provide passwords.

Steps to reproduce

Deploy Batch product without SQL server.

Screenshots

Update docs to include that Hub and Node deployment is required

Update docs to include that Hub and Node deployment is required. Prerequisites are not saying that the Hub and Node deployment are required. Maybe add a note and link to the other repos.

deploy hub
Deploy node

Add Private Link Connectivity between Data Factory and Purview

Describe the solution you'd like
Automate the creation of managed private endpoints on the Data Factory managed vnet. This will allow private communication with Purview to collect Lineage Data.

Cosmos DB optional when deploying Data Product Batch

Data Product Batch deploying failures with Cosmos DB due to resource con
Cosmos DB deployment failing due to resource constraint in Azure Regions such as eastus and centrals.

Error : Sorry, we are currently experiencing high demand in this region, and cannot fulfill your request at this time. We work continuously to bring more and more capacity online, and encourage you to try again shortly. Please do not hesitate to contact us via Azure support at any time or for any reason using this link http://aka.ms/azuresupport.

Feature request is to make Cosmos DB deployment optional for Data Product Batch.

Improvement: Optional resources cause errors in the deployment

Feature or Idea - What?

There are optional resources, such as CosmosDB or the unchosen ADF/Synapse, of which's module outcomes the new logging elements are dependent on. I believe that is the reason causing errors when deploying the Batch product straight out-of-the-box.

Would ne a nice adding to make those logging parts work in a way it regards the optional resources as optional.

Feature or Idea - Why?

To clean the code and not raising unnecessary errors.

Code of Conduct

I agree to follow this project's Code of Conduct

Data domain batch documentation

Data domain batch documentation talks about stream processing, which is a bit confusing, see below (likely a copy past error from domain streaming)

This Data Domain template deploys a set of services, which can be used for data stream processing. The template includes a set of different services for processing data streams, which allows the teams to choose their tools based on their requirements and preferences._

Make some services optional during the deployment

Discussed in #76

^{Originally posted by renepajta August 2, 2021}
At the moment, data product (batch) deploys various components (Synapse, Cosmos, SQL flavor) which may not be necessarily needed by the Data Product team.

Could we make the deployment of individual services optional?

Register RP for CosmosDB

Describe the bug
When deploying the cosmos db resource, receiving:

`` Error: ERROR: Deployment failed. Correlation ID: 05a9eba4-5d3f-4d06-9d01-5b64beeb1851. ***
  "error": ***
    "code": "MissingSubscriptionRegistration",
    "message": "The subscription is not registered to use namespace 'Microsoft.DocumentDB'. See https://aka.ms/rps-not-found for how to register subscriptions.",
    "details": [
      ***
        "code": "MissingSubscriptionRegistration",
        "target": "Microsoft.DocumentDB",
        "message": "The subscription is not registered to use namespace 'Microsoft.DocumentDB'. See https://aka.ms/rps-not-found for how to register subscriptions."
      ***
    ]
  ***
***
Data domain - needs to have document db registeres
Error: The process '/usr/bin/az' failed because one or more lines were written to the STDERR stream``

Add Secret Deployment to Synapse

Deploy secret keys/passwords to Key Vault for Synapse.

Stream Analytics

Add Stream Analytics job and cluster