Giter Site home page Giter Site logo

mmodarre / azuredatafactoryhol Goto Github PK

View Code? Open in Web Editor NEW
64.0 11.0 56.0 100.26 MB

Azure Data Factory Hands On Lab - Step by Step - A Comprehensive Azure Data Factory and Mapping Data Flow step by step tutorial

License: MIT License

azure-data-factory azure adf-pipeline filter-activity lookup-activity foreach-activity metadata-activity mapping-dataflows azure-key-vault etl-pipeline hands-on-lab hands-on-azure-data-factory azure-data-factory-tutorial azure-modern-data-warehous web-activity foreach-loop-activity

azuredatafactoryhol's Introduction

Note: This is a work in progress and any feedback and collaboration is really appreciated. New excercises will be added soon.

ELT with Azure Data Factory

And

Mapping Data Flows

Hands-on lab step-by-step

Feb 2020

Information in this document, including URL and other Internet Web site references, is subject to change without notice. Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, e-mail address, logo, person, place or event is intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

The names of manufacturers, products, or URLs are provided for informational purposes only and Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is not responsible for the contents of any linked site or any link contained in a linked site, or any changes or updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission received from any linked site. Microsoft is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement of Microsoft of the site or the products contained therein.

© 2018 Microsoft Corporation. All rights reserved.

Microsoft and the trademarks listed at https://www.microsoft.com/en-us/legal/intellectualproperty/Trademarks/Usage/General.aspx are trademarks of the Microsoft group of companies. All other trademarks are property of their respective owners.

Azure Data Factory hands-on lab

Abstract and learning objectives

In this workshop, you will deploy an End to End Azure ELT solution. This workshop uses Azure Data Factory (and Mapping Dataflows) to perform Extract Load Transformation (ELT) using Azure Blob storage, Azure SQL DB. Azure DevOps repositories to perform source control over ADF pipelines and Azure DevOps pipelines to deploy across multiple environments including Dev, Test and Production.

By attending this workshop, you will better able to build a complete Azure data factory ELT pipeline. In addition, you will learn to:

  • Deploy Azure Data Factory including an Integration Runtime.

  • Build Mapping Dataflows in ADF.

  • Create Blob Storage and Azure SQLDB Linked Services.

  • Create Azure Key Vault and Linked Services in ADF.

  • Create ADF parameterized pipeline.

  • Install Azure Data Factory self-hosted integration runtime to ingest from on-premises data systems.

  • (In progress) Perform code-free Spark ELT using Azure Data Factory Mapping Dataflows.

  • (To do) Source control ADF pipelines.

  • (To do) CI/CD ADF pipelines and your ELT code.

This hands-on lab is designed to provide exposure to many of Microsoft’s transformative line of business applications built using Microsoft data and advanced analytics. The goal is to show an end-to-end solution, leveraging many of these technologies, but not necessarily doing work in every component possible. The lab architecture is below and includes:

  • Azure Data Factory (ADF)

  • Azure Storage

  • Azure Data Factory Mapping Dataflows

  • Azure SQL Database

  • Azure Key vault

  • (optional) Azure DevOps

Overview

WideWorldImporters (WWI) imports a wide range of products which then resells to retailers and public directly. In an increasingly crowded market, they are always looking for ways to differentiate themselves, and provide added value to their customers.

They are looking to pilot a data warehouse to provide additional information useful to their internal sales and marketing agents. They want to enable their agents to perform AS-IS and AS-WAS analysis in order to price the items more accurately and predict the product demand at different times during the year.

Also to extend their physical presence WWI is extending their business by and recently acquired a medium supermarket business called SmartFoods which their differentiating factor is their emphasis on providing very comprehensive information on food nutrients to customer in order for them to make health wise decisions. SmartFoods run their own loyalty program which customer can accumulate points on their purchases. WWI CIO is hopping to use the loyalty program information and the food nutrients database of SmartFoods to provide customers with a HealthSmart portal. The portal will be showing aggregated information on customers important food nutrients (Carbs, Saturated fats etc.) to promote healthy and SmartFood shopping.

In this hands-on lab, attendees will build an end-to-end solution to build a data warehouse using data lake methodology.

Solution architecture

Below is a diagram of the solution architecture you will build in this lab. Please study this carefully so you understand the solution as whole, before building various components.

Data sources:

  1. SmartFoods Rest API:
Type Rest API
Authentication Oauth2
Data Endpoints
  1. Order line Transactions (CSV)

  2. Customers (JSON)

  3. Auth Token (JSON)

Frequency Daily
Documentation https://github.com/Mmodarre/retailDataGeneratorAzureFunction
  1. SmartFoods Items
Type On premises Local file system
Authentication NA
Data Endpoints
  1. Food (CSV)

  2. Food-Nutrition (CSV)

  3. Nutrition (CSV)

Frequency NA – One Off
Documentation
  1. WWI OLTP
Type SFTP
Authentication Username/Password
Data Endpoints
  1. Orderline Transactions (Parquet)

  2. Orders Transactions (Parquet)

  3. Customers (Parquet)

Frequency Daily
Documentation

Requirements

  1. Microsoft Azure subscription Free Trial or pay-as-you-go (Credit Card) or MSDN subscription.

  2. MS Windows development Environment (Only a requirement for Azure Self Hosted IR – If you are using a Linux or Mac OS workstation you can achieve the same by running a Windows VM locally or in Azure)

  3. Azure Storage Explorer

Getting Started

Hands on lab documents are located under Lab-guide directory. Here is the list labs available:

Azure Data Factory:

  1. Before_the_hands-on_lab_(Prepare_the_environment)

  2. Linked_Services_Datasets_and_Integration_Runtimes

  3. Copy_Activity_Parameters_Debug_and_Publishing

  4. Lookup_activity_ForEach_loop_and_Execute_Pipeline_activity

  5. Get_Metadata_activity_filter_activity_and_complex_expressions

  6. Self-hosted_Integration_Runtime__decompress_files_and_Delete_activity

Azure Data Factory Mapping Data Flows:

  1. SmartFoodsCustomerELT

  2. ELT_with_Mapping_Dataflows–Practice_excercises

azuredatafactoryhol's People

Contributors

mmodarre avatar rajesh-ca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azuredatafactoryhol's Issues

ADLSGen2 or Blob storage?

The High Level Architecture visual indicates Storage Account v2 (ADLSGen2), however the environment preparation steps indicate a storage account without the Hierarchical namespace option selected (that would make it ADLS Gen2).

Reviewing the steps, it seems that only blob storage is being used and not ADLSGen2.

Timeout occurred while using integration runtime

I found during integration runtime setup the region is selected as "Australia East".
When I tried using it later it gives time out. After modifying the region to "West US 2" it works fine.
This is just an observation if you think it might be due to other reasons you can skip.

image

Unable to connect to adflabsftp.westus2.cloudapp.azure.com

In Lab 1, under create SFTP Linked Service, I'm unable to connect to adflabsftp.westus2.cloudapp.azure.com although I followed the steps as mentioned in the lab. After creating the SFTP password in Key Vault and the linked service, when I tested the connection, I get the error - Connection Failed.

Misspell in the prep steps

In the "Prepare the environment" steps, "Create Blob Storage Containers for the Data Warehouse Output and staging" - container name is misspelt - wwidatawarhouse in text vs. wwidatawarehouse in screenshot

Error on Copy activity for API Jon to Blob

https://github.com/Mmodarre/AzureDataFactoryHOL/blob/master/Lab-guide/03-Copy_Activity_Parameters_Debug_and_Publishing/README.md

I was able to to follow all the way through till the last copy activity where it's failing
with error :
{
"errorCode": "2200",
"message": "ErrorCode=SchemaMappingFailedInHierarchicalToTabularStage,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to process hierarchical to tabular stage, error message: One or more errors occurred.,Source=Microsoft.DataTransfer.ClientLibrary,'",
"failureType": "UserError",
"target": "SmartFoodsCustomersToBlob",
"details": []
}

image

Need to mention that accessing the source data via API is not a problem, I verified that by postman call.
image

Missing Auto-fill parameter steps in Execute Pipeline step

In Step 4. Lookup_activity_ForEach_loop_and_Execute_Pipeline_activity I couldn't find the auto-fill parameters as attached in the pic. Also in LookupDates settings, File path type I need to set it to Wildcard file path instead of File path in data set
In case you feel something is wrong from my side, please ignore.
image

Operation timed out when previewing data in dataset setup

Hi,

When creating the WWISftpParquet dataset using the dynamic parameters at this stage below and then trying to Preview data the process is failing and has been returning a 'the operation timed out' message for the last few days. However the sftp linked service behind the connection passes its test connection ok.

https://github.com/Mmodarre/AzureDataFactoryHOL/tree/master/Lab-guide/02-Linked_Services_Datasets_and_Integration_Runtimes#create-datasets-for-wwi-sftp-data

Thanks

Use managed Identity to access storage & SQL?

In the "02-Linked_Services_Datasets_and_Integration_Runtimes":

  • instead of creating a KV secret with the storage account key, and accessing that secret via the ADF's Managed Identity, why not grant the ADF's Managed Identity Storage Blob Data Contributor access instead.

  • similarly, for the Azure SQL instance as well, ADF managed identity could be used

Wouldn't this simplify the approach and also follow best practices - least privilege?

Timed Out Error in Linked Service of SFTP

I am facing this error while, I am testing the sftp dataset connection. I tried to change the region no use same error comes again and again.

Meet network issue when connect to Sftp server 'adflabsftp.westus2.cloudapp.azure.com', SocketErrorCode: 'TimedOut'.
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

Incorrect SQL Dataset being imported in SmartFoodFoodNutritionELT SCD 1 Instructions

Thank you for this wonderful lab.

I noticed in the instructions for loading up food_nut.csv and nutsDim.csv in 08-ELT_with_Mapping_Dataflows–Practice_excercises you are referencing bringing in [SmartFoodsDw][foodDim] throughout the tutorial:

image

I do understand that the item_id in food_nut.csv corresponds to the sku in foodDim, in this case, perhaps it would be beneficial to specify that the initial foodDim import is for matching, whereas the sink should be foodNutDim?

Whereas there's no mention of importing [SmartFoodsDw][foodNutDim]:
image

Since the content of those two csv's merges together to form foodNutDim.
image

Thanks!

Unable to execute pipeline in lab 3

I got the following error when i tried to execute the pipeline under debug mode in Lab 3. I've followed the instructions in the lab as written.

{
"errorCode": "2200",
"message": "ErrorCode=SchemaMappingFailedInHierarchicalToTabularStage,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to process hierarchical to tabular stage, error message: One or more errors occurred.,Source=Microsoft.DataTransfer.ClientLibrary,'",
"failureType": "UserError",
"target": "SmartFoodsCustomersToBlob",
"details": []
}

SFTP Dataset

In 02-Linked_Services_Datasets_and_Integration_Runtimes, Step 12, the text indicates use of the SFTP linked service, but the screenshots all refer to the blob storage linked service.

Also, the preview data doesn't appear to work when referring to the SFTP linked service. The following error occurs immaterial of the use of parameters/fixed names or the compression type.
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.