Follow each instructions on notebook below.
- Storage Settings
- Basics of Pyspark and Spark Machine Learning
- Spark Machine Learning Pipeline
- Hyper-parameter Tuning
- MLeap (requires ML runtime)
- Horovod Runner on Databricks Runtime for ML (requires ML runtime)
- Structured Streaming (Basic)
- Structured Streaming with Azure EventHub or Kafka
- Delta Lake
- Work with MLFlow (requires ML runtime)
- Orchestration with Azure Data Services
-
Create Azure Databricks resource in Microsoft Azure, and launch workspace. See details from instructor or from the Quickstart.
-
Create a computing cluster on Databricks workspace. (Select "Compute" in Workspace UI.)
Databricks Runtime Version 10.2 ML or above is recommended for running this tutorial. -
Download HandsOn.dbc and import into your workspace.
- Select "Workspace" in Workspace UI.
- Go to user folder.
- Click your e-mail (the arrow in the right side) and select "import" command to import HandsOn.dbc.
-
Open the imported notebook and attach your cluster in the notebook. (Select cluster on top of notebook.)
Note : You cannot use Azure Trial (Free) subscription, because of limited vCPU quota. Please promote to Pay-As-You-Go when you use trial subscription. (The credit will be reserved even when you transit to Pay-As-You-Go.)
-
Azure has extensive documentation online.
-
Databricks has even more training materials listed in this Guide. The Azure material is specifically relevant. While they say this is free customer training, it only takes a free registration to become a "customer".
-
Books (most of which are available on Oreilly online):
-
Azure Databricks Cookbook This one is better than many Manning Cookbooks and is pretty helpful.
-
Advanced Analytics with PySpark This is an update of one of my favorite books on Spark, made accessible to Python. This previous version focused on Scala.
-
Azure Databricks This playlist has extensive references.
-
Delta Lake: The Definitive Guide I find this one to be a bit much, but it sure is definative.
-
-
Videos:
-
Debugging Apache Spark This one is a bit older for Spark, but the author, Holden Karau, is a great longtime advocate of PySpark and open source big data contributor. She is a great speaker, and does coding livestreaming on Youtube and Twitch. Look out for her dog Timbit, who makes occasional appearances.
-
Building Your First ETL Pipeline Using Azure Databricks For a review of some of the material covered last week.
-
Predictive Analytics Using Apache Spark MLlib on Databricks Janani Ravi is one of my favorite instructors on Pluaralsight discussing Spark and ML. This course is part of this learning path: https://app.pluralsight.com/paths/skill/apache-spark-on-databricks
-
Modified by Ed Fine @Afinepoint
Links to code provided to keep up to date.
Original code by Tsuyoshi Matsuzaki @ Microsoft