Building a Data Analytics platform on AWS with Tableau

This lab using IMDb movie dataset

Scenario

Various enterprises nowadays face big data problems. Most of them put the effort in data mining, data cleansing, data analytics, or even about machine learning.

AWS plays an important role in big data solution. We provide an end-to-end data solution that leverage different AWS services that including data lake, data catalog, data analytics, and presentation. There are many architectures for the customer to customize their solution. This lab using several services to fulfill data analysis and BI process that includes S3, Glue, Athena.

Use Case in this Lab

Dataset: IMDb dataset
https://www.imdb.com/interfaces/
The data set of 10000+ most popular movies on IMDB in many years. The data points included are: Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue, Metascrore This use case using IMDb data to analyze some interesting insights of movies or TV episode.

Lab Architecture

As illustrated in the preceding diagram, this is a big data processing in this model:

1.   Upload the IMDb data into the S3 bucket.

2.   Setup Glue data catalog to create Glue table.

3.   Athena runs the query to create target table with Glue data catalog.

4.   Tableau will be configured, connect to Athena, to retrieve data, and show the analytic figure on Tableau dashboard.

Amazon S3 introduction

What is Amazon S3? Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides easy-to-use management features so you can organize your data and configure finely-tuned access controls to meet your specific business, organizational, and compliance requirements. Amazon S3 is designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications for companies all around the world.

AWS Glue introduction

What is AWS Glue? AWS Glue is a fully managed data catalog and ETL (extract, transform, and load) service that simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more. It is significantly reducing the time and effort that it takes to derive business insights quickly from an Amazon S3 data lake by discovering the structure and form of your data. Also automatically crawls your Amazon S3 data, identifies data formats, and then suggests schemas for use with other AWS analytic services.

Amazon Athena introduction

What is Amazon Athena? Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning. You can also use Glue’s fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance.

The workshop’s region will be in ‘Singapore’

Step 0 - Prerequisites

Sign-in a AWS account, and make sure you have select Singapore region
Make sure your account have permission to create IAM role for following services: S3, Glue, Athena
Make sure you have created the Access key and Secret access key that have Athena fully permission to connect to Tableau
Download this repository and unzip, ensure that data folder including two files:
title.basic.tsv, title.rating.tsv
Download Tableau Desktop on your laptop.
Click below link to download
https://www.tableau.com/support/releases
Note that download the latest version (2018.3.2 for this example)
Make sure that you have license to use Tableau
https://www.tableau.com/pricing
Setup AWS Athena Driver for Tableau Desktop
If Java is not already installed on your Mac, download and install the latest Java version from https://www.java.com/en/download.
Download the JDBC driver (.jar file) from the Amazon Athena User Guide on Amazon's website.
https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
For Mac, Copy the downloaded .jar file to the /Library/JDBC directory. You might have to create the JDBC directory if it doesn't already exist.
For Windows, Move the downloaded .jar file to C:\Program Files\Tableau\Drivers.

Step 1 - AWS environment setup

First of all, login to AWS console
https://console.aws.amazon.com/console/home

Create Access key and Secret access key on AWS

To create a new secret access key for your root account, use the security credentials page. Expand the Access Keys section, and then click Create New Root Key.
To create a new secret access key for an IAM user, open the IAM console. Click Users in the Details pane, click the appropriate IAM user, and then click Create Access Key on the Security Credentials tab.
Download the newly created credentials (csv file), when prompted to do so in the key creation wizard

Setp 2 - Create IAM roles for Glue service

On the service menu, click IAM.
In the navigation pane, choose Roles.
Click Create role.
For role type, choose AWS Service, find and choose Glue, and choose Next: Permissions.
On the Attach permissions policy page, search and choose AmazonS3FullAccess, AWSGlueServiceRole, and choose Next: Tags then click Next: Review.
On the Review page, enter the following detail:
Role name: AWSGlueServiceRoleDefault
Click Create role.
Choose Roles page, select the role AWSGlueServiceDefault you just created.
Now confirm you have policies as below figure.

Figure1: IAM role policies

You successfully create the role that allow AWS Glue get access to S3.

Setp 3 - Create S3 bucket for data lake and staging

In this step we create below two S3 buckets
- The bucket stores the data that contain IMDb data and allow Glue crawler to crawl the data
- The bucket stores the data from Athena query and provide the path for creating table in Athena
On the service menu, click S3.
Click Create bucket.
Enter the Bucket name “[your name]-imdb-dataset” (e.g., samuel-imdb-dataset) and ensure that the bucket name is unique so that you can create.
Click Create.
Click “[your name]-imdb-dataset” bucket
Click Create folder and enter the name basics then click save
In basics folder Click Upload and choose Add files.
Select file title.basics.tsv then click Upload.
Click Create folder to create another folder named ratings as the same way in 1.6
In ratings folder Click Upload and choose Add files.
Select file title.ratings.tsv then click Upload.
Now “[your name]-imdb-dataset” bucket the folder will show as below
For another bucket, click Create bucket again and enter the bucket name “[your name]-athena-table” (e.g., samuel-athena-table) and ensure that the bucket name is unique so that you can create.
Click Create.
Make sure that your S3 buckets contain those two buckets

Step 4 - Setup AWS Glue data catalog

Create database, tables, crawlers, in Glue Data Catalog

On the Services menu, click AWS Glue.
In the console, choose Add database. In the Database name, type imdb-data, and choose Create.
Choose Crawlers in the navigation pane, choose Add crawler. Enter the Crawler name basics-crawler, and choose Next.
On the Add a data store page, choose S3 as data store.
Select Specified path in my account.
Select basics folder in the bucket that you have created [your name]-imdb-dataset, and choose Next.
On Add another data store page, choose No, and choose Next.
Select Choose an existing IAM role, and choose the role AWSGlueServiceRoleDefault you just created in the drop-down list, and choose Next.
For Frequency, choose Run on demand, and choose Next.
For Database, choose imdb-data, and choose Next.
Review the steps, and choose Finish.
The crawler is ready to run. Choose Run it now.
Now the basics-crawler is crawling the data in basics folder in S3 bucket.
When the crawler has finished, two table has been added. Choose Tables in the left navigation pane, and then choose basics to confirmed.

You can get the table information such as S3 location

You can also get the table schema

Now you need to add another ratings table so let's create another crawler

In the navigation pane, choose Add crawler. Add type Crawler name “ratings-crawler” and choose Next.
On the Add a data store page, choose S3 as data store.
Select Specified path in my account.
Select ratings folder in the bucket that you have created [your name]-imdb-dataset, and choose Next.
On Add another data store page, choose No, and choose Next.
Select Choose an existing IAM role, and choose the role AWSGlueServiceRoleDefault you just created in the drop-down list, and choose Next.
For Frequency, choose Run on demand, and choose Next.
For Database, choose imdb-data, and choose Next.
Review the steps, and choose Finish.
The crawler is ready to run. Choose Run it now.
After the crawler has finished, there is a new table ratings in the imdb-data database:

Now you successfully to setup AWS Glue data catalog and create Glue table with IMDb data

Step 5 - Ad Hoc query in with AWS Athena

Athena can query the data in an easy way with Glue Data Catalog

On the Services menu, click Athena.
On the Query Editor tab, choose the database imdb-data.

Query the data, paste below standard SQL in the blank:
remember to replace external_location with your S3 bucket name

create table rating_with_info with(
format='PARQUET',
external_location='s3://[your name]-athena-table/result/')
as (
select basics.*, averagerating, numvotes from basics
left join ratings
on basics.tconst = ratings.tconst where ratings.averagerating is not null and basics.startyear is not null)

Click Run Query and Athena will query data as the below screen

The query will take about 10 seconds to run
When the query finished you will find that a new table name "rating_with_info" in the Tables list.
rating_with_info table will be used in Tableau to create different views
You can also preview rating_with_info table to explore the data

Step 6 - Setup Tableau desktop connection to Athena

The following steps will show you how to use Tableau to create the views with Athena table.

First you need to download Tableau Desktop on your laptop.
In this step assume you have installed Tableau Desktop.
Open Tableau Desktop you will see this screen
To connect to Athena, click Amazon Athena in navigation pane left side
Enter "athena.ap-southeast-1.amazonaws.com" in Server
Enter port for 443
Enter Staging Directory for your Athena query result S3 bucket

Go to Athena console and click Settings to get the staging directory path
Enter Access Key ID and Secret Access Key which you have created on AWS (you can view these two items in credential csv) then click sign in
Select AwsDataCatalog in Catalog and select imdb-data in database
Drag the table you want to use

In this lab we use rating_with_info

Step 7 - Visualize data with Tableau

Please refer to this video for the configuration on Tableau https://drive.google.com/open?id=1HK3bxNRMLOuTkNHqKkmfTRTmYWdKKlIr

Clean Resources

Delete all the resources you have created in the lab including:
AWS Glue (Glue tables, Glue crawlers, Glue database), Amazon S3 (S3 buckets), Your Access Key

Conclusion

You have learned:

How to set up the Glue Data catalog integrates with the S3 data lake
How to analyze the data in Glue table with Athena
How to use Tableau Desktop to visualize the data in Athena

nguyenminhtuanfit / how-to-build-a-data-analytics-platform-on-aws-with-tableau Goto Github PK

how-to-build-a-data-analytics-platform-on-aws-with-tableau's Introduction