This lab using IMDb movie dataset
Various enterprises nowadays face big data problems. Most of them put the effort in data mining, data cleansing, data analytics, or even about machine learning.
AWS plays an important role in big data solution. We provide an end-to-end data solution that leverage different AWS services that including data lake, data catalog, data analytics, and presentation. There are many architectures for the customer to customize their solution. This lab using several services to fulfill data analysis and BI process that includes S3, Glue, Athena.
- Dataset: IMDb dataset
https://www.imdb.com/interfaces/
The data set of 10000+ most popular movies on IMDB in many years. The data points included are: Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue, Metascrore This use case using IMDb data to analyze some interesting insights of movies or TV episode.
As illustrated in the preceding diagram, this is a big data processing in this model:
1. Upload the IMDb data into the S3 bucket.
2. Setup Glue data catalog to create Glue table.
3. Athena runs the query to create target table with Glue data catalog.
4. Tableau will be configured, connect to Athena, to retrieve data, and show the analytic figure on Tableau dashboard.
What is Amazon S3? Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides easy-to-use management features so you can organize your data and configure finely-tuned access controls to meet your specific business, organizational, and compliance requirements. Amazon S3 is designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications for companies all around the world.
What is AWS Glue? AWS Glue is a fully managed data catalog and ETL (extract, transform, and load) service that simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more. It is significantly reducing the time and effort that it takes to derive business insights quickly from an Amazon S3 data lake by discovering the structure and form of your data. Also automatically crawls your Amazon S3 data, identifies data formats, and then suggests schemas for use with other AWS analytic services.
What is Amazon Athena? Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning. You can also use Glue’s fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance.
- Sign-in a AWS account, and make sure you have select Singapore region
- Make sure your account have permission to create IAM role for following services: S3, Glue, Athena
- Make sure you have created the Access key and Secret access key that have Athena fully permission to connect to Tableau
- Download this repository and unzip, ensure that data folder including two files:
title.basic.tsv, title.rating.tsv - Download Tableau Desktop on your laptop.
Click below link to download
https://www.tableau.com/support/releases
Note that download the latest version (2018.3.2 for this example)
Make sure that you have license to use Tableau
https://www.tableau.com/pricing - Setup AWS Athena Driver for Tableau Desktop
If Java is not already installed on your Mac, download and install the latest Java version from https://www.java.com/en/download.
Download the JDBC driver (.jar file) from the Amazon Athena User Guide on Amazon's website.
https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
For Mac, Copy the downloaded .jar file to the /Library/JDBC directory. You might have to create the JDBC directory if it doesn't already exist.
For Windows, Move the downloaded .jar file to C:\Program Files\Tableau\Drivers.
First of all, login to AWS console
https://console.aws.amazon.com/console/home
- To create a new secret access key for your root account, use the security credentials page. Expand the Access Keys section, and then click Create New Root Key.
- To create a new secret access key for an IAM user, open the IAM console. Click Users in the Details pane, click the appropriate IAM user, and then click Create Access Key on the Security Credentials tab.
- Download the newly created credentials (csv file), when prompted to do so in the key creation wizard
-
On the service menu, click IAM.
-
In the navigation pane, choose Roles.
-
Click Create role.
-
For role type, choose AWS Service, find and choose Glue, and choose Next: Permissions.
-
On the Attach permissions policy page, search and choose AmazonS3FullAccess, AWSGlueServiceRole, and choose Next: Tags then click Next: Review.
-
On the Review page, enter the following detail:
Role name: AWSGlueServiceRoleDefault -
Click Create role.
-
Choose Roles page, select the role AWSGlueServiceDefault you just created.
-
Now confirm you have policies as below figure.
Figure1: IAM role policies
You successfully create the role that allow AWS Glue get access to S3.
-
In this step we create below two S3 buckets
- The bucket stores the data that contain IMDb data and allow Glue crawler to crawl the data
- The bucket stores the data from Athena query and provide the path for creating table in Athena
- The bucket stores the data that contain IMDb data and allow Glue crawler to crawl the data
-
On the service menu, click S3.
-
Click Create bucket.
-
Enter the Bucket name “[your name]-imdb-dataset” (e.g., samuel-imdb-dataset) and ensure that the bucket name is unique so that you can create.
-
Click Create.
-
Click “[your name]-imdb-dataset” bucket
-
Click Create folder and enter the name basics then click save
-
In basics folder Click Upload and choose Add files.
-
Click Create folder to create another folder named ratings as the same way in 1.6
-
In ratings folder Click Upload and choose Add files.
-
Now “[your name]-imdb-dataset” bucket the folder will show as below
-
For another bucket, click Create bucket again and enter the bucket name “[your name]-athena-table” (e.g., samuel-athena-table) and ensure that the bucket name is unique so that you can create.
-
Click Create.
Create database, tables, crawlers, in Glue Data Catalog
-
On the Services menu, click AWS Glue.
-
In the console, choose Add database. In the Database name, type imdb-data, and choose Create.
-
Choose Crawlers in the navigation pane, choose Add crawler. Enter the Crawler name basics-crawler, and choose Next.
-
On the Add a data store page, choose S3 as data store.
-
Select Specified path in my account.
-
Select basics folder in the bucket that you have created [your name]-imdb-dataset, and choose Next.
-
On Add another data store page, choose No, and choose Next.
-
Select Choose an existing IAM role, and choose the role AWSGlueServiceRoleDefault you just created in the drop-down list, and choose Next.
-
For Frequency, choose Run on demand, and choose Next.
-
For Database, choose imdb-data, and choose Next.
-
Review the steps, and choose Finish.
-
The crawler is ready to run. Choose Run it now.
Now the basics-crawler is crawling the data in basics folder in S3 bucket. -
When the crawler has finished, two table has been added. Choose Tables in the left navigation pane, and then choose basics to confirmed.
You can get the table information such as S3 location
You can also get the table schema
Now you need to add another ratings table so let's create another crawler
- In the navigation pane, choose Add crawler. Add type Crawler name “ratings-crawler” and choose Next.
- On the Add a data store page, choose S3 as data store.
- Select Specified path in my account.
- Select ratings folder in the bucket that you have created [your name]-imdb-dataset, and choose Next.
- On Add another data store page, choose No, and choose Next.
- Select Choose an existing IAM role, and choose the role AWSGlueServiceRoleDefault you just created in the drop-down list, and choose Next.
- For Frequency, choose Run on demand, and choose Next.
- For Database, choose imdb-data, and choose Next.
- Review the steps, and choose Finish.
- The crawler is ready to run. Choose Run it now.
- After the crawler has finished, there is a new table ratings in the imdb-data database:
Now you successfully to setup AWS Glue data catalog and create Glue table with IMDb data
Athena can query the data in an easy way with Glue Data Catalog
-
On the Services menu, click Athena.
-
Query the data, paste below standard SQL in the blank:
remember to replace external_location with your S3 bucket namecreate table rating_with_info with( format='PARQUET', external_location='s3://[your name]-athena-table/result/') as ( select basics.*, averagerating, numvotes from basics left join ratings on basics.tconst = ratings.tconst where ratings.averagerating is not null and basics.startyear is not null)
-
Click Run Query and Athena will query data as the below screen
The query will take about 10 seconds to run -
When the query finished you will find that a new table name "rating_with_info" in the Tables list.
-
rating_with_info table will be used in Tableau to create different views
-
You can also preview rating_with_info table to explore the data
The following steps will show you how to use Tableau to create the views with Athena table.
- First you need to download Tableau Desktop on your laptop.
- In this step assume you have installed Tableau Desktop.
- Open Tableau Desktop you will see this screen
- To connect to Athena, click Amazon Athena in navigation pane left side
- Enter "athena.ap-southeast-1.amazonaws.com" in Server
- Enter port for 443
- Enter Staging Directory for your Athena query result S3 bucket
Go to Athena console and click Settings to get the staging directory path
- Enter Access Key ID and Secret Access Key which you have created on AWS (you can view these two items in credential csv) then click sign in
- Select AwsDataCatalog in Catalog and select imdb-data in database
- Drag the table you want to use
In this lab we use rating_with_info
Please refer to this video for the configuration on Tableau https://drive.google.com/open?id=1HK3bxNRMLOuTkNHqKkmfTRTmYWdKKlIr
- Delete all the resources you have created in the lab including:
- AWS Glue (Glue tables, Glue crawlers, Glue database), Amazon S3 (S3 buckets), Your Access Key
You have learned:
- How to set up the Glue Data catalog integrates with the S3 data lake
- How to analyze the data in Glue table with Athena
- How to use Tableau Desktop to visualize the data in Athena