Giter Site home page Giter Site logo

how-to-build-a-data-analytics-platform-on-aws-with-tableau's Introduction

Building a Data Analytics platform on AWS with Tableau

dataset.jpg
This lab using IMDb movie dataset

Scenario

Various enterprises nowadays face big data problems. Most of them put the effort in data mining, data cleansing, data analytics, or even about machine learning.

AWS plays an important role in big data solution. We provide an end-to-end data solution that leverage different AWS services that including data lake, data catalog, data analytics, and presentation. There are many architectures for the customer to customize their solution. This lab using several services to fulfill data analysis and BI process that includes S3, Glue, Athena.

Use Case in this Lab

  • Dataset: IMDb dataset
  • imdb_logo.jpg
    https://www.imdb.com/interfaces/
    The data set of 10000+ most popular movies on IMDB in many years. The data points included are: Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue, Metascrore This use case using IMDb data to analyze some interesting insights of movies or TV episode.

Lab Architecture

architecture.png

As illustrated in the preceding diagram, this is a big data processing in this model:

1.   Upload the IMDb data into the S3 bucket.

2.   Setup Glue data catalog to create Glue table.

3.   Athena runs the query to create target table with Glue data catalog.

4.   Tableau will be configured, connect to Athena, to retrieve data, and show the analytic figure on Tableau dashboard.

Amazon S3 introduction

What is Amazon S3? Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides easy-to-use management features so you can organize your data and configure finely-tuned access controls to meet your specific business, organizational, and compliance requirements. Amazon S3 is designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications for companies all around the world.

AWS Glue introduction

What is AWS Glue? AWS Glue is a fully managed data catalog and ETL (extract, transform, and load) service that simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more. It is significantly reducing the time and effort that it takes to derive business insights quickly from an Amazon S3 data lake by discovering the structure and form of your data. Also automatically crawls your Amazon S3 data, identifies data formats, and then suggests schemas for use with other AWS analytic services.

Amazon Athena introduction

What is Amazon Athena? Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning. You can also use Glue’s fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance.

The workshop’s region will be in ‘Singapore’

Step 0 - Prerequisites

  1. Sign-in a AWS account, and make sure you have select Singapore region
  2. Make sure your account have permission to create IAM role for following services: S3, Glue, Athena
  3. Make sure you have created the Access key and Secret access key that have Athena fully permission to connect to Tableau
  4. Download this repository and unzip, ensure that data folder including two files:
    title.basic.tsv, title.rating.tsv
  5. Download Tableau Desktop on your laptop.
    Click below link to download
    https://www.tableau.com/support/releases
    Note that download the latest version (2018.3.2 for this example)
    Make sure that you have license to use Tableau
    https://www.tableau.com/pricing
  6. Setup AWS Athena Driver for Tableau Desktop
    If Java is not already installed on your Mac, download and install the latest Java version from https://www.java.com/en/download.
    Download the JDBC driver (.jar file) from the Amazon Athena User Guide on Amazon's website.
    https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
    For Mac, Copy the downloaded .jar file to the /Library/JDBC directory. You might have to create the JDBC directory if it doesn't already exist.
    For Windows, Move the downloaded .jar file to C:\Program Files\Tableau\Drivers.

Step 1 - AWS environment setup

First of all, login to AWS console
https://console.aws.amazon.com/console/home

Create Access key and Secret access key on AWS

  • To create a new secret access key for your root account, use the security credentials page. Expand the Access Keys section, and then click Create New Root Key.
  • To create a new secret access key for an IAM user, open the IAM console. Click Users in the Details pane, click the appropriate IAM user, and then click Create Access Key on the Security Credentials tab.
  • Download the newly created credentials (csv file), when prompted to do so in the key creation wizard

Setp 2 - Create IAM roles for Glue service

  • On the service menu, click IAM.

  • In the navigation pane, choose Roles.

  • Click Create role.

  • For role type, choose AWS Service, find and choose Glue, and choose Next: Permissions.

  • On the Attach permissions policy page, search and choose AmazonS3FullAccess, AWSGlueServiceRole, and choose Next: Tags then click Next: Review.

  • On the Review page, enter the following detail:
    Role name: AWSGlueServiceRoleDefault

  • Click Create role.

  • Choose Roles page, select the role AWSGlueServiceDefault you just created.

  • Now confirm you have policies as below figure.
    iam1.png
    Figure1: IAM role policies


    You successfully create the role that allow AWS Glue get access to S3.

Setp 3 - Create S3 bucket for data lake and staging

  • In this step we create below two S3 buckets

    • The bucket stores the data that contain IMDb data and allow Glue crawler to crawl the data
    • The bucket stores the data from Athena query and provide the path for creating table in Athena
  • On the service menu, click S3.

  • Click Create bucket.

  • Enter the Bucket name “[your name]-imdb-dataset” (e.g., samuel-imdb-dataset) and ensure that the bucket name is unique so that you can create.

    s3-1.png

  • Click Create.

  • Click “[your name]-imdb-dataset” bucket

  • Click Create folder and enter the name basics then click save

    s3-2.png

  • In basics folder Click Upload and choose Add files.

  • Select file title.basics.tsv then click Upload.

    s3-3.png

  • Click Create folder to create another folder named ratings as the same way in 1.6

    s3-4.png

  • In ratings folder Click Upload and choose Add files.

  • Select file title.ratings.tsv then click Upload.

    s3-5.png

  • Now “[your name]-imdb-dataset” bucket the folder will show as below
    s3-6.png

  • For another bucket, click Create bucket again and enter the bucket name “[your name]-athena-table” (e.g., samuel-athena-table) and ensure that the bucket name is unique so that you can create.
    s3-7.png

  • Click Create.

  • Make sure that your S3 buckets contain those two buckets

    s3-8.png

Step 4 - Setup AWS Glue data catalog

Create database, tables, crawlers, in Glue Data Catalog

  • On the Services menu, click AWS Glue.

  • In the console, choose Add database. In the Database name, type imdb-data, and choose Create.

    glue-1.png

  • Choose Crawlers in the navigation pane, choose Add crawler. Enter the Crawler name basics-crawler, and choose Next.

    glue-2.png

  • On the Add a data store page, choose S3 as data store.

  • Select Specified path in my account.

  • Select basics folder in the bucket that you have created [your name]-imdb-dataset, and choose Next.
    glue-3.png

  • On Add another data store page, choose No, and choose Next.

  • Select Choose an existing IAM role, and choose the role AWSGlueServiceRoleDefault you just created in the drop-down list, and choose Next.

  • For Frequency, choose Run on demand, and choose Next.

  • For Database, choose imdb-data, and choose Next.

  • Review the steps, and choose Finish.

  • The crawler is ready to run. Choose Run it now.
    Now the basics-crawler is crawling the data in basics folder in S3 bucket.

  • When the crawler has finished, two table has been added. Choose Tables in the left navigation pane, and then choose basics to confirmed.

    glue-4.png
    You can get the table information such as S3 location
    glue-5.png
    You can also get the table schema
    glue-6.png

Now you need to add another ratings table so let's create another crawler

  • In the navigation pane, choose Add crawler. Add type Crawler name “ratings-crawler” and choose Next.

    glue-7.png
  • On the Add a data store page, choose S3 as data store.

  • Select Specified path in my account.

  • Select ratings folder in the bucket that you have created [your name]-imdb-dataset, and choose Next.
    glue-8.png
  • On Add another data store page, choose No, and choose Next.

  • Select Choose an existing IAM role, and choose the role AWSGlueServiceRoleDefault you just created in the drop-down list, and choose Next.

  • For Frequency, choose Run on demand, and choose Next.

  • For Database, choose imdb-data, and choose Next.

  • Review the steps, and choose Finish.

  • The crawler is ready to run. Choose Run it now.
  • After the crawler has finished, there is a new table ratings in the imdb-data database:

    glue-9.png
    glue-10.png
    glue-11.png

Now you successfully to setup AWS Glue data catalog and create Glue table with IMDb data

Step 5 - Ad Hoc query in with AWS Athena

Athena can query the data in an easy way with Glue Data Catalog

  • On the Services menu, click Athena.

  • On the Query Editor tab, choose the database imdb-data.

    athena-1.png

  • Query the data, paste below standard SQL in the blank:
    remember to replace external_location with your S3 bucket name

    create table rating_with_info with(
    format='PARQUET',
    external_location='s3://[your name]-athena-table/result/')
    as (
    select basics.*, averagerating, numvotes from basics
    left join ratings
    on basics.tconst = ratings.tconst where ratings.averagerating is not null and basics.startyear is not null)
    
  • Click Run Query and Athena will query data as the below screen
    athena-2.png
    The query will take about 10 seconds to run

  • When the query finished you will find that a new table name "rating_with_info" in the Tables list.
    athena-3.png

  • rating_with_info table will be used in Tableau to create different views

  • You can also preview rating_with_info table to explore the data
    athena-6.png

Step 6 - Setup Tableau desktop connection to Athena

The following steps will show you how to use Tableau to create the views with Athena table.

  • First you need to download Tableau Desktop on your laptop.
  • In this step assume you have installed Tableau Desktop.
  • Open Tableau Desktop you will see this screen
    tableau-1.png
  • To connect to Athena, click Amazon Athena in navigation pane left side

    tableau-2.png
  • Enter "athena.ap-southeast-1.amazonaws.com" in Server

  • Enter port for 443

  • Enter Staging Directory for your Athena query result S3 bucket

    Go to Athena console and click Settings to get the staging directory path

    athena-4.png
    athena-5.png

  • Enter Access Key ID and Secret Access Key which you have created on AWS (you can view these two items in credential csv) then click sign in

    tableau-3.png
  • Select AwsDataCatalog in Catalog and select imdb-data in database

    tableau-4.png

  • Drag the table you want to use

    tableau-5.png

    In this lab we use rating_with_info

Step 7 - Visualize data with Tableau

Please refer to this video for the configuration on Tableau https://drive.google.com/open?id=1HK3bxNRMLOuTkNHqKkmfTRTmYWdKKlIr

Clean Resources

  • Delete all the resources you have created in the lab including:
  • AWS Glue (Glue tables, Glue crawlers, Glue database), Amazon S3 (S3 buckets), Your Access Key

Conclusion

You have learned:

  • How to set up the Glue Data catalog integrates with the S3 data lake
  • How to analyze the data in Glue table with Athena
  • How to use Tableau Desktop to visualize the data in Athena

Appendix

how-to-build-a-data-analytics-platform-on-aws-with-tableau's People

Contributors

yuanmasa avatar sammeowww avatar

Watchers

James Cloos avatar Tuan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.