Giter Site home page Giter Site logo

akemisetti / aws-mimic-iiitoomop Goto Github PK

View Code? Open in Web Editor NEW

This project forked from amazon-archives/aws-mimic-iiitoomop

0.0 2.0 0.0 34 KB

Demonstrates creating a healthcare data warehouse using the MIMIC-III dataset on Redshift, Spark on EMR and Lambda

License: Apache License 2.0

Java 100.00%

aws-mimic-iiitoomop's Introduction

Build a Healthcare Data Warehouse using Spark and the OHDSI Common Data Model

This code demonstrates the architecture featured on the AWS big data blog (https://aws.amazon.com/blogs/big-data/) on creating a healthcare data warehouse using Redshift, Spark on EMR and Lambda that was published on May 12, 2017. It takes an openly available research dataset called MIMIC-III and converts it into a standard open source healthcare data model called OMOP.

Prerequisites

Steps

  1. Create a bucket in S3
  2. Within the bucket created in step 1, copy over all of the directories underneath the "copyToS3" directory located in this GitHub repo.
    • Make sure you do not copy the actual "copyToS3" directory itself
  3. Upload the MIMIC-III raw data in csv format within the subdirectories of the mimic3 folder that you copied to s3 in step 2
    • eg - caregivers.csv would be "mimic3/caregivers/caregivers.csv"
  4. Copy the "cloudformation", “redshiftSQL”, “transformationSQL”, and “config” folders from GitHub into the bucket created in step 1
  5. Open the config.json file in the "config" directory that was just copied and replace all references of "" to the name of the bucket you created in step 1
    • Also, ensure that your source file names from the mimic3 data match what is listed in this file. If they do not match, either rename the files or change the references in the config file.
  6. Create a folder called “jars” and upload both the "RedshiftCopier-1.0.0-jar-with-dependencies.jar" and "SparkBatchProcessor-1.0.0-jar-with-dependencies.jar" jars within that folder
    • These jars will be produced after you build the project.
  7. Create a new Stack in CloudFormation called "Redshift" and reference the redshift.json template in the "cloudformation" folder within S3.
    • You will need to pass the following parameters to this template:
      1. "Security Group" - this should be the default security group in your VPC
      2. "Subnet Id" - the public subnet of your default VPC
      3. "IP" - the IP range that you would like to be able to connect to the Redshift instance
      4. "Username" - the Redshift Username
      5. "Password" - the Redshift password
  8. Once step 6 has completed successfully, create a new Stack in CloudFormation and reference the mimic3-ohdsi.json template in the "cloudformation" folder within S3.
    • You will need to pass the following parameters to this template:
      1. "Security Group" - this should be the default security group in your VPC
      2. "Subnet Id" - the public subnet of your default VPC
      3. "RedshiftRoleArn" - this is the ARN for the IAM role created in the Redshift CloudFormation template
      4. "Username" - the Redshift Username
      5. "Password" - the Redshift password
      6. "RedshiftEndpoint" - the endpoint of the Redshift cluster created in step 6
      7. "Bucket" - the bucket created in step 1
  9. Sit back and enjoy
    • The CloudFormation template in step 8 creates an EMR cluster with a Spark job that executes automatically and tears down the infrastructure created in step 7 once all of the data has been loaded into Redshift

Notes

  • You will incur a small amount of charges for running this code on AWS.
  • You will need to tear down the CloudFormation stack created in step 6 manually.
  • An EC2 instance is spun up in the Redshift CloudFormation stack for the purpose of adding an IAM role to the cluster since CloudFormation does not support this directly.

aws-mimic-iiitoomop's People

Contributors

hyandell avatar ryanmhood avatar

Watchers

James Cloos avatar Anil Kemisetti avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.