Giter Site home page Giter Site logo

rserver-for-hdinsight-example-criteodataset's Introduction

RServer-for-HDInsight-example-CriteoDataSet

This repo contains a walkthrough of how to use RServer for HDInsight with large data sets like Criteo.

Running Instructions

It took about 10 hours to run the analysis on my cluster using the Criteo data for day 14 - day 23 (420 GB). You can test your cluster and the program by using a subset of the data, e.g., data for day 14 (46 GB).

Deploy an HDInsight cluster

More information about how to deploy R Server for HDInsight can be found at the documentation site. It is recommended that you install RStudio on the cluster by following the instructions as well. Here's the information on the cluster I deployed:

Type Cores RAM (GB) Nodes Pricing Tier
Head Nodes 32 224 2 D14
Worker Nodes 960 6,720 60 D14

Get the Criteo data

Information on the data can be found at Now Available on Azure ML โ€“ Criteo's 1TB Click Prediction Dataset. After downloading and extracting data for day 14 - day 23, upload them to a folder on your HDInsight cluster using tools like AzCopy.

Get the summary data

The summary data can be downloaded from an Azure blob. The summary is for the 1 TB data and includes frequency counts for categorical variables and means for integer variables. After downloading and extracting data, upload them to your HDInsight cluster using tools like AzCopy.

Update the programs

SetComputeContext.R

  • Enter the nodename of your cluster and update the WASB address.
  • Replace the value of dataDir with the correct path to where the data is saved. For example, I saved all data for my project in the folder "/lixun/CriteoAzure" so I assiged this path to dataDir.

CriteoMain.R

  • Update the paths to the raw Criteo data as well as summaries of categorical and integer variables.

CriteoMainCall.R

  • Change the working directory to point to your folder where the programs are saved.

Run CriteoMainCall.R

For example, you can run the program from RStudio installed on the HDInsight cluster.


This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

rserver-for-hdinsight-example-criteodataset's People

Contributors

lixzhang avatar microsoft-github-policy-service[bot] avatar msftgits avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.