Giter Site home page Giter Site logo

microsoft / ghinsights Goto Github PK

View Code? Open in Web Editor NEW
44.0 14.0 19.0 331 KB

GHInsights is a data processing pipeline using Azure Data Factory and Azure Data Lake. It processes GitHub data from the ghtorrent project. The resulting processed data is available in Azure Data Lake for users to query, generate reports, and analyze GitHub projects.

License: Other

C# 88.93% PowerShell 11.07%

ghinsights's Introduction

GHInsights

GHInsights is a dataset and processing pipeline for GitHub event and entity data. It enables you to create your own insights on all or a portion of the activity and content on GitHub. Fundamentally GHInsights is based on GHTorrent, an open, collaborative project for gathering exposing GitHub interactions. GHInsights takes that data and makes it available in Azure Data Lake. This gives you an easily accessible dataset and scalable compute resources so you can create the insights you need without having to gather and manage the many terabytes of data involved.

GHInsights and the enriched datasets is exposes will evolve over time. Currently the data available is pretty much a straight copy of that which is available in GHTorrent and the queries supplied are minimal. We encourage the community to contribute generally useful and interesting queries and enrichments. Those can be shared and/or incorporated directly into the GHInsights dataset and made available to everyone.

Getting Started

Azure Data Lake is split into Storage and Analytics. The idea with GHInsights is that we provide data in Storage and enable you to access it from your Analytics account. This way you get full control of your analysis but on a readily available and rich dataset. Setting up to use GHInsights has some overhead and cost. If you want to poke around at the dataset, we recommend going to GHTorrent and use their online dataset and query mechanism.

Note: The setup here is a work in progress. The team is actively driving to enhance and simplify. In the future you will "mount" the database into your Data Lake account. This is both much simpler and much faster and you don't have to pay the storage costs or the query costs associated with copying the data.

Getting started with Hadoop

GHInsights makes the data available as Web HDFS files. As such you can setup a Hadoop cluster and process the data.

Instructions coming

Getting started with Spark

GHInsights makes the data available as Web HDFS files. As such you can setup a Spark cluster and process the data.

Instructions coming

Getting started with U-SQL

U-SQL is a new, SQL-like big data query language from Microsoft.

  1. To get started you need to setup an [Azure Subscription] (https://azure.microsoft.com/en-us/free).

  2. Request access to the dataset by contacting @jeffmcaffer and @kelewis. We will work with you to get your Azure account enabled for Azure Data Lake Analytics (still in early preview) as well as setting up proper permission for that account to read the GitHub data.

  3. Import the dataset to your account. Right now you have to copy the data into your account. This is a one-time setup step that will go away as soon as Data Lake table sharing is enabled. To import the data, submit the [import.usql] (https://github.com/Microsoft/ghinsights/tree/master/DataExport/import.usql) script in your Azure Data Lake Analytics account. This will take a while (a couple hours), once it is done you will have a copy of the GHInsights U-SQL Database in your account.

  4. Run U-SQL jobs to query your data. See the U-SQL intro for examples and more details.

Note: In this process, the data will be copied over to your Data Lake storage. Keep in mind you are paying for the costs of storing and querying it. Importing the core set of tables takes roughly 50 compute hours. Pricing can vary by region and currency but is currently about US$1/hour. By default importing skips the CommitFile information as it is very large and can take considerably longer (300+ compute hours). If you want the CommitFile info, edit the script and uncomment the lines that fetch those files. For more Azure pricing info, see the Azure Data Lake pricing site.

License

GHInsights is licensed under the MIT license.

ghinsights's People

Contributors

jeffmcaffer avatar kelewis avatar msftgits avatar riv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.