Giter Site home page Giter Site logo

dsc-big-data-analytics-apache-spark-nyc-ds-033020's Introduction

Big Data Analytics on Apache Spark

Introduction

Big data analytics is an emerging area of interest both for business and academia. There are a lot of details around the characteristics of big data and how Apache Spark eases up the job of analyzing huge amounts of data using a simple programming paradigm. In this section, we will look at understanding and implementing a simple problem using MapReduce in PySpark. Real world problems, however, are much more complicated than this and you should be able to scale up the takeaways from the simple word count example we will complete to much bigger problems. This lesson aims to provide you with a wider understanding of MapReduce and big data computation in the Apache Spark environment.

Objectives

You will be able to:

  • Describe the role of Apache Spark in Big data analytics
  • List some of the Spark functionalities
  • Describe the role of RDDs in spark

For this lesson, you are required to read the following review article:

https://link.springer.com/article/10.1007/s41060-016-0027-9

International Journal of Data Science and Analytics, November 2016, Volume 1, Issue 3โ€“4, pp 145โ€“164. Authors: Salman Salloum, Ruslan Dautov, Xiaojun Chen, Patrick Xiaogang Peng, Joshua Zhexue Huang

"In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics." - from the abstract. Here is an image from the paper giving a general overview of how the spark ecosystem functions:

You are expected to spend around 90 - 120 minutes reading this article. It is an excellent article and all the key aspects of spark computational environment are summarized and presented in an excellent manner.

Summary

In this lesson, you read the scientific article "Big Data Analytics on Apache Spark", which covers the key aspects of Spark's computational environment. You'll now move on to working with Spark through Python.

dsc-big-data-analytics-apache-spark-nyc-ds-033020's People

Contributors

shakeelraja avatar loredirick avatar fpolchow avatar h-parker avatar sumedh10 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.