Giter Site home page Giter Site logo

mahmoudparsian / data-algorithms-with-spark Goto Github PK

View Code? Open in Web Editor NEW
192.0 14.0 83.0 45.95 MB

O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian

Python 53.93% Shell 9.25% Scala 36.82%
spark pyspark data algorithms transformations partitioning-algorithms machine-learning design-patterns data-algorithms data-abstractions

data-algorithms-with-spark's Introduction

Data Algorithms with Spark by Mahmoud Parsian

"... This book will be a great resource for
both readers looking to implement existing
algorithms in a scalable fashion and readers
who are developing new, custom algorithms
using Spark. ..."

Dr. Matei Zaharia
Original Creator of Apache Spark

FOREWORD by Dr. Matei Zaharia

Foreword by Dr. Matei Zaharia (Original Creator of Apache Spark)




Software:

All programs are tested with the following software:

Spark Python Scala Java
Apache Spark 3.4.0 Python 3.10.5 Scala 2.13 Java 11

Table of Contents

Chapter Title
Glossary Glossary of Big Data, MapReduce, Spark
Chapter 1 Introduction to Data Algorithms
Chapter 2 Transformations in Action
Chapter 3 Mapper Transformations
Chapter 4 Reductions in Spark
Chapter 5 Partitioning Data
Chapter 6 Graph Algorithms
Chapter 7 Interacting with External Data Sources
Chapter 8 Ranking Algorithms
Chapter 9 Fundamental Data Design Patterns
Chapter 10 Common Data Design Patterns
Chapter 11 Join Design Patterns
Chapter 12 Feature Engineering in PySpark

Bonus Chapters

Bonus Chapter Title / Description
Glossary Glossary of Big Data, MapReduce, Spark
Word Count Solutions for Word Count using RDDs and DataFrames
Anagrams Find words, which are anagrams
Lambda Expressions Using Lambda Expressions in PySpark programs
TF-IDF Term Frequency - Inverse Document Frequency
K-mers K-mers for DNA Sequences
Correlation All vs. All Correlation
Mapping Partitions mapPartitions() Complete Example
UDF User-Defined Function Examples
DataFrames Transformations Examples on Creation and Transformation of DataFrames
DataFrames Tutorials DataFrames Tutorials: from collections and CSV text files
Join Operations Examples on join of RDDs and DataFrames
PySpark Tutorial 101 Examples on using PySpark RDDs and DataFrames
Physical Data Partitioning Tutorial of Physical Data Partitioning
Monoids and Combiners Monoid as a Design Principle

Data Algorithms with Spark Data Algorithms with Spark Data Algorithms with Spark

data-algorithms-with-spark's People

Contributors

bimanmandal avatar deepakmca05 avatar mahmoudparsian avatar pyspark-in-action avatar sbheemas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-algorithms-with-spark's Issues

Missing Files On Github

Hi,

You miss files for PageRank chapter on github. Data Files, and also pagerank_2.py

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.