Giter Site home page Giter Site logo

camilobetanieto / bigdatacomputing Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 28 KB

Big data computing tasks conducted with PySpark. The problems involve MapReduce and Streaming algorithms.

Python 100.00%
big-data mapreduce pyspark sparkstreaming

bigdatacomputing's Introduction

Big Data Computing

Homeworks carried out for the Big Data Computing course I took during my master's degree.

There were three assignments to familiarize ourselves with PySpark in the context of big data: The first two were related to MapReduce algorithms and the third one focused on the Spark Streaming API. The homework statements and the initial code templates were given by the course professors during the first semester of 2023.

Homework 1:

In this case, the task was to implement two MapReduce algorithms, one using the Spark partitions and the other using a partition determined by a hash function. The goal was to approximate the count of distinct triangles in an undirected graph. The triangle-counting primitive is valuable in various scenarios, including social network analysis and web spam detection. For this initial assignment, we were instructed to run the algorithms locally.

Homework 2:

This assignment built upon the work of homework 1 as it required a performance comparison between one of the algorithms that approximated the count and a different MapReduce algorithm that provided the exact count. Additionally, in this case, we were required to run the code with larger datasets than before (the Orkut social network data with 117 million edges) by utilizing the CloudVeneto cluster, a cloud infrastructure provided by the university.

Homework 3:

The final assignment involved using the Spark Streaming API to process a continuous stream of integer items. This was achieved by implementing a space-efficient data structure called count-sketch, which approximated the individual frequencies of the items and the second moment of the stream. This practice could be useful for gathering statistics from sources that generate large volumes of data streams, such as sensor data, the Internet of Things or online auctions.

bigdatacomputing's People

Contributors

camilobetanieto avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.