Giter Site home page Giter Site logo

mariemzayn18 / xl-data Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aymanreda56/xl-data

1.0 0.0 0.0 8.2 MB

It is a Big-Data project, using MR techniques. Aiming to collect conclusions and drive a managerial decision based on analyzing numerous software applications deployed on Google's Play Store.

Python 0.75% Jupyter Notebook 99.25%

xl-data's Introduction

XL-Data

Analysis of Google's Play Store applications using Pyspark and Pyspark ML

It is our Big-Data course project, using MR techniques. Aiming to collect conclusions and drive a managerial decision based on analyzing numerous software applications deployed on Google's Play Store. Our preference framework was Spark, hence its Python implementation, Pyspark. Our data entered a very long pipeline:- cleaning, preprocessing, transformations, EDA, a lot of Map-Reduces, heavy clustering, AI modeling and finally Decision Making. The dataset didnโ€™t meet up to our expectations in various ways, but we worked our way around those stumps.

pysparkImage

What are we doing? ๐Ÿ‘ป

We are helping a company develop a new Profitable app.

We want to choose its price that maximizes the company's profits, choose the best suitable developers, its stance regarding ads, and what exact category should we make the app for.

Problem Definition ๐Ÿค”

If a company wants to develop a new app, Whatโ€™s the best way to develop it to keep it highly profitable and highly rated? In addition to predicting the best price for this app -if itโ€™s paid- and predicting the number of installations for this app based on its given features. Lastly, if this company wants to hire new mobile app developers, we can help it to know those whose apps have the highest ratings and number of installations.

Dataset Source ๐Ÿ‘“

This dataset was scraped via a python script running on a cloud. (we didnโ€™t scrape it, rather, we downloaded it from here.

Pipeline ๐Ÿ“ˆ

  1. Data Preprocessing and cleansing
  2. Data Exploration (Involves visualization to extract knowledge from the data):
  3. Descriptive analysis: using Map Reduce.
  4. Diagnostic analysis: Using Pearson and Spearmanโ€™s correlation.
  5. Clustering to gain insights about data: Using K-means, K-Medoids or ISODATA.
  6. Model training and validation For Prediction and Classification: Using SVM, LR or Decision Trees, plus K-Fold.

Let's Skip to the final results ๐Ÿ˜…

Results ๐Ÿ‘€

As a manager, You should:

Choose a Category from this list:

  • Art & Design
  • Games
  • Role-Playing
  • Photography
  • Comics

It is better to launch the app as Free, then make it paid after a year.

Hire a Development group from this list:

  • PT. Teknologi Usaha Sukses Bersama
  • Petar Markoviฤ‡
  • Rmapps
  • GameWriterStudio
  • ์ธ๋””์‚ฌ์ด๋“œ๊ฒŒ์ž„์ฆˆ
  • Ads are optional, but we prefer not to support adsโ€ฆ
  • If the app is paid, it is better to keep the price under 4$

This predicts:

  • Avg number of installs = 27k
  • Avg Rating = 3.4 (assuming having more than 2000 critic)
  • Appโ€™s price = 3.2$ (if it was free at launch)

With a confidence level of 99.999999% you will be a millionaire in just 3 hours ๐Ÿธ

now with the boring details :trollface:

which you can also find in our document and our presentation:

In this project, we did:

  • Collect the Dataset
  • Install Pyspark and all its dependencies
  • Preprocessing and Cleaning our dataset
  • Perform EDA using Pyspark's low level Map-Reduce functions
  • Use RDDs whenever possible
  • Perform Diagnostic analysis given the previous EDA
  • Answer some predictive questions
  • Clustering
  • ML Modelling
  • Business intelligence

xl-data's People

Contributors

mariemzayn18 avatar marim1611 avatar aymanreda56 avatar abeerhbadr avatar

Stargazers

Gheiath avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.