Giter Site home page Giter Site logo

cheevahagadog / automatic-broccoli Goto Github PK

View Code? Open in Web Editor NEW

This project forked from frontanalyticsinc/automatic-broccoli

0.0 1.0 0.0 50 KB

Automated :boom: data analysis insights from a tabular dataset

Jupyter Notebook 56.48% Python 43.52%
data-analysis data-mining data-science

automatic-broccoli's Introduction

Automatic Broccoli

Because Broccoli is good for you, and so is this repo!

Why?

You need to write SQL to analyze a database. Often times an ad-hoc analysis findings can lead to hard-coded relationships that get put into dashboards excluding relationships that might not have been found or non-existent at the time of the analysis. We wanted to create a data product to support hard-coded dashboard logic by running tests on things we didn't hard code in the background. To run asyncronously to other jobs / queries to ask questions and find connections and useful insights.

Prior Art

Pandas Profiling has done a lot of the heavy lifting for doing inital dataset exploratory data analysis (EDA). It does a great job of generating profile reports in HTML format for a dataset, saving an analyst a lot of time going through that process on their own. This project builds on top of the shoulders of Pandas Profiling by going a few steps further.

How is Automatic Broccoli different?

Current tools do a great job of identifying the datatype of a column, like a numeric, date, string, etc. Using that knowledge descriptive stats can be produced...and that's about where it stops.

Automatic Broccoli goes further in attempting to identify not only the type of column, but its analytical possibilities, especially in relation to the other columns present. Currently, it can detect if a column is a binary, categorical, or continuous and then prepares a dictionary of possible unique combinations of analyses that can be tested.

Example

>>> import auto_broccoli as broc
>>> import pandas_profiling as pp

>>> ab = broc.AutoBroccoli()  # if df=None, it autogenerates it's own data

>>> # Step 1: run the awesome pandas profiling
>>> pobject = pp.ProfileReport(ab.df)
>>> des = pobject.get_description()

>>> # Step 2: identify analytical types of variables
>>> analytics_df = des['variables']
>>> type_dict = ab.classify_column_types(analytics_df)
>>> type_dict
{'binary': ['active', 'nice_person'],
 'categorical': ['buyer_type', 'content_type'],
 'continuous': ['duration_percent', 'impressions', 'visits'],
 'date': ['date'],
 'high_cardinality': ['user_id']}

Notice how it classified the columns for their analytical potential. Then you can create the analytical buckets like this:

>>> # Step three: set them into analytical buckets
>>> analytics_dict = ab.create_analytical_buckets(type_dict)
>>> analytics_dict
{'bin X bin': [('active', 'nice_person')],
 'bin X cat': [['active', 'buyer_type'],
  ['active', 'content_type'],
  ['nice_person', 'buyer_type'],
  ['nice_person', 'content_type']],
 'bin X cont': [['active', 'duration_percent'],
  ['active', 'some_category'],
  ['active', 'visits'],
  ['nice_person', 'duration_percent'],
  ['nice_person', 'some_category'],
  ['nice_person', 'visits']],
 'cat X cont': [['buyer_type', 'duration_percent'],
  ['buyer_type', 'some_category'],
  ['buyer_type', 'visits'],
  ['content_type', 'duration_percent'],
  ['content_type', 'some_category'],
  ['content_type', 'visits']],
 'cont X cont': [('duration_percent', 'some_category'),
  ('duration_percent', 'visits'),
  ('some_category', 'visits')]}

An analytics dataframe can then be generated, yielding the columns that were used and the insight(s) found, if any. Other columns exist but weren't included in this example due to space constraints.

>>> # Step four: run the analytics!
>>> resultsdf = ab.auto_analysis(d_=analytics_dict)
resultsdf

Or you can run the above like this

>>> ab = broc.AutoBroccoli()
>>> resultsdf = ab.main()
resultsdf
analysis analysis_type col_1 col_2 dataset date insight_text p_val
Chi-square bin X cat active buyer_type random 04-04-18 Both non-active and active have a value of 127 for Other. Spouse is the top category for both non-active at 130 and active at 137 for "buyer_type". Active and Non-Active seem to have similar frequencies. 0.931
Chi-square bin X cat active content_type random 04-04-18 Active and Non-Active are the farthest apart on "Doubleclick" in "content_type". Active maximum on Doubleclick is 105 and non-active minimum is on Doubleclick at 90. 0.6876
Chi-square bin X cat nice_person buyer_type random 04-04-18 Non-Nice_Person and Nice_Person are the farthest apart on "Me" in "buyer_type". Non-Nice_Person maximum on Me is 126 and nice_person minimum is on Me at 105. 0.1658
Chi-square bin X cat nice_person content_type random 04-04-18 Significant difference in 'non-nice_person' and 'nice_person' between 'content_type' groups. Non-Nice_Person and Nice_Person are the farthest apart on "Newspaper" in "content_type". Non-Nice_Person maximum on Newspaper is 111 and nice_person minimum is on Newspaper at 92. Nice_Person and Non-Nice_Person are the farthest apart on "Social Media" in "content_type". Nice_Person maximum on Social Media is 126 and non-nice_person minimum is on Social Media at 88. 0.0421
T-test bin X cont active duration_percent random 04-04-18 Both Active and Non-Active have a medium level of variability on "duration_percent". Active average is 0.3 and Non-Active average is 0.3. 0.8108
T-test bin X cont active impressions random 04-04-18 Both Active and Non-Active have a medium level of variability on "impressions". Active average is 3.0 and Non-Active average is 3.0. 0.9334
T-test bin X cont active visits random 04-04-18 Non-Active and Active have basically the same spread around their means of 289.37 and 284.37 on "visits". 0.2707
T-test bin X cont nice_person duration_percent random 04-04-18 Both Nice_Person and Non-Nice_Person have a medium level of variability on "duration_percent". Nice_Person average is 0.3 and Non-Nice_Person average is 0.3. 0.9254
T-test bin X cont nice_person impressions random 04-04-18 Both Nice_Person and Non-Nice_Person have a medium level of variability on "impressions". Nice_Person average is 3.0 and Non-Nice_Person average is 2.9. 0.3874
T-test bin X cont nice_person visits random 04-04-18 Non-Nice_Person and Nice_Person have basically the same spread around their means of 283.23 and 290.87 on "visits". 0.3585
Anova cat X cont buyer_type duration_percent random 04-04-18 Coming soon 0.9556
Anova cat X cont buyer_type impressions random 04-04-18 Coming soon 0.8558
Anova cat X cont buyer_type visits random 04-04-18 Coming soon 0.144
Anova cat X cont content_type duration_percent random 04-04-18 Coming soon 0.4053
Anova cat X cont content_type impressions random 04-04-18 Coming soon 0.4027
Anova cat X cont content_type visits random 04-04-18 Coming soon 0.9779
Pearson corr cont X cont duration_percent impressions random 04-04-18 Not likely a linear relationship in duration_percent in impressions with with coef of -0.02 0.4723
Pearson corr cont X cont duration_percent visits random 04-04-18 Not likely a linear relationship in duration_percent in visits with with coef of -0.01 0.8102
Pearson corr cont X cont impressions visits random 04-04-18 Not likely a linear relationship in impressions in visits with with coef of 0.05 0.0853

TODO:

  • Testing!
  • Better date handling
  • Add connections across tables (2+) tables instead of just one table
  • Setup meta database of
  • Setup input tags for dataset types such as "marketing", "social media" etc.

License

Copyright (c) 2017 Front Analytics Inc. Licensed under the MIT License.

automatic-broccoli's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.