Giter Site home page Giter Site logo

data-describe / data-describe Goto Github PK

View Code? Open in Web Editor NEW
295.0 13.0 19.0 128.69 MB

data⎰describe: Pythonic EDA Accelerator for Data Science

License: Other

Dockerfile 0.40% Shell 0.08% Python 99.52%
exploratory-data-analysis pypi eda analysis data-science

data-describe's Introduction

PyPI status PyPI license Downloads

PyPI version shields.io PyPI pyversions codecov

data ⎰ describe

data-describe is a Python toolkit for Exploratory Data Analysis (EDA). It aims to accelerate data exploration and analysis by providing automated and polished analysis widgets.

For more examples of data-describe in action, see the Quick Start Tutorial.

Main Features

data-describe implements the following basic features:

Feature Description
Data Summary Curated data summary
Data Heatmap Data variation and missingness heatmap
Correlation Matrix Correlation heatmaps with categorical support
Distribution Plots Generate histograms, violin plots, bar charts
Scatterplots Generate scatterplots and evaluate with scatterplot diagnostics
Cluster Analysis Automated clustering and plotting
Feature Ranking Evaluate feature importance using tree models

Extended Features

data-describe is always looking to elevate the standard for Exploratory Data Analysis. Here are just a few that are implemented:

  • Dimensionality Reduction Methods
  • Sensitive Data (PII) Redaction
  • Text Pre-processing / Topic Modeling
  • Big Data Support

Installation

data-describe can be installed using pip:

pip install data-describe

Getting Started

import data_describe as dd
help(dd)

See the User Guide for more information.

Project Status

data-describe is currently in beta status.

Contributing

data-describe welcomes contributions from the community.

data-describe's People

Contributors

actions-user avatar bobbyjacob avatar brianray avatar dependabot[bot] avatar dvdjlaw avatar sheth108 avatar terrytangyuan avatar truongc2 avatar zack-soenen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-describe's Issues

Publication readiness

In order to consider academia users, we should consider whether to make efforts on making the visualizations publication-ready and easy to use. A couple of examples on similar efforts:

A couple of open questions that @brianray mentioned earlier:

  • What is the level of effort?
  • How would that work with interactive plots?

Geospatial analysis kde plot

Clips parameter is hardcoded to "state"
ax = geoplot.kdeplot( df=data.geometry.centroid, figsize=(context.fig_width, context.fig_height), clip=data.dissolve("state").geometry, shade_lowest=False, cmap="viridis", shade=True, **kde_kwargs )

Move variables to reusable constant class

Variables/strings that appear in multiple places should be moved to a constant class so they can be easier to reuse, e.g.

  • the list of available metrics in core/scatter_plot.py and in metrics/bivariate.py. The doc-strings can point to this constant class directly.
  • Strings that represent file extensions in utilities/load_data.py.

Divide dependencies into categories

Dependencies are heavy. Need to divide them into categories (required v.s. optional or based on different applications like geospatial v.s. text) handle exceptions properly in each module's import statement.

Auto Data Type Notebook Warnings

  • Problem running guess_dtypes and select_dtypes. Returns "Unknown string format" and many warnings (string conversions and invalid literals)

error in notebook "Cluster Analysis"

cluster(df, target='Target')
Mime type rendering requires nbformat>=4.2.0 but it is not installed
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    916             method = get_real_method(obj, self.print_method)
    917             if method is not None:
--> 918                 method()
    919                 return True
    920 

/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/plotly/basedatatypes.py in _ipython_display_(self)
    458 
    459         if pio.renderers.render_on_display and pio.renderers.default:
--> 460             pio.show(self)
    461         else:
    462             print(repr(self))

/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/plotly/io/_renderers.py in show(fig, renderer, validate, **kwargs)
    384         if not nbformat or LooseVersion(nbformat.__version__) < LooseVersion("4.2.0"):
    385             raise ValueError(
--> 386                 "Mime type rendering requires nbformat>=4.2.0 but it is not installed"
    387             )
    388 

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

Support interactive visualizations

Currently DD is only suitable for running in notebooks via SDK. Should we support interactive plots?

Trade-offs and efforts of maintenance need to be considered. For R packages, this is relatively easier since there's plotly’s R package that converts static ggplot2 plots to interactive plots.

A question from @brianray on the Google Doc: can this be within the jupyter lab domain only?

Sensitive Data Detection

One idea to really make DD standout against similar packages is by having the added capability of identifying sensitive data like PII and PHI. I looked into several packages and Presidio seems to be the most mature and there's a lot of support from the open-source community. Most packages rely on some combination of regex, rules-based approach, and NER (spaCy). Let me know what you guys think.

  • Presidio

    • Open sourced from microsoft
    • Leverages docker and kubernetes
    • Customizable to the domain problem
    • Uses NER, patterns, formats, chucksums
    • Can be used as a standalone python package
  • PIIdetect

    • Still in early development. Uses word2vec to identify PII and can create fake text
  • scrubadub

    • Scrubs PII using regex and textblob. Slowly adding more ML approaches

Once the sensitive data is identified, we can anonymize the data:

  • Faker

    • Can be used to generate synthetic data while maintaining distributions
    • Tutorial here
  • Trumania

    • Scenario-based random dataset generator
    • Statistical distribution from numpy and faker is provided. Can be extended to new ones.
    • In depth tutorial here

Add a document specifying scope of the package

Identifying scope of any package is important. We need to decide:
What types of data should this package accept?
-- Signal Data
-- Alarm Data
-- Work order data
-- Geospatial data
-- Computer vision data
-- LIDAR data
-- Any others?
Once the data types are identified, we can define acceptable schemas and functionality for each data type.

branding / gh-pages / external marketing

We need to:

  • copy changes on main page. Change Download to "Install"
  • populate User Guide
  • populate release history
  • link API Documentation to Read the Docs
  • add Github banner
  • generate Read The Docs
  • Format the main Readme.md for the project

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.