Light

data-describe / data-describe Goto Github PK

View Code? Open in Web Editor NEW

295.0 13.0 19.0 128.69 MB

data⎰describe: Pythonic EDA Accelerator for Data Science

License: Other

Dockerfile 0.40% Shell 0.08% Python 99.52%

exploratory-data-analysis pypi eda analysis data-science

data-describe's Introduction

data ⎰ describe

data-describe is a Python toolkit for Exploratory Data Analysis (EDA). It aims to accelerate data exploration and analysis by providing automated and polished analysis widgets.

For more examples of data-describe in action, see the Quick Start Tutorial.

Main Features

data-describe implements the following basic features:

Feature	Description
Data Summary	Curated data summary
Data Heatmap	Data variation and missingness heatmap
Correlation Matrix	Correlation heatmaps with categorical support
Distribution Plots	Generate histograms, violin plots, bar charts
Scatterplots	Generate scatterplots and evaluate with scatterplot diagnostics
Cluster Analysis	Automated clustering and plotting
Feature Ranking	Evaluate feature importance using tree models

Extended Features

data-describe is always looking to elevate the standard for Exploratory Data Analysis. Here are just a few that are implemented:

Dimensionality Reduction Methods
Sensitive Data (PII) Redaction
Text Pre-processing / Topic Modeling
Big Data Support

Installation

data-describe can be installed using pip:

pip install data-describe

Getting Started

import data_describe as dd
help(dd)

See the User Guide for more information.

Project Status

data-describe is currently in beta status.

Contributing

data-describe welcomes contributions from the community.

data-describe's People

Contributors

Stargazers

Watchers

Forkers

zeekay victor3387 notspicyzhan allensmile vipermdl jairoruizsaenz gengyuisland popo0293 rena-haswah-mw ankur-srivastava1 lenamax2355 databill86 truongc2 zsoenen eminokic python-repository-hub meono dhockaday richardcmckinney

data-describe's Issues

Add automatic checks for consistent code style

The class `Scagnostics` in metrics/bivariate.py is too long

The class Scagnostics in metrics/bivariate.py is very long (~500 lines). Also need to differentiate between user-level functions and internal/developer-level functions.

Integrate coverage reports using codecov.io

Remove the unncessary block and the use of `pass`

In eospatial/mapping.py

Resolve warning messages in notebooks

Seal package exports/namespaces

High risk for users to use internal methods and hard to develop release strategy.

Consider building a GUI

Similar to #49, should we consider building a GUI around the functionalities?

Remove unnecessary checks for required arguments

When an argument is required, remove the default value instead of checking whether it’s None, e.g. the function calculate_metrics() in metrics/bivariate.py.

Add design doc for how we want to deal with sensitive data

Assertions in tests should provide better messages for non-intuitive test cases

Move long text strings to a separate file

e.g. long text strings in test_topic_model.py. This can be moved to a separate file.

Add automatic checks for static typing

Publication readiness

In order to consider academia users, we should consider whether to make efforts on making the visualizations publication-ready and easy to use. A couple of examples on similar efforts:

ggfortify (its interactive version is autoplotly)
ggsci

A couple of open questions that @brianray mentioned earlier:

What is the level of effort?
How would that work with interactive plots?

Main function `dim_reduc()` in dimensionality_reduction.py is too long

Geospatial analysis kde plot

Clips parameter is hardcoded to "state"
ax = geoplot.kdeplot( df=data.geometry.centroid, figsize=(context.fig_width, context.fig_height), clip=data.dissolve("state").geometry, shade_lowest=False, cmap="viridis", shade=True, **kde_kwargs )

Develop a licensing strategy

something not copy left https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licences

Better to use `**kwargs` instead of `kwargs=None`

Better to use **kwargs instead of kwargs=None in utilities/load_data.py which defaults to a dictionary and you can use kwargs.get().

Test

Move variables to reusable constant class

Variables/strings that appear in multiple places should be moved to a constant class so they can be easier to reuse, e.g.

the list of available metrics in core/scatter_plot.py and in metrics/bivariate.py. The doc-strings can point to this constant class directly.
Strings that represent file extensions in utilities/load_data.py.

Add automatic checks for import statement style

Move all import statements to top of the files

e.g. utilities/load_data.py

Divide dependencies into categories

Dependencies are heavy. Need to divide them into categories (required v.s. optional or based on different applications like geospatial v.s. text) handle exceptions properly in each module's import statement.

Support time series and sequential data

An example is nteract's line chart for time series. There are a lot of users that work on time series so this might be something we want to cover.

Avoid use of sensitive keywords in Python

e.g. type == 'kde in geospatial/mapping.py

use tox test runner

https://tox.readthedocs.io/en/latest/

Notebooks need to be tested as well in addition to existing unit tests

Move UploadCommand from setup.py to a separate script

Setup.py should not contain/ship with UploadCommand which is used to publish the package to PyPI and depends on other tools like twine and git.

Auto Data Type Notebook Warnings

Problem running guess_dtypes and select_dtypes. Returns "Unknown string format" and many warnings (string conversions and invalid literals)

error in notebook "Cluster Analysis"

cluster(df, target='Target')
Mime type rendering requires nbformat>=4.2.0 but it is not installed
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    916             method = get_real_method(obj, self.print_method)
    917             if method is not None:
--> 918                 method()
    919                 return True
    920 

/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/plotly/basedatatypes.py in _ipython_display_(self)
    458 
    459         if pio.renderers.render_on_display and pio.renderers.default:
--> 460             pio.show(self)
    461         else:
    462             print(repr(self))

/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/plotly/io/_renderers.py in show(fig, renderer, validate, **kwargs)
    384         if not nbformat or LooseVersion(nbformat.__version__) < LooseVersion("4.2.0"):
    385             raise ValueError(
--> 386                 "Mime type rendering requires nbformat>=4.2.0 but it is not installed"
    387             )
    388 

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

Support interactive visualizations

Currently DD is only suitable for running in notebooks via SDK. Should we support interactive plots?

Trade-offs and efforts of maintenance need to be considered. For R packages, this is relatively easier since there's plotly’s R package that converts static ggplot2 plots to interactive plots.

A question from @brianray on the Google Doc: can this be within the jupyter lab domain only?

Support automatic visualization

Should we support automatic visualization? Refer to the automatic visualization in H2O's Driverless AI.

Remove broad exceptions in core/cluster.py

core/cluster.py contains many broad exceptions. Should be narrowed down to handle specific exceptions and improve the exception messages users/developers will see.

Develop release strategy

Notebooks need to be updated continuously

Check and resolve any empty plots in notebooks

Use tool to quantify test coverage and automatically find blind spots

Automate management of semantic versioning of package

Such as using python-versioneer

Improve the way we handle large datasets

How do we handle large datasets? Do we have a progress bar or informative warning messages while loading large datasets?

Unit tests should not rely on external datasets

Implement CI/CD to publish to PyPI

Sensitive Data Detection

One idea to really make DD standout against similar packages is by having the added capability of identifying sensitive data like PII and PHI. I looked into several packages and Presidio seems to be the most mature and there's a lot of support from the open-source community. Most packages rely on some combination of regex, rules-based approach, and NER (spaCy). Let me know what you guys think.

Presidio
- Open sourced from microsoft
- Leverages docker and kubernetes
- Customizable to the domain problem
- Uses NER, patterns, formats, chucksums
- Can be used as a standalone python package
PIIdetect
- Still in early development. Uses word2vec to identify PII and can create fake text
scrubadub
- Scrubs PII using regex and textblob. Slowly adding more ML approaches

Once the sensitive data is identified, we can anonymize the data:

Faker
- Can be used to generate synthetic data while maintaining distributions
- Tutorial here
Trumania
- Scenario-based random dataset generator
- Statistical distribution from numpy and faker is provided. Can be extended to new ones.
- In depth tutorial here

Is Jupyter required

https://github.com/brianray/data-describe/blob/1ac718ff1d95598cddcdc8633471d5f511d83be6/mwdata/__init__.py#L20

According to this jupyter is a requirement. This may be a design decision and I am ok with it but if that is the case jupyter needs to be in the requirement.txt and that is a big requirement.

missing data files in Notebooks

"../data/weatherAUS.csv"

Add a document specifying scope of the package

Identifying scope of any package is important. We need to decide:
What types of data should this package accept?
-- Signal Data
-- Alarm Data
-- Work order data
-- Geospatial data
-- Computer vision data
-- LIDAR data
-- Any others?
Once the data types are identified, we can define acceptable schemas and functionality for each data type.

get feedback of geospacial functionality / datasets

@soshel Can you get your expert review of the current geospatial work done?

Perform scalability and load testing for large datasets

Add conda recipe

Make the package install-able using conda

branding / gh-pages / external marketing

We need to:

copy changes on main page. Change Download to "Install"
populate User Guide
populate release history
link API Documentation to Read the Docs
add Github banner
generate Read The Docs
Format the main Readme.md for the project

Avoid using global variables like in utilities/contextmanager.py

We can switch to use module-level variables instead of global variable.

Python files should not be executable

Currently all the Python files are executable. This would lead to potential security issues.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.