facultyai / lens Goto Github PK

View Code? Open in Web Editor NEW

102.0 15.0 9.0 235 KB

Summarise and explore Pandas DataFrames

Home Page: https://lens.readthedocs.io

License: Apache License 2.0

Python 100.00%

data-science pandas dask data-exploration data-visualisation dataframe

lens's People

Contributors

Stargazers

Watchers

Forkers

zblz asmith26 sfrias anhmike mm5631 anuragsinghchaudhary actuarial-tools hercules261188 jeffamaxey

lens's Issues

Consider specifying a target variable when computing a summary

Currently all of the metrics computed are independent of a target variable or column, but if lens.summarise took the name of a column as the target variable, the output of some metrics could be more interpretable even if the target variable is not used in any kind of predictive modelling.

A good example of this could be PCA (see #14), which could plot the different categories of the target variables in different colours for 2D plots of the data transformed into the principal components. This would give a good idea of whether the target variable can be easily inferred from the available data.

lens.summarise(df) hangs

As discussed with Victor here - https://twitter.com/zblz/status/938842567160066049
Running on MacBook Pro, Sierra 10.12.6
See attached PDF export from Jupyter notebook that reproduces the issue for me.
problem_demo.pdf

Explorer.correlation_plot() uses deprecated Plotly method

Calling lens.explorer.Explorer.correlation_plot() shows the following deprecation warning:

/opt/anaconda/envs/Python3/lib/python3.6/site-packages/plotly/tools.py:1422: UserWarning:

plotly.tools.FigureFactory.create_annotated_heatmap is deprecated. Use plotly.figure_factory.create_annotated_heatmap

Consider partial and resumable computation of summaries

For large datasets where computing the summary may be expensive, it would be useful to compute only part of it, be able to explore it, and then compute other parts of it without recomputing the initial report.

The selection of which parts to compute could be by:

columns in the dataset,
metrics, or
row ranges.

can't install lens despite all solutions i googled

it gives me the error

error: Microsoft Visual C++ 14.0 is required. Get it with
"Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/
----------------------------------------
ERROR: Command errored out with exit
status 1: 'c:\python3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] =
'"'"'C:\Users\erjan222\AppData\Local\Temp\pip-install-_qkcxb7t\accumulation-tree\setup.py'"'"';

file='"'"'C:\Users\erjan222\AppData\Local\Temp\pip-install-_qkcxb7t\accumulation-tree\setup.py'"'"';f=getattr(tokenize,
'"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\erjan222\AppData\Local\Temp\pip-record-874bit1_\install-record.txt' 
--single-version-externally-managed --compile Check the logs for full command output.

pip install -U setuptools
pip install -U wheel
!pip install lens --egg

but pip still does not know --egg option.
pip install --upgrade setuptools
but setuptools is updated

I did install visual tools installer, but did not select any packages to install,because they all seem diff from what i need.

how can i install this library lens?

How to prevent jupyter notebook truncating `lens.interactive_explore(ls)`

How can I prevent jupyter notebook from truncating the output to lens.interactive_explore(ls)?

(Sorry if this is more of a jupyter notebook question - thanks for any help!)

Use matplotlib instead of plotly in Explorer

Plotly has the advantage of resulting in interactive plots in a jupyter notebook, but it is does not result in easily portable plots. We should consider ways of making the exploration portable, so that a set of plots could be, e.g., batch-generated from a summary.

The obvious choice is to fall back on matplotlib, which is already used in the first place to generate many of the plots that then get converted into plotly plots.

Compatibility issue with new version of plotly?

Hi guys, getting the following error with the explorer module:

AttributeError                            Traceback (most recent call last)
<ipython-input-9-38c8dacf5299> in <module>
----> 1 explorer.correlation_plot()

~/machine_learning/.env/lib/python3.7/site-packages/lens/explorer.py in correlation_plot(self, include, exclude)
    311         """
    312         fig = plot_correlation(self.summary, include, exclude)
--> 313         self.plot_renderer(fig)
    314 
    315     def correlation(self, include=None, exclude=None):

~/machine_learning/.env/lib/python3.7/site-packages/lens/explorer.py in _render(fig, showlegend)
     44         raise ValueError(message)
     45     else:
---> 46         if not py.offline.__PLOTLY_OFFLINE_INITIALIZED:
     47             py.init_notebook_mode()
     48         return py.iplot(fig, **PLOTLY_KWS)

AttributeError: module 'plotly.offline.offline' has no attribute '__PLOTLY_OFFLINE_INITIALIZED'

I'm working in a python virtulaenv with the following setup

CPython 3.7.3

numpy 1.17.2
pandas 0.25.1
matplotlib 3.1.1
plotly 4.1.1
lens 0.4.5

system     : Linux
release    : 5.0.0-31-generic
machine    : x86_64
interpreter: 64bit

I believe the issue to be come from recent updates in the plotly library with the integration of plotly_express js initilization in the python package.

Quickfix that worked was simply removing the if statement in line 46 of explorer.py

45    else:
46        if not py.offline.__PLOTLY_OFFLINE_INITIALIZED: # remove
47            py.init_notebook_mode() # unindent
48        return py.iplot(fig, **PLOTLY_KWS)

Might be worth a pull request?

Consider removal of t-digest computation

Right now the t-digest computation (done using a python t-digest implementation) takes most of the time in generating a summary. The initial motivation to include it was for it to contain an approximation of the histogram information, but we are also computing a fixed-bin-width histogram so it is of limited value. The t-digest information is used in the explorer for:

arbitrary bin width histograms.
building percentile functions in lens.Summary that get used to plot a CDF in lens.Explorer.

We have to consider whether these two features are important enough and whether we can use other approaches to substitute this information.

Having an adaptively binned histogram (through, e.g., bayesian blocks) would go a long way to replacing the t-digest for our exploration needs.

A significant advantage of a t-digest is that it can be updated in a chunked manner, but we are not currently using that.

Add versioning to summary schema

When we converge on a stable schema for the summary we should start versioning it. In the meantime, it would be useful to check that the lens versioned used to load a schema is the same one that generated it. This reduces portability of the summary but reduces conflicts before reaching a stable schema.

Fix interactive explore widget

Since the release of ipywidgets 6.0, injecting arbitrary javascript into an HTML widget is not allowed. This has broken lens.interactive_explore since it works by injecting the HTML output from an offline plotly plot generation into an HTML widget. We can consider fixing this by using plotly's GraphWidget, which adds complexity in the modifications of the plot, or completely bypassing plotly and rendering the plots with maptlotlib as proposed in #8.

pandas==1.0.1 and lens.summarize(df) throws Error

the dataFrame method "get_values" doesn't exist any more

I downgraded pandas to '0.25.0' to make it work.

the current setup.py requires pandas but doesn't specify a version.

Support dask distributed scheduler

The dask distributed scheduler is generally an improvement over the multiprocessing scheduler even in individual multicore machines because of its improved awareness of data locality, so we should consider adding it as an option to lens.summarise. If we find its performance better in comon use cases of lens, set it as default.

Widget that computes and plots parts of the Summary on-the-fly

Right now, creating and exploring a lens report is always a two step process: lens.summarise and lens.explore or lens.interactive_explore. It would be useful to have a single function that directly creates a widget. However, in that case we want to avoid the front-loading of computation as it would be a bad user experience. A way around that would be to generate the dask graph but only compute the nodes that are needed for each pane of the widget when entering that pane. Using opportunistic caching we would avoid recomputing all the graph up to that node as many of the previously computed nodes would be kept in memory.

Support dask dataframes as input to lens.summarise

Currently, lens requires a pandas dataframe as input to the lens.summarise method. This places an upper limit on the size of the dataset analysed, which must be smaller than the available memory in the machine. Even with efficient optimisation of memory usage during the execution of the dask graph, the initial requirement prevents lens from scaling.

Ideally, lens.summarise should accept dask dataframes as input, and build the execution graph based on this delayed dataframe. This will require a rework of the functions in lens.metrics, given that all of them currently take either pd.Series or pd.Dataframe as arguments. In most cases we should be able to use the dask dataframe API, but for other metrics it will be necessary to access the individual chunks and reduce the result appropriately.

Adding this support, along with the distributed scheduler #11, will allow lens to analyse datasets significantly larger than the memory of the machine.

Citation

Hi,

I would like to reference your library in a paper, how should I cite it?

Improve README.rst

In addition to the current information, the README file should include:

A more detailed motivation for lens.
A link to the docs.
Information about contribution guidelines.
Specify python versions supported.

Compute PCA as part of the summary

Performing dimensionality analysis with PCA as part of the summary computation would be a useful tool to better understand the data.

Improve documentation

Currently the docs are comprised of a tutorial rendered from a notebook, and an API reference built form the python docstrings. We should expand the docs to include:

An introduction to lens.
A motivation of the choice of precomputing the summary vs computing it on-the-fly. Advantages of using dask?
Tutorial for computing a summary and using the Summary and Explorer classes.
Contribution guidelines.
Structure API docs in a more clear way.