facultyai / lens Goto Github PK
View Code? Open in Web Editor NEWSummarise and explore Pandas DataFrames
Home Page: https://lens.readthedocs.io
License: Apache License 2.0
Summarise and explore Pandas DataFrames
Home Page: https://lens.readthedocs.io
License: Apache License 2.0
Currently all of the metrics computed are independent of a target variable or column, but if lens.summarise
took the name of a column as the target variable, the output of some metrics could be more interpretable even if the target variable is not used in any kind of predictive modelling.
A good example of this could be PCA (see #14), which could plot the different categories of the target variables in different colours for 2D plots of the data transformed into the principal components. This would give a good idea of whether the target variable can be easily inferred from the available data.
As discussed with Victor here - https://twitter.com/zblz/status/938842567160066049
Running on MacBook Pro, Sierra 10.12.6
See attached PDF export from Jupyter notebook that reproduces the issue for me.
problem_demo.pdf
Calling lens.explorer.Explorer.correlation_plot()
shows the following deprecation warning:
/opt/anaconda/envs/Python3/lib/python3.6/site-packages/plotly/tools.py:1422: UserWarning:
plotly.tools.FigureFactory.create_annotated_heatmap is deprecated. Use plotly.figure_factory.create_annotated_heatmap
For large datasets where computing the summary may be expensive, it would be useful to compute only part of it, be able to explore it, and then compute other parts of it without recomputing the initial report.
The selection of which parts to compute could be by:
it gives me the error
error: Microsoft Visual C++ 14.0 is required. Get it with
"Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/
----------------------------------------
ERROR: Command errored out with exit
status 1: 'c:\python3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] =
'"'"'C:\Users\erjan222\AppData\Local\Temp\pip-install-_qkcxb7t\accumulation-tree\setup.py'"'"';file='"'"'C:\Users\erjan222\AppData\Local\Temp\pip-install-_qkcxb7t\accumulation-tree\setup.py'"'"';f=getattr(tokenize,
'"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\erjan222\AppData\Local\Temp\pip-record-874bit1_\install-record.txt'
--single-version-externally-managed --compile Check the logs for full command output.
pip install -U setuptools
pip install -U wheel
!pip install lens --egg
but pip still does not know --egg option.
pip install --upgrade setuptools
but setuptools is updated
I did install visual tools installer, but did not select any packages to install,because they all seem diff from what i need.
how can i install this library lens?
Plotly has the advantage of resulting in interactive plots in a jupyter notebook, but it is does not result in easily portable plots. We should consider ways of making the exploration portable, so that a set of plots could be, e.g., batch-generated from a summary.
The obvious choice is to fall back on matplotlib
, which is already used in the first place to generate many of the plots that then get converted into plotly plots.
Hi guys, getting the following error with the explorer module:
AttributeError Traceback (most recent call last)
<ipython-input-9-38c8dacf5299> in <module>
----> 1 explorer.correlation_plot()
~/machine_learning/.env/lib/python3.7/site-packages/lens/explorer.py in correlation_plot(self, include, exclude)
311 """
312 fig = plot_correlation(self.summary, include, exclude)
--> 313 self.plot_renderer(fig)
314
315 def correlation(self, include=None, exclude=None):
~/machine_learning/.env/lib/python3.7/site-packages/lens/explorer.py in _render(fig, showlegend)
44 raise ValueError(message)
45 else:
---> 46 if not py.offline.__PLOTLY_OFFLINE_INITIALIZED:
47 py.init_notebook_mode()
48 return py.iplot(fig, **PLOTLY_KWS)
AttributeError: module 'plotly.offline.offline' has no attribute '__PLOTLY_OFFLINE_INITIALIZED'
I'm working in a python virtulaenv with the following setup
CPython 3.7.3
numpy 1.17.2
pandas 0.25.1
matplotlib 3.1.1
plotly 4.1.1
lens 0.4.5
system : Linux
release : 5.0.0-31-generic
machine : x86_64
interpreter: 64bit
I believe the issue to be come from recent updates in the plotly library with the integration of plotly_express js initilization in the python package.
Quickfix that worked was simply removing the if statement in line 46 of explorer.py
45 else:
46 if not py.offline.__PLOTLY_OFFLINE_INITIALIZED: # remove
47 py.init_notebook_mode() # unindent
48 return py.iplot(fig, **PLOTLY_KWS)
Might be worth a pull request?
Right now the t-digest computation (done using a python t-digest implementation) takes most of the time in generating a summary. The initial motivation to include it was for it to contain an approximation of the histogram information, but we are also computing a fixed-bin-width histogram so it is of limited value. The t-digest information is used in the explorer for:
lens.Summary
that get used to plot a CDF in lens.Explorer
.We have to consider whether these two features are important enough and whether we can use other approaches to substitute this information.
Having an adaptively binned histogram (through, e.g., bayesian blocks) would go a long way to replacing the t-digest for our exploration needs.
A significant advantage of a t-digest is that it can be updated in a chunked manner, but we are not currently using that.
When we converge on a stable schema for the summary we should start versioning it. In the meantime, it would be useful to check that the lens versioned used to load a schema is the same one that generated it. This reduces portability of the summary but reduces conflicts before reaching a stable schema.
Since the release of ipywidgets 6.0, injecting arbitrary javascript into an HTML widget is not allowed. This has broken lens.interactive_explore
since it works by injecting the HTML output from an offline plotly plot generation into an HTML widget. We can consider fixing this by using plotly's GraphWidget
, which adds complexity in the modifications of the plot, or completely bypassing plotly and rendering the plots with maptlotlib as proposed in #8.
the dataFrame method "get_values" doesn't exist any more
I downgraded pandas to '0.25.0' to make it work.
the current setup.py requires pandas but doesn't specify a version.
The dask distributed scheduler is generally an improvement over the multiprocessing scheduler even in individual multicore machines because of its improved awareness of data locality, so we should consider adding it as an option to lens.summarise
. If we find its performance better in comon use cases of lens
, set it as default.
Right now, creating and exploring a lens report is always a two step process: lens.summarise
and lens.explore
or lens.interactive_explore
. It would be useful to have a single function that directly creates a widget. However, in that case we want to avoid the front-loading of computation as it would be a bad user experience. A way around that would be to generate the dask graph but only compute the nodes that are needed for each pane of the widget when entering that pane. Using opportunistic caching we would avoid recomputing all the graph up to that node as many of the previously computed nodes would be kept in memory.
Currently, lens requires a pandas dataframe as input to the lens.summarise
method. This places an upper limit on the size of the dataset analysed, which must be smaller than the available memory in the machine. Even with efficient optimisation of memory usage during the execution of the dask graph, the initial requirement prevents lens
from scaling.
Ideally, lens.summarise
should accept dask dataframes as input, and build the execution graph based on this delayed dataframe. This will require a rework of the functions in lens.metrics
, given that all of them currently take either pd.Series
or pd.Dataframe
as arguments. In most cases we should be able to use the dask dataframe API, but for other metrics it will be necessary to access the individual chunks and reduce the result appropriately.
Adding this support, along with the distributed scheduler #11, will allow lens to analyse datasets significantly larger than the memory of the machine.
Hi,
I would like to reference your library in a paper, how should I cite it?
In addition to the current information, the README file should include:
lens
.Performing dimensionality analysis with PCA as part of the summary computation would be a useful tool to better understand the data.
Currently the docs are comprised of a tutorial rendered from a notebook, and an API reference built form the python docstrings. We should expand the docs to include:
Summary
and Explorer
classes.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.