renumics / spotlight Goto Github PK

Interactively explore unstructured datasets from your dataframe.

License: MIT License

Makefile 0.45% TypeScript 49.02% CSS 0.03% JavaScript 0.18% Python 42.42% HTML 0.10% PowerShell 0.05% Jupyter Notebook 7.75%

data-centric-ai data-curation data-visualization computer-vision machine-learning audio exploratory-data-analysis images timeseries video meshes unstructured-data hacktoberfest

spotlight's Introduction

Renumics Spotlight

Interactively explore unstructured datasets from your dataframe.

Documentation • Playbook • Blog • API Reference

Spotlight helps you to understand unstructured datasets fast. You can quickly create interactive visualizations and leverage data enrichments (e.g. embeddings, prediction, uncertainties) to identify critical clusters in your data.

Spotlight supports most unstructured data types including images, audio, text, videos, time-series and geometric data. You can start from your existing dataframe:

And start Spotlight with just a few lines of code:

from renumics import spotlight

spotlight.show(df, dtype={"image": spotlight.Image, "embedding": spotlight.Embedding})

🚀 Start with a use case

Machine learning and engineering teams use Spotlight to understand and communicate on complex unstructured data problems. Here are some examples on publicly available datasets along with code snippets (👨‍💻), interactive demos (🕹️) and blog articles (📝):

Modality	Task	Description	Link
🖼️ Image	[Classification]	Find Issues in Any Image Classification Dataset	👨‍💻 📝 🕹️
		Find data issues in the CIFAR-100 image dataset	🕹️
		Fine-tuning image classification models from Bing image search	👨‍💻📝
🔊 Audio	[Classification]	Find Issues in Any Audio Classification Dataset	👨‍💻 📝🕹️
		Debug pre-trained gender detection models on the emodb dataset	📝 🕹️
		Compare gender detection models on the emodb dataset	📝 🕹️
📝 Text	[Classification]	Find Issues in Any Text Classification Dataset	👨‍💻 📝
📈🖼️ Mixed	[EDA]	Explore results from the Formula1 Montreal 2023 GP	🕹️
📈🖼️ Mixed	[EDA]	Explore a crash simulation dataset	🕹️

⏱️ Quickstart

Get started by installing Spotlight and loading your first dataset.

What you'll need

Python version 3.8-3.11

Install Spotlight via pip

pip install renumics-spotlight

We recommend installing Spotlight and everything you need to work on your data in a separate virtual environment.

Load a dataset and start exploring

import pandas as pd
from renumics import spotlight

df = pd.read_csv("https://renumics.com/data/mnist/mnist-tiny.csv")
spotlight.show(df, dtype={"image": spotlight.Image})

pd.read_csv loads a sample csv file as a pandas DataFrame.

spotlight.show opens up spotlight in the browser with the pandas dataframe ready for you to explore. The dtype argument specifies custom column types for the browser viewer.

Load a Hugging Face audio dataset with embeddings and a pre-defined layout

import datasets
from renumics import spotlight

ds = datasets.load_dataset('renumics/emodb-enriched', split='all')
layout= spotlight.layouts.debug_classification(label='gender', prediction='m1_gender_prediction', embedding='m1_embedding', features=['age', 'emotion'])
spotlight.show(ds, layout=layout)

Here, the data types are discovered automatically from the dataset and we use a pre-defined layout for model debugging. Custom layouts can be built programmatically or via the UI.

The datasets[audio] package can be installed via pip.

Usage Tracking

We have added crash report and performance collection. We do NOT collect user data other than an anonymized Machine Id obtained by py-machineid, and only log our own actions. We do NOT collect folder names, dataset names, or row data of any kind only aggregate performance statistics like total time of a table_load, crash data, etc. Collecting Spotlight crashes will help us improve stability. To opt out of the crash report collection define an environment variable called SPOTLIGHT_OPT_OUT and set it to true. e.G.export SPOTLIGHT_OPT_OUT=true

We are very happy to hear your feedback

Open an issue on Github
Have a coffee talk with us
Join our channel on Discord

Learn more about unstructured data workflows

🤗 Huggingface example spaces and datasets
🏀 Playbook for data-centric AI workflows
🍰 Sliceguard library for automatic slice detection

Contribute

We are currently participating in the running Hacktoberfest 2023.

If you would like to contribute to Spotlight, the easiest way is to have a look at our Contribution Docs and the CONTRIBUTING.md.

We are also equally happy about non-code contributions -- whether it's reporting bugs, suggesting features, contributing design ideas, or offering feedback, every non-code contribution is highly valued and helps make our project better for everyone.

spotlight's People

Contributors

Stargazers

Watchers

Forkers

sailfish009 antoniorossi dani2112 kishiyamat stjordanis syoy pent gari7830 harshest-human hsyngmtrk xantin lshang0311 bsvmelo jarrydlee jaedukseo westamine felipematheus sajanraj nikhilzulasana thanhpham1987 zorkv ameyyadav09 alextronix git2chirag micseb balakreshnan junyang0412 test00dezwebsite steffenslavetinsky gitgoap wassaf001 chaturrved dumindudara senthi1kumar rishabh2804 akashpambhar menna123mahmoud vishnukaushik carlosmanoel druzsan iamehran plavreshin bansalraman ashutosh-kumar-singh-iit-patna nitintayal008 swetha3456 x0rzavi imkrishnasarathi neelshah2409 farukhs52 arpitpandey29 drasaadmoosa brunoscaglione djiwandou-p cccccjhahah prechayimmee gonzalezulises nstoykov edson-github shashipal95 arpitkrwork antesha tcrapse syedsajidhussain oiraigosag praisendebele senhorinfinito twebberbr sinha97 syaikhipin sri-awadh sensationalspace gokul2018 gilangsamudra dominicnyabuto skyleraiguy unagi2020 orozcodex knight069 polya20 ashishd priya-gittest harishgovardhandamodar

spotlight's Issues

Set Filters from Layout

Is your feature request related to a problem? Please describe.
When i predefine a layout I have a specific use case in mind that might not need all the data in the dataset but a specific subset.
At the moment I can define a specific layout but every time I load a dataset with the layout I have to specify all filters that i might want in order to start exploring with the provided layout.

Describe the solution you'd like
It would be a great time saver for the described problem if I could also specify active/inactive filters in the layout.

I would like to be able to also specify active/inactive filters as it would allow me to save predefined subsets in the layout that can then be toggled off and on by the user.

Describe alternatives you've considered
An alternative would be to introduce extra columns into the dataset that allow for easier filtering e.G.

Filter1 | Filter2 | ---
True False
False True

Those would allow easier filtering but I would have to alter the Dataset for it and it still involves creating filters manually.

Additional context
I think this is also interesting for our playbook where you have different plays operating on the same dataset that might want to look at different subsets of the data.

UMAP bug due to Numpy/Numba conflicts

Describe the bug
Spotlight does not start because of an internal error in the UMAP package

To Reproduce
Steps to reproduce the behavior:

Create new virtual env
Create new notebook and run mnist tiny example
UMAP does not load and error output in console

Expected behavior
Normal startup and completed Simmap

Desktop (please complete the following information):

Ubuntu
Spotlight 1.0.0rc10

Screenshot

Additional context
Downgrading the Numpy version to e.g. 1.23.5 solves the issue. Maybe we should pin the version?

Mnist link is broken

Hello. "https://spotlight.renumics.com/data/mnist/mnist.csv" is a broken link.

Some image URLs cannot be displayed

Some Wikipedia images cannot be displayed, although the image URLs work in the browser.
E.g. first few images of the following dataset.

train_dset = load_dataset("alexandrainst/da-wit", split='train[:100]', data_files=['data/train-00000-of-00017-a976177c0f381298.parquet'], verification_mode=datasets.VerificationMode.NO_CHECKS)
df = train_dset.to_pandas()
spotlight.show(df, dtype={'image_url': spotlight.Image, 'embedding': spotlight.Embedding})

Safari without sound

Describe the bug
When using Spotlight on Safari, Audio Files produce no sound when clicking on the play button

To Reproduce
Steps to reproduce the behavior:

Open Spotlight with an audio dataset
Click on a sample
Add audio view
Press play

Expected behavior
Sound should be played

Desktop (please complete the following information):

OS: MacOS
Browser Safari
Browser Version general problem, independent of version
Spotlight Version newest

Drag'n'Drop views in InspectorView not working

Describe the bug
I can't drag'n'drop the views in the inspector in order to change their ordering.

To Reproduce
Steps to reproduce the behavior:

Add multiple views to the inspector
Try changing the order with drag'n'drop

Expected behavior
I would expect to be able to change the order of the views

Desktop (please complete the following information):

OS: macos
Browser: chrome
Browser Version: 114.0.5735.90 (Official Build) (arm64)
Spotlight Version 1.0.0

Additional context

react-beautiful-dnd prints the following debug error message:

Unable to find draggable with id: 0e6631f7-1128-4558-9966-8e9175969c9

when i inspect the Elements, I can find the draggable with the searched id:

<div data-rbd-draggable-context-id="1" data-rbd-draggable-id="0e6631f7-1128-4558-9966-8e9175969c91" tabindex="0" role="button" aria-describedby="rbd-hidden-text-1-hidden-text-1" data-rbd-drag-handle-draggable-id="0e6631f7-1128-4558-9966-8e9175969c91" data-rbd-drag-handle-context-id="1" draggable="false" class="Row__RowItemWrapper-sc-v9a0x9-0 fCdhiJ" style="position: absolute; left: 0px; top: 44px; height: 64px; width: 100%;">
  <button class="Button__StyledHTMLButton-sc-1xm6gdl-0 jqfzFA">
    <svg xmlns="http://www.w3.org/2000/svg" fill="none" viewBox="0 0 24 24" stroke="currentColor" class="X__StyledSvg-sc-1dqpkgl-0 PyruS">
      <path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M6 18L18 6M6 6l12 12">
      </path>
    </svg>
  </button>
  <div class="Row__ViewNameWrapper-sc-v9a0x9-1 dfmceQ">
    <div>
      <span class="Row__ViewName-sc-v9a0x9-2 ebvlsi">
        less_categories
      </span>
    </div>
  </div>
</div>

Histogram displays misleading message when too many distinct values in a column

If I have a string column with many different values (e.g. labels for problems with many classes) and I use the stacked histogram I get the following messages:

If I select the column as the "COLUMN" in the histogram, I get the message "No column selected" instead of "too many different values"

Example: BirdClef2023 dataset, column ('primary_label')

Backend crash (but recovers) when TABLE_FILE is directory

Describe the bug
Spotlight backend crashed (but recovers) when TABLE_FILE is pointing to a directory instead of a file.
For me this only happens when i launch it in dev mode.

To Reproduce
Steps to reproduce the behavior:

Run TABLE_FILE=data/tables/ make dev

Expected behavior
I would expect the backend not to fail;)

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser Chrome
Browser Version 113.0.5672.126 (Official Build) (arm64)
Spotlight Version 1.0.0

Additional context

The error that appears on the backend:

File ".../spotlight/renumics/spotlight_plugins/core/api/layout.py", line 42, in reset_layout
    dataset_uid = request.app.data_source.get_uid()
                  │       └ <property object at 0x165582930>
                  └ <starlette.requests.Request object at 0x1660284c0>

AttributeError: 'NoneType' object has no attribute 'get_uid'

Spotlight displays Embedding column as value or text in Inspector

Describe the bug
When defining Embedding columns, those will be displayed in the inspector on startup. This is confusing, looks not good, and provides little to no value.

To Reproduce
Steps to reproduce the behavior:

Add an Embedding column to a dataframe
Specify the mapping {"embedding_column": Embedding} in show via dtype.
Look at the inspector

Expected behavior
Do not display embeddings in Inspector or choose a suitable view.

Screenshots
If possible, include screenshots or screen recordings to better illustrate the issue.

Desktop (please complete the following information):

Spotlight Version 1.3.0rc6

Clear cache of external data

Is your feature request related to a problem? Please describe.
I find it difficult to clear the cache of external data.

Describe the solution you'd like
I would like to request a new method in the spotlight API that allows for the clearing of the cache of external data.

Describe alternatives you've considered
The only alternative I have considered is the current method of using the shutil module to delete the cache directory. While this works, it can be cumbersome and potentially risky.

Additional context
I am currently using the spotlight API with a Flask server to render Plotly figures as images for the spotlight detail view. Therefor i use values like "http://localhost:5000/plot?viz=fig2&param={x}" in a image column.
The ability to clear the cache of external data is important for ensuring that the latest data is always displayed in the spotlight view.

Html widget to display e.g. plotly figures

Is your feature request related to a problem? Please describe.
Yes, the problem is that currently there is no built-in capability to display plots that are highly customized for specific use cases. This makes it difficult to review of content that can not be rendered by the built-in 3d renderings and 2d graph plots.

Describe the solution you'd like
I would like the ability to render div HTML code within a widget, so that content generated by Plotly and other libraries can be easily displayed. This would involve adding a new widget type that can handle HTML code as input.

Describe alternatives you've considered
One alternative is to convert the HTML code to an image or a static plot, and then display that within the widget. However, this approach would not allow for interactive features such as zooming and panning, which are often present in Plotly figures.

Additional context
Here is an example of how the new widget could be used:

import plotly.graph_objs as go
df["figs"]= [go.Figure(data=[go.Scatter(x=df.iloc[i]["x"], df.iloc[i]["y"] ).to_html()])
spotlight.show(df, dtype={""]=": spotlight.HTML_DIV})

Datasets containing text seem to slow down Spotlight significantly

Describe the bug
When running spotlight on datasets containing longer text it feels like this significantly slows down the UI. Concretely, hovering on data points and waiting for the hover data to highlight seems very laggy.

To Reproduce
Run the following notebook in the sliceguard library: https://github.com/Renumics/sliceguard/blob/main/examples/quickstart_mixed_data.ipynb

Calling the report function will open up Spotlight and this is where I realized it is much slower than with other datatypes.

Expected behavior
Spotlight being as fast as with images or other unstructured data types.

Desktop (please complete the following information):

OS: Ubuntu
Browser Chrome
Spotlight Version [e.g. 1.3.0]

Spotlight cannot be started under Python 3.10.9 (Miniconda)

Describe the bug
When running Spotlight 1.3.0rc3 under Ubuntu/Miniconda/Python 3.10.9 it immediately fails with this message:

Traceback (most recent call last):
  File "/home/**/code/**-test/.venv/bin/spotlight", line 5, in <module>
    from renumics.spotlight.cli import main
  File "/home/**/code/**-test/.venv/lib/python3.10/site-packages/renumics/spotlight/__init__.py", line 18, in <module>
    from .viewer import Viewer, close, viewers, show
  File "/home/**/code/**-test/.venv/lib/python3.10/site-packages/renumics/spotlight/viewer.py", line 63, in <module>
    from renumics.spotlight.layout import _LayoutLike, parse
  File "/home/**/code/**-test/.venv/lib/python3.10/site-packages/renumics/spotlight/layout/__init__.py", line 26, in <module>
    from .nodes import (
  File "/home/**/code/**-test/.venv/lib/python3.10/site-packages/renumics/spotlight/layout/nodes.py", line 15, in <module>
    from .widgets import Widget
  File "/home/**/code/**-test/.venv/lib/python3.10/site-packages/renumics/spotlight/layout/widgets.py", line 11, in <module>
    class WidgetConfig(BaseModel, allow_population_by_field_name=True):
  File "/home/**/code/**-test/.venv/lib/python3.10/site-packages/pydantic/_internal/_model_construction.py", line 124, in __new__
    cls: type[BaseModel] = super().__new__(mcs, cls_name, bases, namespace, **kwargs)  # type: ignore
  File "/home/**/miniconda3/lib/python3.10/abc.py", line 106, in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
TypeError: WidgetConfig.__init_subclass__() takes no keyword arguments

To Reproduce
Steps to reproduce the behavior:

Setup new virtualenv with Python 3.10.9 (miniconda python version)
Activate venv and run pip install renumics-spotlight==1.3.0rc3
Run spotlight

Expected behavior
It should work flawlessly.

Desktop (please complete the following information):

OS: Ubuntu 20.04, Miniconda, Python 3.10.9
Spotlight Version 1.3.0rc3

missing categories in frontend

"Order by Relevance"-Feature broken for datasets with mixed data

Describe the bug
When a dataset contains not only structured but also unstructured data, the Order By Relevance Feature is not working in version 1.3.0.

To Reproduce
Run this Demo:

https://huggingface.co/spaces/renumics/sliceguard-mixed-data

Expected behavior
Columns, where data distribution significantly differs for selected cluster (compared to filtered data) should be marked as relevant.

Screenshots
see demo

Desktop (please complete the following information):

OS: Ubuntu
Browser Chrome
Spotlight Version 1.3.0

Naming and ordering of strings and categorical variables in histogram

Is your feature request related to a problem? Please describe.
When plotting a histogram for string data the order appears to be random. This makes it hard interpreting the data distribution.

When plotting categorical data, the data seems to be ordered by category index which is good, but the category name is not displayed, and the index is used for display as well in the tooltip.

Describe the solution you'd like
There are a bunch of cases where a order for categorical data makes sense, e.g. the bins represent buckets for different ages (thirties, fourties, fifties, ...). Here using the category order from pandas makes sense but the real category name should be displayed in the frontend.

There are also other cases where categoricals are unordered, here I would propose ordering the histogram by frequency in order to get the best feel for the distribution. Pandas categorical dtypes offer an "ordered" boolean flag which could serve as an indicator to distinguish the two cases: Pandas Docs

For columns that are of type string a similar strategy e.g. ordering by frequency could be implemented.

Describe alternatives you've considered
Encoding the data as type that would appear as ordered in the histogram (e.g. int) would limit the readability of the plot.

Additional context

An image of a unordered cat data with index displayed as name.

Support for categoricals in Similarity Map (Dimensionality Reduction)

Spotlight seems to improperly handle categorical variables in dimensionality reduction. When reducing only one categorical column, the distances appear to be sorted in alphabetical "order", so most likely, the distance metric (euclidean) does not take into account the categorical type of the variable. Note that it is in some cases also desirable to place a mix of categorical and numerical values. A strategy to also make the SimMap work on a Mix of variables would thus be desirable.

⚠️ To reproduce run:

import pandas as pd
import numpy as np
from renumics import spotlight

df = pd.DataFrame(np.random.rand(10,3), columns=["feature_a", "feature_b", "feature_c"])
df["feature_cat"] = pd.Series(["b", "b", "b", "a", "a", "d", "d", "d", "c", "c"], dtype="category")
spotlight.show(df)

Now place feature_cat on the Similarity Map.

🔢 Spotlight Version is 1.0.0pre9

`dtype` does not persist when open CSV (even the same) in UI

Describe the bug
When starting Spotlight with dataset and a dtype defined, predefined types get lost after the first ever open of any dataset, even the same one.

It is not a problem when opening a H5 dataset where all the types are predefined. However, when I have some CSVs to explore, I can only open one exact CSV with the dtype given. After any other opened dataset in UI, even the same one, the predefined types get lost. Same problem occurs when opening Spotlight in a folder with CSVs but with a predefined dtype.

To Reproduce

Run the script:

import pandas as pd
from renumics import spotlight


IMAGE_URL = "https://images.unsplash.com/photo-1682687980961-78fa83781450?ixlib=rb-4.0.3&ixid=M3wxMjA3fDF8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=687&q=80"


if __name__ == "__main__":
    df = pd.DataFrame({"index": range(10)})
    df["image"] = IMAGE_URL

    df.to_csv("dataset.csv", index=False)
    df.to_csv("dataset2.csv", index=False)

    spotlight.show("dataset.csv", dtype={"image": spotlight.Image})

Spotlight should correctly interpret "image" column as Image.
2. Open any dataset through filebrowser in UI.

Expected behavior

I expect dtype to persist at least applied to the initially opened dataset.
To apply dtype to all opened CSVs in the current session would be an usable option IMO though.

DataGrid does not prevent builtin context menu from becoming visible

Describe the bug
On right click in the DataGrid (onContextMenu) the Browsers Builtin context menu overlays spotlights context menu

To Reproduce
Steps to reproduce the behavior:

Open a dataset in a DataGrid
right click on any cell

Expected behavior
I would expect the browser menu to not show, when there is another menu provided by spotlight

Screenshots

Screen.Recording.2023-05-09.at.11.27.48.mov

Desktop (please complete the following information):

OS: macOS 12.3.1
Browser: chrome
Version: 112.0.5615.137 (Official Build) (arm64)

Histogram bin count and size are sometimes chosen poorly. Can be improved.

Is your feature request related to a problem? Please describe.
In some cases the histogram bin count and size are not suitable for reasoning about the current data. This e.g. happens if the overall range of the feature plotted in the histogram is much larger than the currently filtered data and potentially the range is determined by one to few outliers.

Describe the solution you'd like
I would hope that overall, such outliers will not influence the histogram too much and that clicking "hide unfiltered" would also cause the bins to be recomputed.

Describe alternatives you've considered

Additional context

Unfiltered histogram with outier not even being visible in histogram but determining bin range.

Filtered histogram with only two bins making it impossible to determine distribution of "wer"

Persist Filters

Is your feature request related to a problem? Please describe.
At the moment filters are not persisted on a reload

Describe the solution you'd like
I would like the filters I arranged to be persisted and still exist after reload.

Describe alternatives you've considered
I don't think there is a feasible alternative and user feedback suggests, that filters are not used as extensive as they could be because there is a chance of loosing your work on a reload.

Additional context

Stacked histogram for many class classification

The stacked histogram is very useful to understand the data distribution in classification problems. However, it is not usable for problems with many classes (>100).

It would be great to have a mechanism that support this, for example: Possibility to select some classes or zoom-in on a subset of classes

Layout precedence on refresh

Is your feature request related to a problem? Please describe.
When trying to solve different tasks on the same dataset I would like to be able to restart Spotlight with a prepared Layout for the specific task.
However, when i altered the first layout while working on the task the layout for the second task won't get applied per default at the first refresh on the newly launched spotlight instance.

Describe the solution you'd like
I think a great solution would be for the layout system to be aware of the current set layout (on spotlight) startup and the changes i made while having started from this layout.
Having a similar workflow like this when starting with a new dataset:

start spotlight with layout1.json => spotlight shows up with layout1.json
working with spotlight and applying changes to the layout => changed-layout-1
start spotlight with layout2.json => spotlight shows up with layout2.json
working with spotlight and applying changes to the layout => changed-layout-2
start spotlight with layout1.json => spotlight shows up with the changed-layout-1
reset layout => spotlight shows up with layout1.json

Describe alternatives you've considered
I can reset the layout when layout2.json is loaded but I have to know that I have to do this and the changes I did (changed-layout-1) for the first task will be lost.

Additional context
Especially for Examples with different layouts/tasks for the same dataset this would be useful for onboarding of a new user.

Error on Filter for malformed RegEx

Describe the bug
When I try to add a new filter and enter a malformed RegEx the frontend will crash.

To Reproduce
Steps to reproduce the behavior:

Add a new filter and insert e.g. 'test'
Check the console

Expected behavior
As the RegEx is malformed I would expect the app to tell me that the filter is invalid.
At least i would the frontend expect to not crash ;)

Additional context

react-dom.development.js:12056 Uncaught SyntaxError: Invalid regular expression: /test/: Nothing to repeat (at filters.ts:59:12)
at new RegExp ()
at matchString (filters.ts:59:12)
at Object.compare (filters.ts:75:35)
at PredicateFilter.apply (filter.ts:40:31)
at FilterItem.tsx:73:30
at mountMemo (react-dom.development.js:17225:19)
at Object.useMemo (react-dom.development.js:17670:16)
at useMemo (react.development.js:1650:21)
at FilterItem (FilterItem.tsx:70:27)
at renderWithHooks (react-dom.development.js:16305:18)

Highlighting of data points on hover broken (histogram)

Describe the bug
I filter out a subset of data points using Spotlight's filters. After that I hover over a certain histogram bin that also only contains a filtered subset of the data. This leads to Spotlight highlight more data than currently filtered in the SimilarityMap view. (See Screenshots)

To Reproduce
Steps to reproduce the behavior:

Filter out some data
Create a histogram and set it to "hide unfiltered"
Hover over some data and observe the similarity map -> Suddenly data points that before were grayed out should reappear

Expected behavior

Describe what you expected to happen instead of the bug.

Screenshots

Originally filtered data

Hihglighted data points when hovering over already filtered data in histogram

Desktop (please complete the following information):

OS: Ubuntu 18.04
Browser Chrome
Browser Version Version 108.0.5359.124 (Offizieller Build) (64-Bit)
Spotlight Version 1.1.0

Additional context

Running Spotlight in directory that contains pyproject.toml results in KeyError

Describe the bug
I run spotlight from a directory containing a pyproject.toml file. Spotlight then crashes with:

Traceback (most recent call last): File "/home/daniel/code/****/.venv/bin/spotlight", line 5, in <module> from renumics.spotlight.cli import main File "/home/daniel/code/****/.venv/lib/python3.8/site-packages/renumics/spotlight/__init__.py", line 26, in <module> __plugins__ = load_plugins() File "/home/daniel/code/****/.venv/lib/python3.8/site-packages/renumics/spotlight/plugin_loader.py", line 64, in load_plugins project = get_project_info() File "/home/daniel/code/****/.venv/lib/python3.8/site-packages/renumics/spotlight/develop/project.py", line 53, in get_project_info project_name = pyproject_content["tool"]["poetry"]["name"] KeyError: 'poetry'

To Reproduce
Just add a pyproject.toml to the runtime directory.

Expected behavior
Run normally, as Spotlight might be used as component of other python packages, also in development.

Screenshots

Desktop (please complete the following information):

OS: Ubuntu 20.04
Browser -
Browser Version -
Spotlight Version 1.0.0.post55

Additional context

Spotlight layout cannot be loaded with API

Describe the bug
It seems as if the "layout" argument is ignored when loading Spotlight. It doesn't load up with the right layout, but I can select "load layout" in the GUI just fine.

To Reproduce
Steps to reproduce the behavior:

Download dataset:
dataset = datasets.load_dataset("renumics/cifar100-enriched", split="all")
df = dataset.to_pandas()
df.drop(columns=['embedding', 'probabilities'], inplace=True)
Load spotlight:
spotlight.show(df, dtype={'embedding_reduced': spotlight.Embedding, 'image':spotlight.Image})
Save two different layouts
Load the layouts:
spotlight.show(df, dtype={'embedding_reduced': spotlight.Embedding, 'image':spotlight.Image}, layout="mylayout.json")
The layout is not loaded, but can be loaded in the GUI

Expected behavior
Layout should be set to saved layout

Screenshots

Desktop (please complete the following information):

OS: Linux and Windows
Browser Firefox
Spotlight Version rc10

Additional context

truncated text in detail view

Histogramm crashes on stacking by categorical columns with too many distinct values

If I have a string column with many different values (e.g. labels for problems with many classes) and I use the stacked histogram I get the following messages:

If I try to stack by column, I get "Error rendering component"

Example: BirdClef2023 dataset, column ('primary_label')

Spotlight Board Test Issue

Make column names copyable

🚀 Enhancement Request

Is your enhancement request related to a problem? Please describe.

When performing feature selection on dataset with 100+ columns that are potentially named fairly cryptical it can be error prone having to copy the feature names by hand. Copying them over the context menu or marking them as interesting finds is not possible.

Describe the solution you'd like

E.g. making the feature names copyable.

Describe alternatives you've considered

Marking interesting features for use in python code. Could also be suitable but seems complicated and really bound to the feature selection use case.

Additional context

Candidate features are generated with libraries like catch22 and span multiple different variables/curves. There can be pre-selection using feature selection techniques but the long and cryptic names potentially don't go away from this.

Spectrogram decibel scale seems to be faulty

Describe the bug
When using the Decibel Scale in Spotlight's spectrogram I realized the color mapping/scaling seems to be incorrect. Basically a lot of the data seems to be colored with the maximum value.

To Reproduce
Check out this demo which is deployed on HF Spaces or run the example yourself:

https://huggingface.co/spaces/renumics/bengaliai-audio-issues

Expected behavior
Proper scaling, highlighting also less salient sounds.

Screenshots
see interactive demo

Desktop (please complete the following information):

OS: Ubuntu
Browser: Chrome
Spotlight Version: 1.3.0

Similarity Map crashes on ApplyFitlers

Describe the bug
When the Similarity Map is displayed (and datapoints are already mapped and showing on the map) and a filter is added/applied is crashed.

To Reproduce
Steps to reproduce the behavior:

Open a Similarity Map and wait until it has placed the datapoints
Color by any value
Disable show filtered
Add a filter (that actually filters out already visible points)

Expected behavior
Similarity Map should be recomputed without a crash

Screenshots

Desktop (please complete the following information):

OS: macOS Monterey (12.3.1) M1
Browser chrome
Browser Version Version 112.0.5615.137 (Official Build) (arm64)
Spotlight Version spotlight-pro - main - 03850e8

Additional context

SimilarityMap.tsx:283 Uncaught TypeError: Cannot read properties of undefined (reading 'hex')
    at SimilarityMap.tsx:283:64
    at Array.forEach (<anonymous>)
    at SimilarityMap.tsx:282:24
    at updateMemo (react-dom.development.js:17246:19)
    at Object.useMemo (react-dom.development.js:17886:16)
    at useMemo (react.development.js:1650:21)
    at SimilarityMap (SimilarityMap.tsx:280:20)
    at renderWithHooks (react-dom.development.js:16305:18)
    at updateFunctionComponent (react-dom.development.js:19588:20)
    at beginWork (react-dom.development.js:21601:16)
react-dom.development.js:18687 The above error occurred in the <SimilarityMap> component:

    at SimilarityMap (http://localhost:5721/src/components/widgets/SimilarityMap/SimilarityMap.tsx:102:26)
    at WidgetFactory (http://localhost:5721/src/components/widgets/WidgetFactory.tsx:46:3)
    at Component (http://localhost:5721/src/components/Workspace/ComponentFactory.tsx:23:3)
    at ErrorBoundary (http://localhost:5721/node_modules/.vite/deps/flexlayout-react.js?v=73ed8dc3:3211:5)
    at div
    at Tab (http://localhost:5721/node_modules/.vite/deps/flexlayout-react.js?v=73ed8dc3:3235:11)
    at div
    at Layout (http://localhost:5721/node_modules/.vite/deps/flexlayout-react.js?v=73ed8dc3:3870:5)
    at div
    at O2 (http://localhost:5721/node_modules/.vite/deps/styled-components.js?v=73ed8dc3:1326:6)
    at Workspace (http://localhost:5721/src/components/Workspace/Workspace.tsx:62:29)
    at div
    at O2 (http://localhost:5721/node_modules/.vite/deps/styled-components.js?v=73ed8dc3:1326:6)
    at div
    at O2 (http://localhost:5721/node_modules/.vite/deps/styled-components.js?v=73ed8dc3:1326:6)
    at ContextMenuProvider (http://localhost:5721/src/components/ui/ContextMenu.tsx:48:3)
    at div
    at O2 (http://localhost:5721/node_modules/.vite/deps/styled-components.js?v=73ed8dc3:1326:6)
    at App (http://localhost:5721/src/App.tsx:47:19)

React will try to recreate this component tree from scratch using the error boundary you provided, ErrorBoundary.

Spotlight cannot handle Pandas DataFrames with Column MultiIndex

When loading a Pandas DataFrame with Column MultiIndex using spotlight.show() Spotlight will not load the columns.
💻 To reproduce run:

import numpy as np
from renumics import spotlight

df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
spotlight.show(df)

⚠️ Spotlight will warn with:

2023-03-02 13:07:29.191 | WARNING  | renumics.spotlight.plugins.core.pandas_data_source:get_columns:111 - Column '(1, 'A')' not imported from `pandas.DataFrame` because of the following error:
'tuple' object has no attribute 'startswith'
2023-03-02 13:07:29.193 | WARNING  | renumics.spotlight.plugins.core.pandas_data_source:get_columns:111 - Column '(1, 'B')' not imported from `pandas.DataFrame` because of the following error:
'tuple' object has no attribute 'startswith'
2023-03-02 13:07:29.194 | WARNING  | renumics.spotlight.plugins.core.pandas_data_source:get_columns:111 - Column '(2, 'A')' not imported from `pandas.DataFrame` because of the following error:
'tuple' object has no attribute 'startswith'
2023-03-02 13:07:29.195 | WARNING  | renumics.spotlight.plugins.core.pandas_data_source:get_columns:111 - Column '(2, 'B')' not imported from `pandas.DataFrame` because of the following error:
'tuple' object has no attribute 'startswith'

🔢 Spotlight displays version:
spotlight, version 1.0.0-pre.9

Selection of data issues UX not perfectly intuitive

Is your feature request related to a problem? Please describe.
In the current release candidate the selection of data/model issues requires hitting the label containing the "support" for the found issue. This is not necessarily intuitive.

Describe the solution you'd like
Selecting the data points when clicking the title, expanding when clicking "expand" icon.

Cannot load layout from file via layout=layout

Describe the bug
When saving a layout from the UI and trying to load it via

df_show = df.drop(columns=['embedding', 'probabilities'])
layout_url = "http://my-layout.json"
response = requests.get(layout_url)
layout = spotlight.layout.nodes.Layout(**json.loads(response.text))
spotlight.show(df_show, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding}, layout=layout)

To Reproduce
Steps to reproduce the behavior:

Expected behavior
Layout is loaded.

Screenshots
```
{'orientation': 'vertical', 'children': [{'kind': 'split', 'weight': 60, 'orientation': 'horizontal', 'children': [{'kind': 'tab', 'weight': 60, 'children': [{'kind': 'widget', 'name': 'Table', 'type': 'table', 'config': {'tableView': 'full', 'visibleColumns': None, 'sorting': None, 'orderByRelevance': False}}]}, {'kind': 'tab', 'weight': 40, 'children': [{'kind': 'widget', 'name': 'Similarity Map', 'type': 'similaritymap', 'config': {'placeBy': ['score'], 'reductionMethod': None, 'colorBy': 'run_id', 'sizeBy': None, 'filter': False, 'umapNNeighbors': 20, 'umapMetric': None, 'umapMinDist': 0.15, 'pcaNormalization': None, 'umapMenuLocalGlobalBalance': None, 'umapMenuIsAdvanced': False}}, {'kind': 'widget', 'name': 'Scatter Plot', 'type': 'scatterplot', 'config': {'xAxisColumn': 'score', 'yAxisColumn': 'beam', 'colorBy': 'run_id', 'sizeBy': None, 'filter': False}}, {'kind': 'widget', 'name': 'Histogram', 'type': 'histogram', 'config': {'columnKey': 'score', 'stackByColumnKey': 'run_id', 'filter': False}}]}]}, {'kind': 'tab', 'weight': 40, 'children': [{'kind': 'widget', 'name': 'Inspector', 'type': 'inspector', 'config': {'views': [{'view': 'MarkdownLens', 'columns': ['text'], 'name': 'view', 'key': '6e48e9eb-926b-4a35-9249-0e02d23d6a2c'}, {'view': 'MarkdownLens', 'columns': ['scored_text'], 'name': 'view', 'key': 'ad9eb104-a8d6-47ad-9b00-746135d60551'}], 'visibleColumns': 1}}]}]}
Traceback (most recent call last):
File "/myfolder/myfile.py", line 18, in
layout = spotlight.layout.nodes.Layout(**json.loads(response.text))
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)


**Desktop (please complete the following information):**

-   OS: arch linux
-   Browser: Firefox
-   Browser Version [e.g. 22]
-   Spotlight Version  renumics-spotlight 1.3.0rc3 ,  renumics-spotlight 1.2.0 

**Additional context**

server startup timeout when started offline

Describe the bug
When i start a new server without an internet connection spotlight won't start and shows a timeout error.

To Reproduce
Steps to reproduce the behavior:

disable network connection
start spotlight

Expected behavior
...

Screenshots
If possible, include screenshots or screen recordings to better illustrate the issue.

Desktop (please complete the following information):

OS: macos
Browser Chrome
Browser Version 113.0.5672.126 (Official Build) (arm64)
Spotlight Version 1.0.0

Additional context
I think that this is the reporting tool that takes too long to resolve the dns address

Layout is not completely loaded through API

Describe the bug
When I load the layout through the API, image widgets in the inspector view are lost. It works when the layout is loaded through the GUI.

To Reproduce
Steps to reproduce the behavior:

Check the F1 space: https://huggingface.co/spaces/renumics/f1_montreal_gp
No images are loaded in the inspector view
Load the corresponding layout
Image views are loaded
Expected behavior
Layout should be loaded correctly through the API

Screenshots

Desktop (please complete the following information):

OS: Windows
Browser Firefox
Browser Version 114.02
Spotlight Version 1.3.0rc1

Additional context

Create filter based on "column" buttons in the issue view

Is your feature request related to a problem? Please describe.
I typically use filter to understand issues better. Currently, I have to manually select them.

Describe the solution you'd like
There are already clickable "column" buttons in each issue. I would like to get a filter for the column (between min and max value).

Describe alternatives you've considered
Currently, I set the filter manually.

Additional context

Filtering after certain values (potentially containing brackets) will result in no matches

Describe the bug
When filtering after certain specific values Spotlight seems to not match any samples, even though the filter was created using right click on the value in table.

To Reproduce
Steps to reproduce the behavior:

Rick click on table cell
Click "Filter"
See if spotlight matches any entry. Not the case for data in screenshot.

Expected behavior
All entries that match the value should be shown in Filtered Tab.

Screenshots

Desktop (please complete the following information):

OS: Ubuntu
Spotlight Version 1.3.0rc1

Allow wait=inf

When using spotlight I'd like to be able to share a spotlight instance via a link. However, since wait=True, the most permissive mode in that regard, will kill the instance when the "last browser" is closed, I cannot reliably do this.

I propose to introduce wait=inf to be able to deploy a long running spotlight instance.

Value of Text Widget in Inspector cannot be copied

Is your feature request related to a problem? Please describe.
There is no way to mark and copy text from the Text widget in the inspector. Often this makes sense, e.g. for copying paths, prompts, or similar.

Describe the solution you'd like
I would love to be able to mark text and copy it out of the text widget.

Load layout from URL

Is your feature request related to a problem? Please describe.
Sharing layouts as json-files over https is very flexible and a great choice, especially for tutorials and examples.
Right now, the download of the json has to be performed manually like so:

layout_url = "https://raw.githubusercontent.com/Renumics/spotlight/playbook_initial_draft/playbook/rookie/duplicates_Annoy_layout.json"
response = requests.get(layout_url)
layout = spotlight.layout.nodes.Layout(**json.loads(response.text))
spotlight.show(df_show, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding}, layout=layout)

Describe the solution you'd like
It would be great to be able to directly load a layout from a url:
layout_url="https://raw.githubusercontent.com/Renumics/spotlight/playbook_initial_draft/playbook/rookie/duplicates_Annoy_layout.json"
spotlight.show(df_show, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding}, layout=layout_url)

Describe alternatives you've considered
See workaround above

Additional context

No Error message when using 2D array as Embedding Value

Describe the bug
When using a 2D array as values in a dtype=embedding column and calling spotlight.show(), an error message "UnboundLocalError: local variable 'values' referenced before assignment" appears in the browser.

To Reproduce

import pandas as pd
import numpy as np
from renumics import spotlight

embedding_values = [np.random.rand(5, 3) for i in range(5)]

df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['a', 'b', 'c', 'd', 'e'], 'embedding': [x for x in embedding_values]})
spotlight.show(df, dtype={'embedding': spotlight.Embedding})

Expected behavior
The spotlight.show() function should display an appropriate error message.

Screenshots

Desktop (please complete the following information):

OS: Ubuntu 18.04
Browser Version: Firefox 112.01
Spotlight Version:  1.0.0rc10

Similarity map computes UMAP for 2D-embeddings

If I use a 2D-embedding, it seems that a UMAP reduction is computed. This is not necessary and slows down the visualization on large datasets.

Example: BirdClef dataset from Kaggle (https://www.kaggle.com/code/sps444/birdclef-2023-interactive-eda-windowed-data)

Histogram crashes on applying filters if filter column used in plot

Describe the bug
Plotting histogram in spotlight, filtering according to a column that is used in the histogram view, then hitting "hide unfiltered" will result in error rendering component.

To Reproduce
See above.

Expected behavior
The histogram should be filtered according to the currently active filter without crashing.

Screenshots

Desktop (please complete the following information):

OS: Ubuntu 20.04
Browser Chrome
Browser Version 114.0.5735.90 (Official Build) (64-bit)
Spotlight Version 1.0.0

Additional context

Choose more suitable scaling for spectrogram y-axis as a default

Is your feature request related to a problem? Please describe.
The default axis scaling in the spectogram widget is set to logarithmic. This doesn't make much sense for a lot of real-world use cases, such as condition monitoring (exceptions are mostly related to music analysis -> converting frequency to pitches).

Describe the solution you'd like
Not having logarithmic scaling on the y axis by default. What could make sense but is no high priority for me personally is scaling the colorcoding of the signal energy logarithmically (converting to dB). Potentially it could also make sense to allow choosing the mel scale as the y-axis to match human hearing perception in the visual signal representation. Choosing y-axis range in a synchronized way (sync between samples) could also be a useful addition

Describe alternatives you've considered

Additional context

Spectrogram default view when rendered with a speech signal. Will almost always look like this.

Histogram not showing data distributions on hover

Describe the bug
Histogram used to show the data distribution of a currently hovered data segment (e.g. when hovering over other histogram or data issue). However, now it only highlights bins as a whole and doesn't show the distribution in the other histograms.

To Reproduce
Check out this demo:

https://huggingface.co/spaces/renumics/sliceguard-structured-data

Open two histograms and hover over the bins respectively.

Also try hovering over the data issues.

Expected behavior
When hovering over histogram bin the data distribution in the bin should show in the other histograms.

Screenshots
see demo

Desktop (please complete the following information):

OS: Ubuntu
Browser Chrome
Spotlight Version 1.3.0

Regex filtering string columns

Filtering of string columns by regex not working as expected

Expected Behavior

When i filter a string column with a = filter containing a regular expression (eg. ".matching.")
i expect to be presented only with columns that contain "matching"

Current Behavior

Filtering is not working as expected and i can not exactly pin point how it behaves

Steps to Reproduce

add column to spotlight dataset
add values to columne e.g. ["test with a matching word", "test without a matcing word"]
try to filter out "matching" (!= ".matching.")

Context (Environment)

I am using Spotlight v1.0.0rc9 in Chrome on a M1 Mac

Detailed Description

Attached see a picture summarizing the observed behavior and some console tests.

Multiple axis setting in time series view is not saved in layout

Describe the bug
When I have two time series, the "multiple axis" setting is not saved in layout.

To Reproduce
Steps to reproduce the behavior:

Check the F1 space: https://huggingface.co/spaces/renumics/f1_montreal_gp
Change the setting in the brake, distance_driver widget.
Save layout
Change the setting again
Load layout
Expected behavior
Layout should include the multiple axis setting

Screenshots

Desktop (please complete the following information):

OS: Windows
Browser Firefox
Browser Version 114.02
Spotlight Version 1.3.0rc1

Additional context

renumics / spotlight Goto Github PK

spotlight's Introduction

Renumics Spotlight

Documentation • Playbook • Blog • API Reference

🚀 Start with a use case

⏱️ Quickstart

What you'll need

Install Spotlight via pip

Load a dataset and start exploring

Load a Hugging Face audio dataset with embeddings and a pre-defined layout

Usage Tracking

We are very happy to hear your feedback

Learn more about unstructured data workflows

Contribute

spotlight's People

Contributors

Stargazers

Watchers

Forkers

spotlight's Issues

Describe alternatives you've considered

Additional context

Screenshots

Additional context

🚀 Enhancement Request

Is your enhancement request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Additional context

Describe alternatives you've considered

Expected Behavior

Current Behavior

Steps to Reproduce

Context (Environment)

Detailed Description

Recommend Projects

Recommend Topics

Recommend Org