Giter Site home page Giter Site logo

voxel51 / fiftyone Goto Github PK

View Code? Open in Web Editor NEW
6.7K 51.0 497.0 1.32 GB

The open-source tool for building high-quality datasets and computer vision models

Home Page: https://fiftyone.ai

License: Apache License 2.0

Python 70.75% Shell 0.10% HTML 0.04% JavaScript 2.07% TypeScript 26.53% CSS 0.47% Dockerfile 0.04% Makefile 0.01%
machine-learning artificial-intelligence deep-learning computer-vision developer-tools data-science python active-learning data-centric-ai data-cleaning

fiftyone's Introduction

 

The open-source tool for building high-quality datasets and computer vision models


WebsiteDocsTry it NowTutorialsExamplesBlogCommunity

PyPI python PyPI version Downloads Docker Pulls Build License Slack Medium Mailing list Twitter

FiftyOne


Nothing hinders the success of machine learning systems more than poor quality data. And without the right tools, improving a model can be time-consuming and inefficient.

FiftyOne supercharges your machine learning workflows by enabling you to visualize datasets and interpret models faster and more effectively.

Use FiftyOne to get hands-on with your data, including visualizing complex labels, evaluating your models, exploring scenarios of interest, identifying failure modes, finding annotation mistakes, and much more!

You can get involved by joining our Slack community, reading our blog on Medium, and following us on social media:

Slack Medium Twitter LinkedIn Facebook

Installation

You can install the latest stable version of FiftyOne via pip:

pip install fiftyone

Consult the installation guide for troubleshooting and other information about getting up-and-running with FiftyOne.

Quickstart

Dive right into FiftyOne by opening a Python shell and running the snippet below, which downloads a small dataset and launches the FiftyOne App so you can explore it:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")
session = fo.launch_app(dataset)

Then check out this Colab notebook to see some common workflows on the quickstart dataset.

Note that if you are running the above code in a script, you must include session.wait() to block execution until you close the App. See this page for more information.

Documentation

Full documentation for FiftyOne is available at fiftyone.ai. In particular, see these resources:

Examples

Check out the fiftyone-examples repository for open source and community-contributed examples of using FiftyOne.

Contributing to FiftyOne

FiftyOne is open source and community contributions are welcome!

Check out the contribution guide to learn how to get involved.

Installing from source

The instructions below are for macOS and Linux systems. Windows users may need to make adjustments. If you are working in Google Colab, skip to here.

Prerequisites

You will need:

  • Python (3.7 or newer)
  • Node.js - on Linux, we recommend using nvm to install an up-to-date version.
  • Yarn - once Node.js is installed, you can install Yarn via npm install -g yarn
  • On Linux, you will need at least the openssl and libcurl packages. On Debian-based distributions, you will need to install libcurl4 or libcurl3 instead of libcurl, depending on the age of your distribution. For example:
# Ubuntu
sudo apt install libcurl4 openssl

# Fedora
sudo dnf install libcurl openssl

Installation

We strongly recommend that you install FiftyOne in a virtual environment to maintain a clean workspace. The install script is only supported in POSIX-based systems (e.g. Mac and Linux).

First, clone the repository:

git clone https://github.com/voxel51/fiftyone
cd fiftyone

Then run the install script:

bash install.bash

NOTE: If you run into issues importing FiftyOne, you may need to add the path to the cloned repository to your PYTHONPATH:

export PYTHONPATH=$PYTHONPATH:/path/to/fiftyone

NOTE: The install script adds to your nvm settings in your ~/.bashrc or ~/.bash_profile, which is needed for installing and building the App

NOTE: When you pull in new changes to the App, you will need to rebuild it, which you can do either by rerunning the install script or just running yarn build in the ./app directory.

Upgrading your source installation

To upgrade an existing source installation to the bleeding edge, simply pull the latest develop branch and rerun the install script:

git checkout develop
git pull
bash install.bash

Developer installation

If you would like to contribute to FiftyOne, you should perform a developer installation using the -d flag of the install script:

bash install.bash -d

Source installs in Google Colab

You can install from source in Google Colab by running the following in a cell and then restarting the runtime:

%%shell

git clone --depth 1 https://github.com/voxel51/fiftyone.git
cd fiftyone
bash install.bash

Docker installs

Refer to these instructions to see how to build and run Docker images containing source or release builds of FiftyOne.

UI Development on Storybook

Voxel51 is currently in the process of implementing a Storybook which contains examples of its basic UI components. You can access the current storybook instances by running yarn storybook in /app/packages/components. While the storybook instance is running, any changes to the component will trigger a refresh in the storybook app.

%%shell

cd /app/packages/components
yarn storybook

Generating documentation

See the docs guide for information on building and contributing to the documentation.

Uninstallation

You can uninstall FiftyOne as follows:

pip uninstall fiftyone fiftyone-brain fiftyone-db fiftyone-desktop

Contributors

Special thanks to these amazing people for contributing to FiftyOne! 🙌

Citation

If you use FiftyOne in your research, feel free to cite the project (but only if you love it 😊):

@article{moore2020fiftyone,
  title={FiftyOne},
  author={Moore, B. E. and Corso, J. J.},
  journal={GitHub. Note: https://github.com/voxel51/fiftyone},
  year={2020}
}

fiftyone's People

Contributors

adonaivera avatar aidavoxel51 avatar allenleetc avatar benjaminpkane avatar bonlime avatar brimoor avatar dependabot[bot] avatar ehofesmann avatar erfantagh avatar findtopher avatar imanjra avatar j053y avatar jacobmarks avatar jasoncorso avatar kaixi-wang avatar kevin-dimichel avatar lanzhenw avatar lethosor avatar manivoxel51 avatar nebulae avatar neokish avatar ritch avatar rohis06 avatar rusteam avatar sashankaryal avatar swheaton avatar timmermansjoy avatar twsl avatar tylerganter avatar visionjo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fiftyone's Issues

Add a Dataset context manager for DB syncing

I think the most critical design challenge we have remaining is how to ensure that it is crystal clear to the user when their dataset is synced with the DB.

Some thoughts on the issue:

  • can we make sample.save() efficient enough, or is this fundamentally an antipattern for large-scale datasets?

  • an alternative batch approach is as follows:

for sample in dataset:
    sample["file_hash"] = fof.compute_filehash(sample.filepath)

# batch updates all samples that have been modified
dataset.save()

where samples would report to their parent dataset that they have been modified, and then the dataset would handle batch saving all modified samples (the dataset would maintain a queue of in-memory samples to save).

A critical issue with the above code as written is that we'll have very unhappy users if they get an error on the 123456th sample they updated and then discover that none of the other modifications they made were saved.

As a result, I believe Dataset needs a context manager that manages batch syncing to the DB:

with dataset:
    for sample in dataset:
        sample["file_hash"] = fof.compute_filehash(sample.filepath)

where __exit__ ensures that any changes will always be synced to the DB. As an optimization, the dataset could sync changes in batches of n to avoid the need to store every sample in memory

Originally posted by @brimoor in https://github.com/voxel51/fiftyone/diffs

Launch FiftyOne app on same display as current terminal window

On a multidisplay machine, I've noticed that fo.launch_dashboard() causes a dashboard to launch on my default display. It would be nice if the dashboard would launch on the same display as the terminal window that I used to launch it. On multidisplay machines, one could not realize that the app has opened if you're not looking at the right display.

Similarly, if possible, it would be nice if the window would open with focus (so it doesn't appear in the background, again with the goal of ensuring that users see that the dashboard has launched)

Tensorflow 1.15 PrefetchDatset Bug

Trying to run the following from the dataset creation walkthrough:

import fiftyone.zoo as foz

# List available datasets
print(foz.list_zoo_datasets())

# Load a zoo dataset
# The dataset will be downloaded from the web the first time you access it
dataset = foz.load_zoo_dataset("cifar10")

# Print a few samples from the dataset
print(dataset.view().head())

Without tensorflow installed, it prompts me to pip install tensorflow>=1.15.
However, with tensorflow==1.15, I get the following error:

Done writing /home/eric/fiftyone/cifar10/tmp-download/cifar10/3.0.2.incompleteFX5EZH/cifar10-train.tfrecord. Shard lengths: [50000]
Generating split test                                                                                                                                                                                       
Shuffling and writing examples to /home/eric/fiftyone/cifar10/tmp-download/cifar10/3.0.2.incompleteFX5EZH/cifar10-test.tfrecord
  0%|                                                                                                                                                                      | 0/10000 [00:00<?, ? examples/s]Done writing /home/eric/fiftyone/cifar10/tmp-download/cifar10/3.0.2.incompleteFX5EZH/cifar10-test.tfrecord. Shard lengths: [10000]
Skipping computing stats for mode ComputeStatsMode.AUTO.                                                                                                                                                    
Dataset cifar10 downloaded and prepared to /home/eric/fiftyone/cifar10/tmp-download/cifar10/3.0.2. Subsequent calls will reuse this data.
Constructing tf.data.Dataset for split test, from /home/eric/fiftyone/cifar10/tmp-download/cifar10/3.0.2
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-5a9ae139399a> in <module>
      6 # Load a zoo dataset
      7 # The dataset will be downloaded from the web the first time you access it
----> 8 dataset = foz.load_zoo_dataset("cifar10")
      9 
     10 # Print a few samples from the dataset

~/venvs/test/lib/python3.6/site-packages/fiftyone/zoo/__init__.py in load_zoo_dataset(name, splits, dataset_dir, download_if_necessary)
    142     if download_if_necessary:
    143         info, dataset_dir = download_zoo_dataset(
--> 144             name, splits=splits, dataset_dir=dataset_dir
    145         )
    146         zoo_dataset = info.zoo_dataset

~/venvs/test/lib/python3.6/site-packages/fiftyone/zoo/__init__.py in download_zoo_dataset(name, splits, dataset_dir)
    110     """
    111     zoo_dataset, dataset_dir = _parse_dataset_details(name, dataset_dir)
--> 112     info = zoo_dataset.download_and_prepare(dataset_dir, splits=splits)
    113     return info, dataset_dir
    114 

~/venvs/test/lib/python3.6/site-packages/fiftyone/zoo/__init__.py in download_and_prepare(self, dataset_dir, splits)
    550                 logger.info("Downloading split '%s' to '%s'", split, split_dir)
    551                 format, num_samples, classes = self._download_and_prepare(
--> 552                     split_dir, scratch_dir, split
    553                 )
    554 

~/venvs/test/lib/python3.6/site-packages/fiftyone/zoo/tf.py in _download_and_prepare(self, dataset_dir, scratch_dir, split)
    184             get_class_labels_fcn,
    185             get_num_samples_fcn,
--> 186             sample_parser,
    187         )
    188 

~/venvs/test/lib/python3.6/site-packages/fiftyone/zoo/tf.py in _download_and_prepare(dataset_dir, scratch_dir, download_fcn, get_class_labels_fcn, get_num_samples_fcn, sample_parser)
    762     # Write the formatted dataset to `dataset_dir`
    763     write_dataset_fcn(
--> 764         dataset.as_numpy_iterator(),
    765         dataset_dir,
    766         sample_parser=sample_parser,

AttributeError: 'PrefetchDataset' object has no attribute 'as_numpy_iterator'

When trying this code snippet with tensorflow==2.20 this worked without issue. Once run with v2.20 I then reinstalled v1.15 and it worked without issue. Only after deleting ~/fiftyone/cifar10 and rerunning with v1.15 did the error reoccur.

Overly verbose "Dataset Creation Examples"

I'd suggest removing "Working with..." from these sections to make it easier to visually identify the situation that applies to me. If there is disagreement feel free to close this issue. I'm not going to fight it

Screen Shot 2020-06-01 at 2 36 26 PM

Click bubble to toggle between value and `key: value` display mode

Currently all bubbles rendered on images in the dashboard are displayed in <value> mode. However, for certain fields (usually numeric fields), it can be strange to just see a number like 0.987235 on an image.

My proposal is that the user can click on any bubble on any image to toggle between showing the values in <value> and <key>: <value> mode. In the latter mode, you could see confidence: 0.987235 for example

Fine tune opening dashboard view and remote session instructions

Here's what the no dataset page of the FiftyOne Dashboard currently looks like:

Screen Shot 2020-05-13 at 1 12 54 PM

If the user has a dashboard open, then they've already figured out how to run fo.launch_dashboard(), right? So, perhaps we should extend the help a bit to show the full workflow:

import fiftyone as fo

# Load your FiftyOne dataset
dataset = ...

# Launch your dashboard locally
# (if you're reading this from your dashboard, you've already done this!)
session = fo.launch_dashboard()

# Load a dataset 
session.dataset = dataset

# Load a specific view into your dataset
session.view = view

Remote connections are a bit of a special case, so perhaps they should be hidden in a view that appears only after clicking on a Remote session? link or something similar. Then perhaps that help page should show bifurcated local/remote instructions like this:

On your remote machine

import fiftyone as fo

# Load your FiftyOne dataset
dataset = ...

# Launch the dashboard that you'll connect to from your local machine
session = fo.launch_dashboard(remote=True)

# Load a dataset 
session.dataset = dataset

# Load a specific view into your dataset
session.view = view

On your local machine

# Configure port forwarding to access the session on your remote machine
ssh -L 5151:127.0.0.1:5151 username@remote_machine_ip

# Launch the dashboard
# NOTE: I'm going to support this via the CLI as a shortcut for launching a
# dashboard that you intend to connect to remotely
fiftyone remote   

Add support for set operations on DatasetView

Wouldn't this be cool and useful!?

dataset = fo.Dataset(...)

view1 = (
    dataset.view()
    ....
)

view2 = (
    dataset.view()
    ....
)

view_intersection = view1 & view2

# equivalent long version:
# view_intersection = view1.intersection(view2)

view_union = view1 | view2

# equivalent long version:
# view_union = view1.union(view2)

...

We could support all the natural set operations:

Screen Shot 2020-05-14 at 5 06 22 PM

FiftyOne needs a lite ETA install, which can be performed via pip

(this issue belongs in ETA, but I'm adding it here first for visibility, as FiftyOne is the primary user of this needed install improvement)

FiftyOne currently performs a full ETA install (https://github.com/voxel51/eta/blob/develop/install.bash), which includes the following items:

  • base machine packages
    • should be moved out of the pip install component of the ETA install and instead moved to a startup script or some other appropriate place
  • basic python requirements
  • pip install eta
  • ffmpeg
    • not needed right now, but will be needed when we start to work with videos. however, perhaps FiftyOne should be in charge of installing this and never ETA
  • imagemagick
    • probably never needed, used only by a few methods in eta.core.image that we'd never be calling
  • tensorflow
    • not needed for fiftyone; user will install if they need TF for either fiftyone or eta package reasons
  • submodules, including tensorflow/models
    • not needed for fiftyone

ETA needs a lite install process that installs what FiftyOne needs and nothing else, which can be accomplished via pip install.

To preserve functionality for existing users, ETA also needs a full install that can be accomplished via pip install

Improve main fiftyone namespace doc

$ ipython
Python 3.6.10 (tags/debian/3.6.10-1+xenial1:7bb1d22, Jan 11 2020, 15:15:40) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

[ins] In [1]: import fiftyone as fo                                                                                                          

[ins] In [2]: fo?                                                                                                                            
Type:        module
String form: <module 'fiftyone' from '/home/jason/voxel51/j/fiftyone/fiftyone/__init__.py'>
File:        ~/voxel51/j/fiftyone/fiftyone/__init__.py
Docstring:  
FiftyOne package namespace.

| Copyright 2017-2020, Voxel51, Inc.
| `voxel51.com <https://voxel51.com/>`_
|

Let's be sure to improve this documentation information as this is a common usage in ipython...

Examples out of date due to DatasetView changes

I've been running through the file_hashing example and have noticed some issues so far (and from a quick search, other examples appear to be affected as well):

  • DatasetView.filter() was replaced in 6665090 (#52)
  • Dataset.aggregate() was made private in 6401fc3 (is there a user-facing replacement? This change doesn't appear to have been discussed in a PR)

There could be others; I haven't gone through other examples yet. I don't know how to properly update all of the examples, so I would prefer to have someone with more knowledge of these changes address them.

Improve scrolling efficiency on dashboard

Infinite scrolling of images in the dashboard is awesome! So awesome that we made it a front-and-center feature that every user will definitely interact with. As a result, it's important that we optimize its performance as much as possible.

This will be a recurring issue, but here's an initial list of low-hanging fruit that we brainstormed:

Scrolling efficiency ideas

  • cache samples that have already been loaded so that scrolling upwards does not require reloading
  • prefetch the next page of results
  • optimize image annotation rendering (whatever low-hanging fruit may exist)
  • consider storing image thumbnails in the DB (this would be an internal-only field, not loaded when the user interacts with Samples) and show those in the samples view
  • asynchronously load content so that content appears as fast as possible; for example, show a gray/solid-colored placeholder image while the actual image loads so that scrolling doesn't appear to "hang" until enough images load that the next page can be shown. For inspiration here, try scrolling as fast as possible through Google Images

Dashboard caching doesn't realize when images change?

I can break the dashboard in a very unpleasant way by doing the following:

  1. Download MNIST test split with TF:
export FIFTYONE_DEFAULT_ML_BACKEND=tensorflow
fiftyone zoo download mnist --splits test
  1. View in dashboard:
import fiftyone as fo
import fiftyone.zoo as foz

test = foz.load_zoo_dataset("mnist", splits=["test"])

session = fo.launch_dashboard(dataset=test)

Screen Shot 2020-05-30 at 5 37 40 PM

  1. Close terminal session, get a coffee, etc

  2. Now re-download the dataset with Torch:

mv ~/fiftyone/mnist ~/fiftyone/mnist-torch

export FIFTYONE_DEFAULT_ML_BACKEND=torch
fiftyone zoo download mnist --splits test
  1. View in dashboard:
import fiftyone as fo
import fiftyone.zoo as foz

test = foz.load_zoo_dataset("mnist", splits=["test"])

session = fo.launch_dashboard(dataset=test)

Screen Shot 2020-05-30 at 5 39 54 PM

As you can see, the images didn't change, but the labels did! My dashboard is broken!

The issue is that the TF and Torch-based datasets yield images on disk with the same filenames ~/fiftyone/mnist/test/data/%05d.jpg, but the order of the filenames is permuted, so ~/fiftyone/mnist/test/data/00001.jpg is a different image in each download of the dataset.

I think this is some kind of aggressive image caching on the frontend that persists between sessions and doesn't realize if the source image has actually changed on disk.

For example, this shows that the image I see in the dashboard does not necessarily match what is actually on disk at the time:

Screen Shot 2020-05-30 at 5 02 09 PM

Make sidebar scrollable and sections collapsable

Making Display and View sections on the dashboard sidebar collapsable (but expanded by default) will be important to support cases where either the sample field schema and/or the current are complex and thus cannot both fit on the screen at the same time.

Similarly, even with Display collapsed, for example, the current View may exceed the screen height, so the sidebar will need to be scrollable. (Maybe it already is? Not sure)

Walkthrough Issues

  1. Each walkthrough should specify which additional packages are needed in order to run it.
    Ex: The uniqueness walkthrough can't call foz.load_zoo_dataset() without pip installing torch, torchvision, tensorflow-datasets

2) The uniqueness walkthrough errors on the call fob.compute_uniqueness(): FIXED

In [8]: fob.compute_uniqueness(dataset)                                                              
Search path is empty                                                                                 
---------------------------------------------------------------------------                          
ModelError                                Traceback (most recent call last)                          
<ipython-input-8-66cf7b6a3d6c> in <module>                                                           
----> 1 fob.compute_uniqueness(dataset)                                                              
                                                                                                     
~/venvs/test/lib/python3.6/site-packages/fiftyone/brain/uniqueness.py in compute_uniqueness(***failed
 resolving arguments***)                                                                             
                                                                                                     
~/venvs/test/lib/python3.6/site-packages/eta/core/learning.py in load_default_deployment_model(model_
name)                                                                                                
    130             specified model                                                                  
    131     """                                                                                      
--> 132     model = etam.get_model(model_name)                                                       
    133     config = ModelConfig.from_dict(model.default_deployment_config_dict)                     
    134     return config.build()

~/venvs/test/lib/python3.6/site-packages/eta/core/models.py in get_model(name)
     97         ModelError: if the model could not be found
     98     """
---> 99     return _find_model(name)[0]
    100
    101

~/venvs/test/lib/python3.6/site-packages/eta/core/models.py in _find_model(name)
    611     if Model.has_version_str(name):
    612         return _find_exact_model(name)
--> 613     return _find_latest_model(name)
    614
    615

~/venvs/test/lib/python3.6/site-packages/eta/core/models.py in _find_latest_model(base_name)
    635
    636     if _model is None:
--> 637         raise ModelError("No models found with base name '%s'" % base_name)
    638     if _model.has_version:
    639         logger.debug(

ModelError: No models found with base name 'simple_resnet_cifar10'

Need custom FloatField that supports more numeric types

In this code (taken from https://github.com/voxel51/fiftyone/blob/develop/examples/model_inference/README.md), I had to add float(confidence) otherwise I got an error about confidence, which was a numpy float32 or something similar, not being a supported value for a mongoengine.fields.FloatField.

for imgs, sample_ids in data_loader:
    predictions, confidences = predict(model, imgs)

    # Add predictions to your FiftyOne dataset
    for sample_id, prediction, confidence in zip(
        sample_ids, predictions, confidences
    ):
        sample = dataset[sample_id]
        sample[model_name] = fo.Classification(label=labels_map[prediction])
        sample["confidence"] = float(confidence)  # float() is required here, but shouldn't need to be...
        sample.save()

Kind of hard to believe that MongoEngine doesn't handle casting a np.float32 into a float, but, alas, it seems like our wrapper around mongoengine.fields.FloatField will need to override the validate() function below to cast non-int types with float() as well...

https://github.com/MongoEngine/mongoengine/blob/4275c2d7b791f5910308a4815a1ba39324dee373/mongoengine/fields.py#L377-L411

Fullscreen image viewing

After an image is shown in the sidebar, add an option for viewing it fullscreen. This should still use player51 to show detections.

Enhanced sample selection in dashboard

Google Photos please!!!

  • Check boxes appear in the top left
  • selecting one puts you into "selection mode"
  • Shift select many samples
  • Clear selected (and exists "selection mode")

Bring back ODMSample base class

#117 went out a bit quick, but, upon closer inspection, I think we should not have deleted fiftyone.core.odm.ODMDatasetSample.

The current implementation has only fiftyone.core.odm.ODMSample and fiftyone.core.odm.NoDatasetSample, where the latter inherits directly from SerializableDocument:

class NoDatasetSample(SerializableDocument):

I think there's still value in having the following hierarchy:

fiftyone.core.odm.ODMSample
    fiftyone.core.odm.ODMDatasetSample
    fiftyone.core.odm.ODMNoDatasetSample

where the base class fiftyone.core.odm.ODMSample defines the interface that all samples support. This will make it more clear what fiftyone.core.sample.Sample is allowed to do with its backing documents, for example.

Although NoDatasetSample is now completely home brewed (no MongoEngine), it is still a "document" in the sense that it is a JSON serializable representation of a Sample. So I see no issue with using the ODMNoDatasetSample name. It is in the odm package, after all.

Turn off image smoothing in dashboard

The dashboard currently renders images with lots of image smoothing turned on. However, our technical audience of CV/ML scientists will want to see their raw images.

The difference is especially apparent at low resolutions. I would prefer to see the pixelated image 32x32 image on the left below (opened on my machine with smoothing turned off), not the image on the right (in the FO dashboard):

Screen Shot 2020-05-30 at 5 02 09 PM

Convert tags to a SetField

It would be preferable for tags to be represented as sets in Python land. See below for discussion.

Also, this would be a convenient time to create our own wrappers around MongoEngine fields so that our user interface is decoupled from MongoEngine classes:

Screen Shot 2020-05-20 at 2 30 47 PM

[GS1] Confusing output when loading multiple splits of a dataset.

This code

train_dataset = foz.load_zoo_dataset("cifar10", split="train")
valid_dataset = foz.load_zoo_dataset("cifar10", split="test")

generates this output

Using default dataset directory '/home/jason/fiftyone/cifar10/train'
Dataset already downloaded
Parsing samples...
Creating dataset 'cifar10' containing 50000 samples
Using default dataset directory '/home/jason/fiftyone/cifar10/test'
Dataset already downloaded
Parsing samples...
Creating dataset 'cifar10' containing 10000 samples

It's confusing to say "Creating dataset 'cifar10'" twice.

Cannot use shell after exiting Python with dashboard open

From IPython, I run the following commands:

In [1]: import fiftyone as fo                                                                                                                                                        

In [2]: session = fo.launch_dashboard()                                                                                                                                              

In [3]: exit()

The last command exits my shell with the dashboard still open. The dashboard does successfully close, but my shell is now borked: anything I now type does not appear. Commands can be run and their outputs will print, but anything I type doesn't appear. My only recourse is to close the shell and open a new one.

Layout and style the FiftyOne docs

Here's a v1 design for our docs from our designer:

Personally, I like the colors/styling of the light theme the best:

83524744-08621100-a4b2-11ea-83ca-a0fc9f434515

The more important point is making sure the various components of the docs are laid out in a user-friendly way.

For reference, here's a tutorial from PyTorch:
83524827-262f7600-a4b2-11ea-8a8f-860aba6e2d8e

and here's a tutorial from TF:
83524855-32b3ce80-a4b2-11ea-8fbc-49e205aa402b

Distilling all of this, here are the page elements that I like:

  • Top bar allows for selection between Get Started, Docs, Tutorials, GitHub, etc
  • When viewing a page (tutorial or API docs), left navbar shows a top-level-only list of other tutorials/modules that one can click on. The detailed table of contents for the current page appears on the right side bar
  • Search bar in upper right, with GitHub link to the right of it
  • Tutorials have View source on GitHub and Download notebook links

Discussion is welcome!

Cleaning up services in __del__ is nondeterministic/unsafe

__del__ isn't guaranteed to be called, and even if it is, it isn't guaranteed to be called before other objects are deleted. When testing #75, this has led to some intermittent errors like this:

Exception ignored in: <bound method Service.__del__ of <fiftyone.core.service.ServerService object at 0x7f70aa0a4b38>>
Traceback (most recent call last):
  File "/path/to/venv/lib/python3.5/site-packages/fiftyone/core/service.py", line 50, in __del__
  File "/path/to/venv/lib/python3.5/site-packages/fiftyone/core/service.py", line 88, in stop
AttributeError: 'NoneType' object has no attribute 'STOP_SERVER'
Exception ignored in: <bound method Service.__del__ of <fiftyone.core.service.DatabaseService object at 0x7f70b117ae80>>
Traceback (most recent call last):
  File "/path/to/venv/lib/python3.5/site-packages/fiftyone/core/service.py", line 50, in __del__
  File "/path/to/venv/lib/python3.5/site-packages/fiftyone/core/service.py", line 75, in stop
AttributeError: 'NoneType' object has no attribute 'STOP_DB'

In this case, the entire fiftyone.constants module was deleted before the Service instances!

I think avoiding __del__ entirely is the right approach here. My understanding is that #76 has already made some improvements; this issue is mainly for tracking purposes.

Minor Formatting Issue

In the Image Deduplication with Fiftyone walkthrough docs in section 4 Compute File Hashes:

We have two ways to visualize this new information:

1. From your terminal:

sample = dataset.view().first()
print(sample)

1. By refreshing the dashboard:

session.dataset = dataset

Should be 1. and 2.

Support for pythonic sample matching in views

The discussion below was prompted by the following command:

view.match({"metadata.num_channels": 3, "metadata.size_bytes": {"$gt": 1200}})

Having the user write things like "metadata.size_bytes": {"$gt": 1200} is getting a little too close to exposing the user to MongoDB syntax for my taste.

In an ideal pythonic interface, one could write:

view.match(lambda sample: sample.metadata.num_channels == 3)

Now, of course this would have to be implemented as a pipeline stage that read the samples into memory and applied the function, so it may not be as efficient as possible, but it would be very easy for the user to understand, and powerful.

Maybe the operation is reasonably fast even for datasets with 100K+ samples, and so we can just do it. Or maybe it's a bit slow so we expose the functionality and then suggest to the user that there are faster ways to implement certain operations.

This is analogous to https://www.tensorflow.org/api_docs/python/tf/py_function.

For any "optimized" operations we support, how about a more generic syntax like:

view.match("'metadata.size_bytes' > 1200")

where we would have a simple whitespace-based parser that would translate the string into "metadata.size_bytes": {"$gt": 1200}.

The syntax of the match string "'metadata.size_bytes' > 1200" vs "metadata.size_bytes": {"$gt": 1200} is a small point.

The real ask here is implementing view.match(<function>) rather than view.match(<match-string-using-either-syntax>). In the former case, the user has ultimate power to define a match operation that depends on 15 different fields in strange ways, which they can put in their custom, IDE-friendly function, at the cost of a potentially small performance overhead of reading the samples into memory during that stage.

Custom __str__ on FiftyOne fields to improve dataset summaries

Currently dataset summaries look something like this:

>>> d = fo.Dataset("ASDf")
>>> d
Name:           ASDf
Num samples:    0
Tags:           []
Sample fields:
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(field=fiftyone.core.fields.StringField)
    metadata: fiftyone.core.fields.EmbeddedDocumentField(document_type=fiftyone.core.metadata.Metadata)

We should consider implementing custom __str__ for our fiftyone.core.fields classes to make the sample field representations more concise. Particularly list fields and embedded fields. The fact that metadata is an "embedded document" is definitely not relevant to the user, for example

Example concise representation:

>>> d = fo.Dataset("ASDf")
>>> d
Name:           ASDf
Num samples:    0
Tags:           []
Sample fields:
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata: fiftyone.core.metadata.Metadata

Make loading samples progress bar more informative

The parsing samples progress bar is a bit disingenuous because loading a dataset always seem to hang for some time afterwards.

I'll further note that this hang time increases with the dataset size.

This hang time is not communicated to the user as anything specific.

Show first label-type field(s) by default in the dashboard

When viewing a dataset in the dashboard, the default should be to have at least one sample field of type fiftyone.core.label.Label automatically selected (the user could deselect if desired, of course)

Here's a proposal:

  • if there are <= 2 label fields, automatically show them both. The user probably wants to compare them
  • if there are > 2 label fields, either automatically show the first 2, the first one, or none at all (up for discussion)

My justification for this is:

  • we want to minimize the number of clicks that a user has to make to get the insights they want
  • we want to make sure, especially for new users, that they "see" the power of the dashboard; showing a bunch of raw images is not, in itself, that novel. We want a wow factor up front
  • most datasets will only contain a few sets of labels, so its very likely that a reasonable default strategy will fulfill quite a few use cases

FiftyOne developer install issues on mac

The current FiftyOne install process requires XCode developer tools, because the Electron app needs to build some things with gyp.

Also, when @michaelsare ran the install script with Xcode dev tools installed, he didn't see any obvious errors, but yarn was not properly installed. This had to be fixed via brew install yarn. The install script (tries?) to install yarn via npm install -g yarn.

We should find a way to ensure that a developer install from scratch on a fresh machine completes successfully. We'll have non-developer users of the tool internally that will need to be able to spin up a bleeding edge version of the tool to demo to folks. Or we'll want to install FiftyOne from scratch on a fresh VM

Add basic support for visualizing distributions of label and numeric fields in the dashboard

Proposed MVP implementation after our meeting on June 1st:

The tab headings are up for discussion; unsure how to best organize.

MVP field types for which to support

  • histograms of fiftyone.core.label.Classification labels
  • histograms of fiftyone.core.label.Detections labels
  • histograms of string list (e.g., tags) values
  • histograms of numeric (int, float) values

Remove `Dataset.serialize`?

Is Dataset.serialize() being used for anything?

def serialize(self):
"""Serializes the dataset.
Returns:
a JSON representation of the dataset
"""
return {"name": self.name}

I find it confusing because it does not serialize the dataset, it just returns metadata (currently only the name).

This is becoming especially confusing given #121, which introduces a Dataset.to_dict() method that actually serializes the entire dataset.

View chaining example in 15to51 should be "real"

DatasetView chaining is a powerful operation, but the example in the 15to51 walkthrough is empty. I think it would be more instructive if the example produced a non-empty dataset and the contents of the output view were shown to verify that they meet the view's query criteria:

Screen Shot 2020-06-02 at 9 48 46 AM

`pymongo.errors.NotMasterError` occasionally raised when adding a sample

This has show up with my work on the new interface. Shows up maybe once in ~5-10 calls. Need to investigate

https://api.mongodb.com/python/current/api/pymongo/errors.html#pymongo.errors.NotMasterError

(fiftyone) tylerganter@tgmbp:~/source/fiftyone/tests$ python 51in15.py 
Uncaught exception
Traceback (most recent call last):
  File "51in15.py", line 25, in <module>
    sample_id = dataset.add_sample(filepath="/path/to/img.jpg", tags=["train"])
  File "/Users/tylerganter/source/fiftyone/fiftyone/core/dataset.py", line 205, in add_sample
    sample = self._Doc(*args, **kwargs)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/mongoengine/base/document.py", line 115, in __init__
    setattr(self, key, value)
  File "/Users/tylerganter/source/fiftyone/fiftyone/core/odm.py", line 214, in __setattr__
    self.save()
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/mongoengine/document.py", line 408, in save
    object_id = self._save_create(doc, force_insert, write_concern)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/mongoengine/document.py", line 473, in _save_create
    object_id = wc_collection.insert_one(doc).inserted_id
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/collection.py", line 698, in insert_one
    session=session),
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/collection.py", line 612, in _insert
    bypass_doc_val, session)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/collection.py", line 600, in _insert_one
    acknowledged, _insert_command, session)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1491, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1384, in _retry_with_session
    return func(session, sock_info, retryable)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/collection.py", line 595, in _insert_command
    retryable_write=retryable_write)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/pool.py", line 618, in command
    self._raise_connection_failure(error)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/pool.py", line 613, in command
    user_fields=user_fields)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/network.py", line 167, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/Users/tylerganter/envs/fiftyone/lib/python3.6/site-packages/pymongo/helpers.py", line 136, in _check_command_response
    raise NotMasterError(errmsg, response)
pymongo.errors.NotMasterError: interrupted at shutdown

Consider truncating print output for large collections

print(dataset_or_view) can get out of hand when the collection contains many samples. We should consider a threshold above which we only serialize X samples and then append a message like ... X of Y total samples or something

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.