Giter Site home page Giter Site logo

lukashedegaard / datasetops Goto Github PK

View Code? Open in Web Editor NEW
10.0 5.0 1.0 25.33 MB

Fluent dataset operations, compatible with your favorite libraries

Home Page: https://datasetops.readthedocs.io

License: MIT License

Python 96.69% Makefile 3.31%
dataset-combinations data-science multiple-datasets pytorch tensorflow data-munging data-wrangling data-cleaning data-processing deep-learning dataset

datasetops's Introduction


Dataset Ops: Fluent dataset operations, compatible with your favorite libraries

Python package Documentation Status codecov Code style: black

Dataset Ops provides a fluent interface for loading, filtering, transforming, splitting, and combining datasets. Designed specifically with data science and machine learning applications in mind, it integrates seamlessly with Tensorflow and PyTorch.

Appetizer

import datasetops as do

# prepare your data
train, val, test = (
    do.from_folder_class_data('path/to/data/folder')
    .named("data", "label")
    .image_resize((240, 240))
    .one_hot("label")
    .shuffle(seed=42)
    .split([0.6, 0.2, 0.2])
)

# use with your favorite framework
train_tf = train.to_tensorflow() 
train_pt = train.to_pytorch() 

# or do your own thing
for img, label in train:
    ...

Installation

Binary installers available at the Python package index

pip install datasetops

Why?

Collecting and preprocessing datasets is tiresome and often takes upwards of 50% of the effort spent in the data science and machine learning lifecycle. While Tensorflow and PyTorch have some useful datasets utilities available, they are designed specifically with the respective frameworks in mind. Unsuprisingly, this makes it hard to switch between them, and training-ready dataset definitions are bound to one or the other. Moreover, they do not aid you in standard scenarios where you want to:

  • Sample your dataset non-random ways (e.g with a fixed number of samples per class)
  • Center, standardize, normalise you data
  • Combine multiple datasets, e.g. for parallel input to a multi-stream network
  • Create non-standard data splits

Dataset Ops aims to make these processing steps easier, faster, and more intuitive to perform, while retaining full compatibility to and from the leading libraries. This also means you can grab a dataset from torchvision datasets and use it directly with tensorflow:

import do
import torchvision

torch_usps = torchvision.datasets.USPS('../dataset/path', download=True)
tensorflow_usps = do.from_pytorch(torch_usps).to_tensorflow()

Development Status

The library is still under heavy development and the API may be subject to change.

What follows here is a list of implemented and planned features.

Loaders

  • Loader (utility class used to define a dataset)
  • from_pytorch (load from a torch.utils.data.Dataset)
  • from_tensorflow (load from a tf.data.Dataset)
  • from_folder_data (load flat folder with data)
  • from_folder_class_data (load nested folder with a folder for each class)
  • from_folder_dataset_class_data (load nested folder with multiple datasets, each with a nested class folder structure )
  • from_mat (load contents of a .mat file as a single dataaset)
  • from_mat_single_mult_data (load contents of a .mat file as multiple dataasets)
  • load (load data from a path, automatically inferring type and structure)

Converters

  • to_tensorflow (convert Dataset into tensorflow.data.Dataset)
  • to_pytorch (convert Dataset into torchvision.Dataset)

Dataset information

  • shape (get shape of a dataset item)
  • counts (compute the counts of each unique item in the dataset by key)
  • unique (get a list of unique items in the dataset by key)
  • named (supply names for the item elements)
  • names (get a list of names for the elements in an item)
  • stats (provide an overview of the dataset statistics)
  • origin (provide an description of how the dataset was made)

Sampling and splitting

  • shuffle (shuffle the items in a dataset randomly)
  • sample (sample data at random a dataset)
  • filter (filter the dataset using a predicate)
  • split (split a dataset randomly based on fractions)
  • split_filter (split a dataset into two based on a predicate)
  • allow_unique (handy predicate used for balanced classwise filtering/sampling)
  • take (take the first items in dataset)
  • repeat (repeat the items in a dataset, either itemwise or as a whole)

Item manipulation

  • reorder (reorder the elements of the dataset items (e.g. flip label and data order))
  • transform (transform function which takes other functions and applies them to the dataset items.)
  • categorical (transforms an element into a categorical integer encoded label)
  • one_hot (transforms an element into a one-hot encoded label)
  • numpy (transforms an element into a numpy.ndarray)
  • reshape (reshapes numpy.ndarray elements)
  • image (transforms a numpy array or path string into a PIL.Image.Image)
  • image_resize (resizes PIL.Image.Image elements)
  • image_crop (crops PIL.Image.Image elements)
  • image_rotate (rotates PIL.Image.Image elements)
  • image_transform (transforms PIL.Image.Image elements)
  • image_brightness (modify brightness of PIL.Image.Image elements)
  • image_contrast (modify contrast of PIL.Image.Image elements)
  • image_filter (apply an image filter to PIL.Image.Image elements)
  • noise (adds noise to the data)
  • center (modify each item according to dataset statistics)
  • normalize (modify each item according to dataset statistics)
  • standardize (modify each item according to dataset statistics)
  • whiten (modify each item according to dataset statistics)
  • randomly (apply data transformations with some probability)

Dataset combinations

  • concat (concatenate two datasets, placing the items of one after the other)
  • zip (zip datasets itemwise, extending the size of each item)
  • cartesian_product (create a dataset whose items are all combinations of items (zipped) of the originating datasets)

Citation

If you use this software, please cite it as below:

@software{Hedegaard_DatasetOps_2022,
  author = {Hedegaard, Lukas and Oleksiienko, Illia and Legaard, Christian Møldrup},
  doi = {10.5281/zenodo.7223644},
  month = {10},
  title = {{DatasetOps}},
  version = {0.0.7},
  year = {2022}
}

datasetops's People

Contributors

clegaard avatar iliiliiliili avatar lukashedegaard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

dsp6414

datasetops's Issues

Online dataset downloaders

Once in while, people supply a non-standard datasets online. This could be in form of link for downloading a dataset directly, or as a file in Google Drive for instance.
It would be a nice feature, if we could make a loader for these situations.

Naming suggestions:
-from_online
-from_url
-from_link

Write examples that demonstrate the desired use of API.

To drive the design of the API it would be useful to exemplify how the API may be used to load, transform and split data.

For example loading and transforming to grayscale:

import dataset_loader as ds
ds = ds.load_dataset(myPath).to_grayscale()

Naming of dataset.py and declaration of transforms

Currently, most of the transforms are implemented in the dataset.py file.
There there is some defined in the compose.py file, related to taking the cartesian product.

It might be reasonable to agree on where transforms should be defined.
What are your thoughts?

Difference between Dataset and Loader

What is the difference between a Dataset and a Loader? From a conceptual standpoint, a loader behaves exactly like a dataset. In terms of the implementation, it seems like the Loader simply wraps the dataset?

MXNet compatibility

We should consider adding support for MXNet, as this is currently the third most popular framework for machine learning

sample behaviour

Currently, if more samples are requested on .sample, than are available in the dataset, we will sample some samples multiple times. Should we raise an error instead?

Naming scheme

We need to come up with a good name for project

Criteria

The name should translate well into package names and imports in Python:

Will be installed using pip or conda:

pip install mldatasets
conda install mldatasets

Imported in Python:

from mldatasets.loaders import load_dataset

Dataset shape property and debuggers

Currently, the shape property of a dataset is determined by loading a single sample from the dataset.
This has the unintended effects when the dataset is inspected by a debugger like that in vscode, which evaluates the expression, which may potentially take several seconds if each sample is large.

@property
def shape(self) -> Sequence[Shape]:
"""Get the shape of a dataset item.
Returns:
Sequence[int] -- Item shapes
"""
if len(self) == 0:
return _DEFAULT_SHAPE
item = self.__getitem__(0)
if hasattr(item, "__getitem__"):
item_shape = []
for i in item:
if hasattr(i, "shape"): # numpy arrays
item_shape.append(i.shape)
elif hasattr(i, "size"): # PIL.Image.Image
item_shape.append(np.array(i).shape)
else:
item_shape.append(_DEFAULT_SHAPE)
return tuple(item_shape)
return _DEFAULT_SHAPE

This begs the question of whether or not properties should have side effects? I relation to the subsampling operator, this messes a caching mechanism. If logging was enabled this would potentially cause unexpected log messages to be printed.

A solution could be caching the inferred shape, e.g. saving it to a private attribute _shape and having the property link to that value instead?

Add dataset.remove method

Currently, removing an element can be achieved using dataset.transform(lambda x: (x[0] x[2])) or dataset.reorder("name0","name2"). We should have explicit function for this.

Add workflow for doctests

The docs support testing using doctest. We should add a GitHub workflow that automatically checks these.

Add split_element to convert one element into a few other elements

Currently, to create new elements from another we can use dataset.transform(lambda x: (x[0], make_P(x[1]), make_R(x[1]), x[2])). This approach breaks dataset names, so we have to call .named(...) after it.
Implementing split_element method that takes name of the (element, function, that returns List with created elements, List[str] of names of new elements) allows us to process the data without the need to touch other data, so less chances for user to make an error.

Dataset.image should be callable by element names

Dataset.image can be only called without parameters to convert all convertible data to image or with flags (True, False, True). We should also be able to call it with names: ds.image("image_2")

standardize-transform performance

It seems the standardize function adds a significant amount of reads to the underlying dataset.
Calling the function on a dataset containing a single element seemingly causes 7 reads to be carried out.

ds_one = ds.take(1)
def foo():
    ds_center = ds_one.standardize(0, axis=1)
    s = ds_center[0]

def bar():

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    s = ds[0]
    scaler.fit(s.data)
    scaled = scaler.transform(s.data)
    mu = np.mean(scaled)
    std = np.std(scaled)


def do_profile(func):

    print(f"######### PROFILING {func.__name__} #########")
    pr = cProfile.Profile(subcalls=False)
    pr.enable()
    func()
    pr.disable()
    s = io.StringIO()
    sortby = SortKey.TIME
    ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
    ps.print_stats()
    # return s.getvalue()
    print(f"{s.getvalue()}\n")

do_profile(foo)
do_profile(bar)
######### PROFILING foo #########
         40119 function calls (39755 primitive calls) in 18.561 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        7   17.962    2.566   18.073    2.582 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
    94/65    0.102    0.001    0.283    0.004 {built-in method numpy.core._multiarray_umath.implement_array_function}
        7    0.098    0.014    0.098    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:86(_pad_simple)
        7    0.097    0.014    0.098    0.014 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1828(_stack_arrays)
      217    0.085    0.000    0.085    0.000 {built-in method numpy.array}
       36    0.036    0.001    0.036    0.001 {method 'reduce' of 'numpy.ufunc' objects}
        7    0.022    0.003    0.023    0.003 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1864(__init__)
        7    0.018    0.003   18.256    2.608 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:416(_read)
        7    0.018    0.003   18.377    2.625 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:491(get_data)
        1    0.009    0.009    0.039    0.039 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:1421(nanvar)
        2    0.007    0.003    0.036    0.018 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:68(_replace_nan)
        1    0.006    0.006    0.025    0.025 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\preprocessing\_data.py:780(transform)
        1    0.006    0.006    0.076    0.076 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\extmath.py:710(_incremental_mean_and_var)
     7847    0.005    0.000    0.011    0.000 {built-in method builtins.isinstance}
     4018    0.005    0.000    0.006    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\generic.py:10(_check)
        1    0.005    0.005    5.368    5.368 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:238(item_stats)
        1    0.005    0.005    8.000    8.000 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:1385(make_fn)
        7    0.004    0.001    0.004    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
     1148    0.003    0.000    0.005    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1708(_is_dtype_type)
     15/7    0.003    0.000   18.405    2.629 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:235(__getitem__)
        1    0.003    0.003    2.729    2.729 c:\users\clega\desktop\datasetops\src\datasetops\scaler.py:62(fit)
        1    0.003    0.003   13.269   13.269 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:648(transform)
        1    0.003    0.003   10.639   10.639 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:1088(wrapped)
        1    0.003    0.003   15.911   15.911 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:854(standardize)
        1    0.002    0.002    0.017    0.017 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:1194(fn)
      812    0.002    0.000    0.004    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1565(is_extension_array_dtype)
      665    0.002    0.000    0.009    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:542(is_categorical_dtype)
     1106    0.002    0.000    0.011    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\base.py:247(is_dtype)
      455    0.002    0.000    0.004    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:775(is_integer_dtype)
    49/21    0.002    0.000    0.012    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:276(__new__)
      819    0.002    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:75(find)
      273    0.001    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1401(is_float_dtype)
       56    0.001    0.000    0.001    0.000 {built-in method numpy.empty}
      539    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:216(<lambda>)
      609    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:208(<lambda>)
      609    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:206(classes)
      539    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:211(classes_and_not_datetimelike)
     6128    0.001    0.000    0.001    0.000 {built-in method builtins.getattr}
       63    0.001    0.000    0.005    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:2981(get_block_type)
    21/14    0.001    0.000    0.018    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:183(__init__)
       56    0.001    0.000    0.005    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:388(sanitize_array)
       91    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:321(_name_get)
  817/642    0.000    0.000    0.001    0.000 {built-in method builtins.len}
      234    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
      154    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:1124(is_dtype)
      182    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:506(is_interval_dtype)
      154    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:917(is_dtype)
      182    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:472(is_period_dtype)
        7    0.000    0.000    0.104    0.015 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1700(form_blocks)
      876    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
      119    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:441(is_timedelta64_dtype)
      217    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1844(pandas_dtype)
      105    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:222(is_object_dtype)
        7    0.000    0.000    0.100    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:532(pad)
       14    0.000    0.000    0.001    0.000 {pandas._libs.lib.clean_index_list}
      161    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:403(is_datetime64tz_dtype)
      105    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:372(is_datetime64_dtype)
        7    0.000    0.000   18.359    2.623 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:448(read_single_csv)
     2827    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        7    0.000    0.000   18.074    2.582 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:2035(read)
    69/14    0.000    0.000    0.000    0.000 {built-in method _abc._abc_subclasscheck}
        7    0.000    0.000    0.102    0.015 c:\Users\clega\Desktop\vibration\sandbox.py:25(func_csv)
       98    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:987(is_datetime64_any_dtype)
       84    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5393(maybe_extract_name)
       49    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\cast.py:1088(maybe_castable)
       56    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1435(is_bool_dtype)
       42    0.000    0.000    0.096    0.002 <__array_function__ internals>:2(concatenate)
      157    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_asarray.py:14(asarray)
       56    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:506(_try_cast)
       63    0.000    0.000    0.013    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5293(ensure_index)
        7    0.000    0.000   18.209    2.601 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1131(read)
       91    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:307(_name_includes_bit_suffix)
       54    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:360(issubdtype)
       63    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:252(is_sparse)
    63/56    0.000    0.000    0.002    0.000 {built-in method builtins.all}
       84    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1672(_get_dtype)
       49    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\common.py:219(asarray_tuplesafe)
        7    0.000    0.000    0.134    0.019 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\construction.py:213(init_dict)
       21    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\urllib\parse.py:361(urlparse)
       35    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\generic.py:5276(__setattr__)
       21    0.000    0.000    0.003    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:3027(make_block)
        7    0.000    0.000    0.100    0.014 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1811(_multi_blockify)
        4    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\sre_parse.py:469(_parse)
      108    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:286(issubclass_)
       21    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:118(__init__)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\range.py:83(__new__)
        3    0.000    0.000    0.027    0.009 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\validation.py:350(check_array)
      131    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\abc.py:137(__instancecheck__)
      119    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\inference.py:358(is_hashable)
        7    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:3857(_reduce)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:212(_rebuild_blknos_and_blklocs)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:457(_as_pairs)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\common.py:144(get_filepath_or_buffer)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:3911(__getitem__)
        7    0.000    0.000    0.023    0.003 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:792(__init__)
       84    0.000    0.000    0.001    0.000 {pandas._libs.lib.is_list_like}
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:939(_clean_options)
       63    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:339(is_categorical)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1831(_asarray_compat)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:251(mgr_locs)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\stride_tricks.py:114(_broadcast_to)
       21    0.000    0.000    0.000    0.000 {pandas._libs.lib.infer_dtype}
        7    0.000    0.000    0.003    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\construction.py:300(_homogenize)
        7    0.000    0.000    0.003    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\missing.py:225(_isna_ndarraylike)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:830(is_signed_integer_dtype)
        7    0.000    0.000   18.256    2.608 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:530(parser_f)
       14    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\cast.py:1209(maybe_cast_to_datetime)
        7    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\nanops.py:234(_get_values)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\common.py:40(is_url)
        7    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:122(__init__)
       42    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:887(is_unsigned_integer_dtype)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:885(_get_options_with_defaults)
       91    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:24(_kind_name)
       14    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:4046(equals)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\urllib\parse.py:412(urlsplit)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:376(_set_axis)
        7    0.000    0.000    0.000
######### PROFILING bar #########
         5890 function calls (5849 primitive calls) in 2.768 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.570    2.570    2.586    2.586 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
       41    0.042    0.001    0.042    0.001 {built-in method numpy.array}
       16    0.038    0.002    0.038    0.002 {method 'reduce' of 'numpy.ufunc' objects}
    27/16    0.021    0.001    0.138    0.009 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    0.017    0.017    0.025    0.025 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_methods.py:176(_var)
        1    0.014    0.014    0.014    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:86(_pad_simple)
        1    0.014    0.014    0.014    0.014 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1828(_stack_arrays)
        1    0.009    0.009    0.038    0.038 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:1421(nanvar)
        2    0.007    0.003    0.036    0.018 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:68(_replace_nan)
        1    0.007    0.007    0.076    0.076 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\extmath.py:710(_incremental_mean_and_var)
        1    0.006    0.006    0.025    0.025 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\preprocessing\_data.py:780(transform)
        1    0.004    0.004    0.004    0.004 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1864(__init__)
        1    0.003    0.003    2.613    2.613 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:416(_read)
        1    0.003    0.003    2.631    2.631 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:491(get_data)
        1    0.003    0.003    0.028    0.028 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_methods.py:232(_std)
     1141    0.001    0.000    0.002    0.000 {built-in method builtins.isinstance}
      574    0.001    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\generic.py:10(_check)
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      164    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1708(_is_dtype_type)
      116    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1565(is_extension_array_dtype)
      158    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\base.py:247(is_dtype)
       95    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:542(is_categorical_dtype)
       65    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:775(is_integer_dtype)
      117    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:75(find)
      7/3    0.000    0.000    0.002    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:276(__new__)
        1    0.000    0.000    2.768    2.768 c:\Users\clega\Desktop\vibration\sandbox.py:80(bar)
        8    0.000    0.000    0.000    0.000 {built-in method numpy.empty}
       39    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1401(is_float_dtype)
       87    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:208(<lambda>)
       77    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:216(<lambda>)
       87    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:206(classes)
        2    0.000    0.000    0.023    0.012 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\validation.py:350(check_array)
       77    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:211(classes_and_not_datetimelike)
      876    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        9    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\blocks.py:2981(get_block_type)
        8    0.000    0.000    0.001    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:388(sanitize_array)
       13    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:321(_name_get)
       36    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
      144    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
      3/2    0.000    0.000    0.002    0.001 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\series.py:183(__init__)
        4    0.000    0.000    0.075    0.019 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\extmath.py:681(_safe_accumulator_op)
       22    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:1124(is_dtype)
       15    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:222(is_object_dtype)
   108/84    0.000    0.000    0.000    0.000 {built-in method builtins.len}
       22    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\dtypes.py:917(is_dtype)
       26    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:506(is_interval_dtype)
        1    0.000    0.000    2.628    2.628 c:\users\clega\desktop\datasetops\src\datasetops\loaders.py:448(read_single_csv)
       17    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:441(is_timedelta64_dtype)
       26    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:472(is_period_dtype)
        2    0.000    0.000    0.009    0.005 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\utils\validation.py:37(_assert_all_finite)
        1    0.000    0.000    0.015    0.015 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1700(form_blocks)
       31    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1844(pandas_dtype)
        1    0.000    0.000    0.014    0.014 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:532(pad)
       15    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:372(is_datetime64_dtype)
       23    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:403(is_datetime64tz_dtype)
       11    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:360(issubdtype)
        8    0.000    0.000    0.027    0.003 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\fromnumeric.py:70(_wrapreduction)
        6    0.000    0.000    0.014    0.002 <__array_function__ internals>:2(concatenate)
      420    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.015    0.015 c:\Users\clega\Desktop\vibration\sandbox.py:25(func_csv)
        1    0.000    0.000    2.631    2.631 c:\users\clega\desktop\datasetops\src\datasetops\dataset.py:235(__getitem__)
        1    0.000    0.000    0.081    0.081 C:\ProgramData\Miniconda3\lib\site-packages\sklearn\preprocessing\_data.py:671(partial_fit)
        2    0.000    0.000    0.000    0.000 {pandas._libs.lib.clean_index_list}
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_ufunc_config.py:32(seterr)
       22    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\numerictypes.py:286(issubclass_)
        7    0.000    0.000    0.027    0.004 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\fromnumeric.py:2105(sum)
       24    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_asarray.py:14(asarray)
        2    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\nanfunctions.py:183(_divide_by_count)
        1    0.000    0.000    2.586    2.586 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:2035(read)
        7    0.000    0.000    0.027    0.004 <__array_function__ internals>:2(sum)
       14    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:987(is_datetime64_any_dtype)
        6    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1831(_asarray_compat)
       12    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5393(maybe_extract_name)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\cast.py:1088(maybe_castable)
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1435(is_bool_dtype)
       13    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_dtype.py:307(_name_includes_bit_suffix)
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_ufunc_config.py:132(geterr)
        9    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:252(is_sparse)
        9    0.000    0.000    0.002    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:5293(ensure_index)
        8    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\construction.py:506(_try_cast)
       21    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\abc.py:137(__instancecheck__)
        7    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\common.py:219(asarray_tuplesafe)
        1    0.000    0.000    0.004    0.004 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\_methods.py:143(_mean)
       12    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\common.py:1672(_get_dtype)
        1    0.000    0.000    0.015    0.015 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:1811(_multi_blockify)
        1    0.000    0.000    2.613    2.613 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:530(parser_f)
        2    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:199(_is_single_block)
        1    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\numpy\lib\arraypad.py:457(_as_pairs)
        5    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\generic.py:5276(__setattr__)
        1    0.000    0.000    0.004    0.004 <__array_function__ internals>:2(mean)
       17    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\dtypes\inference.py:358(is_hashable)
        1    0.000    0.000    2.605    2.605 C:\ProgramData\Miniconda3\lib\site-packages\pandas\io\parsers.py:1131(read)
        1    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:212(_rebuild_blknos_and_blklocs)
        1    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\internals\managers.py:798(as_array)
       21    0.000    0.000    0.000    0.000 {built-in method _abc._abc_instancecheck}
        1    0.000    0.000    0.004    0.004 C:\ProgramData\Miniconda3\lib\site-packages\numpy\core\fromnumeric.py:3244(mean)
       13    0.000    0.000    0.000    0.000 C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py:615(__len__)
        3    0.000    0.000    0.000    0.000 C:\ProgramData\Min

Fix readthedocs autoapi issues

Currently, documentation build fails due to autoapi extension not being available on readthedocs:
image

It seems like it should be possible to make it the required packages either by doing

  1. pip install .
  2. pip install -r my_dev_requirements.txt

image

Remove `extend` from Loader

Currently, there may be issues if a Loader dataset has no elements. Methods that require shape, (named for instance) will not work.
There is currently a temptation to create datasets as follows:

ds = Loader(getitems_fn).named(“image”, “label”)
ds.extend(ids)

This fails because the named function is only valid after extend was called.

Removing extend from the Loader and instead pass ids via the constructor would avoid the scenario. Also, it conforms better to our otherwise functional API

ds = Loader(getitems_fn, ids).named(“image”, “label”)

Slow tests

Currently, a small number of tests are taking disproportionatly big chunk of the execution time.
Running: pytest --durations=0 reveals

33.56s call     tests/datasetops_tests/test_caching.py::test_cache
4.22s call     tests/datasetops_tests/test_loaders.py::test_tfds
3.74s call     tests/datasetops_tests/test_datasets.py::test_to_tensorflow
2.19s call     tests/datasetops_tests/test_examples.py::test_readme_example_2
1.13s call     tests/datasetops_tests/test_stream_dataset.py::test_read_from_file
0.72s call     tests/datasetops_tests/test_datasets.py::test_to_pytorch
0.63s call     tests/datasetops_tests/test_examples.py::test_domain_adaptation
0.53s call     tests/datasetops_tests/test_transformation_graph.py::test_serialization_not_same
0.29s call     tests/datasetops_tests/test_datasets.py::TestSubsample::test_subsample
0.26s call     tests/datasetops_tests/test_transformation_graph.py::test_operation_origins
0.22s call     tests/datasetops_tests/test_transformation_graph.py::test_serialization_same
0.13s call     tests/datasetops_tests/test_datasets.py::test_image_to_tensorflow
0.06s call     tests/datasetops_tests/test_caching.py::test_cacheable
0.05s call     tests/datasetops_tests/test_transformation_graph.py::test_common_nodes_equality
0.04s call     tests/datasetops_tests/test_transformation_graph.py::test_roots_kitti
0.03s call     tests/datasetops_tests/test_loaders.py::test_mat_single_with_multi_data
0.02s call     tests/datasetops_tests/test_loaders.py::test_pytorch
0.02s call     tests/datasetops_tests/test_examples.py::test_readme_example_1
0.02s call     tests/datasetops_tests/test_loaders.py::TestLoadCSV::test_names_missing
0.02s call     tests/datasetops_tests/test_transformation_graph.py::test_roots_tfds
0.01s call     tests/datasetops_tests/test_datasets.py::test_image_resize
0.01s call     tests/datasetops_tests/test_scaler.py::test_center
0.01s call     tests/datasetops_tests/test_datasets.py::TestSubsample::test_caching
0.01s call     tests/datasetops_tests/test_scaler.py::test_item_stats
0.01s call     tests/datasetops_tests/test_loaders.py::test_folder_dataset_class_data

I see two options.

  1. Make them faster
  2. Make it easy only to run the fasts tests

If we go with option 2 we could do

# content of conftest.py

import pytest


def pytest_addoption(parser):
    parser.addoption(
        "--runslow", action="store_true", default=False, help="run slow tests"
    )


def pytest_configure(config):
    config.addinivalue_line("markers", "slow: mark test as slow to run")


def pytest_collection_modifyitems(config, items):
    if config.getoption("--runslow"):
        # --runslow given in cli: do not skip slow tests
        return
    skip_slow = pytest.mark.skip(reason="need --runslow option to run")
    for item in items:
        if "slow" in item.keywords:
            item.add_marker(skip_slow)

Make `custom` private

The custom transform was originally intended to be wrap a user-defined lambda, to be applied to each item. However, the transform function has now been implemented in such a way, that we can pass lambdas directly. custom is thus unnecessary, and should be removed.

Standard samplers

We should consider adding a few standard splitters with automatic shuffling:

  • split_train_test(ratio=[0.8,0.2])
  • split_train_val_test(ratio=[0.65,0.15,0.2])
  • split_k_fold(k=5)
  • split_balanced(key, num_per_class, comparison_fn)

And another sampler
-sample_balanced(key, num_per_class, comparison_fn)

Update documentation

Many of the exposed functions have not been documented - if the library is supposed to be truly user-friendly, this should be done.

The README should also be updated with a usage example

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.