Giter Site home page Giter Site logo

neurodata / df-dn-paper Goto Github PK

View Code? Open in Web Editor NEW
16.0 4.0 8.0 113.68 MB

Conceptual & empirical comparisons between decision forests & deep networks

Home Page: https://dfdn.neurodata.io

License: MIT License

Python 3.08% Jupyter Notebook 93.12% TeX 3.77% Makefile 0.02% Batchfile 0.02%
decision-trees random-forests deep-neural-networks classification deep-learning machine-learning

df-dn-paper's Introduction

df-dn-paper's People

Contributors

adwaykanhere avatar michael-ainsworth avatar mkusman1 avatar nogamudrik avatar pssf23 avatar ypeng22 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

df-dn-paper's Issues

Update Tabular code to be compatible with new classifiers

Currently, the tabular data code is written specifically to work with DL and RF, in that order. Hence, any change in the number of models or in their order requires a change in all the code. Saving the models' results in dictionaries instead of lists may solve this problem.

Consider hyperparameter tuning with the general Tabular domain

Specifically, use a left-out set of Tabular datasets to tune the hyperparameters. Possibly implement 5-fold cross validations (tune on 4 folds, test on 1 fold) to evaluate classifier performances. Within each dataset, the performance is also calculated with 5-fold cross validations.

BUG Error in combining train & valid loader for Vision

When creating train+val loaders for tuning CNNs for vision data, the following traceback is received:

INFO 05-12 21:06:37] ax.service.utils.instantiation: Inferred value type of ParameterType.FLOAT for parameter lr. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 05-12 21:06:37] ax.service.utils.instantiation: Inferred value type of ParameterType.FLOAT for parameter momentum. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 05-12 21:06:37] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter epoch. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 05-12 21:06:37] ax.service.utils.instantiation: Inferred value type of ParameterType.STRING for parameter optimizer. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 05-12 21:06:37] ax.service.utils.instantiation: Created search space: SearchSpace(parameters=[RangeParameter(name='lr', parameter_type=FLOAT, range=[1e-06, 0.4], log_scale=True), RangeParameter(name='momentum', parameter_type=FLOAT, range=[0.0, 1.0]), RangeParameter(name='epoch', parameter_type=INT, range=[15, 40]), ChoiceParameter(name='optimizer', parameter_type=STRING, values=['SGD', 'Adam'], is_ordered=False, sort_values=False)], parameter_constraints=[]).
[INFO 05-12 21:06:37] ax.modelbridge.dispatch_utils: Using Bayesian optimization since there are more ordered parameters than there are categories for the unordered categorical parameters.
[INFO 05-12 21:06:37] ax.modelbridge.dispatch_utils: Using Bayesian Optimization generation strategy: GenerationStrategy(name='Sobol+GPEI', steps=[Sobol for 8 trials, GPEI for subsequent trials]). Iterations after 8 will take longer to generate due to  model-fitting.
[INFO 05-12 21:06:37] ax.service.managed_loop: Started full optimization with 20 steps.
[INFO 05-12 21:06:37] ax.service.managed_loop: Running optimization trial 1...
[INFO 05-12 21:06:39] ax.service.managed_loop: Running optimization trial 2...
[INFO 05-12 21:06:40] ax.service.managed_loop: Running optimization trial 3...
[INFO 05-12 21:06:41] ax.service.managed_loop: Running optimization trial 4...
[INFO 05-12 21:06:43] ax.service.managed_loop: Running optimization trial 5...
[INFO 05-12 21:06:44] ax.service.managed_loop: Running optimization trial 6...
[INFO 05-12 21:06:45] ax.service.managed_loop: Running optimization trial 7...
[INFO 05-12 21:06:47] ax.service.managed_loop: Running optimization trial 8...
[INFO 05-12 21:06:48] ax.service.managed_loop: Running optimization trial 9...
[INFO 05-12 21:06:51] ax.service.managed_loop: Running optimization trial 10...
[INFO 05-12 21:06:52] ax.service.managed_loop: Running optimization trial 11...
[INFO 05-12 21:06:54] ax.service.managed_loop: Running optimization trial 12...
[INFO 05-12 21:06:56] ax.service.managed_loop: Running optimization trial 13...
[INFO 05-12 21:06:58] ax.service.managed_loop: Running optimization trial 14...
[INFO 05-12 21:07:00] ax.service.managed_loop: Running optimization trial 15...
[INFO 05-12 21:07:02] ax.service.managed_loop: Running optimization trial 16...
[INFO 05-12 21:07:04] ax.service.managed_loop: Running optimization trial 17...
[INFO 05-12 21:07:06] ax.service.managed_loop: Running optimization trial 18...
[INFO 05-12 21:07:08] ax.service.managed_loop: Running optimization trial 19...
[INFO 05-12 21:07:10] ax.service.managed_loop: Running optimization trial 20...
[INFO 05-12 21:07:10] ax.modelbridge.base: Untransformed parameter 0.40000000000000013 greater than upper bound 0.4, clamping
[WARNING 05-12 21:07:12] ax.modelbridge.cross_validation: Metric accuracy was unable to be reliably fit.
[WARNING 05-12 21:07:12] ax.service.utils.best_point: Model fit is poor; falling back on raw data for best point.
[WARNING 05-12 21:07:12] ax.service.utils.best_point: Model fit is poor and data on objective metric accuracy is noisy; interpret best points results carefully.
(array([], dtype=int64),)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-15-8f0b5b112073>](https://localhost:8080/#) in <module>()
     44 )
     45 
---> 46 run_cnn32()

[<ipython-input-14-5a5512f0afa6>](https://localhost:8080/#) in run_cnn32()
    369             for i in train_valid_indices:
    370               print(np.where(classes_new == dataset_copy.targets[i]))
--> 371               dataset_copy.targets[i] = np.where(classes_new == dataset_copy.targets[i])[0][0]
    372 
    373             train_valid_sampler = torch.utils.data.sampler.SubsetRandomSampler(train_valid_indices)

IndexError: index 0 is out of bounds for axis 0 with size 0

Stratify Tabular sample sets across class labels

Code that needs to be changed:

def random_sample_new(data, training_sample_sizes):
"""
Given X_data and a list of training sample size, randomly sample indices to be used.
Larger sample sizes include all indices from smaller sample sizes.
"""
temp_inds = []
ordered = [i for i in range(len(data))]
minus = 0
for ss in range(len(training_sample_sizes)):
x = sorted(sample(ordered, training_sample_sizes[ss] - minus))
minus += len(x)
temp_inds.append(x)
ordered = list(set(ordered) - set(x))
final_inds = []
temp = []
for i in range(len(temp_inds)):
cur = temp_inds[i]
final_inds.append(sorted(cur + temp))
temp = sorted(cur + temp)
return final_inds

Code that can be used as a reference:

partitions = np.array_split(np.array(range(samples)), num_classes)
# Obtain only train images and labels for selected classes
image_ls = []
label_ls = []
i = 0
for cls in classes:
class_idx = np.argwhere(train_labels == cls).flatten()
np.random.shuffle(class_idx)
class_img = train_images[class_idx[: len(partitions[i])]]
image_ls.append(class_img)
label_ls.append(np.repeat(cls, len(partitions[i])))
i += 1
train_images = np.concatenate(image_ls)
train_labels = np.concatenate(label_ls)
# Obtain only test images and labels for selected classes
image_ls = []
label_ls = []
for cls in classes:
image_ls.append(test_images[test_labels == cls])
label_ls.append(np.repeat(cls, np.sum(test_labels == cls)))
test_images = np.concatenate(image_ls)
test_labels = np.concatenate(label_ls)

@NogaMudrik

Switching audio classification to use FSDKaggle2018 Dataset and recording metrics for all classifiers in fsdd.py

A change of dataset is proposed over the FSDD database for the following reasons:

  • Increased Size: The FSDD dataset comprises 3000 audio samples over 10 classes while the FSDKaggle18 dataset comprises 11,073 audio samples over 41 classes. When creating a balanced dataset, we subset the FSDKaggle18 dataset such that it contains only classes with 300 samples per class which give us 5400 samples over 18 classes.
  • Increased Complexity: FSDD contains only spoken digit sounds and there is very little inherent complexity in the audio apart from changes due to different speakers. FSDKaggle18 dataset contains audio files from a variety of sources thus creating a more diverse and general-purpose audio dataset for benchmarking.

Consider saving raw predictions in benchmarks

Saving raw predictions of classification tasks helps improving evaluation metrics (changing/adding/deleting/...). As the test sets are randomly generated, test labels need to be saved at the same time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.