neurodata / df-dn-paper Goto Github PK

View Code? Open in Web Editor NEW

16.0 4.0 8.0 113.68 MB

Conceptual & empirical comparisons between decision forests & deep networks

Home Page: https://dfdn.neurodata.io

License: MIT License

Python 3.08% Jupyter Notebook 93.12% TeX 3.77% Makefile 0.02% Batchfile 0.02%

decision-trees random-forests deep-neural-networks classification deep-learning machine-learning

df-dn-paper's Introduction

When are Deep Networks really better than Decision Forests at small sample sizes, and how?

DF/DN: conceptual & empirical comparisons between Decision Forests & Deep Networks.

This is preliminary work. More details will be available.

Documentation: https://dfdn.neurodata.io/
Abstract: https://dfdn.neurodata.io/#abstract
Replication Guide: https://dfdn.neurodata.io/#replicate
Benchmark Figures: https://dfdn.neurodata.io/#benchmarks

df-dn-paper's People

Contributors

Stargazers

Watchers

Forkers

mkusman1 neurodatadesign kaleab-k adwaykanhere whhqund alexander-vass

df-dn-paper's Issues

Update Tabular code to be compatible with new classifiers

Currently, the tabular data code is written specifically to work with DL and RF, in that order. Hence, any change in the number of models or in their order requires a change in all the code. Saving the models' results in dictionaries instead of lists may solve this problem.

Conduct benchmarks on another image dataset and generate figures

The current selection is SVHN.

Consider hyperparameter tuning with the general Tabular domain

Specifically, use a left-out set of Tabular datasets to tune the hyperparameters. Possibly implement 5-fold cross validations (tune on 4 folds, test on 1 fold) to evaluate classifier performances. Within each dataset, the performance is also calculated with 5-fold cross validations.

Conduct similar time/cost benchmarks on CIFAR-10/100 datasets

Reduce time cost of DNs to the level of RFs
Reference ideas in this paper

Optimize DN benchmarks with standard library & hyper-parameter tuning

Conduct hyper-parameter search for RF at each sample size & report search times

Construct Netlify website for benchmark notebooks

Add additional metrics and stratifications to results

e.g.

F-score
Splitting by # of classes for Tabular
more...

Conduct benchmarks on tabular data and generate figures

Switch Tabular DN to a more complex classifier

Currently Tabular uses MLP. We want something more complex and could utilize GPU.

Conduct benchmarks on CIFAR-10 and generate figures

Conduct multiple-class benchmarks (3 & 8) out of 10 classes
Generate accuracy vs sample size figures
Generate wall time vs sample size figures

BUG Error in combining train & valid loader for Vision

When creating train+val loaders for tuning CNNs for vision data, the following traceback is received:

INFO 05-12 21:06:37] ax.service.utils.instantiation: Inferred value type of ParameterType.FLOAT for parameter lr. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 05-12 21:06:37] ax.service.utils.instantiation: Inferred value type of ParameterType.FLOAT for parameter momentum. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 05-12 21:06:37] ax.service.utils.instantiation: Inferred value type of ParameterType.INT for parameter epoch. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 05-12 21:06:37] ax.service.utils.instantiation: Inferred value type of ParameterType.STRING for parameter optimizer. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 05-12 21:06:37] ax.service.utils.instantiation: Created search space: SearchSpace(parameters=[RangeParameter(name='lr', parameter_type=FLOAT, range=[1e-06, 0.4], log_scale=True), RangeParameter(name='momentum', parameter_type=FLOAT, range=[0.0, 1.0]), RangeParameter(name='epoch', parameter_type=INT, range=[15, 40]), ChoiceParameter(name='optimizer', parameter_type=STRING, values=['SGD', 'Adam'], is_ordered=False, sort_values=False)], parameter_constraints=[]).
[INFO 05-12 21:06:37] ax.modelbridge.dispatch_utils: Using Bayesian optimization since there are more ordered parameters than there are categories for the unordered categorical parameters.
[INFO 05-12 21:06:37] ax.modelbridge.dispatch_utils: Using Bayesian Optimization generation strategy: GenerationStrategy(name='Sobol+GPEI', steps=[Sobol for 8 trials, GPEI for subsequent trials]). Iterations after 8 will take longer to generate due to  model-fitting.
[INFO 05-12 21:06:37] ax.service.managed_loop: Started full optimization with 20 steps.
[INFO 05-12 21:06:37] ax.service.managed_loop: Running optimization trial 1...
[INFO 05-12 21:06:39] ax.service.managed_loop: Running optimization trial 2...
[INFO 05-12 21:06:40] ax.service.managed_loop: Running optimization trial 3...
[INFO 05-12 21:06:41] ax.service.managed_loop: Running optimization trial 4...
[INFO 05-12 21:06:43] ax.service.managed_loop: Running optimization trial 5...
[INFO 05-12 21:06:44] ax.service.managed_loop: Running optimization trial 6...
[INFO 05-12 21:06:45] ax.service.managed_loop: Running optimization trial 7...
[INFO 05-12 21:06:47] ax.service.managed_loop: Running optimization trial 8...
[INFO 05-12 21:06:48] ax.service.managed_loop: Running optimization trial 9...
[INFO 05-12 21:06:51] ax.service.managed_loop: Running optimization trial 10...
[INFO 05-12 21:06:52] ax.service.managed_loop: Running optimization trial 11...
[INFO 05-12 21:06:54] ax.service.managed_loop: Running optimization trial 12...
[INFO 05-12 21:06:56] ax.service.managed_loop: Running optimization trial 13...
[INFO 05-12 21:06:58] ax.service.managed_loop: Running optimization trial 14...
[INFO 05-12 21:07:00] ax.service.managed_loop: Running optimization trial 15...
[INFO 05-12 21:07:02] ax.service.managed_loop: Running optimization trial 16...
[INFO 05-12 21:07:04] ax.service.managed_loop: Running optimization trial 17...
[INFO 05-12 21:07:06] ax.service.managed_loop: Running optimization trial 18...
[INFO 05-12 21:07:08] ax.service.managed_loop: Running optimization trial 19...
[INFO 05-12 21:07:10] ax.service.managed_loop: Running optimization trial 20...
[INFO 05-12 21:07:10] ax.modelbridge.base: Untransformed parameter 0.40000000000000013 greater than upper bound 0.4, clamping
[WARNING 05-12 21:07:12] ax.modelbridge.cross_validation: Metric accuracy was unable to be reliably fit.
[WARNING 05-12 21:07:12] ax.service.utils.best_point: Model fit is poor; falling back on raw data for best point.
[WARNING 05-12 21:07:12] ax.service.utils.best_point: Model fit is poor and data on objective metric accuracy is noisy; interpret best points results carefully.
(array([], dtype=int64),)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-15-8f0b5b112073>](https://localhost:8080/#) in <module>()
     44 )
     45 
---> 46 run_cnn32()

[<ipython-input-14-5a5512f0afa6>](https://localhost:8080/#) in run_cnn32()
    369             for i in train_valid_indices:
    370               print(np.where(classes_new == dataset_copy.targets[i]))
--> 371               dataset_copy.targets[i] = np.where(classes_new == dataset_copy.targets[i])[0][0]
    372 
    373             train_valid_sampler = torch.utils.data.sampler.SubsetRandomSampler(train_valid_indices)

IndexError: index 0 is out of bounds for axis 0 with size 0

Conduct RF benchmarks on different CPUs for CIFAR10/100 datasets

Modify wall time benchmarks to separate training and testing

Add SPORF benchmarks

Reference repo: https://github.com/NeuroDataDesign/manifold_random_forests/tree/optimize

Standardize loading for auditory data

Conduct benchmarks on CIFAR-100 and generate figures

Conduct multiple-class benchmarks (90) out of 100 classes
Generate accuracy vs sample size figures
Generate wall time vs sample size figures

Unify evaluation metrics

Unify evaluation metrics or justify differences

Stratify Tabular sample sets across class labels

Code that needs to be changed:

df-dn-paper/benchmarks/tabular/toolbox.py

Lines 53 to 76 in 0a7d447

    
           def random_sample_new(data, training_sample_sizes): 
        
               """ 
        
               Given X_data and a list of training sample size, randomly sample indices to be used. 
        
               Larger sample sizes include all indices from smaller sample sizes. 
        
               """ 
        
               temp_inds = [] 
        
               ordered = [i for i in range(len(data))] 
        
               minus = 0 
        
               for ss in range(len(training_sample_sizes)): 
        
                   x = sorted(sample(ordered, training_sample_sizes[ss] - minus)) 
        
                   minus += len(x) 
        
                   temp_inds.append(x) 
        
                   ordered = list(set(ordered) - set(x)) 
        
               final_inds = [] 
        
               temp = [] 
        
               for i in range(len(temp_inds)): 
        
                   cur = temp_inds[i] 
        
                   final_inds.append(sorted(cur + temp)) 
        
                   temp = sorted(cur + temp) 
        
               return final_inds

Code that can be used as a reference:

df-dn-paper/benchmarks/vision/toolbox.py

Lines 184 to 209 in 0a7d447

    
           partitions = np.array_split(np.array(range(samples)), num_classes) 
        
           # Obtain only train images and labels for selected classes 
        
           image_ls = [] 
        
           label_ls = [] 
        
           i = 0 
        
           for cls in classes: 
        
               class_idx = np.argwhere(train_labels == cls).flatten() 
        
               np.random.shuffle(class_idx) 
        
               class_img = train_images[class_idx[: len(partitions[i])]] 
        
               image_ls.append(class_img) 
        
               label_ls.append(np.repeat(cls, len(partitions[i]))) 
        
               i += 1 
        
           train_images = np.concatenate(image_ls) 
        
           train_labels = np.concatenate(label_ls) 
        
           # Obtain only test images and labels for selected classes 
        
           image_ls = [] 
        
           label_ls = [] 
        
           for cls in classes: 
        
               image_ls.append(test_images[test_labels == cls]) 
        
               label_ls.append(np.repeat(cls, np.sum(test_labels == cls))) 
        
           test_images = np.concatenate(image_ls) 
        
           test_labels = np.concatenate(label_ls)

@NogaMudrik

Complete NeurIPS paper checklist

https://neurips.cc/Conferences/2021/PaperInformation/PaperChecklist

Switching audio classification to use FSDKaggle2018 Dataset and recording metrics for all classifiers in fsdd.py

A change of dataset is proposed over the FSDD database for the following reasons:

Increased Size: The FSDD dataset comprises 3000 audio samples over 10 classes while the FSDKaggle18 dataset comprises 11,073 audio samples over 41 classes. When creating a balanced dataset, we subset the FSDKaggle18 dataset such that it contains only classes with 300 samples per class which give us 5400 samples over 18 classes.
Increased Complexity: FSDD contains only spoken digit sounds and there is very little inherent complexity in the audio apart from changes due to different speakers. FSDKaggle18 dataset contains audio files from a variety of sources thus creating a more diverse and general-purpose audio dataset for benchmarking.

	def random_sample_new(data, training_sample_sizes):
	"""
	Given X_data and a list of training sample size, randomly sample indices to be used.
	Larger sample sizes include all indices from smaller sample sizes.
	"""
	temp_inds = []

	ordered = [i for i in range(len(data))]
	minus = 0
	for ss in range(len(training_sample_sizes)):
	x = sorted(sample(ordered, training_sample_sizes[ss] - minus))
	minus += len(x)
	temp_inds.append(x)
	ordered = list(set(ordered) - set(x))

	final_inds = []
	temp = []

	for i in range(len(temp_inds)):
	cur = temp_inds[i]
	final_inds.append(sorted(cur + temp))
	temp = sorted(cur + temp)

	return final_inds

	partitions = np.array_split(np.array(range(samples)), num_classes)

	# Obtain only train images and labels for selected classes
	image_ls = []
	label_ls = []
	i = 0
	for cls in classes:
	class_idx = np.argwhere(train_labels == cls).flatten()
	np.random.shuffle(class_idx)
	class_img = train_images[class_idx[: len(partitions[i])]]
	image_ls.append(class_img)
	label_ls.append(np.repeat(cls, len(partitions[i])))
	i += 1

	train_images = np.concatenate(image_ls)
	train_labels = np.concatenate(label_ls)

	# Obtain only test images and labels for selected classes
	image_ls = []
	label_ls = []
	for cls in classes:
	image_ls.append(test_images[test_labels == cls])
	label_ls.append(np.repeat(cls, np.sum(test_labels == cls)))

	test_images = np.concatenate(image_ls)
	test_labels = np.concatenate(label_ls)