Giter Site home page Giter Site logo

google / uncertainty-baselines Goto Github PK

View Code? Open in Web Editor NEW
1.4K 21.0 195.0 7.88 MB

High-quality implementations of standard and SOTA methods on a variety of tasks.

License: Apache License 2.0

Python 94.50% Jupyter Notebook 5.50%
bayesian-methods deep-learning machine-learning data-science tensorflow neural-networks statistics probabilistic-programming

uncertainty-baselines's Introduction

Uncertainty Baselines

Tests

The goal of Uncertainty Baselines is to provide a template for researchers to build on. The baselines can be a starting point for any new ideas, applications, and/or for communicating with other uncertainty and robustness researchers. This is done in three ways:

  1. Provide high-quality implementations of standard and state-of-the-art methods on standard tasks.
  2. Have minimal dependencies on other files in the codebase. Baselines should be easily forkable without relying on other baselines and generic modules.
  3. Prescribe best practices for uncertainty and robustness benchmarking.

Motivation. There are many uncertainty and robustness implementations across GitHub. However, they are typically one-off experiments for a specific paper (many papers don't even have code). There are no clear examples that uncertainty researchers can build on to quickly prototype their work. Everyone must implement their own baseline. In fact, even on standard tasks, every project differs slightly in their experiment setup, whether it be architectures, hyperparameters, or data preprocessing. This makes it difficult to compare properly against baselines.

Installation

To install the latest development version, run

pip install "git+https://github.com/google/uncertainty-baselines.git#egg=uncertainty_baselines"

There is not yet a stable version (nor an official release of this library). All APIs are subject to change. Installing uncertainty_baselines does not automatically install any backend. For TensorFlow, you will need to install TensorFlow ( tensorflow or tf-nightly), TensorFlow Addons (tensorflow- addons or tfa-nightly), and TensorBoard (tensorboard or tb-nightly). See the extra dependencies one can install in setup.py.

Usage

Baselines

The baselines/ directory includes all the baselines, organized by their training dataset. For example, baselines/cifar/determinstic.py is a Wide ResNet 28-10 obtaining 96.0% test accuracy on CIFAR-10.

Launching with TPUs. You often need TPUs to reproduce baselines. There are three options:

  1. Colab. Colab offers free TPUs. This is the most convenient and budget-friendly option. You can experiment with a baseline by copying its script and running it from scratch. This works well for simple experimentation. However, be careful relying on Colab long-term: TPU access isn't guaranteed, and Colab can only go so far for managing multiple long experiments.

  2. Google Cloud. This is the most flexible option. First, you'll need to create a virtual machine instance (details here).

    Here's an example to launch the BatchEnsemble baseline on CIFAR-10. We assume a few environment variables which are set up with the cloud TPU (details here).

    export BUCKET=gs://bucket-name
    export TPU_NAME=ub-cifar-batchensemble
    export DATA_DIR=$BUCKET/tensorflow_datasets
    export OUTPUT_DIR=$BUCKET/model
    
    python baselines/cifar/batchensemble.py \
        --tpu=$TPU_NAME \
        --data_dir=$DATA_DIR \
        --output_dir=$OUTPUT_DIR

    Note the TPU's accelerator type must align with the number of cores for the baseline (num_cores flag). In this example, BatchEnsemble uses a default of num_cores=8. So the TPU must be set up with accelerator_type=v3-8.

  3. Change the flags. For example, go from 8 TPU cores to 8 GPUs, or reduce the number of cores to train the baseline.

    python baselines/cifar/batchensemble.py \
        --data_dir=/tmp/tensorflow_datasets \
        --output_dir=/tmp/model \
        --use_gpu=True \
        --num_cores=8

    Results may be similar, but ultimately all bets are off. GPU vs TPU may not make much of a difference in practice, especially if you use the same numerical precision. However, changing the number of cores matters a lot. The total batch size during each training step is often determined by num_cores, so be careful!

Datasets

The ub.datasets module consists of datasets following the TensorFlow Datasets API. They add minimal logic such as default data preprocessing. Note: in ipython/colab notebook, one may need to activate tf earger execution mode tf.compat.v1.enable_eager_execution().

import uncertainty_baselines as ub

# Load CIFAR-10, holding out 10% for validation.
dataset_builder = ub.datasets.Cifar10Dataset(split='train',
                                             validation_percent=0.1)
train_dataset = dataset_builder.load(batch_size=FLAGS.batch_size)
for batch in train_dataset:
  # Apply code over batches of the data.

You can also use get to instantiate datasets from strings (e.g., commandline flags).

dataset_builder = ub.datasets.get(dataset_name, split=split, **dataset_kwargs)

To use the datasets in Jax and PyTorch:

for batch in tfds.as_numpy(ds):
  train_step(batch)

Note that tfds.as_numpy calls tensor.numpy(). This invokes an unnecessary copy compared to tensor._numpy().

for batch in iter(ds):
  train_step(jax.tree_map(lambda y: y._numpy(), batch))

Models

The ub.models module consists of models following the tf.keras.Model API.

import uncertainty_baselines as ub

model = ub.models.wide_resnet(input_shape=(32, 32, 3),
                              depth=28,
                              width_multiplier=10,
                              num_classes=10,
                              l2=1e-4)

Metrics

We define metrics used across datasets below. All results are reported by roughly 3 significant digits and averaged over 10 runs.

  1. # Parameters. Number of parameters in the model to make predictions after training.

  2. Test Accuracy. Accuracy over the test set. For a dataset of N input-output pairs (xn, yn) where the label yn takes on 1 of K values, the accuracy is

    1/N \sum_{n=1}^N 1[ \argmax{ p(yn | xn) } = yn ],

    where 1 is the indicator function that is 1 when the model's predicted class is equal to the label and 0 otherwise.

  3. Test Cal. Error. Expected calibration error (ECE) over the test set (Naeini et al., 2015). ECE discretizes the probability interval [0, 1] under equally spaced bins and assigns each predicted probability to the bin that encompasses it. The calibration error is the difference between the fraction of predictions in the bin that are correct (accuracy) and the mean of the probabilities in the bin (confidence). The expected calibration error averages across bins.

    For a dataset of N input-output pairs (xn, yn) where the label yn takes on 1 of K values, ECE computes a weighted average

    \sum_{b=1}^B n_b / N | acc(b) - conf(b) |,

    where B is the number of bins, n_b is the number of predictions in bin b, and acc(b) and conf(b) is the accuracy and confidence of bin b respectively.

  4. Test NLL. Negative log-likelihood over the test set (measured in nats). For a dataset of N input-output pairs (xn, yn), the negative log-likelihood is

    -1/N \sum_{n=1}^N \log p(yn | xn).

    It is equivalent up to a constant to the KL divergence from the true data distribution to the model, therefore capturing the overall goodness of fit to the true distribution (Murphy, 2012). It can also be intepreted as the amount of bits (nats) to explain the data (Grunwald, 2004).

  5. Train/Test Runtime. Training runtime is the total wall-clock time to train the model, including any intermediate test set evaluations. Test Runtime refers to the time it takes to run a forward pass on the GPU/TPU, i.e., the duration for which the device is not idle. Note that Test Runtime does not include time on the coordinator: this is more precise in comparing baselines because including the coordinator adds overhead in GPU/TPU scheduling and data fetching---producing high variance results.

Viewing metrics. Uncertainty Baselines writes TensorFlow summaries to the model_dir which can be consumed by TensorBoard. This includes the TensorBoard hyperparameters plugin, which can be used to analyze hyperparamter tuning sweeps.

If you wish to upload to the PUBLICLY READABLE tensorboard.dev, use:

tensorboard dev upload --logdir MODEL_DIR --plugins "scalars,graphs,hparams" --name "My experiment" --description "My experiment details"

References

If you'd like to cite Uncertainty Baselines, use the following BibTeX entry.

Z. Nado, N. Band, M. Collier, J. Djolonga, M. Dusenberry, S. Farquhar, A. Filos, M. Havasi, R. Jenatton, G. Jerfel, J. Liu, Z. Mariet, J. Nixon, S. Padhy, J. Ren, T. Rudner, Y. Wen, F. Wenzel, K. Murphy, D. Sculley, B. Lakshminarayanan, J. Snoek, Y. Gal, and D. Tran. Uncertainty Baselines: Benchmarks for uncertainty & robustness in deep learning, arXiv preprint arXiv:2106.04015, 2021.

@article{nado2021uncertainty,
  author = {Zachary Nado and Neil Band and Mark Collier and Josip Djolonga and Michael Dusenberry and Sebastian Farquhar and Angelos Filos and Marton Havasi and Rodolphe Jenatton and Ghassen Jerfel and Jeremiah Liu and Zelda Mariet and Jeremy Nixon and Shreyas Padhy and Jie Ren and Tim Rudner and Yeming Wen and Florian Wenzel and Kevin Murphy and D. Sculley and Balaji Lakshminarayanan and Jasper Snoek and Yarin Gal and Dustin Tran},
  title = {{Uncertainty Baselines}:  Benchmarks for Uncertainty \& Robustness in Deep Learning},
  journal = {arXiv preprint arXiv:2106.04015},
  year = {2021},
}

Papers using Uncertainty Baselines

The following papers have used code from Uncertainty Baselines:

  1. A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection
  2. BatchEnsemble: An Alternative Approach to Efficient Ensembles and Lifelong Learning
  3. DEUP: Direct Epistemic Uncertainty Prediction
  4. Distilling Ensembles Improves Uncertainty Estimates
  5. Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors
  6. Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit
  7. Hyperparameter Ensembles for Robustness and Uncertainty Quantification
  8. Measuring Calibration in Deep Learning
  9. Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation
  10. Neural networks with late-phase weights
  11. On the Practicality of Deterministic Epistemic Uncertainty
  12. Prediction-Time Batch Normalization for Robustness under Covariate Shift
  13. Refining the variational posterior through iterative optimization
  14. Revisiting One-vs-All Classifiers for Predictive Uncertainty and Out-of-Distribution Detection in Neural Networks
  15. Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness
  16. Training independent subnetworks for robust prediction
  17. Plex: Towards Reliability Using Pretrained Large Model Extensions, available here

Contributing

Formatting Code

Before committing code, make sure that the file is formatted according to yapf's yapf style:

yapf -i --style yapf [source file]

Adding a Baseline

  1. Write a script that loads the fixed training dataset and model. Typically, this is forked from other baselines.
  2. After tuning, set the default flag values to the best hyperparameters.
  3. Add the baseline's performance to the table of results in the corresponding README.md.

Adding a Dataset

  1. Add the bibtex reference to references.md.
  2. Add the dataset definition to the datasets/ dir. Every file should have a subclass of datasets.base.BaseDataset, which at a minimum requires implementing a constructor, a tfds.core.DatasetBuilder, and _create_process_example_fn.
  3. Add a test that at a minimum constructs the dataset and checks the shapes of elements.
  4. Add the dataset to datasets/datasets.py for easy access.
  5. Add the dataset class to datasets/__init__.py.

For an example of adding a dataset, see this pull request.

Adding a Model

  1. Add the bibtex reference to references.md.

  2. Add the model definition to the models/ dir. Every file should have a create_model function with the following signature:

    def create_model(
        batch_size: int,
        ...
        **unused_kwargs: Dict[str, Any])
        -> tf.keras.models.Model:
  3. Add a test that at a minimum constructs the model and does a forward pass.

  4. Add the model to models/models.py for easy access.

  5. Add the create_model function to models/__init__.py.

uncertainty-baselines's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

uncertainty-baselines's Issues

Potentially swap to SWA's learning rate schedule for CIFAR baselines

@dustinvtran
From preliminary experiments, the LR schedule from the SWA papers (https://github.com/timgaripov/swa/blob/master/train.py#L94) seems to improve the baseline results (at least for deterministic and dropout). Upgrading to that one may close the gap from our deterministic baseline which reproduces the original paper of 96.0% (and we get 0.154 NLL). Their papers' baseline reports 96.4% and 0.12 NLL. (Same for CIFAR-100.)

google/edward2#233

BatchEnsemble and mini-batches

Hi guys,

When reading the BatchEnsemble paper, I get the impression that each model of the ensemble is trained with a different part of the mini-batch:

Section 3.1: "To match the input and the ensemble weight, we can divide the input mini-batch into M sub-batches and each sub-batch receives ensemble weight"

Appendix B: "Also note that the scheme that each ensemble member is trained with different sub-batch of input can encourage diversity as well"

However, in the current implementation, each model of the ensemble is trained on the same mini-batch because the mini-batch is replicated before send it to the BatchEnsemble model:

images = tf.tile(images, [FLAGS.ensemble_size, 1, 1, 1])
labels = tf.tile(labels, [FLAGS.ensemble_size])

Could you please clarify? Have you tried both approaches?

Thanks,

Reproducing OOD scores for CIFAR-10 vs SVHN

Hi there,

I'm currently trying to reproduce your results from the SNGP paper (https://arxiv.org/abs/2006.10108) and get a much higher AUCPR for the baseline deterministic WideResNet separating CIFAR-10 from SVHN as OOD set. I don't really see how the result for CIFAR-10 can also be worse than for CIFAR-100 since this relationship is reversed for all other tested methods.

Would be cool if you could have a look and advise if this is a typo in the paper or share how you arrive at the OOD scores for this setup. My results for the OOD task with a vanilla WRN are AUPR = 0.899 and AUROC = 0.931.
Screenshot 2021-01-26 at 08 28 51

Thanks a lot!

Should we remove the default platform value?

All code snippets for launching jobs always include platform=jf or platform=gpu as well as the tpu_topology (if tpu) and gpu_type (if gpu). We may want to remove these flags' default values because it's not obvious what a generally good default is.

RuntimeError when running baselines/imagenet/sngp.py

Dear uncertainty-baseline authors,

I am trying to run the SNGP training on ImageNet using uncertainty-baselines/baselines/imagenet/sngp.py.

It errors during the execution of the first training step with the following message:

 RuntimeError: `merge_call` called while defining a new graph or a tf.function.
This can often happen if the function `fn` passed to `strategy.run()` 
contains a nested `@tf.function`, and the nested `@tf.function` contains 
a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients),
or if the function `fn` uses a control flow statement which contains a synchronization 
point in the body. Such behaviors are not yet supported. Instead, please avoid 
nested `tf.function`s or control flow statements that may potentially cross a
synchronization boundary, for example, wrap the `fn` passed to `strategy.run` 
or the entire `strategy.run` inside a `tf.function` or move the control flow out of `fn`

This is the stack trace:

RuntimeError: in user code:

    .../lib/uncertainty-baselines/baselines/imagenet/sngp_tmp.py:290 step_fn  *
        model.layers[-1].reset_covariance_matrix()
    ../edward2/edward2/tensorflow/layers/random_feature.py:219 reset_covariance_matrix  *
        self._gp_cov_layer.reset_precision_matrix()
    ../edward2/edward2/tensorflow/layers/random_feature.py:363 reset_precision_matrix  *
        precision_matrix_reset_op = self.precision_matrix.assign(
    .../venv/lib/python3.8/site-packages/tensorflow/python/distribute/values.py:685 assign  **
        return values_util.on_write_assign(self, value, use_locking=use_locking,
    .../venv/lib/python3.8/site-packages/tensorflow/python/distribute/values_util.py:33 on_write_assign
        return var._update(  # pylint: disable=protected-access
    .../venv/lib/python3.8/site-packages/tensorflow/python/distribute/values.py:827 _update
        return self._update_replica(update_fn, value, **kwargs)
    .../venv/lib/python3.8/site-packages/tensorflow/python/distribute/values.py:897 _update_replica
        return _on_write_update_replica(self, update_fn, value, **kwargs)
    .../venv/lib/python3.8/site-packages/tensorflow/python/distribute/values.py:71 _on_write_update_replica
        return ds_context.get_replica_context().merge_call(
    .../venv/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2715 merge_call
        return self._merge_call(merge_fn, args, kwargs)
    .../venv/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py:432 _merge_call
        raise RuntimeError(

It seems like the self.precision_matrix.assign call in edward2/edward2/tensorflow/layers/random_feature.py causes this error, because it is executed inside the strategy.run call of a tf.function.

What can I do to fix this?

Reproducibility of Rank-1 results after the OSS release: investigate a potential gap and/or update the README.

The current table in the README presents results for Rank-1 BNNs that date back to April/May which :

  • used a different evaluation log-likelihood
  • possibly had a slightly different CIFAR data loader
  • mostly were 3 to 5-seed averages instead of our current standard of 10-seed averages

The latest results with Gaussian Rank-1 BNNs are slightly underperforming (by an order of 0.05%) on CIFAR-10 while outperforming on CIFAR-100 (by an order of 0.1-0.4%) and require investigating or simply updating the README.

Move Bibtex entries for models and datasets

Instead of in the global README.md, it's better to move them into each module. This keeps it local, and makes it easier to maintain given the many references we use on a per-method/layer/etc. basis. For example, in Rank-1 BNNs.

Add ResNet MCMC baselines from SGMCMC paper

@jereliu @flwenzel. One potential issue is that @flwenzel's SGMCMC paper was on the older Edward2 ResNet-20 baseline script (which itself was loosely based on the UQ benchmark's). This is before we upgraded to WRN-28-10.

@znado Not sure what we want to do with the resnet-20 baselines. We could have a separate directory in the CIFAR baselines for it if we'd like.

Set up leaderplot plots with links to actual runs

@dustinvtran
@dusenberrymw has a setup for plotting several metrics, compared across methods, and each method of which is averaged over multiple seeds. The plotting data is obtained directly from the experiments.

This improves our existing leaderboard (a table) by 1. automatically averaging over multiple seeds instead of manually; and 2. programmatically going from experiment runs -> results for visualization purposes.

It would be great to set this up for all datasets. Later, we should also look into how to enable this publically once Tensorboard.dev supports making available all the different experiment runs we'd like.

google/edward2#279

Couldn't reproduce MIMO accuracy on CIFAR-100

Hi @dustinvtran and others, thank you for the repository! @VoronkovaDasha and me are trying to reproduce results of MIMO on WideResNet28x10 + CIFAR-100 to compare the performance with other methods. However, so far we have not been able to do it; the accuracy values we get are a notch lower than they should be. We use 1 GPU.

For CIFAR-100 the paper reports accuracy 82.0, NLL 0.690 for an ensemble of size 3.

Here is what we get:

python3 /dashavoronkova8/ens_project/mimo/cifar.py --output_dir '/dashavoronkova8/ens_project/mimo/cifar' --seed 0 --use_gpu --dataset cifar100 --per_core_batch_size 512 --num_cores 1 --batch_repetitions 4 --corruptions_interval -1 --ensemble_size 3 --width_multiplier 10 --base_learning_rate 0.1 --train_epochs 250 --lr_decay_ratio 0.1 --lr_warmup_epochs 0 --num_bins 15 --input_repetition_probability 0. --l2 3e-4 --checkpoint_interval 50

Train Loss: 1.3752, Accuracy: 99.94%
Test NLL: 0.7143, Accuracy: 80.85%
Member 0 Test Loss: 0.9081, Accuracy: 77.92%
Member 1 Test Loss: 0.9205, Accuracy: 77.65%
Member 2 Test Loss: 0.9248, Accuracy: 77.64%

The same experiment with another seed:

Train Loss: 1.3718, Accuracy: 99.95%
Test NLL: 0.7147, Accuracy: 80.73%
Member 0 Test Loss: 0.9152, Accuracy: 77.83%
Member 1 Test Loss: 0.9257, Accuracy: 77.55%
Member 2 Test Loss: 0.9209, Accuracy: 77.52%

Now with lr_warmup_epochs=1.

python3 /dashavoronkova8/ens_project/mimo/cifar.py --output_dir '/dashavoronkova8/ens_project/mimo/cifar' --seed 0 --use_gpu --dataset cifar100 --per_core_batch_size 512 --num_cores 1 --batch_repetitions 4 --corruptions_interval -1 --ensemble_size 3 --width_multiplier 10 --base_learning_rate 0.1 --train_epochs 250 --lr_decay_ratio 0.1 --lr_warmup_epochs 1 --num_bins 15 --input_repetition_probability 0. --l2 3e-4 --checkpoint_interval 50

Train Loss: 1.3739, Accuracy: 99.95%
Test NLL: 0.7198, Accuracy: 80.76%
Member 0 Test Loss: 0.9486, Accuracy: 77.09%
Member 1 Test Loss: 0.9144, Accuracy: 77.73%
Member 2 Test Loss: 0.9117, Accuracy: 77.74%

I wonder what is the culprit here? Are the script parameters OK?

Compute dropout ImageNet #s with 10 seeds

For quick prototyping, I only got it working and reported with the 2 seeds that I had. The #s are slightly better than you would get with multiple seeds as the 2 seed's mean reported 76.6%, which is .2% higher than its original result of 76.4%.

Add TPU/multi-GPU support for ensembles

@dustinvtran
@GhassenJ @znado @JasperSnoek

To do this, we need to replace storing logits/labels in memory with an eval loop.

The simplest method is that at each iteration, each worker loops over checkpoints to grab logit predictions, then aggregates to make the ensemble prediction, and then runs eval against the current batch's labels. However, loading checkpoints at each iteration may be prohibitively expensive.

An alternative method is to 1. write logits in parallel to a file using the workers; 2. in the coordinator, aggregate the logits and eval against the dataset's labels which is batched once more. This second step should be inexpensive compared to the first step; the second step does not run the model and does not require image data in memory.

Here's pseudocode.

for checkpoint, logits_file in zip(checkpoints, logits_files):
  model.load(checkpoint)
  for batch in dataset:  # the dataset is split in parallel across workers
    predict_fn(model, batch, logits_file)  # predict_fn calls model and writes logits to logits_file

# Perform below in coordinator (CPU).
for batch in dataset:
  logits = [tf.nn.softmax(subset(logits_file, batch)) for logits_file in logits_files]
  probs = tf.reduce_mean(logits, axis=-1)
  eval(batch[labels], probs)

A final method involves having each worker load different checkpoints. However, this requires model parallelism in addition to data parallelism, which is not supported well.

Partially resolved in google/edward2#258. The next step is to figure out exactly how to parallelize writing logits to files.

In general, there are M models, N datasets, and M*N files of logits.

  • It's easy to parallelize writing logits files across models or datasets, but only via separate jobs rather than a single job. Separate jobs can be fine, but this makes the pipeline more complex as you need to manually run the second step of eval that reads from all these files.
  • If we restrict to a single job, we must resort to data parallelism, so somehow asynchronously writing logits to files while still preserving the dataset ordering. (This will also not be storeable as a NumPy array unless the workers all communicate their predictions to the coordinator, which is very expensive. So it may have to be something like plaintext?)

google/edward2#249

Importing uncertainty_baseline errors out: NameError: name 'ed' is not defined

TensorFlow version: tf-nightly (2.4.0-dev20201007)
uncertainty_baseline version: pip install "git+https://github.com/google/uncertainty-baselines.git#egg=uncertainty_baselines"s"

----> 1 import uncertainty_baselines as ub

~/venv_tf_nightly/lib/python3.8/site-packages/uncertainty_baselines/__init__.py in <module>
     38 for module_name in _IMPORTS:
     39   try:
---> 40     _lazy_import(module_name)
     41   except ModuleNotFoundError as e:
     42     logging.warning(e)

~/venv_tf_nightly/lib/python3.8/site-packages/uncertainty_baselines/__init__.py in _lazy_import(name)
     31 def _lazy_import(name):
     32   module = importlib.import_module(__name__)
---> 33   imported = importlib.import_module('.' + name, 'uncertainty_baselines')
     34   setattr(module, name, imported)
     35   return imported

/usr/lib/python3.8/importlib/__init__.py in import_module(name, package)
    125                 break
    126             level += 1
--> 127     return _bootstrap._gcd_import(name[level:], package, level)
    128 
    129 

~/venv_tf_nightly/lib/python3.8/site-packages/uncertainty_baselines/models/__init__.py in <module>
     23 from uncertainty_baselines.models.resnet20 import create_model as ResNet20Builder
     24 from uncertainty_baselines.models.resnet50 import create_model as ResNet50Builder
---> 25 from uncertainty_baselines.models.resnet50_batchensemble import resnet101_batchensemble
     26 from uncertainty_baselines.models.resnet50_batchensemble import resnet50_batchensemble
     27 from uncertainty_baselines.models.resnet50_batchensemble import resnet_batchensemble

~/venv_tf_nightly/lib/python3.8/site-packages/uncertainty_baselines/models/resnet50_batchensemble.py in <module>
     31 
     32 EnsembleBatchNormalization = functools.partial(  # pylint: disable=invalid-name
---> 33     ed.layers.EnsembleSyncBatchNorm,
     34     epsilon=BATCH_NORM_EPSILON,
     35     momentum=BATCH_NORM_DECAY)

NameError: name 'ed' is not defined

Add robustness CIFAR test sets

@dustinvtran
For cifar 10.1, add nll / accuracy / ce. For cifar c, add mean nll, mce, and mean ce. similarly cifar p.

It looks like adding extra validation sets with the `model.fit
API is fairly involved (https://stackoverflow.com/questions/47731935/using-multiple-validation-sets-with-keras). We're planning to move to custom training loops anyways, so I'll get started on that first.

This issue should be moved to Robustness Metrics' issue tracker once open-sourced.

google/edward2#96

Refactor `ensemble.py` implementations using SavedModel

@jereliu
Currently the ensemble.py implementations in edward2/baselines compiles model graph by loading model hyper-parameters from flags, and then load model weights from checkpoints. This creates challenges for both model reproducibility (if there's a mismatch between the default and the actual model hyperparameters) and for code reuse.

A better approach is to use SavedModel, so the ensemble script doesn't have to depend at all on re-defining the TF model code, and we can create a general-purpose ensemble script that aggregates black-box model predictions.

google/edward2#366

Question about batch size and test-set evaluation

Hi there!

I noticed something a little odd while evaluating an ensemble using baselines/cifar/ensemble.py: it seems that evaluation is only performed on the test set rounded down to a multiple of the batch size, rather than the full set. I noticed this as the numpy arrays which store the predictions have shape (9984, 10) (that script has an eff. batch size of 64, which divides 9984).

I believe that this might be the case in the other training/eval scripts as well; as I read it, the test iterator is only called for the first TEST_IMAGES // BATCH_SIZE batches, leaving a partial batch if the batch size doesn't evenly divide.

Please let me know if I'm mistaken about this. If you find this is accurate, do the reported results need to be reevaluated? If they were run with the current default effective batch size of 512, I believe 272 test examples out of 10000 were missed.

open-source explicit reproducibility instructions

@dustinvtran
We have the exact Bash commands to launch jobs with Google's cloud cluster in order to reproduce all results. There's nothing particularly notable about them as all the best hparams are default flag values or footnotes in the README.md's. Howevever, it would be nice to open-source a version of these instructions in order to be explicit---bordering on pedantic. More clarity helps!

@znado mentioned Tensorboard.dev. That could be a great way to open-source the exact runs we use for our tuning jobs and the runs we use to report results in the tables.

google/edward2#222

Remove boilerplate outside model, moving into Keras model for various baselines

In BatchEnsemble, we have to add code like tf.tile before passing inputs into the model, and splitting and average probs before passing into metrics.

This is fine in individual scripts, but once you start applying SavedModels for post-training pipelines, these lines have to be copied over (e.g., temperature_scaling.py takes an arbitrary SavedModel as input; and Robustness Metrics). These scripts aim to be general-purpose, but specific baselines need further changes to their inputs/outputs. This makes the post-training pipelines not as general as they could be.

Questions about prediction of SGNP

Hi @jereliu ,

I have a few questions about the inference stage of SGNP:

  1. According to the Eq 9) and Algorithrm 1) in the paper, shouldn't there be K precision matrix for each dimension of the output, where K is the number of class? And the dimension of each one is [ batch_size, batch_size], but the total matrix should be [K, batch_size, batch_size], am I understanding something wrong? And in the codes, I can just find the a single covariance matrix with size of [batch_size, batch_size].
  2. After searching the codes for a while, I couldn't find the sampling step which is the 5th step in Algorithm 2). Without this sampling step, the prediction is similar to MAP prediction except for the difference during training. This way to make prediction should be essential in this method, right?

I would appreciate if you can explain more to me.

Best,
Jianxiang

Some appreciation and a question on evaluation

Yesterday I attended the NeurIPS workshop on Practical Uncertainty Estimation and Out-of-Distribution Robustness in Deep Learning, thanks for a great summary on recent advances, much appreciated!
I have had a look around this referred repository and have some comments:

(i) Love the repository!
(ii) How can I contribute?
I am looking into benchmarking uncertainty methods for Named Entity Recognition, is this something you would like to include or would it be out-of-scope?
(iii) Small question: you report ECE and Brier on the toxicity challenge, why not negative loglikelihood (log loss)? In general, proper scoring rules can be decomposed into calibration and refinement loss. I am wondering if Brier is not closer to ECE than log loss, which would be closer to accuracy? Is that the reason why you select Brier over log loss?

Thank you very much!

Jordy Van Landeghem

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.