Thank you for sharing the pretrained model.
I tried running the code in the tutorial after adding the path of the ImageNet validation dataset and checkpoint of vit-l/16 (downloaded from the huggingface page).
So, would you please help me with this issue.
Thank you in advance.
hydra:
run:
dir: ./outputs/checkpoint
defaults:
- trainer: vit_b16_i1k
runtime:
strategy: 'gpu' # one of ['cpu', 'tpu', 'gpu', 'gpu_multinode', 'gpu_multinode_async']
use_mixed_precision: true
experiment:
mode: eval # 'train', 'train_eval', 'eval'
debug: false
save_dir: ${hydra:run.dir}
comment: ???
Here is the bash code I tried.
python3 -m trainer trainer=vit_l16_i1k_downstream \
experiment.debug=false \
experiment.mode='eval'
And, here is the error message below.
~:$ source test.sh
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:
TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).
For more information see: https://github.com/tensorflow/addons/issues/2807
warnings.warn(
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.13.0 and strictly below 2.16.0 (nightly versions are not supported).
The versions of TensorFlow you are currently using is 2.10.1 and is not supported.
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
warnings.warn(
/home/masaru-sasaki/work_space/coyo-vit/trainer.py:323: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_path="configs", config_name="trainer")
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'trainer': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
warnings.warn(msg, UserWarning)
/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
[2023-12-19 22:23:12,639][__main__][INFO] - Training with the following config:
trainer:
dataset:
train:
cache: true
supervised_key: label
builder:
- tfds_name: imagenet2012:5.0.0
tfds_data_dir:
your dir: null
tfds_split: train
dtype: bfloat16
image_size: 384
mixup_alpha: 0.0
cutmix_alpha: 0.0
preprocess:
- type: InceptionCrop
params:
size: 384
- type: random_hflip
- type: normalize
params:
mean: 127.5
std: 127.5
validation:
cache: true
supervised_key: label
builder:
- tfds_name: imagenet2012:5.0.0
tfds_data_dir: /mnt/disk202208/common-data/ImageNet/ILSVRC2012_img_val/
tfds_split: validation
dtype: bfloat16
image_size: 384
mixup_alpha: 0.0
cutmix_alpha: 0.0
preprocess:
- type: resize
params:
size:
- 384
- 384
- type: normalize
params:
mean: 127.5
std: 127.5
backbone:
backbone_name: vit-l/16
backbone_params:
image_size: 384
representation_size: 0
attention_dropout_rate: 0.0
dropout_rate: 0.0
channels: 3
dropout_rate: 0.0
cls_kernel_init:
type: zeros
cls_bias_init:
type: zeros
pretrained: null
loss:
class_name: CategoricalCrossentropy
config:
from_logits: true
label_smoothing: 0.0
l2_weight_decay: 0.0
learning_rate:
schedule_name: vit/cosine
init_lr: 0.0
base_lr: 0.06
end_learning_rate: 0
warmup_steps: 500
optimizer:
class_name: SGD
config:
momentum: 0.9
global_clipnorm: 1.0
moving_average_decay: 0.0
metrics:
metrics_list:
- class_name: TopKCategoricalAccuracy
config:
k: 1
name: top1_acc
- class_name: TopKCategoricalAccuracy
config:
k: 5
name: top5_acc
- class_name: CategoricalAccuracy
global_batch_size: 512
local_batch_size: null
epochs: 8
runtime:
strategy: gpu
use_mixed_precision: true
experiment:
mode: eval
debug: false
save_dir: ${hydra:run.dir}
comment: ???
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
[2023-12-19 22:23:15,087][tensorflow][INFO] - Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
[2023-12-19 22:23:15,090][tensorflow][INFO] - Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
[2023-12-19 22:23:15,091][__main__][INFO] - strategy: <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7efe2a178310>
[2023-12-19 22:23:15,092][__main__][INFO] - num_workers: 4
[2023-12-19 22:23:15,092][__main__][INFO] - local_batch_size: 128, global_batch_size: 512
[2023-12-19 22:23:15,092][root][INFO] - evaluate checkpoint: ./outputs/checkpoint
[2023-12-19 22:23:15,093][__main__][INFO] - Build dataset (is_training=False)
[2023-12-19 22:23:15,093][__main__][INFO] - [{'tfds_name': 'imagenet2012:5.0.0', 'tfds_data_dir': '/mnt/disk202208/common-data/ImageNet/ILSVRC2012_img_val/', 'tfds_split': 'validation'}]
[2023-12-19 22:23:15,093][root][INFO] - use TFDS: imagenet2012:5.0.0[validation]
[2023-12-19 22:23:15,636][absl][INFO] - Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.0.0
[2023-12-19 22:23:16,232][absl][INFO] - Load dataset info from /tmp/tmp8_aju2t8tfds
[2023-12-19 22:23:16,237][absl][INFO] - Field info.description from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.release_notes from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.citation from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.splits from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.supervised_keys from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,238][absl][INFO] - Field info.module_name from disk and from code do not match. Keeping the one from code.
[2023-12-19 22:23:16,239][root][INFO] - stacking dataset imagenet2012:5.0.0[validation] -> updated info: {'num_examples': 50000, 'num_shards': 64, 'num_classes': 1000}
[2023-12-19 22:23:16,575][__main__][INFO] - Build backbone (name=vit-l/16)
Model: "vision_transformer"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
pos_drop (Dropout) multiple 0
embedding (Conv2D) multiple 787456
encoderblock_0 (Transformer multiple 12596224
Block)
encoderblock_1 (Transformer multiple 12596224
Block)
encoderblock_2 (Transformer multiple 12596224
Block)
encoderblock_3 (Transformer multiple 12596224
Block)
encoderblock_4 (Transformer multiple 12596224
Block)
encoderblock_5 (Transformer multiple 12596224
Block)
encoderblock_6 (Transformer multiple 12596224
Block)
encoderblock_7 (Transformer multiple 12596224
Block)
encoderblock_8 (Transformer multiple 12596224
Block)
encoderblock_9 (Transformer multiple 12596224
Block)
encoderblock_10 (Transforme multiple 12596224
rBlock)
encoderblock_11 (Transforme multiple 12596224
rBlock)
encoderblock_12 (Transforme multiple 12596224
rBlock)
encoderblock_13 (Transforme multiple 12596224
rBlock)
encoderblock_14 (Transforme multiple 12596224
rBlock)
encoderblock_15 (Transforme multiple 12596224
rBlock)
encoderblock_16 (Transforme multiple 12596224
rBlock)
encoderblock_17 (Transforme multiple 12596224
rBlock)
encoderblock_18 (Transforme multiple 12596224
rBlock)
encoderblock_19 (Transforme multiple 12596224
rBlock)
encoderblock_20 (Transforme multiple 12596224
rBlock)
encoderblock_21 (Transforme multiple 12596224
rBlock)
encoderblock_22 (Transforme multiple 12596224
rBlock)
encoderblock_23 (Transforme multiple 12596224
rBlock)
encoder_nrom (LayerNormaliz multiple 2048
ation)
extract_token (Lambda) multiple 0
pre_logits (Identity) multiple 0
=================================================================
Total params: 303,690,752
Trainable params: 303,690,752
Non-trainable params: 0
_________________________________________________________________
[2023-12-19 22:23:23,693][__main__][INFO] - Compile the model...
[2023-12-19 22:23:23,694][__main__][INFO] - optimizer: <class 'keras.optimizers.optimizer_v2.gradient_descent.SGD'>
[2023-12-19 22:23:23,694][__main__][INFO] - name: SGD
[2023-12-19 22:23:23,694][__main__][INFO] - global_clipnorm: 1.0
[2023-12-19 22:23:23,694][__main__][INFO] - learning_rate: 0.01
[2023-12-19 22:23:23,694][__main__][INFO] - decay: 0.0
[2023-12-19 22:23:23,694][__main__][INFO] - momentum: 0.9
[2023-12-19 22:23:23,694][__main__][INFO] - nesterov: False
[2023-12-19 22:23:23,694][__main__][INFO] - Build loss: <class 'keras.losses.CategoricalCrossentropy'>
[2023-12-19 22:23:23,694][__main__][INFO] - Build metrics...
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,700][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,705][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,709][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,710][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,715][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,716][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,720][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,721][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,725][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,726][tensorflow][INFO] - Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
[2023-12-19 22:23:23,736][__main__][INFO] - Build callbacks...
Error executing job with overrides: ['trainer=vit_l16_i1k_downstream', 'experiment.debug=false', 'experiment.mode=eval']
Traceback (most recent call last):
File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 92, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern))
RuntimeError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./outputs/checkpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 2563, in restore
status = self.read(save_path, options=options)
File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 2441, in read
result = self._saver.restore(save_path=save_path, options=options)
File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 1448, in restore
reader = py_checkpoint_reader.NewCheckpointReader(save_path)
File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 96, in NewCheckpointReader
error_translator(e)
File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 31, in error_translator
raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./outputs/checkpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/masaru-sasaki/work_space/coyo-vit/trainer.py", line 340, in train_main
trainer.eval(config.experiment.save_dir)
File "/home/masaru-sasaki/work_space/coyo-vit/trainer.py", line 311, in eval
checkpoint.restore(ckpt)
File "/home/masaru-sasaki/.pyenv/versions/mambaforge-22.9.0-3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 2567, in restore
raise errors_impl.NotFoundError(
tensorflow.python.framework.errors_impl.NotFoundError: Error when restoring from checkpoint or SavedModel at ./outputs/checkpoint: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./outputs/checkpoint
Please double-check that the path is correct. You may be missing the checkpoint suffix (e.g. the '-1' in 'path/to/ckpt-1').
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.