In this work, we further improve the SA-Net from the following aspects:
- We replace the ResNet blocks with hybrid-ViT based transformer blocks at the AiF branch of the encoder.
- We take advantage of the inter-slice features in both the encoding and decoding processes. Specifically, we discard the three 3D conv layers (with color gray in the figure1 of SA-Net) at the FS branch of the encoder, and replace the original receptive field blocks (RFBs) with 3D RFBs. To fuse the 3D FS-based features and 2D AiF features at high-level, we further design a multi-head synergistic attention module (please refer to codes).
As a result, our SA-Net-v2 outperforms the original version of SA-Net by a large margin.
Figure 1: Quantitative results for different models on three benchmark datasets. The best scores are in boldface. We train and test our SA-Net-v2 with the settings that are consistent with SA-Net, which is the state-of-the-art model at present. ⋆ indicates tradition methods. - denotes no available result. ↑ indicates the higher the score the better, and vice versa for ↓.
Download the saliency prediction maps at Google Drive.
Download the pretrained model at Google Drive.