I find it very interesting how the model is able to pick up tiny gaps in salient objec

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Architecture Insight about u-2-net HOT 3 OPEN

xuebinqin commented on August 15, 2024 3

Architecture Insight

from u-2-net.

Comments (3)

xuebinqin commented on August 15, 2024

Thanks for your efforts in exploring it. There are mainly two factors contributing to the segmentation of the tiny 'gaps' or fine 'structures': 1) relatively "high" resolution of the feature map, 2) global and local feature extraction capabilities of network modules. The first factor, relatively "high" resolution, is easy to understand. As we know, the higher the resolution is, the more details we can perceive. Although the resolution of 320x320 is not that high, the tiny structure is recognizable if you view the tiny structures in zoomed-in view. But when we focus on the tiny 'gaps' of fine 'structures', we usually ignore the important fact: we are actually inferencing the tiny gaps from a large scale/global contexture. For example, we can recognize the hairs because we know that they are hairs of a girl, if the face and body or the girl are covered by other stuffs, it will be difficult for us to recognize the hairs (it might not be a good example). So only high resolution are not enough and large or global information are important in segmenting these tiny 'gaps'. Most of other networks only have one receptive in each stage. Although they can also provide high resolution feature maps just before the prediction. But the global contexture info are missing because of small convolution filters in relatively high resolution feature maps. The RSU blocks in each stage of the encoder and decoder is able to achieve both global and local contexture information so that they enable the segmenting of tiny 'gaps'. Another advantage of the RSU blocks against PSP or inception-like blocks is that it downsamples the feature maps to achieve larger scale info and upsamples to recover the resolution. Both downsample and upsample operations are gradually conducted, which avoids degradation of features by drastic downsampling and upsampling.

from u-2-net.

bluesky314 commented on August 15, 2024

High level information is needed to for low level processing to extract more fine features. You're right, this does happen in a regular unet but the operations are very far apart. This reminds me of cross-scale connections used in EfficientDet(biFPN) and Path Aggregation networks where the idea is very similar but implemented differently. The same global features are propagated whereas here, each block has its own local/global features. How do you think this would differ from that? How do you contrast this with HRNet which has concurrent multi-resolution pathways so all levels can talk to each other?

from u-2-net.

bluesky314 commented on August 15, 2024

@Nathanua Would appreciate your thoughts

from u-2-net.

Recommend Projects

Architecture Insight about u-2-net HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent