Main Remark Tabnet architecture is using sparsemax function in ord

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Research : Binary Mask vs Sparse Mask? about tabnet HOT 3 OPEN

dreamquark-ai commented on July 4, 2024

Research : Binary Mask vs Sparse Mask?

from tabnet.

Comments (3)

pangjac commented on July 4, 2024

Hi @Optimox , could you please help elaborate the example

if someone is 50 year old, and the attention layer think that age is half of the solution then attention for age would be 0.5, and the next layer would see age=25. But how can the next layers differentiate from 75 / 3, 50 /2 and 25?

I am a bit confused on how, once the attention layer thinks age is half of the solution, then "the next layer" would see age=25.( how 25 comes out?) Thank you!

from tabnet.

Optimox commented on July 4, 2024

Well, once the attention mask is applied if you multiply age by 0.5 then it gives you a totally different age. In practice it still works but I wonder if it would work better with completely binary masks. That’s the point.

I once tried to add a very sharp layer that’s 0 in 0 but goes up to 1 very quickly, but I remember that it did not change much (and gradients exploded). It would be nice to perform an exhaustive comparison on multiple benchmark datasets.

from tabnet.

haoliangjiang commented on July 4, 2024

The binary mask will help in that we are confident that the attention mask allows the entire value to join the computation of the next block. However, I believe one of the implicit goals is that network can learn how to pass information so that it can classify or regress correctly. So I do not think binary masks will significantly boost the performance, for both masks are more about introducing more visualizability of the network. But I am not sure about the optimization level where it might do something different to gradients.

Given a well-trained model, I can think of two situations that the network outputs 25 instead of 50 after the attention. One is that the input data is noisy data. It corrects the data by changing the age to a reasonable range. The other is that the model knows 25 is 50 somehow, as the attention mask is mainly based on the input. Both of these situations help to predict.

Not a pro. Just sharing my thoughts.

from tabnet.

Recommend Projects

Research : Binary Mask vs Sparse Mask? about tabnet HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent