Dear Dr. Greg, Firstly, thank you very much for this well-documented

Nice explanation <a class="user-mention notranslate" data-hovercard-type="user" data-h

Modify Tybalt to handle missing values for incomplete data about tybalt HOT 4 OPEN

greenelab commented on June 12, 2024

Modify Tybalt to handle missing values for incomplete data

from tybalt.

Comments (4)

cgreene commented on June 12, 2024

Hi Yagmur,

Do you have details on how you are implementing this? In prior work with a different architecture, we used:

In the event of missing data, the cost calculation was modified to exclude missing data from contributing to the reconstruction cost. A missingness vector m was created for each input vector, with a value of 1 where the data is present and 0 when the data is missing. Both the input sample x and reconstruction z were multiplied by m and the cross entropy error was divided by the sum of the m, the number of non-missing features to get the average cost per feature present (Formula 4). This allowed the DA to learn the structure of the data from present features rather than imputation.

https://www.biorxiv.org/content/10.1101/039800v1.full

I think we would need more details to provide any guidance.

from tybalt.

yagmuronay commented on June 12, 2024

Dear Dr. Greene,

thank you for your reply and the paper. I see that in the paper the corrupted values have been masked with zeros. In my case, the original data set may initially contain zeros and missing values are represented separately with numpy.nan. Therefore I cannot simply overwrite the missing values with zeros in the preprocessing. I believe replacing the missing values(numpy.nan) with any value would affect the binary cross-entropy loss even if multiply the input vector and the reconstruction with the "missingness vector" m in the end. Please correct me if I am wrong. Instead, I need to omit them when calculating the loss.

Therefore what I need is rather a loss function that creates a mask for the missing values in the original data and applies this mask to the original and predicted values before the calculations. To sum up, the pipeline we have in mind is as follows:

1- Get the original data which initially may have missing values (numpy.nan) and preprocess omitting the missing values
2- Introduce further missing values to the data at random (e.g. 10% in total)
3- Modify the loss function of Tybalt, defined in the CustomVariationalLayer: Modify vae_loss() steps as follows:
a) Create a boolean mask to get only where the original values are missing*
b) Mask the original and the predicted data with this mask
c) Calculate the cost with the masked input vector and reconstruction vector
(*We could extend the mask so that the loss is only calculated on the corrupted values that are not missing in the original data but only in the training data to focus on missing value imputation. It should be easy once the loss function is ready.)
4- Train the model with the corrupted data using the modified loss function

As far as I am concerned, I only need to modify the reconstruction error, K.metrics.binary_crossentropy() and not the KL term to achieve this. Therefore I have been working on a custom binary-cross-entropy function that masks the original and predicted values, where the original data is missing :

def custom_binary_crossentropy(y_true, y_pred): 
    # Create the mask to mask the values where the original data has missing values
    y_true_not_nan_mask = tf.logical_not(tf.is_nan(y_true))
    # Apply the mask to the original data
    y_true_masked = tf.boolean_mask(y_true, mask=y_true_not_nan_mask)
    # Apply the mask to the predicted values
    y_pred_masked = tf.boolean_mask(y_pred, mask=y_true_not_nan_mask)
    
    # Calculate the binary cross entropy(bce) with the masked values
    term_0 = (1 - y_true_masked) * K.log(1 - y_pred_masked + K.epsilon()) # Cancels out when target is 1 
    term_1 = y_true_masked * K.log(y_pred_masked + K.epsilon()) # Cancels out when target is 0
    cross_entropy_loss = -(term_0 + term_1)
    
    # Calculate the bce loss mean only where the original values were not missing by using the mask
    masked_mean_bce_loss = tf.reduce_mean(cross_entropy_loss)
    
    return masked_mean_bce_loss

After testing this function with the variables as in the code snippet below, the loss is 0.659456. However, when I use it instead of keras.metrics.binary_crossentropy(), the loss graph is empty and the both axes ticks show unexpecetd values (-0.04, -0.02, 0, 0.02, 0.04). Do I need to do other modifications to the training pipeline/ model? I am also not sure if I need to keep the mean calculation at the end of the custom loss function. Is vae_loss( ) calculated on each sample? Thank you very much for your time and suggestions!

y_true = tf.constant([
    [0, 1, np.nan, 0],
    [0, 1, 1, 0],
    [np.nan, 1, np.nan, 0],
    [1, 1, 0, np.nan],
])

y_pred = tf.constant([
    [0.1, 0.7, 0.1, 0.3],
    [0.2, 0.6, 0.1, 0],
    [0.1, 0.9, 0.3, 0.2],
    [0.1, 0.4, 0.4, 0.2],
])

loss = custom_binary_crossentropy(y_true, y_pred)
print(loss.eval())

from tybalt.

gwaybio commented on June 12, 2024

Nice explanation @yagmuronay - a couple quick things to consider:

Have you tried adding axis=-1 into the tf.reduce_mean() call? See the "Creating custom losses" section of https://keras.io/api/losses/
Are the weird values a result of replacing the binary_cross_entropy vae_loss() with your custom masked loss?

tybalt/tybalt/utils/vae_utils.py

Line 58 in 644f34a

metrics.binary_crossentropy(x_input, x_decoded)
- Try multiplying by the original dimensions of your input RNAseq data. IIRC, we needed this term to balance the KL divergence loss
Have you considered adding the mask to the KL divergence term as well? This reports on the distribution of the encoder output - if you're not learning how to handle missingness with the reconstruction term, then I might worry about missingness influencing the KL term disproportionately
vae_loss is called on each batch of input data

from tybalt.

stale commented on June 12, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from tybalt.

Modify Tybalt to handle missing values for incomplete data about tybalt HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent