Comments (4)
Hi Yagmur,
Do you have details on how you are implementing this? In prior work with a different architecture, we used:
In the event of missing data, the cost calculation was modified to exclude missing data from contributing to the reconstruction cost. A missingness vector m was created for each input vector, with a value of 1 where the data is present and 0 when the data is missing. Both the input sample x and reconstruction z were multiplied by m and the cross entropy error was divided by the sum of the m, the number of non-missing features to get the average cost per feature present (Formula 4). This allowed the DA to learn the structure of the data from present features rather than imputation.
https://www.biorxiv.org/content/10.1101/039800v1.full
I think we would need more details to provide any guidance.
from tybalt.
Dear Dr. Greene,
thank you for your reply and the paper. I see that in the paper the corrupted values have been masked with zeros. In my case, the original data set may initially contain zeros and missing values are represented separately with numpy.nan. Therefore I cannot simply overwrite the missing values with zeros in the preprocessing. I believe replacing the missing values(numpy.nan) with any value would affect the binary cross-entropy loss even if multiply the input vector and the reconstruction with the "missingness vector" m in the end. Please correct me if I am wrong. Instead, I need to omit them when calculating the loss.
Therefore what I need is rather a loss function that creates a mask for the missing values in the original data and applies this mask to the original and predicted values before the calculations. To sum up, the pipeline we have in mind is as follows:
1- Get the original data which initially may have missing values (numpy.nan) and preprocess omitting the missing values
2- Introduce further missing values to the data at random (e.g. 10% in total)
3- Modify the loss function of Tybalt, defined in the CustomVariationalLayer: Modify vae_loss() steps as follows:
a) Create a boolean mask to get only where the original values are missing*
b) Mask the original and the predicted data with this mask
c) Calculate the cost with the masked input vector and reconstruction vector
(*We could extend the mask so that the loss is only calculated on the corrupted values that are not missing in the original data but only in the training data to focus on missing value imputation. It should be easy once the loss function is ready.)
4- Train the model with the corrupted data using the modified loss function
As far as I am concerned, I only need to modify the reconstruction error, K.metrics.binary_crossentropy() and not the KL term to achieve this. Therefore I have been working on a custom binary-cross-entropy function that masks the original and predicted values, where the original data is missing :
def custom_binary_crossentropy(y_true, y_pred):
# Create the mask to mask the values where the original data has missing values
y_true_not_nan_mask = tf.logical_not(tf.is_nan(y_true))
# Apply the mask to the original data
y_true_masked = tf.boolean_mask(y_true, mask=y_true_not_nan_mask)
# Apply the mask to the predicted values
y_pred_masked = tf.boolean_mask(y_pred, mask=y_true_not_nan_mask)
# Calculate the binary cross entropy(bce) with the masked values
term_0 = (1 - y_true_masked) * K.log(1 - y_pred_masked + K.epsilon()) # Cancels out when target is 1
term_1 = y_true_masked * K.log(y_pred_masked + K.epsilon()) # Cancels out when target is 0
cross_entropy_loss = -(term_0 + term_1)
# Calculate the bce loss mean only where the original values were not missing by using the mask
masked_mean_bce_loss = tf.reduce_mean(cross_entropy_loss)
return masked_mean_bce_loss
After testing this function with the variables as in the code snippet below, the loss is 0.659456. However, when I use it instead of keras.metrics.binary_crossentropy(), the loss graph is empty and the both axes ticks show unexpecetd values (-0.04, -0.02, 0, 0.02, 0.04). Do I need to do other modifications to the training pipeline/ model? I am also not sure if I need to keep the mean calculation at the end of the custom loss function. Is vae_loss( ) calculated on each sample? Thank you very much for your time and suggestions!
y_true = tf.constant([
[0, 1, np.nan, 0],
[0, 1, 1, 0],
[np.nan, 1, np.nan, 0],
[1, 1, 0, np.nan],
])
y_pred = tf.constant([
[0.1, 0.7, 0.1, 0.3],
[0.2, 0.6, 0.1, 0],
[0.1, 0.9, 0.3, 0.2],
[0.1, 0.4, 0.4, 0.2],
])
loss = custom_binary_crossentropy(y_true, y_pred)
print(loss.eval())
from tybalt.
Nice explanation @yagmuronay - a couple quick things to consider:
- Have you tried adding
axis=-1
into thetf.reduce_mean()
call? See the "Creating custom losses" section of https://keras.io/api/losses/ - Are the weird values a result of replacing the binary_cross_entropy
vae_loss()
with your custom masked loss?
tybalt/tybalt/utils/vae_utils.py
Line 58 in 644f34a
- Try multiplying by the original dimensions of your input RNAseq data. IIRC, we needed this term to balance the KL divergence loss
- Have you considered adding the mask to the KL divergence term as well? This reports on the distribution of the encoder output - if you're not learning how to handle missingness with the reconstruction term, then I might worry about missingness influencing the KL term disproportionately
vae_loss
is called on each batch of input data
from tybalt.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from tybalt.
Related Issues (20)
- Simulation Experiments HOT 2
- Keras versioning error HOT 3
- Add R packages to environment.yml HOT 2
- Reproducing separation HOT 3
- Reorganize repository
- ADAGE Implementation Issues HOT 2
- Replace data in encoded_adage_features.tsv HOT 1
- Something wrong in extracting weights? HOT 3
- Sampling space for specific genes HOT 4
- Zero'd out training HOT 3
- Sampling distriubtions HOT 7
- Features that represent biological signals HOT 3
- t-SNT visualization HOT 2
- Matching pancancer expression to metadata HOT 4
- ERROR: VAE Model reconstruct the gene expression data HOT 4
- Error when setting up environment HOT 7
- Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. HOT 2
- Top n - High Weight Selection Method HOT 1
- MAD: mean or median? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tybalt.