Giter Site home page Giter Site logo

dec-keras's Introduction

Deep Embedding Clustering (DEC)

Keras implementation for ICML-2016 paper:

  • Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. ICML 2016.

Usage

  1. Install Keras>=2.0.9, scikit-learn
pip install keras scikit-learn   
  1. Clone the code to local.
git clone https://github.com/XifengGuo/DEC-keras.git DEC
cd DEC
  1. Prepare datasets.

Download STL:

cd data/stl
bash get_data.sh
cd ../..

MNIST and Fashion-MNIST (FMNIST) can be downloaded automatically when you run the code.

Reuters and USPS: If you cannot find these datasets yourself, you can download them from:
https://pan.baidu.com/s/1hsMQ8Tm (password: 4ss4) for Reuters, and
https://pan.baidu.com/s/1skRg9Dr (password: sc58) for USPS

  1. Run experiment on MNIST.
    python DEC.py --dataset mnist
    or (if there's pretrained autoencoder weights)
    The DEC model will be saved to "results/DEC_model_final.h5".

  2. Other usages.

Use python DEC.py -h for help.

Results

python run_exp.py

Table 1. Mean performance over 10 trials. See results.csv for detailed results for each trial.

kmeans AE+kmeans DEC paper
mnist acc 53 88 91 84
nmi 50 81 87 --
fmnist acc 47 61 62 --
nmi 51 64 65 --
usps acc 67 71 76 --
nmi 63 68 79 --
stl acc 70 79 86 --
nmi 71 72 82 --
reuters acc 52 76 78 72
nmi 31 52 57 --

Autoencoder model

Other implementations

Original code (Caffe): https://github.com/piiswrong/dec
MXNet implementation: https://github.com/dmlc/mxnet/blob/master/example/dec/dec.py
Keras implementation without pretraining code: https://github.com/fferroni/DEC-Keras

dec-keras's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dec-keras's Issues

How to calculate the clusterig (KLD) loss for every instance

Great work!
For the problem I studied, the accuracy reaches 97%, which is very impressive.
How can I compute the DEC loss of every instance after the training has been completed. For the autoencoder, it is straightforward by defining a simple function:

def ae_loss(autoencoder, X):
    ae_rec = autoencoder.predict(X)  
    ae_loss = tf.keras.losses.mse(ae_rec, X)  
    return ae_loss

Defining similar function for computing the clustering loss is not working. Any idea how can this be implemented?
I would like to do a further investigation by studying the loss distribution.

SAE part may be wrong.

The SAE (stacked autoencoder) part should be trained layer-wise, which means the next autoencoder starts to be trained after the previous one is trained. From origin paper:

After training of one layer, we use its output h as the input to train the n

However from the output of model structure image (autoencoders.png), the encoders are connected to each other and then follows a number of decoders and there is only one training phase over the whole "autoencoder".

about accuracy

several run times lead to several accuracy values? what can be the problem of this issue?

loss keeps 0 during training (clustering)

During training with clustering layers, the loss keeps 0 through all the iteration while nmi, acc and ari work fine. Why does this happen? Does it indicate that the encoder layers haven't been trained during training?

Conflicts to the Network Structure

Hi all,

Thanks for sharing a great Model for Deep Clustering. I aim to leverge your excellent work to cluster images for my project.

Baed on your paper, the Network Structure is a basis of Stacked Autoencoder which is composed of 2 pairs of Dropout-Dense. However, the implementation does not follow the network structure. Is there any reason behind?

Done training .. But how to predict on a new data ?

Great work! .
So i am training this using a different approach with 20newsgroup dataset but the thing is training is finished and i visualize the clusters too with z_2d, and cluster centroids from pickle file.
But how to now predict on a new data and know in which cluster it is mapped onto ?

data/reuters/get_data.sh Outdated Links

Hello! The links for the Reuters data are outdated. The new base link is http://www.ai.mit.edu/projects/jmlr/ instead of http://jmlr.csail.mit.edu/.

This should be the new contents of get_data.sh:

#!/bin/sh
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt0.dat.gz
gunzip lyrl2004_tokens_test_pt0.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt1.dat.gz
gunzip lyrl2004_tokens_test_pt1.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt2.dat.gz
gunzip lyrl2004_tokens_test_pt2.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt3.dat.gz
gunzip lyrl2004_tokens_test_pt3.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_train.dat.gz
gunzip lyrl2004_tokens_train.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a08-topic-qrels/rcv1-v2.topics.qrels.gz
gunzip rcv1-v2.topics.qrels.gz

Licence

@XifengGuo, thanks for providing some interesting code. Can you add a licence?

about acc

I checked closed issue about acc. you replied pretraining strategy is different.

Can you explain different between this implementation and papers? (I checked the authors' pretraining method in paper, but,i can't find your strategy in this repository)

and in DEC paper, they used dropout, but why didn't you use dropout layer?

Possible to get code working with plaidml backend?

$ python DEC.py --dataset mnist
Using plaidml.keras.backend backend.
Namespace(ae_weights=None, batch_size=256, dataset='mnist', maxiter=20000.0, pretrain_epochs=None, save_dir='results', tol=0.001, update_interval=None)
('MNIST samples', (70000, 784))
INFO:plaidml:Opening device "metal_amd_radeon_hd_-_firepro_d500.1"
return array(obj, copy=False)
Traceback (most recent call last):
File "DEC.py", line 321, in
dec = DEC(dims=[x.shape[-1], 500, 500, 2000, 10], n_clusters=n_clusters, init=init)
File "DEC.py", line 138, in init
clustering_layer = ClusteringLayer(self.n_clusters, name='clustering')(self.encoder.output)
File "/Users/test/plaidml-venv/lib/python2.7/site-packages/keras/engine/base_layer.py", line 457, in call
output = self.call(inputs, **kwargs)
File "DEC.py", line 106, in call
q **= (self.alpha + 1.0) / 2.0
TypeError: unsupported operand type(s) for ** or pow(): 'Value' and 'float'

cifar-10 数据集acc很低的疑问

作者您好,感谢您对DEC的复现,想咨询一些问题。我在代码里看到有cifar-10数据集以及特征抽取,但我在cifar-10上跑出来的效果很差,acc在0.3左右,请问是哪些超参影响很大啊,期待您的回复

Encoder weights are not updated during the clustering phase

Many thanks for the code.

I have implemented ' Unsupervised Deep Embedding for Clustering Analysis' using Pytorch, and I noticed that the pytorch version is converging much slower than your keras version. By going through the details, I noticed that in your version encoders' weights are not updating during the clustering stage.

I'm not sure what is the reason, but according to the paper, the encoders' weight must get updated.

loss 先升后下降

你好,我在用你的dec的时候loss 是先上升,后下降 。但是准确率一直在上升,这个你知道是什么原因么 ?

TypeError: 'float' object cannot be interpreted as an integer

Getting the error with the following message
run DEC.py mnist
Using TensorFlow backend.
Namespace(ae_weights=None, batch_size=256, dataset='mnist', gamma=0.1, maxiter=20000.0, n_clusters=10, save_dir='results', tol=0.001, update_interval=140)
MNIST samples (70000, 784)
No pretrained ae_weights given, start pretraining...
Pretraining the 1th layer...
learning rate = 0.1
Traceback (most recent call last):

File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\DEC.py", line 311, in
x=x)

File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\DEC.py", line 170, in initialize_model
sae.fit(x, epochs=400)

File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\SAE.py", line 133, in fit
self.pretrain_stacks(x, epochs=epochs/2)

File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\SAE.py", line 102, in pretrain_stacks
self.stacks[i].fit(features, features, batch_size=self.batch_size, epochs=epochs/3)

File "C:\Users\kaneja\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\models.py", line 867, in fit
initial_epoch=initial_epoch)

File "C:\Users\kaneja\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py", line 1598, in fit
validation_steps=validation_steps)

File "C:\Users\kaneja\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py", line 1130, in _fit_loop
for epoch in range(initial_epoch, epochs):

TypeError: 'float' object cannot be interpreted as an integer

new dataset

Hellos, this is great work. Thanks!

I just have a question, how can I use my own dataset with this? I have a folder of images that I would like clustered.

Thanks!

train_y.bin

Thank you for your contribution!

python DEC.py --dataset mnist

This runs fine.

However,

python run_exp.py
yields this:

Reached tolerance threshold. Stopping training.
('saving model to:', './results/exp1/reuters10k/trial9/DEC_model_final.h5')
Traceback (most recent call last):
File "run_exp.py", line 26, in
x, y = load_data(db)
File "./datasets.py", line 324, in load_data
return load_stl()
File "./datasets.py", line 283, in load_stl
y1 = np.fromfile(data_path + '/train_y.bin', dtype=np.uint8) - 1
IOError: [Errno 2] No such file or directory: './data/stl/train_y.bin'

Any ideas on what the problem is?
Much appreciated.
Thanks.

Question about epoch

Thanks for your great implementation!
I have a question doing experiment with it. There are default settings for epoch. (e.g. MNIST - 300 epochs) Are they the same value with your IDEC paper experiment?

I want to reproduce your experiment for study, but the accuracy score that I took a experiment with your DEC implementation is not correct with accuracy in your paper(IDEC).

What does it mean if clustering accuracy metric fluctuates a lot?

I am wondering what it means if the accuracy, nmi, and ari metrics fluctuate a lot. I noticed when training on MNIST, every update interval pretty much has an improvement in accuracy and there is a upward trend.
However, when I train on my dataset, there are lots of fluctuations. it sometimes starts high at iteration 0, then goes lower, then goes high again, then ends up somewhere in between. Does this mean something is wrong with the data? Is this trend representative of something else?

Suggested fix for deprecated 'from sklearn.utils.linear_assignment_ import linear_assignment'

Hello great work!

i think 'from sklearn.utils.linear_assignment_ import linear_assignment' is now deprecated and I would recommend making the following changes to the accuracy module.

def acc(y_true, y_pred):
"""
Calculate clustering accuracy. Require scikit-learn installed
# Arguments
y: true labels, numpy.array with shape (n_samples,)
y_pred: predicted labels, numpy.array with shape (n_samples,)
# Return
accuracy, in [0,1]
"""
y_true = y_true.astype(np.int64)
assert y_pred.size == y_true.size
D = max(y_pred.max(), y_true.max()) + 1
w = np.zeros((D, D), dtype=np.int64)
for i in range(y_pred.size):
w[y_pred[i], y_true[i]] += 1
from scipy.optimize import linear_sum_assignment as linear_assignment
ind = np.transpose(np.asarray(linear_assignment(w.max() - w)))
return sum([w[i, j] for i, j in ind]) * 1.0 / y_pred.size

Thanks for all the great work!
Ali

TypeError: add_weight() got multiple values for argument 'name'

I got the following error when running python DEC.py

Using TensorFlow backend.
Namespace(ae_weights=None, batch_size=256, dataset='mnist', maxiter=20000.0, pretrain_epochs=None, save_dir='results', tol=0.001, update_interval=None)
MNIST samples (70000, 784)
Traceback (most recent call last):
File "DEC.py", line 321, in
dec = DEC(dims=[x.shape[-1], 500, 500, 2000, 10], n_clusters=n_clusters, init=init)
File "DEC.py", line 138, in init
clustering_layer = ClusteringLayer(self.n_clusters, name='clustering')(self.encoder.output)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/base_layer.py", line 463, in call
self.build(unpack_singleton(input_shapes))
File "DEC.py", line 91, in build
self.clusters = self.add_weight((self.n_clusters, input_dim), initializer='glorot_uniform', name='clusters')
TypeError: add_weight() got multiple values for argument 'name'

How to feed TFRecord data (over 60GB) to the DEC-keras model?

Thanks for your great implementation!
I' ve tried to solve classification problem whose input data have the shape of 1000*221 by DEC model.
I want to train my data with over 80 thousand data (very large size [8000000,1000,221],dtype=float32 about 60GB ), so it's not possible load whole dataset into python array.
After googling, I found tf.TFRecord helps me to get out this capacity problem.

I followed the tutorial in the official TensorFlow site to write TFRecord file and I can load the TFReocrd into the conventional Keras Model. However, I can't find how to feed into the DEC-model. The input (mnist) of DEC-model is one numpy file that has the shape [70000,784].

Like flowing:

dataset = tf.data.TFRecordDataset(filenames=[filenames])
parsed_dataset = dataset.map(_parse_function, num_parallel_calls=8)
final_dataset = parsed_dataset.shuffle(buffer_size=number_of_sample).batch(10)
iterator = dataset.make_one_shot_iterator()
parsed_record = iterator.get_next()
feature, label = parsed_record['feature'], parsed_record['label']
#keras
inputs = keras.Input(shape=(1000,221 ), name='feature', tensor=feature)
model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
loss='categorical_crossentropy',
metrics=['accuracy','categorical_crossentropy'],
target_tensors=[label]
)
train_model.fit(epochs= 30,
steps_per_epoch= 800000/256)

ValueError: No such layer: clustering.

This is what I encountered when running the script. Can anyone help me resolving this issue?


Layer (type) Output Shape Param # Connected to

input (InputLayer) [(None, 784)] 0 []

encoder_0 (Dense) (None, 500) 392500 ['input[0][0]']

encoder_1 (Dense) (None, 500) 250500 ['encoder_0[0][0]']

encoder_2 (Dense) (None, 2000) 1002000 ['encoder_1[0][0]']

encoder_3 (Dense) (None, 10) 20010 ['encoder_2[0][0]']

tf.expand_dims (TFOpLambda) (None, 1, 10) 0 ['encoder_3[0][0]']

tf.math.subtract (TFOpLambda) (None, 10, 10) 0 ['tf.expand_dims[0][0]']

tf.math.square (TFOpLambda) (None, 10, 10) 0 ['tf.math.subtract[0][0]']

tf.math.reduce_sum (TFOpLambda (None, 10) 0 ['tf.math.square[0][0]']
)

tf.math.truediv (TFOpLambda) (None, 10) 0 ['tf.math.reduce_sum[0][0]']

tf.operators.add (TFOpLamb (None, 10) 0 ['tf.math.truediv[0][0]']
da)

tf.math.truediv_1 (TFOpLambda) (None, 10) 0 ['tf.operators.add[0][0]']

tf.math.pow (TFOpLambda) (None, 10) 0 ['tf.math.truediv_1[0][0]']

tf.compat.v1.transpose (TFOpLa (10, None) 0 ['tf.math.pow[0][0]']
mbda)

tf.math.reduce_sum_1 (TFOpLamb (None,) 0 ['tf.math.pow[0][0]']
da)

tf.math.truediv_2 (TFOpLambda) (10, None) 0 ['tf.compat.v1.transpose[0][0]',
'tf.math.reduce_sum_1[0][0]']

tf.compat.v1.transpose_1 (TFOp (None, 10) 0 ['tf.math.truediv_2[0][0]']
Lambda)

==================================================================================================
Total params: 1,665,010
Trainable params: 1,665,010
Non-trainable params: 0


Update interval 140
Save interval 1365
Initializing cluster centers with k-means.
2188/2188 [==============================] - 10s 4ms/step
Traceback (most recent call last):
File "DEC.py", line 335, in
y_pred = dec.fit(x, y=y, tol=args.tol, maxiter=args.maxiter, batch_size=args.batch_size,
File "DEC.py", line 210, in fit
self.model.get_layer(name='clustering').set_weights([kmeans.cluster_centers_])
File "/research/DEC_Pytorch_tutorial/dec_venv/lib/python3.8/site-packages/keras/engine/training.py", line 3353, in get_layer
raise ValueError(
ValueError: No such layer: clustering. Existing layers are: ['input', 'encoder_0', 'encoder_1', 'encoder_2', 'encoder_3', 'tf.expand_dims', 'tf.math.subtract', 'tf.math.square', 'tf.math.reduce_sum', 'tf.math.truediv', 'tf.operators.add', 'tf.math.truediv_1', 'tf.math.pow', 'tf.compat.v1.transpose', 'tf.math.reduce_sum_1', 'tf.math.truediv_2', 'tf.compat.v1.transpose_1'].

Why is evaluation done on the training set

If I understand correctly, the model is evaluated on the same data that it's trained on. Doesn't this lead to a wrong evaluation?

Load data

x, y = load_data(args.dataset)

DEC-keras/datasets.py

Lines 94 to 103 in 2438070

def load_mnist():
# the data, shuffled and split between train and test sets
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = np.concatenate((x_train, x_test))
y = np.concatenate((y_train, y_test))
x = x.reshape((x.shape[0], -1))
x = np.divide(x, 255.)
print('MNIST samples', x.shape)
return x, y

Evaluate

DEC-keras/DEC.py

Lines 333 to 335 in 2438070

y_pred = dec.fit(x, y=y, tol=args.tol, maxiter=args.maxiter, batch_size=args.batch_size,
update_interval=update_interval, save_dir=args.save_dir)
print('acc:', metrics.acc(y, y_pred))

Shouldn't x_train and y_trained used to pretrain and fit, and then x_test and y_test used to evaluate the model?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.