xifengguo / dec-keras Goto Github PK

View Code? Open in Web Editor NEW

472.0 10.0 164.0 77 KB

Keras implementation for Deep Embedding Clustering (DEC)

License: MIT License

Python 96.08% Shell 3.92%

dec-keras's Introduction

Deep Embedding Clustering (DEC)

Keras implementation for ICML-2016 paper:

Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. ICML 2016.

Usage

Install Keras>=2.0.9, scikit-learn

pip install keras scikit-learn

Clone the code to local.

git clone https://github.com/XifengGuo/DEC-keras.git DEC
cd DEC

Prepare datasets.

Download STL:

cd data/stl
bash get_data.sh
cd ../..

MNIST and Fashion-MNIST (FMNIST) can be downloaded automatically when you run the code.

Reuters and USPS: If you cannot find these datasets yourself, you can download them from:
https://pan.baidu.com/s/1hsMQ8Tm (password: 4ss4) for Reuters, and
https://pan.baidu.com/s/1skRg9Dr (password: sc58) for USPS

Run experiment on MNIST.
python DEC.py --dataset mnist
or (if there's pretrained autoencoder weights)
The DEC model will be saved to "results/DEC_model_final.h5".
Other usages.

Use python DEC.py -h for help.

Results

python run_exp.py

Table 1. Mean performance over 10 trials. See results.csv for detailed results for each trial.

		kmeans	AE+kmeans	DEC	paper
mnist	acc	53	88	91	84
	nmi	50	81	87	--
fmnist	acc	47	61	62	--
	nmi	51	64	65	--
usps	acc	67	71	76	--
	nmi	63	68	79	--
stl	acc	70	79	86	--
	nmi	71	72	82	--
reuters	acc	52	76	78	72
	nmi	31	52	57	--

Autoencoder model

Other implementations

Original code (Caffe): https://github.com/piiswrong/dec
MXNet implementation: https://github.com/dmlc/mxnet/blob/master/example/dec/dec.py
Keras implementation without pretraining code: https://github.com/fferroni/DEC-Keras

dec-keras's People

Stargazers

Watchers

Forkers

duolinwang dwright04 mutual-ai cjb2014 ninachang1107 jgraving tandychao zhoutf leeamen papercoming statml shubhampachori12110095 klovbe shenyuanyuan jlertle tony607 salimmiloudi william-vu jt17383 mrklees samy101 ramananm mahlertom winwinjjiang pzhao16me hi-zhenyu behnamsabeti liwcghb jiwoncpark zy20091082 jacobwjs sushantjha8 ajinkyapuar devyhia yangyuchen0340 csjunxu zekun-li dizzydwarf75 sxksxy hadifar jiajiadf himanshisyadav amirunpri2018 akhil-vader rezaa89 hikylemorris gmabdullah thomaslin1990 isr-wang cdchushig kapitsa2811 nguyenanhtien jack1981 tnlin priya9295 whytin chrisliu2007 liuwujijay yinlin-testai learning2021 cold-blue myxue luzhongqiu ypark ashishjain1988 phymucs alpv95 liyakun biubiubiu29 baohq1595 yoseungho utaka233 jacklxc lusccc forrestsz breakmo maryumjam xiaoxianglin yanbin-wang rscar89 30lm32 bbardakk phucson jadidaniel auhgniy wangsiwei2010 pandinosaurus peter943 nauede hrgentry mzhuang1 mystar-x idea15 iserfj janetwise luckmoon zhilangtaosha sumoonudt baobunuo yuelianghaoyuana

dec-keras's Issues

n_samples=23205 should be >= n_clusters=85617.

Using your code on my dataset (17 columns x ~30k rows) it gives me the following error:
ValueError: n_samples=23205 should be >= n_clusters=85617.
How can i fix this?

How to calculate the clusterig (KLD) loss for every instance

Great work!
For the problem I studied, the accuracy reaches 97%, which is very impressive.
How can I compute the DEC loss of every instance after the training has been completed. For the autoencoder, it is straightforward by defining a simple function:

def ae_loss(autoencoder, X):
    ae_rec = autoencoder.predict(X)  
    ae_loss = tf.keras.losses.mse(ae_rec, X)  
    return ae_loss

Defining similar function for computing the clustering loss is not working. Any idea how can this be implemented?
I would like to do a further investigation by studying the loss distribution.

SAE part may be wrong.

The SAE (stacked autoencoder) part should be trained layer-wise, which means the next autoencoder starts to be trained after the previous one is trained. From origin paper:

After training of one layer, we use its output h as the input to train the n

However from the output of model structure image (autoencoders.png), the encoders are connected to each other and then follows a number of decoders and there is only one training phase over the whole "autoencoder".

about accuracy

several run times lead to several accuracy values? what can be the problem of this issue?

loss keeps 0 during training (clustering)

During training with clustering layers, the loss keeps 0 through all the iteration while nmi, acc and ari work fine. Why does this happen? Does it indicate that the encoder layers haven't been trained during training?

Conflicts to the Network Structure

Hi all,

Thanks for sharing a great Model for Deep Clustering. I aim to leverge your excellent work to cluster images for my project.

Baed on your paper, the Network Structure is a basis of Stacked Autoencoder which is composed of 2 pairs of Dropout-Dense. However, the implementation does not follow the network structure. Is there any reason behind?

Why is there a lot of dataset-specific hyperparameters?

Here:

DEC-keras/DEC.py

Line 296 in fb28f34

if args.dataset == 'mnist' or args.dataset == 'fmnist':

How should we set those when clustering a new dataset?

Done training .. But how to predict on a new data ?

Great work! .
So i am training this using a different approach with 20newsgroup dataset but the thing is training is finished and i visualize the clusters too with z_2d, and cluster centroids from pickle file.
But how to now predict on a new data and know in which cluster it is mapped onto ?

data/reuters/get_data.sh Outdated Links

Hello! The links for the Reuters data are outdated. The new base link is http://www.ai.mit.edu/projects/jmlr/ instead of http://jmlr.csail.mit.edu/.

This should be the new contents of get_data.sh:

#!/bin/sh
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt0.dat.gz
gunzip lyrl2004_tokens_test_pt0.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt1.dat.gz
gunzip lyrl2004_tokens_test_pt1.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt2.dat.gz
gunzip lyrl2004_tokens_test_pt2.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt3.dat.gz
gunzip lyrl2004_tokens_test_pt3.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_train.dat.gz
gunzip lyrl2004_tokens_train.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a08-topic-qrels/rcv1-v2.topics.qrels.gz
gunzip rcv1-v2.topics.qrels.gz

Licence

@XifengGuo, thanks for providing some interesting code. Can you add a licence?

about acc

I checked closed issue about acc. you replied pretraining strategy is different.

Can you explain different between this implementation and papers? (I checked the authors' pretraining method in paper, but,i can't find your strategy in this repository)

and in DEC paper, they used dropout, but why didn't you use dropout layer?

Possible to get code working with plaidml backend?

$ python DEC.py --dataset mnist
Using plaidml.keras.backend backend.
Namespace(ae_weights=None, batch_size=256, dataset='mnist', maxiter=20000.0, pretrain_epochs=None, save_dir='results', tol=0.001, update_interval=None)
('MNIST samples', (70000, 784))
INFO:plaidml:Opening device "metal_amd_radeon_hd_-_firepro_d500.1"
return array(obj, copy=False)
Traceback (most recent call last):
File "DEC.py", line 321, in
dec = DEC(dims=[x.shape[-1], 500, 500, 2000, 10], n_clusters=n_clusters, init=init)
File "DEC.py", line 138, in init
clustering_layer = ClusteringLayer(self.n_clusters, name='clustering')(self.encoder.output)
File "/Users/test/plaidml-venv/lib/python2.7/site-packages/keras/engine/base_layer.py", line 457, in call
output = self.call(inputs, **kwargs)
File "DEC.py", line 106, in call
q **= (self.alpha + 1.0) / 2.0
TypeError: unsupported operand type(s) for ** or pow(): 'Value' and 'float'

cifar-10 数据集acc很低的疑问

作者您好，感谢您对DEC的复现，想咨询一些问题。我在代码里看到有cifar-10数据集以及特征抽取，但我在cifar-10上跑出来的效果很差，acc在0.3左右，请问是哪些超参影响很大啊，期待您的回复

Encoder weights are not updated during the clustering phase

Many thanks for the code.

I have implemented ' Unsupervised Deep Embedding for Clustering Analysis' using Pytorch, and I noticed that the pytorch version is converging much slower than your keras version. By going through the details, I noticed that in your version encoders' weights are not updating during the clustering stage.

I'm not sure what is the reason, but according to the paper, the encoders' weight must get updated.

ValueError: too many values to unpack (expected 2)

Hi, thanks for your code. While running the code, the error occered.

Do anyone know how to solve it ?
Thanks.

loss 先升后下降

你好，我在用你的dec的时候loss 是先上升，后下降。但是准确率一直在上升，这个你知道是什么原因么？

TypeError: 'float' object cannot be interpreted as an integer

Getting the error with the following message
run DEC.py mnist
Using TensorFlow backend.
Namespace(ae_weights=None, batch_size=256, dataset='mnist', gamma=0.1, maxiter=20000.0, n_clusters=10, save_dir='results', tol=0.001, update_interval=140)
MNIST samples (70000, 784)
No pretrained ae_weights given, start pretraining...
Pretraining the 1th layer...
learning rate = 0.1
Traceback (most recent call last):

File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\DEC.py", line 311, in
x=x)

File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\DEC.py", line 170, in initialize_model
sae.fit(x, epochs=400)

File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\SAE.py", line 133, in fit
self.pretrain_stacks(x, epochs=epochs/2)

File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\SAE.py", line 102, in pretrain_stacks
self.stacks[i].fit(features, features, batch_size=self.batch_size, epochs=epochs/3)

File "C:\Users\kaneja\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\models.py", line 867, in fit
initial_epoch=initial_epoch)

File "C:\Users\kaneja\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py", line 1598, in fit
validation_steps=validation_steps)

File "C:\Users\kaneja\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py", line 1130, in _fit_loop
for epoch in range(initial_epoch, epochs):

TypeError: 'float' object cannot be interpreted as an integer

new dataset

Hellos, this is great work. Thanks!

I just have a question, how can I use my own dataset with this? I have a folder of images that I would like clustered.

Thanks!

Can using different autoencoder architectures help improve results?

If I am not getting the desired results with my dataset, will modifying the architecture of the autoencoder help? Do I have to keep the embedding layer below a certain size?

Why does it get the higher Acc than that of the paper?

You have shown the acc produced by your implementation, 0.91,which is higher than that in papers.
Could you explain the improvement?

train_y.bin

Thank you for your contribution!

python DEC.py --dataset mnist

This runs fine.

However,

python run_exp.py
yields this:

Reached tolerance threshold. Stopping training.
('saving model to:', './results/exp1/reuters10k/trial9/DEC_model_final.h5')
Traceback (most recent call last):
File "run_exp.py", line 26, in
x, y = load_data(db)
File "./datasets.py", line 324, in load_data
return load_stl()
File "./datasets.py", line 283, in load_stl
y1 = np.fromfile(data_path + '/train_y.bin', dtype=np.uint8) - 1
IOError: [Errno 2] No such file or directory: './data/stl/train_y.bin'

Any ideas on what the problem is?
Much appreciated.
Thanks.

Is it possible to get less than the ground truth number of classes?

I am using the repo on hyperspectral datasets and without changing any hyperparameters, I get around negative or very close to 0 ari values and the number of predicted classes is less than the ground truth number of classes.

Any suggestions? The pretraining itself is biased towards having a lesser number of clusters.

Question about epoch

Thanks for your great implementation!
I have a question doing experiment with it. There are default settings for epoch. (e.g. MNIST - 300 epochs) Are they the same value with your IDEC paper experiment?

I want to reproduce your experiment for study, but the accuracy score that I took a experiment with your DEC implementation is not correct with accuracy in your paper(IDEC).

Difference between your reuters dataset and keras's reuters dataset

Hi, May i know do you use any trick to preprocessing your reuters ?
Because when i loaded the keras's reuters dataset for training, the accuracy only around 0.19.
Thank you

What does it mean if clustering accuracy metric fluctuates a lot?

I am wondering what it means if the accuracy, nmi, and ari metrics fluctuate a lot. I noticed when training on MNIST, every update interval pretty much has an improvement in accuracy and there is a upward trend.
However, when I train on my dataset, there are lots of fluctuations. it sometimes starts high at iteration 0, then goes lower, then goes high again, then ends up somewhere in between. Does this mean something is wrong with the data? Is this trend representative of something else?

Suggested fix for deprecated 'from sklearn.utils.linear_assignment_ import linear_assignment'

Hello great work!

i think 'from sklearn.utils.linear_assignment_ import linear_assignment' is now deprecated and I would recommend making the following changes to the accuracy module.

def acc(y_true, y_pred):
"""
Calculate clustering accuracy. Require scikit-learn installed
# Arguments
y: true labels, numpy.array with shape (n_samples,)
y_pred: predicted labels, numpy.array with shape (n_samples,)
# Return
accuracy, in [0,1]
"""
y_true = y_true.astype(np.int64)
assert y_pred.size == y_true.size
D = max(y_pred.max(), y_true.max()) + 1
w = np.zeros((D, D), dtype=np.int64)
for i in range(y_pred.size):
w[y_pred[i], y_true[i]] += 1
from scipy.optimize import linear_sum_assignment as linear_assignment
ind = np.transpose(np.asarray(linear_assignment(w.max() - w)))
return sum([w[i, j] for i, j in ind]) * 1.0 / y_pred.size

Thanks for all the great work!
Ali

It should be de-noising autoencoder rather than vanilla autoencoder

Hi, I notice the autoencoder here is not described as the original implementation (although the result seems good enough).
If needed, I would love to provide the implementation of de-nosing autoencoder.

TypeError: add_weight() got multiple values for argument 'name'

I got the following error when running python DEC.py

Using TensorFlow backend.
Namespace(ae_weights=None, batch_size=256, dataset='mnist', maxiter=20000.0, pretrain_epochs=None, save_dir='results', tol=0.001, update_interval=None)
MNIST samples (70000, 784)
Traceback (most recent call last):
File "DEC.py", line 321, in
dec = DEC(dims=[x.shape[-1], 500, 500, 2000, 10], n_clusters=n_clusters, init=init)
File "DEC.py", line 138, in init
clustering_layer = ClusteringLayer(self.n_clusters, name='clustering')(self.encoder.output)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/base_layer.py", line 463, in call
self.build(unpack_singleton(input_shapes))
File "DEC.py", line 91, in build
self.clusters = self.add_weight((self.n_clusters, input_dim), initializer='glorot_uniform', name='clusters')
TypeError: add_weight() got multiple values for argument 'name'

How to feed TFRecord data (over 60GB) to the DEC-keras model?

Thanks for your great implementation!
I' ve tried to solve classification problem whose input data have the shape of 1000*221 by DEC model.
I want to train my data with over 80 thousand data (very large size [8000000,1000,221],dtype=float32 about 60GB ), so it's not possible load whole dataset into python array.
After googling, I found tf.TFRecord helps me to get out this capacity problem.

I followed the tutorial in the official TensorFlow site to write TFRecord file and I can load the TFReocrd into the conventional Keras Model. However, I can't find how to feed into the DEC-model. The input (mnist) of DEC-model is one numpy file that has the shape [70000,784].

Like flowing：

dataset = tf.data.TFRecordDataset(filenames=[filenames])
parsed_dataset = dataset.map(_parse_function, num_parallel_calls=8)
final_dataset = parsed_dataset.shuffle(buffer_size=number_of_sample).batch(10)
iterator = dataset.make_one_shot_iterator()
parsed_record = iterator.get_next()
feature, label = parsed_record['feature'], parsed_record['label']
#keras
inputs = keras.Input(shape=(1000，221 ), name='feature', tensor=feature)
model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
loss='categorical_crossentropy',
metrics=['accuracy','categorical_crossentropy'],
target_tensors=[label]
)
train_model.fit(epochs= 30,
steps_per_epoch= 800000/256)

ValueError: No such layer: clustering.

This is what I encountered when running the script. Can anyone help me resolving this issue?

Layer (type) Output Shape Param # Connected to

input (InputLayer) [(None, 784)] 0 []

encoder_0 (Dense) (None, 500) 392500 ['input[0][0]']

encoder_1 (Dense) (None, 500) 250500 ['encoder_0[0][0]']

encoder_2 (Dense) (None, 2000) 1002000 ['encoder_1[0][0]']

encoder_3 (Dense) (None, 10) 20010 ['encoder_2[0][0]']

tf.expand_dims (TFOpLambda) (None, 1, 10) 0 ['encoder_3[0][0]']

tf.math.subtract (TFOpLambda) (None, 10, 10) 0 ['tf.expand_dims[0][0]']

tf.math.square (TFOpLambda) (None, 10, 10) 0 ['tf.math.subtract[0][0]']

tf.math.reduce_sum (TFOpLambda (None, 10) 0 ['tf.math.square[0][0]']
)

tf.math.truediv (TFOpLambda) (None, 10) 0 ['tf.math.reduce_sum[0][0]']

tf.operators.add (TFOpLamb (None, 10) 0 ['tf.math.truediv[0][0]']
da)

tf.math.truediv_1 (TFOpLambda) (None, 10) 0 ['tf.operators.add[0][0]']

tf.math.pow (TFOpLambda) (None, 10) 0 ['tf.math.truediv_1[0][0]']

tf.compat.v1.transpose (TFOpLa (10, None) 0 ['tf.math.pow[0][0]']
mbda)

tf.math.reduce_sum_1 (TFOpLamb (None,) 0 ['tf.math.pow[0][0]']
da)

tf.math.truediv_2 (TFOpLambda) (10, None) 0 ['tf.compat.v1.transpose[0][0]',
'tf.math.reduce_sum_1[0][0]']

tf.compat.v1.transpose_1 (TFOp (None, 10) 0 ['tf.math.truediv_2[0][0]']
Lambda)

==================================================================================================
Total params: 1,665,010
Trainable params: 1,665,010
Non-trainable params: 0

Update interval 140
Save interval 1365
Initializing cluster centers with k-means.
2188/2188 [==============================] - 10s 4ms/step
Traceback (most recent call last):
File "DEC.py", line 335, in
y_pred = dec.fit(x, y=y, tol=args.tol, maxiter=args.maxiter, batch_size=args.batch_size,
File "DEC.py", line 210, in fit
self.model.get_layer(name='clustering').set_weights([kmeans.cluster_centers_])
File "/research/DEC_Pytorch_tutorial/dec_venv/lib/python3.8/site-packages/keras/engine/training.py", line 3353, in get_layer
raise ValueError(
ValueError: No such layer: clustering. Existing layers are: ['input', 'encoder_0', 'encoder_1', 'encoder_2', 'encoder_3', 'tf.expand_dims', 'tf.math.subtract', 'tf.math.square', 'tf.math.reduce_sum', 'tf.math.truediv', 'tf.operators.add', 'tf.math.truediv_1', 'tf.math.pow', 'tf.compat.v1.transpose', 'tf.math.reduce_sum_1', 'tf.math.truediv_2', 'tf.compat.v1.transpose_1'].

should I update p as usually as q?

Why is evaluation done on the training set

If I understand correctly, the model is evaluated on the same data that it's trained on. Doesn't this lead to a wrong evaluation?

Load data

DEC-keras/DEC.py

Line 290 in 2438070

x, y = load_data(args.dataset)

DEC-keras/datasets.py

Lines 94 to 103 in 2438070

    
           def load_mnist(): 
        
               # the data, shuffled and split between train and test sets 
        
               from keras.datasets import mnist 
        
               (x_train, y_train), (x_test, y_test) = mnist.load_data() 
        
               x = np.concatenate((x_train, x_test)) 
        
               y = np.concatenate((y_train, y_test)) 
        
               x = x.reshape((x.shape[0], -1)) 
        
               x = np.divide(x, 255.) 
        
               print('MNIST samples', x.shape) 
        
               return x, y

Evaluate

DEC-keras/DEC.py

Lines 333 to 335 in 2438070

    
           y_pred = dec.fit(x, y=y, tol=args.tol, maxiter=args.maxiter, batch_size=args.batch_size, 
        
                            update_interval=update_interval, save_dir=args.save_dir) 
        
           print('acc:', metrics.acc(y, y_pred))

Shouldn't x_train and y_trained used to pretrain and fit, and then x_test and y_test used to evaluate the model?

	def load_mnist():
	# the data, shuffled and split between train and test sets
	from keras.datasets import mnist
	(x_train, y_train), (x_test, y_test) = mnist.load_data()
	x = np.concatenate((x_train, x_test))
	y = np.concatenate((y_train, y_test))
	x = x.reshape((x.shape[0], -1))
	x = np.divide(x, 255.)
	print('MNIST samples', x.shape)
	return x, y

	y_pred = dec.fit(x, y=y, tol=args.tol, maxiter=args.maxiter, batch_size=args.batch_size,
	update_interval=update_interval, save_dir=args.save_dir)
	print('acc:', metrics.acc(y, y_pred))

xifengguo / dec-keras Goto Github PK

dec-keras's Introduction

Deep Embedding Clustering (DEC)

Usage

Results

Autoencoder model

Other implementations

dec-keras's People

Stargazers

Watchers

Forkers

dec-keras's Issues

Layer (type) Output Shape Param # Connected to

Recommend Projects

Recommend Topics

Recommend Org