fcomitani / simpsom Goto Github PK

View Code? Open in Web Editor NEW

153.0 2.0 36.0 35.28 MB

Python library for Self-Organizing Maps

License: GNU General Public License v3.0

Python 100.00%

clustering dimensionality-reduction kohonen python self-organizing-map

simpsom's Introduction

SimpSOM (Simple Self-Organizing Maps)

Version 3.0.0

Simple Self-Organizing Maps (SimpSOM) is a lightweight Python 3 library to train SOM. It offers an efficient way of training SOM in Python while keeping its implementation simple and easy to read.

Version 3 is a rewriting focusing on performance.

Installation

simpsom can be downloaded from PyPI with

pip install simpsom

To install the latest (unreleased) version you can download it from this repository by running

git clone https://github.com/fcomitani/simpsom
cd simpsom
python setup.py install

Dependencies

Core dependencies:

numpy
scikit-learn
matplotlib

If available, CuPy can be used to run simpsom on the GPU. CuML is also optional, but will allow you to run clustering on the GPU as well.

For a full list see requirements.txt

Example of Usage

Running simpsom is easy. After setting up a network by providing size and tiling style, train it with the train method.

import simpsom as sps

net = sps.SOMNet(20, 20, data, topology='hexagonal', 
                init='PCA', metric='cosine',
                neighborhood_fun='gaussian', PBC=True,
                random_seed=32, GPU=False, CUML=False,
                output_path="./")

net.train(train_algo='batch', start_learning_rate=0.01, epochs=-1, 
    batch_size=-1)

The trained map can be saved to disk.

net.save_map("./trained_som.npy")

The results can be inspected with a variety of plotting functions.

net.plot_map_by_difference(show=True, print_out=True)
net.plot_projected_points(projected_data, color_val=[n.difference for n in net.nodes_list],
        project=False, jitter=False, 
        show=True, print_out=False)

Detailed documentation, API references and tutorials can be found here.

Who is using SimpSOM

Here are some of the research works that use SimpSOM:

Postema, J. T. (2019). Explaining system behaviour in radar systems (Master's thesis, University of Twente).

Lorenzi, C., Barriere, S., Villemin, J. P., Dejardin Bretones, L., Mancheron, A., & Ritchie, W. (2020). iMOKA: k-mer based software to analyze large collections of sequencing data. Genome biology, 21(1), 1-19.

Saunders, J. K., McIlvin, M. R., Dupont, C. L., Kaul, D., Moran, D. M., Horner, T., ... & Saito, M. A. (2022). Microbial functional diversity across biogeochemical provinces in the central Pacific Ocean. Proceedings of the National Academy of Sciences, 119(37), e2200014119.

Contributions

Contributions are always welcome. If you would like to help us improve this library please fork the main branch and make sure pytest pass after your changes.

Citation

When using this library for your work, please cite the appropriate version from Zenodo

Federico Comitani. (2022). SimpSOM (v2.0.2). Zenodo. https://zenodo.org/record/7187332

simpsom's People

Contributors

Stargazers

Watchers

simpsom's Issues

Node's coordinates in the SOM

Hi @fcomitani:

I have some problems to determinate the node's coordinate in the bidirectional map. If I took, the trainned net and I want to see the centroids of the node, only I need to do is:

#SOM 4X5
net45 = sps.somNet(4, 5, data, PBC=True)
#Train the network for 10000 epochs and with initial learning rate of 0.01.
s45 = net45.train(0.01, 10000)

#Visualize the centroids of each features into the node 0
s45[0]

But if I want only two coordinates from where is located in the MAP this node, I can´t do this. So, can you help me to find where I get this two values that I need, 'cause I review your code and I don't find it.

Thanks so much,

how to load weight.npy file to predict?

I think that when updating weights, should not target all nodes.

Hello, first of all, thank you for sharing, and gave me a lot of help.

def update_weights(self, inputVec, sigma, lrate, bmu):

	"""Update the node Weights.

	Args:
		inputVec (np.array): A weights vector whose distance drives the direction of the update.
		sigma (float): The updated gaussian sigma.
		lrate (float): The updated learning rate.
		bmu (somNode): The best matching unit.
	"""

	dist=self.get_nodeDistance(bmu)
	gauss=np.exp(-dist*dist/(2*sigma*sigma))  # I think gauss will always > 0
	if gauss>0:
		for i in range(len(self.weights)):
			self.weights[i] = self.weights[i] - gauss*lrate*(self.weights[i]-inputVec[i])

In somNode::update_weights() , expression ' gauss > 0 ' will always be true.
So throughout the training process, the weights of all nodes will be changed.

I read some literature. The literature says that the node weights near BMU should be updated, and this neighborhood is gradually reduced, eventually containing only BMU itself.
So, I think we should change ' gauss >0' to 'gauss>x ' (0<x<1)

Thank you again for sharing and looking forward to your reply.

Unbound Local Error While training

Hi, I tried training a model and I got the following error.

Training SOM... 0%

UnboundLocalError Traceback (most recent call last)
in ()
----> 1 net.train(0.2, 1000)

~\Anaconda3\lib\site-packages\simpsom-1.3.3-py3.7.egg\SimpSOM_init_.py in train(self, startLearnRate, epochs)
213 inputVec = self.data[np.random.randint(0, self.data.shape[0]), :].reshape(np.array([self.data.shape[1]]))
214
--> 215 bmu=self.find_bmu(inputVec)
216
217 for node in self.nodeList:

~\Anaconda3\lib\site-packages\simpsom-1.3.3-py3.7.egg\SimpSOM_init_.py in find_bmu(self, vec)
173 minVal=dist
174 bmu=node
--> 175 return bmu
176
177

UnboundLocalError: local variable 'bmu' referenced before assignment

net.project() function is slow

Hey @fcomitani team,

I basically use your library for clustering but there's one function which takes hell lot of time. My code is like below:

x_train=df_kmeans.drop(columns=['LTV','sqrt_LTV']) 
net = sps.somNet(30, 30, x_train.values, PBC=True)
net.train(0.1, 20000)
prj=np.array(net.project(x_train.values))

This (prj=np.array(net.project(x_train.values))) line of code takes around 6-7 hours for around 7 million rows. Can you help me out that how I can faster this one out. My current system is 32 GB RAM and 4 core CPU in AWS.

Bug in learning rate?

Should the _update_learning_rate() in network.py be self.learning_rate = self.start_learning_rate * self.xp.exp(-n_iter / self.epochs) instead? Thanks!

Cannot locate raw_data or any detailed API.

I've installed the latest SimpSOM using Pip and I've tried following the code presented on Github and the code presented in the API (readthedocs). Unfortunately, raw_data isn't present after import SimpSOM and I can't find any API documentation (e.g., descriptions, return values, argument values with datatypes) for each method/function. If raw_data is only a place holder, then what is it, i.e., a list, dictionary, array, etc? I would really like to learn more about this package, is there documentation elsewhere?

Parallelization of the code

Hi, first thanks for the code, I'm using it at my work for clustering some data.
I'm interested in parallelizing the code, and I see that you have a "TODO" comment above this line "Parallel(n_jobs=self.n_jobs)(delayed(my_func)(c, K, N) for c in inputs)" in the train function. I suppose you have tried or have some ideas about it, and I wanted to know which function is "my_func" a placeholder for, and what are the parameters c,K,N ir order to have a better general idea about how to proceed with the parallelization.

Again thanks for the repository!

a little error happened on densitypeak.py

Near 284 line，p.dists.iteritems should change to p.dists.item. Thank you very mach for your code to help me a lot.

How to reference this package

Hi this code is pretty useful for making a fast SOM in a preliminary data set. I liked a lot, as the package can be also easily modified to change the graphics for my particular use. I wanted to ask you, how can your code be properly referred to a paper. Is there a publication from you that I shall cite, and if not, what shall I add to the acknowledgement section of my paper. Thanks!

Labels

How can I know, the value of each element of the data in the hexagono in the SOM? There are not function thats shows the value?. How can I do this, cause I need the value?

Time complexity

I have question about time complexity. I made som experiments and I got O(n log n) with using DBSCAN for clustering and O(n^2) with using Quality Threshold algorithm. I tried the first method on Jain, Flame, Compound and t4.8k and second method on MNIST. Can you explain O(n log n) complexity? Because I think that time complexity for SOM is O(n^2). Thanks

problem due cyclical error?

Hi there!

I am trying to reproduce a simple example but I am having some issues to initialize SOMNet

To reproduce

import simpsom as sps
import numpy as np

data = np.random.rand(20, 20)
net = sps.SOMNet(20, 20, data, topology='hexagonal',
                init='PCA', metric='cosine',
                neighborhood_fun='gaussian', PBC=True,
                random_seed=32, GPU=False, CUML=False,
                output_path="./")

and here is the error I get

Traceback (most recent call last):
  File "/home/lucas/mba/python/simpsom.py", line 1, in <module>
    import simpsom as sps
  File "/home/lucas/mba/python/simpsom.py", line 5, in <module>
    net = sps.SOMNet(20, 20, data, topology='hexagonal',
AttributeError: partially initialized module 'simpsom' has no attribute 'SOMNet' (most likely due to a circular import)

Cloning of repository fails due to too long filenames.

I tried to clone the simpsom repsitory, but I got errors because of too long filenames in the tests/ground_truth/ folder.
I'm working with an encrypted home folder, for which the maximum filename size to 143 characters.

As a workaround I have forked the simpsom repository, cloned it to an unencrypted folder, deleted the ground_truth files and pushed id back to my forked repository. Than I was able to clone the repository to an encrypted folder.

Here are the errors when cloning:

me@comp:~/tmp$ git clone https://github.com/fcomitani/simpsom.git
Klone nach 'simpsom' …
remote: Enumerating objects: 1414, done.
remote: Counting objects: 100% (217/217), done.
remote: Compressing objects: 100% (85/85), done.
remote: Total 1414 (delta 154), reused 165 (delta 130), pack-reused 1197
Empfange Objekte: 100% (1414/1414), 34.45 MiB | 15.38 MiB/s, fertig.
Löse Unterschiede auf: 100% (733/733), fertig.
error: unable to create file tests/ground_truth/som_clusters_29793102587501614832761962817496381948754187354139510906297685347436709277089543135748392401212717654282462446534086383947023366030253119052733995242201873785323240739807453510.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_clusters_342008044776077809952382379377725389002657731782690276593972840565627274789710108763094526877518194116991217137317838936203404537258310.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_clusters_342008044776173808293820246848172717185750701891546008523394046430801193894526052825594273532638348990588789776175654503162559421243718.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_clusters_6990909199294242919332708782990481955635203273555594435047935561554494072389185531750095764725676280630374101219996805343174917845436987668653086546229359531757587021810123956550.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_clusters_80251701864081417831371931036709008182515405432296624554309851309220089166275703141148945731290811019842586449396221097154566387051290950.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_clusters_80251701864081513829713368904179455510698498402405480286239272515085263085380519085211445477945931174716184022035078912721525541935276358.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_projected_29793102587501614832761962817496381948754187354139510906297685347436709277089543135748392401212717654282462446534086383947023366030253119052733995242201873785323240739807453510.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_projected_342008044776077809952382379377725389002657731782690276593972840565627274789710108763094526877518194116991217137317838936203404537258310.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_projected_342008044776173808293820246848172717185750701891546008523394046430801193894526052825594273532638348990588789776175654503162559421243718.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_projected_6990909199294242919332708782990481955635203273555594435047935561554494072389185531750095764725676280630374101219996805343174917845436987668653086546229359531757587021810123956550.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_projected_80251701864081417831371931036709008182515405432296624554309851309220089166275703141148945731290811019842586449396221097154566387051290950.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/som_projected_80251701864081513829713368904179455510698498402405480286239272515085263085380519085211445477945931174716184022035078912721525541935276358.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_1224543790650657313213549472991040535904219846983620675164315761268632417486417738025024652371253533972999132316224697404960185610836.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_1224543790650657313213549472991040535904316229697346149481683466457230054787165196947898727118335156616273920126621428309896238752070.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_1224543790650657313213549472991040535904316229697346149481683466457230054787165196947898727118335234824610301771193443025591640809798.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_1335968924906787723998553143006053779321110122674337069737184616588407112905780651649079739801479596597171496942181683254327081984326.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_1335968924906787723998553143006053779321110122674337069737184616588407112905780651649079739801479674805507878586753697970022484042054.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_1346400136541297592615788464317873357020832176068851985856552095957119374614286985032559191109006013108952827969596804999741666077473546408059206.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_18685055399332539569298545425278328489749448348749094774846126728342169455610559414236305619044411867761030363303200024846688582.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_18685055399332539569298545425278328489749448348749094774846126728342169456274684723282236516895166603137453243205118281426690374.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_18685055399332539569298545425278328489750919032247099059558511844534599738236183038460742443149308498859269490750211552227123526.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_20385268019207576354958391464325771779568238790711341781361650770101963224748689817883955002660751240165972540340344160735556180.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_20544435677204858285763373635056853135622468717937993960553728025241249173361883988934768232047196198044068854409077774513251429182268531014.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_22413839222448957511672109767691453563286597887893850675787809544380889229156669641337366995873220311703611193242375012091195018934785696070.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_22413839222448957511672147934829851706974254468168255641494287828126447755230133213742654659999254827175997758192823410038977598027450835270.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_29793102587501614832761962817496381948754187354139510906297685347436709277089543135748392401212717654282462446534086383947023366030253119052733995242201873785323240739807453510.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_313483210406568272182668665085706377191480280827806892842064834884769898876522940934406311007040904697087777872953522535669807482429766.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_313483210406568272182668665085706377191480280827806892842064834884769898876522940934406311007040904775296114254598094550385502884487494.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_342008044776077809952382379377725389002657731782690276593972840565627274789710108763094526877518194116991217137317838936203404537258310.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_342008044776173808293820246848172717185750701891546008523394046430801193894526052825594273532638348990588789776175654503162559421243718.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_4783374182229130129740427628871252093376235272255258396412826040848554901512364050577729402805996705532320000494614954335532315220.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_5218628612917139546869348214867397575473086416696629178660877408548465284788205670504217733599529674207701159930397200212215296596.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_5218628612917139546869348214867397575569469130422103496028582597146102585535664593378292480681152317482488970327128105148268437830.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_5218628612917139546869348214867397575569469130422103496028582597146102585535664593378292480681230525818870614899142820843670495558.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_5259375533364443721155423650574554402718938031188775047500175620866119703749582709011776511528302778938456220334561858942689831440974428791110.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_5259375533364443721155423688741692800862625687769049452465882099149865262275656172584181799192428813453928606899512307340637614020067093930310.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_5737942840946933122988069871316442036985409143826399469508816258754238097057918953726770108704046100504160029419017113508415153431397775728966.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_6990909199294242919332708782990481955635203273555594435047935561554494072389185531750095764725676280630374101219996805343174917845436987668653086546229359531757587021810123956550.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_80251701864081417831371931036709008182515405432296624554309851309220089166275703141148945731290811019842586449396221097154566387051290950.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_80251701864081513829713368904179455510698498402405480286239272515085263085380519085211445477945931174716184022035078912721525541935276358.npy: Der Dateiname ist zu lang
error: unable to create file tests/ground_truth/trained_som_87554059462691240279969178780044740483205306606426785458463137265706929141469772643990323717393234658316474950798540409350505658676175174.npy: Der Dateiname ist zu lang
fatal: Arbeitsverzeichnis konnte nicht ausgecheckt werden
warning: Klonen erfolgreich, Auschecken ist aber fehlgeschlagen.
Sie können mit 'git status' prüfen, was ausgecheckt worden ist
und das Auschecken mit 'git restore --source=HEAD :/' erneut versuchen.

me@comp:~/tmp$

Nodes difference MNIST

Hi,

First at all, thank you for share the code. Could you share any code for nodes difference of MNIST? I am interesting on the representation by class just like your last image in the examples. And another question could you put centroids for the differents regions for each data point and finally calculate the distance for each data point with its centroid. And finally it is possible use SimpSOM like prediction? It would be very nice to see where a new point will be. Thanks for all!

Module Not Found error during import

Hello! Thank you for your work. I tried importing the latest version of simpsom (v2.0.0) and the error occurred: ModuleNotFoundError: No module named 'simpsom.cluster'. There seems to be a missing init.py file in "./simpsom/cluster/". I tried adding it locally and it did the trick for me.

Is diff_graph() an implementation of the U-matrix?

Hi Federico,

I looked into the code and the only difference between the U-matrix and your implementation of diff_graph() seems to be that the former takes an average of the distances with its neighbours while diff_graph() just takes the sum. Is diff_graph() an implementation of the U-matrix?

run_colorsExample

Figures 2 and 3 of the .run_colorsExample() do not appear. I also tried to use the .project function with another dataset and the result was the same. I am using python 3.6.5 64b, numpy=1.14.3, matplotlib=2.2.2, sklearn=0.20.1 and no errors appear neither. Anyone know what to do?

PyPi not using latest changes

Hi @fcomitani
I just want to inform you that PyPi does not pull the latest changes that include the printout typo fix.

MemoryError

Hi! I just started experimenting with your package for the analysis of some big datasets and have encountered problems with the required allocation of memory: for example, the MNIST tutorial is interruped because of the following error

MemoryError: Unable to allocate 876. GiB for an array with shape (60000, 2500, 784) and data type float64

Also creating a small data set ad hoc, such as 2000 np.array of length 200 gets me a similar error. Running on a 8GB RAM and Intel(R) Core i5-6300U CPU .

Thanks in advance!

pip version doesn't have colnames parameter in nodes_graph

Hi!

I just realized that the version I installed from pip (pip install SimpSOM) doesn't have the colnames parameter included but it's included in the repo code. I made the modifications on my local file, but just so you know that happens!

Thanks for your amazing work!

Predicting winning cell for data?

Once the SOM has been trained, is it possible to get the cell to which a data sample belongs? Or for each cell to get a list of samples from a dataset that belong to that cell? I haven't found anything like this in the example.