stephantul / somber Goto Github PK

Recursive Self-Organizing Map/Neural Gas.

License: MIT License

Python 100.00%

kohonen som recurrent-neural-networks unsupervised machine-learning recsom neural-gas ng plsom cython

somber's Introduction

Hi there 👋

I'm Stéphan Tulkens! I'm a computational linguistics/AI person. I am currently working as a machine learning engineer/NLP scientist at Metamaze, where I work with transformers and generative AI models to automate document processing.

I got my Phd at CLiPS at the University of Antwerpen under the watchful eyes of Walter Daelemans (Computational Linguistics) and Dominiek Sandra (Psycholinguistics). The topic of my Phd was the way people process orthography during reading. You can find a copy here. Before that I studied computational linguistics (Ma), philosophy (Ba) and software engineering (Ba)

My goal is always to make things as fast and small as possible. I like it when simple models work well, and I love it when simple models get close in accuracy to big models. I do not believe absolute accuracy is a metric to be chased, and I think we should always be mindful of what a model computes or learns from the data.

I’m currently working on 🏃‍♂️:

reach: a library for loading and working with word embeddings.
piecelearn: a library that trains a subword tokenizer and embeddings on the same corpus, giving you open vocabulary embeddings.
unitoken: a library for easy pre-tokenization.
hashing_split: a library for hash-based data splits (stable splits!)

Other stuff I made (most of it from my Phd) 🐕:

wordkit: a library for working with orthography
old20: calculate the orthographic levenshtein distance 20 metric.
metameric: fast interactive activation networks in numpy.
humumls: load the UMLS database into a mongodb instance. Fast!
dutchembeddings: word embeddings for dutch (back when this was a cool thing to do)

My research interests 🤖:

Tokenizers, specifically subword tokenizers.
Embeddings, specifically static embeddings (so old-fashioned! 💀), and how to combine these in meaningful ways.
String similarity, and how to compute it without using dynamic programming.

Contact:

somber's People

Contributors

Stargazers

Watchers

Forkers

zclfly benjamesbabala liehtman jaedukseo ashabala amirunpri2018 sunnys-lab yanxg dboingue sultanmu ispml patham9 sorelyss

somber's Issues

Add PLSOM2

The PLSOM2 is an interesting candidate for addition to SOMBER, for the following reasons:

we already have PLSOM
It is supposed to improve PLSOM by making it less sensitive to outliers

It would require changing the updates routine and parameter update routine in the PLSOM.

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

Batching for recurrent SOMS

All recurrent SOMs would benefit from some kind of batching scheme. I currently have batched code, but it doesn't work. My hunch is that batching the recurrent/recursive/merging SOM is non-trivial because there is no way of making sure the concurrently processed batches converge on the same BMU.

Because we take the mean over batches, this means that updates will generally end up being more noisy as the batch size increases.

Two starting ideas:

A possible solution could be to initialize the context Weights and weights of the SOM to some value which allows for convergence (e.g. the weights are skewed in such a way that the concurrently processed batches all go in the same direction), but this entails setting the parameters in advance, which kind of defeats any purpose the SOM has.
A second solution would be to slowly increase the batch size, allowing the SOM to converge to something stable first, and then increasing batch size.

Add Gamma som

The Merge som is working, so we perhaps could implement the Gamma Som, which uses a generalization of the Merge Som context function (see here (behind a paywall unfortunately)).

The Merge Som Best Matching Unit (BMU) calculation takes into account the BMU at the previous time step, together with its context weight, as follows:

 context = (1 - self.beta) * self.weights[prev_bmu] + self.beta * self.context_weights[prev_bmu]

As context weights reflect the previous activation, you can use a merge som to "step back in time"

Generalizing over this, the Gamma Som defines a BMU function which takes into account the k previous BMUs explicitly (so the Merge Som can be defined as a Gamma SOM with k = 1). This makes the Gamma Som computationally more expensive than the Merge Som, but also more powerful.

Adding the Gamma Som will require relatively few changes, but I don't have any need for it currently.

stephantul / somber Goto Github PK

somber's Introduction

Hi there 👋

I’m currently working on 🏃‍♂️:

Other stuff I made (most of it from my Phd) 🐕:

My research interests 🤖:

Contact:

somber's People

Contributors

Stargazers

Watchers

Forkers

somber's Issues

Add PLSOM2

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Batching for recurrent SOMS

Add Gamma som

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent