larryhastings / correlate Goto Github PK

View Code? Open in Web Editor NEW

80.0 80.0 1.0 351 KB

A clever brute-force correlator for kinda-messy data

License: Other

Python 99.61% Shell 0.39%

correlate's People

Contributors

Stargazers

Watchers

Forkers

takluyver

correlate's Issues

Add a way to easily reuse Correlator instances

Hi Larry, first, thank you for this great library. I've started evaluating it as a replacement for a homegrown algorithm I have been using to match TV show recordings to TV meta data.

Since my tool was working on a file by file basis against a reference list of titles from the TV meta data, it would have been easier to set up a single Correlator instance with the reference data as e.g. .dataset_b and then pass in the single file name as .dataset_a.

At the moment, correlate does not appear to provide a way to clear the data sets. It would be handy to have a .clear() or .reset() method, so that you can reuse a single instance in the above way by calling .dataset_a.reset() after having processed the match of the single file.

(Since I had wanted to try out the package, I ended up refactoring my code to work on batches, so the above is no longer relevant for my use case, but it would have made things a little easier to adapt.)

Reduce correlate memory use during correlation

IIRC correlate optimized for speed over memory use. Most of the time this is fine. But early in correlation there's an extremely wasteful bit of code.

When iterating over all the keys to find possible matches, correlate adds every nonzero matching pair (value A from dataset a, value B from dataset B) to an array. It doesn't test to see if the pair has already been matched, so the same pair will often be added multiple times. I tried it both ways, and it was faster to add the redundant pairs, then sort and uniq the array. Using a set and testing for "have we seen this pair before?" was ever-so-slightly slower.

As data sets get huger and huger, this memory-wasteful approach will gobble more and more memory. It seems like correlate should use the ever-so-slightly slower, but much less memory intensive, "have we added this pair already?" approach.

Make it possible to work with normalized score values right from the start

It is often important to set a minimum_score in order to reduce the number of false positive matches. At the moment, this is not possible, since the minimum_score works on the unnormalized score values, which are hard to guess without trying out a few match cases on your actual data.

It would be good to have a way to say "please work with normalized scores", so that minimum_score could be given as a value between 0-1, regardless of how the scores pan out. A value of e.g. 10% (=0.1) would then mean "drop matches with a confidence value of less than 10%".

I'm not sure whether this is easily possible with the logic used by the package, but it make it more usable with yet unknown data distributions.

larryhastings / correlate Goto Github PK

correlate's People

Contributors

Stargazers

Watchers

Forkers

correlate's Issues

Add a way to easily reuse Correlator instances

Reduce correlate memory use during correlation

Make it possible to work with normalized score values right from the start

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent