larryhastings / correlate Goto Github PK
View Code? Open in Web Editor NEWA clever brute-force correlator for kinda-messy data
License: Other
A clever brute-force correlator for kinda-messy data
License: Other
Hi Larry, first, thank you for this great library. I've started evaluating it as a replacement for a homegrown algorithm I have been using to match TV show recordings to TV meta data.
Since my tool was working on a file by file basis against a reference list of titles from the TV meta data, it would have been easier to set up a single Correlator instance with the reference data as e.g. .dataset_b and then pass in the single file name as .dataset_a.
At the moment, correlate does not appear to provide a way to clear the data sets. It would be handy to have a .clear() or .reset() method, so that you can reuse a single instance in the above way by calling .dataset_a.reset() after having processed the match of the single file.
(Since I had wanted to try out the package, I ended up refactoring my code to work on batches, so the above is no longer relevant for my use case, but it would have made things a little easier to adapt.)
IIRC correlate optimized for speed over memory use. Most of the time this is fine. But early in correlation there's an extremely wasteful bit of code.
When iterating over all the keys to find possible matches, correlate adds every nonzero matching pair (value A from dataset a, value B from dataset B)
to an array. It doesn't test to see if the pair has already been matched, so the same pair will often be added multiple times. I tried it both ways, and it was faster to add the redundant pairs, then sort and uniq the array. Using a set and testing for "have we seen this pair before?" was ever-so-slightly slower.
As data sets get huger and huger, this memory-wasteful approach will gobble more and more memory. It seems like correlate should use the ever-so-slightly slower, but much less memory intensive, "have we added this pair already?" approach.
It is often important to set a minimum_score in order to reduce the number of false positive matches. At the moment, this is not possible, since the minimum_score works on the unnormalized score values, which are hard to guess without trying out a few match cases on your actual data.
It would be good to have a way to say "please work with normalized scores", so that minimum_score could be given as a value between 0-1, regardless of how the scores pan out. A value of e.g. 10% (=0.1) would then mean "drop matches with a confidence value of less than 10%".
I'm not sure whether this is easily possible with the logic used by the package, but it make it more usable with yet unknown data distributions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.