Giter Site home page Giter Site logo

correlate's People

Contributors

larryhastings avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

takluyver

correlate's Issues

Add a way to easily reuse Correlator instances

Hi Larry, first, thank you for this great library. I've started evaluating it as a replacement for a homegrown algorithm I have been using to match TV show recordings to TV meta data.

Since my tool was working on a file by file basis against a reference list of titles from the TV meta data, it would have been easier to set up a single Correlator instance with the reference data as e.g. .dataset_b and then pass in the single file name as .dataset_a.

At the moment, correlate does not appear to provide a way to clear the data sets. It would be handy to have a .clear() or .reset() method, so that you can reuse a single instance in the above way by calling .dataset_a.reset() after having processed the match of the single file.

(Since I had wanted to try out the package, I ended up refactoring my code to work on batches, so the above is no longer relevant for my use case, but it would have made things a little easier to adapt.)

Reduce correlate memory use during correlation

IIRC correlate optimized for speed over memory use. Most of the time this is fine. But early in correlation there's an extremely wasteful bit of code.

When iterating over all the keys to find possible matches, correlate adds every nonzero matching pair (value A from dataset a, value B from dataset B) to an array. It doesn't test to see if the pair has already been matched, so the same pair will often be added multiple times. I tried it both ways, and it was faster to add the redundant pairs, then sort and uniq the array. Using a set and testing for "have we seen this pair before?" was ever-so-slightly slower.

As data sets get huger and huger, this memory-wasteful approach will gobble more and more memory. It seems like correlate should use the ever-so-slightly slower, but much less memory intensive, "have we added this pair already?" approach.

Make it possible to work with normalized score values right from the start

It is often important to set a minimum_score in order to reduce the number of false positive matches. At the moment, this is not possible, since the minimum_score works on the unnormalized score values, which are hard to guess without trying out a few match cases on your actual data.

It would be good to have a way to say "please work with normalized scores", so that minimum_score could be given as a value between 0-1, regardless of how the scores pan out. A value of e.g. 10% (=0.1) would then mean "drop matches with a confidence value of less than 10%".

I'm not sure whether this is easily possible with the logic used by the package, but it make it more usable with yet unknown data distributions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.