Giter Site home page Giter Site logo

Comments (8)

RobinL avatar RobinL commented on September 27, 2024 1

Yeah, agree, it's definitely the right solution. The problem is it's quite a big job because all the visualisations and dashboards expect data in the current format

from splink.

RobinL avatar RobinL commented on September 27, 2024

Yeah, I agree. I think (but not sure, haven't thought that hard) that this is the same as the infinity protection. Which would lead me to think the case statement would be the best option (just for symmetry)

I have wondered before whether there's a third option (which also could potentially deal with the infinity case statements) of having this logic on the python side rather than in the sql.

e.g. could we clamp the value of the bf_ to between some values (e.g. between 1e-100, 1e100 or something), and issue a warning

Another thing I've wondered a bit about is whether we should move to the bf_ being match weights in the sql, which would then be additive. That would at least avoid (possibly?) the floating point issue.

from splink.

vfrank66 avatar vfrank66 commented on September 27, 2024

Hey attempting to test out this package and I am receiving this error. I can see the columns that make up this function call failure in sql, but what exactly would be the way to research and address this problem?

This is occurring on my first test dataset, so I do not know how to address it or if it is incorrect column profiling.

from splink.

JohnHenningsen avatar JohnHenningsen commented on September 27, 2024

Hi there, I am also running into this issue, using the duckdb backend. Does anyone know a workaround to fix this issue, at least for the moment? I understand where the issue comes from based on the discussion above, but do not know where to start with the suggested solutions, e.g. adding a tiny delta to prevent log(0).

from splink.

vfrank66 avatar vfrank66 commented on September 27, 2024

A simple workaround fix is duplicate a single row with a new unique id. Just one row that is an exact match to another. For me that was enough to fix it.

from splink.

RobinL avatar RobinL commented on September 27, 2024

It'd be great if someone could find a reprex for this issue. I've not actually encountered it myself. not doubting it exists - I suspect it happens with data of a certain type that we don't usually encounter, possibly such as certain values having no dupes (as vfrank66 alludes to).

If not a reprex, @JohnHenningsen are you able to post a screenshot of the match weight charts - it's possible that provides some insights...

In any case, we should hopefully be able to get round to fixing fairly soon, once Splink 4 is released (which has been absorbing most of our time for some months now)

from splink.

NickCrews avatar NickCrews commented on September 27, 2024

@RobinL Im just re-reading your original response, and yes I think we should switch to combining match weights additively, otherwise Im pretty sure we will run into floating point errors. So that might make this whole thing moot?

from splink.

JohnHenningsen avatar JohnHenningsen commented on September 27, 2024

Thanks for the helpful suggestions everyone! Unfortunately our cluster is down at the moment but I will try the simple workaround and help reproduce this issue as soon as possible.

To give a bit of context, aside from a few columns of categorical data we are relying on a product description column to match records. That column contains a 3-10 word string, and we came up with some custom comparisons based on array_intersect. It is quite likely that there are no exact matches, as we have high variance in the data entry for this column.

from splink.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.