Comments (8)
Yeah, agree, it's definitely the right solution. The problem is it's quite a big job because all the visualisations and dashboards expect data in the current format
from splink.
Yeah, I agree. I think (but not sure, haven't thought that hard) that this is the same as the infinity protection. Which would lead me to think the case statement would be the best option (just for symmetry)
I have wondered before whether there's a third option (which also could potentially deal with the infinity case statements) of having this logic on the python side rather than in the sql.
e.g. could we clamp the value of the bf_ to between some values (e.g. between 1e-100, 1e100 or something), and issue a warning
Another thing I've wondered a bit about is whether we should move to the bf_ being match weights in the sql, which would then be additive. That would at least avoid (possibly?) the floating point issue.
from splink.
Hey attempting to test out this package and I am receiving this error. I can see the columns that make up this function call failure in sql, but what exactly would be the way to research and address this problem?
This is occurring on my first test dataset, so I do not know how to address it or if it is incorrect column profiling.
from splink.
Hi there, I am also running into this issue, using the duckdb backend. Does anyone know a workaround to fix this issue, at least for the moment? I understand where the issue comes from based on the discussion above, but do not know where to start with the suggested solutions, e.g. adding a tiny delta to prevent log(0).
from splink.
A simple workaround fix is duplicate a single row with a new unique id. Just one row that is an exact match to another. For me that was enough to fix it.
from splink.
It'd be great if someone could find a reprex for this issue. I've not actually encountered it myself. not doubting it exists - I suspect it happens with data of a certain type that we don't usually encounter, possibly such as certain values having no dupes (as vfrank66 alludes to).
If not a reprex, @JohnHenningsen are you able to post a screenshot of the match weight charts - it's possible that provides some insights...
In any case, we should hopefully be able to get round to fixing fairly soon, once Splink 4 is released (which has been absorbing most of our time for some months now)
from splink.
@RobinL Im just re-reading your original response, and yes I think we should switch to combining match weights additively, otherwise Im pretty sure we will run into floating point errors. So that might make this whole thing moot?
from splink.
Thanks for the helpful suggestions everyone! Unfortunately our cluster is down at the moment but I will try the simple workaround and help reproduce this issue as soon as possible.
To give a bit of context, aside from a few columns of categorical data we are relying on a product description column to match records. That column contains a 3-10 word string, and we came up with some custom comparisons based on array_intersect. It is quite likely that there are no exact matches, as we have high variance in the data entry for this column.
from splink.
Related Issues (20)
- Can't create SQLiteAPI with `register_udfs=False`
- Zero trained m-values can lead to `math domain error`
- `NaN` trained values can break `predict()` HOT 1
- Add option for Input Table with Athena Linker connection
- Splink install failing due to `splink_datasets` `PermissionError` HOT 1
- Docs build failing HOT 1
- High memory usage of `linker.evaluation.prediction_errors_from_labels_table` HOT 2
- Possible bug in `estimate_u_using_random_sampling` for Spark backend HOT 1
- Issues in a readonly filesystem
- Document `DatabaseAPI`
- [FEAT] Allow generation of all visualizations as a dict
- [feat] allow max rows for em training
- Avoid casting `Infinity` to double for Spark backend HOT 4
- Splink datasets revamp
- Avoid casting `Infinity` to double for Spark backend HOT 1
- [FEAT] A percentage threshold for array intersection HOT 4
- [FEAT] Improve ExplodingBlockingRule performance HOT 2
- [Athena] calling `invalidate_cache()` results in "not a folder created by Splink" error even when folder was created by Splink HOT 2
- [FEAT] Interaction term between two correlated comparisons HOT 2
- `predict()` fails with threshold probability 0 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from splink.