Note I now think the the below discussion is inaccurate and it has been supers

Closed by <a class="issue-link js-issue-link" data-error-text="Failed to load title" d

Add support for term frequency about splink HOT 6 CLOSED

RobinL commented on June 6, 2024

Add support for term frequency

from splink.

Comments (6)

RobinL commented on June 6, 2024

We will want a generic version of this i.e. a version that allows us to make a term-frequency adjustment for n named columns - e.g. make the adjustment for surname AND firstname.

One way to do this would be to use Bayes rule:
https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering#Combining_individual_probabilities

Note that 'generic lambda' is inherent in the 'specific' lambda, so in the case we want and adjustment for surname AND first name we cannot simply insert the 'specific lambda' for surname and first name into the above formula.

Instead we need to 'remove' the original lambda.

e.g. if:
generic lambda = 0.7
surname specific lambda = 0.8
first name specific lambda = 0.9

we need p1 = 0.7, p2 = 0.8, p3 = 0.9, and then p4 = 0.3 and p5 = 0.3

where p4 and p5 'undo' the two 'additional' generic lambdas, inherent in p2 and p3. otherwise we have counted 'generic lambda' three times rather than once.

Note this allows for a formulation which is general in the sense it can be computed for records where there's no match, where only surname or only first name match, and where surname and first name match

from splink.

RobinL commented on June 6, 2024

There's probably a way to simplify this so that p2 and p3 can be interpreted as adjustments to p1 - i.e. apply the 0.3 adjustment to each one individually:

0.8x0.3
---------
(0.8x0.3) + (0.2 x 0.3)

= 0.63157895

This way you can just put 0.5 in for the case that the name does not match, which in terms of naive bayes, is just 'no information'

from splink.

RobinL commented on June 6, 2024

Further explanation:

λ is a measure of the probability of a comparison drawn at random being a match.

π distribution for a column, where e.g. 𝛾=1 is a measure of

the proportion records where column comparison agrees GIVEN record is match
the proportion records where column comparison agrees GIVEN record is non match

The term frequency adjustment says:

What is the probability of a comparison drawn at random being a match GIVEN surname is 'smith'

Well, the standard lambda is computed just as the average of

So instead of using that we use the average of
amongst records where 'smith' matches

from splink.

RobinL commented on June 6, 2024

How to test for this?

Create a dataframe with the xis populated to predictable values e.g. 1 and 0.

Put into 'make adjustments for term frequencies'

Also check that missing values work

e.g.
surname and forename match:
surname but not forename
neither surname nor forename match

from splink.

RobinL commented on June 6, 2024

We should check our implementation against the fastlink implementation:

https://github.com/kosukeimai/fastLink/blob/5ae71dec098dd5689223dfeef1f574b3d16d769a/R/nameReweight.R#L95
https://github.com/kosukeimai/fastLink/blob/5ae71dec098dd5689223dfeef1f574b3d16d769a/R/fastLink.R#L443

There's some indication that the 'proper' implementation should downweight common surnames. Ours always seems to increase the match prob even for common surnames.
https://github.com/kosukeimai/fastLink/blob/5ae71dec098dd5689223dfeef1f574b3d16d769a/R/nameReweight.R#L4

from splink.

RobinL commented on June 6, 2024

Closed by #19

from splink.

Add support for term frequency about splink HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent