Giter Site home page Giter Site logo

Add support for term frequency about splink HOT 6 CLOSED

RobinL avatar RobinL commented on June 6, 2024
Add support for term frequency

from splink.

Comments (6)

RobinL avatar RobinL commented on June 6, 2024

We will want a generic version of this i.e. a version that allows us to make a term-frequency adjustment for n named columns - e.g. make the adjustment for surname AND firstname.

One way to do this would be to use Bayes rule:
https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering#Combining_individual_probabilities

image

Note that 'generic lambda' is inherent in the 'specific' lambda, so in the case we want and adjustment for surname AND first name we cannot simply insert the 'specific lambda' for surname and first name into the above formula.

Instead we need to 'remove' the original lambda.

e.g. if:
generic lambda = 0.7
surname specific lambda = 0.8
first name specific lambda = 0.9

we need p1 = 0.7, p2 = 0.8, p3 = 0.9, and then p4 = 0.3 and p5 = 0.3

where p4 and p5 'undo' the two 'additional' generic lambdas, inherent in p2 and p3. otherwise we have counted 'generic lambda' three times rather than once.

Note this allows for a formulation which is general in the sense it can be computed for records where there's no match, where only surname or only first name match, and where surname and first name match

from splink.

RobinL avatar RobinL commented on June 6, 2024

There's probably a way to simplify this so that p2 and p3 can be interpreted as adjustments to p1 - i.e. apply the 0.3 adjustment to each one individually:

0.8x0.3
---------
(0.8x0.3) + (0.2 x 0.3)

= 0.63157895

This way you can just put 0.5 in for the case that the name does not match, which in terms of naive bayes, is just 'no information'

from splink.

RobinL avatar RobinL commented on June 6, 2024

Further explanation:

λ is a measure of the probability of a comparison drawn at random being a match.

π distribution for a column, where e.g. 𝛾=1 is a measure of

  • the proportion records where column comparison agrees GIVEN record is match
  • the proportion records where column comparison agrees GIVEN record is non match

The term frequency adjustment says:

  • What is the probability of a comparison drawn at random being a match GIVEN surname is 'smith'

Well, the standard lambda is computed just as the average of
image

So instead of using that we use the average of
image amongst records where 'smith' matches

from splink.

RobinL avatar RobinL commented on June 6, 2024

How to test for this?

Create a dataframe with the xis populated to predictable values e.g. 1 and 0.

Put into 'make adjustments for term frequencies'

Also check that missing values work

e.g.
surname and forename match:
surname but not forename
neither surname nor forename match

from splink.

RobinL avatar RobinL commented on June 6, 2024

We should check our implementation against the fastlink implementation:

https://github.com/kosukeimai/fastLink/blob/5ae71dec098dd5689223dfeef1f574b3d16d769a/R/nameReweight.R#L95
https://github.com/kosukeimai/fastLink/blob/5ae71dec098dd5689223dfeef1f574b3d16d769a/R/fastLink.R#L443

There's some indication that the 'proper' implementation should downweight common surnames. Ours always seems to increase the match prob even for common surnames.
https://github.com/kosukeimai/fastLink/blob/5ae71dec098dd5689223dfeef1f574b3d16d769a/R/nameReweight.R#L4

from splink.

RobinL avatar RobinL commented on June 6, 2024

Closed by #19

from splink.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.