Comments (6)
We will want a generic version of this i.e. a version that allows us to make a term-frequency adjustment for n
named columns - e.g. make the adjustment for surname AND firstname.
One way to do this would be to use Bayes rule:
https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering#Combining_individual_probabilities
Note that 'generic lambda' is inherent in the 'specific' lambda, so in the case we want and adjustment for surname AND first name we cannot simply insert the 'specific lambda' for surname and first name into the above formula.
Instead we need to 'remove' the original lambda.
e.g. if:
generic lambda = 0.7
surname specific lambda = 0.8
first name specific lambda = 0.9
we need p1 = 0.7, p2 = 0.8, p3 = 0.9, and then p4 = 0.3 and p5 = 0.3
where p4 and p5 'undo' the two 'additional' generic lambdas, inherent in p2 and p3. otherwise we have counted 'generic lambda' three times rather than once.
Note this allows for a formulation which is general in the sense it can be computed for records where there's no match, where only surname or only first name match, and where surname and first name match
from splink.
There's probably a way to simplify this so that p2 and p3 can be interpreted as adjustments to p1 - i.e. apply the 0.3 adjustment to each one individually:
0.8x0.3
---------
(0.8x0.3) + (0.2 x 0.3)
= 0.63157895
This way you can just put 0.5 in for the case that the name does not match, which in terms of naive bayes, is just 'no information'
from splink.
Further explanation:
λ is a measure of the probability of a comparison drawn at random being a match.
π distribution for a column, where e.g. 𝛾=1 is a measure of
- the proportion records where column comparison agrees GIVEN record is match
- the proportion records where column comparison agrees GIVEN record is non match
The term frequency adjustment says:
- What is the probability of a comparison drawn at random being a match GIVEN surname is 'smith'
Well, the standard lambda is computed just as the average of
So instead of using that we use the average of
amongst records where 'smith' matches
from splink.
How to test for this?
Create a dataframe with the xis populated to predictable values e.g. 1 and 0.
Put into 'make adjustments for term frequencies'
Also check that missing values work
e.g.
surname and forename match:
surname but not forename
neither surname nor forename match
from splink.
We should check our implementation against the fastlink implementation:
https://github.com/kosukeimai/fastLink/blob/5ae71dec098dd5689223dfeef1f574b3d16d769a/R/nameReweight.R#L95
https://github.com/kosukeimai/fastLink/blob/5ae71dec098dd5689223dfeef1f574b3d16d769a/R/fastLink.R#L443
There's some indication that the 'proper' implementation should downweight common surnames. Ours always seems to increase the match prob even for common surnames.
https://github.com/kosukeimai/fastLink/blob/5ae71dec098dd5689223dfeef1f574b3d16d769a/R/nameReweight.R#L4
from splink.
Closed by #19
from splink.
Related Issues (20)
- Unable to retrieve m and u Estimates from the Saved Model
- [Splink 4] Find new matches can be simplified by creating a new linker
- [FEAT] Add GitHub action to sort/update custom dictionary HOT 3
- [FEAT] Split out system installs from spellchecker bash script HOT 2
- [MAINT] Ensure consistent capitalisation when referencing functions named after people
- [FEAT] Scala 2.13 support? HOT 4
- Can't train for M values on Databricks HOT 4
- [FEAT] Rename cols in graph metric tables
- [FEAT] Add cluster metrics to cluster studio
- Allow `__splink__df_concat` to be computed without `linker` HOT 1
- M values aren't trained for a column HOT 3
- `linker.estimate_u_using_random_sampling` fails with default arguments, with no clear indication why HOT 3
- [FEAT] Allow training m without a blocking rule with a sample of the input records
- Notebook test sometimes fails HOT 2
- CI tests are not caching environment HOT 1
- [FEAT] Seed for comparison_viewer_dashboard?
- threshold_selection_tool_from_labels_table does not work using spark
- [FEAT] Cluster evaluation - summary statistics
- [FEAT] Cluster evaluation - with ground truth data
- Sql syntax error: HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from splink.