Note I now think the the below discussion is inaccurate and it has been superseded by this
Need to add support for ex-post adjustment to membership probabilities.
These are described here for the case of making adjustments for a single column.
https://static.cambridge.org/content/id/urn:cambridge.org:id:article:S0003055418000783/resource/name/S0003055418000783sup001.pdf
This is the supplementary material for this:
https://imai.fas.harvard.edu/research/files/linkage.pdf
The formula proposed is:
Which says:
To calculate the membership probabilities, rather than use the 'generic' lambda, instead compute a 'specific lambda' within token groups, and use that instead of lambda.
So - for example, to make the adjustment for the surname Linacre, look within all records where the surname matches, and the surname is Linacre.
Within these records, compute the expected proportion of matches by looking at the computed membership probabilities. Use this as an estimate for lambda rather than the 'generic' lambda
We expect the 'specific' lambda generally to be 'better' than the 'generic lambda' because we're looking within records where the surname matches AND the surname is x.
This 'betterness' is coming from two sources:
- Within records where surname matches, we know the gamma value must indicate agreement
- Within records where the surname matches, it's more likely the other fields will match, especially if it's an unusual surname.
This actually suggests to me there may be a problem with the formula:
The combination of lambda and the surname pi in the standard (not adjusted) formula accounts for prob random records match, plus the information contained in the gamma.
We're replacing this with an adjusted lambda that accounts for the prob records match given gamma is 1 (i.e. surnames match) . This accounts for information in lambda (because it's computed from the xis, like lambda is), and accounts for the match on surname (because we only use xis within surname)
But then, in addition, we're applying the pi probability for surname match from pi. this feels like double counting.
Note: I believe there is an error in the above formula - a missing product term. Compare it to the formula for membership probabilities given in the main paper:
Specifically the missing term is
on top and bottom. This term says: multiply the probabilities we've looked up from the pi distributions together for each element of the comparison vector