moj-analytical-services / splink Goto Github PK

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page: https://moj-analytical-services.github.io/splink/

License: MIT License

Python 44.61% Shell 0.30% Roff 13.04% Jinja 0.49% CSS 0.15% JavaScript 41.34% Dockerfile 0.05% HTML 0.01%

record-linkage spark em-algorithm deduplication deduplicate-data entity-resolution data-matching fuzzy-matching data-science duckdb

splink's Introduction

Important

Development has begun on Splink 4 on the splink4_dev branch. Splink 3 is in maintenance mode and we are no longer accepting new features. We welcome contributions to Splink 4. Read more on our latest blog.

Fast, accurate and scalable probabilistic data linkage

Splink is a Python package for probabilistic record linkage (entity resolution) that allows you to deduplicate and link records from datasets that lack unique identifiers.

Key Features

⚡ Speed: Capable of linking a million records on a laptop in around a minute.
🎯 Accuracy: Support for term frequency adjustments and user-defined fuzzy matching logic.
🌐 Scalability: Execute linkage in Python (using DuckDB) or big-data backends like AWS Athena or Spark for 100+ million records.
🎓 Unsupervised Learning: No training data is required for model training.
📊 Interactive Outputs: A suite of interactive visualisations help users understand their model and diagnose problems.

Splink's linkage algorithm is based on Fellegi-Sunter's model of record linkage, with various customisations to improve accuracy.

What does Splink do?

Consider the following records that lack a unique person identifier:

Splink predicts which rows link together:

and clusters these links to produce an estimated person ID:

What data does Splink work best with?

Before using Splink, input data should be standardised, with consistent column names and formatting (e.g., lowercased, punctuation cleaned up, etc.).

Splink performs best with input data containing multiple columns that are not highly correlated. For instance, if the entity type is persons, you may have columns for full name, date of birth, and city. If the entity type is companies, you could have columns for name, turnover, sector, and telephone number.

High correlation occurs when the value of a column is highly constrained (predictable) from the value of another column. For example, a 'city' field is almost perfectly correlated with 'postcode'. Gender is highly correlated with 'first name'. Correlation is particularly problematic if all of your input columns are highly correlated.

Splink is not designed for linking a single column containing a 'bag of words'. For example, a table with a single 'company name' column, and no other details.

Documentation

The homepage for the Splink documentation can be found here. Interactive demos can be found here, or by clicking the following Binder link:

The specification of the Fellegi Sunter statistical model behind splink is similar as that used in the R fastLink package. Accompanying the fastLink package is an academic paper that describes this model. The Splink documentation site and a series of interactive articles also explores the theory behind Splink.

The Office for National Statistics have written a case study about using Splink to link 2021 Census data to itself.

Installation

Splink supports python 3.8+. To obtain the latest released version of splink you can install from PyPI using pip:

pip install splink

or, if you prefer, you can instead install splink using conda:

conda install -c conda-forge splink

Installing Splink for Specific Backends

For projects requiring specific backends, Splink offers optional installations for Spark, Athena, and PostgreSQL. These can be installed by appending the backend name in brackets to the pip install command:

pip install 'splink[{backend}]'

Should you require a version of Splink without DuckDB, see our section on DuckDBLess Splink Installation.

Click here for backend-specific installation commands

Spark

pip install 'splink[spark]'

Athena

pip install 'splink[athena]'

PostgreSQL

pip install 'splink[postgres]'

Quickstart

The following code demonstrates how to estimate the parameters of a deduplication model, use it to identify duplicate records, and then use clustering to generate an estimated unique person ID.

For more detailed tutorial, please see here.

from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on
from splink.datasets import splink_datasets

df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],
}

linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)

blocking_rule_for_training = block_on(["first_name", "surname"])

linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training, estimate_without_term_frequencies=True)

blocking_rule_for_training = block_on("dob")
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training, estimate_without_term_frequencies=True)

pairwise_predictions = linker.predict()

clusters = linker.cluster_pairwise_predictions_at_threshold(pairwise_predictions, 0.95)
clusters.as_pandas_dataframe(limit=5)

Videos

Charts Gallery

You can see all of the interactive charts provided in Splink by checking out the Charts Gallery.

Support

To find the best place to ask a question, report a bug or get general advice, please refer to our Contributing Guide.

Use Cases

To see how users are using Splink in the wild, check out the Use Cases section of the docs.

Awards

❓ Future of Government Awards 2023: Open Source Creation - Shortlisted, result to be announced shortly

🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - Runner up

🥇 Analysis in Government Awards 2022: People's Choice Award - Winner

🥈 Analysis in Government Awards 2022: Innovative Methods - Runner up

🥇 Analysis in Government Awards 2020: Innovative Methods - Winner

🥇 MoJ Data and Analytical Services Directorate (DASD) Awards 2020: Innovation and Impact - Winner

Citation

If you use Splink in your research, we'd be grateful for a citation as follows:

@article{Linacre_Lindsay_Manassis_Slade_Hepworth_2022,
	title        = {Splink: Free software for probabilistic record linkage at scale.},
	author       = {Linacre, Robin and Lindsay, Sam and Manassis, Theodore and Slade, Zoe and Hepworth, Tom and Kennedy, Ross and Bond, Andrew},
	year         = 2022,
	month        = {Aug.},
	journal      = {International Journal of Population Data Science},
	volume       = 7,
	number       = 3,
	doi          = {10.23889/ijpds.v7i3.1794},
	url          = {https://ijpds.org/article/view/1794},
}

Acknowledgements

We are very grateful to ADR UK (Administrative Data Research UK) for providing the initial funding for this work as part of the Data First project.

We are extremely grateful to professors Katie Harron, James Doidge and Peter Christen for their expert advice and guidance in the development of Splink. We are also very grateful to colleagues at the UK's Office for National Statistics for their expert advice and peer review of this work. Any errors remain our own.

Related Repositories

While Splink is a standalone package, there are a number of repositories in the Splink ecosystem:

splink_scalaudfs contains the code to generate User Defined Functions in scala which are then callable in Spark.
splink_datasets contains datasets that can be installed automatically as a part of Splink through the In-build datasets functionality.
splink_synthetic_data contains code to generate synthetic data.

splink's People

Contributors

Stargazers

Watchers

Forkers

cordje geoheil zhuohuwu0603 sdmurff mrceej yangchenghuang rubensmau kenrickveit aggvarun01 riddhi150390 uk-gov-mirror thomaswyrick slobo olivierbinette er-research yasark datanir squipix drorspei alicja-januszkiewicz mjidiba97 patrickstocklin pdefusco lenroc14 robinl marianabazely nsharkey cmstokoe rvenkatesh2020 johndelara1 moj-analytical-services iguanaware kalanistanton rjc89 guptam cxz thomas-hirsch anaerobic chloepugh nickcrews rrajdev james-osmond rapidambakar aaburgad bfgdigital sugatoray jesseasamoa amgfernandes yogidonlybear rdhanurkar robertwhiffin roelantdl anthonye93 rachels-ons nichohagan wilko77 sandy4321 mamonu kylebutts mivanovitch lisp2021 gfmartins jadsitt eric-seekas clairewilleck tlac lvcyf3rb jkginfinite adbond rameshputalapattu datajungler radoslav84 robertopcarreiro davidschrooten mjgvs arpitjain799 alberto-codes curiosity007 mastratton3 0xbadidea mbyanfei moj-analytical-services th368 samirnoman aymonwuolanne elevate abalfazl9776 jslacks just-modeling ardila1108 sfc-gh-mstellwall ymerouani dmcfarland margaritabb colinspear johnglobo moj-analytical-services nerskin vinuhx minkymorgan

splink's Issues

get_largest_blocks() limited to simple blocking rules

get_largest_blocks(...) assumes blocking rules of the form:

l.column1 = r.column1 [AND l.column2 = r.column2 ...]

✅ Working example

Blocking rule:

l.height = r.height AND l.dob = r.dob

SQL generated:

SELECT height, dob, count(*) as count
FROM df
WHERE height is not null and dob is not null
GROUP BY height, dob
ORDER BY count(*) desc
LIMIT 5

❌ Breaking example

Blocking rule:

abs(l.height - r.height) < 5

SQL generated:

SELECT abs(height, count(*) as count
FROM df
WHERE abs(height is not null
GROUP BY abs(height
ORDER BY count(*) desc
LIMIT 5

Computation of likelihood should go into the params object, rather than be an ad-hoc computation with `print()`

Is it possible to work incrementally rather than having to dedupe the whole thing every time?

From a performance point of view, the EM steps are pretty fast.

What seems to take a lot of time is applying the string comparison functions, blocking rules, deduping, and creating the table of comparisons.

Imagine you start with records A,B,C which we need to dedupe. Call this df1.

We know later we're going to get record D,E, and we'll then also need to dedupe ABCD.

With ABC, we compute comparisons

AB
AC
BC

Using select * from df1 as a cross join df1 as b where a.id < b.id

We can save that out to a table called comparisons.

Now records D,E comes in. Call this df2.

We can now do

select * from df1 cross join df2 which gives us:

AD
BD
CD
AE
BE
CE

and then

select * from df2 as a cross join df2 as b where a.id < b.id which gives us:
D,E

Revisit comparison evaluation

Would like some functions that, given labelled data, efficiently return:

real parameters
tp, fp, tn, fn

here's an earlier attempt:

def get_real_params(df_comparison, df_with_gamma, spark, est_params):

    gamma_cols = [c for c in df_with_gamma.columns if re.match(r"^gamma_\d$", c)]
    df_with_gamma.createOrReplaceTempView('df_with_gamma')
    df_comparison.createOrReplaceTempView('df_comparison')

    # Want match probability, gammas, label
    gamma_select_expr = ", ".join([f"g.{c}" for c in gamma_cols])

    # This dataset looks like df_e, but instead of match probability it's got a 1,0 label

    sql = f"""
    select {gamma_select_expr},
    cast(c.group_l == c.group_r as int) as match_probability
    from df_with_gamma as g
    left join df_comparison as c
    on g.unique_id_l = c.unique_id_l
    and
    g.unique_id_r = c.unique_id_r

    """
    gamma_e_label = spark.sql(sql)

    p = Params(est_params.gamma_settings)
    run_maximisation_step(gamma_e_label, spark, p)
    return p.params

Iteration convergence

Need a way of stopping iterations when EM algo converges.

Probably mirror the way fastLink does this - i.e. a tolerance for how much the params changed since last iteration:

https://github.com/kosukeimai/fastLink/blob/4f49a2bd634579d04206ce55f6ff89a8bf9f2e21/R/emlinklog.R#L213

org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStageX grows beyond 64 KB

There seems to be an issue with some of the calculation stages during the
Maximisation step when running through the sparklink normal flow in AWS Glue.
It doesnt stop the calculations and the job finishes sucessfully but perhaps this merits attention
in the sense that we dont know what happens when a "Whole-stage codegen disabled for plan X"

an example from the logs :

20/02/26 17:01:01 ERROR CodeGenerator: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage15" grows beyond 64 KB 20/02/26 17:01:01 **WARN WholeStageCodegenExec: Whole-stage codegen disabled for plan** (id=15):

part of Codegen follows :

*(15) Project [(((((((((((0.013465756550431252 * CASE WHEN (CASE WHEN (isnull(PERNAME1#69) || isnull(PERNAME1#1446)) THEN -1 WHEN (UDF:jaccard_sim(UDF:QgramTokeniser(PERNAME1#69), UDF:QgramTokeniser(PERNAME1#1446)) > 0.84) THEN 2 WHEN (UDF:jaccard_sim(UDF:QgramTokeniser(PERNAME1#69), UDF:QgramTokeniser(PERNAME1#1446)) > 0.7) THEN 1 ELSE 0 END = -1) THEN 1.0 WHEN (CASE WHEN (isnull(PERNAME1#69) || isnull(PERNAME1#1446)) THEN -1 WHEN (UDF:jaccard_sim(UDF:QgramTokeniser(PERNAME1#69), UDF:QgramTokeniser(PERNAME1#1446)) > 0.84) THEN 2 WHEN UDF:QgramTokeniser(PERNAME1#1446)) > 0.84) THEN 2 WHEN (UDF:jaccard_sim(UDF:QgramTokeniser(PERNAME1#69),

and it goes for ever like that ...

Deal with nulls

Need to implement a method for dealing with comparisons where one or both sides is a null

Stop everything being decimals

Spark treats literal values in queries as decimals

i.e. in the following a is a decimal:

spark.sql("select 0.2/0.3 as a")

See here

We need to refactor a bunch of statement to wrap them with cast:
spark.sql("select cast(0.2 as float)/cast(0.3 as float) as a")

Empty blocking rules

Blocking rules have to be included in settings but with no requirement on the contents other than being an array.

This means an empty array causes block_using_rules() to fail.

With that in mind I would relax the requirement to include blocking rules in settings, with the default being an empty array, and with block_using_rules() falling back on cartestian_block() instead. (This will also be an opportunity to fix the typo in "cartesian")

Sam's answers

Here are some manual calculations and the sparklink equivalent. Adding as issue so we don't lose this work

https://github.com/moj-analytical-services/sparklink/blob/990196497f86da96358569959121418fb3ed9c7c/sams_answers.ipynb

Optimise blocking by filtering out nulls

Each blocking rule has a unique_id_l < unique_id_r rule, which means that the standard Spark filter that gets rid of nulls before joining retains all rows (because unique_id is never null).

We could add an optimisation where we insert a filter condition that gets rid of records which are null on the blocking fields.

This may turn out to be unnecessary if we eventually decide to use LSH, or that we decide "OR" based blocking rules aren't very useful.

Allow a case expression which takes two columns as an input

At the moment sparklink would probably break if you tried this because col_name doesn't make sense in the comparison columns

Sphynx or MkDocs for documentation?

see e.g. https://news.ycombinator.com/item?id=17717295

Leaning on side of Sphynx with docs written in markdown

Another addition to settings dict

Need another key in column_comparisons for 'term frequency adjustment'

This would:

auto retain this information in df_gammas
make relevant tf adjustments automatically

Inconsistent treatment of `additional_columns_to_retain`

df -> df_comparison
Where non-linking columns are specified in the settings to be retained, they are appended with "_l"/"_r" in the blocking (see _get_columns_to_retain_blocking())

df_comparison -> df_gammas
When calculating the gammas the extra columns are selected exactly as specified (see _get_select_expression_gammas())

I assume the former is the safer option, even in the event that you might want to retain a column only found on one side of the link, but worth checking.

Add support for term frequency

Note I now think the the below discussion is inaccurate and it has been superseded by this

Need to add support for ex-post adjustment to membership probabilities.

These are described here for the case of making adjustments for a single column.
https://static.cambridge.org/content/id/urn:cambridge.org:id:article:S0003055418000783/resource/name/S0003055418000783sup001.pdf

This is the supplementary material for this:
https://imai.fas.harvard.edu/research/files/linkage.pdf

The formula proposed is:

Which says:

To calculate the membership probabilities, rather than use the 'generic' lambda, instead compute a 'specific lambda' within token groups, and use that instead of lambda.

So - for example, to make the adjustment for the surname Linacre, look within all records where the surname matches, and the surname is Linacre.

Within these records, compute the expected proportion of matches by looking at the computed membership probabilities. Use this as an estimate for lambda rather than the 'generic' lambda

We expect the 'specific' lambda generally to be 'better' than the 'generic lambda' because we're looking within records where the surname matches AND the surname is x.

This 'betterness' is coming from two sources:

Within records where surname matches, we know the gamma value must indicate agreement
Within records where the surname matches, it's more likely the other fields will match, especially if it's an unusual surname.

This actually suggests to me there may be a problem with the formula:

The combination of lambda and the surname pi in the standard (not adjusted) formula accounts for prob random records match, plus the information contained in the gamma.

We're replacing this with an adjusted lambda that accounts for the prob records match given gamma is 1 (i.e. surnames match) . This accounts for information in lambda (because it's computed from the xis, like lambda is), and accounts for the match on surname (because we only use xis within surname)

But then, in addition, we're applying the pi probability for surname match from pi. this feels like double counting.

Note: I believe there is an error in the above formula - a missing product term. Compare it to the formula for membership probabilities given in the main paper:

Specifically the missing term is

on top and bottom. This term says: multiply the probabilities we've looked up from the pi distributions together for each element of the comparison vector

Potential for small amount of bias in term freq adjustments due to jaro vs. exact match

See Sam's comment here:
#107

Improvement to efficiency by avoiding blocking drop duplicates

Each of the UNION statements in blocking could filter out records which obeyed previous blocking rules

Decorated functions do not autodocument

Need to do so

sphinx-doc/sphinx#3783

Public API: Have clear method of saving out params and settings, and loading them in.

Decide on pattern for how to pass the spark context into functions

At the moment there's no consistency in how functions argumenst works

e.g.

def myfunction(df1, other arg, spark):
   pass

there should be consistency in how spark is passed in. or an alternative means of 'registering' spark with the library

Extend gamma settings to allow user to set start parameters

Custom case statements in gamma settings - remove need for user specifying 'gamma_0' etc

: org.apache.spark.sql.catalyst.parser.ParseException: decimal can only support precision up to 38

    select *,
    case 
WHEN gamma_0 = -1 THEN cast(1 as double)
when gamma_0 = 0 then cast(0.28223589062690735 as double)
when gamma_0 = 1 then cast(0.003788494737818837 as double)
when gamma_0 = 2 then cast(0.7139756083488464 as double) 
end 
as prob_gamma_0_non_match, 

case 
WHEN gamma_0 = -1 THEN cast(1 as double)
when gamma_0 = 0 then cast(0.23711225390434265 as double)
when gamma_0 = 1 then cast(0.03999733552336693 as double)
when gamma_0 = 2 then cast(0.7228904366493225 as double) 
end 
as prob_gamma_0_match, 

case 
WHEN gamma_1 = -1 THEN cast(1 as double)
when gamma_1 = 0 then cast(0.6836211681365967 as double)
when gamma_1 = 1 then cast(0.0019191077444702387 as double)
when gamma_1 = 2 then cast(0.31445974111557007 as double) 
end 
as prob_gamma_1_non_match, 

case 
WHEN gamma_1 = -1 THEN cast(1 as double)
when gamma_1 = 0 then cast(0.2554125189781189 as double)
when gamma_1 = 1 then cast(0.025807620957493782 as double)
when gamma_1 = 2 then cast(0.7187798619270325 as double) 
end 
as prob_gamma_1_match, 

case 
WHEN gamma_2 = -1 THEN cast(1 as double)
when gamma_2 = 0 then cast(0.05246942862868309 as double)
when gamma_2 = 1 then cast(0.21890297532081604 as double)
when gamma_2 = 2 then cast(0.7286275625228882 as double) 
end 
as prob_gamma_2_non_match, 

case 
WHEN gamma_2 = -1 THEN cast(1 as double)
when gamma_2 = 0 then cast(0.009963554330170155 as double)
when gamma_2 = 1 then cast(0.042951032519340515 as double)
when gamma_2 = 2 then cast(0.947085440158844 as double) 
end 
as prob_gamma_2_match, 

case 
WHEN gamma_3 = -1 THEN cast(1 as double)
when gamma_3 = 0 then cast(0.8954361081123352 as double)
when gamma_3 = 1 then cast(0.001044640433974564 as double)
when gamma_3 = 2 then cast(0.10351923853158951 as double) 
end 
as prob_gamma_3_non_match, 

case 
WHEN gamma_3 = -1 THEN cast(1 as double)
when gamma_3 = 0 then cast(0.05692289397120476 as double)
when gamma_3 = 1 then cast(0.001244868733920157 as double)
when gamma_3 = 2 then cast(0.9418322443962097 as double) 
end 
as prob_gamma_3_match, 

case 
WHEN gamma_4 = -1 THEN cast(1 as double)
when gamma_4 = 0 then cast(0.0573236308991909 as double)
when gamma_4 = 1 then cast(0.9426763653755188 as double)
when gamma_4 = 2 then cast(1.170965990837658e-09 as double) 
end 
as prob_gamma_4_non_match, 

case 
WHEN gamma_4 = -1 THEN cast(1 as double)
when gamma_4 = 0 then cast(5.9380419956766985e-25 as double)
when gamma_4 = 1 then cast(2.3712765084837883e-10 as double)
when gamma_4 = 2 then cast(1.0 as double) 
end 
as prob_gamma_4_match, 

case 
WHEN gamma_5 = -1 THEN cast(1 as double)
when gamma_5 = 0 then cast(0.0573236308991909 as double)
when gamma_5 = 1 then cast(0.9426763653755188 as double)
when gamma_5 = 2 then cast(1.170965990837658e-09 as double) 
end 
as prob_gamma_5_non_match, 

case 
WHEN gamma_5 = -1 THEN cast(1 as double)
when gamma_5 = 0 then cast(5.9380419956766985e-25 as double)
when gamma_5 = 1 then cast(2.3712765084837883e-10 as double)
when gamma_5 = 2 then cast(1.0 as double) 
end 
as prob_gamma_5_match, 

case 
WHEN gamma_6 = -1 THEN cast(1 as double)
when gamma_6 = 0 then cast(0.8737211227416992 as double)
when gamma_6 = 1 then cast(0.1262788623571396 as double) 
end 
as prob_gamma_6_non_match, 


15:59:31
case WHEN gamma_6 = -1 THEN cast(1 as double) when gamma_6 = 0 then cast(0.24107448756694794 as double) when gamma_6 = 1 then cast(0.7589254975318909 as double) end as prob_gamma_6_match, case WHEN gamma_7 = -1 THEN cast(1 as double) when gamma_7 = 0 then cast(0.8228409886360168 as double) when gamma_7 = 1 then cast(0.17715899646282196 as double) end as prob_gamma_7_non_match, case WHEN
case 
WHEN gamma_6 = -1 THEN cast(1 as double)
when gamma_6 = 0 then cast(0.24107448756694794 as double)
when gamma_6 = 1 then cast(0.7589254975318909 as double) 
end 
as prob_gamma_6_match, 

case 
WHEN gamma_7 = -1 THEN cast(1 as double)
when gamma_7 = 0 then cast(0.8228409886360168 as double)
when gamma_7 = 1 then cast(0.17715899646282196 as double) 
end 
as prob_gamma_7_non_match, 

case 
WHEN gamma_7 = -1 THEN cast(1 as double)
when gamma_7 = 0 then cast(0.6131282448768616 as double)
when gamma_7 = 1 then cast(0.38687172532081604 as double) 
end 
as prob_gamma_7_match, 

case 
WHEN gamma_8 = -1 THEN cast(1 as double)
when gamma_8 = 0 then cast(0.7181188464164734 as double)
when gamma_8 = 1 then cast(0.01536017470061779 as double)
when gamma_8 = 2 then cast(0.26652097702026367 as double) 
end 
as prob_gamma_8_non_match, 

case 
WHEN gamma_8 = -1 THEN cast(1 as double)
when gamma_8 = 0 then cast(0.6471342444419861 as double)
when gamma_8 = 1 then cast(0.009849164634943008 as double)
when gamma_8 = 2 then cast(0.3430165648460388 as double) 
end 
as prob_gamma_8_match, 

case 
WHEN gamma_9 = -1 THEN cast(1 as double)
when gamma_9 = 0 then cast(0.7700794339179993 as double)
when gamma_9 = 1 then cast(0.22992053627967834 as double) 
end 
as prob_gamma_9_non_match, 

case 
WHEN gamma_9 = -1 THEN cast(1 as double)
when gamma_9 = 0 then cast(0.5112894773483276 as double)
when gamma_9 = 1 then cast(0.48871055245399475 as double) 
end 
as prob_gamma_9_match, 

case 
WHEN gamma_10 = -1 THEN cast(1 as double)
when gamma_10 = 0 then cast(0.5872060060501099 as double)
when gamma_10 = 1 then cast(0.41279396414756775 as double) 
end 
as prob_gamma_10_non_match, 

case 
WHEN gamma_10 = -1 THEN cast(1 as double)
when gamma_10 = 0 then cast(0.3346138596534729 as double)
when gamma_10 = 1 then cast(0.6653861403465271 as double) 
end 
as prob_gamma_10_match, 

case 
WHEN gamma_11 = -1 THEN cast(1 as double)
when gamma_11 = 0 then cast(0.5770787596702576 as double)
when gamma_11 = 1 then cast(0.42292121052742004 as double) 
end 
as prob_gamma_11_non_match, 

case 
WHEN gamma_11 = -1 THEN cast(1 as double)
when gamma_11 = 0 then cast(0.21871845424175262 as double)
when gamma_11 = 1 then cast(0.7812815308570862 as double) 
end 
as prob_gamma_11_match, 

case 
WHEN gamma_12 = -1 THEN cast(1 as double)
when gamma_12 = 0 then cast(0.4755280315876007 as double)
when gamma_12 = 1 then cast(0.0035781676415354013 as double)
when gamma_12 = 2 then cast(0.5208938121795654 as double) 
end 
as prob_gamma_12_non_match, 

case 
WHEN gamma_12 = -1 THEN cast(1 as double)
when gamma_12 = 0 then cast(0.4633902311325073 as double)
when gamma_12 = 1 then cast(0.030555138364434242 as double)
when gamma_12 = 2 then cast(0.5060546398162842 as double) 
end 
as prob_gamma_12_match
    from df_with_gamma

Include automatic null treatment (with ability to opt out)

Following discussion in #90 :

Example case expressions (and defaults) all include basic when col_name_l is null then -1 etc. to handle nulls, but onus on the user to ensure custom case expressions are written similarly.
These conditions can easily be inserted into all case expressions automatically (using col_name, or custom_column_names for custom_name comparisons)
If this null treatment is somehow not appropriate in the custom case, can allow the user to opt out in order to include their own null treatment in case_expression.

Save/Load params

Want the ability to save out current parameters, and load in pre-trained parameters.

Error in charts when converting nonetypes

  File "script_2020-03-31-15-24-33.py", line 254, in persist_params_settings
    params.all_charts_write_html_file(filename="sparklink_charts.html", overwrite=True)
  File "/mnt/yarn/usercache/root/appcache/application_1585667806720_0002/container_1585667806720_0002_01_000001/splink.zip/splink/params.py", line 431, in all_charts_write_html_file
    c2 = json.dumps(self.adjustment_factor_chart())
  File "/mnt/yarn/usercache/root/appcache/application_1585667806720_0002/container_1585667806720_0002_01_000001/splink.zip/splink/params.py", line 388, in adjustment_factor_chart
    data = self._convert_params_dict_to_normalised_adjustment_data()
  File "/mnt/yarn/usercache/root/appcache/application_1585667806720_0002/container_1585667806720_0002_01_000001/splink.zip/splink/params.py", line 187, in _convert_params_dict_to_normalised_adjustment_data
    row["adjustment"] = row["m"] / (row["m"] + row["u"])
TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'

Check results against R fastlink

Need to check that R fastlink gives the same answers as our code, including the term frequency adjustment

EM convergence

Need a way of making iterations stop when convergence has been reached

Probably mirror the way fastLink does this - i.e. a tolerance for how much the params changed since last iteration:

https://github.com/kosukeimai/fastLink/blob/4f49a2bd634579d04206ce55f6ff89a8bf9f2e21/R/emlinklog.R#L213

Optimise speed of iterations by taking unique combinations of gamma values

Sam's idea!

Make logging consistent

Wherever we log, need the option of python or spark logging using consistent interface

gammas should be called gamma_first_name etc not gamma_0, all columns should be retained by default

Allow user to save out charts and params after each iteration

For large data linking tasks, it is useful to get feedback after each iteration on progress. it is also useful to be able to save the current state of the output so that if the job fails, you can restart from where it left off.

Add intuition

Would be useful to add a function that outputted a 'story' about a row, giving the intuition for why it gave the answer it did.

Like:

Row 1:
Initial estimate of match probability = 20% (estimate of probability of match of a comparison drawn at random)
Field 1: Surname.
Comparison: Linacre vs Linacer.  Setting gamma =1
Prob of observing this given  match = blah
Prob of observing this given none match = blah
Adjustment to probability = 0.7
Assessment of match probability so far:  Blah
Field 2: … blah
Final assessment of match probability:  Blah

'link_only' doesnt work

when trying to link with setting "link_type": "link_only"

this code

from sparklink import Sparklink linker = Sparklink(settings, spark, df_l=persondf,df_r=censusdf,df=None)

gives

ValueError: For link_type = 'link_only', you must pass two Spark dataframes to Sparklink using the df_l and df_r argument. The df argument should be omitted or set to None. e.g. linker = Sparklink(settings, spark, df_l=my_first_df, df_r=df_to_link_to_first_one)

Remember to implement from update to settings....

Need to implement:

Something around EM convergence
other columns to retain
unique_id_column_name

API Usability improvement: unique_id_col more obvious

make unique_id_col more explicit in functions that use it.
such as add_gammas and sql_gen_block_using_rules

Perhaps add a special note on the documentation about it.

Revamp settings for retaining columns

Need to adjust for treatment of nulls during maximisation step

Term frequency adjustment calculation is flawed

Quality assurance by Office for National statistics has identified problems with the term frequency adjustment calculations.

I have verified this, finding that:

Where term frequency adjustments are applied to multiple columns, we double count the part of the calculation which 'undoes lambda'
I'm not convinced that the approach to calculating term frequency adjustments is correct.

The overall result is that, at present, term frequency adjustment are too extreme:

they have a tendency to (almost) always adjust the match probability up
they tend to adjust it up too much

Blocking autogenerated SQL for Linking as well as for Deduping

Right now blocking doesnt work for without hacks for linking.
Perhaps a way that it should be is having two explicit functions

block_using_rules_link(df1, df2) and block_using_rules_dedupe(df)

Make it clear what the external API is

Add _ to internal API methods throughout

Productionise

Probably need a 'productionise.py' which takes a params object and outputs SQL statements that can be run again

(1) Implement the same blocking rules that were used in training
(2) Use the weights in params to generate hard coded SQL statements to score matches.

Note that term frequency adjustments would need more thought

Binderise

Would be good to have an interactive example that works in Binder

Helper functions to analyse skew in blocking rules

Something like this:

def get_largest_blocks(blocking_rule, df, limit=5):
    
    parts = re.split(" |=", blocking_rule)
    parts = [p for p in parts if "l." in p]
    parts = [p.replace("l.", "") for p in parts]

    col_expr = ", ".join(parts)
    
    filter_nulls_expr = " and ".join(f"{p} is not null" for p in parts)
    
    sql = f"""
    SELECT {col_expr}, count(*) 
    FROM df
    WHERE {filter_nulls_expr}
    GROUP BY {col_expr}
    ORDER BY count(*) desc
    LIMIT {limit}
    """
    df.registerTempTable("df")
    return spark.sql(sql)

for b in blocking_rules:
    df2 = get_largest_blocks(b, df)
    print(df2._jdf.showString(10, 20, False))

Optimise broadcast of term frequency adj

At the moment we have surname_r and surname_l in the broadcast table, effectively increasing its size by around a third. Since this table is broadcast, it's important to keep its size down.

Need to refactor the sql so that it only needs one column, surname

Inconsistent default no. of iterations

The settings json schema says max_iterations defaults to 100.
iterate() defaults to num_iterations=10
However, get_scored_comparisons() in the Sparklink class specifies num_iterations=20

Does not take into acount the specified max_iterations in settings