djvanderlaan / reclin2 Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 3.0 635 KB

Record Linkage Toolkit for R

License: GNU General Public License v3.0

R 94.62% Makefile 0.34% C++ 4.41% CSS 0.63%

reclin2's People

Contributors

Stargazers

Watchers

Forkers

stjordanis flownt etiennebacher

reclin2's Issues

problink_em fails with extremely high and low comparator scores

Sometimes when the comparison scores produced by compare_pairs reach extreme scores like complete (dis)similarity, 0 or 1, problink_em fails to estimate probabilities. Here is a reproducible example:

library(reclin2)

ex1 <- data.frame(
  a = "aaaaa",
  b = "zzzzz"
)

ex2 <- data.frame(
  a = "bbbbb",
  b = "zzzzz"
)

pairs <- pair(ex1, ex2)
pairs <- compare_pairs(pairs, on = c("a", "b"), default_comparator = cmp_jarowinkler())

m <- problink_em(~ a + b, data = pairs)

# Error in if (mprobs[[col]] > mprob_max) mprobs[[col]] <- mprob_max : 
#  missing value where TRUE/FALSE needed

I am not entirely sure when exactly this happens, but it also seems to be linked to the fact that there are absolute 0's and absolute 1's present in the same row. I dug a bit into the source code of problink_em and figured out that the error occurs due to an NA value which is produced due to division by 0. In particular, the following line seems to produce this NA value (from problink_em.R#L77

gm <- p*a / (p*a + (1-p)*b)

This might seem like a fringe case (possibly reinforced by my absurd reprex), but in fact this came up in my day-to-day work with reclin2 on a regular basis. Sometimes, I'm able to work around this problem by just adding or subtracting a marginal number from the extreme scores, but this is not a reliable workaround, as the problem also seems to occur in case of scores very close to 0/1. See e.g. the following example in which a score of 0.999 leads to an error, but a score of 0.9 does not:

pairs <- pair(ex1, ex2)
pairs <- compare_pairs(pairs, on = c("a", "b"), default_comparator = cmp_jarowinkler())

pairs$a <- pairs$a + 0.001
pairs$b <- pairs$b - 0.001
m <- problink_em(~ a + b, data = pairs)

# Error in if (mprobs[[col]] > mprob_max) mprobs[[col]] <- mprob_max : 
#   missing value where TRUE/FALSE needed


pairs <- pair(ex1, ex2)
pairs <- compare_pairs(pairs, on = c("a", "b"), default_comparator = cmp_jarowinkler())

pairs$a <- pairs$a + 0.1
pairs$b <- pairs$b - 0.1
m <- problink_em(~ a + b, data = pairs)

# M- and u-probabilities estimated by the EM-algorithm:
#  Variable M-probability U-probability
#         a             0         1e-04
#         b             0         1e-04
#
# Matching probability: 0.0001370404.

> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reclin2_0.3.4     data.table_1.14.8

loaded via a namespace (and not attached):
[1] compiler_4.3.1    parallel_4.3.1    tools_4.3.1       Rcpp_1.0.10       lpSolve_5.6.19    stringdist_0.9.10

deduplication vignette without true/official records

The deduplication vignette you provided seems to be for the case in which you have a set of true/official records. What if I just wanted to deduplicate based on some kind of fuzzy matching criteria because I don't have access to any true/official records? This seems to be more common for most of what I'm doing. Any suggestions or direction is appreciated.

After cluster_collect compare_vars seems to give incorrect results

Check if this is the case. Possibly the mapping between x and .x in de data.table is incorrect.

Add `score_simple` to vignettes

Introduction: Misspelled function name

Hi,

in your introduction https://htmlpreview.github.io/?https://github.com/djvanderlaan/reclin2/blob/master/inst/doc/introduction.html
you name the function "pairs_minsim", however it is called pair_minsim.

Easy fix!
Have a nice day.

After cluster_collect data.table complains that the pointer is not valid and creates a copy

Score simsum

Add function like score_simsum from reclin

Ellipsis are not passed in in compare_vars

Ik ben erachter gekomen dat de ellipsis niet doorgegeven wordt door reclin2::compare_vars, in tegenstelling tot wat de documentatie beweert.

Voorbeeld:
df <- data.frame(c1=c(1))
pairs <- reclin2::pair(df, df)

g <- function(...){
l <- list(...)
print(l)
return(l[[1]]==l[[2]])
}

compare_vars(pairs, 'newvar', on_x='c1', on_y='c1', comparator=g, delta=2)

Resultaat:
[[1]]
[1] 1

[[2]]
[1] 1

First data set: 1 records
Second data set: 1 records
Total number of pairs: 1 pairs

.x .y newvar
1: 1 1 TRUE

Verwacht resultaat:
Ook delta zichtbaar als argument aan g.

Export greedy

Export greedy; useful for some applications to have direct access to the currently internal greedy ; match_n_to_m is also exported.

greedy() : max n/m not always respected

The function reclin2::greedy, which greedily selects the 'best' matches within arity bounds does not always respect those bounds.

#BUG: Should select 4, not 5!
reclin2:::greedy(1:5, c(1,1,2,2,2), rep(1, 5), n=2, m=2)

This was found using

df_11 <- data.frame(x=c(1,2))
df_23 <- data.frame(x=c(1,1,2,2,2))

and a join on x (no blocking).

Note that swapping the arguments does give the right solution.

reclin2:::greedy(c(1,1,2,2,2), 1:5, rep(1, 5), n=2, m=2)

Consistency with inplace = TRUE for the predict- function

Hello!
in general the functions in your package do not return anything to the console when inplace = TRUE. This makes sense as "pairs" is modified in place.
However, the predict function seems to be the only function which still prints out the result. I find this behaviour inconsistent and I think that you can easily fix it.

library(reclin2)

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")

compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"), inplace = TRUE) #use inplace, to avoid copying of object to save memory
print(pairs)

m <- problink_em(~ lastname + firstname + address + sex, data = pairs) #pairs should again be the first argument!

predict(m, pairs = pairs, add = TRUE, inplace = TRUE) #Should not print when using inplace = TRUE, if to be consistent with compare_pairs()
print(pairs)

Also note another comment I made: In problink_em I feel like for consistency, data should again be the first argument in this function.
If you want I can maybe also make a pr with the appropriate changes, I don't want to put it all on you. But you are probably faster as this is your package. Your call!

Have a nice day!

Alternative to filter_pairs_for_deduplication()

Hi.

I'm a new R user. I have adapted an R script to remove duplicates from my database. This script was made using the reclin package.
When I tried to pass this script on to my colleagues, they were unable to install reclin, which is no longer available.
Now I'm trying to adapt this script to reclin2. But I couldn't find an equivalent function to filter_pairs_for_deduplication()

Is there an alternative to this function?

Thank you.

cluster_pair_blocking() returning pairs with different block value

Hi, I switched from reclin to reclin2 and really appreciate the performance boost. When working with cluster_pair_blocking() and link(), I encountered an issue: I received pairs that shouldn't be compared based on the blocking variable.

To illustrate, here's the workflow without a cluster:

library(reclin2)
library(dplyr)

# Create an example dataset for deduplication
x <- data.frame(
  block = rep(letters, each = 100),
  comp = sample(1:3, length(letters) * 100, replace = TRUE)
)

# Run through reclin2 workflow without a cluster
pairs_linked <- x %>%
  pair_blocking(
    deduplication = TRUE,
    on = "block"
  ) %>%
  compare_pairs(on = "comp") %>%
  link()

different_blocks <- pairs_linked %>%
  filter(block.x != block.y)

stopifnot(nrow(different_blocks) == 0)

That works as expected. No pairs with different blocking variable values.

Now to run it on a cluster:

library(parallel)
cluster <- makeCluster(2)

pairs_linked_cluster <- x %>%
  cluster_pair_blocking(
    deduplication = TRUE,
    on = "block",
    cluster = cluster
  ) %>%
  compare_pairs(on = "comp") %>%
  cluster_collect() %>% 
  link()

stopCluster(cluster)

different_blocks_cluster <- pairs_linked_cluster %>%
  filter(block.x != block.y)

stopifnot(nrow(different_blocks_cluster) == 0)

This time, there's an error because some pairs differ in their "block" variable.

Does link() have to operate on the worker nodes? Or could there be some other id confusion when running on a cluster?

Result of sessionInfo():

R version 4.1.0 (2021-05-18)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 12.5.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] dplyr_1.0.9       reclin2_0.1.1     data.table_1.14.0

loaded via a namespace (and not attached):
 [1] stringdist_0.9.7 Rcpp_1.0.7       fansi_0.5.0      crayon_1.4.1     assertthat_0.2.1 utf8_1.2.2       R6_2.5.1         DBI_1.1.1        lifecycle_1.0.1  magrittr_2.0.1  
[11] pillar_1.8.0     rlang_1.0.4      cli_3.3.0        renv_0.14.0      rstudioapi_0.13  vctrs_0.4.1      generics_0.1.3   tools_4.1.0      glue_1.6.2       purrr_0.3.4     
[21] compiler_4.1.0   pkgconfig_2.0.3  tidyselect_1.1.2 tibble_3.1.8     lpSolve_5.6.16

question on pair selection via greedy and m-to-n

Dear Jan,

I am using the R reclin2 package you published to solve a record matching research question.
After applying a similarity calculation to assign a matching score to each combination of records I would like to use functions “greedy” and “m-to-n” to select the pairs that are considered to belong to the same record in the base record set.
Here, I can observe that some records are matched more than once, which obviously can lead to incorrect results. I am working with version 0.3.5 and understand that an issue has been fixed with the new version 0.5.0 for the select_greedy function, but not for the m-to-n function. Is it possible that the same kind of issue you fixed in greedy is also currently observable in m-to-n?

For reproduction I attach an excel file containing an example:
record 148 is paired with record 150 (row 110) and at the same time record 150 is paired with record 151 (row 132).
Same for records 152/154 in row 152 and 154/156 (greedy, row 167) or 154/156 (m-to-n, row 168).
Label columns were generated using this code:

greedy approach

dt_pairs_application <- 
  reclin2::select_greedy(pairs = dt_pairs_application, 
                         variable = "LABEL_LM_AGG_SIM_GREEDY_APPROACH", 
                         score = "MATCH_PROB_LM_AGG_SIM",
                         threshold = 0.5 )

m to n approach

 dt_pairs_application <- 
   reclin2::select_n_to_m(pairs = dt_pairs_application, 
                          variable = "LABEL_LM_AGG_SIM_N_TO_M_APPROACH", 
                          score = "MATCH_PROB_LM_AGG_SIM",
                          threshold = 0.5 )

dt_check.xlsx

select_preprocess does nog handle id_x and id_y correctly

library(reclin2)

data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
print(pairs)

pairs[, score := runif(nrow(pairs))]

reclin2:::select_preprocess(pairs, score = "score", id_x = "id", id_y = "id")

Result:

> reclin2:::select_preprocess(pairs, score = "score", id_x = "id", id_y = "id")
    .x .y      score index
 1:  1  2 0.61505326     1
 2:  2  3 0.12873553     2
 3:  3  4 0.87160600     3
 4:  4  6 0.76758030     4
 5:  5  7 0.03668445     5
 6:  6 NA 0.76429553     6
 7: NA NA 0.03718838     7
 8: NA NA 0.32234890     8
 9: NA NA 0.63785324     9
10: NA NA 0.55063922    10
11: NA NA 0.96634700    11
12: NA NA 0.70562979    12
13: NA NA 0.18539430    13
14: NA NA 0.35664960    14
15: NA NA 0.68076638    15
16: NA NA 0.97718373    16
17: NA NA 0.73858780    17

Blocking using approximate nearest neighbours algorithms

I am writing to let you know that I have developed a small package called [blocking] (https://github.com/ncn-foreigners/blocking) that allows blocking of records based on approximate nearest neighbours algorithms (RcppAnnoy, RcppHNSW and mlpack) and graphs (igraph). The package includes the function pair_ann, which was developed on the basis of pair_blocking and pair_minsim to allow direct integration into your package.

Here is the code using the reclin2 sample data:

library(blocking)
library(reclin2)

data("linkexample1", "linkexample2", package = "reclin2")

linkexample1$txt <- with(linkexample1, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample1$txt <- gsub("\\s+", "", linkexample1$txt)
linkexample2$txt <- with(linkexample2, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample2$txt <- gsub("\\s+", "", linkexample2$txt)

# pairing records from linkexample2 to linkexample1 based on the txt column

pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.75) |>
link(selection = "threshold")

Feel free to test and comment. I plan to submit this package to CRAN in December.

How to keep diagonal in pair_minsim?

Dear djvanderlaan,

First of all thank you for your wonderful package, for solving our hands and for introducing me to it!
My data looks like the example. I am clustering using stringdist and pair_minsin

library(tidyverse)
library(reclin2)
library(stringdist)

df <- tibble(fruits=c("anana","ananna","peach"))


my_dist <- function(x, y, ...){
  1 - stringdist(x, y, method=c("lv"))
}

p <- pair_minsim(df, on = "fruits", deduplication = TRUE,
                 default_comparator = my_dist, minsim = -2) 

p %>% 
  as.data.frame()
#>   .x .y simsum
#> 1  1  2      0

^{Created on 2022-03-18 by the reprex package (v2.0.1)}

I have three fruits: anana, ananna, peach and I cluster to cluster based on similarity. anana & ananna are very close but peach are very far away. However I want to keep peach in the data even if it does not cluster with anyone but itself for further analysis. But i just want to keep peach in the analysis, and not its relationships with the others and look like

   .x .y simsum
    1  2      0
    3  3     1

Do you have an idea how I can do this through pair_minsim?
Honestly, thank you for your time

Blocking on multiple variables with OR condition

I'm trying to link dataframes dfA and dfB. I don't want to rely solely on one variable for blocking, so I'd like to use two blocking variables and consider the union of the two resulting sets of pairs as my potential pairs. Unfortunately, this case doesn't seem to be supported by the pair_blocking function.

I tried the following instruction to create a new pairs object:
pairs <- unique(rbindlist(list(pair_blocking(dfA, dfB, "postcode"), pair_blocking(dfA, dfB, "birthyear"))))

It seems like I'm losing information in the process, as I get the following error when calling compare_pairs on this object:
Error in UseMethod("compare_pairs", pairs): no applicable method for 'compare_pairs' applied to an object of class "c('data.table', 'data.frame')"

error after "predict"

Troublesome object masking when attaching the package

When attaching the package, the identical comparator function masks base::identical:

> library(reclin2)

Attaching package: ‘reclin2’

The following object is masked from ‘package:base’:

    identical

This leads to some functions using getOption("viewer") to just stop working:

> leaflet::leaflet()
Error in identical(height, "maximize") : 
  unused arguments (height, "maximize")

> shiny::paneViewer()("")
Error in identical(height, "maximize") : 
  unused arguments (height, "maximize")

> getOption("viewer")
function (url, height = NULL) 
{
    if (!is.character(url) || (length(url) != 1)) 
        stop("url must be a single element character vector.", 
            call. = FALSE)
    if (identical(height, "maximize")) 
        height <- -1
    if (!is.null(height) && (!is.numeric(height) || (length(height) != 
        1))) 
        stop("height must be a single element numeric vector or 'maximize'.", 
            call. = FALSE)
    invisible(.Call("rs_viewer", url, height, PACKAGE = "(embedding)"))
}

Currently, the only reliable workaround I'm aware of is to just load reclin2 instead of attaching it. Since base::identical is a pretty common function and this issue is quite hard to debug or even recognize, it might be a good idea to rename the comparator function.

Distance Calculation, Longitude Latitude

Hi there,

Working with a dataset that included lat long and would love to calculate distance between points as one of my factors. I'm happy to build this calculation mathematically in the comparative element, but wonder if there was a pre-build in the package? or if you had suggestions about how to go about it?

All the Best,
Frankie

`identical` calls `identical` and not `cmp_identical`

Block pairs based on inequality condition?

Hello, thanks for the package and for the vignettes, they are extremely helpful.

One question I have is whether it is possible to do some kind of "inequality blocking" (I'm a beginner in the record linkage domain, maybe there's a specific term for this). In your first vignette, you block on the postal code, so it is quite easy to block pairs based on this condition. In my setting, I would like to add a limit on the year.

Basically, I have a list of people who moved from country A to country B at different years (with names and year of arrival) and I have a list of letters (with the sender names and the year in which they were written). I'd like to match letter senders to passengers lists. In this case, I would like to ignore pairs of writers-passengers where the year of arrival > year of letter (the letter has to be written after the person arrived in the country). Therefore, I would like to do something like pair_blocking(passengers, writers, "year_arrival < year_written"), but that doesn't seem possible yet.

Is this kind of blocking in the scope of this package? Otherwise, are there other packages / techniques that you could recommend?

Thanks again

select k nearest neighbours

Currently, the logic for selecting matching records is working in two ways:

1:1 matching using select_greedy or select_n_to_m
Selection of pairs above a certain treshold using select_treshold

However, in the latter case, one needs to specify the treshold directly, even though the distribution of this value is not always known beforehand.

Alternatively, it should be possible to indicate the top k nearest neighbours or highest scoring candidate matches.

Similarly to select_treshold, a select_top_k command could look like this:

pairs <- select_top_k(pairs, "top_weights", score = "weights", k = 8)

The above command can be implemented by hand utilising 2 rows of code with data.table::frank and data.table::fifelse, but I think it would be more convenient to implement it in the package directly.

Add `select_unique` to vignettes

pairs_minsim drops pairs when there are missing values

As reported in issue #8 .

library(reclin2)
data(linkexample1, linkexample2)
linkexample1$postcode[1] <- NA
pair_minsim(linkexample1, linkexample2, on = c("postcode", "lastname"), minsim = 0.5)

Should result in a pair between records 1 from x and 1 from y.

R 3.6.0 dependency?

Hi, is there any reason the package is dependent on R 3.6.0?

I tried to build the package for 3.5.3 and received the following message:

* checking for file ‘reclin2-master/DESCRIPTION’ ... OK
* preparing ‘reclin2’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to process help pages
* saving partial Rd database
Error in loadVignetteBuilder(pkgdir, TRUE) : 
  vignette builder 'simplermarkdown' not found
Execution halted

I was able to install the package without a build, but just curious in case I run into any trouble.