djvanderlaan / reclin2 Goto Github PK
View Code? Open in Web Editor NEWRecord Linkage Toolkit for R
License: GNU General Public License v3.0
Record Linkage Toolkit for R
License: GNU General Public License v3.0
Sometimes when the comparison scores produced by compare_pairs
reach extreme scores like complete (dis)similarity, 0 or 1, problink_em
fails to estimate probabilities. Here is a reproducible example:
library(reclin2)
ex1 <- data.frame(
a = "aaaaa",
b = "zzzzz"
)
ex2 <- data.frame(
a = "bbbbb",
b = "zzzzz"
)
pairs <- pair(ex1, ex2)
pairs <- compare_pairs(pairs, on = c("a", "b"), default_comparator = cmp_jarowinkler())
m <- problink_em(~ a + b, data = pairs)
# Error in if (mprobs[[col]] > mprob_max) mprobs[[col]] <- mprob_max :
# missing value where TRUE/FALSE needed
I am not entirely sure when exactly this happens, but it also seems to be linked to the fact that there are absolute 0's and absolute 1's present in the same row. I dug a bit into the source code of problink_em
and figured out that the error occurs due to an NA value which is produced due to division by 0. In particular, the following line seems to produce this NA value (from problink_em.R#L77
gm <- p*a / (p*a + (1-p)*b)
This might seem like a fringe case (possibly reinforced by my absurd reprex), but in fact this came up in my day-to-day work with reclin2 on a regular basis. Sometimes, I'm able to work around this problem by just adding or subtracting a marginal number from the extreme scores, but this is not a reliable workaround, as the problem also seems to occur in case of scores very close to 0/1. See e.g. the following example in which a score of 0.999 leads to an error, but a score of 0.9 does not:
pairs <- pair(ex1, ex2)
pairs <- compare_pairs(pairs, on = c("a", "b"), default_comparator = cmp_jarowinkler())
pairs$a <- pairs$a + 0.001
pairs$b <- pairs$b - 0.001
m <- problink_em(~ a + b, data = pairs)
# Error in if (mprobs[[col]] > mprob_max) mprobs[[col]] <- mprob_max :
# missing value where TRUE/FALSE needed
pairs <- pair(ex1, ex2)
pairs <- compare_pairs(pairs, on = c("a", "b"), default_comparator = cmp_jarowinkler())
pairs$a <- pairs$a + 0.1
pairs$b <- pairs$b - 0.1
m <- problink_em(~ a + b, data = pairs)
# M- and u-probabilities estimated by the EM-algorithm:
# Variable M-probability U-probability
# a 0 1e-04
# b 0 1e-04
#
# Matching probability: 0.0001370404.
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reclin2_0.3.4 data.table_1.14.8
loaded via a namespace (and not attached):
[1] compiler_4.3.1 parallel_4.3.1 tools_4.3.1 Rcpp_1.0.10 lpSolve_5.6.19 stringdist_0.9.10
The deduplication vignette you provided seems to be for the case in which you have a set of true/official records. What if I just wanted to deduplicate based on some kind of fuzzy matching criteria because I don't have access to any true/official records? This seems to be more common for most of what I'm doing. Any suggestions or direction is appreciated.
Check if this is the case. Possibly the mapping between x and .x in de data.table is incorrect.
Hi,
in your introduction https://htmlpreview.github.io/?https://github.com/djvanderlaan/reclin2/blob/master/inst/doc/introduction.html
you name the function "pairs_minsim", however it is called pair_minsim.
Easy fix!
Have a nice day.
Add function like score_simsum
from reclin
Ik ben erachter gekomen dat de ellipsis niet doorgegeven wordt door reclin2::compare_vars, in tegenstelling tot wat de documentatie beweert.
Voorbeeld:
df <- data.frame(c1=c(1))
pairs <- reclin2::pair(df, df)
g <- function(...){
l <- list(...)
print(l)
return(l[[1]]==l[[2]])
}
compare_vars(pairs, 'newvar', on_x='c1', on_y='c1', comparator=g, delta=2)
Resultaat:
[[1]]
[1] 1
[[2]]
[1] 1
First data set: 1 records
Second data set: 1 records
Total number of pairs: 1 pairs
.x .y newvar
1: 1 1 TRUE
Verwacht resultaat:
Ook delta zichtbaar als argument aan g.
Export greedy; useful for some applications to have direct access to the currently internal greedy
; match_n_to_m
is also exported.
The function reclin2::greedy, which greedily selects the 'best' matches within arity bounds does not always respect those bounds.
#BUG: Should select 4, not 5!
reclin2:::greedy(1:5, c(1,1,2,2,2), rep(1, 5), n=2, m=2)
This was found using
df_11 <- data.frame(x=c(1,2))
df_23 <- data.frame(x=c(1,1,2,2,2))
and a join on x (no blocking).
Note that swapping the arguments does give the right solution.
reclin2:::greedy(c(1,1,2,2,2), 1:5, rep(1, 5), n=2, m=2)
Hello!
in general the functions in your package do not return anything to the console when inplace = TRUE. This makes sense as "pairs" is modified in place.
However, the predict function seems to be the only function which still prints out the result. I find this behaviour inconsistent and I think that you can easily fix it.
library(reclin2)
data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"), inplace = TRUE) #use inplace, to avoid copying of object to save memory
print(pairs)
m <- problink_em(~ lastname + firstname + address + sex, data = pairs) #pairs should again be the first argument!
predict(m, pairs = pairs, add = TRUE, inplace = TRUE) #Should not print when using inplace = TRUE, if to be consistent with compare_pairs()
print(pairs)
Also note another comment I made: In problink_em I feel like for consistency, data should again be the first argument in this function.
If you want I can maybe also make a pr with the appropriate changes, I don't want to put it all on you. But you are probably faster as this is your package. Your call!
Have a nice day!
Hi.
I'm a new R user. I have adapted an R script to remove duplicates from my database. This script was made using the reclin package.
When I tried to pass this script on to my colleagues, they were unable to install reclin, which is no longer available.
Now I'm trying to adapt this script to reclin2. But I couldn't find an equivalent function to filter_pairs_for_deduplication()
Is there an alternative to this function?
Thank you.
Hi, I switched from reclin to reclin2 and really appreciate the performance boost. When working with cluster_pair_blocking()
and link()
, I encountered an issue: I received pairs that shouldn't be compared based on the blocking variable.
To illustrate, here's the workflow without a cluster:
library(reclin2)
library(dplyr)
# Create an example dataset for deduplication
x <- data.frame(
block = rep(letters, each = 100),
comp = sample(1:3, length(letters) * 100, replace = TRUE)
)
# Run through reclin2 workflow without a cluster
pairs_linked <- x %>%
pair_blocking(
deduplication = TRUE,
on = "block"
) %>%
compare_pairs(on = "comp") %>%
link()
different_blocks <- pairs_linked %>%
filter(block.x != block.y)
stopifnot(nrow(different_blocks) == 0)
That works as expected. No pairs with different blocking variable values.
Now to run it on a cluster:
library(parallel)
cluster <- makeCluster(2)
pairs_linked_cluster <- x %>%
cluster_pair_blocking(
deduplication = TRUE,
on = "block",
cluster = cluster
) %>%
compare_pairs(on = "comp") %>%
cluster_collect() %>%
link()
stopCluster(cluster)
different_blocks_cluster <- pairs_linked_cluster %>%
filter(block.x != block.y)
stopifnot(nrow(different_blocks_cluster) == 0)
This time, there's an error because some pairs differ in their "block" variable.
Does link()
have to operate on the worker nodes? Or could there be some other id confusion when running on a cluster?
Result of sessionInfo()
:
R version 4.1.0 (2021-05-18)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 12.5.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices datasets utils methods base
other attached packages:
[1] dplyr_1.0.9 reclin2_0.1.1 data.table_1.14.0
loaded via a namespace (and not attached):
[1] stringdist_0.9.7 Rcpp_1.0.7 fansi_0.5.0 crayon_1.4.1 assertthat_0.2.1 utf8_1.2.2 R6_2.5.1 DBI_1.1.1 lifecycle_1.0.1 magrittr_2.0.1
[11] pillar_1.8.0 rlang_1.0.4 cli_3.3.0 renv_0.14.0 rstudioapi_0.13 vctrs_0.4.1 generics_0.1.3 tools_4.1.0 glue_1.6.2 purrr_0.3.4
[21] compiler_4.1.0 pkgconfig_2.0.3 tidyselect_1.1.2 tibble_3.1.8 lpSolve_5.6.16
Dear Jan,
I am using the R reclin2 package you published to solve a record matching research question.
After applying a similarity calculation to assign a matching score to each combination of records I would like to use functions “greedy” and “m-to-n” to select the pairs that are considered to belong to the same record in the base record set.
Here, I can observe that some records are matched more than once, which obviously can lead to incorrect results. I am working with version 0.3.5 and understand that an issue has been fixed with the new version 0.5.0 for the select_greedy function, but not for the m-to-n function. Is it possible that the same kind of issue you fixed in greedy is also currently observable in m-to-n?
For reproduction I attach an excel file containing an example:
record 148 is paired with record 150 (row 110) and at the same time record 150 is paired with record 151 (row 132).
Same for records 152/154 in row 152 and 154/156 (greedy, row 167) or 154/156 (m-to-n, row 168).
Label columns were generated using this code:
dt_pairs_application <-
reclin2::select_greedy(pairs = dt_pairs_application,
variable = "LABEL_LM_AGG_SIM_GREEDY_APPROACH",
score = "MATCH_PROB_LM_AGG_SIM",
threshold = 0.5 )
dt_pairs_application <-
reclin2::select_n_to_m(pairs = dt_pairs_application,
variable = "LABEL_LM_AGG_SIM_N_TO_M_APPROACH",
score = "MATCH_PROB_LM_AGG_SIM",
threshold = 0.5 )
library(reclin2)
data("linkexample1", "linkexample2")
pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
print(pairs)
pairs[, score := runif(nrow(pairs))]
reclin2:::select_preprocess(pairs, score = "score", id_x = "id", id_y = "id")
Result:
> reclin2:::select_preprocess(pairs, score = "score", id_x = "id", id_y = "id")
.x .y score index
1: 1 2 0.61505326 1
2: 2 3 0.12873553 2
3: 3 4 0.87160600 3
4: 4 6 0.76758030 4
5: 5 7 0.03668445 5
6: 6 NA 0.76429553 6
7: NA NA 0.03718838 7
8: NA NA 0.32234890 8
9: NA NA 0.63785324 9
10: NA NA 0.55063922 10
11: NA NA 0.96634700 11
12: NA NA 0.70562979 12
13: NA NA 0.18539430 13
14: NA NA 0.35664960 14
15: NA NA 0.68076638 15
16: NA NA 0.97718373 16
17: NA NA 0.73858780 17
I am writing to let you know that I have developed a small package called [blocking
] (https://github.com/ncn-foreigners/blocking) that allows blocking of records based on approximate nearest neighbours algorithms (RcppAnnoy
, RcppHNSW
and mlpack
) and graphs (igraph
). The package includes the function pair_ann
, which was developed on the basis of pair_blocking
and pair_minsim
to allow direct integration into your package.
Here is the code using the reclin2
sample data:
library(blocking)
library(reclin2)
data("linkexample1", "linkexample2", package = "reclin2")
linkexample1$txt <- with(linkexample1, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample1$txt <- gsub("\\s+", "", linkexample1$txt)
linkexample2$txt <- with(linkexample2, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample2$txt <- gsub("\\s+", "", linkexample2$txt)
# pairing records from linkexample2 to linkexample1 based on the txt column
pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.75) |>
link(selection = "threshold")
Feel free to test and comment. I plan to submit this package to CRAN in December.
Dear djvanderlaan,
First of all thank you for your wonderful package, for solving our hands and for introducing me to it!
My data looks like the example. I am clustering using stringdist and pair_minsin
library(tidyverse)
library(reclin2)
library(stringdist)
df <- tibble(fruits=c("anana","ananna","peach"))
my_dist <- function(x, y, ...){
1 - stringdist(x, y, method=c("lv"))
}
p <- pair_minsim(df, on = "fruits", deduplication = TRUE,
default_comparator = my_dist, minsim = -2)
p %>%
as.data.frame()
#> .x .y simsum
#> 1 1 2 0
Created on 2022-03-18 by the reprex package (v2.0.1)
I have three fruits: anana, ananna, peach and I cluster to cluster based on similarity. anana & ananna are very close but peach are very far away. However I want to keep peach in the data even if it does not cluster with anyone but itself for further analysis. But i just want to keep peach in the analysis, and not its relationships with the others and look like
.x .y simsum
1 2 0
3 3 1
Do you have an idea how I can do this through pair_minsim?
Honestly, thank you for your time
I'm trying to link dataframes dfA and dfB. I don't want to rely solely on one variable for blocking, so I'd like to use two blocking variables and consider the union of the two resulting sets of pairs as my potential pairs. Unfortunately, this case doesn't seem to be supported by the pair_blocking
function.
I tried the following instruction to create a new pairs
object:
pairs <- unique(rbindlist(list(pair_blocking(dfA, dfB, "postcode"), pair_blocking(dfA, dfB, "birthyear"))))
It seems like I'm losing information in the process, as I get the following error when calling compare_pairs
on this object:
Error in UseMethod("compare_pairs", pairs): no applicable method for 'compare_pairs' applied to an object of class "c('data.table', 'data.frame')"
When attaching the package, the identical comparator function masks base::identical
:
> library(reclin2)
Attaching package: ‘reclin2’
The following object is masked from ‘package:base’:
identical
This leads to some functions using getOption("viewer")
to just stop working:
> leaflet::leaflet()
Error in identical(height, "maximize") :
unused arguments (height, "maximize")
> shiny::paneViewer()("")
Error in identical(height, "maximize") :
unused arguments (height, "maximize")
> getOption("viewer")
function (url, height = NULL)
{
if (!is.character(url) || (length(url) != 1))
stop("url must be a single element character vector.",
call. = FALSE)
if (identical(height, "maximize"))
height <- -1
if (!is.null(height) && (!is.numeric(height) || (length(height) !=
1)))
stop("height must be a single element numeric vector or 'maximize'.",
call. = FALSE)
invisible(.Call("rs_viewer", url, height, PACKAGE = "(embedding)"))
}
Currently, the only reliable workaround I'm aware of is to just load reclin2 instead of attaching it. Since base::identical
is a pretty common function and this issue is quite hard to debug or even recognize, it might be a good idea to rename the comparator function.
Hi there,
Working with a dataset that included lat long and would love to calculate distance between points as one of my factors. I'm happy to build this calculation mathematically in the comparative element, but wonder if there was a pre-build in the package? or if you had suggestions about how to go about it?
All the Best,
Frankie
Hello, thanks for the package and for the vignettes, they are extremely helpful.
One question I have is whether it is possible to do some kind of "inequality blocking" (I'm a beginner in the record linkage domain, maybe there's a specific term for this). In your first vignette, you block on the postal code, so it is quite easy to block pairs based on this condition. In my setting, I would like to add a limit on the year.
Basically, I have a list of people who moved from country A to country B at different years (with names and year of arrival) and I have a list of letters (with the sender names and the year in which they were written). I'd like to match letter senders to passengers lists. In this case, I would like to ignore pairs of writers-passengers where the year of arrival > year of letter (the letter has to be written after the person arrived in the country). Therefore, I would like to do something like pair_blocking(passengers, writers, "year_arrival < year_written")
, but that doesn't seem possible yet.
Is this kind of blocking in the scope of this package? Otherwise, are there other packages / techniques that you could recommend?
Thanks again
Currently, the logic for selecting matching records is working in two ways:
select_greedy
or select_n_to_m
select_treshold
However, in the latter case, one needs to specify the treshold directly, even though the distribution of this value is not always known beforehand.
Alternatively, it should be possible to indicate the top k nearest neighbours or highest scoring candidate matches.
Similarly to select_treshold
, a select_top_k
command could look like this:
pairs <- select_top_k(pairs, "top_weights", score = "weights", k = 8)
The above command can be implemented by hand utilising 2 rows of code with data.table::frank
and data.table::fifelse
, but I think it would be more convenient to implement it in the package directly.
As reported in issue #8 .
library(reclin2)
data(linkexample1, linkexample2)
linkexample1$postcode[1] <- NA
pair_minsim(linkexample1, linkexample2, on = c("postcode", "lastname"), minsim = 0.5)
Should result in a pair between records 1 from x and 1 from y.
Hi, is there any reason the package is dependent on R 3.6.0?
I tried to build the package for 3.5.3 and received the following message:
* checking for file ‘reclin2-master/DESCRIPTION’ ... OK
* preparing ‘reclin2’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to process help pages
* saving partial Rd database
Error in loadVignetteBuilder(pkgdir, TRUE) :
vignette builder 'simplermarkdown' not found
Execution halted
I was able to install the package without a build, but just curious in case I run into any trouble.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.