dgrtwo / fuzzyjoin Goto Github PK
View Code? Open in Web Editor NEWJoin tables together on inexact matching
License: Other
Join tables together on inexact matching
License: Other
geo_join
does a full m*n comparison of points. This works for moderately sized datasets but becomes intractable in both time and memory.
The right geo_join()
implementation would likely use k-d trees to build a spatial index of one table, then look up points from the other, while taking advantage of the limited max_dist
argument. This would turn it (on average) from O(m*n) time to O(m*log(n)) time. (Similar approaches could be used for functions such as distance_join
, and on a related note b-k trees could be used for Levenshtein distance).
I may or may not have time to implement this but it would be a top-notch pull request.
First, thanks for putting this package together. I've been trying for quite some time to wrap my head around how to accomplish exactly this task.
Unfortunately, when I try running any of the examples, I get an error message:
Error in sapply(x_indices_l, length) * sapply(y_indices_l, length) : non-numeric argument to binary operator
I'm not familiar enough with base R to know how to fix the problem, but fortunately, it looks like someone else is: b0cfdd9. What's the proper way to request integrating this fix into your repository? Is it appropriate to create a pull request based on someone else's code?
In addition to providing examples of match_fun's #22 , it looks like match_fun gets used as multi_match_fun if match_fun is singular and there are multiple column's in the by argument?
"If only one function is given it is used on all column pairs. "
If so, then multi_by and multi_match_fun seems confusing and redundant to me.
I see the note "Note that as of now, you cannot give both match_fun and multi_match_fun- you can either compare each column individually or compare all of them."
Perhaps multi_by and multi_match_fun should be removed in the future?
Basically, the following definitions seem redundant and I can't tell what the differences are
by
Columns of each to join
match_fun
Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.
multi_by
Columns to join, where all columns will be used to test matches together
multi_match_fun
Function to use for testing matches, performed on all columns in each data frame simultaneously
The error occurred in situation as exemplified below:
> fuzzy_left_join(mtcars, mtcars,
+ by = c("gear" = "cyl", "carb" = "cyl"),
+ match_fun = list(`==`, `==`))
Error: Length of match_fun not equal to columns specified in 'by'.
First, the error message is misleading because lengths do not mach because fuzzyjoin made them so to make the comparisons. Second, it should not occur at all since this is not a bug but users behavior that is should be perfectly valid (expecting each column to be used only once does not seem reasonable). Moreover, this behavior is inconsistent as the following code does not produce errors:
fuzzy_left_join(mtcars, mtcars,
by = c("cyl" = "gear", "cyl" = "carb"),
match_fun = list(`==`, `==`))
The practical use case is when one wants to do left joins like x.start < y.date & y.date < x.end
as equivalent to SQL's: SELECT * FROM table1 x LEFT JOIN table2 y ON y.value BETWEEN x.start AND x.end
I can't find an example for how match_fun works anywhere online
I saw this on SO: http://stackoverflow.com/questions/40632295/how-do-i-select-values-from-a-lookup-table-based-on-their-time-stamp, where the asker would like to match based on ranges... but it gave me the idea... your fuzzyjoin could also match based on closest timestamp. Could be a good feature for this package, no?
difference_right_join (and probably others) do not create the designated distance_col when there are no overlapping rows. This makes use of these functions in a general pipeline problematic because the column you expect to exist does not. Example below
library(fuzzyjoin)
a <- data.frame(replicate(2,sample(20:30,10,rep=TRUE)))
b <- data.frame(replicate(2,sample(1:10,10,rep=TRUE)))
difference_right_join( a, b, by='X1', max_dist = 1, distance_col='distance')
#> X1.x X2.x X1.y X2.y
#> 1 NA NA 10 4
#> 2 NA NA 7 5
#> 3 NA NA 9 2
#> 4 NA NA 1 10
#> 5 NA NA 8 4
#> 6 NA NA 5 1
#> 7 NA NA 4 9
#> 8 NA NA 1 3
#> 9 NA NA 9 5
#> 10 NA NA 3 6
Created on 2019-08-05 by the reprex package (v0.3.0)
In this example 100-110 is not joined with 111-112, but 10-11 is joined with 11.1-11.2.
x1 <- frame_data(~ xmin, ~xmax,
100, 110,
10, 11)
x2 <- frame_data(~ xmin, ~xmax,
111, 112,
11.1, 11.2)
interval_inner_join(x1, x2, by = c("xmin","xmax"), maxgap = 0)
xmin.x xmax.x xmin.y xmax.y
<dbl> <dbl> <dbl> <dbl>
1 10 11 11.1 11.2
This looks like it is due to storing ranges as integers by IRanges
r1 <- IRanges::IRanges(as.numeric(x1[2,1]), as.numeric(x1[2,2]))
r2 <- IRanges::IRanges(as.numeric(x2[2,1]), as.numeric(x2[2,2]))
IRanges::findOverlaps(r1, r2)
This probably isn't going to change on IRanges side so maybe just adding a disclaimer to interval_join would be appropriate. The function behaves fine if xmin, xmax are scaled by some large value.
I"m a bit unclear on why stringdist_left_join
doesn't yield a tibble with x * y rows if max_dist
is left unspecified.
Any word can be modified to any other word by replacing every letter, therefore shouldn't a row be returned for every combination of x and y?
I realize that this would essentially be producing the same result as stringdist::stringdistmatrix()
, and it isn't actually a result I"m looking for, but it just seems as though stringdist_left_join
has a hidden max_dist
cutoff, and I"m curious if that's the case what it is or how it's determined.
Thanks!
More examples: I've used this package in other powerful ways, but on proprietary data. I'm interested in ideas for use cases that can be provided as vignettes.
fuzzy_left_join(A, B
by = c(
"key" = "key",
"date" = "startDate",
"date" = "endDate"
),
match_fun = list("==", ">=", "<=")
)
When I installed the github version, I kept getting the following error when pressing tab, and when loading the package for the first time:
Error in fetch(key) :
lazy-load database '~/R/x86_64-pc-linux-gnu-library/3.4/fuzzyjoin/help/fuzzyjoin.rdb' is corrupt
I wonder if there is a small mistake in the documentation of geo_join
:
By default geosphere::distGeo
outputs distances in meters, and other geosphere
functions probably do.
So if the unit
given to geo_join
is not miles, the distance is calculated in meters I guess? But in the documentation the other possible value for unit
is "km" so I'm confused.
I have followed the exact steps from the reference doc at geo_join section.
data("state")
state.name
state.center
states <- data_frame(state = state.name,
longitude = state.center$x,
latitude = state.center$y)
s1 <- rename(states, state1 = state)
s2 <- rename(states, state2 = state)
pairs <- s1 %>%
geo_inner_join(s2, max_dist = 200) %>%
filter(state1 != state2)
library(ggplot2)
ggplot(pairs, aes(x = longitude.x, y = latitude.x,
xend = longitude.y, yend = latitude.y)) +
geom_segment(color = "red") +
borders("state") +
theme_void()
But the result seems to be odd. Instead of joining all the states that are close within 200 miles distance, the states in the result seems to be pretty far from each other.
Source: local data frame [74 x 6]
state1 longitude.x latitude.x state2 longitude.y latitude.y
(chr) (dbl) (dbl) (chr) (dbl) (dbl)
1 Alaska -127.2500 49.2500 Louisiana -92.2724 30.6181
2 Alaska -127.2500 49.2500 Montana -109.3200 46.8230
3 Arizona -111.6250 34.2192 Connecticut -72.3573 41.5928
4 Arizona -111.6250 34.2192 Indiana -86.0808 40.0495
5 Arizona -111.6250 34.2192 Minnesota -94.6043 46.3943
6 Colorado -105.5130 38.6777 North Dakota -100.0990 47.2517
7 Connecticut -72.3573 41.5928 Arizona -111.6250 34.2192
8 Connecticut -72.3573 41.5928 Georgia -83.3736 32.3329
9 Connecticut -72.3573 41.5928 Minnesota -94.6043 46.3943
10 Connecticut -72.3573 41.5928 New Hampshire -71.3924 43.3934
.. ... ... ... ... ... ...
I could be my understanding of this function is wrong or the way I'm using is something wrong. But because I'm also getting some strange result with 'regex_join' function, I just wanted to check if there might be some known issue with the latest one from CRAN and Github.
See those examples:
library(dplyr)
library(ggplot2)
data(diamonds)
d <- data_frame(regex_name = c("^Idea", "mium", "Good"))
This works fine as diamonds as a tibble
:
diamonds %>%
select(cut) %>%
regex_left_join(d, by = c(cut = "regex_name"))
# A tibble: 53,940 x 2
cut regex_name
<ord> <chr>
1 Ideal ^Idea
2 Premium mium
3 Good Good
4 Premium mium
This does not work:
diamonds %>%
select(cut) %>%
as.data.frame() %>%
regex_left_join(d, by = c(cut = "regex_name"))
Error: cannot convert object to a data frame
Fails to rename similar variable names in the resulting fuzzy joined data table.
library(magrittr)
library(fuzzyjoin)
data_1 <- data.frame(town = c("Boston", "Philadelphia", "Salt Lake City", "Jacksonville"),
year = c(1900, 1901, 1902,1903))
data_2 <- data.frame(city = c("Bsoton", "Philedelphia", "Salt Lake City"),
state = c("Massachusetts", "Pennsylvania", "Utah"),
pop = c(10, 20, 30),
year = c(2008, 2009, 2010))
test_join <- data_1 %>%
stringdist_join(data_2, by = c(town = "city"), max = 1, mode = "left")
Output:
town year city state pop year
1 Boston 1900 Bsoton Massachusetts 10 2008
2 Philadelphia 1901 Philedelphia Pennsylvania 20 2009
3 Salt Lake City 1902 Salt Lake City Utah 30 2010
4 Jacksonville 1903 <NA> <NA> NA NA
I was trying to regex join two vectors:
sentences <- c("qwertyuiop", "asdfgfh", "zxcvbn")
patterns <- c("wer", "asd")
regex_right_join(sentences, patterns)
# Error in UseMethod("groups") :
# no applicable method for 'groups' applied to an object of class "character"
Maybe if I previously convert them to a data.frame?
regex_right_join(data.frame(sentences),
data.frame(patterns),
by=c(sentences="patterns"))
# Error: cannot convert object to a data frame
It seems that it's dropping a dimension (from data.frame to vector) in some place...
So, finally, I tried:
regex_right_join(data.frame(sentences, 0),
data.frame(patterns, 0),
by=c(sentences="patterns")) %>%
select(sentences, patterns)
# sentences patterns
#1 qwertyuiop wer
#2 asdfgfh asd
Could you check where is it dropping a dimension (from data.frame to vector)?
And would you consider adding the functionality to join vectors?
I'm going to try and add two-by-two comparison: right now we'd like to join tables based on geographical distance for which you use two columns in each dataframe.
I got the error below when I used the fuzzy join package.
DF and DAT are columns with text of company names. I am trying to get a subset of names that are in DF that don’t match those in DAT.
Any help will be appreciated!!
df_test <- stringdist_anti_join(df, dat)
Joining by: "ER_NAME2"
Error in outer(u_x, u_y, mf, ...) :
long vectors not supported yet: memory.c:1648
df has 287626 rows and dat has 11880 rows of data
sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] tcltk grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] fuzzyjoin_0.1 stringdist_0.9.4.1 Hmisc_3.16-0 ggplot2_1.0.1 Formula_1.2-1 survival_2.38-1 lattice_0.20-31
[8] stringr_1.0.0 sqldf_0.4-10 RSQLite_1.0.0 DBI_0.3.1 gsubfn_0.6-6 proto_0.3-10 dplyr_0.4.3
[15] plyr_1.8.3 readr_0.1.1
loaded via a namespace (and not attached):
[1] Rcpp_0.11.6 RColorBrewer_1.1-2 tools_3.2.1 rpart_4.1-9 digest_0.6.8 gtable_0.1.2 parallel_3.2.1
[8] gridExtra_0.9.1 cluster_2.0.1 nnet_7.3-9 R6_2.1.0 foreign_0.8-63 latticeExtra_0.6-26 purrr_0.2.1
[15] tidyr_0.4.1 reshape2_1.4.1 magrittr_1.5 scales_0.2.5 splines_3.2.1 MASS_7.3-40 assertthat_0.1
[22] colorspace_1.2-6 stringi_0.5-5 acepack_1.3-3.3 lazyeval_0.1.10 munsell_0.4.2 chron_2.3-47
First of all congratulations for this great package. I was a bit struggling to find out why I could not geo-match my data. I think that it is a bit unfortunate that there is no exact longitude and latitude variable matching in the geo_join function.
I made a small amendment in the documentation of the function. However, I think that a clearer matching would be more appropriate, given that Google any many other APIs use by default lng for longitudes.
` states <- data_frame(state = state.name,
lng = state.center$x,
lat = state.center$y)
s1 <- rename(states, state1 = state)
s2 <- rename(states, state2 = state)
pairs <- s1 %>%
geo_inner_join(s2, by = c("lng", "lat"),
max_dist = 200) %>%
filter(state1 != state2)
`
Another issue that makes the whole code very difficult to read and amend that only a very few functions are directly imported. For example, in geo_join dplyr::data_frame
and dplyr::common_by
are not referenced or imported. It is not logical to me why the above modification should not work. Given that in geo-matching lat.x must strictly match with lat.y and lon.x with lon.y it would be preferable to clearly pass on these arguments, for example, like by.lat = c("lat", "lat") and by.lon = c("lon", "lng")
Hello! Please excuse any obvious or dumb errors - I am new to R and tried looking up the answers beforehand, I promise!
I am trying to join two tables using the R "fuzzyjoin" package. I am trying to fuzzy join two tables of company names, exactly. I have one data frame of 5000 company names, and one data frame of 1600 company names. There are other no columns besides the company names.
Using the package, I have:
NewTable <- AccountsList1 %>%
stringdist_inner_join(AccounttList2, by = NULL)
However, I got two errors:
Joining by: "Accounts"
Error in dists[include] <- stringdist::stringdist(v1[include], v2[include], :
NAs are not allowed in subscripted assignments
and
50: In stri_length(string) :
invalid UTF-8 byte sequence detected. perhaps you should try calling stri_enc_toutf8()
So then I removed N/As via [!is.na(AccountsList1)]
and forced UTF-8 via
stri_enc_toutf8(AccountsList1, is_unknown_8bit = FALSE, validate = FALSE)
However, when I rerun I get the exact same errors... Does anyone have any advice? Thank you!
Is it possible to allow max_dist
to be a vector so that it can be different for each column passed to stringdist_join
's by
argument?
Something like this: stringdist_inner_join(df1, df2, by=c("col1", "col2"), max_dist=c(1, 2))
Error: vector memory exhausted (limit reached?)
I’m getting the above error when trying to stringdist_left_join two tables - the left table is 185K rows and the right table is 4.37M rows. The R session never appears to use more than 6GB of memory (according to Activity Monitor) while I’m on a machine with 32GB of memory with available memory in the range of 10GB when the vector memory exhausted error arises. I’ve followed various recommendations to increase R_MAX_VSIZE to a large number - 700GB as shown in the Sys.getenv() output shown below. All this to say it appears that stringdist_left_join does not pay attention to R_MAX_VSIZE. Is there some other setting I can change to use more of the available memory on my machine?
Sys.getenv()
Apple_PubSub_Socket_Render /private/tmp/com.apple.launchd.sSrL33I64Z/Render
COLUMNS 80
COMMAND_MODE unix2003
DISPLAY /private/tmp/com.apple.launchd.tTt2eLd6xQ/org.macosforge.xquartz:0
DYLD_FALLBACK_LIBRARY_PATH /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/server
DYLD_LIBRARY_PATH /Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/server
EDITOR vi
HOME /Users/geoffreysnyder
LD_LIBRARY_PATH :@JAVA_LD@
LINES 24
LN_S ln -s
LOGNAME geoffreysnyder
MAKE make
PAGER /usr/bin/less
PATH /usr/local/bin:/usr/local/mysql/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:~/Library/Python/3.7/bin
PWD /Users/geoffreysnyder/repos/Data_Load/code
R_ARCH
R_BROWSER /usr/bin/open
R_BZIPCMD /usr/bin/bzip2
R_DOC_DIR /Library/Frameworks/R.framework/Resources/doc
R_GZIPCMD /usr/bin/gzip
R_HOME /Library/Frameworks/R.framework/Resources
R_INCLUDE_DIR /Library/Frameworks/R.framework/Resources/include
R_LIBS_SITE
R_LIBS_USER ~/Library/R/3.5/library
R_MAX_VSIZE 700GB
R_PAPERSIZE a4
R_PDFVIEWER /usr/bin/open
R_PLATFORM x86_64-apple-darwin15.6.0
R_PRINTCMD lpr
R_QPDF /Library/Frameworks/R.framework/Resources/bin/qpdf
R_RD4PDF times,inconsolata,hyper
R_SESSION_TMPDIR /var/folders/xw/402kc2hc8xl82d008k8x64f00000gq/T//RtmpJdct7Y
R_SHARE_DIR /Library/Frameworks/R.framework/Resources/share
R_SYSTEM_ABI osx,gcc,gxx,gfortran,?
R_TEXI2DVICMD /usr/local/bin/texi2dvi
R_UNZIPCMD /usr/bin/unzip
R_ZIPCMD /usr/bin/zip
SECURITYSESSIONID 186a8
SED /usr/bin/sed
SHELL /bin/zsh
SHLVL 0
SSH_AUTH_SOCK /private/tmp/com.apple.launchd.UNOOV1wxev/Listeners
SUBLIMEREPL_AC_IP 127.0.0.1
SUBLIMEREPL_AC_PORT None
TAR /usr/bin/tar
TMPDIR /var/folders/xw/402kc2hc8xl82d008k8x64f00000gq/T/
TZ America/Los_Angeles
USER geoffreysnyder
XPC_FLAGS 0x0
XPC_SERVICE_NAME 0
__CF_USER_TEXT_ENCODING 0x1F7:0x0:0x0
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 RJDBC_0.2-7.1 rJava_0.9-10 DBI_1.0.0 fuzzyjoin_0.1.4 readr_1.2.0 dplyr_0.7.8
[8] lubridate_1.7.4 stringr_1.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 tidyr_0.8.2 assertthat_0.2.0 R6_2.3.0 magrittr_1.5 pillar_1.2.3
[7] rlang_0.3.0.1 stringi_1.2.4 tools_3.5.1 glue_1.3.0 purrr_0.2.5 hms_0.4.2.9000
[13] compiler_3.5.1 pkgconfig_2.0.2 bindr_0.1.1 tidyselect_0.2.5 tibble_1.4.2
When trying to join two dataframes, if one of the dataframes has only one column an error happens:
stringdist_inner_join(df1, df2)
Error: cannot convert object to a data frame
Maybe it's dropping one dimension (from dataframe with one column to vector)?
I have two tibble
s with a column called pos
on which I am trying to perform a distance_inner_join
.
> df1
# A tibble: 1,398 x 3
gene_name strand pos
<chr> <fct> <dbl>
1 1500015O10Rik + 43730601
2 1700001G17Rik + 33669823
3 1700003I22Rik + 56018977
4 1700007P06Rik + 187125137
5 1700016C15Rik + 177729813
6 1700016L21Rik + 80445931
7 1700019A02Rik - 53158746
8 1700019D03Rik - 52925246
9 1700019P21Rik - 138876645
10 1700020N18Rik - 91405182
# ... with 1,388 more rows
> df2
# A tibble: 58 x 3
type areaStat pos
<chr> <dbl> <dbl>
1 imprinted -28.5 37977873
2 imprinted -23.4 55027489
3 imprinted -21.5 123876627
4 imprinted -19.2 95665495
5 imprinted 18.3 172148438
6 imprinted 18.4 97938015
7 imprinted 18.5 31347008
8 imprinted 20.7 96774378
9 imprinted 21.3 154282393
10 imprinted 23.0 118973902
# ... with 48 more rows
When I run distance_inner_join
, I get the following error:
> distance_inner_join(
+ df1,
+ df2,
+ by="pos", max_dist=1e6)
Error in rowSums((v1 - v2)^2) :
'x' must be an array of at least two dimensions
and the traceback:
6. stop("'x' must be an array of at least two dimensions")
5. rowSums((v1 - v2)^2)
4. multi_match_fun(ux_input, uy_input)
3. fuzzy_join(x, y, multi_by = by, multi_match_fun = match_fun, mode = mode)
2. distance_join(x, y, by, max_dist = max_dist, method = method, mode = "inner", distance_col = distance_col)
1. distance_inner_join(gene_tss %>% filter(chr == .x) %>% select(-chr), dmr_midpoint %>% filter(chr == .x) %>% select(-chr), by = "pos", max_dist = 1e+06)
Installed the latest version from GitHub today 2018-07-23.
I like the idea of this package, but it does not work well in practice for my needs. I wrote a knitr explaining the problem here.
In brief: often the task is to match up two n-length vectors of strings against each other 1-to-1. The present join algorithm does not enforce 1-to-1 joins, i.e. sometimes one string gets joined to two others, and sometimes one gets joined to none.
I have the following setting:
DatasetA: approximately 20 thousand rows, coordinates are given as latA, lonA
DatasetB: approximately 2 million rows, coordinates are given as latB, lonB
Because the coordinates do not exactly match, I tried the following:
DatasetC <- DatasetA %>%
difference_left_join(DatasetB, by = c("latA" = "latB", "lonA" = "lonB"), max_dist = 2)
This works when I take a sample (e.g. 10%) from DatasetA but repeatedly crashes when using the entire dataset. Did you experience similar behaviour?
dplyr
allows to join columns with differing names in lhs and rhs via named vectors as by
argument. It would be great if genome_join
/interval_join
(and its variants) could support the same syntax.
Example:
require(dplyr)
require(tidyr)
require(fuzzyjoin)
arrayRes = structure(list(
phopho_position_start = c(862L, 426L, 518L, 556L, 519L, 127L),
phopho_position_end = c(863L, 427L, 519L, 557L, 520L, 128L),
ensembl_peptide_id = c("ENSP00000363708", "ENSP00000376679", "ENSP00000321606", "ENSP00000264229", "ENSP00000307093", "ENSP00000358142" ),
external_gene_name = c("BMPR2", "ABLIM1", "CRMP1", "KIAA1211", "MAP6", "SV2A")
), .Names = c("phopho_position_start", "phopho_position_end", "ensembl_peptide_id", "external_gene_name")) %>% tbl_df
hmmerSearchRes = structure(list(
target_name = c("ENSP00000385014", "ENSP00000385014", "ENSP00000385014", "ENSP00000233057", "ENSP00000233057", "ENSP00000233057" ),
ali_from = c(248L, 311L, 399L, 248L, 352L, 440L), ali_to = c(262L, 318L, 412L, 262L, 359L, 453L),
hmm_coverage = c(0.933333333333333, 0.466666666666667, 0.866666666666667, 0.933333333333333, 0.466666666666667, 0.866666666666667)),
.Names = c("target_name", "ali_from", "ali_to", "hmm_coverage")) %>% tbl_df
genome_join(arrayRes, hmmerSearchRes, by=c("ensembl_peptide_id"="target_name", "phopho_position_start"="ali_from", "phopho_position_end"="ali_to"))
It seems if we use a list of match_fun's for each matching variable, it is not possible to pass arguments into them?
https://stackoverflow.com/questions/44383510/r-passing-arguments-into-the-match-fun-function-in-fuzzyjoinfuzzy-join
Suggested feature:
Offer the stringdist, interval, etc. match_fun's as helper functions that can be easily input into fuzzy_join.
This would avoid problems like me trying to extract the stringdist function out of stringdist_join :)
Here's me trying to use the stringdist match_fun - can you help me figure out what I did wrong here?
Passing arguments into the match_fun function in fuzzyjoin::fuzzy_join
Related to #22 (documentation for match_fun's please!)
It would be great if the interval join methods in fuzzyjoin would also run with more than 2 columns start and end, like a group for example.
For the types of fuzzy joins that have some kind of a distance metric (e.g., a string distance for stringdist_join() and friends), would it be possible to have an option to return the distance measure as a new column in the resulting data frame?
Note: This is a cross-post of this Stack Overflow question.
Using fuzzyjoin 0.1.4 and R 3.5.3.
I'm adapting the example shown here, where I'd like to left-join some test results to existing session data, but I know that the tests may have been conducted three hours before or after the session was created, by the same user.
However, I am getting duplicate user_id
columns as a result, despite specifying them in the by
option, and the values themselves are not added to the data frame:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(fuzzyjoin)
sessions = tribble(
~user_id, ~session_id, ~session_created_at,
1, 1, "2019-01-01T15:30:00",
1, 2, "2019-01-02T00:00:30",
1, 3, "2019-01-04T12:30:00"
) %>%
mutate_at("session_created_at", as_datetime)
tests = tribble(
~user_id, ~test_id, ~test_created_at, ~test_value,
1, 7, "2019-01-01T15:28:00", "foo",
1, 8, "2019-01-01T23:59:30", "bar"
) %>%
mutate_at("test_created_at", as_datetime) %>%
mutate(
start = test_created_at + hours(3),
end = test_created_at - hours(3)
)
sessions %>%
fuzzy_left_join(
tests,
by = c(
"user_id" = "user_id",
"session_created_at" = "start",
"session_created_at" = "end"
),
match_fun = list(`==`, `>=`, `<=`)
)
#> # A tibble: 3 x 9
#> user_id.x session_id session_created_at user_id.y test_id
#> <dbl> <dbl> <dttm> <dbl> <dbl>
#> 1 1 1 2019-01-01 15:30:00 NA NA
#> 2 1 2 2019-01-02 00:00:30 NA NA
#> 3 1 3 2019-01-04 12:30:00 NA NA
#> # … with 4 more variables: test_created_at <dttm>, test_value <chr>,
#> # start <dttm>, end <dttm>
sessions %>%
left_join(tests, by = c("user_id" = "user_id"))
#> # A tibble: 6 x 8
#> user_id session_id session_created_at test_id test_created_at
#> <dbl> <dbl> <dttm> <dbl> <dttm>
#> 1 1 1 2019-01-01 15:30:00 7 2019-01-01 15:28:00
#> 2 1 1 2019-01-01 15:30:00 8 2019-01-01 23:59:30
#> 3 1 2 2019-01-02 00:00:30 7 2019-01-01 15:28:00
#> 4 1 2 2019-01-02 00:00:30 8 2019-01-01 23:59:30
#> 5 1 3 2019-01-04 12:30:00 7 2019-01-01 15:28:00
#> 6 1 3 2019-01-04 12:30:00 8 2019-01-01 23:59:30
#> # … with 3 more variables: test_value <chr>, start <dttm>, end <dttm>
Created on 2019-05-17 by the reprex package (v0.2.1)
Why is that, and what do I have to do to get it working?
I should add that when I run:
sessions %>%
left_join(tests, by = c("user_id" = "user_id"))
I get the proper columns, but of course not the correct test results joined. And when I run this, I get my expected results, but it seems inefficient:
sessions %>%
left_join(tests) %>%
filter(test_created_at - hours(3) <= session_created_at) %>%
filter(test_created_at + hours(3) >= session_created_at)
df1 <- data_frame(color = c("yellow", "blue", "green", "purple"),
grp = c(1, 2, 1, 2))
df2 <- data_frame(color = c("yallow", "bloooooo", "gren", "purple"))
df1_grp <- df1 %>% group_by(grp) %>% slice(1)
stringdist_inner_join(df1, df2) # works
stringdist_inner_join(df1_grp, df2) # fails
Error: corrupt 'grouped_df', contains 2 rows, and 4 rows in groups
stringdist_inner_join(df1_grp %>% ungroup(), df2) # works with ungrouping
I recognize the error message from dplyr bugs I have encountered, but the current CRAN version of dplyr 0.4.3 will do this join on grouped data_frames without error: inner_join(df1_grp, df2)
If there are duplicate column names (that aren't being matched on), they're currently suffixed with .x
and .y
. It'd be nice to be able to change that, like you can in dplyr's joins.
Thank you for this package! I used it for assigning people to generations (Boomers, Millenials, etc) depending on the year they were born (also possible to base it on age using that as a rough approximation of year they were born, but that's more complicated code). Here's some simplified code of how I used fuzzyjoin.
library(fuzzyjoin)
#create generations table
name <- c("Generation ?", "Generation Z", "Millenials", "Xennials", "Generation X", "Baby Boomers","Silent")
startyr <- c(2017, 1997, 1984, 1977, 1963, 1944, 1900)
#there's a dummy year in here for Generation ?
endyr <- c(2025, 2016, 1996, 1983, 1976, 1962, 1943)
generations <- data.frame(name, startyr, endyr)
#create a data table of people with the year they were born
name <- c("person1", "person2", "person3", "person4", "person5", "person6", "person7", "person8", "person9")
yr_birth <- c(1971, 2018, 1999, 1917, 1965, 1949, 1979, 1983, 1989)
people <- data.frame(name, yr_birth)
#Update the people table to indicate what generation they belong to
#depending on whether their birth year falls on or after start yr and on or before end year of the generation
people <- fuzzy_left_join( people, generations, by = c( "yr_birth" = "startyr", "yr_birth" = "endyr" ), match_fun = list(
>=,
<=))
I just postd this question which contains the following example:
x = matrix( 1:100, nrow = 100, ncol= 4 ,byrow = FALSE)
y = x + matrix( .001 * rnorm(n = 400), nrow = 100, ncol= 4)
z = rbind(x,y)
z = z[sample(nrow(z)),]
can fuzzyjoin be used to reduce z
to x
(at least as a estimate)?
I'm trying with 'regex_join' function with sample data below.
library(fuzzyjoin)
library(dplyr)
library(readr)
# Create 'user' data frame
user <- read_csv(
"name, regex_name
kan, ^kan
john, john
hideaki, hideaki
")
# Create 'activity' data frame
activity <- read_csv(
"username, type
kanaugust, beta
hideaki_test, alpha
john_test, beta
betauser1, beta
kan_test, beta
chakakan, beta
")
# Run regex_left_join() function
activity %>%
regex_left_join(user, by = c("username" = "regex_name"))
The result looks not right.
> activity %>%
+ regex_left_join(user, by = c("username" = "regex_name"))
Source: local data frame [6 x 4]
username type name regex_name
(chr) (chr) (chr) (chr)
1 kanaugust beta NA NA
2 hideaki_test alpha hideaki hideaki
3 john_test beta NA NA
4 betauser1 beta kan ^kan
5 kan_test beta kan ^kan
6 chakakan beta john john
I was expecting it to be something like below.
username type name regex_name
(chr) (chr) (chr) (chr)
1 kanaugust beta kan ^kan
2 hideaki_test alpha hideaki hideaki
3 john_test beta john john
4 betauser1 beta NA NA
5 kan_test beta kan ^kan
6 chakakan beta NA NA
Similar to how bedtools intersect
works, it would be great if the interval/genome join methods in fuzzyjoin
would also calculate (maybe optionally) the number of bases in the intersection (ie. the length of the overlap) and append it to the results table.
I wanted to give a try to this package but this error came up.
df1 <-
structure(
list(
lake = c(
"McCullough",
"McCullough",
"Partridge",
"Randall",
"Tomahawk",
"Ward",
"White Sand",
"Allequash",
"Island",
"Katinka"
)
),
.Names = "lake",
class = c("tbl_df", "data.frame"),
row.names = c(NA, -10L)
)
df2 <-
structure(
list(
lakename = c(
"Allequash Lake",
"Anvil Lake",
"Arrowhead Lake",
"Bass Lake",
"Big Lake",
"Birch Lake",
"Ballard Lake",
"Big Muskellunge Lake",
"Black Oak Lake",
"Big Portage Lake"
)
),
.Names = "lakename",
class = c("tbl_df",
"data.frame"),
row.names = c(NA, -10L)
)
names(df1)
## [1] "lake"
names(df2)
## [1] "lakename"
stringdist_left_join(df1, df2, by = c("lake" = "lakename"), max_dist = 1)
## Error: data_frames can only contain 1d atomic vectors and lists
Any idea?
Thank you
Is there a reason why it's not supported yet ? purrr is imported so it is just a matter of applying purrr::as_mapper
on functions, I feel that it's a low hanging fruit that would make the package feel a lot more tidyverse friendly.
The following pied of code fixes fuzzy_join
and all functions built on it :
additional_code <-quote({
if(!is.null(match_fun)){
if(is.list(match_fun))
match_fun <- map(match_fun,as_mapper,...)
else
match_fun <- as_mapper(match_fun)
}
if(!is.null(multi_match_fun)){
multi_match_fun <- as_mapper(multi_match_fun)
}
if(!is.null(index_match_fun)){
index_match_fun <- as_mapper(index_match_fun)
}
})
Let's test it :
library(fuzzyjoin)
trace(fuzzyjoin::fuzzy_join,additional_code, at = 2)
sdata <- structure(
list(byear = c(1990, 1995, 2000, 2005),
eyear = c(1995, 2000, 2005, 2010),
val = c(3, 1, 5, 6)), row.names = c(NA, -4L
),
class = c("tbl_df", "tbl", "data.frame"))
fdate <- structure(
list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 5, 5, 5, 5, 8, 8, 8, 8, 13, 13, 13),
fyear = c(1998, 1999, 2000, 2001, 1998, 1999, 2000, 2001, 2002, 2003,
1998, 1999, 2000, 2001, 2002, 2003, 1998, 1999, 2000, 2001,
1998, 1999, 2000, 2001, 1998, 1999, 2000)),
row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))
identical(
fuzzy_inner_join(sdata, fdata, by=c("val"="id"), ~.x > .y),
fuzzy_inner_join(sdata, fdata, by=c("val"="id"), `>`))
# Tracing fuzzy_join(x, y, by, match_fun, mode = "inner", ...) step 2
# Tracing fuzzy_join(x, y, by, match_fun, mode = "inner", ...) step 2
# [1] TRUE
Would you like a PR ?
Hello!
As presented in an example, the function stringdist_left_join can return multiple matches from y dataset for one line of x dataset. In my case, I have a situation where the result of the match includes
label.x label.y
Merton Merton
Merton Melton
Is it possible to tell the function that if it finds an exact match, it should retain it and move on to the next line?
Thank you!
Hi,
first let me say, that I really like your package, especially the genome_join
functions, which save me a lot of conversions between data frames and GenomicRanges
.
But a few methods for dealing with genomic intervals were missing so I wrote a package (tidygenomics) to fill those gaps.
Because I am using your genome_join implementation and mimic your nomenclature, I wanted to ask if this is okay for you? And secondly, so far only the fuzzyjoin version on github supports genome_join, do you have plans to release this to CRAN?
Best Regards,
Constantin
Is there any way to match on normalized string distance?
Ie for osa, lv, and dl, dividing by the nchar of the longest word, or for lcs the additive lengths of both?
1 letter off of a 4 letter string is quite different than 1 letter off of a 20 character string.
the semi-join appears to work, but doesn't add joined variables
Working from an example I had saved from twitter a while back.
library(tidyverse)
library(fuzzyjoin)
regexes <- read_csv(
"regex, type
Windows, Windows-ish
Red Hat|CentOS|Fedora, Fedora-ish
Ubuntu|Debian,Debian-ish
CoreOS|Amazon,Amazon-ish")
os <- read_lines("https://rud.is/dl/os.txt", na = "NA")
os <- as.data.frame(os, na.rm = TRUE)
os <- os[complete.cases(os),]
os <- as.data.frame(os, na.rm = TRUE)
os_left <- os %>%
replace_na(list(regex = "Unknown")) %>%
regex_left_join(regexes, c(os = "regex"))
os_semi <- os %>%
replace_na(list(regex = "Unknown")) %>%
regex_left_join(regexes, c(os = "regex"))
os2 returns a dataframe with 3 variables
os3 returns one variable (factor)
Hey thanks for the wonderful package.
I find that there are situations where I need to merge by multiple variables, but only one of those variables is fuzzy. Let's say I have a list of school names and their zip codes, plus a list of individuals who state the name of their school and their zip code. I expect there to be some error in the school name, but the ZIP code should be correct, so I try something like this:
library(fuzzyjoin)
library(dplyr)
# correct names
official <- data.frame(stringsAsFactors=FALSE,
name = c("School One", "School Two", "School Three"),
ZIP = c("91427", "91428", "01427"))
# slightly wrong names
response <- data.frame(stringsAsFactors=FALSE,
name = c("School Oune", "School Two", "School Thee"),
ZIP = c("91427"))
# inner join them
stringdist_inner_join(official, response, by =c("name","ZIP"),
max_dist = 2)
# don't understand why there are duplicate rows
stringdist_inner_join(official, response, by =c("name","ZIP"),
max_dist = 2, distance_col = "dist")
# don't understand why distance_col changes # of rows but this is # of rows I expect
# ideally I'd like only to match if ZIPs are the same
# I can't specify different max distances for differnt columns though
stringdist_inner_join(official, response, by =c("name","ZIP"),
max_dist = c(2,0), distance_col = "dist")
So a few things:
Why did the inner join generate duplicate rows?
Why does specifying a distance_col change the number of rows in the inner join?
How do approach cases where I want different max distance for different variables?
Would it be possible to make the max_dist
parameter accept a vector of distances for each by
variable?
Regards,
Carl
Hi,
I cannot get the examples to work.
First I had to write data("state")
instead of data("state.center")
.
Then I get the error Joining by: c("longitude", "latitude") Error: No variables selected
when running
states <- data_frame(state = state.name,
longitude = state.center$x,
latitude = state.center$y)
s1 <- rename(states, state1 = state)
s2 <- rename(states, state2 = state)
pairs <- s1 %>%
geo_inner_join(s2, max_dist = 200) %>%
filter(state1 != state2)
Any plans to support dplyr style calls to external databases, even for some subset of functions?
I'm guessing that's out of scope for this package, but it would be nice to have that documented in perhaps a more obvious way(?) Anyway, really cool package here, thanks!
df1 <- data.frame(color = c("yellow", "blue", "green", "purple", NA),
blah = 1:5, stringsAsFactors = FALSE)
df2 <- data.frame(color = c("yallow", "bloooooo", "gren", "purple", "0range", NA),
test = 6:11, stringsAsFactors = FALSE)
inner_join(df1, df2) # works
stringdist_inner_join(df1, df2) # fails
I'm not sure how problematic this is, as in practice it is typically undesirable to have NAs join with NAs. But I'd expected it to behave like dplyr's joins. If this errors intentionally on NAs in the join variables, maybe clarify the error message?
wondering if to distance_join() or geo_join() two tables, one with M coords and the other with N coords, say M<=N, your curr implementation is "exaustive" i.e., it runs in O(M*N) time. if so, consider creating a kd-tree (of kNN fame) of the smaller list in O[M*log(M)] time and querying it N times for the closest point to P1,P2,...PN on the longer table. only join if the dist(P, closest(P)) < max_dist. this will run in O[(M+N)*log(M)] which will be at least M/(2*log(M)) faster than the exaustive method.
In reference to README: "I'm interested in ideas for use cases that can be provided as vignettes."
Hi,
I thought to share an interesting data example.
Here is a public dataset: Aboriginal and Torres Strait Islander Languages by www.aiatsis.gov.au
https://collection.aiatsis.gov.au/datasets/austlang/001
this is a CSV, comma separated, with quotes.
Languages were transmitted orally and were never written down, hence there are various alternative spellings to refer to the same language.
language_name
and language_synonym
are the two variables to look up, would be a pretty cool example for fuzzyjoin.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.