dgrtwo / fuzzyjoin Goto Github PK

View Code? Open in Web Editor NEW

663.0 663.0 62.0 1.46 MB

Join tables together on inexact matching

License: Other

R 100.00%

fuzzyjoin's People

Contributors

Stargazers

Watchers

Forkers

strongh jorane cderv mdoneil cophy08 rlugojr ericschles libardo1 digideskio sibusiso16 kevark alenzhao benmarwick snashraf avpronkin aliciaschep kevinwkc sdanielzafar yeedle ahcyip zhaoxiaohe m-sostero hiekay batpigandme nanaakwasiabayieboateng arturochian maron78 lizl90 ramusharon scipionesarlo guhjy k-gerst mhamine chrismuir radovankavicky gapdata deerluluolivia julianocristian tonydurst nemochina2008 afcarl wei-wu-nyc schrnstn xtmgah brooke-watson shayan-taheri fxcebx han-tun rohitpandey13 jbgruber olivierbinette jun-lizst kaftanowicz xmorgan asitav-sen bluejayblues moodymudskipper statunizaga doh-osd0303 tomoya-sasaki mfgeary tliu76 bmorsesaw

fuzzyjoin's Issues

Would be great for geo_join to use k-d trees

geo_join does a full m*n comparison of points. This works for moderately sized datasets but becomes intractable in both time and memory.

The right geo_join() implementation would likely use k-d trees to build a spatial index of one table, then look up points from the other, while taking advantage of the limited max_dist argument. This would turn it (on average) from O(m*n) time to O(m*log(n)) time. (Similar approaches could be used for functions such as distance_join, and on a related note b-k trees could be used for Levenshtein distance).

I may or may not have time to implement this but it would be a top-notch pull request.

Error in fuzzy_join.R

First, thanks for putting this package together. I've been trying for quite some time to wrap my head around how to accomplish exactly this task.

Unfortunately, when I try running any of the examples, I get an error message:

Error in sapply(x_indices_l, length) * sapply(y_indices_l, length) : non-numeric argument to binary operator

I'm not familiar enough with base R to know how to fix the problem, but fortunately, it looks like someone else is: b0cfdd9. What's the proper way to request integrating this fix into your repository? Is it appropriate to create a pull request based on someone else's code?

Confusion regarding by vs. multi_by and match_fun vs. multi_match_fun

In addition to providing examples of match_fun's #22 , it looks like match_fun gets used as multi_match_fun if match_fun is singular and there are multiple column's in the by argument?
"If only one function is given it is used on all column pairs. "

If so, then multi_by and multi_match_fun seems confusing and redundant to me.

I see the note "Note that as of now, you cannot give both match_fun and multi_match_fun- you can either compare each column individually or compare all of them."
Perhaps multi_by and multi_match_fun should be removed in the future?

Basically, the following definitions seem redundant and I can't tell what the differences are

by	
Columns of each to join

match_fun	
Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

multi_by	
Columns to join, where all columns will be used to test matches together

multi_match_fun	
Function to use for testing matches, performed on all columns in each data frame simultaneously

BUG: fuzzy_join's return errors when joining multiple times to the same column

The error occurred in situation as exemplified below:

> fuzzy_left_join(mtcars, mtcars,
+                 by = c("gear" = "cyl", "carb" = "cyl"),
+                 match_fun = list(`==`, `==`))
Error: Length of match_fun not equal to columns specified in 'by'.

First, the error message is misleading because lengths do not mach because fuzzyjoin made them so to make the comparisons. Second, it should not occur at all since this is not a bug but users behavior that is should be perfectly valid (expecting each column to be used only once does not seem reasonable). Moreover, this behavior is inconsistent as the following code does not produce errors:

fuzzy_left_join(mtcars, mtcars,
                by = c("cyl" = "gear", "cyl" = "carb"),
                match_fun = list(`==`, `==`))

The practical use case is when one wants to do left joins like x.start < y.date & y.date < x.end as equivalent to SQL's: SELECT * FROM table1 x LEFT JOIN table2 y ON y.value BETWEEN x.start AND x.end

Vignette for fuzzy_inner_join

I can't find an example for how match_fun works anywhere online

Feature: fuzzyjoin based on approximate timestamps

I saw this on SO: http://stackoverflow.com/questions/40632295/how-do-i-select-values-from-a-lookup-table-based-on-their-time-stamp, where the asker would like to match based on ranges... but it gave me the idea... your fuzzyjoin could also match based on closest timestamp. Could be a good feature for this package, no?

difference_right_join does not create distance_col when there are no overlapping rows

difference_right_join (and probably others) do not create the designated distance_col when there are no overlapping rows. This makes use of these functions in a general pipeline problematic because the column you expect to exist does not. Example below

library(fuzzyjoin)
a <- data.frame(replicate(2,sample(20:30,10,rep=TRUE))) 

b <- data.frame(replicate(2,sample(1:10,10,rep=TRUE))) 

difference_right_join( a, b, by='X1', max_dist = 1, distance_col='distance')
#>    X1.x X2.x X1.y X2.y
#> 1    NA   NA   10    4
#> 2    NA   NA    7    5
#> 3    NA   NA    9    2
#> 4    NA   NA    1   10
#> 5    NA   NA    8    4
#> 6    NA   NA    5    1
#> 7    NA   NA    4    9
#> 8    NA   NA    1    3
#> 9    NA   NA    9    5
#> 10   NA   NA    3    6

^{Created on 2019-08-05 by the reprex package (v0.3.0)}

Decimals are not used for calculating overlap with interval_join methods

In this example 100-110 is not joined with 111-112, but 10-11 is joined with 11.1-11.2.

x1 <- frame_data(~ xmin, ~xmax,
                 100, 110,
                 10, 11)

x2 <- frame_data(~ xmin, ~xmax,
                 111, 112,
                 11.1, 11.2)

interval_inner_join(x1, x2, by = c("xmin","xmax"), maxgap = 0)

xmin.x xmax.x xmin.y xmax.y
   <dbl>  <dbl>  <dbl>  <dbl>
1     10     11   11.1   11.2

This looks like it is due to storing ranges as integers by IRanges

r1 <- IRanges::IRanges(as.numeric(x1[2,1]), as.numeric(x1[2,2]))
r2 <- IRanges::IRanges(as.numeric(x2[2,1]), as.numeric(x2[2,2]))
IRanges::findOverlaps(r1, r2)

This probably isn't going to change on IRanges side so maybe just adding a disclaimer to interval_join would be appropriate. The function behaves fine if xmin, xmax are scaled by some large value.

stringdist_left_join: Why doesn't an unspecified max_dist yield a tibble with x * y rows?

I"m a bit unclear on why stringdist_left_join doesn't yield a tibble with x * y rows if max_dist is left unspecified.

Any word can be modified to any other word by replacing every letter, therefore shouldn't a row be returned for every combination of x and y?

I realize that this would essentially be producing the same result as stringdist::stringdistmatrix(), and it isn't actually a result I"m looking for, but it just seems as though stringdist_left_join has a hidden max_dist cutoff, and I"m curious if that's the case what it is or how it's determined.

Thanks!

Fuzzy_join example for help or vignette. Match (key==key) and date between (startDate and endDate)

More examples: I've used this package in other powerful ways, but on proprietary data. I'm interested in ideas for use cases that can be provided as vignettes.

fuzzy_left_join(A, B
  by = c(
    "key" = "key",
    "date" = "startDate",
    "date" = "endDate"
  ),
  match_fun = list("==", ">=", "<=")
)

Issue in autocomplete for developers's version

When I installed the github version, I kept getting the following error when pressing tab, and when loading the package for the first time:

Error in fetch(key) : 
  lazy-load database '~/R/x86_64-pc-linux-gnu-library/3.4/fuzzyjoin/help/fuzzyjoin.rdb' is corrupt

unit of max_dist in geo_join?

I wonder if there is a small mistake in the documentation of geo_join:

By default geosphere::distGeo outputs distances in meters, and other geosphere functions probably do.

So if the unit given to geo_join is not miles, the distance is calculated in meters I guess? But in the documentation the other possible value for unit is "km" so I'm confused.

geo_join is returning data that doesn't look right

I have followed the exact steps from the reference doc at geo_join section.

data("state")
state.name
state.center
states <- data_frame(state = state.name,
                     longitude = state.center$x,
                     latitude = state.center$y)


s1 <- rename(states, state1 = state)
s2 <- rename(states, state2 = state)

pairs <- s1 %>%
  geo_inner_join(s2, max_dist = 200) %>%
  filter(state1 != state2)

library(ggplot2)
ggplot(pairs, aes(x = longitude.x, y = latitude.x,
                  xend = longitude.y, yend = latitude.y)) +
  geom_segment(color = "red") +
  borders("state") +
  theme_void()

But the result seems to be odd. Instead of joining all the states that are close within 200 miles distance, the states in the result seems to be pretty far from each other.

Source: local data frame [74 x 6]

        state1 longitude.x latitude.x        state2 longitude.y latitude.y
         (chr)       (dbl)      (dbl)         (chr)       (dbl)      (dbl)
1       Alaska   -127.2500    49.2500     Louisiana    -92.2724    30.6181
2       Alaska   -127.2500    49.2500       Montana   -109.3200    46.8230
3      Arizona   -111.6250    34.2192   Connecticut    -72.3573    41.5928
4      Arizona   -111.6250    34.2192       Indiana    -86.0808    40.0495
5      Arizona   -111.6250    34.2192     Minnesota    -94.6043    46.3943
6     Colorado   -105.5130    38.6777  North Dakota   -100.0990    47.2517
7  Connecticut    -72.3573    41.5928       Arizona   -111.6250    34.2192
8  Connecticut    -72.3573    41.5928       Georgia    -83.3736    32.3329
9  Connecticut    -72.3573    41.5928     Minnesota    -94.6043    46.3943
10 Connecticut    -72.3573    41.5928 New Hampshire    -71.3924    43.3934
..         ...         ...        ...           ...         ...        ...

I could be my understanding of this function is wrong or the way I'm using is something wrong. But because I'm also getting some strange result with 'regex_join' function, I just wanted to check if there might be some known issue with the latest one from CRAN and Github.

joins don't work when data.frame has only one column

See those examples:

library(dplyr)
library(ggplot2)
data(diamonds)
d <- data_frame(regex_name = c("^Idea", "mium", "Good"))

This works fine as diamonds as a tibble:

diamonds %>%
  select(cut) %>%
  regex_left_join(d, by = c(cut = "regex_name"))
# A tibble: 53,940 x 2
         cut regex_name
       <ord>      <chr>
1      Ideal      ^Idea
2    Premium       mium
3       Good       Good
4    Premium       mium

This does not work:

diamonds %>%
  select(cut) %>%
  as.data.frame() %>%
  regex_left_join(d, by = c(cut = "regex_name"))
Error: cannot convert object to a data frame

stringdist_join does not rename similar variable names

Fails to rename similar variable names in the resulting fuzzy joined data table.

library(magrittr)
library(fuzzyjoin)

data_1 <- data.frame(town = c("Boston", "Philadelphia", "Salt Lake City", "Jacksonville"),
                     year = c(1900, 1901, 1902,1903))

data_2 <- data.frame(city = c("Bsoton", "Philedelphia", "Salt Lake City"),
                    state = c("Massachusetts", "Pennsylvania", "Utah"),
                    pop = c(10, 20, 30),
                    year = c(2008, 2009, 2010))

test_join <- data_1 %>% 
  stringdist_join(data_2, by = c(town = "city"), max = 1, mode = "left")

Output:

            town year           city         state pop year
1         Boston 1900         Bsoton Massachusetts  10 2008
2   Philadelphia 1901   Philedelphia  Pennsylvania  20 2009
3 Salt Lake City 1902 Salt Lake City          Utah  30 2010
4   Jacksonville 1903           <NA>          <NA>  NA   NA

join operation incorrectly drops a dimension

I was trying to regex join two vectors:

    sentences <- c("qwertyuiop", "asdfgfh", "zxcvbn")
    patterns <- c("wer", "asd")
    regex_right_join(sentences, patterns)
    # Error in UseMethod("groups") : 
    #   no applicable method for 'groups' applied to an object of class "character"

Maybe if I previously convert them to a data.frame?

    regex_right_join(data.frame(sentences),
                   data.frame(patterns),
                   by=c(sentences="patterns"))
    # Error: cannot convert object to a data frame

It seems that it's dropping a dimension (from data.frame to vector) in some place...

So, finally, I tried:

    regex_right_join(data.frame(sentences, 0),
                   data.frame(patterns, 0),
                   by=c(sentences="patterns")) %>%
    select(sentences, patterns)
    #    sentences patterns
    #1 qwertyuiop      wer
    #2    asdfgfh      asd

Could you check where is it dropping a dimension (from data.frame to vector)?
And would you consider adding the functionality to join vectors?

geo_join

I'm going to try and add two-by-two comparison: right now we'd like to join tables based on geographical distance for which you use two columns in each dataframe.

Error

I got the error below when I used the fuzzy join package.

DF and DAT are columns with text of company names. I am trying to get a subset of names that are in DF that don’t match those in DAT.

Any help will be appreciated!!

df_test <- stringdist_anti_join(df, dat)
Joining by: "ER_NAME2"
Error in outer(u_x, u_y, mf, ...) :
long vectors not supported yet: memory.c:1648

df has 287626 rows and dat has 11880 rows of data

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] tcltk grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] fuzzyjoin_0.1 stringdist_0.9.4.1 Hmisc_3.16-0 ggplot2_1.0.1 Formula_1.2-1 survival_2.38-1 lattice_0.20-31
[8] stringr_1.0.0 sqldf_0.4-10 RSQLite_1.0.0 DBI_0.3.1 gsubfn_0.6-6 proto_0.3-10 dplyr_0.4.3
[15] plyr_1.8.3 readr_0.1.1

loaded via a namespace (and not attached):
[1] Rcpp_0.11.6 RColorBrewer_1.1-2 tools_3.2.1 rpart_4.1-9 digest_0.6.8 gtable_0.1.2 parallel_3.2.1
[8] gridExtra_0.9.1 cluster_2.0.1 nnet_7.3-9 R6_2.1.0 foreign_0.8-63 latticeExtra_0.6-26 purrr_0.2.1
[15] tidyr_0.4.1 reshape2_1.4.1 magrittr_1.5 scales_0.2.5 splines_3.2.1 MASS_7.3-40 assertthat_0.1
[22] colorspace_1.2-6 stringi_0.5-5 acepack_1.3-3.3 lazyeval_0.1.10 munsell_0.4.2 chron_2.3-47

geo_join variable matching

First of all congratulations for this great package. I was a bit struggling to find out why I could not geo-match my data. I think that it is a bit unfortunate that there is no exact longitude and latitude variable matching in the geo_join function.

I made a small amendment in the documentation of the function. However, I think that a clearer matching would be more appropriate, given that Google any many other APIs use by default lng for longitudes.

` states <- data_frame(state = state.name,
lng = state.center$x,
lat = state.center$y)

s1 <- rename(states, state1 = state)
s2 <- rename(states, state2 = state)

pairs <- s1 %>%
geo_inner_join(s2, by = c("lng", "lat"),
max_dist = 200) %>%
filter(state1 != state2)
`

Another issue that makes the whole code very difficult to read and amend that only a very few functions are directly imported. For example, in geo_join dplyr::data_frame and dplyr::common_by are not referenced or imported. It is not logical to me why the above modification should not work. Given that in geo-matching lat.x must strictly match with lat.y and lon.x with lon.y it would be preferable to clearly pass on these arguments, for example, like by.lat = c("lat", "lat") and by.lon = c("lon", "lng")

Errors when trying to fuzzyjoin

Hello! Please excuse any obvious or dumb errors - I am new to R and tried looking up the answers beforehand, I promise!

I am trying to join two tables using the R "fuzzyjoin" package. I am trying to fuzzy join two tables of company names, exactly. I have one data frame of 5000 company names, and one data frame of 1600 company names. There are other no columns besides the company names.

Using the package, I have:
NewTable <- AccountsList1 %>%
stringdist_inner_join(AccounttList2, by = NULL)

However, I got two errors:
Joining by: "Accounts"
Error in dists[include] <- stringdist::stringdist(v1[include], v2[include], :
NAs are not allowed in subscripted assignments
and
50: In stri_length(string) :
invalid UTF-8 byte sequence detected. perhaps you should try calling stri_enc_toutf8()

So then I removed N/As via [!is.na(AccountsList1)]
and forced UTF-8 via
stri_enc_toutf8(AccountsList1, is_unknown_8bit = FALSE, validate = FALSE)

However, when I rerun I get the exact same errors... Does anyone have any advice? Thank you!

max_dist for multiple columns in stringdist_join

Is it possible to allow max_dist to be a vector so that it can be different for each column passed to stringdist_join's by argument?

Something like this: stringdist_inner_join(df1, df2, by=c("col1", "col2"), max_dist=c(1, 2))

Error: vector memory exhausted (limit reached?)

I’m getting the above error when trying to stringdist_left_join two tables - the left table is 185K rows and the right table is 4.37M rows. The R session never appears to use more than 6GB of memory (according to Activity Monitor) while I’m on a machine with 32GB of memory with available memory in the range of 10GB when the vector memory exhausted error arises. I’ve followed various recommendations to increase R_MAX_VSIZE to a large number - 700GB as shown in the Sys.getenv() output shown below. All this to say it appears that stringdist_left_join does not pay attention to R_MAX_VSIZE. Is there some other setting I can change to use more of the available memory on my machine?

Sys.getenv()

Apple_PubSub_Socket_Render          /private/tmp/com.apple.launchd.sSrL33I64Z/Render
COLUMNS                             80
COMMAND_MODE                        unix2003
DISPLAY                             /private/tmp/com.apple.launchd.tTt2eLd6xQ/org.macosforge.xquartz:0
DYLD_FALLBACK_LIBRARY_PATH          /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/server
DYLD_LIBRARY_PATH                   /Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/server
EDITOR                              vi
HOME                                /Users/geoffreysnyder
LD_LIBRARY_PATH                     :@JAVA_LD@
LINES                               24
LN_S                                ln -s
LOGNAME                             geoffreysnyder
MAKE                                make
PAGER                               /usr/bin/less
PATH                                /usr/local/bin:/usr/local/mysql/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:~/Library/Python/3.7/bin
PWD                                 /Users/geoffreysnyder/repos/Data_Load/code
R_ARCH                              
R_BROWSER                           /usr/bin/open
R_BZIPCMD                           /usr/bin/bzip2
R_DOC_DIR                           /Library/Frameworks/R.framework/Resources/doc
R_GZIPCMD                           /usr/bin/gzip
R_HOME                              /Library/Frameworks/R.framework/Resources
R_INCLUDE_DIR                       /Library/Frameworks/R.framework/Resources/include
R_LIBS_SITE                         
R_LIBS_USER                         ~/Library/R/3.5/library
R_MAX_VSIZE                         700GB
R_PAPERSIZE                         a4
R_PDFVIEWER                         /usr/bin/open
R_PLATFORM                          x86_64-apple-darwin15.6.0
R_PRINTCMD                          lpr
R_QPDF                              /Library/Frameworks/R.framework/Resources/bin/qpdf
R_RD4PDF                            times,inconsolata,hyper
R_SESSION_TMPDIR                    /var/folders/xw/402kc2hc8xl82d008k8x64f00000gq/T//RtmpJdct7Y
R_SHARE_DIR                         /Library/Frameworks/R.framework/Resources/share
R_SYSTEM_ABI                        osx,gcc,gxx,gfortran,?
R_TEXI2DVICMD                       /usr/local/bin/texi2dvi
R_UNZIPCMD                          /usr/bin/unzip
R_ZIPCMD                            /usr/bin/zip
SECURITYSESSIONID                   186a8
SED                                 /usr/bin/sed
SHELL                               /bin/zsh
SHLVL                               0
SSH_AUTH_SOCK                       /private/tmp/com.apple.launchd.UNOOV1wxev/Listeners
SUBLIMEREPL_AC_IP                   127.0.0.1
SUBLIMEREPL_AC_PORT                 None
TAR                                 /usr/bin/tar
TMPDIR                              /var/folders/xw/402kc2hc8xl82d008k8x64f00000gq/T/
TZ                                  America/Los_Angeles
USER                                geoffreysnyder
XPC_FLAGS                           0x0
XPC_SERVICE_NAME                    0
__CF_USER_TEXT_ENCODING             0x1F7:0x0:0x0

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.2

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2  RJDBC_0.2-7.1   rJava_0.9-10    DBI_1.0.0       fuzzyjoin_0.1.4 readr_1.2.0     dplyr_0.7.8    
[8] lubridate_1.7.4 stringr_1.3.1  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       tidyr_0.8.2      assertthat_0.2.0 R6_2.3.0         magrittr_1.5     pillar_1.2.3    
 [7] rlang_0.3.0.1    stringi_1.2.4    tools_3.5.1      glue_1.3.0       purrr_0.2.5      hms_0.4.2.9000  
[13] compiler_3.5.1   pkgconfig_2.0.2  bindr_0.1.1      tidyselect_0.2.5 tibble_1.4.2

error when one dataframe only has one column

When trying to join two dataframes, if one of the dataframes has only one column an error happens:

stringdist_inner_join(df1, df2)
Error: cannot convert object to a data frame

Maybe it's dropping one dimension (from dataframe with one column to vector)?

distance_inner_join: Error in rowSums((v1 - v2)^2) : 'x' must be an array of at least two dimensions

I have two tibbles with a column called pos on which I am trying to perform a distance_inner_join.

> df1
# A tibble: 1,398 x 3
   gene_name     strand       pos
   <chr>         <fct>      <dbl>
 1 1500015O10Rik +       43730601
 2 1700001G17Rik +       33669823
 3 1700003I22Rik +       56018977
 4 1700007P06Rik +      187125137
 5 1700016C15Rik +      177729813
 6 1700016L21Rik +       80445931
 7 1700019A02Rik -       53158746
 8 1700019D03Rik -       52925246
 9 1700019P21Rik -      138876645
10 1700020N18Rik -       91405182
# ... with 1,388 more rows
> df2
# A tibble: 58 x 3
   type      areaStat       pos
   <chr>        <dbl>     <dbl>
 1 imprinted    -28.5  37977873
 2 imprinted    -23.4  55027489
 3 imprinted    -21.5 123876627
 4 imprinted    -19.2  95665495
 5 imprinted     18.3 172148438
 6 imprinted     18.4  97938015
 7 imprinted     18.5  31347008
 8 imprinted     20.7  96774378
 9 imprinted     21.3 154282393
10 imprinted     23.0 118973902
# ... with 48 more rows

When I run distance_inner_join, I get the following error:

> distance_inner_join(
+   df1, 
+   df2, 
+   by="pos", max_dist=1e6)
Error in rowSums((v1 - v2)^2) : 
  'x' must be an array of at least two dimensions

and the traceback:

6. stop("'x' must be an array of at least two dimensions")
5. rowSums((v1 - v2)^2)
4. multi_match_fun(ux_input, uy_input)
3. fuzzy_join(x, y, multi_by = by, multi_match_fun = match_fun, mode = mode)
2. distance_join(x, y, by, max_dist = max_dist, method = method, mode = "inner", distance_col = distance_col)
1. distance_inner_join(gene_tss %>% filter(chr == .x) %>% select(-chr), dmr_midpoint %>% filter(chr == .x) %>% select(-chr), by = "pos", max_dist = 1e+06)

Installed the latest version from GitHub today 2018-07-23.

Optimal joining?

I like the idea of this package, but it does not work well in practice for my needs. I wrote a knitr explaining the problem here.

In brief: often the task is to match up two n-length vectors of strings against each other 1-to-1. The present join algorithm does not enforce 1-to-1 joins, i.e. sometimes one string gets joined to two others, and sometimes one gets joined to none.

R crashes when doing a difference_left_join on larger datasets

I have the following setting:

DatasetA: approximately 20 thousand rows, coordinates are given as latA, lonA
DatasetB: approximately 2 million rows, coordinates are given as latB, lonB

Because the coordinates do not exactly match, I tried the following:

DatasetC <- DatasetA %>%
difference_left_join(DatasetB, by = c("latA" = "latB", "lonA" = "lonB"), max_dist = 2)

This works when I take a sample (e.g. 10%) from DatasetA but repeatedly crashes when using the entire dataset. Did you experience similar behaviour?

support for dplyr-style `by` arguments

dplyr allows to join columns with differing names in lhs and rhs via named vectors as by argument. It would be great if genome_join/interval_join (and its variants) could support the same syntax.

Example:

require(dplyr)
require(tidyr)
require(fuzzyjoin)

arrayRes = structure(list(
    phopho_position_start = c(862L, 426L, 518L, 556L, 519L, 127L),
    phopho_position_end = c(863L, 427L, 519L, 557L, 520L, 128L),
    ensembl_peptide_id = c("ENSP00000363708", "ENSP00000376679", "ENSP00000321606", "ENSP00000264229", "ENSP00000307093", "ENSP00000358142" ),
    external_gene_name = c("BMPR2", "ABLIM1", "CRMP1", "KIAA1211", "MAP6", "SV2A")
), .Names = c("phopho_position_start", "phopho_position_end", "ensembl_peptide_id", "external_gene_name")) %>% tbl_df


hmmerSearchRes = structure(list(
    target_name = c("ENSP00000385014", "ENSP00000385014", "ENSP00000385014", "ENSP00000233057", "ENSP00000233057", "ENSP00000233057" ),
    ali_from = c(248L, 311L, 399L, 248L, 352L, 440L), ali_to = c(262L, 318L, 412L, 262L, 359L, 453L),
    hmm_coverage = c(0.933333333333333, 0.466666666666667, 0.866666666666667, 0.933333333333333, 0.466666666666667, 0.866666666666667)),
.Names = c("target_name", "ali_from", "ali_to", "hmm_coverage")) %>% tbl_df

genome_join(arrayRes, hmmerSearchRes, by=c("ensembl_peptide_id"="target_name", "phopho_position_start"="ali_from", "phopho_position_end"="ali_to"))

cannot pass arguments into match_fun = list(match_fun_1, match_fun_2, etc)

It seems if we use a list of match_fun's for each matching variable, it is not possible to pass arguments into them?
https://stackoverflow.com/questions/44383510/r-passing-arguments-into-the-match-fun-function-in-fuzzyjoinfuzzy-join

Suggested: match_fun's as helper functions

Suggested feature:
Offer the stringdist, interval, etc. match_fun's as helper functions that can be easily input into fuzzy_join.

This would avoid problems like me trying to extract the stringdist function out of stringdist_join :)
Here's me trying to use the stringdist match_fun - can you help me figure out what I did wrong here?
Passing arguments into the match_fun function in fuzzyjoin::fuzzy_join

Related to #22 (documentation for match_fun's please!)

Using interval join methods with more than 2 columns (start and end)

It would be great if the interval join methods in fuzzyjoin would also run with more than 2 columns start and end, like a group for example.

Feature request: report distances from stringdist_join() and friends

For the types of fuzzy joins that have some kind of a distance metric (e.g., a string distance for stringdist_join() and friends), would it be possible to have an option to return the distance measure as a new column in the resulting data frame?

Join does not work and produces .x and .y columns and no actual join

Note: This is a cross-post of this Stack Overflow question.

Using fuzzyjoin 0.1.4 and R 3.5.3.

I'm adapting the example shown here, where I'd like to left-join some test results to existing session data, but I know that the tests may have been conducted three hours before or after the session was created, by the same user.

However, I am getting duplicate user_id columns as a result, despite specifying them in the by option, and the values themselves are not added to the data frame:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(fuzzyjoin)

sessions = tribble(
  ~user_id, ~session_id, ~session_created_at,
  1, 1, "2019-01-01T15:30:00",
  1, 2, "2019-01-02T00:00:30",
  1, 3, "2019-01-04T12:30:00"
) %>%
  mutate_at("session_created_at", as_datetime)

tests = tribble(
  ~user_id, ~test_id, ~test_created_at, ~test_value,
  1, 7, "2019-01-01T15:28:00", "foo",
  1, 8, "2019-01-01T23:59:30", "bar"
) %>%
  mutate_at("test_created_at", as_datetime) %>% 
  mutate(
    start = test_created_at + hours(3),
    end = test_created_at - hours(3)
  )

sessions %>% 
  fuzzy_left_join(
    tests,
    by = c(
      "user_id" = "user_id",
      "session_created_at" = "start",
      "session_created_at" = "end"
    ),
    match_fun = list(`==`, `>=`, `<=`)
  )
#> # A tibble: 3 x 9
#>   user_id.x session_id session_created_at  user_id.y test_id
#>       <dbl>      <dbl> <dttm>                  <dbl>   <dbl>
#> 1         1          1 2019-01-01 15:30:00        NA      NA
#> 2         1          2 2019-01-02 00:00:30        NA      NA
#> 3         1          3 2019-01-04 12:30:00        NA      NA
#> # … with 4 more variables: test_created_at <dttm>, test_value <chr>,
#> #   start <dttm>, end <dttm>

sessions %>% 
  left_join(tests, by = c("user_id" = "user_id"))
#> # A tibble: 6 x 8
#>   user_id session_id session_created_at  test_id test_created_at    
#>     <dbl>      <dbl> <dttm>                <dbl> <dttm>             
#> 1       1          1 2019-01-01 15:30:00       7 2019-01-01 15:28:00
#> 2       1          1 2019-01-01 15:30:00       8 2019-01-01 23:59:30
#> 3       1          2 2019-01-02 00:00:30       7 2019-01-01 15:28:00
#> 4       1          2 2019-01-02 00:00:30       8 2019-01-01 23:59:30
#> 5       1          3 2019-01-04 12:30:00       7 2019-01-01 15:28:00
#> 6       1          3 2019-01-04 12:30:00       8 2019-01-01 23:59:30
#> # … with 3 more variables: test_value <chr>, start <dttm>, end <dttm>

^{Created on 2019-05-17 by the reprex package (v0.2.1)}

Why is that, and what do I have to do to get it working?

I should add that when I run:

sessions %>% 
  left_join(tests, by = c("user_id" = "user_id"))

I get the proper columns, but of course not the correct test results joined. And when I run this, I get my expected results, but it seems inefficient:

sessions %>% 
  left_join(tests) %>% 
  filter(test_created_at - hours(3) <= session_created_at) %>% 
  filter(test_created_at + hours(3) >= session_created_at)

stringdist_inner_join fails on grouped data_frame

df1 <- data_frame(color = c("yellow", "blue", "green", "purple"),
                  grp = c(1, 2, 1, 2))

df2 <- data_frame(color = c("yallow", "bloooooo", "gren", "purple"))

df1_grp <- df1 %>% group_by(grp) %>% slice(1)

stringdist_inner_join(df1, df2) # works
stringdist_inner_join(df1_grp, df2) # fails

Error: corrupt 'grouped_df', contains 2 rows, and 4 rows in groups

stringdist_inner_join(df1_grp %>% ungroup(), df2) # works with ungrouping

I recognize the error message from dplyr bugs I have encountered, but the current CRAN version of dplyr 0.4.3 will do this join on grouped data_frames without error: inner_join(df1_grp, df2)

Specify suffixes for duplicate column names

If there are duplicate column names (that aren't being matched on), they're currently suffixed with .x and .y. It'd be nice to be able to change that, like you can in dplyr's joins.

example suggestion

Thank you for this package! I used it for assigning people to generations (Boomers, Millenials, etc) depending on the year they were born (also possible to base it on age using that as a rough approximation of year they were born, but that's more complicated code). Here's some simplified code of how I used fuzzyjoin.

library(fuzzyjoin)

#create generations table
name <- c("Generation ?", "Generation Z", "Millenials", "Xennials", "Generation X", "Baby Boomers","Silent")

startyr <- c(2017, 1997, 1984, 1977, 1963, 1944, 1900)

#there's a dummy year in here for Generation ?
endyr <- c(2025, 2016, 1996, 1983, 1976, 1962, 1943)

generations <- data.frame(name, startyr, endyr)

#create a data table of people with the year they were born
name <- c("person1", "person2", "person3", "person4", "person5", "person6", "person7", "person8", "person9")

yr_birth <- c(1971, 2018, 1999, 1917, 1965, 1949, 1979, 1983, 1989)

people <- data.frame(name, yr_birth)

#Update the people table to indicate what generation they belong to
#depending on whether their birth year falls on or after start yr and on or before end year of the generation

people <- fuzzy_left_join( people, generations, by = c( "yr_birth" = "startyr", "yr_birth" = "endyr" ), match_fun = list(>=, <=))

Can this be used to find fuzzy matching rows?

I just postd this question which contains the following example:

x = matrix( 1:100, nrow = 100, ncol= 4 ,byrow = FALSE)
y = x + matrix( .001 * rnorm(n = 400), nrow = 100, ncol= 4)
z = rbind(x,y)
z = z[sample(nrow(z)),]

can fuzzyjoin be used to reduce z to x (at least as a estimate)?

regex_join is producing the result that doesn't look right

I'm trying with 'regex_join' function with sample data below.

library(fuzzyjoin)
library(dplyr)
library(readr)

# Create 'user' data frame
user <- read_csv(
"name, regex_name
kan, ^kan
john, john
hideaki, hideaki
")

# Create 'activity' data frame
activity <- read_csv(
"username, type
kanaugust, beta
hideaki_test, alpha
john_test, beta
betauser1, beta
kan_test, beta
chakakan, beta
")

# Run regex_left_join() function
activity %>% 
  regex_left_join(user, by = c("username" = "regex_name"))

The result looks not right.

> activity %>% 
+   regex_left_join(user, by = c("username" = "regex_name"))
Source: local data frame [6 x 4]

      username  type    name regex_name
         (chr) (chr)   (chr)      (chr)
1    kanaugust  beta      NA         NA
2 hideaki_test alpha hideaki    hideaki
3    john_test  beta      NA         NA
4    betauser1  beta     kan       ^kan
5     kan_test  beta     kan       ^kan
6     chakakan  beta    john       john

I was expecting it to be something like below.

      username  type    name regex_name
         (chr) (chr)   (chr)      (chr)
1    kanaugust  beta      kan        ^kan
2 hideaki_test alpha hideaki    hideaki
3    john_test  beta      john         john
4    betauser1  beta     NA       NA
5     kan_test  beta     kan       ^kan
6     chakakan  beta    NA       NA

interval join should calculate and append overlap length

Similar to how bedtools intersect works, it would be great if the interval/genome join methods in fuzzyjoin would also calculate (maybe optionally) the number of bases in the intersection (ie. the length of the overlap) and append it to the results table.

Error 1d atomic vector

I wanted to give a try to this package but this error came up.

df1 <-
  structure(
  list(
  lake = c(
  "McCullough",
  "McCullough",
  "Partridge",
  "Randall",
  "Tomahawk",
  "Ward",
  "White Sand",
  "Allequash",
  "Island",
  "Katinka"
  )
  ),
  .Names = "lake",
  class = c("tbl_df", "data.frame"),
  row.names = c(NA, -10L)
  )

  df2 <-
  structure(
  list(
  lakename = c(
  "Allequash Lake",
  "Anvil Lake",
  "Arrowhead Lake",
  "Bass Lake",
  "Big Lake",
  "Birch Lake",
  "Ballard Lake",
  "Big Muskellunge Lake",
  "Black Oak Lake",
  "Big Portage Lake"
  )
  ),
  .Names = "lakename",
  class = c("tbl_df",
  "data.frame"),
  row.names = c(NA, -10L)
  )
  names(df1)

## [1] "lake"

  names(df2)

## [1] "lakename"

  stringdist_left_join(df1,  df2,   by = c("lake" = "lakename"),  max_dist = 1)

## Error: data_frames can only contain 1d atomic vectors and lists

Any idea?

Thank you

support formula notation ?

Is there a reason why it's not supported yet ? purrr is imported so it is just a matter of applying purrr::as_mapper on functions, I feel that it's a low hanging fruit that would make the package feel a lot more tidyverse friendly.

The following pied of code fixes fuzzy_join and all functions built on it :

additional_code <-quote({
  if(!is.null(match_fun)){
    if(is.list(match_fun)) 
      match_fun <- map(match_fun,as_mapper,...)
    else
      match_fun <- as_mapper(match_fun)
  }
  if(!is.null(multi_match_fun)){
    multi_match_fun <- as_mapper(multi_match_fun)
  }
  if(!is.null(index_match_fun)){
    index_match_fun <- as_mapper(index_match_fun)
  }
})

Let's test it :

library(fuzzyjoin)
trace(fuzzyjoin::fuzzy_join,additional_code, at = 2)

sdata <- structure(
  list(byear = c(1990, 1995, 2000, 2005), 
       eyear = c(1995, 2000, 2005, 2010), 
       val = c(3, 1, 5, 6)), row.names = c(NA, -4L
       ),
  class = c("tbl_df", "tbl", "data.frame"))

fdate <- structure(
  list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 
              3, 3, 3, 5, 5, 5, 5, 8, 8, 8, 8, 13, 13, 13), 
       fyear = c(1998, 1999, 2000, 2001, 1998, 1999, 2000, 2001, 2002, 2003, 
                 1998, 1999, 2000, 2001, 2002, 2003, 1998, 1999, 2000, 2001, 
                 1998, 1999, 2000, 2001, 1998, 1999, 2000)), 
  row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))

identical(
fuzzy_inner_join(sdata, fdata, by=c("val"="id"), ~.x > .y),
fuzzy_inner_join(sdata, fdata, by=c("val"="id"), `>`))
# Tracing fuzzy_join(x, y, by, match_fun, mode = "inner", ...) step 2 
# Tracing fuzzy_join(x, y, by, match_fun, mode = "inner", ...) step 2 
# [1] TRUE

Would you like a PR ?

Fuzzy match conditional on exact match

Hello!

As presented in an example, the function stringdist_left_join can return multiple matches from y dataset for one line of x dataset. In my case, I have a situation where the result of the match includes
label.x label.y
Merton Merton
Merton Melton
Is it possible to tell the function that if it finds an exact match, it should retain it and move on to the next line?

Thank you!

Package for extending the genome_join methods

Hi,

first let me say, that I really like your package, especially the genome_join functions, which save me a lot of conversions between data frames and GenomicRanges.

But a few methods for dealing with genomic intervals were missing so I wrote a package (tidygenomics) to fill those gaps.

Because I am using your genome_join implementation and mimic your nomenclature, I wanted to ask if this is okay for you? And secondly, so far only the fuzzyjoin version on github supports genome_join, do you have plans to release this to CRAN?

Best Regards,
Constantin

Normalized string distance?

Is there any way to match on normalized string distance?

Ie for osa, lv, and dl, dividing by the nchar of the longest word, or for lcs the additive lengths of both?

1 letter off of a 4 letter string is quite different than 1 letter off of a 20 character string.

regex_semi_join doesn't pass back in joined variables

the semi-join appears to work, but doesn't add joined variables
Working from an example I had saved from twitter a while back.

library(tidyverse)
library(fuzzyjoin)
regexes <- read_csv(
"regex, type
Windows, Windows-ish
Red Hat|CentOS|Fedora, Fedora-ish
Ubuntu|Debian,Debian-ish
CoreOS|Amazon,Amazon-ish")

os <- read_lines("https://rud.is/dl/os.txt", na = "NA")
os <- as.data.frame(os, na.rm = TRUE)
os <- os[complete.cases(os),]
os <- as.data.frame(os, na.rm = TRUE)

os_left <- os %>% 
replace_na(list(regex = "Unknown")) %>% 
  regex_left_join(regexes, c(os = "regex")) 

os_semi <- os %>% 
replace_na(list(regex = "Unknown")) %>% 
  regex_left_join(regexes, c(os = "regex"))

os2 returns a dataframe with 3 variables
os3 returns one variable (factor)

A couple questiosn and a request

Hey thanks for the wonderful package.

I find that there are situations where I need to merge by multiple variables, but only one of those variables is fuzzy. Let's say I have a list of school names and their zip codes, plus a list of individuals who state the name of their school and their zip code. I expect there to be some error in the school name, but the ZIP code should be correct, so I try something like this:

library(fuzzyjoin)
library(dplyr)

# correct names
official <- data.frame(stringsAsFactors=FALSE,
                name = c("School One", "School Two", "School Three"),
                ZIP = c("91427", "91428", "01427"))
# slightly wrong names
response <- data.frame(stringsAsFactors=FALSE,
                name = c("School Oune", "School Two", "School Thee"),
                ZIP = c("91427"))

# inner join them
stringdist_inner_join(official, response, by =c("name","ZIP"),
                      max_dist = 2)
# don't understand why there are duplicate rows

stringdist_inner_join(official, response, by =c("name","ZIP"),
                      max_dist = 2, distance_col = "dist")
# don't understand why distance_col changes # of rows but this is # of rows I expect

# ideally I'd like only to match if ZIPs are the same
# I can't specify different max distances for differnt columns though
stringdist_inner_join(official, response, by =c("name","ZIP"),
                      max_dist = c(2,0), distance_col = "dist")

So a few things:

Why did the inner join generate duplicate rows?
Why does specifying a distance_col change the number of rows in the inner join?
How do approach cases where I want different max distance for different variables?

Would it be possible to make the max_dist parameter accept a vector of distances for each by variable?

Regards,
Carl

distance_join and geo_join examples

Hi,

I cannot get the examples to work.

First I had to write data("state") instead of data("state.center").

Then I get the error Joining by: c("longitude", "latitude") Error: No variables selected
when running

states <- data_frame(state = state.name,
                     longitude = state.center$x,
                     latitude = state.center$y)

s1 <- rename(states, state1 = state)
s2 <- rename(states, state2 = state)

pairs <- s1 %>%
 geo_inner_join(s2, max_dist = 200) %>%
 filter(state1 != state2)

External databases?

Any plans to support dplyr style calls to external databases, even for some subset of functions?
I'm guessing that's out of scope for this package, but it would be nice to have that documented in perhaps a more obvious way(?) Anyway, really cool package here, thanks!

stringdist joins fail with NAs

df1 <- data.frame(color = c("yellow", "blue", "green", "purple", NA),
                  blah = 1:5, stringsAsFactors = FALSE)

df2 <- data.frame(color = c("yallow", "bloooooo", "gren", "purple", "0range", NA),
                  test = 6:11, stringsAsFactors = FALSE)

inner_join(df1, df2) # works
stringdist_inner_join(df1, df2) # fails

I'm not sure how problematic this is, as in practice it is typically undesirable to have NAs join with NAs. But I'd expected it to behave like dplyr's joins. If this errors intentionally on NAs in the join variables, maybe clarify the error message?

use kd-trees for distance and geo_joins

wondering if to distance_join() or geo_join() two tables, one with M coords and the other with N coords, say M<=N, your curr implementation is "exaustive" i.e., it runs in O(M*N) time. if so, consider creating a kd-tree (of kNN fame) of the smaller list in O[M*log(M)] time and querying it N times for the closest point to P1,P2,...PN on the longer table. only join if the dist(P, closest(P)) < max_dist. this will run in O[(M+N)*log(M)] which will be at least M/(2*log(M)) faster than the exaustive method.

if more examples needed, here is a suggestion

In reference to README: "I'm interested in ideas for use cases that can be provided as vignettes."

Hi,
I thought to share an interesting data example.

Here is a public dataset: Aboriginal and Torres Strait Islander Languages by www.aiatsis.gov.au
https://collection.aiatsis.gov.au/datasets/austlang/001
this is a CSV, comma separated, with quotes.

Languages were transmitted orally and were never written down, hence there are various alternative spellings to refer to the same language.

language_name and language_synonym
are the two variables to look up, would be a pretty cool example for fuzzyjoin.