<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Reduce time to error when collecting results of SQL query with duplicated column names about dbplyr HOT 11 CLOSED

tidyverse commented on June 11, 2024

Reduce time to error when collecting results of SQL query with duplicated column names

from dbplyr.

Comments (11)

commented on June 11, 2024

@krlmlr commented on Dec 27, 2017, 8:41 AM UTC:

Thanks. I see your point, it seems that warning before executing the query will be difficult. We should be working on assigning unique column names, perhaps with ~~tibble::repair_names()~~ tibble::set_tidy_names().

from dbplyr.

commented on June 11, 2024

@bkkkk commented on Dec 28, 2017, 2:09 AM UTC:

I suspected it would be, anyways I think that having the names repaired probably fits better with other dplyr data ingestion / tibble creation workflows.

from dbplyr.

commented on June 11, 2024

@edgararuiz commented on Dec 28, 2017, 2:07 PM UTC:

Hi, if you use the native dbplyr join command, it will add a .x and .y suffix to your results automatically so you won't encounter that issue, here is an example:

library(dplyr)
library(DBI)
library(tibble)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")

mtcars %>%
  rowid_to_column() %>%
  dbWriteTable(con, "mtcars", .)

tbl(con, "mtcars") %>%
  inner_join(tbl(con, "mtcars"), by = "rowid" ) %>%
  collect() %>%
  View()

dbDisconnect(con)

I would encourage you to try something like this:

tbl(con, "consumers") %>%
  select(main.id,
         main.phone,
         others.phone) %>%
  inner_join(tbl(con, "consumers", by = c("others.id" = "main.id"))) %>%
  collect()

from dbplyr.

commented on June 11, 2024

@bkkkk commented on Jan 2, 2018, 11:26 PM UTC:

@edgararuiz thanks for the suggestion but we use SQL queries for interoperability and it lets non-R users review my code. It's also not limited to joins, my example uses a join because the code I was working on before I posted this issue had a join.

The issue will also arise with the following query, you'll get 2 columns named date_trunc, and 2 named sum:

SELECT
   date_trunc('month', SOME_DATE) :: DATE,
   date_trunc('day', SOME_OTHER_DATE) :: DATE,
   sum(SOME_AMOUNT),
   sum(SOME_OTHER_AMOUNT)
FROM
   SOME_TABLE
GROUP BY 1, 2

This issue is a dplyr-specific issue that makes SQL workflows just a little more annoying for no particularly good reason, especially given that other parts of the tidyverse, like your example above, cleans up column names to avoid duplicates.

from dbplyr.

commented on June 11, 2024

@edgararuiz commented on Jan 3, 2018, 12:17 AM UTC:

Ok, for that, I'd use the x and y suffix in the next command, and then use show_query() to display the resulting SQL so that others can check my work. I'm sure I'm missing something, maybe if you can provide an example of how it would work with a local data.frame would be great!

> library(dplyr)
> library(DBI)
> library(tibble)
> con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
> 
> mtcars %>%
+   rowid_to_column() %>%
+   dbWriteTable(con, "mtcars", .)
> 
> new <- tbl(con, "mtcars") %>%
+   inner_join(tbl(con, "mtcars"), by = "rowid" ) %>%
+   group_by(am.x, am.y) %>%
+   summarise(sum(wt.x),
+             sum(wt.y)) 
> head(new)
# Source: lazy query [?? x 4]
# Database: sqlite 3.19.3 []
# Groups: am.x
   am.x  am.y `sum(wt.x)` `sum(wt.y)`
  <dbl> <dbl>       <dbl>       <dbl>
1  0     0           71.6        71.6
2  1.00  1.00        31.3        31.3
> 
> show_query(new)
<SQL>
SELECT `am.x`, `am.y`, SUM(`wt.x`) AS `sum(wt.x)`, SUM(`wt.y`) AS `sum(wt.y)`
FROM (SELECT `TBL_LEFT`.`rowid` AS `rowid`, `TBL_LEFT`.`mpg` AS `mpg.x`, `TBL_LEFT`.`cyl` AS `cyl.x`, `TBL_LEFT`.`disp` AS `disp.x`, `TBL_LEFT`.`hp` AS `hp.x`, `TBL_LEFT`.`drat` AS `drat.x`, `TBL_LEFT`.`wt` AS `wt.x`, `TBL_LEFT`.`qsec` AS `qsec.x`, `TBL_LEFT`.`vs` AS `vs.x`, `TBL_LEFT`.`am` AS `am.x`, `TBL_LEFT`.`gear` AS `gear.x`, `TBL_LEFT`.`carb` AS `carb.x`, `TBL_RIGHT`.`mpg` AS `mpg.y`, `TBL_RIGHT`.`cyl` AS `cyl.y`, `TBL_RIGHT`.`disp` AS `disp.y`, `TBL_RIGHT`.`hp` AS `hp.y`, `TBL_RIGHT`.`drat` AS `drat.y`, `TBL_RIGHT`.`wt` AS `wt.y`, `TBL_RIGHT`.`qsec` AS `qsec.y`, `TBL_RIGHT`.`vs` AS `vs.y`, `TBL_RIGHT`.`am` AS `am.y`, `TBL_RIGHT`.`gear` AS `gear.y`, `TBL_RIGHT`.`carb` AS `carb.y`
  FROM `mtcars` AS `TBL_LEFT`
  INNER JOIN `mtcars` AS `TBL_RIGHT`
  ON (`TBL_LEFT`.`rowid` = `TBL_RIGHT`.`rowid`)
)
GROUP BY `am.x`, `am.y`

from dbplyr.

commented on June 11, 2024

@bkkkk commented on Jan 3, 2018, 5:00 AM UTC:

These queries form part of very large shiny applications and other automated reports which are all in github, where the code reviews happen. Using show_query is simply not practical. In addition, and no offense meant at all, but the result of show_query are not as easy to read as a hand-written queries.

This is about a small ease of use improvement when using tbl directly with SQL queries. I think @krlmlr has a sensible approach, instead of dumping the results of a potentially long running query, warn and repair the names.

You can reproduce this with a local data frame but the truth is that the situation is totally artificial since you'd be building the data frame by hand and therefore this is a mistake you've made:

data_frame(x = c(1, 2), x = c(1, 6))

or it's the result of an incorrect call to add_column like this:

data_frame(x = c(1, 2), y = c(1, 6)) %>%
   add_column(x = c(1, 2))

Which actually returns an error as it should and prevents you from overwriting data when you might not intend to.

When using PostgreSQL clients, it's perfectly fine to not name columns and have those auto-generated column names be duplicated. Of course it's bad data-hygiene so when using an SQL query within a tbl call it's correct to not allow this situation, but it's better (IMHO) to have the names tidied and build the data frame than dump the data and waste a lot of processing time (on the DB side) for a column naming issue.

Using dplyr verbs instead doesn't really change anything it just side-steps the issue.

from dbplyr.

commented on June 11, 2024

@krlmlr commented on Jan 17, 2018, 9:06 AM UTC:

@edgararuiz: Would you agree to just call tibble::set_tidy_names() in collect()?

from dbplyr.

commented on June 11, 2024

@edgararuiz commented on Jan 17, 2018, 1:43 PM UTC:

Sure, that sounds fine. Thanks!

from dbplyr.

commented on June 11, 2024

@hadley commented on May 20, 2018, 2:42 PM UTC:

Minimal reprex

library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(
  x = c(1,2,2),
  y = c(3,5,NA),
  z = c(NA,'a','b')
)

mf1 <- mf %>% select(x, x = y)
mf1
#> Error: Column `x` must have a unique name

Fixing this requires a call to tibble::set_tidy_names() in collect.tbl_sql() , prior to grouped_df()

from dbplyr.

hadley commented on June 11, 2024

This is now resolved automatically by the new dplyr duplicated column name rules:

library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(
  x = c(1,2,2),
  y = c(3,5,NA),
  z = c(NA,'a','b')
)

mf1 <- mf %>% select(x, x = y)
mf1
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.22.0 [:memory:]
#>       x  x..2
#>   <dbl> <dbl>
#> 1     1     3
#> 2     2     5
#> 3     2    NA

^{Created on 2019-01-02 by the reprex package (v0.2.1)}

from dbplyr.

bkkkk commented on June 11, 2024

This is awesome thank you!

from dbplyr.

Reduce time to error when collecting results of SQL query with duplicated column names about dbplyr HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent