Comments (11)
@krlmlr commented on Dec 27, 2017, 8:41 AM UTC:
Thanks. I see your point, it seems that warning before executing the query will be difficult. We should be working on assigning unique column names, perhaps with tibble::repair_names()
tibble::set_tidy_names()
.
from dbplyr.
@bkkkk commented on Dec 28, 2017, 2:09 AM UTC:
I suspected it would be, anyways I think that having the names repaired probably fits better with other dplyr data ingestion / tibble creation workflows.
from dbplyr.
@edgararuiz commented on Dec 28, 2017, 2:07 PM UTC:
Hi, if you use the native dbplyr
join command, it will add a .x
and .y
suffix to your results automatically so you won't encounter that issue, here is an example:
library(dplyr)
library(DBI)
library(tibble)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
mtcars %>%
rowid_to_column() %>%
dbWriteTable(con, "mtcars", .)
tbl(con, "mtcars") %>%
inner_join(tbl(con, "mtcars"), by = "rowid" ) %>%
collect() %>%
View()
dbDisconnect(con)
I would encourage you to try something like this:
tbl(con, "consumers") %>%
select(main.id,
main.phone,
others.phone) %>%
inner_join(tbl(con, "consumers", by = c("others.id" = "main.id"))) %>%
collect()
from dbplyr.
@bkkkk commented on Jan 2, 2018, 11:26 PM UTC:
@edgararuiz thanks for the suggestion but we use SQL queries for interoperability and it lets non-R users review my code. It's also not limited to joins, my example uses a join because the code I was working on before I posted this issue had a join.
The issue will also arise with the following query, you'll get 2 columns named date_trunc
, and 2 named sum
:
SELECT
date_trunc('month', SOME_DATE) :: DATE,
date_trunc('day', SOME_OTHER_DATE) :: DATE,
sum(SOME_AMOUNT),
sum(SOME_OTHER_AMOUNT)
FROM
SOME_TABLE
GROUP BY 1, 2
This issue is a dplyr-specific issue that makes SQL workflows just a little more annoying for no particularly good reason, especially given that other parts of the tidyverse, like your example above, cleans up column names to avoid duplicates.
from dbplyr.
@edgararuiz commented on Jan 3, 2018, 12:17 AM UTC:
Ok, for that, I'd use the x and y suffix in the next command, and then use show_query()
to display the resulting SQL so that others can check my work. I'm sure I'm missing something, maybe if you can provide an example of how it would work with a local data.frame
would be great!
> library(dplyr)
> library(DBI)
> library(tibble)
> con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
>
> mtcars %>%
+ rowid_to_column() %>%
+ dbWriteTable(con, "mtcars", .)
>
> new <- tbl(con, "mtcars") %>%
+ inner_join(tbl(con, "mtcars"), by = "rowid" ) %>%
+ group_by(am.x, am.y) %>%
+ summarise(sum(wt.x),
+ sum(wt.y))
> head(new)
# Source: lazy query [?? x 4]
# Database: sqlite 3.19.3 []
# Groups: am.x
am.x am.y `sum(wt.x)` `sum(wt.y)`
<dbl> <dbl> <dbl> <dbl>
1 0 0 71.6 71.6
2 1.00 1.00 31.3 31.3
>
> show_query(new)
<SQL>
SELECT `am.x`, `am.y`, SUM(`wt.x`) AS `sum(wt.x)`, SUM(`wt.y`) AS `sum(wt.y)`
FROM (SELECT `TBL_LEFT`.`rowid` AS `rowid`, `TBL_LEFT`.`mpg` AS `mpg.x`, `TBL_LEFT`.`cyl` AS `cyl.x`, `TBL_LEFT`.`disp` AS `disp.x`, `TBL_LEFT`.`hp` AS `hp.x`, `TBL_LEFT`.`drat` AS `drat.x`, `TBL_LEFT`.`wt` AS `wt.x`, `TBL_LEFT`.`qsec` AS `qsec.x`, `TBL_LEFT`.`vs` AS `vs.x`, `TBL_LEFT`.`am` AS `am.x`, `TBL_LEFT`.`gear` AS `gear.x`, `TBL_LEFT`.`carb` AS `carb.x`, `TBL_RIGHT`.`mpg` AS `mpg.y`, `TBL_RIGHT`.`cyl` AS `cyl.y`, `TBL_RIGHT`.`disp` AS `disp.y`, `TBL_RIGHT`.`hp` AS `hp.y`, `TBL_RIGHT`.`drat` AS `drat.y`, `TBL_RIGHT`.`wt` AS `wt.y`, `TBL_RIGHT`.`qsec` AS `qsec.y`, `TBL_RIGHT`.`vs` AS `vs.y`, `TBL_RIGHT`.`am` AS `am.y`, `TBL_RIGHT`.`gear` AS `gear.y`, `TBL_RIGHT`.`carb` AS `carb.y`
FROM `mtcars` AS `TBL_LEFT`
INNER JOIN `mtcars` AS `TBL_RIGHT`
ON (`TBL_LEFT`.`rowid` = `TBL_RIGHT`.`rowid`)
)
GROUP BY `am.x`, `am.y`
from dbplyr.
@bkkkk commented on Jan 3, 2018, 5:00 AM UTC:
These queries form part of very large shiny applications and other automated reports which are all in github, where the code reviews happen. Using show_query
is simply not practical. In addition, and no offense meant at all, but the result of show_query
are not as easy to read as a hand-written queries.
This is about a small ease of use improvement when using tbl
directly with SQL queries. I think @krlmlr has a sensible approach, instead of dumping the results of a potentially long running query, warn and repair the names.
You can reproduce this with a local data frame but the truth is that the situation is totally artificial since you'd be building the data frame by hand and therefore this is a mistake you've made:
data_frame(x = c(1, 2), x = c(1, 6))
or it's the result of an incorrect call to add_column like this:
data_frame(x = c(1, 2), y = c(1, 6)) %>%
add_column(x = c(1, 2))
Which actually returns an error as it should and prevents you from overwriting data when you might not intend to.
When using PostgreSQL clients, it's perfectly fine to not name columns and have those auto-generated column names be duplicated. Of course it's bad data-hygiene so when using an SQL query within a tbl
call it's correct to not allow this situation, but it's better (IMHO) to have the names tidied and build the data frame than dump the data and waste a lot of processing time (on the DB side) for a column naming issue.
Using dplyr
verbs instead doesn't really change anything it just side-steps the issue.
from dbplyr.
@krlmlr commented on Jan 17, 2018, 9:06 AM UTC:
@edgararuiz: Would you agree to just call tibble::set_tidy_names()
in collect()
?
from dbplyr.
@edgararuiz commented on Jan 17, 2018, 1:43 PM UTC:
Sure, that sounds fine. Thanks!
from dbplyr.
@hadley commented on May 20, 2018, 2:42 PM UTC:
Minimal reprex
library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(
x = c(1,2,2),
y = c(3,5,NA),
z = c(NA,'a','b')
)
mf1 <- mf %>% select(x, x = y)
mf1
#> Error: Column `x` must have a unique name
Fixing this requires a call to tibble::set_tidy_names()
in collect.tbl_sql()
, prior to grouped_df()
from dbplyr.
This is now resolved automatically by the new dplyr duplicated column name rules:
library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(
x = c(1,2,2),
y = c(3,5,NA),
z = c(NA,'a','b')
)
mf1 <- mf %>% select(x, x = y)
mf1
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.22.0 [:memory:]
#> x x..2
#> <dbl> <dbl>
#> 1 1 3
#> 2 2 5
#> 3 2 NA
Created on 2019-01-02 by the reprex package (v0.2.1)
from dbplyr.
This is awesome thank you!
from dbplyr.
Related Issues (20)
- Databricks `copy_to` second stab HOT 2
- Support 'Range between' along with 'rows between' in `window_frame` HOT 1
- Oracle generates wrong SQL for `head()` HOT 7
- `select()` fails after specific sequence of `dplyr` commands HOT 1
- bigquery integration HOT 1
- explain for a SQL query is affected by options(digits.secs) HOT 1
- Use of `as.Date()` in `filter()` returns an error in dbplyr 2.4.0 for Oracle databases HOT 1
- `rows_patch.tbl_lazy` fails when patching more than one column
- dbplyr 2.4.0 - connecting to a table via a database link no longer works HOT 3
- Using a named vector with `dplyr::select(dplyr::all_of(...))` changes the table column names HOT 1
- The `unknown` column appears after I join the 2 tables HOT 4
- Progress bar for collect() HOT 1
- dbplyr 2.4.0 - date filter doesn't work anymore, possibly due to absence of single quotes in translation HOT 5
- Helpful warning if user forgets `I()`
- Update docs to prefer `I()` over `in_schema()` etc
- Check `table_path()` approach with backends HOT 4
- Fix no visible global function definition NOTE
- Automate global variables defintition HOT 1
- Consider recording quotes in simulate class definition
- Research stricter interpolation revdep failures HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbplyr.