Comments (9)
I think it might make sense in R due to differing semantics. It is simply not feasible to make dplyr and a database agree for every possible input.
from dbplyr.
@AjarKeen commented on Apr 4, 2018, 7:54 PM UTC:
I think the problem isn't related to the tidyeval
issue. It seems to just be that translation of the failing code into SQL results in a reference to the newly created column within the same statement, which doesn't work.
The code that fails is this:
flights_sdf %>%
group_by(carrier) %>%
summarize(count_num = n(),
mean_dep_delay = mean(dep_delay),
ratio = mean_dep_delay / count_num) %>%
collect()
Which is translated to:
SELECT `carrier`,
count(*) AS `count_num`,
AVG(`dep_delay`) AS `mean_dep_delay`,
`mean_dep_delay` / `count_num` AS `ratio`
FROM `flights`
GROUP BY `carrier`
We can test that query out on a toy table:
CREATE TABLE flights (
carrier VARCHAR(50),
dep_delay FLOAT
);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('UA', 0);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('AA', 5);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('DL', 10);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('UA', 5);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('AA', 10);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('DL', 15);
SELECT `carrier`,
count(*) AS `count_num`,
AVG(`dep_delay`) AS `mean_dep_delay`,
`mean_dep_delay` / `count_num` AS `ratio`
FROM `flights`
GROUP BY `carrier`
Which fails: Unknown column 'mean_dep_delay'
.
Instead, we need the R code to be translated into something like this:
SELECT `carrier`,
`mean_dep_delay` / `count_num` AS `ratio`
FROM (
SELECT `carrier`,
count(*) AS `count_num`,
AVG(`dep_delay`) AS `mean_dep_delay`
FROM `flights`
GROUP BY `carrier`
) AS flights_grouped
Which works on the toy example:
carrier | ratio
AA | 3.75
DL | 6.25
UA | 1.25
I'm not sure how to make such a change in dbplyr
, but it would definitely be a useful feature.
from dbplyr.
@hadley commented on May 20, 2018, 2:38 PM UTC:
Minimal reprex
library(dplyr, warn.conflicts = FALSE)
lf1 <- dbplyr::lazy_frame(x = 1:5, src = dbplyr::simulate_dbi())
lf1 %>%
summarise(
x1 = mean(x, na.rm = TRUE),
x2 = sum(x, na.rm = TRUE),
x3 = x1 / x2
) %>%
show_query()
#> <SQL> SELECT AVG("x") AS "x1", SUM("x") AS "x2", "x1" / "x2" AS "x3"
#> FROM "df"
Created on 2018-05-20 by the reprex package (v0.2.0).
from dbplyr.
@hadley commented on May 20, 2018, 2:39 PM UTC:
Compare with mutate()
which correctly creates the nested subquery:
library(dplyr, warn.conflicts = FALSE)
lf1 <- dbplyr::lazy_frame(x = 1:5, src = dbplyr::simulate_dbi())
lf1 %>%
mutate(
x1 = mean(x, na.rm = TRUE),
x2 = sum(x, na.rm = TRUE),
x3 = x1 / x2
) %>%
show_query()
#> <SQL> SELECT "x", "x1", "x2", "x1" / "x2" AS "x3"
#> FROM (SELECT "x", avg("x") OVER () AS "x1", sum("x") OVER () AS "x2"
#> FROM "df") "gdgrsnkldf"
from dbplyr.
This isn't quite as simple as applying the algorithm from mutate()
because while mutate()
automatically keeps previous variables, summarise()
does not. This means it's the flip side of #193
from dbplyr.
Also need to make this work for transmute()
from dbplyr.
Updated reprex:
library(dplyr, warn.conflicts = FALSE)
lf <- dbplyr::lazy_frame(x = 1:5)
lf %>%
mutate(
x1 = mean(x, na.rm = TRUE),
x2 = sum(x, na.rm = TRUE),
x3 = x1 / x2
)
#> <SQL>
#> SELECT `x`, `x1`, `x2`, `x1` / `x2` AS `x3`
#> FROM (SELECT `x`, AVG(`x`) OVER () AS `x1`, SUM(`x`) OVER () AS `x2`
#> FROM `df`) `dbplyr_ksmttkjckb`
lf %>%
summarise(
x1 = mean(x, na.rm = TRUE),
x2 = sum(x, na.rm = TRUE),
x3 = x1 / x2
)
#> <SQL>
#> SELECT AVG(`x`) AS `x1`, SUM(`x`) AS `x2`, `x1` / `x2` AS `x3`
#> FROM `df`
Created on 2019-03-14 by the reprex package (v0.2.1.9000)
from dbplyr.
Maybe it's enough to make this an error? It's not really clear what the above example means, and we really want to the user to rewrite to:
library(dplyr, warn.conflicts = FALSE)
lf <- dbplyr::lazy_frame(x = 1:5)
lf %>%
summarise(
x1 = mean(x, na.rm = TRUE),
x2 = sum(x, na.rm = TRUE)
) %>%
mutate(x3 = x1 / x2)
#> <SQL>
#> SELECT `x1`, `x2`, `x1` / `x2` AS `x3`
#> FROM (SELECT AVG(`x`) AS `x1`, SUM(`x`) AS `x2`
#> FROM `df`) `dbplyr_xejwocofsh`
Created on 2019-03-14 by the reprex package (v0.2.1.9000)
from dbplyr.
The only confusing thing about this fix is that the "wrong" version still works in a non-dbplyr
context, so the same code will work or not work if the data source changes. This makes it hard to prototype locally and then execute on a larger dataset through a backend.
Local reprex:
library(dplyr)
lf <- tibble(x = 1:5)
lf %>%
summarise(
x1 = mean(x, na.rm = TRUE),
x2 = sum(x, na.rm = TRUE),
x3 = x1 / x2
)
#> # A tibble: 1 x 3
#> x1 x2 x3
#> <dbl> <int> <dbl>
#> 1 3 15 0.2
I can file a dplyr
issue if the right thing to do is have the error thrown in the local context as well.
Edit: or does this fix cover both cases? That wasn't clear from looking at the commit.
from dbplyr.
Related Issues (20)
- Helper to get `last_sql()`
- str_detect() and str_replace() not treating period (.) detection same in snowflake translation HOT 3
- `paste0()` incorrectly uses a `sep` argument
- Snowflake translation error: dropped filter with `anti_join` HOT 3
- Filter by a column in another table HOT 5
- Release dbplyr 2.5.0
- rstudio autocomplete slows down from 2.3.4 -> 2.4.0
- Breaking changes in dbplyr 2.5.0 HOT 9
- `distinct()` in Databricks/SparkSQL causes "arrange()... __row_num_*" error
- `head()` modifies `SELECT *`
- Export `check_na_rm()`
- rows_upsert worked in 2.4.0, broke with 2.5.0 HOT 2
- `sql` in `in_catalog` breaks with `filter` HOT 1
- Regression in window_order variable argument parsing for 2.5.0
- Unable to use quantile() in mutate() in DuckDB HOT 4
- Possible inconsistency in translation of stringr::str_like() with respect to ignore_case argument HOT 2
- postition of head in pipeline no longer leading to different sql HOT 3
- rows_delete fails due to key constraints: Canβt Modify Database Table
- dbplyr `across()` behavior differs from dplyr in a grouped context HOT 1
- Adding translations for clock::date_count_between()
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbplyr.