Comments (6)
Does It mean that it will never be possible to aggregate using first()?
e.g.
#> <SQL>
#> SELECT `h`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
#> GROUP BY `h`
from dbplyr.
Closely related: tidyverse/dplyr#2290
from dbplyr.
Also reported in sparklyr
under sparklyr/sparklyr#1504. sparklyr
reprex:
library(sparklyr)
sc <- spark_connect(mater = "local")
batting_tbl <- copy_to(sc, Lahman::Batting)
batting_tbl %>% head() %>%
group_by(yearID) %>%
summarise(meansting = mean(stint), firstname = first(playerID)) %>%
dbplyr::sql_render()
<SQL> SELECT `yearID`, AVG(`stint`) AS `meansting`, '`playerID`' AS `firstname`
FROM (SELECT *
FROM `sparklyr_88f216c7aaf6`
LIMIT 6) `ypzthwdkmo`
GROUP BY `yearID`
from dbplyr.
I thinks this is because first
is considered a window function, activated by mutate()
, as oppossed to an grouped/aggregate function, activated by summarize
. I post two examples, one with first
and the other with max
:
library(sparklyr)
library(dplyr)
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)
t <- batting_tbl %>%
head(100) %>%
group_by(teamID) %>%
mutate(x = first(playerID)) %>%
select(teamID, x)
t
#> # Source: lazy query [?? x 2]
#> # Database: spark_connection
#> # Groups: teamID
#> teamID x
#> <chr> <chr>
#> 1 BS1 barnero01
#> 2 BS1 barnero01
#> 3 BS1 barnero01
#> 4 BS1 barnero01
#> 5 BS1 barnero01
#> 6 BS1 barnero01
#> 7 BS1 barnero01
#> 8 BS1 barnero01
#> 9 BS1 barnero01
#> 10 CH1 brannmi01
#> # ... with more rows
show_query(t)
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `playerID`, `yearID`, `stint`, `teamID`, `lgID`, `G`, `AB`, `R`, `H`, `X2B`, `X3B`, `HR`, `RBI`, `SB`, `CS`, `BB`, `SO`, `IBB`, `HBP`, `SH`, `SF`, `GIDP`, first_value(`playerID`) OVER (PARTITION BY `teamID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_35d8306e463f`
#> LIMIT 100) `jyepziefva`) `bhsbzffqdj`
spark_disconnect(sc)
If you try with max()
, it will return one row per group, which is what I usually do in MS SQL
library(sparklyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)
t <- batting_tbl %>%
head(100) %>%
group_by(teamID) %>%
summarise(x = max(playerID)) %>%
select(teamID, x)
t
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> # Source: lazy query [?? x 2]
#> # Database: spark_connection
#> teamID x
#> <chr> <chr>
#> 1 BS1 spaldal01
#> 2 CH1 simmojo01
#> 3 CL1 questjo01
#> 4 FW1 selmafr01
#> 5 NY2 startjo01
#> 6 PH1 senseco01
#> 7 RC1 sagerpo01
#> 8 TRO pikeli01
#> 9 WS3 nortofr01
show_query(t)
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `teamID`, MAX(`playerID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_1d1870e96305`
#> LIMIT 100) `gpdizhnmrf`
#> GROUP BY `teamID`) `mgevjkueih`
spark_disconnect(sc)
Created on 2018-09-20 by the reprex package (v0.2.0).
from dbplyr.
Minimal reprex:
library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)
mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_djljnlxwsc`
mf %>% summarise(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, '`x`' AS `f`
#> FROM `dbplyr_djljnlxwsc`
#> GROUP BY `g`
The source of the bug appears to be that first
appears in the window functions, but not in the aggregate functions, so somehow dplyr::first()
is getting called locally. I can fix that.
However, this is only going to deliver a better error message because in SQL FIRST_VALUE()
is only window function, not also an aggregation function as it is R (because, I think, it depends on a notion of order that does not exist in SQL, except in windows).
from dbplyr.
library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)
mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
mf %>% summarise(f = first(x)) %>% show_query()
#> Error: `first()` is only available in a windowed (`mutate()`) context
from dbplyr.
Related Issues (20)
- Breaking changes in dbplyr 2.5.0 HOT 9
- `distinct()` in Databricks/SparkSQL causes "arrange()... __row_num_*" error
- `head()` modifies `SELECT *`
- Export `check_na_rm()`
- rows_upsert worked in 2.4.0, broke with 2.5.0 HOT 2
- `sql` in `in_catalog` breaks with `filter` HOT 1
- Regression in window_order variable argument parsing for 2.5.0
- Unable to use quantile() in mutate() in DuckDB HOT 4
- Possible inconsistency in translation of stringr::str_like() with respect to ignore_case argument HOT 2
- postition of head in pipeline no longer leading to different sql HOT 3
- rows_delete fails due to key constraints: Canβt Modify Database Table
- dbplyr `across()` behavior differs from dplyr in a grouped context HOT 1
- Adding translations for clock::date_count_between()
- `rand_expr` is undocumented
- overflow error when counting very large tables in sql server
- Unusual behaviour of dplyr::pull() HOT 2
- Make compute docs easier to understand
- Supporting persisted tables for Spark SQL backend
- MSSQL slice_sample() translation always returns the same rows
- na_matches="na" coerces inequality and overlap joins to equality joins
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbplyr.