The first function in dbplyr returns whatever express

Closely related: <a class="issue-link js-issue-link" data-error-text="Failed to load t

Also reported in sparklyr under <a class="issue-link

Minimal reprex: <div class="highlight highlight-source-r notranslate position-rela

<div class="highlight highlight-source-r notranslate position-relative overflow-auto" dir="auto" dat

first() doesn't work in summaries about dbplyr HOT 6 CLOSED

tidyverse commented on May 29, 2024 4

first() doesn't work in summaries

from dbplyr.

Comments (6)

Athospd commented on May 29, 2024 1

Does It mean that it will never be possible to aggregate using first()?
e.g.

#> <SQL>
#> SELECT `h`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
#> GROUP BY `h`

from dbplyr.

karldw commented on May 29, 2024

Closely related: tidyverse/dplyr#2290

from dbplyr.

javierluraschi commented on May 29, 2024

Also reported in sparklyr under sparklyr/sparklyr#1504. sparklyr reprex:

library(sparklyr)

sc <- spark_connect(mater = "local")
batting_tbl <- copy_to(sc, Lahman::Batting)

batting_tbl %>% head() %>%
  group_by(yearID) %>%
  summarise(meansting = mean(stint), firstname = first(playerID)) %>%
  dbplyr::sql_render()

<SQL> SELECT `yearID`, AVG(`stint`) AS `meansting`, '`playerID`' AS `firstname`
FROM (SELECT *
FROM `sparklyr_88f216c7aaf6`
LIMIT 6) `ypzthwdkmo`
GROUP BY `yearID`

from dbplyr.

edgararuiz-zz commented on May 29, 2024

I thinks this is because first is considered a window function, activated by mutate(), as oppossed to an grouped/aggregate function, activated by summarize. I post two examples, one with first and the other with max:

library(sparklyr)
library(dplyr)
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)

t <- batting_tbl %>% 
  head(100) %>%
  group_by(teamID) %>%
  mutate(x = first(playerID)) %>%
  select(teamID, x)
t
#> # Source:   lazy query [?? x 2]
#> # Database: spark_connection
#> # Groups:   teamID
#>    teamID x        
#>    <chr>  <chr>    
#>  1 BS1    barnero01
#>  2 BS1    barnero01
#>  3 BS1    barnero01
#>  4 BS1    barnero01
#>  5 BS1    barnero01
#>  6 BS1    barnero01
#>  7 BS1    barnero01
#>  8 BS1    barnero01
#>  9 BS1    barnero01
#> 10 CH1    brannmi01
#> # ... with more rows
show_query(t)
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `playerID`, `yearID`, `stint`, `teamID`, `lgID`, `G`, `AB`, `R`, `H`, `X2B`, `X3B`, `HR`, `RBI`, `SB`, `CS`, `BB`, `SO`, `IBB`, `HBP`, `SH`, `SF`, `GIDP`, first_value(`playerID`) OVER (PARTITION BY `teamID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_35d8306e463f`
#> LIMIT 100) `jyepziefva`) `bhsbzffqdj`
spark_disconnect(sc)

If you try with max(), it will return one row per group, which is what I usually do in MS SQL

library(sparklyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)

t <- batting_tbl %>% 
  head(100) %>%
  group_by(teamID) %>%
  summarise(x = max(playerID)) %>%
  select(teamID, x)
t
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> # Source:   lazy query [?? x 2]
#> # Database: spark_connection
#>   teamID x        
#>   <chr>  <chr>    
#> 1 BS1    spaldal01
#> 2 CH1    simmojo01
#> 3 CL1    questjo01
#> 4 FW1    selmafr01
#> 5 NY2    startjo01
#> 6 PH1    senseco01
#> 7 RC1    sagerpo01
#> 8 TRO    pikeli01 
#> 9 WS3    nortofr01
show_query(t)
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `teamID`, MAX(`playerID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_1d1870e96305`
#> LIMIT 100) `gpdizhnmrf`
#> GROUP BY `teamID`) `mgevjkueih`
spark_disconnect(sc)

Created on 2018-09-20 by the reprex package (v0.2.0).

from dbplyr.

hadley commented on May 29, 2024

Minimal reprex:

library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)

mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_djljnlxwsc`
mf %>% summarise(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, '`x`' AS `f`
#> FROM `dbplyr_djljnlxwsc`
#> GROUP BY `g`

The source of the bug appears to be that first appears in the window functions, but not in the aggregate functions, so somehow dplyr::first() is getting called locally. I can fix that.

However, this is only going to deliver a better error message because in SQL FIRST_VALUE() is only window function, not also an aggregation function as it is R (because, I think, it depends on a notion of order that does not exist in SQL, except in windows).

from dbplyr.

hadley commented on May 29, 2024

library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)

mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
mf %>% summarise(f = first(x)) %>% show_query()
#> Error: `first()` is only available in a windowed (`mutate()`) context

from dbplyr.

first() doesn't work in summaries about dbplyr HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent