Giter Site home page Giter Site logo

Comments (6)

Athospd avatar Athospd commented on May 29, 2024 1

Does It mean that it will never be possible to aggregate using first()?
e.g.

#> <SQL>
#> SELECT `h`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
#> GROUP BY `h`

from dbplyr.

karldw avatar karldw commented on May 29, 2024

Closely related: tidyverse/dplyr#2290

from dbplyr.

javierluraschi avatar javierluraschi commented on May 29, 2024

Also reported in sparklyr under sparklyr/sparklyr#1504. sparklyr reprex:

library(sparklyr)

sc <- spark_connect(mater = "local")
batting_tbl <- copy_to(sc, Lahman::Batting)

batting_tbl %>% head() %>%
  group_by(yearID) %>%
  summarise(meansting = mean(stint), firstname = first(playerID)) %>%
  dbplyr::sql_render()
<SQL> SELECT `yearID`, AVG(`stint`) AS `meansting`, '`playerID`' AS `firstname`
FROM (SELECT *
FROM `sparklyr_88f216c7aaf6`
LIMIT 6) `ypzthwdkmo`
GROUP BY `yearID`

from dbplyr.

edgararuiz-zz avatar edgararuiz-zz commented on May 29, 2024

I thinks this is because first is considered a window function, activated by mutate(), as oppossed to an grouped/aggregate function, activated by summarize. I post two examples, one with first and the other with max:

library(sparklyr)
library(dplyr)
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)

t <- batting_tbl %>% 
  head(100) %>%
  group_by(teamID) %>%
  mutate(x = first(playerID)) %>%
  select(teamID, x)
t
#> # Source:   lazy query [?? x 2]
#> # Database: spark_connection
#> # Groups:   teamID
#>    teamID x        
#>    <chr>  <chr>    
#>  1 BS1    barnero01
#>  2 BS1    barnero01
#>  3 BS1    barnero01
#>  4 BS1    barnero01
#>  5 BS1    barnero01
#>  6 BS1    barnero01
#>  7 BS1    barnero01
#>  8 BS1    barnero01
#>  9 BS1    barnero01
#> 10 CH1    brannmi01
#> # ... with more rows
show_query(t)
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `playerID`, `yearID`, `stint`, `teamID`, `lgID`, `G`, `AB`, `R`, `H`, `X2B`, `X3B`, `HR`, `RBI`, `SB`, `CS`, `BB`, `SO`, `IBB`, `HBP`, `SH`, `SF`, `GIDP`, first_value(`playerID`) OVER (PARTITION BY `teamID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_35d8306e463f`
#> LIMIT 100) `jyepziefva`) `bhsbzffqdj`
spark_disconnect(sc)

If you try with max(), it will return one row per group, which is what I usually do in MS SQL

library(sparklyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)

t <- batting_tbl %>% 
  head(100) %>%
  group_by(teamID) %>%
  summarise(x = max(playerID)) %>%
  select(teamID, x)
t
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> # Source:   lazy query [?? x 2]
#> # Database: spark_connection
#>   teamID x        
#>   <chr>  <chr>    
#> 1 BS1    spaldal01
#> 2 CH1    simmojo01
#> 3 CL1    questjo01
#> 4 FW1    selmafr01
#> 5 NY2    startjo01
#> 6 PH1    senseco01
#> 7 RC1    sagerpo01
#> 8 TRO    pikeli01 
#> 9 WS3    nortofr01
show_query(t)
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `teamID`, MAX(`playerID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_1d1870e96305`
#> LIMIT 100) `gpdizhnmrf`
#> GROUP BY `teamID`) `mgevjkueih`
spark_disconnect(sc)

Created on 2018-09-20 by the reprex package (v0.2.0).

from dbplyr.

hadley avatar hadley commented on May 29, 2024

Minimal reprex:

library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)

mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_djljnlxwsc`
mf %>% summarise(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, '`x`' AS `f`
#> FROM `dbplyr_djljnlxwsc`
#> GROUP BY `g`

The source of the bug appears to be that first appears in the window functions, but not in the aggregate functions, so somehow dplyr::first() is getting called locally. I can fix that.

However, this is only going to deliver a better error message because in SQL FIRST_VALUE() is only window function, not also an aggregation function as it is R (because, I think, it depends on a notion of order that does not exist in SQL, except in windows).

from dbplyr.

hadley avatar hadley commented on May 29, 2024
library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)

mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
mf %>% summarise(f = first(x)) %>% show_query()
#> Error: `first()` is only available in a windowed (`mutate()`) context

from dbplyr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.