Giter Site home page Giter Site logo

dbplyr's Introduction

dbplyr

CRAN status R-CMD-check Codecov test coverage

Overview

dbplyr is the database backend for dplyr. It allows you to use remote database tables as if they are in-memory data frames by automatically converting dplyr code into SQL.

To learn more about why you might use dbplyr instead of writing SQL, see vignette("sql"). To learn more about the details of the SQL translation, see vignette("translation-verb") and vignette("translation-function").

Installation

# The easiest way to get dbplyr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just dbplyr:
install.packages("dbplyr")

# Or the development version from GitHub:
# install.packages("pak")
pak::pak("tidyverse/dbplyr")

Usage

dbplyr is designed to work with database tables as if they were local data frames. To demonstrate this I’ll first create an in-memory SQLite database and copy over a dataset:

library(dplyr, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)

Note that you don’t actually need to load dbplyr with library(dbplyr); dplyr automatically loads it for you when it sees you working with a database. Database connections are coordinated by the DBI package. Learn more at https://dbi.r-dbi.org/

Now you can retrieve a table using tbl() (see ?tbl_dbi for more details). Printing it just retrieves the first few rows:

mtcars2 <- tbl(con, "mtcars")
mtcars2
#> # Source:   table<`mtcars`> [?? x 11]
#> # Database: sqlite 3.45.0 [:memory:]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ more rows

All dplyr calls are evaluated lazily, generating SQL that is only sent to the database when you request the data.

# lazily generates query
summary <- mtcars2 %>% 
  group_by(cyl) %>% 
  summarise(mpg = mean(mpg, na.rm = TRUE)) %>% 
  arrange(desc(mpg))

# see query
summary %>% show_query()
#> <SQL>
#> SELECT `cyl`, AVG(`mpg`) AS `mpg`
#> FROM `mtcars`
#> GROUP BY `cyl`
#> ORDER BY `mpg` DESC

# execute query and retrieve results
summary %>% collect()
#> # A tibble: 3 × 2
#>     cyl   mpg
#>   <dbl> <dbl>
#> 1     4  26.7
#> 2     6  19.7
#> 3     8  15.1

Code of Conduct

Please note that the dbplyr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

dbplyr's People

Contributors

arunsrinivasan avatar austenhead avatar batpigandme avatar cderv avatar colearendt avatar cosinequanon avatar davidpatshuifong avatar dpprdan avatar edgararuiz avatar eibanez avatar ejneer avatar fh-afrachioni avatar fh-mthomson avatar hadley avatar hannes avatar ilarischeinin avatar imanuelcostigan avatar javierluraschi avatar jimhester avatar kevinushey avatar krlmlr avatar leondutoit avatar lindbrook avatar lionel- avatar mgirlich avatar pimentel avatar romainfrancois avatar sfirke avatar shosaco avatar zeehio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbplyr's Issues

Inconsistency between ifelse and case_when using dbplyr

@simon-anasta commented on Aug 27, 2018, 2:14 AM UTC:

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.


When writing queries using dbplyr for SQL server, ifelse and case_when produce different SQL code. The SQL code from case_when results in an exception. See also this question

Sorry the below is not a perfect reprex. Security restrictions prevent me from including business specific details in my example.

# libraries
library(DBI)
library(dplyr)
library(dbplyr)

# establish connection to database table
connection_string = "database.specific.string"
# mine looks something like "DRIVER=...; Trusted_Connection=...; DATABASE=...' SERVER=..."
db_connection = dbConnect(odbc::odbc(), .connection_string = connection_string)
my_table = tbl(db_connection, from = my_table_name)

# attempted query
tmp = my_table %>%
    mutate(new_col = case_when(col1 == col2 ~ "a",
                               TRUE ~ "b"))

# check SQL code for query
show_query(tmp)

# compared to
tmp = my_table %>%
    mutate(new_col = ifelse(col1 == col2, "a", "b"))

# check SQL code for query
show_query(tmp)

The resulting SQL query from the case_when command takes the form:

SELECT 
    col1, col2,
    CASE
       WHEN CONVERT(BIT, IIF(col1 = col2, 1.0, 0.0))) THEN ('a')
       WHEN (TRUE) THEN ('b')
    END AS new_col
FROM my_database.my_table_name

This produces an exception: "An expression of non-boolean type specified in a context where a condition is expected, near 'THEN'"

The resulting SQL query from the ifelse command takes the form:

SELECT 
    col1, col2,
    CASE
       WHEN (CONVERT(BIT, IIF(col1 = col2, 1.0, 0.0)))) = TRUE THEN ('a')
       WHEN (CONVERT(BIT, IIF(col1 = col2, 1.0, 0.0)))) = FALSE THEN ('b')
    END AS new_col
FROM my_database.my_table_name

This works as expected.

As best I can determine, this is a bug. I suspect the SQL syntax from case_when should be more similar to the SQL syntax from ifelse. In particular I suspect the additional brackets and the "= TRUE" in the ifelse version should also appear in the case_when version.

This issue was moved by romainfrancois from tidyverse/dplyr#3774.

inconsistency between dplyr and dbplyr summarise_at when variable names reused

@tslumley commented on Apr 27, 2018, 1:46 AM UTC:

When summarise_at uses a variable in .vars that it also uses in .funs, the result is different in dplyr and dbplyr, and I think the dbplyr result is the right one -- though there's a case for just making it an error.

library(dplyr)
library(MonetDBLite)
library(DBI)
mydb <- dbConnect(MonetDBLite::MonetDBLite())
DBI::dbWriteTable(mydb, "mtcarsdb", mtcars)
mtcars.db = tbl(mydb, "mtcarsdb")

## wrong
mtcars %>% summarise_at(vars(cyl,disp,hp,drat), funs(sum(. *  cyl * mpg)))
## right
mtcars.db %>% summarise_at(vars(cyl,disp,hp,drat), funs(sum(. *  cyl * mpg)))
## also right
mtcars %>% mutate(cyl1=cyl) %>% summarise_at(vars(cyl,disp,hp,drat), funs(sum(. *  cyl1 * mpg)))

# intended result
with(mtcars, cbind(cyl * cyl * mpg,
                     disp * cyl * mpg,
                     hp * cyl * mpg,
                     drat * cyl * mpg) %>% colSums())

It looks as though in R the variable cyl is updated before it is used for the 2nd through 4th sums, but in SQL it isn't.

This issue was moved by krlmlr from tidyverse/dplyr/issues/3547.

summarize_all() throws error when using tilde notation

I'm not sure if this will be fixed with tidyverse/dplyr#3433. Also, I don't understand why the second example gives a warning. Thanks @Romanik!

library(tidyverse)
dbplyr::memdb_frame(a = 1, b = 2) %>% 
  summarize_all(~mean(.))
#> Error in mean(.): object 'a' not found
dbplyr::memdb_frame(a = 1, b = 2) %>% 
  summarize_all(mean)
#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.22.0 [:memory:]
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
dbplyr::memdb_frame(a = 1, b = 2) %>% 
  summarize_if(funs(is.numeric), ~mean(.))
#> Applying predicate on the first 100 rows
#> Error in mean(.): object 'a' not found

Created on 2018-07-25 by the reprex package (v0.2.0).

Still existing link to dplyr for issues reports

I understood there has been a choice to now have dbplyr issues in there own repo. (change in c98ff1f)

There still link in the DESCRIPTION file to report bug in dplyr.

BugReports: https://github.com/tidyverse/dplyr/issues

Also the pkgdown site is not up-to-date, but has the README has been modified, this should be ok next build. I do not know your timeframe for updating the website.

Cannot pull views (using dbplyr & RMariaDB)

@foundinblank commented on Apr 25, 2018, 12:28 PM UTC:

Hi,

I'm unable to pull views via my RMariaDB connection. Here's an example below where I'm trying to pull a view called video_product_lookup, but I can replicate the issue with all other views. I can pull regular tables just fine.

Thanks,
Adam

# Load packages
library(RMariaDB)
#> Warning: package 'RMariaDB' was built under R version 3.4.4
library(tidyverse)
library(DBI)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

# Connect to database
reporting <- dbConnect(RMariaDB::MariaDB(), group = "reporting")
dbListTables(reporting) # This does list the table (view) I want to connect to
#> ...        
#> [67] "video_product_lookup"
#> ...        

# Get the view
df <- tbl(reporting, "video_product_lookup")
#> Error in result_create(conn@ptr, statement, is_statement): Prepared statement needs to be re-prepared [1615]

# Get the view a different way
df <- dbReadTable(reporting, "video_product_lookup")
#> Error in result_create(conn@ptr, statement, is_statement): Prepared statement needs to be re-prepared [1615]

# Get a regular table - this works fine. 
df <- tbl(reporting, "biz_orders")

Created on 2018-04-25 by the reprex package (v0.2.0).

This issue was moved by batpigandme from tidyverse/dplyr/issues/3541.

DBI::dbExecute (TD implementaion?) will replace linebreak \n with carriage return \r of the statement

Hi,

When I'm using DBI::dbExecute(con, statement) to send create view /create table statement to Teradata, and then do a show view query to pull the statement. The text consistently got messed up by missing the \n. I have tested using other IDE to send identical .sql file to database and the \n was kept fine.

DBI        * 1.0.0      2018-05-02 cran (@1.0.0)                          
dbplyr       1.2.1      2018-02-19 cran (@1.2.1) 
 Teradata Version: 15.10.0711  15.10.07.11B

Can't Join on Variables of Different Name

I've been trying to join two tables with have slightly different variable names for instance start_date and day_start_date

I've been trying to write something like:
inner_join(calendar.dim, by = c("start_date" = "day_start_date"))

I have googled quite a bit and read through the documentation and can't seem to find where this exists.

Compute does not work with teradata

calling compute on a teradata connection does not work and throws the following error:

Error in new_result(connection@ptr, statement) : 
  nanodbc/nanodbc.cpp:1344: 42000: [Teradata][ODBC Teradata Driver][Teradata Database](-3707)Syntax error, expected something like a 'METHOD' keyword between the 'CREATE' keyword and the 'TEMPORARY' keyword. 

I think you need to use the keyword VOLATILE instead of TEMPORARY for Teradata right?

copy_to with in_schema fails with DBI 1.0.0

@OssiLehtinen commented on Jun 7, 2018, 10:29 AM UTC:

Using copy_to from dbplyr with in_schema name definition results in an error message, when working with DBI 1.0.0

copy_to(con, mtcars, name=in_schema("TEMP", "MTCARS"))
#> Error: Can't unquote TEMP.MTCARS

The code for method dbUnquoteIdentifier in DBI's quote.R seems to have changed, which I suspect is causing the problem. I think it doesn't like the "." produced by in_schema.

The previous version I was using (DBI 0.8.0) didn't have this issue.

Some technical specs: I'm running under Redhat 7, and the database is IBM db2 via odbc.

Please let me know if some essintial info is missing, and thanks for any help in advance!

Br, Ossi

This issue was moved by krlmlr from tidyverse/dplyr/issues/3646.

n_distinct() on sqlite with more columns returns an error

@edoardomichielon commented on Jun 20, 2018, 11:10 AM UTC:


When I use a connection to a sqlite on disk, the verb n_distinct() returns an error if there are two or more columns. Same code, if I select just one column or collect data before counting record, it works properly.

# remove objects
rm(list =  ls())

# require packages
require(dplyr)
require(dbplyr)
require(RSQLite)

# create a sqlite db and connect to it
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# copy data into sqlite
copy_to(con, mtcars)

# point to table 
my_tbl <- tbl(con, "mtcars")

# n_distinct with one column (THIS WORKS)
my_tbl %>% group_by(gear) %>% summarise(n_distinct(mpg))

# if I use the local (or collected) data it is ok (THIS WORKS)
my_tbl %>% collect() %>% group_by(gear) %>% summarise(n_distinct(mpg, cyl))

# n_distinct with two columns (THIS DOES NOT WORKS)
my_tbl %>% group_by(gear) %>% summarise(n_distinct(mpg, cyl))

## Error in result_create(conn@ptr, statement) : 
## wrong number of arguments to function COUNT()

The code is correctly translated into Sql

# Show query
my_tbl %>% group_by(gear) %>% summarise(n_distinct(mpg, cyl)) %>% show_query()

## SELECT `gear`, COUNT(DISTINCT `mpg`, `cyl`) AS `n_distinct(mpg, cyl)`
## FROM `mtcars`
## GROUP BY `gear`

This issue was moved by batpigandme from tidyverse/dplyr/issues/3687.

dbplyr: n_distinct translation does not handle na.rm

@colearendt commented on Jul 20, 2018, 10:33 AM UTC:

Definition of n_distinct:

function (..., na.rm = FALSE) 
{
    n_distinct_multi(list(...), na.rm)
}
<environment: namespace:dplyr>

And the translation:

  n_distinct = function(...) {
    vars <- sql_vector(list(...), parens = FALSE, collapse = ", ")
    build_sql("COUNT(DISTINCT ", vars, ")")
  }

n_distinct = function(...) {
vars <- sql_vector(list(...), parens = FALSE, collapse = ", ")
build_sql("COUNT(DISTINCT ", vars, ")")
}
)

n_distinct = function(...) {
vars <- sql_vector(list(...), parens = FALSE, collapse = ", ")
win_over(build_sql("COUNT(DISTINCT ", vars, ")"), win_current_group())
},

Which results in:

pd %>% summarise(n_distinct(fld_integer, na.rm = TRUE)) %>% show_query()
#<SQL>
#SELECT COUNT(DISTINCT "fld_integer", TRUE AS "na.rm") AS "n_distinct(fld_integer, na.rm = TRUE)"
#FROM "testdata"

na.rm should not be included in the splicing.

This issue was moved by krlmlr from tidyverse/dplyr#3720.

dplyr::case_when() combined with mutate() does not work when applied to tbl

@CarolineBarret commented on Aug 2, 2018, 1:14 PM UTC:

I am working with R 3.4.3 and dplyr 0.7.4.

I am trying to apply the case_when() function to a tibble object from a database. But when I combine the case_when() function to the mutate() function, I get an error:

dir.create("tmp")
# The datasets.sqlite is just a sqlite database 
# It contains a table called iris with the iris data frame in it
file.copy("data/datasets.sqlite", "tmp")

# Connect to the database
con <- DBI::dbConnect(RSQLite::SQLite(), "tmp/datasets.sqlite")

cake <- dplyr::mutate(dplyr::tbl(con, "iris"),
                        type = dplyr::case_when(
        Sepal.Length > 5.8 | Petal.Length > 3.7 ~ "long",
        Sepal.Width > 3  | Petal.Width > 1.2 ~ "wide",
        TRUE ~  "other"
      )
    )

This gives me the following error: Error in eval_bare(f[[2]], env) : object 'Sepal.Length' not found

I think this is to do with the fact that I use mutate() because I do not get an error with the following:

  ir <- dplyr::tbl(con, "iris")
  cake <- dplyr::case_when(
                          ir$Sepal.Length > 5.8 | ir$Petal.Length > 3.7 ~ "long",
                          ir$Sepal.Width > 3  | ir$Petal.Width > 1.2 ~ "wide",
                          TRUE ~  "other"
                        )

It also is to do with tibbles output (and not with data.frames), because this works fine too:

  cake <- dplyr::mutate(DBI::dbReadTable(con, "iris"),
                        type = dplyr::case_when(
                          Sepal.Length > 5.8 | Petal.Length > 3.7 ~ "long",
                          Sepal.Width > 3  | Petal.Width > 1.2 ~ "wide",
                          TRUE ~  "other"
                        )
  )

This issue was moved by krlmlr from tidyverse/dplyr#3736.

.data pronoun does not work

@sverchkov commented on Jul 20, 2018, 7:10 PM UTC:

Using the .data pronoun in filter, mutate, and transmute does not work with database tables

library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")

### Using mutate

# works
mutate(mtcars, foo = .data$cyl)

# also works
mutate(mtcars2, foo = cyl)

# doesn't
mutate(mtcars2, foo = .data$cyl)
# Error: Column `cyl` not found in `.data`

### Using transmute

# works
transmute(mtcars, blah = .data$cyl)

# also works
transmute(mtcars, blah = cyl)

# doesn't
transmute(mtcars2, blah = .data$cyl)
# Error: Column `cyl` not found in `.data`

### Using transmute to select

# works
transmute(mtcars, .data$cyl)

# also works
transmute(mtcars2, cyl)

# doesn't
transmute(mtcars2, .data$cyl)
# Error: Column `cyl` not found in `.data`

# select doesn't have this issue, this works
select( mtcars2, .data$cyl )

### Using filter

# works
filter( mtcars, .data$cyl > 4 )

# also works
filter( mtcars2, cyl > 4 )

# doesn't
filter( mtcars2, .data$cyl > 4 )
# Error: Column `cyl` not found in `.data`

(I actually ran into this when running code with a Postgres database, so it isn't DB-specific)

This is different from issue#3370, since everything works without the .data pronoun, but might be somehow related.

This issue was moved by krlmlr from tidyverse/dplyr#3722.

as.character(column_name) translates to CAST(column_name as TEXT) for BigQuery

Casting to string in BigQuery uses CAST(column_name as STRING). When R translates as.character(column_name) it writes it as CAST(column_name as TEXT) which fails.

> library(dbplyr)
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dbplyr_1.2.2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18     dplyr_0.7.6      assertthat_0.2.0 R6_2.2.2        
 [5] DBI_1.0.0        magrittr_1.5     pillar_1.2.3     rlang_0.2.2     
 [9] bindrcpp_0.2.2   tools_3.4.4      glue_1.3.0       purrr_0.2.5     
[13] yaml_2.1.19      compiler_3.4.4   pkgconfig_2.0.2  bindr_0.1.1     
[17] tidyselect_0.2.4 tibble_1.4.2

Example:

project <- "project-name"
billing <- "billing-name"
con <- dbConnect(
  bigrquery::bigquery(),
  project = project,
  billing = billing
)
# a date dimensional table
date_dim <- tbl(con, "project.schema.table")
## illustrative of date_dim
#date_dim <- data.frame(date_key = c(20180101:20180131), fiscal_month_name = rep("JAN", 31))

date_dim %>%
  select(DATE_KEY, FISCAL_MONTH_NAME) %>%
  mutate(date_key_str = as.character(DATE_KEY)) %>%
  show_query()

<SQL>
SELECT `DATE_KEY`, `FISCAL_MONTH_NAME`, CAST(`DATE_KEY` AS TEXT) AS `date_key_str`
FROM (SELECT `DATE_KEY`, `FISCAL_MONTH_NAME`
FROM `project.schema.table`) `vnzsigtmhn`
date_dim %>%
  select(DATE_KEY, FISCAL_MONTH_NAME) %>%
  mutate(date_key_str = as.character(DATE_KEY)) %>%
  collect()

Attempting to collect then produces this error:

Error: Type not found: TEXT at [2:66] [invalidQuery]

Unfortunately hard-coding the CAST into the mutate does not work either.

> date_dim %>%
+   select(DATE_KEY, FISCAL_MONTH_NAME) %>%
+   mutate(date_key_str = CAST(DATE_KEY as STRING))
Error: unexpected symbol in:
"  select(DATE_KEY, FISCAL_MONTH_NAME) %>%
  mutate(date_key_str = CAST(DATE_KEY as"

EDIT: Original script was running on an old version of R, I updated and the issue still persists.

Like mutate, summarise needs to create subqueries when necessary

@AjarKeen commented on Jan 8, 2018, 9:44 PM UTC:

cc @javierluraschi since I've only tested this with sparklyr, not with other dbplyr backends.

I'm starting to use `sparklyr` to do grouped time series aggregations, and very quickly ran into this problem. Here's a modest reprex showing how it works locally, the `sparklyr` version that fails, and then a workaround:
library(dplyr)
library(sparklyr)
library(nycflights13)

# local version
flights %>%
  group_by(carrier) %>%
  summarize(count_num = n(), 
            mean_dep_delay = mean(dep_delay, na.rm = TRUE),
            ratio = mean_dep_delay / count_num) %>% 
  arrange(carrier)

#> # A tibble: 16 x 4
#>    carrier count_num mean_dep_delay    ratio
#>    <chr>       <int>          <dbl>    <dbl>
#>  1 9E          18460          16.7  0.000906
#>  2 AA          32729           8.59 0.000262
#>  3 AS            714           5.80 0.00813 
#>  4 B6          54635          13.0  0.000238
#>  5 DL          48110           9.26 0.000193
#>  6 EV          54173          20.0  0.000368
#>  7 F9            685          20.2  0.0295  
#>  8 FL           3260          18.7  0.00574 
#>  9 HA            342           4.90 0.0143  
#> 10 MQ          26397          10.6  0.000400
#> 11 OO             32          12.6  0.393   
#> 12 UA          58665          12.1  0.000206
#> 13 US          20536           3.78 0.000184
#> 14 VX           5162          12.9  0.00249 
#> 15 WN          12275          17.7  0.00144 
#> 16 YV            601          19.0  0.0316

# Spark version
sc <- spark_connect(master = "local")

flights_sdf <- copy_to(sc, flights, "flights")

flights_sdf %>%
  group_by(carrier) %>%
  summarize(count_num = n(), 
            mean_dep_delay = mean(dep_delay),
            ratio = mean_dep_delay / count_num) %>%
  collect()

#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`mean_dep_delay`' given input columns: [dest, dep_delay, distance, dep_time, minute, carrier, origin, sched_arr_time, month, arr_time, day, flight, sched_dep_time, time_hour, arr_delay, air_time, hour, tailnum, year]; line 1 pos 81;
#> 'Aggregate [carrier#38], [carrier#38, count(1) AS count_num#447L, avg(dep_delay#34) AS mean_dep_delay#448, ('mean_dep_delay / 'count_num) AS ratio#449]
#> +- SubqueryAlias flights
#>    +- LogicalRDD [year#29, month#30, day#31, dep_time#32, sched_dep_time#33, dep_delay#34, arr_time#35, sched_arr_time#36, arr_delay#37, carrier#38, flight#39, tailnum#40, origin#41, dest#42, air_time#43, distance#44, hour#45, minute#46, time_hour#47]
#> 
#>  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$transformUp$1.apply(TreeNode.scala:289)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$transformUp$1.apply(TreeNode.scala:289)
#>  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$4.apply(TreeNode.scala:306)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$4.apply(TreeNode.scala:306)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$recursiveTransform$1(QueryPlan.scala:289)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$recursiveTransform$1$1.apply(QueryPlan.scala:293)
#>  at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
#>  at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
#>  at scala.collection.immutable.List.foreach(List.scala:381)
#>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
#>  at scala.collection.immutable.List.map(List.scala:285)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$recursiveTransform$1(QueryPlan.scala:293)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$6.apply(QueryPlan.scala:298)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
#>  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
#>  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
#>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
#>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
#>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
#>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
#>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
#>  at java.lang.reflect.Method.invoke(Method.java:498)
#>  at sparklyr.Invoke$.invoke(invoke.scala:102)
#>  at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
#>  at sparklyr.StreamHandler$.read(stream.scala:62)
#>  at sparklyr.BackendHandler.channelRead0(handler.scala:52)
#>  at sparklyr.BackendHandler.channelRead0(handler.scala:14)
#>  at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#>  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
#>  at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#>  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
#>  at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
#>  at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#>  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
#>  at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#>  at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
#>  at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
#>  at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
#>  at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
#>  at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
#>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
#>  at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
#>  at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
#>  at java.lang.Thread.run(Thread.java:748)

# Spark workaround
flights_sdf %>%
  group_by(carrier) %>%
  mutate(count_num = n(), 
         mean_dep_delay = mean(dep_delay),
         ratio = mean_dep_delay / count_num) %>%
  summarize(count_num = mean(count_num),
            mean_dep_delay = mean(mean_dep_delay),
            ratio = mean(ratio)) %>%
  arrange(carrier) %>%
  collect()

#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `avg(x, na.rm = TRUE)` to silence this warning

#> # A tibble: 16 x 4
#>    carrier count_num mean_dep_delay    ratio
#>    <chr>       <dbl>          <dbl>    <dbl>
#>  1 9E        18460            16.7  0.000906
#>  2 AA        32729             8.59 0.000262
#>  3 AS          714             5.80 0.00813 
#>  4 B6        54635            13.0  0.000238
#>  5 DL        48110             9.26 0.000193
#>  6 EV        54173            20.0  0.000368
#>  7 F9          685            20.2  0.0295  
#>  8 FL         3260            18.7  0.00574 
#>  9 HA          342             4.90 0.0143  
#> 10 MQ        26397            10.6  0.000400
#> 11 OO           32.0          12.6  0.393   
#> 12 UA        58665            12.1  0.000206
#> 13 US        20536             3.78 0.000184
#> 14 VX         5162            12.9  0.00249 
#> 15 WN        12275            17.7  0.00144 
#> 16 YV          601            19.0  0.0316

My session info after I run the above:

> devtools::session_info()
Session info ---------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.383)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       <NA>                        
 date     2018-01-08                  

Packages -------------------------------------------------------------------------
 package      * version    date       source                           
 assertthat     0.2.0      2017-04-11 CRAN (R 3.4.1)                   
 backports      1.1.2      2017-12-13 cran (@1.1.2)                    
 base         * 3.4.1      2017-09-07 local                            
 base64enc      0.1-3      2015-07-28 CRAN (R 3.4.1)                   
 bindr          0.1        2016-11-13 CRAN (R 3.4.1)                   
 bindrcpp     * 0.2        2017-06-17 CRAN (R 3.4.1)                   
 broom          0.4.3      2017-11-20 cran (@0.4.3)                    
 callr          1.0.0      2016-06-18 CRAN (R 3.4.1)                   
 cli            1.0.0      2017-11-05 CRAN (R 3.4.1)                   
 clipr          0.4.0      2017-11-03 CRAN (R 3.4.1)                   
 compiler       3.4.1      2017-09-07 local                            
 config         0.2        2016-08-02 CRAN (R 3.4.1)                   
 crayon         1.3.4      2017-09-16 CRAN (R 3.4.1)                   
 datasets     * 3.4.1      2017-09-07 local                            
 DBI            0.7        2017-06-18 CRAN (R 3.4.1)                   
 dbplyr         1.2.0      2018-01-03 cran (@1.2.0)                    
 devtools       1.13.4     2017-11-09 CRAN (R 3.4.1)                   
 digest         0.6.13     2017-12-14 cran (@0.6.13)                   
 dplyr        * 0.7.4      2017-09-28 CRAN (R 3.4.1)                   
 evaluate       0.10.1     2017-06-24 CRAN (R 3.4.1)                   
 foreign        0.8-69     2017-06-22 CRAN (R 3.4.1)                   
 glue           1.2.0      2017-10-29 CRAN (R 3.4.1)                   
 graphics     * 3.4.1      2017-09-07 local                            
 grDevices    * 3.4.1      2017-09-07 local                            
 grid           3.4.1      2017-09-07 local                            
 htmltools      0.3.6      2017-04-28 CRAN (R 3.4.1)                   
 httpuv         1.3.5      2017-07-04 CRAN (R 3.4.1)                   
 httr           1.3.1      2017-08-20 CRAN (R 3.4.1)                   
 jsonlite       1.5        2017-06-01 CRAN (R 3.4.1)                   
 knitr          1.17       2017-08-10 CRAN (R 3.4.1)                   
 lattice        0.20-35    2017-03-25 CRAN (R 3.4.1)                   
 lazyeval       0.2.1      2017-10-29 CRAN (R 3.4.1)                   
 magrittr       1.5        2014-11-22 CRAN (R 3.4.1)                   
 memoise        1.1.0      2017-04-21 CRAN (R 3.4.1)                   
 methods      * 3.4.1      2017-09-07 local                            
 mime           0.5        2016-07-07 CRAN (R 3.4.1)                   
 mnormt         1.5-5      2016-10-15 CRAN (R 3.4.1)                   
 nlme           3.1-131    2017-02-06 CRAN (R 3.4.1)                   
 nycflights13 * 0.2.2      2017-01-27 CRAN (R 3.4.1)                   
 openssl        0.9.9      2017-11-10 cran (@0.9.9)                    
 parallel       3.4.1      2017-09-07 local                            
 pillar         1.0.1      2017-11-27 cran (@1.0.1)                    
 pkgconfig      2.0.1      2017-03-21 CRAN (R 3.4.1)                   
 plyr           1.8.4      2016-06-08 CRAN (R 3.4.1)                   
 psych          1.7.8      2017-09-09 CRAN (R 3.4.1)                   
 purrr          0.2.4      2017-10-18 CRAN (R 3.4.1)                   
 R6             2.2.2      2017-06-17 CRAN (R 3.4.1)                   
 Rcpp           0.12.14    2017-11-23 cran (@0.12.14)                  
 reprex       * 0.1.1      2017-01-13 CRAN (R 3.4.1)                   
 reshape2       1.4.3      2017-12-11 cran (@1.4.3)                    
 rlang          0.1.6      2017-12-21 cran (@0.1.6)                    
 rmarkdown      1.7        2017-11-10 CRAN (R 3.4.1)                   
 rprojroot      1.3-2      2018-01-03 cran (@1.3-2)                    
 rstudioapi     0.7        2017-09-07 CRAN (R 3.4.1)                   
 shiny          1.0.5      2017-08-23 CRAN (R 3.4.1)                   
 sparklyr     * 0.7.0-9106 2018-01-08 Github (rstudio/sparklyr@41d145a)
 stats        * 3.4.1      2017-09-07 local                            
 stringi        1.1.5      2017-04-07 CRAN (R 3.4.1)                   
 stringr        1.2.0      2017-02-18 CRAN (R 3.4.1)                   
 tibble         1.4.1      2017-12-25 cran (@1.4.1)                    
 tidyr          0.7.2      2017-10-16 CRAN (R 3.4.1)                   
 tools          3.4.1      2017-09-07 local                            
 utf8           1.1.3      2018-01-03 cran (@1.1.3)                    
 utils        * 3.4.1      2017-09-07 local                            
 whisker        0.3-2      2013-04-28 CRAN (R 3.4.1)                   
 withr          2.1.1      2017-12-19 cran (@2.1.1)                    
 xtable         1.8-2      2016-02-05 CRAN (R 3.4.1)                   
 yaml           2.1.14     2016-11-12 CRAN (R 3.4.1) 

I'm not sure if this is related to a similar-sounding issue with mutate / rename that was resolved in dbplyr 1.2.0. I can work around it for now, but it will get clunky fast as I get into more complex aggregation operations with sparklyr.

This issue was moved by krlmlr from tidyverse/dplyr/issues/3295.

Add option for NOLOCK against MS SQL Server

@mkirzon commented on Apr 5, 2018, 11:38 PM UTC:

When using dplyr directly against databases, it'd be helpful to have a NOLOCK options when configuring table connections.

This is a common practice in MSSQL to ensure complex queries don't lock the database for other read/writes. The tradeoff is that the read may be uncommited so this should be a toggleable option.

This issue was moved by krlmlr from tidyverse/dplyr/issues/3484.

Oracle `IN` clause limits

Oracle DBs have a limit on the length of lists (such as those used with IN clauses) of 1000 items. See the discussion here.

Is this something that dbplyr would be open to addressing? If so, would one of the two methods suggested in the article above be preferable?

For example this throws an error:

lkp <- as.character(1:1001)
oracle_tbl %>% 
  filter(ID %in% lkp)
Error in new_result(connection@ptr, statement) : 
  nanodbc/nanodbc.cpp:1344: HY000: ORA-01795: maximum number of expressions in a list is 1000

While this is ok:

lkp <- as.character(1:1000)
oracle_tbl %>% 
  filter(ID %in% lkp)

dbplyr on HIVE produces invalid SQL when using select_*() or select helpers on an in_schema() object

@sz-cgt commented on Nov 9, 2017, 11:01 PM UTC:

Simple example

db <- dbConnect(odbc::odbc(), "PRD")
db %>% 
  tbl(dbplyr::in_schema("1001_2", "dxcg_raw")) %>% 
  select_all() %>% 
  show_query()

produces

<SQL>
SELECT `zzz10.mcidx` AS `zzz10.mcidx`, `zzz10.dcg_dmgrphc_risk_nbr` AS `zzz10.dcg_dmgrphc_risk_nbr`, `zzz10.dcg_unwgtd_retrospctv_risk_nbr` AS `zzz10.dcg_unwgtd_retrospctv_risk_nbr`, `zzz10.dcg_wgtd_retrospctv_risk_nbr` AS `zzz10.dcg_wgtd_retrospctv_risk_nbr`, `zzz10.dcg_unwgtd_prsptv_risk_nbr` AS `zzz10.dcg_unwgtd_prsptv_risk_nbr`, `zzz10.dcg_wgtd_prsptv_risk_nbr` AS `zzz10.dcg_wgtd_prsptv_risk_nbr`, `zzz10.dcg_incrd_prd_bgn_dt` AS `zzz10.dcg_incrd_prd_bgn_dt`, `zzz10.dcg_incrd_prd_end_dt` AS `zzz10.dcg_incrd_prd_end_dt`, `zzz10.file_name` AS `zzz10.file_name`, `zzz10.load_id` AS `zzz10.load_id`
FROM 1001_2.dxcg_raw

Note the missing table alias zzz10

Remove the schema reference and it works

db <- dbConnect(odbc::odbc(), "PRD")
odbc::dbSendStatement(db2, "use 1001_2")
db %>% 
  tbl("dxcg_raw") %>% 
  select_all() %>% 
  show_query()

produces

<SQL>
SELECT `mcidx` AS `mcidx`, `dcg_dmgrphc_risk_nbr` AS `dcg_dmgrphc_risk_nbr`, `dcg_unwgtd_retrospctv_risk_nbr` AS `dcg_unwgtd_retrospctv_risk_nbr`, `dcg_wgtd_retrospctv_risk_nbr` AS `dcg_wgtd_retrospctv_risk_nbr`, `dcg_unwgtd_prsptv_risk_nbr` AS `dcg_unwgtd_prsptv_risk_nbr`, `dcg_wgtd_prsptv_risk_nbr` AS `dcg_wgtd_prsptv_risk_nbr`, `dcg_incrd_prd_bgn_dt` AS `dcg_incrd_prd_bgn_dt`, `dcg_incrd_prd_end_dt` AS `dcg_incrd_prd_end_dt`, `file_name` AS `file_name`, `load_id` AS `load_id`
FROM `dxcg_raw`

Swap to PostgreSQL and it works

con <- dbConnect(odbc::odbc(), "Validation")
con %>% 
  tbl(dbplyr::in_schema("c1022_1", "clmdx")) %>% 
  select_all() %>% 
  show_query()

produces

<SQL>
SELECT "claim_id" AS "claim_id", "sv_stat" AS "sv_stat", "serviceyearmonth" AS "serviceyearmonth", "claim_in_network" AS "claim_in_network", "dx" AS "dx"
FROM c1022_1.clmdx

Here's the connection information for the two systems (minus any personal details of course)

> odbc::dbGetInfo(db)
$dbname
[1] "HIVE"
$dbms.name
[1] "Hive"
$db.version
[1] "1.2.1.2.3.4.7-4"
$sourcename
[1] "PRD"
$servername
[1] "Hive"
$drivername
[1] "Hortonworks Hive ODBC Driver"
$odbc.version
[1] "03.80.0000"
$driver.version
[1] "2.1.10.1014"
$odbcdriver.version
[1] "03.80"
$supports.transactions
[1] FALSE
attr(,"class")
[1] "Hive"        "driver_info" "list"
> odbc::dbGetInfo(con)
$dbname
[1] "validation"
$dbms.name
[1] "PostgreSQL"
$db.version
[1] "9.6.3"
$sourcename
[1] "Validation"
$servername
[1] "lprdawswks0001.cedargatepartners.pvc"
$drivername
[1] "PSQLODBC35W.DLL"
$odbc.version
[1] "03.80.0000"
$driver.version
[1] "10.00.0000"
$odbcdriver.version
[1] "03.51"
$supports.transactions
[1] TRUE
attr(,"class")
[1] "PostgreSQL"  "driver_info" "list"       

This issue was moved by krlmlr from tidyverse/dplyr/issues/3200.

MSSQL connection. Errors in dplyr select() after arrange

@pssguy commented on Aug 29, 2017, 10:22 PM UTC:

I am attempting to use an MSSQL connection and hitting this issue

I first replicate the example from the dbplyr intro

library(odbc)
library(DBI)

 library(tidyverse) 
 library(dbplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "iris", iris)
iris2 <- tbl(con, "iris")

iris2 %>% 
  arrange(Species) %>% 
  select(Sepal.Length)

This works fine

Now with an MSSQL connection

con2 <- dbConnect(odbc::odbc(), DSN = "premier")
DBI::dbWriteTable(con2, "iris", iris, overwrite=TRUE)
iris9 <- tbl(con2, "iris")

test1 <-iris9 %>% 
  arrange(Species) %>% 
  select(Sepal.Length)

test1

# Error: <SQL> 'SELECT TOP 1000 "Sepal.Length" AS "Sepal.Length" FROM (SELECT * FROM "iris" ORDER BY "Species") "odeiuzmtqh"' nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.

So it appears as though error has a conflict re use of TOP

Reversing the select and arrange commands

test2 <-iris9 %>% 
  select(Sepal.Length) %>% 
  arrange(Species) 

test2 # No error

This works in a simple example but I will sometimes need to arrange data prior to other processes in a pipe

When I look at the problem code it does not exactly replicate error i.e no mention of Top 1000

test1 %>% show_query()
# <SQL>
# SELECT "Sepal.Length" AS "Sepal.Length"
# FROM (SELECT *
# FROM "iris"
# ORDER BY "Species") "omlcgfsrjt"

Trying several alternatives in SQL


SELECT "Sepal.Length" AS "Sepal.Length"
 FROM (SELECT *
 FROM "iris"
 ORDER BY "Species") "omlcgfsrjt"

nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified. 
Failed to execute SQL chunk

something that looks like error code


SELECT TOP 1000 "Sepal.Length" AS "Sepal.Length"
 FROM (SELECT *
 FROM "iris"
 ORDER BY "Species") "omlcgfsrjt"

  nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified. 
Failed to execute SQL chunk

replacing top 1000 in sub-query produces the desired output


SELECT  "Sepal.Length" AS "Sepal.Length"
 FROM (SELECT TOP 1000 *
 FROM "iris"
 ORDER BY "Species") "omlcgfsrjt"

Not sure if this is an error or just something that has not yet been addressed for MSSQL.

p.s. Why no issues option under dbplyr?

This issue was moved by hadley from tidyverse/dplyr/issues/3062.

Documentation clarification

In the "Introduction to dbplyr" the author recommends the following when deciding to use a database:

As well as working with local in-memory data stored in data frames, dplyr also works with remote on-disk data stored in databases. This is particularly useful in two scenarios:

Your data is already in a database.

You have so much data that it does not all fit into memory simultaneously and you need to use some external storage engine.

(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating).

This is actually not completely accurate as it's literally written.

I suggest this should be qualified to give the following reasons as well:

  1. Dataset is relatively small (up to a million rows or less)
  2. It's only a "one-off" project where the overhead of creating the database takes longer than wait times
  3. It's a single user project, so data consistency and multi-user concurrency aren't issues

Or alternatively, make no mention of when a database should be used or not as this is a large subject that depends on too many variables to mention in an introduction

`dplyr::arrange()` with the `.by_group=TRUE` parameter set produces SQL with an error

@CerebralMastication commented on Apr 15, 2018, 1:56 PM UTC:

This is a cross post from the RStats Community, where I initially posted it: https://community.rstudio.com/t/dplyr-arrange-by-group-true-fails-with-sql-backend/7232

I've either got a misunderstanding or a bug... I think it's a bug.

It seems that dplyr::arrange() with the .by_group=TRUE parameter set produces SQL with an error. Here's how to reprex it:

On the DB (Redshift in my case) set up a dummy table:

drop TABLE sandbox.testorder; 

CREATE TABLE sandbox.testorder (
    grp varchar(255),
    n  DOUBLE PRECISION
);

INSERT INTO sandbox.testorder (grp , n) VALUES ('a',3.3);
INSERT INTO sandbox.testorder (grp , n) VALUES ('a',1.1);
INSERT INTO sandbox.testorder (grp , n) VALUES ('b',2.2);
INSERT INTO sandbox.testorder (grp , n) VALUES ('b',4.4);

Then from R (presuming a connection to the DB called con and already loaded dbplyr

testorder  <- tbl(con, "testorder")

testorder %>%
  group_by( grp ) %>%
  arrange( n, .by_group=TRUE) ->
out_test

show_query(out_test)

which generates the following SQL:

SELECT *
FROM "testorder"
ORDER BY "n", TRUE

which fails if I try to collect(out_test) with the following error:

Error in postgresqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  non-integer constant in ORDER BY

the rub seems to be the ,TRUE there at the end. If I remove it, I get the followable runnable SQL:

SELECT *
FROM "testorder"
ORDER BY "n"

My guess is that the routine that generates the SQL has a glitch. Looks like it's just passing , TRUE instead of adding in the group by variables.

This issue was moved by krlmlr from tidyverse/dplyr/issues/3515.

Pass through DB functions does not seem to work in select()

Here is a reprex using SQLite's date command:

library(odbc)
library(DBI)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")

dates_tbl <- copy_to(con, mtcars)

dates_tbl %>%
  mutate(b = date(mpg))
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.22.0 [:memory:]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb b    
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 -471~
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 -471~
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 -471~
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 -471~
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 -471~
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 -471~
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 -471~
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2 -471~
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 -471~
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 -471~
#> # ... with more rows

dates_tbl %>%
  select(b = date(mpg))
#> Error in date(mpg): unused argument (mpg)

dbDisconnect(con)

Created on 2018-08-20 by the reprex package (v0.2.0).

Flexible solution for selecting translation for arbitrary connections

@hadley commented on Oct 23, 2017, 10:12 PM UTC:

Need to verify which connection methods actually get used (I think just escaping, and the rest is just S3 dispatch) and then provide some lightweight way for users to select their own translation layer.

Maybe something like:

con <- custom_translation(jdbcConnection, "Oracle")

with arguments to override default quoting for identifiers and strings?

All methods would need to redispatch to wrapped object.

This issue was moved by hadley from tidyverse/dplyr/issues/3160.

as.integer64 compability for SQL SERVER

@jnolis commented on Jan 16, 2018, 10:22 PM UTC:

Currently, when converting from R to a SQL Server query, d[b]plyr incorrectly handles "as.integer64" This is the function I would expect to use if I want to cast as a bigint within SQL Server. What ends up happening is dplyr uses "AS.INTEGER64" as a SQL Command

  1. In SQL SERVER create an example table
CREATE TABLE Example (Number int)
  1. In R using dbplyr, try and query the table after casting Number as an integer64
tbl(con,"Example") %>%
  mutate(Number64 = as.integer64(Number))
  1. This yields:
Error: <SQL> 'SELECT  TOP 10 "Number", AS.INTEGER64("Number") AS "Number64"
FROM "Example"'
  nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Incorrect syntax near the keyword 'AS'.

Which wouldn't happen if the SQL Query was the correct:

SELECT TOP 10 Number, CAST(Number AS BIGINT) AS NUMBER64
FROM EXAMPLE

This issue was moved by krlmlr from tidyverse/dplyr/issues/3305.

pmin and pmax fail in mutate connected to SQL Server

@jnolis commented on Nov 28, 2017, 6:36 PM UTC:

Occasionally, I will need to use pmin or pmax to replace values above or below a threshold, for instance:

require(dplyr)
data_frame(X = 1:10) %>%
  mutate(Y = pmax(X,5))

However, when I use dbplyr to try and do this using a SQL Server table, I get the following error:

tbl(con,"OrderInfo") %>%
  mutate(Test = pmax(Revenue,0))
Error: <SQL> 'SELECT  TOP 10 "CustomerId", "OrderId", "ClientOrderId", "Channel", "OrderDate", "Revenue", "Cost", MAX("Revenue", 0.0) AS "Test"
FROM "OrderInfo"'
  nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]The MAX function requires 1 argument(s). 

As you can see, the SQL code generator assumes that I want to do a pairwise MAX, however SQL Server does not support that (only taking the maximum over a single column).

A cursory Stack Overflow search suggests that to do this operation a case statement could be used. I think either the SQL code generator should either be updated to do this, or an error should be generated at the time of calling the mutate.

This issue was moved by krlmlr from tidyverse/dplyr/issues/3227.

Minimise peak memory usage when doing copy_to from one database to another

@mungojam commented on Oct 24, 2017, 7:28 PM UTC:

The copy_to now supports a src_sql source as well as destination, but currently does this via a tibble. This makes it performant only for datasets that fit into available RAM.

It would be good if some sort of streaming or batching were implemented so that the peak RAM usage is much smaller. This is so that we can use dplyr to do an initial large extract from a large remote table into a local SQLite database (for example).

Perhaps this fits better as a DBI feature but it's nice to be able to do a custom select on the source table using dplyr.

library(dplyr, warn.conflicts = FALSE)

src1 <- src_sqlite("fileDb1.db", create = TRUE)
src2 <- src_sqlite("fileDb2.db", create = TRUE)

# Create from dataframe
iris_1 <- copy_to(src1, iris, "iris1")

# Create from tbl in another data, currently goes via in-memory tibble containing full dataset
db2 <- copy_to(src2, iris, 'data2', temporary = FALSE)

This issue was moved by hadley from tidyverse/dplyr/issues/3163.

Explore poor query performance

@hadley commented on Sep 8, 2017, 9:19 PM UTC:

library(dplyr)
library(RPostgreSQL) 
myRedshift <- src_postgres(Sys.getenv("DBNAME"), 
                           host = Sys.getenv("JDBCURL"),
                           port = Sys.getenv("PORT"),
                           user = Sys.getenv("USER"), 
                           password = Sys.getenv("PW"))
experiment_id <- '455145790a164522d1f4f47f2df40c76728c84dd'  
date_start <- '2017-06-08'  
date_end <- '2017-06-15' 
experiment_data <- tbl(myRedshift, sql("SELECT *, DATE_TRUNC('day', ts)::date AS dt FROM p_experiments WHERE name LIKE '%variation::select%'")) %>%
  dplyr::filter(experimentid == experiment_id &
                  ts >= date_start &
                  ts < date_end &
                  !is.na(userid) &
                  variation != 'unknown')
cohorts <- experiment_data %>%
  group_by(userid, variation) %>%
  summarise(enrolment_date = min(dt))
reading_time <- tbl(myRedshift, "etl_reading") %>%
  dplyr::filter(createdate >= date_start &
                  createdate < date_end &
                  time_spent >= 0) %>%
  inner_join(cohorts, by = "userid") %>%
  dplyr::filter(createdate >= enrolment_date) %>%
  group_by(variation, userid) %>%
  summarise(rt = sum(time_spent)) %>%
  ungroup()
reading_time_all_allocated <- cohorts %>%
  left_join(reading_time, by = c("userid", "variation")) %>%
  collect(n = Inf)

produces this SQL (via explain()):

SELECT "TBL_LEFT"."userid" AS "userid", "TBL_LEFT"."variation" AS "variation", "TBL_LEFT"."enrolment_date" AS "enrolment_date", "TBL_RIGHT"."rt" AS "rt"
  FROM (SELECT "userid", "variation", MIN("dt") AS "enrolment_date"
FROM (SELECT *
FROM (SELECT *, DATE_TRUNC('day', ts)::date AS dt FROM p_experiments WHERE name LIKE '%variation::select%') "ihznadkpcf"
WHERE ("experimentid" = '455145790a164522d1f4f47f2df40c76728c84dd' AND "ts" >= '2017-06-08' AND "ts" < '2017-06-15' AND NOT((("userid") IS NULL)) AND "variation" != 'unknown')) "biptkfpowo"
GROUP BY "userid", "variation") "TBL_LEFT"
  LEFT JOIN (SELECT "variation", "userid", SUM("time_spent") AS "rt"
FROM (SELECT *
FROM (SELECT "TBL_LEFT"."platform" AS "platform", "TBL_LEFT"."uuid" AS "uuid", "TBL_LEFT"."userid" AS "userid", "TBL_LEFT"."storyid" AS "storyid", "TBL_LEFT"."partid" AS "partid", "TBL_LEFT"."createdate" AS "createdate", "TBL_LEFT"."read_percent" AS "read_percent", "TBL_LEFT"."time_spent" AS "time_spent", "TBL_RIGHT"."variation" AS "variation", "TBL_RIGHT"."enrolment_date" AS "enrolment_date"
  FROM (SELECT *
FROM "etl_reading"
WHERE ("createdate" >= '2017-06-08' AND "createdate" < '2017-06-15' AND "time_spent" >= 0.0)) "TBL_LEFT"
  INNER JOIN (SELECT "userid", "variation", MIN("dt") AS "enrolment_date"
FROM (SELECT *
FROM (SELECT *, DATE_TRUNC('day', ts)::date AS dt FROM p_experiments WHERE name LIKE '%variation::select%') "bvqcfsxath"
WHERE ("experimentid" = '455145790a164522d1f4f47f2df40c76728c84dd' AND "ts" >= '2017-06-08' AND "ts" < '2017-06-15' AND NOT((("userid") IS NULL)) AND "variation" != 'unknown')) "nnwriawvlp"
GROUP BY "userid", "variation") "TBL_RIGHT"
  ON ("TBL_LEFT"."userid" = "TBL_RIGHT"."userid")
) "xmtbxcquyw"
WHERE ("createdate" >= "enrolment_date")) "ostngvabdq"
GROUP BY "variation", "userid") "TBL_RIGHT"
  ON ("TBL_LEFT"."userid" = "TBL_RIGHT"."userid" AND "TBL_LEFT"."variation" = "TBL_RIGHT"."variation")

Comparing this to the actual SQL I'd write to get these results:

reading_time_all_allocated <- dbGetQuery(con, glue(
"WITH cohorts AS (
  SELECT 
    p.variation, 
    p.userid,
    DATE(MIN(p.ts)) AS enrolment_date
  FROM p_experiments p
  WHERE 
    p.experimentid = '{experiment_id}' 
    AND p.name LIKE '%variation::select%'
    AND p.userid IS NOT NULL 
    AND p.ts BETWEEN '{date_start}' AND '{date_end}'
    AND p.variation != 'unknown'
  GROUP BY p.variation, p.userid),
reading_time AS (
  SELECT 
    c.userid, 
    c.variation,
    COALESCE(SUM(er.time_spent), 0) AS rt
  FROM etl_reading er
  JOIN cohorts c ON er.userid = c.userid
  WHERE 
    er.createdate >= c.enrolment_date
    AND er.createdate BETWEEN '{date_start}' AND ('{date_end}' - INTEGER '1')
    AND er.time_spent >= 0
  GROUP BY c.userid, c.variation)
SELECT 
  c.userid, 
  c.variation, 
  COALESCE(r.rt, 0) AS rt 
FROM cohorts c
  LEFT OUTER JOIN reading_time r ON c.userid = r.userid AND c.variation = r.variation",
    experiment_id = experiment_id,
    date_start = date_start,
    date_end = date_end))

Probably because of the extra subqueries inside the joins.

This issue was moved by hadley from tidyverse/dplyr/issues/3091.

Reduce time to error when collecting results of SQL query with duplicated column names

@bkkkk commented on Dec 27, 2017, 8:07 AM UTC:

When collecting the results of an SQL query with duplicated column names, it would be nice if you got the warning as soon as possible so you don't waste time having to rerun long queries after fixing the column names.

I will often write and test out queries in DataGrip or another SQL client and then move it into R, in the SQL client duplicate column names are fine, but for good reason dplyr doesn't allow this, warning:

Error: Columns [COLUMN NAMES] must have unique names

and then dumping the result.

sql("SELECT
   main.id,
   main.phone,
   others.phone
FROM consumers main
INNER JOIN consumers others ON others.id = main.id") %>%
  tbl(src = my_source) %>%
  collect(n = Inf)

It would be nice to have dplyr warn the user, automatically assign column names (phone_v1, phone_v2) and collected the results into a tibble, or in lieu of that immediately warn the user upon executing the command if possible.

This issue was moved by krlmlr from tidyverse/dplyr/issues/3262.

Repeated window function returns incorrect result

@sirallen commented on May 25, 2018, 4:46 PM UTC:

Dplyr database query appears to fail when the same window function is called twice

The following query is meant to find up to the first four unique destinations (by calendar date/time) for each carrier, in the nycflights13::flights dataset:

library(dbplyr)
library(dplyr)
library(nycflights13)

localdb <- src_postgres(dbname = 'dbname',
                        host = 'localhost',
                        port = 5432,
                        user = 'user',
                        password = 'password')

copy_to(localdb, flights, "flights",
        temporary = FALSE, 
        indexes = list(
          c("year", "month", "day"), 
          "carrier", 
          "tailnum",
          "dest"
        )
)

# Has row_number() twice
incorrect <- tbl(localdb, 'flights') %>%
  group_by(carrier, dest) %>%
  arrange(carrier, dest, time_hour) %>%
  filter(row_number() == 1L) %>%
  group_by(carrier) %>%
  arrange(carrier, time_hour) %>%
  filter(row_number() < 5L) %>%
  select(carrier, dest, time_hour) %>%
  collect()

head(incorrect, 10)
# A tibble: 10 x 3
# Groups:   carrier [4]
#    carrier  dest           time_hour
#      <chr> <chr>              <dttm>
# 1      9E   ATL 2013-01-03 22:00:00
# 2      9E   AUS 2013-02-01 08:00:00
# 3      9E   AVL 2013-04-14 02:00:00
# 4      9E   BGR 2013-10-17 14:00:00
# 5      AA   AUS 2013-01-01 09:00:00
# 6      AA   BOS 2013-01-01 00:00:00
# 7      AA   DFW 2012-12-31 22:00:00
# 8      AA   EGE 2013-01-01 09:00:00
# 9      AS   SEA 2012-12-31 23:00:00
#10      B6   ABQ 2013-04-22 09:00:00

# Has row_number() and min_rank() -- selects ties as well, so not exactly what is intended,
# but the results are ordered correctly compared to above
correct <- tbl(localdb, 'flights') %>%
  group_by(carrier, dest) %>%
  arrange(carrier, dest, time_hour) %>%
  filter(row_number() == 1L) %>%
  group_by(carrier) %>%
  arrange(carrier, time_hour) %>%
  filter(min_rank(time_hour) < 5L) %>%
  select(carrier, dest, time_hour) %>%
  collect()

head(correct, 10)
# A tibble: 10 x 3
# Groups:   carrier [2]
#   carrier  dest           time_hour
#     <chr> <chr>              <dttm>
# 1      9E   MSP 2013-01-01 00:00:00
# 2      9E   BUF 2013-01-01 06:00:00
# 3      9E   ROC 2013-01-01 07:00:00
# 4      9E   IAD 2013-01-01 07:00:00
# 5      9E   SYR 2013-01-01 07:00:00
# 6      9E   BWI 2013-01-01 07:00:00
# 7      9E   ORD 2013-01-01 07:00:00
# 8      9E   IND 2013-01-01 07:00:00
# 9      AA   MIA 2012-12-31 21:00:00
#10      AA   SJU 2012-12-31 22:00:00

This issue was moved by krlmlr from tidyverse/dplyr/issues/3602.

feature request: case_when() TRUE should translate to ELSE

@namarkus commented on Jun 20, 2018, 2:47 PM UTC:

sql translation of case_when() has no else clause

case_when works just fine with dataframes. Using it in a database environment, I'm missing the possibility of defining an else-clause:

translate_sql(case_when(
  height > 200 | mass > 200 ~ "large",
  species == "Droid"        ~ "robot",
  TRUE                      ~  "other"
))

translates to:
(...) WHEN (TRUE) THEN ('other') END
whereas it should translate to:
(...) ELSE ('other') END

This issue was moved by batpigandme from tidyverse/dplyr/issues/3688.

Oracle: copy_to with overwrite fails

@EdwardJRoss commented on Jan 17, 2018, 5:10 AM UTC:

I am trying to send a table to an Oracle Database via ODBC ,overwriting if it already exists.
The copy_to method in Oracle via ODBC fails when overwrite=TRUE because "DROP TABLE IF EXISTS" isn't valid syntax in Oracle.

Example (without specific credentials), tested with Oracle 11.2:

con <- DBI::dbConnect(odbc::odbc(), ...)
dplyr::copy_to(con, iris, overwrite=TRUE)
#> Error: <SQL> 'DROP TABLE IF EXISTS iris'
#> nanodbc/nanodbc.cpp:1587: HY000: [Oracle][ODBC][Ora]ORA-00933: SQL command not properly ended
dplyr::copy_to(con, iris, overwrite=TRUE)
#> Error in connection_begin(conn@ptr) : Double begin

Expected result:
The table iris is loaded to the database, being overwritten if it already exists.

Actual Result:
There is an error, and the connection can no longer be used (it seems the transaction isn't aborted on failure).

I also tried using ROracle but copy_to seems to fail when attempting to start a transaction:

con <- DBI::dbConnect(DBI::dbDriver("Oracle"), ...)
dplyr::copy_to(con, iris)
#> Error in (function (classes, fdef, mtable)  : 
#>  unable to find an inherited method for function 'dbBegin' for signature '"OraConnection"'

This issue was moved by krlmlr from tidyverse/dplyr/issues/3306.

"Type DOUBLE does not exist" in Vertica database

@ablack3 commented on Dec 1, 2017, 4:48 PM UTC:

I have an ODBC connection to a Vertica database and am using dplyr to query it.

Vertica has types DOUBLE PRECISION and NUMERIC but does not have type DOUBLE.

When I try to convert a variable of type CHARACTER to NUMERIC dplyr generates incorrect SQL code.

Unfortunately I cannot create a fully reproducible example because the issue involves a Vertica database connection.

con <- DBI:: dbConnect(odbc::odbc(), dsn = "Vertica")
dbplyr::translate_sql(as.numeric(myVar), con = con)

CAST("myVar" AS DOUBLE)
This will not work since DOUBLE is not a valid type in Vertica.
DOUBLE PRECISION or NUMERIC should work.

I get the same issue when I use as.double()

dbplyr::translate_sql(as.double(myVar), con = con)

CAST("myVar" AS DOUBLE)

However if I leave off the connection argument the correct SQL is generated

dbplyr::translate_sql(as.numeric(myVar))

CAST("myVar" AS NUMERIC)

How do I fix this issue and get started adding support for Vertica?
Thanks!

We’re working to add support for more databases over time, but adding support on your own is surprisingly easy. Submit an issue to dplyr and we’ll help you get started.
https://blog.rstudio.com/2017/06/27/dbplyr-1-1-0/

This issue was moved by batpigandme from tidyverse/dplyr/issues/3236.

copy_to causes PostgreSQL to be memory hungry with large dataframes

@chrnin commented on Feb 12, 2018, 3:51 PM UTC:

When I try to insert a pretty (but not so) large dataframe (that I got from a ~80k rows csv file, avg 140M of data), postgresql exhausts memory (8G + 2G swap) because of a large unique insert query with all the data inside causing my system to randomly kill some processes (rsession, or postmaster, or anyone with large memory consumption).

The example below crashes on my computer with 8G ram, to reproduce on bigger system, increase the a range accordingly.

tibble(
   a=1:400000,
   b="Just some boring data to make the dataset grow faster, ok.. That's pretty huge, but I have huge CSV files sometimes.",
   c="Actually it has to be somewhat massive, and I intend to copy that a huge amount of times, I'm sorry for that…",
   d="I wonder what I could say here in order to make this mildly interesting, so I'm gonna share my thoughts (as a r beginner)",
   e="I think PostgreSQL is struggling with query analysis, maybe it could be more efficient at that…",
   f="I also think that a solution would be to chunk the data frame into pieces in order to limit query size.",
   g="By passing multiple moderately sized queries (e.g. around 10mb size), the execution time loss would be moderate…",
   h="As a fair new user, I have no idea if this is the kind of things you intend to do with this function.",
   i="Maybe there would be something to do with the COPY statement, I think it is far less memory hungry than INSERT statement",
   j="But that would be something very PostgreSQL specific, and more somewhat DBI related, not a very good way to see that…",
   k="Feel free to reject this issue, I don't even know if it's dumb… I'm going to survive using something else for large files… Thanks for reading :)"
) %>% dplyr::copy_to(
   dest = my_beloved_postgresql_database,
   df=.,
   name='kamikaze_table',
   temporary=FALSE,
   overwrite=TRUE
)

edit: I'm using dplyr 0.7.4 on ubuntu 17.10.

This issue was moved by krlmlr from tidyverse/dplyr/issues/3355.

Ignoring case with remote database table filter operations

@geotheory commented on Aug 28, 2018, 12:02 PM UTC:

I've detailed this question on this SO page.

I can't figure out how to apply a non-case-sensitive filter query to a remote PostgreSQL table using dplyr. To demonstrate:

require(dplyr)
require(stringr)
require(RPostgreSQL)

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="mydb", host="localhost", port=5432, user="username")

# create db table
copy_to(con, iris, "iris", temporary = FALSE)

# dplyr remote database table
iris_pg <- tbl(con, "iris")

iris_pg %>% filter(str_detect(Species, 'setosa')) %>% head(3) %>% collect()
# A tibble: 3 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
*        <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa

iris_pg %>% filter(str_detect(Species, 'Setosa')) %>% head(3) %>% collect()
# A tibble: 0 x 0

To ignore case stringr::fixed('Setosa', ignore_case=TRUE) works with tibble filtering. But with the postgres table it has no effect:

iris_pg %>% filter(str_detect(Species, stringr::fixed('SETOSA', ignore_case=TRUE))) %>% head(3) %>% collect()
# A tibble: 0 x 0

Is there any way around this, or possibly dev plans to accommodate the ignore_case arguments or fixed() or regex() for database table filtering?

This issue was moved by romainfrancois from tidyverse/dplyr#3783.

Strange printing for postgres integer64

I noticed that I get weird results when I count a column with very different value counts. This is a super minor issue. Please don't feel compelled to address it!

The issue doesn't seem to happen with SQLite. I'm assuming it's something about printing the S3: Integer64 column.

Basic setup of a table to count:

con <- DBI::dbConnect(RPostgres::Postgres(),
  user = "my_user", dbname = "my_db")
df <- data.frame(x = c(rep(1, 1000), rep(2, 10)))
dplyr::copy_to(con, df, temporary = TRUE)
remote_df <- dplyr::tbl(con, "df")
dplyr::count(remote_df, x)

Result:

# Source:   lazy query [?? x 2]
# Database: postgres [[email protected]:5432/my_db]
      x n              
  <dbl> <S3: integer64>
1     2 "  10"         <- What's up with the extra spaces?
2     1 1000     

Expected result:

# Source:   lazy query [?? x 2]
# Database: postgres [[email protected]:5432/my_db]
      x n              
  <dbl> <S3: integer64>
1     2   10 
2     1 1000     

Add support for new src/tbl class in DB simulations

@edgararuiz commented on May 24, 2018, 4:00 PM UTC:

Adding support in the dbplyr::simulate_###() functions of the new src/tbl class that is being passed in the dev version of dbplyr will allow us to adds test to the customizations (ref: tidyverse/dbplyr@81419ca)

Using a PR's version of dbplyr, this should be using MINUS instead of EXCEPT:

library(dplyr, warn.conflicts = FALSE)
packageVersion("dbplyr")
#> [1] '1.2.1.9001'
lf1 <- dbplyr::lazy_frame(x = 1:4, src = dbplyr::simulate_oracle())
lf2 <- dbplyr::lazy_frame(x = 1:4, src = dbplyr::simulate_oracle())
lf1 %>% setdiff(lf2) %>% show_query()
#> <SQL> (SELECT *
#> FROM `df`)
#> EXCEPT
#> (SELECT *
#> FROM `df`)

This issue was moved by krlmlr from tidyverse/dplyr/issues/3598.

MS SQL mutate(ifelse()) translation not working

I'm running into an issue when using mutate and ifelse (or if) on a SQL Server backend. The following sample query:

df <- tbl(con, in_schema("schema", "table_name")) %>%
   mutate(field = ifelse(field %in% c('test1', 'test2', 'test3'), 'string1', 'string2') %>%
   collect()

... will give the following translation:

SELECT CASE WHEN (("field" IN ('test1', 'test2', 'test3')) = 'TRUE') THEN ('string1') 
            WHEN (("field" IN ('test1', 'test2', 'test3')) =  'FALSE') THEN ('string2') 
            END AS "field"

Which will return an error. The correct translation should be:

SELECT CASE WHEN ("field" IN ('test1', 'test2', 'test3')) THEN 'string1' 
            WHEN ("field" NOT IN ('test1', 'test2', 'test3')) THEN 'string2' 
            END AS "field"

Will correcting this in the mssql_if function create any issues in other queries, or is there a better way to do mutate(ifelse()) in mssql?

mssql_sql_if <- function(cond, if_true, if_false = NULL) {

Case_when translation for ORACLE does not work

@cderv commented on Apr 18, 2018, 5:20 PM UTC:

I work with an oracle connection so difficult to make a reprex sorry. I just will use the code as-is as I think it is enough to see the issue in SQL translation.

When using dplyr on an oracle connection, I had this error

tu_cc_pt %>% 
mutate(niv_u = case_when(
  	PT_NIVEAU_TENSION == "90 KV" ~ "HT",
  	PT_NIVEAU_TENSION == "63 KV" ~ "HT",
  	TRUE ~ PT_NIVEAU_TENSION)) %>%
collect()

The error message is

Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000: ORA-00920: invalid relational operator 

A call to show_query instead of collect gives the following SQL

SELECT "ID_POSTE_TECHNIQUE", "PT_IDR", "PT_CUR", "PT_TYPE", "PT_STATUT_SEQ", "PT_STATUT_NON_SEQ", "PT_NIVEAU_TENSION", "PT_FILERIE_LIBELLE", "PT_TECHNOLOGIE", "PT_PLAN_PROTECTION", "PT_DT_PREM_MISE_SS_TENSION", "PT_CM", "PT_GMR", CASE
WHEN ("PT_NIVEAU_TENSION" = '90 KV') THEN ('HT')
WHEN ("PT_NIVEAU_TENSION" = '63 KV') THEN ('HT')
WHEN (TRUE) THEN ("PT_NIVEAU_TENSION")
END AS "niv_u"
FROM (DMS_SIDONI_DL_RO.DMBC_CC_PT)

This as been mentionned in #2928 but not fixed it seems.

I think the problem is with WHEN (TRUE) that is not valid with ORACLE. It should be a ELSE

SELECT "ID_POSTE_TECHNIQUE", "PT_IDR", "PT_CUR", "PT_TYPE", "PT_STATUT_SEQ", "PT_STATUT_NON_SEQ", "PT_NIVEAU_TENSION", "PT_FILERIE_LIBELLE", "PT_TECHNOLOGIE", "PT_PLAN_PROTECTION", "PT_DT_PREM_MISE_SS_TENSION", "PT_CM", "PT_GMR", CASE
WHEN ("PT_NIVEAU_TENSION" = '90 KV') THEN ('HT')
WHEN ("PT_NIVEAU_TENSION" = '63 KV') THEN ('HT')
ELSE ("PT_NIVEAU_TENSION")
END AS "niv_u"
FROM (DMS_SIDONI_DL_RO.DMBC_CC_PT)

What do you think ?

If you need another simpler example, please tell me I will work on one.

This issue was moved by krlmlr from tidyverse/dplyr/issues/3521.

Using qualified functions inside filter or arrange results in 'object not found' error for SQL

@janusvm commented on Jun 25, 2018, 8:26 AM UTC:

For an SQL backend, using desc() or between() inside an arrange or filter, respectively, returns an error if the functions are qualified.

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
lf <- dbplyr::lazy_frame(x = 1, y = 1, 
                         src = dbplyr::simulate_dbi())

# Works
lf %>% 
  arrange(desc(y)) %>% 
  show_query()
#> <SQL> SELECT *
#> FROM "df"
#> ORDER BY "y" DESC
lf %>% 
  filter(between(y, 0, 1)) %>% 
  show_query()
#> <SQL> SELECT *
#> FROM "df"
#> WHERE ("y" BETWEEN 0.0 AND 1.0)

# Doesn't work
lf %>% 
  dplyr::arrange(dplyr::desc(y)) %>% 
  dplyr::show_query()
#> Error in dplyr::desc(y): object 'y' not found
lf %>% 
  dplyr::filter(dplyr::between(y, 0, 1)) %>% 
  dplyr::show_query()
#> Error in dplyr::between(y, 0, 1): object 'y' not found

This issue was moved by batpigandme from tidyverse/dplyr/issues/3696.

dbplyr does not translate base R integer division %/%

@emilyriederer commented on Aug 27, 2017, 7:16 PM UTC:

Base R integer division (%/%) is translated to normal division (/) in SQL.

Apologies for non-reprex. I couldn't think of a way to get around the fact that a user must establish a connection to get this to run. When run connected to a AWS Redshift DB, data$z contains values 0, 1, and 2 while data_db$z contains floating point values.

The cause is that %/% is translated as / without accounting for integer-format. One correct SQL translation (at least for Redshift) would be CAST("x"/5 AS INTEGER) or FLOOR("x"/5)

library(dplyr)

data <- data.frame(x=as.integer(1:10))
data_db <- copy_to(con, data, "data_db", temporary = FALSE)

data <- mutate(data, z = x %/% 5)
data_db <- mutate(data_db, z = x %/% 5)

data
collect(data_db)

show_query(data_db)

This issue was moved by krlmlr from tidyverse/dplyr/issues/3057.

first() doesn't work in summaries

The first function in dbplyr returns whatever expression it's given as a string, instead of returning the first value (which should respect previous group_by() and arrange().

It was discovered in this Stack Overflow question. That table (and the version I reproduced locally) is PostgreSQL, but the problem can be reproduced in SQLlite:

library(dplyr)

batting <- tbl(dbplyr::lahman_sqlite(), "Batting")

tbl(dbplyr::lahman_sqlite(), "Batting") %>%
  arrange(yearID) %>%
  group_by(playerID) %>%
  summarize(first = first(G))

Returns:

# Source:   lazy query [?? x 2]
# Database: sqlite 3.22.0 [/var/folders/8p/xzrrqphx2qb3d2s_fgqrk5xr0000gn/T//Rtmpe0JeOa/lahman.sqlite]
   playerID  first
   <chr>     <chr>
 1 aardsda01 `G`  
 2 aaronha01 `G`  
 3 aaronto01 `G`  
 4 aasedo01  `G`  
 5 abadan01  `G`  
 6 abadfe01  `G`  
 7 abadijo01 `G`  
 8 abbated01 `G`  
 9 abbeybe01 `G`  
10 abbeych01 `G`  
# ... with more rows

This matches with the result of show_query() on the above.

<SQL>
SELECT `playerID`, '`G`' AS `first`
FROM (SELECT *
FROM `Batting`
ORDER BY `yearID`)
GROUP BY `playerID`

It has a similar behavior:

  • If it's given first(blabla), or another expression that's not a column
  • If it's done without an arrange or a group_by() (it can thus be reproduced with just tbl(dbplyr::lahman_sqlite(), "Batting") %>% summarize(first(G)))

Ignoring case with remote database table filter operations

@geotheory commented on Aug 28, 2018, 12:02 PM UTC:

I've detailed this question on this SO page.

I can't figure out how to apply a non-case-sensitive filter query to a remote PostgreSQL table using dplyr. To demonstrate:

require(dplyr)
require(stringr)
require(RPostgreSQL)

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="mydb", host="localhost", port=5432, user="username")

# create db table
copy_to(con, iris, "iris", temporary = FALSE)

# dplyr remote database table
iris_pg <- tbl(con, "iris")

iris_pg %>% filter(str_detect(Species, 'setosa')) %>% head(3) %>% collect()
# A tibble: 3 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
*        <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa

iris_pg %>% filter(str_detect(Species, 'Setosa')) %>% head(3) %>% collect()
# A tibble: 0 x 0

To ignore case stringr::fixed('Setosa', ignore_case=TRUE) works with tibble filtering. But with the postgres table it has no effect:

iris_pg %>% filter(str_detect(Species, stringr::fixed('SETOSA', ignore_case=TRUE))) %>% head(3) %>% collect()
# A tibble: 0 x 0

Is there any way around this, or possibly dev plans to accommodate the ignore_case arguments or fixed() or regex() for database table filtering?

This issue was moved by romainfrancois from tidyverse/dplyr#3783.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.