ianmcook / tidyquery Goto Github PK

View Code? Open in Web Editor NEW

167.0 5.0 12.0 491 KB

Query R data frames with SQL

License: Apache License 2.0

R 100.00%

r sql tidyverse dplyr query

tidyquery's Issues

Export version number for queryparser

To enable ianmcook/queryparser#32, export the version number of tidyquery

Support data.table objects through dtplyr

dtplyr is a data.table backend for dplyr. Add support for the dtplyr_step objects that dtplyr creates.

dplyr 1.0.8, rlang 1.0.0

We're in the process of releasing dplyr 1.0.8, rlang 1.0.0 and it appears that tidyquery fails against them. I believe it's more related to rlang than dplyr though.

── After ─────────────────────────────────────────────────────────────────────────────────────────────────────
> checking tests ... ERROR
  See below...

── Test failures ─────────────────────────────────────────────────────────────────────────────── testthat ────

> library(testthat)
> library(tidyquery)
> 
> test_check("tidyquery")
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
══ Skipped tests ═══════════════════════════════════════════════════════════════
• currently returns columns in wrong order (1)

══ Failed tests ════════════════════════════════════════════════════════════════
── Error (test-dtplyr.R:17:3): Full example #1 returns expected result on dtplyr_step ──
Error in `step_subset(parent, i = i)`: is.null(i) || is_expression(i) || is_step(i) is not TRUE
Backtrace:
     ▆
  1. ├─testthat::expect_equal(...) at test-dtplyr.R:17:2
  2. │ └─testthat::quasi_label(enquo(object), label, arg = "object")
  3. │   └─rlang::eval_bare(expr, quo_get_env(quo))
  4. ├─... %>% as.data.frame()
  5. ├─base::as.data.frame(.)
  6. ├─tidyquery::query("SELECT origin, dest,\n          COUNT(flight) AS num_flts,\n          round(AVG(distance)) AS dist,\n          round(AVG(arr_delay)) AS avg_delay\n          FROM flights_dt\n        WHERE distance BETWEEN 200 AND 300\n          AND air_time IS NOT NULL\n        GROUP BY origin, dest\n        HAVING num_flts > 3000\n        ORDER BY num_flts DESC, avg_delay DESC\n        LIMIT 100;")
  7. │ └─tidyquery:::query_(data, sql, TRUE)
  8. │   └─out %>% verb(filter, !!(tree$having[[1]]))
  9. ├─tidyquery:::verb(., filter, !!(tree$having[[1]]))
 10. │ └─input$data %>% fun(...)
 11. ├─dplyr fun(., ...)
 12. └─dtplyr:::filter.dtplyr_step(., ...)
 13.   └─dtplyr:::step_subset_i(.data, i)
 14.     └─dtplyr:::step_subset(parent, i = i)
 15.       └─base::stopifnot(is.null(i) || is_expression(i) || is_step(i))
── Failure (test-errors.R:96:3): query() fails on two very long expressions with no aliases ──
`query("SELECT 1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1, 1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+2 FROM games")` did not throw the expected error.
Backtrace:
    ▆
 1. └─testthat::expect_error(...) at test-errors.R:96:2
 2.   └─testthat:::expect_condition_matching(...)

[ FAIL 2 | WARN 0 | SKIP 1 | PASS 216 ]
Error: Test failures
Execution halted

1 error x | 0 warnings ✓ | 0 notes ✓

query() fails when FROM clause refers to object in non-global calling environment

When the query() function is called inside a function, it fails when the FROM clause refers to a data frame that exists in that function's environment but not in the global environment. For example:

foo <- function(dat, sql) {
    query(sql)
}
foo(iris, "SELECT Species, COUNT(*) FROM dat GROUP BY Species")
# Error: No data frame exists with the name dat

Do not truncate function arguments in output of `show_dplyr()`

~~This is now enabled by r-lib/rlang#897~~

Add support for fields with spaces in their names when using FROM clause

I inherited a bunch of SQL with spaces in field names. Does tidyquery support field names that have spaces like brand and model in this reprex:

library(tidyverse)
library(tidyquery)
mtcars_tibble <- mtcars %>% rownames_to_column(var = 'brand and model') %>%
    as_tibble()
sql <- "select mt.`brand and model`, mt.mpg, mt.cyl from mtcars_tibble as mt"
tidyquery::query(sql)
#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls
#> Warning in readChar(rc, 1L, useBytes = TRUE): truncating string with embedded
#> nuls

#> Warning in readChar(rc, 1L, useBytes = TRUE): truncating string with embedded
#> nuls
#> Warning in readChar(rc_in, 1L): truncating string with embedded nuls
#> Error in str2lang(unqualified_column_name): <text>:1:7: unexpected symbol
#> 1: brand and
#>           ^

^{Created on 2020-08-11 by the reprex package (v0.3.0.9001)}

Simplify dplyr code where possible

The dplyr code generated by tidyquery is unnecessarily lengthy and complicated in some cases, like when the SQL query includes a GROUP BY clause and aggregate functions. This is because the generated dplyr code is optimized for maximum compatibility, not simplicity. To work in edge cases, tidyquery chains together sequences of dplyr verbs that are often longer than necessary.

Optimize the generated dplyr code by detecting cases where the code can be simpler.

Support Arrow Table, RecordBatch, and Dataset objects

Support Apache Arrow Table and RecordBatch objects (which inherit class ArrowTabular) and Dataset objects (which inherit class Dataset and ArrowObject)

Work around problems involving typed NAs

In SQL, NULLs are not typed. But in R, NAs are typed. By default, NA is logical, and there are other variants NA_real_, NA_character_, and NA_integer_. This can cause problems when SQL queries contain CASE expressions or calls to the coalesce() function, because the dplyr functions that these are translated to (case_when() and coalesce() respectively) require that all the possible returned values inside these function calls have the same data type. queryparser has no way of knowing what data types these returned values will have at parse time, so it cannot translate to the appropriate NA type, so it just uses logical NA. So in tidyquery, if the other returned values are numeric, character, or integer, but one or more is NA, then errors like this result:

#Error: must be a double vector, not a logical vector

Consider if there is any way to resolve this by examining the expressions in tidyquery and replacing logical NAs with NA_real_, NA_character_, ore NA_integer_ depending on the types of the other values.

In the meantime, the workaround for users is to cast NULLs to the expected data types in the THEN and ELSE values of CASE expressions and in the arguments to coalesce() in the SQL. For example:

CASE WHEN name = 'Bobby' THEN 'Bob' ELSE CAST(NULL AS STRING) END

coalesce(fname, lname, CAST(NULL AS STRING))

Use new dplyr `join_by()` syntax to represent join conditions

dplyr 1.1.0 introduces a new, more flexible syntax for specifying join conditions. Use this in tidyquery.

Add function show_dplyr()

Add a function show_dplyr() that prints the resulting dplyr code instead of running it.

Add substring_index() function

Add MySQL substring_index() function, https://www.w3schools.com/sql/func_mysql_substring_index.asp.

Here are tidyverse and base versions of the function:

# tidyverse version
substring_index_tidyverse <- function(string, delimiter, number) {
	string_parts <- str_split(string, coll(delimiter), simplify = FALSE) %>% unlist()
	n_parts <- length(string_parts)
	if(number > 0) {
		number <- min(c(number, n_parts))
		return(string_parts[1:number] %>%
			   	str_flatten(delimiter))
	}
	if(number < 0) {
		number <- max(c(number, -n_parts))
		return(string_parts[(n_parts + number + 1):n_parts] %>%
			str_flatten(delimiter))
	}
}

substring_index_tidyverse('www.w3schools.com', '.', 2)

# base version
substring_index_base <- function(string, delimiter, number) {
	string_parts <- strsplit(string, delimiter, fixed = TRUE) %>% unlist()
	n_parts <- length(string_parts)
	if(number > 0) {
		number <- min(c(number, n_parts))
		return(string_parts[1:number] %>%
			   	paste(., collapse = delimiter))
	}
	if(number < 0) {
		number <- max(c(number, -n_parts))
		return(string_parts[(n_parts + number + 1):n_parts] %>%
			   	paste(., collapse = delimiter))
	}
}

substring_index_base('www.w3schools.com', '.', 2)

I tried adding to queryparser's translations.R file, but I'm running into a "could not find function" error with tidyquery::query(). I'm not sure where changes need to be made in tidyquery and queryparser to make it work.

Support for Subqueries and OVER clauses

I was trying to use the function show_dplyr but I got the following error message:

Error: Subqueries are not supported
Error: OVER clauses are not supported
Error: Joins of three or more tables are unsupported

Is there a any plans to update it in the near future to add these support?

Thank you

Improve support for outer join queries

Use the new keep argument to full_join() in tidyverse/dplyr#4589 to preserve the join key columns from both the left and right tables, allowing qualified references to these columns in other clauses. I'm hoping this keep argument is added to left_join() and right_join() as well.

Migrate CI to GitHub Actions

Needed for #27

Allow `namespace::object` in `FROM` clause

This stopped working after #12

Cross join not working .

cross join does not work in query() functions
Syntax like this does not work. df1 and df2 are two simple dataframes .
query (' select * from df1 cross join df2 where df1.co1 = df2.col2 ')

Fix failures when dplyr shortens expression in column name with "..."

When dplyr processes a very long column expression in a transmute() or mutate() and there is no name given for the resulting column, then instead of using the full expression as the name of the column, it sometimes shortens the expression using .... For example:

games %>%
  transmute(stringr::str_c(name, ' is for players age ', as.character(min_age), ' or older')) %>%
  colnames()
#[1] "stringr::str_c(...)"

This causes errors in the current version of tidyquery, because internally it assumes that this name shortening does not occur. For example:

query("SELECT concat(name, ' is for players age ', cast(min_age AS STRING), ' or older') FROM games;")
# Error: object 'stringr::str_c(name, " is for players age ", as.character(min_age), ' not found

Fix this by using the column names that come out of the transmute() to change the names in tree$select if they don't match.

Failing test with next version of dtplyr

This test now fails:

  expect_equal(
    query(
      "SELECT origin, dest,
          COUNT(flight) AS num_flts,
          round(AVG(distance)) AS dist,
          round(AVG(arr_delay)) AS avg_delay
          FROM flights_dt
        WHERE distance BETWEEN 200 AND 300
          AND air_time IS NOT NULL
        GROUP BY origin, dest
        HAVING num_flts > 3000
        ORDER BY num_flts DESC, avg_delay DESC
        LIMIT 100;"
    ),
    flights_dt %>%
      filter(between(distance,200,300) & !is.na(air_time)) %>%
      group_by(origin, dest) %>%
      filter(sum(!is.na(flight)) > 3000) %>%
      summarise(
        num_flts = sum(!is.na(flight)),
        dist = round(mean(distance, na.rm = TRUE)),
        avg_delay = round(mean(arr_delay, na.rm = TRUE))
      ) %>%
      ungroup() %>%
      arrange(desc(num_flts), desc(avg_delay)) %>%
      head(100L)
  )

Unfortunately you don't get a particularly informative error (even with local_edition(3)) because the pipeline is rather deep. However, I think this is the key difference:

actual$parent$parent$parent$parent$parent$i vs expected$parent$parent$parent$parent$parent$i
- `\`_DT3\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`
+ `\`_DT4\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`

i.e. expected is generating one additional intermediate data table name than expected — this is probably due to the new grouped filter behaviour. Indeed, if I remove filter(sum(!is.na(flight)) > 3000) and HAVING num_flts > 3000 the test passes

Add support for cross joins

After the fix to tidyverse/dplyr#4206 is on CRAN, use it to enable cross joins by specifying by = character().

Use tidyselect functions to implement query() more cleanly

Look into using functions from the tidyselect package including poke_vars(), peek_vars(), scoped_vars(), with_vars(), and has_vars() to implement some of the logic around aliases in query() more cleanly. These functions could help avoid some of the calls to deparse() or as.character().

Support SQL-92-style join queries

queryparser now supports queries that use SQL-92-style (explicit) join syntax (ianmcook/queryparser#20). Add support for this to tidyquery.

Limit the initial implementation in tidyquery to two-table joins; implement joins of three or more tables later. With three or more tables, there are major challenges caused by differences in how SQL engines and dplyr perform joins: SQL engines process joins by looking at all the source tables at once, whereas dplyr can only process one join at a time, and in each successive join, the left table is all the tables previously joined. dplyr does not have any built-in mechanism for tracking which columns in a join result came from which tables, and in joins of three or more tables, tidyquery would need to keep track of that to ensure that column references in the join conditions and in other clauses point to the correct columns in the left table.

The dplyr join functions coalesce the join key columns from the left and right tables and can return only these coalesced values, not the separate join key columns from the left and right tables. For the initial implementation in tidyquery, return only the coalesced join key column(s) and disallow qualified references to the join key column(s) from the left and/or right table in outer joins, to prevent users from thinking they're running a query to check for unmatched rows. In a later version of tidyquery, before the join, use mutate() to add new columns to the left and right tables containing the join keys, then join by those columns and remove them after the join; this would preserve the separate join key columns from the left and right tables.

Fix bugs that occur with very long expressions

R expressions greater than 500 characters in length cause errors because 500 characters is the value used (and the largest value allowed by R) for the width.cutoff argument to the deparse() function. Figure out what part of this bug needs to be fixed in tidyquery versus in queryparser.

Support disk.frame objects

See DiskFrame/disk.frame#196

ianmcook / tidyquery Goto Github PK

tidyquery's Issues

Recommend Projects

Recommend Topics

Recommend Org