ianmcook / tidyquery Goto Github PK

View Code? Open in Web Editor NEW

167.0 5.0 12.0 491 KB

Query R data frames with SQL

License: Apache License 2.0

R 100.00%

r sql tidyverse dplyr query

tidyquery's Introduction

tidyquery

tidyquery runs SQL queries on R data frames.

It uses queryparser to translate SQL queries into R expressions, then it uses dplyr to evaluate these expressions and return results. tidyquery does not load data frames into a database; it queries them in place.

For an introduction to tidyquery and queryparser, watch the recording of the talk “Bridging the Gap between SQL and R” from rstudio::conf(2020).

Installation

Install the released version of tidyquery from CRAN with:

install.packages("tidyquery")

Or install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("ianmcook/tidyquery")

Usage

tidyquery exports two functions: query() and show_dplyr().

Using `query()`

To run a SQL query on an R data frame, call the function query(), passing a SELECT statement enclosed in quotes as the first argument. The table names in the FROM clause should match the names of data frames in your current R session:

library(tidyquery)
library(nycflights13)

query(
" SELECT origin, dest,
    COUNT(flight) AS num_flts,
    round(SUM(seats)) AS num_seats,
    round(AVG(arr_delay)) AS avg_delay
  FROM flights f LEFT OUTER JOIN planes p
    ON f.tailnum = p.tailnum
  WHERE distance BETWEEN 200 AND 300
    AND air_time IS NOT NULL
  GROUP BY origin, dest
  HAVING num_flts > 3000
  ORDER BY num_seats DESC, avg_delay ASC
  LIMIT 2;"
)
#> # A tibble: 2 × 5
#>   origin dest  num_flts num_seats avg_delay
#>   <chr>  <chr>    <int>     <dbl>     <dbl>
#> 1 LGA    DCA       4468    712643         6
#> 2 EWR    BOS       5247    611192         5

Alternatively, for single-table queries, you can pass a data frame as the first argument and a SELECT statement as the second argument, omitting the FROM clause. This allows query() to function like a dplyr verb:

library(dplyr)

airports %>%
  query("SELECT name, lat, lon ORDER BY lat DESC LIMIT 5")
#> # A tibble: 5 × 3
#>   name                                         lat    lon
#>   <chr>                                      <dbl>  <dbl>
#> 1 Dillant Hopkins Airport                     72.3   42.9
#> 2 Wiley Post Will Rogers Mem                  71.3 -157. 
#> 3 Wainwright Airport                          70.6 -160. 
#> 4 Wainwright As                               70.6 -160. 
#> 5 Atqasuk Edward Burnell Sr Memorial Airport  70.5 -157.

You can chain dplyr verbs before and after query():

planes %>%
  filter(engine == "Turbo-fan") %>%
  query("SELECT manufacturer AS maker, COUNT(*) AS num_planes GROUP BY maker") %>%
  arrange(desc(num_planes)) %>%
  head(5)
#> # A tibble: 5 × 2
#>   maker            num_planes
#>   <chr>                 <int>
#> 1 BOEING                 1276
#> 2 BOMBARDIER INC          368
#> 3 AIRBUS                  331
#> 4 EMBRAER                 298
#> 5 AIRBUS INDUSTRIE        270

In the SELECT statement, the names of data frames and columns are case-sensitive (like in R) but keywords and function names are case-insensitive (like in SQL).

In addition to R data frames and tibbles (tbl_df objects), query() can be used to query other data frame-like objects, including:

dtplyr_step objects created with dtplyr, a data.table backend for dplyr
Apache Arrow Table, RecordBatch, Dataset, and arrow_dplyr_query objects created with arrow
tbl_sql objects created with dbplyr or a dbplyr backend package, enabling you to write SQL which is translated to dplyr then translated back to SQL and run in a database (a fun party trick!)

Using `show_dplyr()`

tidyquery works by generating dplyr code. To print the dplyr code instead of running it, use show_dplyr():

show_dplyr(
" SELECT manufacturer, 
    COUNT(*) AS num_planes
  FROM planes
  WHERE engine = 'Turbo-fan'
  GROUP BY manufacturer
  ORDER BY num_planes DESC;"
)
#> planes %>%
#>   filter(engine == "Turbo-fan") %>%
#>   group_by(manufacturer) %>%
#>   summarise(num_planes = dplyr::n()) %>%
#>   ungroup() %>%
#>   arrange(dplyr::desc(num_planes))

Current Limitations

tidyquery is subject to the current limitations of the queryparser package. Please see the Current Limitations section of the queryparser README on CRAN or GitHub.

tidyquery also has the following additional limitations:

Joins involving three or more tables are not supported.
Because joins in dplyr currently work in a fundamentally different way than joins in SQL, some other types of join queries are not supported. Examples of unsupported join queries include non-equijoin queries and outer join queries with qualified references to the join column(s). Planned changes in dplyr will enable future versions of tidyquery to support more types of joins.
In the code printed by show_dplyr(), calls to functions with more than five arguments might be truncated, with arguments after the fifth replaced with ...

Related Work

The sqldf package (CRAN, GitHub) runs SQL queries on R data frames by transparently setting up a database, loading data from R data frames into the database, running SQL queries in the database, and returning results as R data frames.

The duckdb package (CRAN, GitHub) includes the function duckdb_register() which registers an R data frame as a virtual table in a DuckDB database, enabling you to run SQL queries on the data frame with DBI::dbGetQuery().

The dbplyr package (CRAN, GitHub) is like tidyquery in reverse: it converts dplyr code into SQL, allowing you to use dplyr to work with data in a database.

In Python, the dataframe_sql package (targeting pandas) and the sql_to_ibis package (targeting Ibis) are analogous to tidyquery.

tidyquery's People

Contributors

Stargazers

Watchers

Forkers

sridhar0605 r-forks-to-learn han-tun stevenhibble nanaakwasiabayieboateng harshalrepo eeyem rmasiniexpert romainfrancois jfontestad eitsupi lionel-

tidyquery's Issues

Allow `namespace::object` in `FROM` clause

This stopped working after #12

Improve support for outer join queries

Use the new keep argument to full_join() in tidyverse/dplyr#4589 to preserve the join key columns from both the left and right tables, allowing qualified references to these columns in other clauses. I'm hoping this keep argument is added to left_join() and right_join() as well.

Add substring_index() function

Add MySQL substring_index() function, https://www.w3schools.com/sql/func_mysql_substring_index.asp.

Here are tidyverse and base versions of the function:

# tidyverse version
substring_index_tidyverse <- function(string, delimiter, number) {
	string_parts <- str_split(string, coll(delimiter), simplify = FALSE) %>% unlist()
	n_parts <- length(string_parts)
	if(number > 0) {
		number <- min(c(number, n_parts))
		return(string_parts[1:number] %>%
			   	str_flatten(delimiter))
	}
	if(number < 0) {
		number <- max(c(number, -n_parts))
		return(string_parts[(n_parts + number + 1):n_parts] %>%
			str_flatten(delimiter))
	}
}

substring_index_tidyverse('www.w3schools.com', '.', 2)

# base version
substring_index_base <- function(string, delimiter, number) {
	string_parts <- strsplit(string, delimiter, fixed = TRUE) %>% unlist()
	n_parts <- length(string_parts)
	if(number > 0) {
		number <- min(c(number, n_parts))
		return(string_parts[1:number] %>%
			   	paste(., collapse = delimiter))
	}
	if(number < 0) {
		number <- max(c(number, -n_parts))
		return(string_parts[(n_parts + number + 1):n_parts] %>%
			   	paste(., collapse = delimiter))
	}
}

substring_index_base('www.w3schools.com', '.', 2)

I tried adding to queryparser's translations.R file, but I'm running into a "could not find function" error with tidyquery::query(). I'm not sure where changes need to be made in tidyquery and queryparser to make it work.

Do not truncate function arguments in output of `show_dplyr()`

~~This is now enabled by r-lib/rlang#897~~

Add support for cross joins

After the fix to tidyverse/dplyr#4206 is on CRAN, use it to enable cross joins by specifying by = character().

Simplify dplyr code where possible

The dplyr code generated by tidyquery is unnecessarily lengthy and complicated in some cases, like when the SQL query includes a GROUP BY clause and aggregate functions. This is because the generated dplyr code is optimized for maximum compatibility, not simplicity. To work in edge cases, tidyquery chains together sequences of dplyr verbs that are often longer than necessary.

Optimize the generated dplyr code by detecting cases where the code can be simpler.

Add function show_dplyr()

Add a function show_dplyr() that prints the resulting dplyr code instead of running it.

Migrate CI to GitHub Actions

Needed for #27

Support disk.frame objects

See DiskFrame/disk.frame#196

Fix failures when dplyr shortens expression in column name with "..."

When dplyr processes a very long column expression in a transmute() or mutate() and there is no name given for the resulting column, then instead of using the full expression as the name of the column, it sometimes shortens the expression using .... For example:

games %>%
  transmute(stringr::str_c(name, ' is for players age ', as.character(min_age), ' or older')) %>%
  colnames()
#[1] "stringr::str_c(...)"

This causes errors in the current version of tidyquery, because internally it assumes that this name shortening does not occur. For example:

query("SELECT concat(name, ' is for players age ', cast(min_age AS STRING), ' or older') FROM games;")
# Error: object 'stringr::str_c(name, " is for players age ", as.character(min_age), ' not found

Fix this by using the column names that come out of the transmute() to change the names in tree$select if they don't match.

Cross join not working .

cross join does not work in query() functions
Syntax like this does not work. df1 and df2 are two simple dataframes .
query (' select * from df1 cross join df2 where df1.co1 = df2.col2 ')

Work around problems involving typed NAs

In SQL, NULLs are not typed. But in R, NAs are typed. By default, NA is logical, and there are other variants NA_real_, NA_character_, and NA_integer_. This can cause problems when SQL queries contain CASE expressions or calls to the coalesce() function, because the dplyr functions that these are translated to (case_when() and coalesce() respectively) require that all the possible returned values inside these function calls have the same data type. queryparser has no way of knowing what data types these returned values will have at parse time, so it cannot translate to the appropriate NA type, so it just uses logical NA. So in tidyquery, if the other returned values are numeric, character, or integer, but one or more is NA, then errors like this result:

#Error: must be a double vector, not a logical vector

Consider if there is any way to resolve this by examining the expressions in tidyquery and replacing logical NAs with NA_real_, NA_character_, ore NA_integer_ depending on the types of the other values.

In the meantime, the workaround for users is to cast NULLs to the expected data types in the THEN and ELSE values of CASE expressions and in the arguments to coalesce() in the SQL. For example:

CASE WHEN name = 'Bobby' THEN 'Bob' ELSE CAST(NULL AS STRING) END

coalesce(fname, lname, CAST(NULL AS STRING))

Support for Subqueries and OVER clauses

I was trying to use the function show_dplyr but I got the following error message:

Error: Subqueries are not supported
Error: OVER clauses are not supported
Error: Joins of three or more tables are unsupported

Is there a any plans to update it in the near future to add these support?

Thank you

query() fails when FROM clause refers to object in non-global calling environment

When the query() function is called inside a function, it fails when the FROM clause refers to a data frame that exists in that function's environment but not in the global environment. For example:

foo <- function(dat, sql) {
    query(sql)
}
foo(iris, "SELECT Species, COUNT(*) FROM dat GROUP BY Species")
# Error: No data frame exists with the name dat

Use tidyselect functions to implement query() more cleanly

Look into using functions from the tidyselect package including poke_vars(), peek_vars(), scoped_vars(), with_vars(), and has_vars() to implement some of the logic around aliases in query() more cleanly. These functions could help avoid some of the calls to deparse() or as.character().

Use new dplyr `join_by()` syntax to represent join conditions

dplyr 1.1.0 introduces a new, more flexible syntax for specifying join conditions. Use this in tidyquery.

Export version number for queryparser

To enable ianmcook/queryparser#32, export the version number of tidyquery

Fix bugs that occur with very long expressions

R expressions greater than 500 characters in length cause errors because 500 characters is the value used (and the largest value allowed by R) for the width.cutoff argument to the deparse() function. Figure out what part of this bug needs to be fixed in tidyquery versus in queryparser.

Failing test with next version of dtplyr

This test now fails:

  expect_equal(
    query(
      "SELECT origin, dest,
          COUNT(flight) AS num_flts,
          round(AVG(distance)) AS dist,
          round(AVG(arr_delay)) AS avg_delay
          FROM flights_dt
        WHERE distance BETWEEN 200 AND 300
          AND air_time IS NOT NULL
        GROUP BY origin, dest
        HAVING num_flts > 3000
        ORDER BY num_flts DESC, avg_delay DESC
        LIMIT 100;"
    ),
    flights_dt %>%
      filter(between(distance,200,300) & !is.na(air_time)) %>%
      group_by(origin, dest) %>%
      filter(sum(!is.na(flight)) > 3000) %>%
      summarise(
        num_flts = sum(!is.na(flight)),
        dist = round(mean(distance, na.rm = TRUE)),
        avg_delay = round(mean(arr_delay, na.rm = TRUE))
      ) %>%
      ungroup() %>%
      arrange(desc(num_flts), desc(avg_delay)) %>%
      head(100L)
  )

Unfortunately you don't get a particularly informative error (even with local_edition(3)) because the pipeline is rather deep. However, I think this is the key difference:

actual$parent$parent$parent$parent$parent$i vs expected$parent$parent$parent$parent$parent$i
- `\`_DT3\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`
+ `\`_DT4\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`

i.e. expected is generating one additional intermediate data table name than expected — this is probably due to the new grouped filter behaviour. Indeed, if I remove filter(sum(!is.na(flight)) > 3000) and HAVING num_flts > 3000 the test passes

Support Arrow Table, RecordBatch, and Dataset objects

Support Apache Arrow Table and RecordBatch objects (which inherit class ArrowTabular) and Dataset objects (which inherit class Dataset and ArrowObject)

Support data.table objects through dtplyr

dtplyr is a data.table backend for dplyr. Add support for the dtplyr_step objects that dtplyr creates.

Add support for fields with spaces in their names when using FROM clause

I inherited a bunch of SQL with spaces in field names. Does tidyquery support field names that have spaces like brand and model in this reprex:

library(tidyverse)
library(tidyquery)
mtcars_tibble <- mtcars %>% rownames_to_column(var = 'brand and model') %>%
    as_tibble()
sql <- "select mt.`brand and model`, mt.mpg, mt.cyl from mtcars_tibble as mt"
tidyquery::query(sql)
#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls

#> Warning in readChar(rc, nchars): truncating string with embedded nuls
#> Warning in readChar(rc, 1L, useBytes = TRUE): truncating string with embedded
#> nuls

#> Warning in readChar(rc, 1L, useBytes = TRUE): truncating string with embedded
#> nuls
#> Warning in readChar(rc_in, 1L): truncating string with embedded nuls
#> Error in str2lang(unqualified_column_name): <text>:1:7: unexpected symbol
#> 1: brand and
#>           ^

^{Created on 2020-08-11 by the reprex package (v0.3.0.9001)}

dplyr 1.0.8, rlang 1.0.0

We're in the process of releasing dplyr 1.0.8, rlang 1.0.0 and it appears that tidyquery fails against them. I believe it's more related to rlang than dplyr though.

── After ─────────────────────────────────────────────────────────────────────────────────────────────────────
> checking tests ... ERROR
  See below...

── Test failures ─────────────────────────────────────────────────────────────────────────────── testthat ────

> library(testthat)
> library(tidyquery)
> 
> test_check("tidyquery")
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
Joining, by = "office_id"
══ Skipped tests ═══════════════════════════════════════════════════════════════
• currently returns columns in wrong order (1)

══ Failed tests ════════════════════════════════════════════════════════════════
── Error (test-dtplyr.R:17:3): Full example #1 returns expected result on dtplyr_step ──
Error in `step_subset(parent, i = i)`: is.null(i) || is_expression(i) || is_step(i) is not TRUE
Backtrace:
     ▆
  1. ├─testthat::expect_equal(...) at test-dtplyr.R:17:2
  2. │ └─testthat::quasi_label(enquo(object), label, arg = "object")
  3. │   └─rlang::eval_bare(expr, quo_get_env(quo))
  4. ├─... %>% as.data.frame()
  5. ├─base::as.data.frame(.)
  6. ├─tidyquery::query("SELECT origin, dest,\n          COUNT(flight) AS num_flts,\n          round(AVG(distance)) AS dist,\n          round(AVG(arr_delay)) AS avg_delay\n          FROM flights_dt\n        WHERE distance BETWEEN 200 AND 300\n          AND air_time IS NOT NULL\n        GROUP BY origin, dest\n        HAVING num_flts > 3000\n        ORDER BY num_flts DESC, avg_delay DESC\n        LIMIT 100;")
  7. │ └─tidyquery:::query_(data, sql, TRUE)
  8. │   └─out %>% verb(filter, !!(tree$having[[1]]))
  9. ├─tidyquery:::verb(., filter, !!(tree$having[[1]]))
 10. │ └─input$data %>% fun(...)
 11. ├─dplyr fun(., ...)
 12. └─dtplyr:::filter.dtplyr_step(., ...)
 13.   └─dtplyr:::step_subset_i(.data, i)
 14.     └─dtplyr:::step_subset(parent, i = i)
 15.       └─base::stopifnot(is.null(i) || is_expression(i) || is_step(i))
── Failure (test-errors.R:96:3): query() fails on two very long expressions with no aliases ──
`query("SELECT 1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1, 1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+2 FROM games")` did not throw the expected error.
Backtrace:
    ▆
 1. └─testthat::expect_error(...) at test-errors.R:96:2
 2.   └─testthat:::expect_condition_matching(...)

[ FAIL 2 | WARN 0 | SKIP 1 | PASS 216 ]
Error: Test failures
Execution halted

1 error x | 0 warnings ✓ | 0 notes ✓

Support SQL-92-style join queries

queryparser now supports queries that use SQL-92-style (explicit) join syntax (ianmcook/queryparser#20). Add support for this to tidyquery.

Limit the initial implementation in tidyquery to two-table joins; implement joins of three or more tables later. With three or more tables, there are major challenges caused by differences in how SQL engines and dplyr perform joins: SQL engines process joins by looking at all the source tables at once, whereas dplyr can only process one join at a time, and in each successive join, the left table is all the tables previously joined. dplyr does not have any built-in mechanism for tracking which columns in a join result came from which tables, and in joins of three or more tables, tidyquery would need to keep track of that to ensure that column references in the join conditions and in other clauses point to the correct columns in the left table.

The dplyr join functions coalesce the join key columns from the left and right tables and can return only these coalesced values, not the separate join key columns from the left and right tables. For the initial implementation in tidyquery, return only the coalesced join key column(s) and disallow qualified references to the join key column(s) from the left and/or right table in outer joins, to prevent users from thinking they're running a query to check for unmatched rows. In a later version of tidyquery, before the join, use mutate() to add new columns to the left and right tables containing the join keys, then join by those columns and remove them after the join; this would preserve the separate join key columns from the left and right tables.