Giter Site home page Giter Site logo

queryparser's Introduction

queryparser

CRAN status Travis build status AppVeyor build status Codecov test coverage

queryparser translates SQL queries into lists of unevaluated R expressions.

⚠️ Most R users should not directly use queryparser. Instead, use it through tidyquery.

For an introduction to tidyquery and queryparser, watch the recording of the talk “Bridging the Gap between SQL and R” from rstudio::conf(2020).

Installation

Install the released version of queryparser from CRAN with:

install.packages("queryparser")

Or install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("ianmcook/queryparser")

Usage

Call the function parse_query(), passing a SELECT statement enclosed in quotes as the first argument:

library(queryparser)

parse_query("SELECT DISTINCT carrier FROM flights WHERE dest = 'HNL'")
#> $select
#> $select[[1]]
#> carrier
#> 
#> attr(,"distinct")
#> [1] TRUE
#> 
#> $from
#> $from[[1]]
#> flights
#> 
#> 
#> $where
#> $where[[1]]
#> dest == "HNL"

Queries can include the clauses SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT:

parse_query(
" SELECT origin, dest,
    COUNT(flight) AS num_flts,
    round(SUM(seats)) AS num_seats,
    round(AVG(arr_delay)) AS avg_delay
  FROM flights f LEFT OUTER JOIN planes p
    ON f.tailnum = p.tailnum
  WHERE distance BETWEEN 200 AND 300
    AND air_time IS NOT NULL
  GROUP BY origin, dest
  HAVING num_flts > 3000
  ORDER BY num_seats DESC, avg_delay ASC
  LIMIT 2;"
)
#> $select
#> $select[[1]]
#> origin
#> 
#> $select[[2]]
#> dest
#> 
#> $select$num_flts
#> sum(!is.na(flight), na.rm = TRUE)
#> 
#> $select$num_seats
#> round(sum(seats, na.rm = TRUE))
#> 
#> $select$avg_delay
#> round(mean(arr_delay, na.rm = TRUE))
#> 
#> attr(,"aggregate")
#>                      num_flts num_seats avg_delay 
#>     FALSE     FALSE      TRUE      TRUE      TRUE 
#> 
#> $from
#> $from$f
#> flights
#> 
#> $from$p
#> planes
#> 
#> attr(,"join_types")
#> [1] "left outer join"
#> attr(,"join_conditions")
#> attr(,"join_conditions")[[1]]
#> f.tailnum == p.tailnum
#> 
#> 
#> $where
#> $where[[1]]
#> (distance >= 200 & distance <= 300) & !is.na(air_time)
#> 
#> 
#> $group_by
#> $group_by[[1]]
#> origin
#> 
#> $group_by[[2]]
#> dest
#> 
#> 
#> $having
#> $having[[1]]
#> num_flts > 3000
#> 
#> 
#> $order_by
#> $order_by[[1]]
#> -xtfrm(num_seats)
#> 
#> $order_by[[2]]
#> avg_delay
#> 
#> attr(,"aggregate")
#> [1] FALSE FALSE
#> 
#> $limit
#> $limit[[1]]
#> [1] 2
#> 
#> 
#> attr(,"aggregate")
#> [1] TRUE

Set the argument tidyverse to TRUE to use functions from tidyverse packages including dplyr, stringr, and lubridate in the R expressions:

parse_query("SELECT COUNT(*) AS n FROM t WHERE x BETWEEN y AND z ORDER BY n DESC", tidyverse = TRUE)
#> $select
#> $select$n
#> dplyr::n()
#> 
#> attr(,"aggregate")
#>    n 
#> TRUE 
#> 
#> $from
#> $from[[1]]
#> t
#> 
#> 
#> $where
#> $where[[1]]
#> dplyr::between(x, y, z)
#> 
#> 
#> $order_by
#> $order_by[[1]]
#> dplyr::desc(n)
#> 
#> attr(,"aggregate")
#> [1] FALSE
#> 
#> attr(,"aggregate")
#> [1] TRUE

queryparser will translate only explicitly allowed functions and operators, preventing injection of malicious code:

parse_query("SELECT x FROM y WHERE system('rm -rf /')")
#> Error: Unrecognized function or operator: system

Current Limitations

queryparser does not currently support:

  • Subqueries
  • Unions
  • SQL-89-style (implicit) join notation
  • The WITH clause (common table expressions)
  • OVER expressions (window or analytic functions)
  • Some SQL functions and operators

queryparser currently has the following known limitations:

  • Some SQL expressions will translate only when tidyverse is set to TRUE. An example of this is COUNT(DISTINCT ...) expressions with multiple arguments.
  • When logical operators (such as IS NULL) have unparenthesized expressions as their operands, R will interpret the resulting code using a different order of operations than a SQL engine would. When using an expression as the operand to a logical operator, always enclose the expression in parentheses.
  • The error messages that occur when attempting to parse invalid or unrecognized SQL are often non-informative.

Non-Goals

queryparser is not intended to:

  • Translate other types of SQL statements (such as INSERT or UPDATE)
  • Customize translations for specific SQL dialects
  • Fully validate the syntax of the SELECT statements passed to it
  • Efficiently process large batches of queries
  • Facilitate the analysis of queries (for example, to identify patterns)

Related Work

The sqlparseR package (CRAN) provides a wrapper around the Python module sqlparse.

queryparser's People

Contributors

ianmcook avatar stevenhibble avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

queryparser's Issues

Silence "truncating string with embedded nuls" warnings

In some environments, readChar() emits a warning when it encounters an embedded nul. This causes queryparser to generate a large number of warnings. I believe (but am not certain) that this issue first began to occur in R 4.0.0. I am not sure whether it affects all OSs.

Security mechanism can be defeated by using %>%

The mechanism that stops queryparser from translating an expression if it includes functions that are not on the whitelist can be defeated by using a function name with no parentheses and using the forward pipe operator %>% to pass input to the function. For example:

parse_expression("'ls' %>% system")
#system("ls")

This is caused by the %>% operator being explicitly allowed in the query at the time when the check is performed

Column names or parts of column names may be incorrectly converted to lowercase

When a column name matches the name of a SQL function that queryparser translates, then that column name is converted to lowercase. This also occurs for parts of columns names that are separated from other parts by periods. For example:

parse_expression("Length > 2")
#length > 2

parse_expression("Petal.Length > 2")
#Petal.length > 2

As a workaround until this is resolved, enclose variable names in backticks:

parse_expression("`Length` > 2")
#Length > 2

parse_expression("`Petal.Length` > 2")
#Petal.Length > 2

Add support for table aliases

queryparser should allow:

  • column names prefixed with table names or table aliases followed by a dot
    • In single-table queries, it should simply remove these prefixes
  • table aliases in the FROM clause, with or without the AS keyword

Security mechanism can be defeated by using function name as column name

The mechanism that stops queryparser from translating an expression if it includes functions that are not on the whitelist can be defeated by including a column name in the expression that has the same name as the disallowed function. For example:

parse_expression("system + system('ls')")
#system + system("ls")

This is caused by the way the all_funs function is implemented.

Add tests for translations in translations.R

There are data types, operators, functions, and aggregates that need testing. The thought process was laid out by Ian in this comment. How should we handle testing "generic" translations (where tidy = FALSE and tidy = TRUE return the same results)? Here's my current idea:

test_that("string is translated as character", {
  expect_equal(parse_expression("cast(a as string)", tidy = FALSE), str2lang("as.character(a)"))
  expect_equal(parse_expression("cast(a as string)", tidy = TRUE),  str2lang("as.character(a)"))
})

Maybe extracting the expressions would be better:

test_that("string is translated as character", {
  SQL <- "cast(a as string)"
  R   <- "as.character(a)"
  expect_equal(parse_expression(SQL, tidy = FALSE), str2lang(R))
  expect_equal(parse_expression(SQL, tidy = TRUE),  str2lang(R))
})

Set an attribute indicating whether each expression in the SELECT list aggregates

Related to #8, queryparser also performs nontrivial work to determine whether or not each expression in the SELECT list aggregates. To prevent packages or applications that use queryparser from needing to duplicate this work, set an attribute on the select sublist that's returned by parse_query() indicating whether each expression in the SELECT list aggregates.

Throw error if called from an incompatible version of tidyquery

tidyquery imports queryparser and can require a minimum version of queryparser through R's usual method of specifying forward import versions. But this leaves open the possibility that a user might upgrade to a new version of queryparser but still have an older, incompatible version of tidyquery.

When queryparser is called from tidyquery, check its version and throw an error if the version is incompatible.

Set an attribute indicating whether each expression in the ORDER BY list aggregates

Related to #8 and #9, queryparser also determines whether or not each expression in the ORDER BY list aggregates. To prevent packages or applications that use queryparser from needing to duplicate this work, set an attribute on the order_by sublist that's returned by parse_query() indicating whether each expression in the ORDER BY list aggregates.

Do not use && and || in translated expressions

The operators && and || are not vectorized in R so should not be used in expressions that will be applied in a vectorized fashion to data frames. Translate AND and OR to & and | instead. Consider adding a note and example to the README describing how these operators do not short circuit like SQL's AND and OR sometimes do.

Fix problems with non-ASCII characters in queries

Some functions in queryparser work fine with input that includes Unicode characters, but other functions cause warnings or throw errors. Add a comprehensive set of tests for input that contains Unicode characters, and make all the changes necessary to get these functions to pass.

Errors in translation of BETWEEN expressions with quoted operands when tidyverse = FALSE

When tidyverse = FALSE, translations of BETWEEN expressions with quoted literal strings or quoted column names as their operands might be erroneous and might cause warnings. This happens because the replace_special_keywords() function outputs a string that has more instances of MASKING_CHARACTER in it than the input string did, so the unmasking code cannot correctly replace the masked quoted strings.

Allow parentheses enclosing table names and joins in FROM clause

It's common in SQL queries for individual table names in the FROM clause to be enclosed in parentheses. In joins, it's also common for everything after the FROM keyword in the FROM clause to be enclosed in parentheses. queryparser should allow parentheses in these cases.

Add support for SQL-92-style join queries

Add support for SQL-92-style (explicit) joins, with join conditions specified after ON or USING in the FROM clause. Do not support SQL-89-style (implicit) joins, in which join conditions are specified in the WHERE clause.

Add support for UNION

UNION [ALL] is probably the most popular set operation in SQL. It may be the easiest to implement as well.

This might affect the structure of parse_query's output. That will probably have effects on tidyquery.

Initial thoughts on options:

  1. Keep everything flat
  2. Add another level to the structure

parse_query("select x from df1 union select x from df2")

Option 1:

$`select`
$`select`[[1]]
x

$from
$from[[1]]
df1

$set_op
[1] union

$`select`
$`select`[[1]]
x

$from
$from[[1]]
df2

Option 2:

[[1]]
[[1]]$`select`
[[1]]$`select`[[1]]
x

[[1]]$from
[[1]]$from[[1]]
df1

$set_op
[1] union

[[3]]
[[3]]$`select`
[[3]]$`select`[[1]]
x

[[3]]$from
[[3]]$from[[1]]
df2

Second through last occurrence of a binary symbolic operator is not replaced

There is an error in the regular expression used in the function replace_operators_binary_symbolic that causes the second through last of a particular binary symbolic operator to fail to be replaced. For example:

parse_expression("carb = 4 and gear = 4")
# carb == 4 && gear = 4

The + that is pasted in after non_operator_regex in two places in the regular expression should be removed.

Translate NULLS FIRST and NULLS LAST in ORDER BY clause

Currently queryparser will throw an error if the NULLS FIRST or NULLS LAST keywords are used in the ORDER BY clause. A future version of queryparser should recognize and parse these keywords and modify the order_by sublist accordingly. For example, the clause ORDER BY x NULLS FIRST would be translated as follows:

$order_by
$order_by[[1]]
is.na(x)

$order_by[[2]]
x

Remove line comments and block comments from query

To allow parsing of commented SQL queries, queryparser should remove any line comments (beginning with -- and ending at the end of the line) and block comments (beginning with /* and ending with */) before parsing a query.

Add support for CASE expressions when tidyverse = FALSE

Support for CASE expressions was added for the tidyverse = TRUE case (#16). Add it for the tidyverse = FALSE case as well, using nested ifelse() calls. Use NA as the default value in the final else condition of no ELSE is specified.

Function translations replace column names

Currently the translate_direct replaces columns with names that match function names.

parse_expression("month = 12", T)
#lubridate::month == 12

To fix this, do not call substitute directly in translate_direct. Instead call a function that calls substitute only on functions (the parent element of a call).

Add support for CASE expressions when tidyverse = TRUE

Add support for translating SQL CASE expressions to R expressions that use dplyr::case_when() in the tidyverse = TRUE case. This should support both forms of the SQL CASE expression:

CASE a
  WHEN b THEN c
  WHEN d THEN e
  ELSE f
END
CASE
  WHEN a == b THEN c
  WHEN a == d THEN e
  ELSE f
END

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.