ianmcook / queryparser Goto Github PK

Translate SQL queries into R expressions

License: Apache License 2.0

R 100.00%

sql r tidyverse database

queryparser's Issues

Security mechanism can be defeated by using function name as column name

The mechanism that stops queryparser from translating an expression if it includes functions that are not on the whitelist can be defeated by including a column name in the expression that has the same name as the disallowed function. For example:

parse_expression("system + system('ls')")
#system + system("ls")

This is caused by the way the all_funs function is implemented.

Translate NULLS FIRST and NULLS LAST in ORDER BY clause

Currently queryparser will throw an error if the NULLS FIRST or NULLS LAST keywords are used in the ORDER BY clause. A future version of queryparser should recognize and parse these keywords and modify the order_by sublist accordingly. For example, the clause ORDER BY x NULLS FIRST would be translated as follows:

$order_by
$order_by[[1]]
is.na(x)

$order_by[[2]]
x

Add support for SQL-92-style join queries

Add support for SQL-92-style (explicit) joins, with join conditions specified after ON or USING in the FROM clause. Do not support SQL-89-style (implicit) joins, in which join conditions are specified in the WHERE clause.

Security mechanism can be defeated by using %>%

The mechanism that stops queryparser from translating an expression if it includes functions that are not on the whitelist can be defeated by using a function name with no parentheses and using the forward pipe operator %>% to pass input to the function. For example:

parse_expression("'ls' %>% system")
#system("ls")

This is caused by the %>% operator being explicitly allowed in the query at the time when the check is performed

Set an attribute indicating whether each expression in the ORDER BY list aggregates

Related to #8 and #9, queryparser also determines whether or not each expression in the ORDER BY list aggregates. To prevent packages or applications that use queryparser from needing to duplicate this work, set an attribute on the order_by sublist that's returned by parse_query() indicating whether each expression in the ORDER BY list aggregates.

Allow syntactically invalid column names in backticks

As reported in ianmcook/tidyquery#14, queryparser throws an error when the SQL refers to a column in a data frame that has a syntactically invalid name.

Allow parentheses enclosing table names and joins in FROM clause

It's common in SQL queries for individual table names in the FROM clause to be enclosed in parentheses. In joins, it's also common for everything after the FROM keyword in the FROM clause to be enclosed in parentheses. queryparser should allow parentheses in these cases.

Second through last occurrence of a binary symbolic operator is not replaced

There is an error in the regular expression used in the function replace_operators_binary_symbolic that causes the second through last of a particular binary symbolic operator to fail to be replaced. For example:

parse_expression("carb = 4 and gear = 4")
# carb == 4 && gear = 4

The + that is pasted in after non_operator_regex in two places in the regular expression should be removed.

Add support for CASE expressions when tidyverse = TRUE

Add support for translating SQL CASE expressions to R expressions that use dplyr::case_when() in the tidyverse = TRUE case. This should support both forms of the SQL CASE expression:

CASE a
  WHEN b THEN c
  WHEN d THEN e
  ELSE f
END

CASE
  WHEN a == b THEN c
  WHEN a == d THEN e
  ELSE f
END

Column alias extractor returns bad output when expression ends with column name containing a period

When a SQL expression ends with a column name containing a period, the extract_alias function does not behave properly. For example:

extract_alias(" 1 + Petal.Length")
#      Length 
#"1 + Petal."

Function translations replace column names

Currently the translate_direct replaces columns with names that match function names.

parse_expression("month = 12", T)
#lubridate::month == 12

To fix this, do not call substitute directly in translate_direct. Instead call a function that calls substitute only on functions (the parent element of a call).

Do not use && and || in translated expressions

The operators && and || are not vectorized in R so should not be used in expressions that will be applied in a vectorized fashion to data frames. Translate AND and OR to & and | instead. Consider adding a note and example to the README describing how these operators do not short circuit like SQL's AND and OR sometimes do.

Translation of coalesce() fails when tidyverse = FALSE and arguments are expressions

SQL expressions containing calls to coalesce() fail when tidyverse = FALSE and when the arguments include expressions (not only simple column references).

Add support for UNION

UNION [ALL] is probably the most popular set operation in SQL. It may be the easiest to implement as well.

This might affect the structure of parse_query's output. That will probably have effects on tidyquery.

Initial thoughts on options:

Keep everything flat
Add another level to the structure

parse_query("select x from df1 union select x from df2")

Option 1:

$`select`
$`select`[[1]]
x

$from
$from[[1]]
df1

$set_op
[1] union

$`select`
$`select`[[1]]
x

$from
$from[[1]]
df2

Option 2:

[[1]]
[[1]]$`select`
[[1]]$`select`[[1]]
x

[[1]]$from
[[1]]$from[[1]]
df1

$set_op
[1] union

[[3]]
[[3]]$`select`
[[3]]$`select`[[1]]
x

[[3]]$from
[[3]]$from[[1]]
df2

Add support for CASE expressions when tidyverse = FALSE

Support for CASE expressions was added for the tidyverse = TRUE case (#16). Add it for the tidyverse = FALSE case as well, using nested ifelse() calls. Use NA as the default value in the final else condition of no ELSE is specified.

Add translation for substring_index()

Add a translation for the MySQL function substring_index() as requested by @jimmyg3g in ianmcook/tidyquery#13

Translation of multiple CAST and BETWEEN in one expression fails

Due to errors in the replace_special_keywords() function, expressions with multiple CAST and/or BETWEEN keywords will fail to translate correctly.

Translate coalesce() to a vectorized expression when tidyverse = FALSE

When tidyverse = FALSE, coalesce() expressions are translated to if() {} else if {} else {} statements, which are not vectorized in R. Switch to using nested ifelse() calls which are vectorized.

Fix problems with non-ASCII characters in queries

Some functions in queryparser work fine with input that includes Unicode characters, but other functions cause warnings or throw errors. Add a comprehensive set of tests for input that contains Unicode characters, and make all the changes necessary to get these functions to pass.

Column names or parts of column names may be incorrectly converted to lowercase

When a column name matches the name of a SQL function that queryparser translates, then that column name is converted to lowercase. This also occurs for parts of columns names that are separated from other parts by periods. For example:

parse_expression("Length > 2")
#length > 2

parse_expression("Petal.Length > 2")
#Petal.length > 2

As a workaround until this is resolved, enclose variable names in backticks:

parse_expression("`Length` > 2")
#Length > 2

parse_expression("`Petal.Length` > 2")
#Petal.Length > 2

Throw error if called from an incompatible version of tidyquery

tidyquery imports queryparser and can require a minimum version of queryparser through R's usual method of specifying forward import versions. But this leaves open the possibility that a user might upgrade to a new version of queryparser but still have an older, incompatible version of tidyquery.

When queryparser is called from tidyquery, check its version and throw an error if the version is incompatible.

Errors in translation of BETWEEN expressions with quoted operands when tidyverse = FALSE

When tidyverse = FALSE, translations of BETWEEN expressions with quoted literal strings or quoted column names as their operands might be erroneous and might cause warnings. This happens because the replace_special_keywords() function outputs a string that has more instances of MASKING_CHARACTER in it than the input string did, so the unmasking code cannot correctly replace the masked quoted strings.

Add support for table aliases

queryparser should allow:

column names prefixed with table names or table aliases followed by a dot
- In single-table queries, it should simply remove these prefixes
table aliases in the FROM clause, with or without the AS keyword

Parse queries with no whitespace characters separating keywords

queryparser currently fails to parse queries like these:

SELECT`foo`FROM`bar`;
SELECT*FROM table;
SELECT'literal'FROM table;

Remove line comments and block comments from query

To allow parsing of commented SQL queries, queryparser should remove any line comments (beginning with -- and ending at the end of the line) and block comments (beginning with /* and ending with */) before parsing a query.

Silence "truncating string with embedded nuls" warnings

In some environments, readChar() emits a warning when it encounters an embedded nul. This causes queryparser to generate a large number of warnings. I believe (but am not certain) that this issue first began to occur in R 4.0.0. I am not sure whether it affects all OSs.

Set an attribute indicating whether each expression in the SELECT list aggregates

Related to #8, queryparser also performs nontrivial work to determine whether or not each expression in the SELECT list aggregates. To prevent packages or applications that use queryparser from needing to duplicate this work, set an attribute on the select sublist that's returned by parse_query() indicating whether each expression in the SELECT list aggregates.

Add tests for translations in translations.R

There are data types, operators, functions, and aggregates that need testing. The thought process was laid out by Ian in this comment. How should we handle testing "generic" translations (where tidy = FALSE and tidy = TRUE return the same results)? Here's my current idea:

test_that("string is translated as character", {
  expect_equal(parse_expression("cast(a as string)", tidy = FALSE), str2lang("as.character(a)"))
  expect_equal(parse_expression("cast(a as string)", tidy = TRUE),  str2lang("as.character(a)"))
})

Maybe extracting the expressions would be better:

test_that("string is translated as character", {
  SQL <- "cast(a as string)"
  R   <- "as.character(a)"
  expect_equal(parse_expression(SQL, tidy = FALSE), str2lang(R))
  expect_equal(parse_expression(SQL, tidy = TRUE),  str2lang(R))
})

Set an attribute on the returned list indicating whether the query aggregates

queryparser performs nontrivial work to determine whether or not a query aggregates. To prevent packages or applications that use queryparser from needing to duplicate this work, set an attribute on the list returned by parse_query() indicating whether the query aggregates.

ianmcook / queryparser Goto Github PK

queryparser's Issues

Option 1:

Option 2:

Recommend Projects

Recommend Topics

Recommend Org