ianmcook / queryparser Goto Github PK
View Code? Open in Web Editor NEWTranslate SQL queries into R expressions
License: Apache License 2.0
Translate SQL queries into R expressions
License: Apache License 2.0
The mechanism that stops queryparser from translating an expression if it includes functions that are not on the whitelist can be defeated by including a column name in the expression that has the same name as the disallowed function. For example:
parse_expression("system + system('ls')")
#system + system("ls")
This is caused by the way the all_funs
function is implemented.
Currently queryparser will throw an error if the NULLS FIRST
or NULLS LAST
keywords are used in the ORDER BY
clause. A future version of queryparser should recognize and parse these keywords and modify the order_by
sublist accordingly. For example, the clause ORDER BY x NULLS FIRST
would be translated as follows:
$order_by
$order_by[[1]]
is.na(x)
$order_by[[2]]
x
Add support for SQL-92-style (explicit) joins, with join conditions specified after ON
or USING
in the FROM
clause. Do not support SQL-89-style (implicit) joins, in which join conditions are specified in the WHERE
clause.
The mechanism that stops queryparser from translating an expression if it includes functions that are not on the whitelist can be defeated by using a function name with no parentheses and using the forward pipe operator %>%
to pass input to the function. For example:
parse_expression("'ls' %>% system")
#system("ls")
This is caused by the %>%
operator being explicitly allowed in the query at the time when the check is performed
Related to #8 and #9, queryparser also determines whether or not each expression in the ORDER BY
list aggregates. To prevent packages or applications that use queryparser from needing to duplicate this work, set an attribute on the order_by
sublist that's returned by parse_query()
indicating whether each expression in the ORDER BY
list aggregates.
As reported in ianmcook/tidyquery#14, queryparser throws an error when the SQL refers to a column in a data frame that has a syntactically invalid name.
It's common in SQL queries for individual table names in the FROM clause to be enclosed in parentheses. In joins, it's also common for everything after the FROM keyword in the FROM clause to be enclosed in parentheses. queryparser should allow parentheses in these cases.
There is an error in the regular expression used in the function replace_operators_binary_symbolic
that causes the second through last of a particular binary symbolic operator to fail to be replaced. For example:
parse_expression("carb = 4 and gear = 4")
# carb == 4 && gear = 4
The +
that is pasted in after non_operator_regex
in two places in the regular expression should be removed.
Add support for translating SQL CASE
expressions to R expressions that use dplyr::case_when()
in the tidyverse = TRUE
case. This should support both forms of the SQL CASE
expression:
CASE a
WHEN b THEN c
WHEN d THEN e
ELSE f
END
CASE
WHEN a == b THEN c
WHEN a == d THEN e
ELSE f
END
When a SQL expression ends with a column name containing a period, the extract_alias
function does not behave properly. For example:
extract_alias(" 1 + Petal.Length")
# Length
#"1 + Petal."
Currently the translate_direct
replaces columns with names that match function names.
parse_expression("month = 12", T)
#lubridate::month == 12
To fix this, do not call substitute
directly in translate_direct
. Instead call a function that calls substitute
only on functions (the parent element of a call).
The operators &&
and ||
are not vectorized in R so should not be used in expressions that will be applied in a vectorized fashion to data frames. Translate AND
and OR
to &
and |
instead. Consider adding a note and example to the README describing how these operators do not short circuit like SQL's AND
and OR
sometimes do.
SQL expressions containing calls to coalesce()
fail when tidyverse = FALSE
and when the arguments include expressions (not only simple column references).
UNION [ALL]
is probably the most popular set operation in SQL. It may be the easiest to implement as well.
This might affect the structure of parse_query
's output. That will probably have effects on tidyquery.
Initial thoughts on options:
parse_query("select x from df1 union select x from df2")
$`select`
$`select`[[1]]
x$from
$from[[1]]
df1$set_op
[1] union$`select`
$`select`[[1]]
x$from
$from[[1]]
df2
[[1]]
[[1]]$`select`
[[1]]$`select`[[1]]
x[[1]]$from
[[1]]$from[[1]]
df1$set_op
[1] union[[3]]
[[3]]$`select`
[[3]]$`select`[[1]]
x[[3]]$from
[[3]]$from[[1]]
df2
Support for CASE
expressions was added for the tidyverse = TRUE
case (#16). Add it for the tidyverse = FALSE
case as well, using nested ifelse()
calls. Use NA
as the default value in the final else condition of no ELSE
is specified.
Add a translation for the MySQL function substring_index()
as requested by @jimmyg3g in ianmcook/tidyquery#13
Due to errors in the replace_special_keywords()
function, expressions with multiple CAST
and/or BETWEEN
keywords will fail to translate correctly.
When tidyverse = FALSE
, coalesce()
expressions are translated to if() {} else if {} else {}
statements, which are not vectorized in R. Switch to using nested ifelse()
calls which are vectorized.
Some functions in queryparser work fine with input that includes Unicode characters, but other functions cause warnings or throw errors. Add a comprehensive set of tests for input that contains Unicode characters, and make all the changes necessary to get these functions to pass.
When a column name matches the name of a SQL function that queryparser translates, then that column name is converted to lowercase. This also occurs for parts of columns names that are separated from other parts by periods. For example:
parse_expression("Length > 2")
#length > 2
parse_expression("Petal.Length > 2")
#Petal.length > 2
As a workaround until this is resolved, enclose variable names in backticks:
parse_expression("`Length` > 2")
#Length > 2
parse_expression("`Petal.Length` > 2")
#Petal.Length > 2
tidyquery imports queryparser and can require a minimum version of queryparser through R's usual method of specifying forward import versions. But this leaves open the possibility that a user might upgrade to a new version of queryparser but still have an older, incompatible version of tidyquery.
When queryparser is called from tidyquery, check its version and throw an error if the version is incompatible.
When tidyverse = FALSE
, translations of BETWEEN
expressions with quoted literal strings or quoted column names as their operands might be erroneous and might cause warnings. This happens because the replace_special_keywords()
function outputs a string that has more instances of MASKING_CHARACTER
in it than the input string did, so the unmasking code cannot correctly replace the masked quoted strings.
queryparser should allow:
queryparser currently fails to parse queries like these:
SELECT`foo`FROM`bar`;
SELECT*FROM table;
SELECT'literal'FROM table;
To allow parsing of commented SQL queries, queryparser should remove any line comments (beginning with --
and ending at the end of the line) and block comments (beginning with /*
and ending with */
) before parsing a query.
In some environments, readChar()
emits a warning when it encounters an embedded nul
. This causes queryparser to generate a large number of warnings. I believe (but am not certain) that this issue first began to occur in R 4.0.0. I am not sure whether it affects all OSs.
Related to #8, queryparser also performs nontrivial work to determine whether or not each expression in the SELECT
list aggregates. To prevent packages or applications that use queryparser from needing to duplicate this work, set an attribute on the select
sublist that's returned by parse_query()
indicating whether each expression in the SELECT
list aggregates.
There are data types, operators, functions, and aggregates that need testing. The thought process was laid out by Ian in this comment. How should we handle testing "generic" translations (where tidy = FALSE
and tidy = TRUE
return the same results)? Here's my current idea:
test_that("string is translated as character", {
expect_equal(parse_expression("cast(a as string)", tidy = FALSE), str2lang("as.character(a)"))
expect_equal(parse_expression("cast(a as string)", tidy = TRUE), str2lang("as.character(a)"))
})
Maybe extracting the expressions would be better:
test_that("string is translated as character", {
SQL <- "cast(a as string)"
R <- "as.character(a)"
expect_equal(parse_expression(SQL, tidy = FALSE), str2lang(R))
expect_equal(parse_expression(SQL, tidy = TRUE), str2lang(R))
})
queryparser performs nontrivial work to determine whether or not a query aggregates. To prevent packages or applications that use queryparser from needing to duplicate this work, set an attribute on the list returned by parse_query()
indicating whether the query aggregates.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.