Giter Site home page Giter Site logo

implyr's Introduction

implyr

Build Status CRAN_Status_Badge

Overview

implyr is a SQL backend to dplyr for Apache Impala, the massively parallel processing query engine. Impala enables low-latency SQL queries on large datasets stored in HDFS, Apache HBase, Apache Kudu, Amazon S3, Microsoft ADLS, and Dell EMC Isilon.

implyr is designed to work with any DBI-compatible interface to Impala. implyr does not provide the underlying connectivity to Impala, nor does it require that you use one particular R package for connectivity to Impala. Currently, two packages that can provide this connectivity are odbc and RJDBC. Future packages may provide other options for connectivity.

Installation

You can install the latest release of implyr from CRAN:

install.packages("implyr")

Or you can install the current development version from GitHub:

devtools::install_github("ianmcook/implyr")

You must also install a package and driver to provide connectivity to Impala.

ODBC Connectivity

odbc is currently the preferred R package for connecting to Impala. It provides superior performance and compatibility.

  1. Ensure that the system where your R code will run supports ODBC. ODBC support is built into Windows but requires unixODBC or iODBC on Linux and macOS.

  2. Install the odbc package from CRAN:

    install.packages("odbc")
  3. Download and install the latest version of the Impala ODBC driver from Cloudera.

  4. Complete the installation and configuration steps described in the odbc package README and the Impala ODBC driver installation guide.

JDBC Connectivity

Package RJDBC can provide access to Impala through JDBC.

  1. Ensure that the system where your R code will run has a Java Runtime Environment (JRE) installed.

  2. Install the RJDBC package from CRAN:

    install.packages("RJDBC")
  3. Download and install the latest version of the Impala JDBC driver from Cloudera.

  4. Complete the installation and configuration steps described in the Impala JDBC driver installation guide.

Connecting to Impala

First, load the implyr package:

library(implyr)

The next step depends on the method you will use to connect to Impala.

ODBC Connectivity

Load the odbc package:

library(odbc)

Create an ODBC driver object:

drv <- odbc::odbc()

Call src_impala() to connect to Impala and create a dplyr data source. In the call to src_impala(), specify the arguments required by odbc::dbConnect(). These arguments can consist of individual ODBC keywords (driver, host, port, database, uid, pwd, and others), an ODBC data source name (dsn), or an ODBC connection string (.connection_string). For example:

impala <- src_impala(
    drv = drv,
    driver = "Cloudera ODBC Driver for Impala",
    host = "host",
    port = 21050,
    database = "default",
    uid = "username",
    pwd = "password"
  )

The returned object impala provides a remote dplyr data source to Impala.

For more information about which arguments you can pass to src_impala() when using ODBC connectivity, see ?"dbConnect,OdbcDriver-method" and the Authentication section below.

JDBC Connectivity

Load the RJDBC package:

library(RJDBC)

Initialize the Java Virtual Machine (JVM) by calling .jinit() and passing a vector containing the paths to all the Impala JDBC driver JAR files as the classpath argument:

impala_classpath <- list.files(path = "/path/to/jdbc/driver", pattern = "\\.jar$", full.names = TRUE)
.jinit(classpath = impala_classpath)

If an error occurs, you may need to first set the JAVA_HOME environment variable:

Sys.setenv(JAVA_HOME = "/path/to/java/home/")

Create a JDBC driver object:

drv <- JDBC(
  driverClass = "com.cloudera.impala.jdbc41.Driver",
  classPath = impala_classpath,
  identifier.quote = "`"
)

If you are using the JDBC version 4.0 driver, specify com.cloudera.impala.jdbc4.Driver. If you are using the JDBC version 4.1 driver, specify com.cloudera.impala.jdbc41.Driver.

Call src_impala() to connect to Impala and create a dplyr data source. In the call to src_impala(), specify a JDBC connection string as the first argument. Optionally, specify a username as the second argument and a password as the third argument. For example:

impala <- src_impala(drv, "jdbc:impala://host:21050", "username", "password")

Or include the username and password (or other authentication properties) in the connection string:

impala <- src_impala(drv, "jdbc:impala://host:21050;UID=username;PWD=password")

The returned object impala provides a remote dplyr data source to Impala.

See the Authentication section below for information about how to construct the JDBC connection string when using different authentication methods.

Do not attempt to connect to Impala using more than one method in one R session.

Authentication

The Impala ODBC and JDBC drivers support multiple authentication methods, including no authentication, username, username and password, and Kerberos. To use Kerberos, specify properties including AuthMech, KrbRealm, KrbHostFQDN, and KrbServiceName. Consult your system administrator and the Impala ODBC driver installation guide or Impala JDBC driver installation guide.

Using dplyr

Now you can use dplyr verbs against tables in Impala.

To see what tables are in the current database in Impala, issue the command:

src_tbls(impala)

For this example, start by creating a lazy tbl named flights_tbl representing the data in the Impala table named flights:

flights_tbl <- tbl(impala, "flights")

To specify the database that contains the table, use the function in_schema(). For example, if the Impala table named flights were in a database named nycflights13, then you would use the command:

flights_tbl <- tbl(impala, in_schema("nycflights13", "flights"))

The examples here assume that data has already been loaded into the Impala table named flights. See the Loading Local Data into Impala section below for information about ways to load data from R into Impala.

delay <- flights_tbl %>% 
  select(tailnum, distance, arr_delay) %>%
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(count > 20L, dist < 2000L, !is.na(delay)) %>%
  arrange(delay, dist, count) %>%
  collect()

implyr supports the dplyr verbs filter(), arrange(), select(), rename(), distinct(), mutate(), transmute(), and summarise(). It supports grouped operations with the group_by() function.

When using implyr, you must specify table names and column names using lowercase characters. To ensure that results are in sorted order, you must apply arrange() last, after all other dplyr verbs.

Impala does not perform implicit casting; for example, it does not automatically convert numbers to strings when they are used in a string context. Impala requires that you explicitly cast columns to the required types. implyr provides familiar R-style type conversion functions to enable casting to all the scalar Impala data types. For example, as.character() casts a column or column expression to the Impala STRING type.

flights_tbl %>% 
  transmute(flight_code = paste0(carrier, as.character(flight))) %>% 
  distinct(flight_code)

In addition, you should specify integer values as R integer objects instead of numeric objects; for example, 1L or as.integer(1) instead of 1.

See Introduction to dplyr for more examples of dplyr grammar.

Like other SQL backends to dplyr, implyr delays work until a result needs to be computed, then computes the result as a single query operation.

  • Use collect() to execute the query and return the result to R as a data frame tbl.
  • Use as.data.frame() to execute the query and return the result to R as an ordinary data frame.
  • Use compute(temporary = FALSE) to execute the query and store the result in an Impala table. Impala does not support temporary tables, so temporary = FALSE is required.
  • Use collapse() to generate the query for later execution.

If you print or store a result without using one of these functions, then implyr returns a lazy tbl. Only use collect() or as.data.frame() when the result will be small enough to fit in memory in your R session.

See the Introduction to dbplyr for more information.

implyr supports window functions, which enable computation of ranks, offsets, and cumulative aggregates. See Window functions for more information.

worst_delay_each_day <- flights_tbl %>%
  group_by(year, month, day) %>%
  filter(arr_delay == max(arr_delay)) %>%
  arrange(year, month, day) %>%
  collect()

implyr supports most two-table verbs, which enable joins and set operations.

airlines_tbl <- tbl(impala, "airlines")
inner_join(flights_tbl, airlines_tbl, by = "carrier")

implyr supports efficient filtering joins.

airlines_tbl <- tbl(impala, "airlines")
southwest_airlines <- airlines_tbl %>% filter(name == "Southwest Airlines Co.")
southwest_flights <- semi_join(flights_tbl, southwest_airlines, by = "carrier")

You can also use dplyr join functions to bring together values from ARRAY and MAP columns with scalar values from the same rows. See Impala Complex Types for more details about ARRAY and MAP columns.

Read the Warnings and Current Limitations section below to understand the ways that working with Impala as a remote dplyr data source is different from working with local data or other remote dplyr data sources.

Using SQL

In addition to using dplyr grammar, you can also issue SQL queries to Impala.

To execute a statement that returns no result set, use the dbExecute() function:

dbExecute(impala, "REFRESH flights")

To execute a query and return the result to R as a data frame, use the dbGetQuery() function.

flights_by_carrier_df <- dbGetQuery(
  impala,
  "SELECT carrier, COUNT(*) FROM flights GROUP BY carrier"
)

Only use dbGetQuery when the query result will be small enough to fit in memory in your R session.

You can also execute SQL and return the result as a lazy tbl:

flights_tbl <- tbl(impala, sql("SELECT * FROM flights"))

Disconnecting

When you are finished, close the connection to Impala:

dbDisconnect(impala)

Loading Local Data into Impala

The examples above assume that data has already been loaded into Impala. If you wish to run the examples above, you will need to load data from the package nycflights13 into Impala.

implyr does not provide tools for loading local data into Impala tables. This is because Impala can query data stored in several different filesystems and storage systems (HDFS, Apache HBase, Apache Kudu, Amazon S3, Microsoft ADLS, and Dell EMC Isilon) and Impala does not include built-in capability for loading local data into these systems.

Some other dplyr backends implement the function copy_to, which copies a local data frame to a remote source. implyr implements copy_to, but it currently only supports very small data frames. It uses the SQL INSERT ... VALUES() technique, which is not suitable for loading large amounts of data.

HDFS is the most common system for storing data in Impala tables. There are two methods described below for uploading data from R into HDFS.

To load the data frame nycflights13::flights into HDFS, first install and load the nycflights13 package:

install.packages("nycflights13")
library(nycflights13)

Then issue an SQL statement to create a table flights in Impala with a schema that matches the data frame flights:

dbExecute(impala, "CREATE TABLE flights (
    year SMALLINT,
    month TINYINT,
    day TINYINT,
    dep_time SMALLINT,
    sched_dep_time SMALLINT,
    dep_delay SMALLINT,
    arr_time SMALLINT,
    sched_arr_time SMALLINT,
    arr_delay SMALLINT,
    carrier STRING,
    flight SMALLINT,
    tailnum STRING,
    origin STRING,
    dest STRING,
    air_time SMALLINT,
    distance SMALLINT,
    hour TINYINT,
    minute TINYINT,
    time_hour TIMESTAMP)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  LOCATION '/user/hive/warehouse/flights'")

Next, write the data frame flights to a local file:

write.table(flights, file = "flights", quote = FALSE, sep = "\t", na = "\\N", row.names = FALSE, col.names = FALSE)

If the Hadoop File System shell is installed on the system where you are running R, then you can issue an hdfs dfs -put command to load this local file to HDFS. You can issue the command from R using the system() function:

system("hdfs dfs -put flights /user/hive/warehouse/flights/000000_0")

Another option is to use the R package rwebhdfs to load the local file into HDFS using the WebHDFS REST API:

devtools::install_github("saurfang/rwebhdfs")
library(rwebhdfs)
hdfs <- webhdfs("host", 50070, "username")
write_file(hdfs, "/user/hive/warehouse/flights/000000_0", "flights")

After loading the data, issue the Impala command INVALIDATE METADATA to refresh Impala’s metadata cache:

dbExecute(impala, "INVALIDATE METADATA")

Warnings and Current Limitations

Using implyr with RJDBC

The RJDBC package is not fully DBI-compatible. implyr works around these incompatibilities as best it can.

RJDBC has a crude type handling system: columns and column expressions with numeric types are returned to R as numeric columns, and all other types are returned as character columns. This has undesirable effects; for example, Boolean types are returned as character columns with values "0" for FALSE and or "1" for TRUE.

When using the function copy_to with RJDBC, data types may be modified in unexpected ways. For example, inserted character values may be right-padded with whitespace.

If possible, connect to Impala using the odbc package instead of the RJDBC package.

Row Order

Impala’s data storage and processing does not preserve row order. Impala uses parallel processing and stores data in multiple files, so the the notion of data being stored in sorted order is impractical. This has several important implications for the use of implyr:

  • Rows are not necessarily returned in the same order that they were in when added to Impala. To return rows in a specific order, you must use arrange().
  • If row ordering is applied in an intermediate phase of query processing, Impala may not return the final result in sorted order. To ensure that results are in sorted order, apply arrange() last, after all other dplyr verbs. implyr will issue a warning if you apply arrange() in an earlier step.
  • When using compute() to store results in an Impala table, Impala may not preserve row order. implyr will issue a warning if you use arrange() before compute().

See the Impala ORDER BY documentation for more information.

Temporary Tables

Impala does not support temporary tables. When using compute() to store results in an Impala table, you must set temporary = FALSE. implyr will throw an error if you use compute() but do not set temporary = FALSE.

Missing Values

SQL engines including Impala treat missing values differently than R does. To avoid unexpected results, handle missing values before applying other operations on column values.

Table and Column Names

Impala requires table names and column names to be all lowercase. Currently, implyr does not convert table names and column names to lowercase; you must specify them using all lowercase characters. For information about other limitations on table names and column names, see Overview of Impala Identifiers.

dplyr Support

implyr does not support all dplyr verbs and functions. Some verbs including slice(), sample_n(), and sample_frac() are not supported. Some functions including intersect() and setdiff() are not supported.

If you apply an R function to a lazy tbl and the function is not implemented as a remote method, then implyr will compute the result of the prior steps and return that result to R as a tbl or data frame. It will then compute the function and any later steps locally in R. An example of this is the function lm. Only use this technique when the intermediate result will be small enough to fit in memory in your R session.

The median() function returns a value that is approximately (not necessarily exactly) the median. See APPX_MEDIAN Function.

implyr supports some Impala functions that are not specified by R or by dplyr. See Impala Built-In Functions for more information.

implyr's People

Contributors

bendettasd avatar ianmcook avatar karoliskascenas avatar liudvikasakelis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

implyr's Issues

use spark with tbl() returned object

Hi Ian -

Thanks for your work on implyr! I created an impala object following these step:

impala <- implyr::src_impala(
      drv = drv,
      dsn = "Impala",
      database = "my_db"
    )
my_impala_tbl <- tbl(impala, in_schema("my_db", "mytable"))
test_df = my_impala_tbl %>% select(colA, colB) %>% filter(colA == "a")

Are there ways that can convert the returned test_df to a spark df without having to collect it to R first?

Thanks!

dbplyr 2.0.0 breaks implyr

Unfortunately I cannot paste any output as I have already downgraded to 1.4.4, but the problem was that "db.tbl" was not being properly escaped when using tbl(impala, "db.tbl") or tbl(impala, in_schema("db", "tbl")).

The only way I was able to query data was with dbGetQuery(impala, "select * from db.tbl") where the query string was probably not being manipulated in any way.

I am quite inexperienced with R, but maybe a quick fix would be to cap the dbplyr version dependency on 1.4.4?

Add Function src_insert

Request a function that allows a tbl_impala, or tbl_lazy etc. to be able to be inserted back into an impala table.
Arguments would be
1: implyr object
2: target table (implyr object?)
3: overwrite = FALSE
4: partition spec (if it cannot be extracted automatically)

Potential Method... just a suggestion:
After running collapse on the implyr object, $ ops contains the SQL for the query. We just need to add the Insert Overwrite or Insert INTO statement at the beginning to make this work.

Greater than 1024 rows of some variable types causes an encoding error

I'm finding if a table has more than 1,024 rows that dbWriteTable has an issue with some variable types.

library(implyr)

connect_impala <- function() {
  src_impala(drv = odbc::odbc(),
             driver = 'Cloudera ODBC Driver for Impala',
             host = 'impala.xxxxxxxxxxxxx',
             port = xxxxx,
             database = 'default',
             AuthMech = 3,
             ssl = 1,
             uid = rstudioapi::askForPassword('Username'),
             pwd = rstudioapi::askForPassword('Password'))
}

cx <- connect_impala()


any_number_you_like <- 0L
set.seed(any_number_you_like)


my_test_data_1024 <- data.frame(X = sample(1024))

dbWriteTable(cx$con,
             Id(schema = 'XXXXXXX', table = 'TEST_1024'), 
             my_test_data_1024)

dbRemoveTable(cx$con, Id(schema = 'XXXXXXX', table = 'TEST_1024'))



my_test_data_1025 <- data.frame(X = sample(1025))

dbWriteTable(cx$con,
             Id(schema = 'XXXXXXX', table = 'TEST_1025'), 
             my_test_data_1025)

The first dbWriteTable is fine, but the second throws an error like this:

Error in result_insert_dataframe(rs@ptr, values, batch_rows) : 
  nanodbc/nanodbc.cpp:1617: HY000: [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : AnalysisException: Target table 'XXXXXXX.TEST_1025' is incompatible with source expressions.
Expression 'cast('208À€' as string)' (type: STRING) is not compatible with column 'x' (type: INT)

I've replaced the sample with versions that produce double and logical variables with similar results, but sampling from letters did not produce this error.

use of impala as backend for rmarkdown-sql-codechunks supported?

Today I tried to use Impala as a backend-connection to execute sql-code chunks as shown in https://rmarkdown.rstudio.com/authoring_knitr_engines.html#sql

I ran into some error message, which I will post soon. Is this feature supposed to work by any chance? It would allow sql-syntax-highlighting and increase the readability over dbGetQuery(). :)

I tried

impala <- src_impala(...)
```{sql, connection = impala}
SELECT * FROM trials
`

Out of curiosity: is there a conceptual or implemented difference between DBI::dbConnect() and implyr::src_impala?

dplyr verbs fail on functions with dots in names when no column name is assigned

The following code fails

tbl(impala, "flights") %>% transmute(as.character(year))

with the error

Error in new_result(connection@ptr, statement) : 
  nanodbc/nanodbc.cpp:1344: HY000: [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : AnalysisException: Syntax error in line 1:
SELECT cast(`year` as string) AS `as`.`character(year)`
                                     ^
Encountered: .
Expected: CROSS, FROM, FULL, GROUP, HAVING, INNER, JOIN, LEFT, LIMIT, OFFSET, ON, ORDER, RIGHT, STRAIGHT_JOIN, UNION, USING, WHERE, COMMA

CAUSED BY: Exception: Syntax error 

This seems to be an implyr-specific problem, because the following works without error:

dbplyr::memdb_frame(year = c(2013L)) %>% transmute(as.character(year))

Resolve this by not adding the backticks around the dot in this case.

Until it's fixed, the workaround is to specify a name for the column:

tbl(impala, "flights") %>% transmute(year_string = as.character(year))

Warning: "object '.__C__impala_connection' not found" when initializien impala connection

Hey Ian,

on a new server with a newly set up impala connection we currently run into the following warning, when initializing the impala connection

library(implyr)

impala <- implyr::src_impala(odbc::odbc(), "IMPALA_DSN")
#> Warning in rm(list = what, pos = classWhere) :
#>  object '.__C__impala_connection' not found

Do you have an idea, what could fix this? It doesn't happen the second time impala <- implyr::src_impala(odbc::odbc(), "IMPALA_DSN") is called within the same session.

Add automated tests

Add automated tests that will run on Travis using an EC2 instance running Impala.

No warning on union_all() of two tbl_impala objects that are arrange()d

To reproduce:

flights_aa <- tbl(impala, "flights") %>% filter(carrier == "AA") %>% arrange(dep_delay)
flights_ua <- tbl(impala, "flights") %>% filter(carrier == "UA") %>% arrange(dep_delay)
union_all(flights_aa, flights_ua)

The resulting rows are not in order by dep_delay, but the warning (Results may not be in sorted order!) is not displayed.

Error in .valueClassTest(ans, "data.frame", "dbFetch")

I'm using implyr 0.2.3, dplyr 0.7.8, dbi 0.7, dbplyr 1.2.1

The following code works

tbl(impala, sql("select * from schema.table")

but the following codes

tbl(impala, sql("select * from schema.table;")
tbl(impala, "schema..table")

returned an error message
Error in .valueClassTest(ans, "data.frame", "dbFetch") : invalid value from generic function ‘dbFetch’, class “character”, expected “data.frame”

and the following code

tbl(impala, "schema.table")
tbl(impala, in_schema('schema', 'table'))

returned another error message
Error: is.character(vars) is not TRUE

Cannot get unicode from impala

library(odbc)
library(dplyr)
library(implyr)
drv <- odbc::odbc()
impala <- src_impala(
drv = drv,
driver = "Cloudera ODBC Driver for Impala",
host = "impala-proxy",
port = 21050,
database = "default",
uid = "username",
pwd = "password"
)
data <- dbGetQuery(impala, "SELECT * FROM ghtk.package_logs limit 10")

Output: Th<ea>m coupon E516188e for unicode string Thêm coupon E516188e

"Unable to locate SQLGetPrivateProfileString function" when connecting

I know this is a long shot and difficult to reproduce, but still I want to point it out, because it might be solved.

When connecting to Impala I get the following error message and no connection is made:

Error: nanodbc/nanodbc.cpp:950: HY000: [unixODBC][Simba][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function.

Now strangely enough, when I connect to a different data source first (sqlite) in the R session I have no problem connecting to Impala afterwords. The following works:

library(implyr); library(odbc)

dbConnect(odbc::odbc(), dsn = "SQLite")

impala <- src_impala(
  odbc::odbc(),
  dsn = "Impala")

But it would not work if the second part does not run. I am in the dark why this is, maybe an environment variable is set correctly by connecting to sqlite first. Lately a colleague had the exact same error on a fresh system, and the same fixed work.

Again, I know it is long shot but maybe @ianmcook or @jimhester can think of a direction of resolving this.

I am on macOS 10.12.6.

My /etc/odbcinst.ini file looks like

[SQLite Driver]
Driver = /usr/local/lib/libsqlite3odbc.dylib

[Cloudera ODBC Driver for Impala]
Driver = /opt/cloudera/impalaodbc/lib/universal/libclouderaimpalaodbc.dylib

and my .Renviron file looks like

# ODBC
DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:/usr/local/lib
ODBCINI=/etc/odbc.ini
ODBCSYSINI=/etc

Let me know if you need any more information.

dbDisconnect error on CentOS

When executing the code example of connecting to Impala, running a query, and disconnecting as shown in the README there is an error "type must be a single element of type 'character'" when attempting to disconnect. OS: CentOS, Implyr Version: 0.2.2

is there character length limit? : nchar(x) < 32700

I downloaded area json text from impala by implyr package & get broken json text when json text length is more than 32700.
Is there character length limit?

p.s.
data download result with python impyla is normal.

Connections pane supported?

I have been able to successfully connect to Impala with implyr using the odbc-package. So far the connections pane isn't working for me (but src_tbls(impala) did).

Is the connctions-pane-functionality supported through the implyr-package?

copy_to limit size requirement

Hi,
Is there a specific reason for implyr limitation of the size of data.frame that can be inserted into database ?

implyr/R/src_impala.R

Lines 340 to 349 in 32a3e12

if (prod(dim(df)) > 1e3L) {
# TBD: consider whether to make this limit configurable, possibly using
# options with the pkgconfig package
stop(
"Data frame ",
name,
" is too large. copy_to currently only supports very small data frames.",
call. = FALSE
)
}

I did not found any mentions in Impala Guides
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_create_table.html

A DBI::dbWriteTable on con <- dbConnect(odbc::odbc(), "Impala_DSN") has worked. As it is not as specific as implyr:::db_create_table.impala_connection , it would be better to use implyr 📦 over dbplyr 📦

Thanks for advice.

Translating expression with extracted entry.

I think I found a bug, although I cannot be sure that it is still present.
I'll mention it here because it's simple, and if needed and relevant I'll work on a reproducible example.

I am calculating a range of dates by which to filter my main frame. Let's just take the whole weeks.
And using that to filter the table.

> dates_range <- tbl(conn, "events") %>%
    group_by(a_week = floor_date(a_date, "week")) %>% 
    mutate(n_days = n()) %>% 
    select(a_week, a_date) %>% distinct() %>%  collect() %>%
    filter(n_days == 7) %$% 
    range(a_date) %>% as.Date()

> events_query <- tbl(conn, "events") %>% 
    filter(a_date >= dates_range[1]) %>% show_query()

I get

SELECT *
FROM `events`
WHERE (`a_date` >= CASE 
  WHEN (1.0) THEN (('2020-05-04', '2020-06-21')) 
  END)

Thus it's interpreting date_range[1] as a CASE WHEN clause, which does weird things.
While for my purpose I can go around and substitue the date_range value, I thought to write it here and do my share.

I must say, that I wasn't able to download the latest version of implyr since I'm on a work computer, and thus find a possibility of this being fixed already. 😮

Regards from Mexico.

Fix literal string arguments in paste()

The following fails because dbplyr does not quote the literal strings in the SQL:

tbl(impala, "flights") %>%
  transmute(date = paste0(
    as.character(year),
    "-",
    lpad(as.character(month), 2L, "0"),
    "-",
    lpad(as.character(day), 2L, "0")
  )
)

I believe this is a dbplyr issue, but need to check and create a reproducible example. Note that you can't use SQLite to create an easily reproducible example because SQLite uses an infix operator for concatenation.

The workaround is to wrap the literal strings in parens: ("-")

Cannot connect to database SASL error

Whenever I try to connect to the impala database, I keep on getting this error:

Error: nanodbc/nanodbc.cpp:950: HY000: [Cloudera][ThriftExtension] (4) Error occurred while contacting server: EAGAIN (timed out). The connection has been configured to not use SASL for authentication. This error might be due to the server has been configured to use SASL for authentication.

I'm not sure if this is because often we have to connect through SSH but so far I've tried a lot of things unsuccessfully. Any idea of what might be happening here?

Drop db_sql_render method?

dbplyr now automatically drops ORDER BY from subqueries, so I don't think you need this any more, and I think you're the only user of the generic, which means I could (eventually) drop it from dbplyr.

show query progress in R

When executing an Impala query from HUE the progress of the query is shown. It would be a nice-to-have in R as well. Similar to the progress bars from readr for reading large datasets.

screen shot 2017-07-13 at 10 39 39

Implement access to complex columns (ARRAY, MAP, STRUCT)

As a beginner it is not immediatly clear to me, how to best use implyr to access Impala-complex types (especially maps, e.g. pull out a couple of columns and join them with the existing data-frame).

The link in the Readme is helpful (to create dbGetQuery()-requests), but a short example - possibly showing dplyr-logic - would be really cool as well. :)

Fix detection of ORDER BY in subqueries with dplyr 0.7.0

implyr issues a warning if you apply arrange() before other verbs. This is because Impala may not return the final result in sorted order if row ordering is applied in an intermediate phase of query processing. But this warning is often a false positive, especially with the new SQL optimizer in dplyr 0.7.0. Need to stop these false positive warnings.

ODBC ERROR: while contacting server - SSL connection

Good Morning,

i'm trying to connect to Impala, but i received:
Error: nanodbc/nanodbc.cpp:950: HY000: [Cloudera][ThriftExtension] (5) Error occurred while contacting server: No more data to read.. This could be because you are trying to establish a non-SSL connection to a SSL-enabled server.

My connection code is:
impala <- src_impala( drv = drv, driver = "Cloudera ODBC Driver for Impala", host = "server.ita.it", port = 21050, database = "dwh", uid = userId, pwd = password )
i thought the problem was to check SSL connection, but my cloudera ODBC have check "ENABLE SSL" on setting and if i connect to IMPALA with DBI library i haven't . DBI connection code:
conHD <- DBI::dbConnect(odbc::odbc(), SERVER_NAME_HDP, # CLODERA ODBC NAME uid = userIdHdp, pwd = passwordHdp)

where am I wrong?

R session info:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C LC_TIME=Italian_Italy.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] odbc_1.1.6 implyr_0.3.0 dplyr_0.8.3 DBI_1.0.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 zeallot_0.1.0 dbplyr_1.4.2 crayon_1.3.4 assertthat_0.2.1 R6_2.4.0 backports_1.1.5 magrittr_1.5 pillar_1.4.2
[10] rlang_0.4.0 rstudioapi_0.10 blob_1.2.0 vctrs_0.2.0 tools_3.6.1 bit64_0.9-7 glue_1.3.1 bit_1.1-14 purrr_0.3.2
[19] hms_0.5.1 compiler_3.6.1 pkgconfig_2.0.3 tidyselect_0.2.5 tibble_2.1.3
with:
Cloudera ODBC Driver for Impala version 2.5.59

thanks in advance
have nice day
MC

src_impala from existing connection object

Hello,
would it be possible to simply the connection process, and initialize src_impala from an existing connection object?

For example, I already have an odbc connection initialized like this:

conn = odbcConnect("impaladsn")

Is there a way to call src_impala on this object directly? For example, src_impala(conn)?

Thanks in advance.

Test with HBase

A useR!2017 attendee asked after the implyr talk whether implyr could be used with HBase, specifically to write a table to HBase. Create automated tests to check if this is currently possible.

java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.

We are using implyr to connect to impala via JDBC. Usually this works fine, but in ca 25% of the attempt trying to establish the impala connection fails, we get at least three different errors that look a lot like each other:

for(i in 1:100){
  impala <- CreateImpalaConnection(SSLTrustStorePwd = SSLTrustStorePwd)
}

Error in .jcall(conn@jc, "V", "close") : 
  java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.
Error in .jcall(conn@jc, "Ljava/sql/Statement;", "createStatement") : 
  java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.
Error in .jcall(.rJava.class.loader, "[Ljava/lang/String;", "getClassPath") : 
  java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.

I noticed via traceback() that the errors are in different locations:

  • The '"[Ljava/lang/String;", "getClassPath")' error is in the .jinit function:
    6: stop(list(message = "java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.", call = .jcall(.rJava.class.loader, "[Ljava/lang/String;", "getClassPath"), jobj = new("jobjRef", jobj = <pointer: 0xac0e2b8>, jclass = "java/sql/SQLNonTransientConnectionException"))) 5: .jcheck() 4: .jcall(.rJava.class.loader, "[Ljava/lang/String;", "getClassPath") 3: .jclassPath() 2: .jinit(classpath = impala_classpath, force.init = TRUE)

  • The "Ljava/sql/Statement;", "createStatement" is within the dbConnect function (https://github.com/ianmcook/implyr/blob/master/R/src_impala.R#L126)
    13: .getClassesFromCache(Class) 12: getClassDef(classi, where = where) 11: validObject(.Object) 10: initialize(value, ...) 9: initialize(value, ...) 8: new("jobjRef", jobj = r, jclass = substr(returnSig, 2, nchar(returnSig) - 1)) 7: new("jobjRef", jobj = r, jclass = substr(returnSig, 2, nchar(returnSig) - 1)) 6: .jcall("java/sql/DriverManager", "Ljava/sql/Connection;", "getConnection", as.character(url)[1], as.character(user)[1], as.character(password)[1], check = FALSE) 5: .local(drv, ...) 4: dbConnect(drv, ...) 3: dbConnect(drv, ...) 2: src_impala(drv = drv, ....)

  • The "conn@jc, "V", "close"" error is in the db_disconnector function: https://github.com/ianmcook/implyr/blob/master/R/src_impala.R#L506
    and seems to occur less often when I set auto_disconnect to FALSE.

Are these errors known, and what can we do about it? We now wrapped the CreateImpalaConnection function in a try catch loop and try multiple times, but that is not the desired way to do it.

The function CreateImpalaConnection is defined as:

CreateImpalaConnection <- function(SSLTrustStorePwd){
  #Function to create an impala connection
  
  #Imapala settings
  impala_classpath <- "file_path_of_impala_driver"
  
  .jinit(classpath = impala_classpath, force.init = TRUE)
  drv <- JDBC(
    driverClass = "com.cloudera.impala.jdbc4.Driver",
    classPath = impala_classpath,
    identifier.quote = "`"
  )
  impala <- src_impala(
    drv = drv,
    paste0(
      "jdbc:impala://serveradress:21051;",
      "AuthMech=1;KrbRealm=ourdomain;KrbHostFQDN=serveradress;KrbServiceName=impala;",
   "SSL=1;SSLTrustStore=filepath_to_jssecacerts;SSLTrustStorePwd=",SSLTrustStorePwd, ";",
      "CAIssuedCertNamesMismatch=1;"
    ),
    auto_disconnect = FALSE
  )
  return(impala)
}

Add automated tests using odbc

The automated tests currently use RJDBC. Add support for tests using odbc. Use multiple Travis CI jobs to run the same tests using either RJDBC or odbc based on an environment variable.

Test and document limitations for distinct() or unique()

A useR!2017 attendee asked after the implyr talk about what Impala's limitations for using DISTINCT were. Older versions of Impala (before version 2.0) allowed only one DISTINCT clause per query. Newer versions have removed this limitation. But the current version of Impala does have the limitation that you cannot use DISTINCT in more than one aggregation function in the same query. Write tests to check these behaviors in implyr, and document the practical implications of these limitations for implyr users who are using the distinct() verb or the unique() function.

Unable to find an inherited method

Hi,

I'm trying to use implyr with Kerberos auth, connection itself works fine,

src_tbls(impala)

is returning list of tables, but when I'm trying to use tbl function to get any table I get:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbSendQuery’ for signature ‘"impala_connection", "sql"’

Canceling a (long-running) query

I use the implyr-package frequently. Sometimes I realize only after I have sent a query, that it will take (too) long to take the results.

I havent yet found the best way to cancel a query. The Stop-button in Rstudio doesn't do the trick, same with Ctrc-c. I sometimes terminate Rstudio or Rstudio Server, if a query takes too long.

What are the mechanics that make it difficult to cancel a query? What approach would you suggest?

Dbplyr joins fail with default suffixes

To reproduce, join two tables that have column names in common but aren't the join key. Default suffixes (".x" and ".y") are attempted, but obviously column names in Impala (most SQLs I suspect) can't contain dots, so errors are thrown.
Default suffixes should be changed to something valid like "_x" and "_y" when we know we're joining Impala-backed tables.

Add rJava to Suggests

On 2018-05-18, Kurt Hornik emailed the following:

These [packages including implyr] seem to have undeclared package dependencies in their unit test code (R files in tests subdirs), see below.

Can you pls fix as necessary? (Add the missing package dependencies to
Suggests, I guess.)

Please note that these issues are currently not yet detected by the CRAN
incoming (or regular) checks.

$implyr
'::' or ':::' import not declared from: ‘rJava’

connection success but query failed

hi ,I used implyr and RJDBC to connect R and impala succed, but when I use dbGetQuery to select some data from database, I got en error;

the sql was below:

testdata <- dbGetQuery(
impala,
"SELECT * FROM jolly.who_orderinfo limit 10"
)

and the error was below:

Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT * FROM jolly.who_orderinfo limit 10 ([Simba]ImpalaJDBCDriver ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:HY000, errorMessage:AuthorizationException: User '' does not have privileges to execute 'SELECT' on: jolly.who_orderinfo
), Query: SELECT * FROM jolly.who_orderinfo limit 10.)

can anyone tell me why and how can I solve this problems, thanks!

compute() table to parquet?

Using the compute() method on an implyr table currently stores it as TEXT, creating big files which usually end up scattered across several worker nodes.

Is there an option to store them as Parquet instead, or at least compressed gz?

Thanks

copy_to dimension limit

Is there any reason why this limitation exists? I've rebuilt the package without it and was successfully able to write a 50000x5 table.

Error when join key column name is the same in both tables

When using a join function to join two Impala tables on a key column that has the same name in both tables, one of two errors can occur:

  • If the join result is collected immediately, then dplyr gives the error "Each variable must have a unique name."
  • If the join result is not collected immediately, then Impala gives the error "Duplicated inline view column alias".

This is due to a behavior of Impala and its ODBC and JDBC drivers in which duplicate column names are not disambiguated. See IMPALA-421.

You can reproduce this error by attempting to join the flights and airlines tables from nycflights13:

flights_tbl <- tbl(impala, "flights")
airlines_tbl <- tbl(impala, "airlines")
inner_join(flights_tbl, airlines_tbl, by = "carrier")
inner_join(flights_tbl, airlines_tbl, by = "carrier") %>% collect()

A workaround is to rename the key column in one of the tables before joining:

airlines_tbl <- airlines_tbl %>% rename(airlines.carrier = carrier)
inner_join(flights_tbl, airlines_tbl, by = c("carrier" = "airlines.carrier"))

See if there is a way for implyr to work around this behavior.

Some other SQL engines work around this problem by prepending tablename. to the name of every column in a join.

no existing definition for function ‘dbSendQuery’

I met the "no existing definition for function 'dbSendQuery'" error when I use src_impala like

impala = implyr::src_impala(odbc::odbc(), dsn = "some_dsn_name")

The DSN works well , like
odbc::dbConnect(odbc::odbc(), dsn = "some_dsn_name") %>% odbc::dbGetQuery("some_sql_query")

I don't have any trouble when using JDBC connectivity, but I'd prefer using ODBC as ODBC doesn't require the administrator privilege for Kerberos.

Thank you for your help!

My SessionInfo:
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] odbc_1.0.1 magrittr_1.5 R6_2.2.2 assertthat_0.2.0 RevoUtils_10.0.3
[6] DBI_0.7 tools_3.3.3 glue_1.1.0 dplyr_0.7.0 tibble_1.3.3
[11] implyr_0.2.0 Rcpp_0.12.11 blob_1.0.0 rlang_0.1.1 dbplyr_1.0.0

error when using dbSendQuery on impala connection

I am not sure if this is supported. I was trying to use that to fetch resultset in batches rather than just one shot to avoid memory footprint. dbQuery and other methods work fine. code is like the following:

library(implyr) # this is the only line for library loading
library(RJDBC)

impala <- src_impala(...) #I am using JDBC
dbSendQuery(impala, "select * from tbl1")

Error in (function (classes, fdef, mtable):
unable to find an inherited method for function 'dbSendQuery' for signature ''src_impala", "character"'

session info on windows 64bit:
R 3.4.3
rJava_0.9_9 DBI_0.8 implyr_0.2.2 dplyr_0.7. RJDBC_0.2_7

I saw the other issue you fixed the loading sequence and namespace issue with DBI so I didn't load DBI explicitly.

Thanks.

Unable to handle database name with `tbl()` function

It would be nice to handle database name with tbl() function. In current version of implyr, it returns an error as follows:

> airports <- tbl(impala, "u_ariga.airports_pq")
Error in new_result(connection@ptr, statement) : 
  nanodbc.cpp:1344: HY000: [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : AnalysisException: Could not resolve table reference: 'u_ariga.airports_pq'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.