Giter Site home page Giter Site logo

dplyr.teradata's Introduction

A Teradata Backend for dplyr

Koji Makiyama (@hoxo-m)

Travis-CI Build Status CRAN Version Coverage Status

1. Overview

The package provides a Teradata backend for dplyr.

It makes it possible to operate Teradata Database in the same way as manipulating data frames with dplyr.

library(dplyr.teradata)

# Establish a connection to Teradata
con <- dbConnect(odbc(), 
                 driver = "{Teradata Driver}", DBCName = "host_name_or_IP_address",
                 uid = "user_name", pwd = "*****")
my_table <- tbl(con, "my_table_name")

# Build a query
q <- my_table %>% 
  filter(between(date, "2017-01-01", "2017-01-03")) %>% 
  group_by(date) %>%
  summarise(n = n()) %>%
  arrange(date)

show_query(q)
#> <SQL>
#> SELECT "date", count(*) AS "n"
#> FROM "my_table_name"
#> WHERE ("date" BETWEEN '2017-01-01' AND '2017-01-03')
#> GROUP BY "date"
#> ORDER BY "date"

# Send the query and get its result on R
df <- q %>% collect
df
#> # A tibble: 3 x 2
#>          date        n
#>        <date>    <int>
#>  1 2017-01-01   123456
#>  2 2017-01-02  7891011
#>  3 2017-01-03 12131415

2. Installation

You can install the dplyr.teradata package from CRAN.

install.packages("dplyr.teradata")

You can also install the development version of the package from GitHub.

install.packages("remotes") # if you have not installed "remotes" package
remotes::install_github("hoxo-m/dplyr.teradata")

The source code for dplyr.teradata package is available on GitHub at

3. Motivation

The package provides a Teradata backend for dplyr. It makes it possible to build SQL for Teradata Database in the same way as manipulating data frames with the dplyr package. It also can send the queries and then receive its results on R.

Therefore, you can complete data analysis with Teradata only on R. It means that you are freed from troublesome switching of tools and switching thoughts that cause mistakes.

4. Usage

The package uses the odbc package to connect database and the dbplyr package to build SQL.

First, you need to establish an ODBC connection to Teradata. See:

# Establish a connection to Teradata
con <- dbConnect(odbc(), 
                 driver = "{Teradata Driver}", DBCName = "host_name_or_IP_address",
                 uid = "user_name", pwd = "*****")

Second, you need to specify a table to build SQL. See:

To specify a table, you can use tbl():

# Getting table
my_table <- tbl(con, "my_table_name")

# Getting table in schema
my_table <- tbl(con, in_schema("my_schema", "my_table_name"))

Third, you build queries. It can do in the same way as manipulating data frames with dplyr:

For example, you can use follows:

  • mutate() adds new columns that are functions of existing columns.
  • select() picks columns based on their names.
  • filter() picks rows based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.
# Build a query
q <- my_table %>% 
  filter(between(date, "2017-01-01", "2017-01-03")) %>% 
  group_by(date) %>%
  summarise(n = n()) %>%
  arrange(date)

n() is a function in dplyr to return the number of rows in the current group but here it will be translated to count(*) as a SQL function.

If you want to show built queries, use show_query():

show_query(q)
#> <SQL>
#> SELECT "date", count(*) AS "n"
#> FROM "my_table_name"
#> WHERE ("date" BETWEEN '2017-01-01' AND '2017-01-03')
#> GROUP BY "date"
#> ORDER BY "date"

Finally, you send built queries and get its results on R using collect().

# Send the query and get its result on R
df <- q %>% collect
df
#> # A tibble: 3 x 2
#>          date        n
#>        <date>    <int>
#>  1 2017-01-01   123456
#>  2 2017-01-02  7891011
#>  3 2017-01-03 12131415

5. Translatable functions

The package mainly use dbplyr to translate manipulations into queries.

Translatable functions are the available functions in manipulations that it can translate into SQL functions.

For instance, n() is translated to count(*) in the above example.

To know translatable functions for Teradata, refer the following:

Here, we introduce the special translatable functions that it becomes available by dplyr.teradata.

5.1. Treat Boolean

Teradata does not have the boolean data type. So when you use boolean, you need to write some complex statements. The package has several functions to treat it briefly.

bool_to_int transforms boolean to integer.

mutate(is_positive = bool_to_int(x > 0L))
#> <SQL> CASE WHEN (`x` > 0) THEN 1 WHEN NOT(`x` > 0) THEN 0 END

count_if() or n_if() counts a number of rows satisfying a condition.

summarize(n = count_if(x > 0L))
#> <SQL> SUM(CASE WHEN (`x` > 0) THEN 1 WHEN NOT(`x` > 0) THEN 0 END)

5.2. to_timestamp()

When your tables has some columns stored UNIX time and you want to convert it to timestamp, you need to write complex SQL.

to_timestamp() is a translatable function that makes it easy.

mutate(ts = to_timestamp(unixtime_column))

Such as above manipulation is translated into SQL like following:

#> <SQL> CAST(DATE '1970-01-01' + (`unixtime_column` / 86400) AS TIMESTAMP(0)) + (`unixtime_column` MOD 86400) * (INTERVAL '00:00:01' HOUR TO SECOND)

5.3. cut()

cut() is very useful function that you can use in base R.

For example, you want to cut values of x into three parts of ranges by break points 2 and 4:

x <- 1:6
breaks <- c(0, 2, 4, 6)
cut(x, breaks)
#> [1] (0,2] (0,2] (2,4] (2,4] (4,6] (4,6]
#> Levels: (0,2] (2,4] (4,6]

dplyr.teradata has a translatable function similar to this:

breaks = c(0, 2, 4, 6)
mutate(y = cut(x, breaks))

In the result, it is translated to a CASE WHEN statement as follows:

#> <SQL> CASE
#>  WHEN x > 0 AND x <= 2 THEN '(0,2]'
#>  WHEN x > 2 AND x <= 4 THEN '(2,4]'
#>  WHEN x > 4 AND x <= 6 THEN '(4,6]'
#>  ELSE NULL
#> END

Arguments of base cut() are also available:

breaks = c(0, 2, 4, 6)
mutate(y = cut(x, breaks, labels = "-", include.lowest = TRUE))
#> <SQL> CASE
#>  WHEN x >= 0 AND x <= 2 THEN '0-2'
#>  WHEN x > 2 AND x <= 4 THEN '3-4'
#>  WHEN x > 4 AND x <= 6 THEN '5-6'
#>  ELSE NULL
#> END

6. Miscellaneous

6.1. Sampling Data

Teradata supports sampling rows from tables:

and dplyr has the same purpose verb slice_sample(). The package makes them work well.

For example, by the number of rows:

q <- my_table %>% slice_sample(n = 100L)

show_query(q)
#> <SQL>
#> SELECT *
#> FROM "my_table_name"
#> SAMPLE RANDOMIZED ALLOCATION 100

or by the proportion of rows:

q <- my_table %>% slice_sample(prop = 0.1)

show_query(q)
#> <SQL>
#> SELECT *
#> FROM "my_table_name"
#> SAMPLE RANDOMIZED ALLOCATION 0.1

It also supports sampling with replacement:

q <- my_table %>% slice_sample(n = 100L, replace = TRUE)

show_query(q)
#> <SQL>
#> SELECT *
#> FROM "my_table_name"
#> SAMPLE WITH REPLACEMENT RANDOMIZED ALLOCATION 100

and supports a random sample stratified by AMPs (it is much faster, especially for very large samples):

q <- my_table %>% slice_sample(n = 100L, randomized_allocation = FALSE)

show_query(q)
#> <SQL>
#> SELECT *
#> FROM "my_table_name"
#> SAMPLE 100

The package currently supports the verbs old versions.

# By the number of rows
q <- my_table %>% sample_n(100L)
# By the proportion of rows
q <- my_table %>% sample_frac(0.1)

6.2. blob_to_string()

The blob objects from databases sometimes prevents manipulations with dplyr.

You might want to convert them to string.

blob_to_string() is a function to make it easy:

x <- blob::as_blob("Good morning")
x
#> <blob[1]>
#> [1] blob[12 B]

# print raw data in blob
x[[1]]
#>  [1] 47 6f 6f 64 20 6d 6f 72 6e 69 6e 67

blob_to_string(x)
#> [1] "476f6f64206d6f726e696e67"

7. Related work

dplyr.teradata's People

Contributors

hoxo-m avatar jimhester avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dplyr.teradata's Issues

issues with summarise

I have the feeling that our companies version of teradata is just old and so it is going to cause problems regardless. But when I try to summarise without a group_by I always get the error

library(dplyr.teradata)
library(tdplyr)
library(bcbsncR)
con <- DBI::dbConnect(odbc::odbc(), "<internal DSN>", timeout = 10)

copy_to(con,mtcars,"mtcars", temporary = T)

tbl(con,"mtcars") %>% summarise(mean(gears))
#> Warning: Missing values are always removed in SQL.
#> Use `mean(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> Error in new_result(connection@ptr, statement, immediate): nanodbc/nanodbc.cpp:1374: HY000: [Teradata][ODBC Teradata Driver][Teradata Database](-5628)Column NANA not found in temp.mtcars.

tbl(con,"mtcars") %>% summarise(mean(gears)) %>% show_query()
#> <SQL>
#> SELECT NANA, AVG("gears") AS "mean(gears)"
#> FROM "mtcars"
#> GROUP BY NANA

Created on 2021-02-25 by the reprex package (v0.3.0)

Has anyone else ever run across this problem? I can generally get around it by group_by(flag = 1) but there are times where the false grouping occurs deep in packages that I can't preempt.

Unable to install the package -

devtools::install_github('hoxo-m/dplyr.teradata')
(skipping a few compilation msg)
.... 
installing to /Library/Frameworks/R.framework/Versions/3.4/Resources/library/odbc/libs
** R
** inst
** tests
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (odbc)
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL  \
  '/private/var/folders/72/j1qmrjt11cg7tsg0byvjkjh00000gp/T/RtmpJNjKGn/devtools1ec27ad81c3c/hoxo-m-dplyr.teradata-0a12e75'  \
  --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' --install-tests 

* installing *source* package ‘dplyr.teradata’ ...
Warning in as.POSIXlt.POSIXct(x, tz) :
  unknown timezone 'default/America/Chicago'
** R
** tests
** preparing package for lazy loading
Error in setMethod("dbUnQuoteIdentifier", c("Teradata", "SQL"), getMethod("dbUnQuoteIdentifier",  : 
  no existing definition for function ‘dbUnQuoteIdentifier’
Error : unable to load R code in package ‘dplyr.teradata’
ERROR: lazy loading failed for package ‘dplyr.teradata’
* removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/dplyr.teradata’
Installation failed: Command failed (1)
Warning message:
In strptime(x, fmt, tz = "GMT") :
  unknown timezone 'default/America/Chicago'

`count_if()`

I want count_if(cond) that is a short version of sum(ifelse(cond, 1L, 0L)).

Or n_if()

Support DSN

dbConnect(dsn = "my_dsn")

Now

DSN is unsupported yet.

compute functionality

in dplyr you can create a temporary table in database with the compute.

in teradata, one uses the VOLATILE keyword instead of TEMPORARY.

Could this be added to this package to allow for the creation of volatile/temporary tables in database?

copy_to functionality

Hello,
Does the teradata backend allow for dplyr's copy_to to function?

After establishing a connection to the database, the following code fails:

testDB<-mtcars
copy_to(con, testDB, 'WORK_DATABASE.testDB')
Error in db_data_type.RODBC(dest$con, df) : Unimplemented

Is this a general issue or specific to my system setup? What is the SQL query that copy_to refers to (e.g. CREATE TABLE)?

paste0 in mutate

I am not sure if this should be reported here.

I am trying to create a new column from a date and time column :

mutate (dtime = paste0(date_col, time_col))

but I get this error message:

Error in new_result(connection@ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42000: [Teradata][ODBC Teradata Driver][Teradata Database] Syntax error: expected something between '(' and a string or a Unicode character literal.

slice_sample() is now broken due to dbplyr package update on 20220604

When I installed R in new machine on 20220609 and run code including function slice_sample(), an error occur saying op_single() function is not found.

Turns out that the dependent package dbplyr updated on 20220604 to version 2.2.0 which decommissioned op_single() function which was available from previous version 2.1.1

Cannot work with dplyr 1.0.0

Sorry now fixing...
Please use the old version.

install.packages("https://cran.r-project.org/src/contrib/Archive/dplyr/dplyr_0.8.5.tar.gz", repos = NULL, type = "source")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.