rstudio / dbtest Goto Github PK

View Code? Open in Web Editor NEW

11.0 5.0 5.0 581 KB

Automates testing of vector expressions of SQL translations for dplyr

Home Page: http://db.rstudio.com/translation

License: Other

R 100.00%

dbtest's Introduction

dbtest

Overview

dbtest uses testthat to automate testing of dbplyr translations by running the tests against live database connections. Tests are written in YAML files, and connections are either passed explicitly, read from system DSNs, or configured in YAML files. Further, it is possible to execute a test suite against multiple database connections.

Install

To install dbtest, you can install the latest version from GitHub:

devtools::install_github("rstudio/dbtest")

Connection

The first step to use dbtest is to set up a DBI connection object. There are many ways you can do this.

DSN

Sometimes, a system already has a handful of DSNs (Data Source Names) set up that make connection easy. In usual DBI, the connection might look like DBI::dbConnect(odbc::odbc(), "My Data Source"). On a linux operating system, these are usually defined in /etc/odbc.ini or ~/.odbc.ini. If you have DSNs defined on your system, you can utilize them by using the value "dsn" as your connection. dbtest will use all of your DSNs and execute tests against them.

dbtest::test_database("dsn")

YAML File

Every database has different connection parameters. To make database connections easy to automate, dbtest will read a YAML file and pass the named parameters into dbConnect to create a DBI connection. Note that the config package is used, so you must label the set of connections and refer to it with R_CONFIG_ACTIVE=mylabel. Otherwise, the default heading will be selected. An example might look like:

default:
  pg:
    drv: !expr odbc::odbc()
    Driver: PostgreSQL
    Host: postgres.example.com
    Port: 5432
    Database: postgres
    UID: user
    PWD: password

  mssql:
    drv: !expr odbc::odbc()
    Driver: SQLServer
    Server: mssql.example.com
    Port: 1433
    UID: user
    PWD: password

  oracle:
    drv: !expr odbc::odbc()
    Driver: Oracle
    Host: oracle.example.com
    Port: 1521
    SVC: xe
    UID: user
    PWD: password

Notice that the names of the various databases are different. This corresponds to the parameters that different database providers expect in the dbConnect function. Notice also that drv is !expr odbc::odbc(). This allows execution of R code to provide the necessary DBI driver to support the connection.

You can test this behavior and create connection objects manually with:

cfg <- config::get(file = "./path/to/conn.yml")
do.call(dbConnect, cfg$pg)
do.call(dbConnect, cfg$mssql)
do.call(dbConnect, cfg$oracle)

Or you can use the config file and dbtest to execute tests against all of these database connections with:

dbtest::test_database("./path/to/conn.yml")

DBI Connection

The most straightforward way to interactively use dbtest is to provide a DBI connection object directly to dbtest::test_database.

con <- DBI::dbConnect(odbc::odbc(), "My DSN")
dbtest::test_database(con)

tbl_sql

If you are familiar with dbplyr and already have a tbl_sql object (which combines a DBI connection object with a reference to a database table), you can pass that object to test_database as well. In this case, tests will be executed directly against that tbl_sql object.

con <- DBI::dbConnect(odbc::odbc(), "PostgreSQL")
dbWriteTable(con, "mytesttable", iris)
my_tbl_sql <- dplyr::tbl(con, "mytesttable")
dbtest::test_database(my_tbl_sql)
dbDisconnect(con)

Usage

Once you have decided how you are going to provide connection objects to dbtest, the usage is fairly straightforward. You use test_database. test_database is also vectorized to make it easier for testing multiple databases.

test_database takes the following arguments:

datasource = a data source object used for connecting to a database (as described above)
tests = a list or character vector of YAML files from which tests will be sourced. See the examples of test files below or the test files included with dbtest by executing dbtest::all_tests()

If you want to use specific test files included in dbtest, you can reference them explicitly with dbtest::pkg_test("character-basic.yml"), for instance. This is what we will do for ease of use.

Finally, dbtest provides reporting functions that make it easier to analyze and explore the results of your tests. This is where the rubber meets the road on improving the development process with a test suite that increases quality and ensures reliability.

test_output <- dbtest::test_database(
  "conn.yml", 
  dbtest::pkg_test(
    c("character-basic.yml", "math-basic.yml")
    )
  )

## ...............EEEEE.E.....................................................
## ...............EEEEE.E.....................................................
## ..E....E....E..EEEEE.EE....E....E....E....E....E....E....E....E..EEEEE..E..

dbtest::plot_summary(test_output)

dbtest::plot_tests(test_output)

## $`character-basic`

## 
## $`math-basic`

If you want to see more specific details about a failure, in particular, you can use get_dbtest_detail:

dbtest::get_dbtest_detail(test_output)

## # A tibble: 75 x 7
##    test          pg     pg_raw mssql mssql_raw oracle           oracle_raw
##    <chr>         <chr>  <list> <chr> <list>    <chr>            <list>    
##  1 mutate: tolo… succe… <list… succ… <list [1… success          <list [1]>
##  2 "filter: tol… succe… <list… succ… <list [1… success          <list [1]>
##  3 summarize: n… succe… <list… succ… <list [1… "nanodbc/nanodb… <list [1]>
##  4 group_by: to… succe… <list… succ… <list [1… success          <list [1]>
##  5 arrange: tol… succe… <list… succ… <list [1… success          <list [1]>
##  6 mutate: toup… succe… <list… succ… <list [1… success          <list [1]>
##  7 "filter: tou… succe… <list… succ… <list [1… success          <list [1]>
##  8 summarize: n… succe… <list… succ… <list [1… "nanodbc/nanodb… <list [1]>
##  9 group_by: to… succe… <list… succ… <list [1… success          <list [1]>
## 10 arrange: tou… succe… <list… succ… <list [1… success          <list [1]>
## # ... with 65 more rows

Writing Test Files

Writing test files in YAML can be a bit strange, because what dbtest expects is text. For instance, do not use the !expr trick that the config package uses above for a connection object. Rather, you specify a verb and then arbitrary text that will be interpreted as R code. This text will get picked up into the testing process, which will do the following:

ensure that test data is set up properly. On most connections, this will result in a temporary table.
build a dplyr chain focused on the verb you selected
insert your arbitrary text into the selected verb
execute the dplyr chain against the database
execute the dplyr chain against a local copy of the same data
compare the outputs using testthat::expect_equal

Currently supported verbs are:

summarise / summarize
mutate
arrange
filter
group_by

Example

An example might be most illustrative. Let's say that we want to test the base R functions tolower and toupper and how they get translated into SQL.

First, we would define a test YAML file like:

/tmp/RtmpQ7gwyi/test-file.yml

- test-tolower:
    mutate: tolower(fld_character)
    group_by: tolower(fld_character)
- test-toupper:
    mutate: toupper(fld_character)
    group_by: toupper(fld_character)

When executed against databases, it might look like:

test_results <- dbtest::test_database("conn.yml", test_file)

## ....
## ....
## ....

dbtest::plot_tests(test_results)[[1]]

dbtest's People

Contributors

Stargazers

Watchers

Forkers

davisvaughan colearendt isabella232

dbtest's Issues

Add support for arbitrary tests / multiple datasets

I think this adds to the complication of the tests, but I thought it would be worthwhile to allow testing things like left_join, right_join, full_join, and other multi-dataset operations. In order to do this, we may need to up the dataset count to 2.

However, the more useful feature (and with the right dataset, maybe rendering multiple datasets irrelevant) would be to allow arbitrary expressions in the YAML test. Then you just need join-able columns in your test dataset

I.e. maybe something like the following in run_test:

if (verb == "verbatim") manip <- f    # or some other rlang magic

And how it might be used...

- left_join:
    verbatim: . %>% left_join(., by=c("col1","col2"))

Something in the tests ruins the connection object

Not sure what is happening here. Something with a transaction not being cleaned up properly. Maybe problems with the rollback?

con <- dbConnect(odbc(), "pg")
test_results <- test_single_database(con)
# ...F....F....F....F....FF...F....F.F.EEEEEE
dbGetQuery(con, "select * from information_schema.tables")
#Error in new_result(connection@ptr, statement) : 
#  nanodbc/nanodbc.cpp:1344: 25P02: [RStudio][PostgreSQL] (30) Error occurred while trying to execute a query: [SQLState 25P02] ERROR:  current transaction is aborted, commands ignored until end of transaction block
 
con
#<OdbcConnection> postgres@postgres
#  Database: postgres
#  PostgreSQL Version: 10.2.0

Bad YAML test syntax throws cryptic error in `plot_tests`

If the YAML syntax is bad, then the test suite will run, but the format is different enough that plot_tests will fail with a cryptic error message that is very challenging to debug.

 Error in data.frame(file = test$file, context = context, test = test$test,  : 
  arguments imply differing number of rows: 1, 2

The YAML that generated the bad test results (for repro):

- test-tolower:
    mutate: tolower(fld_character)
    group_by: tolower(fld_character)
  test-toupper:
    mutate: toupper(fld_character)
    group_by: toupper(fld_character)

You can generate that YAML file with:

test_file <- fs::path_temp("test-file.yml")
yaml::write_yaml(
  list(
    setNames(
    list(
      list(
        "mutate" = "tolower(fld_character)"
        , "group_by" = "tolower(fld_character)"
      )
      , list(
        "mutate" = "toupper(fld_character)"
        , "group_by" = "toupper(fld_character)"
      )
    )
    , c("test-tolower", "test-toupper")
  ))
  , file = test_file
)

And then that just needs to be executed as follows:

test_results <- dbtest::test_databases("conn.yml", test_file)
dbtest::plot_tests(test_results)

Fix would be a better error message, better recovery from bad YAML tests, or something of the like.

Add list of packages to attach for a test

A test should be able to have a set of packages that are automatically attached for a test?

Either this or we should add explicit stringr:: package references in the tests themselves. Not sure which is preferable. It seems silly for the tests to fail when stringr is available, for instance.

Document the preferred and then implement for stringr!

Define test helpers for more thorough testing

Specifically, it would be nice to have some of the standard con objects available for every test, so we are not having to re-build them each time.

Further, it would be great to have some wrappers for determining if DSNs are available, if Postgres is available (where would the configuration be stored, etc.). This way, we could easily skip tests when infrastructure is not available (i.e. on CRAN), but have a more thorough testing suite when developing, as well.

Add support for multiple test YAML files

At present, we pull from a single test YAML file. It might be nice to allow the user to specify a directory or a list of YAML files so that a subset of tests can be run easily. This is similar to the way that devtools::test("myfeature") allows easy sub-selection of test files.

There is a related thought of testing section headers within files, but I think we should follow testthat conventions on this. I believe testthat only allows selection at the file level.

test_database sometimes returns a list, sometimes a dbtest_results object

For consistency, it seems like we should always return a list...

Implement JUnit reporter

The JUnit reporter already exists in the testthat package! We should make it easy to implement since that will make the Jenkins jobs create nice Jenkins-readable output.

Add a lightweight way to `get_details`

It is tricky right now to debug statements and figure out what happened... until we implement #18 , we should at least have a vectorized way to get details for test results given a dbtest_results object, a context, verb, etc.

BONUS: if we can generate the code for the SQL using the same internals that we use for testing... that would be awesome too!

Fix integer vs. integer64 test failures

Tests fail on integer vs. integer64 types:

$results
$results[[1]]
manip(local_df) not equal to manip(remote_df).
Classes differ: integer is not integer64

Annoying... makes me not want to use integers in my tests 😮

Add covr, badges, and other stuff

In order to ensure that we are testing well and communicating well with potential users, we should add:

covr
CONTRIBUTING.md
badges
etc.?

Add reporter parameter

We should think through how to allow users to customize which reporter(s) they are using. We should probably allow "additions" to the standard set, although I think our "minimal set" is probably just the ListReporter, since that facilitates the other plotting/reporting stuff that we do.

testthat has good examples of this. It should be pretty striaghtforward

Add custom (or extensible) TestThat reporter

The ListReporter does not allow caching additional information when tests execute... it would be nice to be able to do so, in order that the definition of the code we are executing could be captured, as well as the output that is being compared.

Motivation:

Remove "code" from the test name for a cleaner ggplot presentation. The "code" is still available in the testthat_results object

well... not completely. Not sure how to capture the output of what manip is defined as...

Add HTML Overview and Coverage Report

In the previous version of the package, there was a HTML overview that allowed easier exploration of error messages, stack trace, results, etc. There was also a nice way to look at code coverage stats / percentages. Both of these "reports" should be added back to the new version!

Add a way to skip tests

It would be nice to have a way to skip tests (in the native testthat way... maybe by parsing the "verb" and putting "skip mutate" or something). This would give a better picture in the testthat output than commented YAML, for instance.

Add a syntax checker for test validity

In conjunction with #19 , we should have a way to check that the YAML provided for tests is valid before we try to execute it!

Create `plot_test_summary` high level analog of `plot_tests`

We should be able to create a file-level summary of dbtest_results for when many tests are executed. Idea here is that there should be a gradient on "percentage passed"

Less magic in `test_databases`, break into separate wrappers

@edgararuiz - Curious for your thoughts on this

In conjunction with #5 , we could clean up our path to test_single_database by having multiple helpers - i.e. test_dsns, test_config, test_tbls, etc.

This would also simplify the test_databases function, since it would simply be choosing between one of these helpers. It would also help users have a bit more control if they only want to use one of the approaches and avoid some of the "magic" that is currently happening with test_databases.

Output from `test_single_database` should have a S3 class

I think it is probably worth adding a S3 class dbtest_output or something like that to the output from test_single_database. This would have the benefit of being testable with is_dbtest_output or something like it. This would also turn out to be helpful in plot_tests where we have to test for this type of object.

Change `plot_tests` output to a list of plots

We should output a list of ggplot objects from plot_tests. Maybe one ggplot per test file? And title each plot object by the file that it is broken into!

Fix `copy_to` arrange for `group_by`

group_by verb needs an arrange for independence from different db behaviors with respect to sorting group by

Made README generation more atomic

At present, README generation is not atomic / reproducible. Plots rendered, temp file path, etc. can all change.

See if there is a way to reduce some of this...

Bypass or try failed connections again

At present, test_databases, when encountering a failed connection, failed table creation, or other R error, will die and kill all output.

For instance,

test <- test_databases("conn.yml")
#...F....F....F....F....FF...F....F.F.EEEEEE
#Created a temporary table named: ##xkokevxfncvsbkrdejms
#....E....E....E....E....E....E....E..E...E.
# Error in connection_begin(conn@ptr) : 
#  nanodbc/nanodbc.cpp:1128: HYC00: [RStudio][ODBC] (11470) Transactions are not supported. 
test
#Error: object 'test' not found

It is probably preferable that we either bypass the failed connection or failed table creation (and maybe propagate the error somehow), try again after waiting a little while, or both.

What I am really thinking of here is how do we best integrate with testthat? I.e. should we surface anything in the output test object about a failed connection?

Fix `arrange` test issues

Relates to #8

When testing arrange, the default behavior is to pull the last column of the data.frame in question (since arrange does not modify columns like a mutate would). This can have problems with testing since the level of uniqueness of the sort is important to test success / failure.

Further, there can be problems with adding mutate() to create the column so the same level of uniqueness is present because

it is not a unit test of arrange
some verbs (like desc(column)) are not valid within mutate

Very tricky... it almost seems we need a "unique value" column like id or something... and then every arrange needs to be done by the computation in question and then by the unique column.

Better fail if passing a YAML file to `test_single_database`

If we pass a bad input to test_single_database, that should be made very clear. At present, it is very strange how test_databases and test_single_database behave differently... Honestly, I think the interface may be nicer if we just had test_database as the main API for interacting with dbtest and make it vectorized.

Clean up table creation approach

copy_to is currently used for table creation. However, there are some issues related to the persistence of tables, schema references for temporary tables, etc. that makes it very challenging to clean up or use this in an interactive session.

This may be something that we try to take care of in our test suite by being "smart" about the different databases or it could be something that we push upstream to fixes in dplyr or DBI so that this use case is more easily tackled in R packages, generally.