ropensci / arkdb Goto Github PK

View Code? Open in Web Editor NEW

78.0 10.0 6.0 1.1 MB

Archive and unarchive databases as flat text files

Home Page: https://docs.ropensci.org/arkdb

License: Other

R 100.00%

database archiving r rstats dbi peer-reviewed r-package

arkdb's Issues

Possible issue with OFFSET / LIMIT query

https://stackoverflow.com/a/39680636

Right now, we use a really basic OFFSET and LIMIT query. I have run into a few situations where the results come out differently using PostgreSQL.

Need to add ORDER BY X where X is a column in the table, to be certain that we are getting consistent results otherwise the window can change.

@cboettig I'm thinking we add this as a required argument in ark. What do you think?

investigate OS specific edge case failures

Strange failing cases on certain OSs:

Linux systems with compression (particularly with gz, also with bz2) fail on monetdb ark call when using MonetDB. (Currently test suite on monetDB uses uncompressed files to avoid this linux-specific failure, passes on Mac & Windows)
appveyor has a strange failure only when running tests in sequence. Clearing the local sql database and files still leaves some shadow, and so later tests append tables to existing db instead of starting fresh. This results in too many rows in output data, causing test to fail. (currently in test suite but skipped on Appveyor, problem does not appear on Linux or Mac).

Consider hiding the tempdir() setup in examples

Wrapping examples the way @coatless suggests here might make the documentation more readable to new users. Otherwise users might erroneously think they need to use a tempdir as well.

#' \dontshow{.old_wd <- setwd(tempdir())}
#' # Your code here
#' \dontshow{setwd(.old_wd)}

Adding a callback function for data adjustment

There is often a need to prepare, reorder, sort, or recode raw data into a useable format. For example, databases often store categorical information in a binary (0/1) or integer format (1,2,3,4,5). Researchers or others that are using these data often expect to receive them with encoding built-in. Adding a callback to the ark function could allow for this type of "analytical preparation" to be completed in a just-in-time basis. Simplifying some typical database <> in-memory analytical work flows. Combined with the new filter_statement this could be used for setting up common data pipelines where a window of data are required, perhaps summarized, modeled, or otherwise aggregated.

Pass `na` argument to readers

In addition to delim, and quote, readers should allow the user to specify na (typically, "", "NA", or both).

I was writing a bulk import method for Postgres when I realized I needed this.

Interestingly, several SQL CSV import variants allow the user to specify an escape character, as well. Base and readr readers just turn \ on and off.

Implement native bulk importers where possible

Ideally, arkdb should detect when the input file is compatible with a known bulk importer method and then use that to import rather than importing in chunks.

For instance, MonetDB and duckdb have quite reasonable bulk importers. Others can at least do csv. In most cases this should be much faster than chunking, though tends to be less flexible and will need a mechanism to override

Unark should handle overwriting vs appending to existing tables

Importing the local .tsv.bz2 file via unark() to a remote AWS RDS MySQL Database is taking an unusual amount of time

Hi folks,

Please, consider this MWE:

# db_con_local: connection to my local MySQL database
# db_con_remote: connection to my remote AWS RDS MySQL database

table_name <- "table_original" 

df <- nycflights13::flights

DBI::dbWriteTable(db_con_local, table_name, df, overwrite=TRUE)

arkdb::ark(
  db_con = db_con_local, 
  dir = "data", 
  tables = table_name,  
  filenames = NULL,
  overwrite = TRUE,
  lines=5e+5L
) 

files <- glue::glue("./{table_name}.tsv.bz2")

arkdb::unark(
  files = files, 
  db_con = db_con_remote,
  tablenames = "table_imported",
  overwrite = TRUE,
  lines=5e+5L
)

Although the .tsv.bz2 file in this case has just 5MB, the unark() command above takes several hours and still does not finish the task, until I'm forced to stop R.

I believe this issue is related to r-dbi/RMariaDB#265 .

For some reason, loading data in a remote MySQL database is way faster using the SQL command LOAD DATA LOCAL INFILE.

This alternative rotine for example only took a few seconds:

# db_con_remote: connection to my remote AWS RDS MySQL database

 table_name <- "test" 
  
  table_file_path <- glue::glue("data/{table_name}.csv") |> fs::path_abs()
  
  df <- nycflights13::flights
  
  data.table::fwrite(df, table_file_path)
  
  DBI::dbWriteTable(db_con_remote, table_name, df[1,], overwrite=TRUE)
  
  DBI::dbExecute(db_con_remote, glue::glue("TRUNCATE {table_name}"))

# Then run this query in MySQL Workbench (which took only 9 seconds):

        TRUNCATE <<table_name>>;
        LOAD DATA LOCAL INFILE '<<table_file_path>>' 
        INTO TABLE <<table_name>>
        CHARACTER SET utf8mb4 
        FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' 
        LINES  TERMINATED BY '\r\n'  
        IGNORE 1 LINES;

Is there a solution for that?

Thank you

`error in evaluating the argument 'name' in selecting a method for function 'dbExistsTable'`

Hi folks, first let me say that the functionality that arkdb provides is great and very much needed.

My issue is the error below which I get after unark():

[-] chunk 2	...Done! (in 2.191768 mins)
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'name' in selecting a method for function 'dbExistsTable': subscript out of bounds

Here is the code (sorry that my data is restricted, hence I could not share):

  arkdb::ark(
    db_con = db_con, 
    dir = '...', 
    tables = "original_table",
    filenames = NULL,
    overwrite = TRUE,
    lines=5e+5L
  )  
  files <- fs::dir_ls('...', glob="*.tsv.bz2") 

  arkdb::unark(
    files = files, 
    db_con = db_con, 
    tablenames = "new_imported_table",
    overwrite = TRUE,
    lines=5e+5L
  )

original_table has 798594 rows while new_imported_table has 797904 (missing a few).

Please, any idea why this is happening?

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

other attached packages:
[1] arkdb_0.0.15 DBI_1.1.2

unark should permit custom table names and/or enforce SQL-valid names

Currently, unark() takes the table name as the basename of the .csv file. This name may not be SQL-compliant, leading to quoting, which is a bit of a mess.

unark() should strip out non-compliant characters by default.
unark() should also be more flexible, allowing the user to specify the corresponding table names manually, rather than enforcing they correspond with the incoming csv names.

Support connections from pool

arkdb:::assert_dbi() currently fails on a pool connection since inherits(pool, "DBIConnection") is FALSE, but DBI methods seem to work directly on pool objects so maybe this assertion is too strong? @noamross any insight here?

Need short vignette for new functionality

Should cover:

How to use ark with parquet
How to use ark with duckdb, show example reading over files.

duckdb filename adjustments

It may be desireable to use duckdb to read or serialize alternative file types like parquet or CSV. The current bulk loader assumes that instead of files, a traditional database will be used. However, in practice duckdb is often used to create an SQL layer over a "folder of files" much like AWS Athena, but locally.

One way to implement this is to allow tablenames to include the reader function. duckdb SQL for reading a c/tsv is as follows:

SELECT * FROM read_csv_auto('file/path.csv');

arkdb can generate this SQL easily if read_csv_auto("file/path.csv") is passed to the tablename arguement. However, the naming of output files is designed to match the input table name.

Potential solutions:

Parse the tablename for it's "basename"
Provide an optional argument to ark, that allows for a reader function for duckdb.
Provide an optional argument to specify/overload the output name of the file.

Support dbFetch(n = 50000) instead of multiple queries

as an option (or perhaps the default behavior)?

res <- dbSendQuery(con, "SELECT * FROM ...")
while (TRUE) {
  data <- dbFetch(res, n = 50000)
  if (nrow(data) == 0) break
  readr::write_tsv(data, ...)
}
dbClearResult(res)

Many DBI backends will not reserve client-side memory for the full result set after dbSendQuery(). Also, the result won't be mixed up when other clients alter the table during export.

Rich FitzJohn Onboarding review

Major comments

Support for other text formats (Issue #3, See PR #5)

Done in #10

Minor comments:

The example in the README produces a significant number of warnings: Warning: Closing open result set, pending rows - these should be silenced if expected, eliminated if not.

fixed

The progress and general chattiness needs to be tuneable (e.g., with a verbose and a progress argument. Ideally both would be tuneable as progress bars are not very log-friendly but messages are (you can see this play out a bit with the vignette build - the progress output ends up on the parent terminal for me)
The log output as it is not incredibly informative after it has been run.

This has been reformatted as Rich recommended

How is the default of lines chosen? Is there some heuristic for best performance? My gut instinct is that as it increases in size, the performance will get better, to a point.

Yeah, I think function overhead means that the fewer chunks you need, the faster it should be. I chose this default as a number that would keep things to a memory footprint that would work on most laptops today for most tables (obviously wide enough table can break this heuristic) but large enough to be acceptably fast, e.g. on the nyflights data. Really no great answer here and the user should expect to tune this parameter if performance is an issue.

So rather than testing "known not to support" test for "known to support". You might also include a package-cached capability test (keyed against class) for a SELECT * FROM table WHERE rownum BETWEEN 1 and 2 - if you don't get a syntax error it should be ok to use.

Brilliant. Fixed this, but then Kirill had an even better solution, so I've switched to that. The window-sql method is still there, but perhaps should just be deprecated now.

Super minor comments

use inherits, not is.

(oh right, yay, this drops importFrom methods too)

Your postgres windowed code is not tested. You might try that on travis?

could, but is no longer the best method available, might just deprecate

Your normalize_con code is not tested for the non-src_dbi path

tested now

I am confused by the point of the untested early exit in unark_file - when would this be wanted?

tested now

There are also a few error paths untested (see covr output)

most are tested now

surely one can compute the headers ahead of time (0'th iteration) (See code suggestions in onboarding thread)

still have to digest this...

I would be probably inclined to split ark_chunk into a "read" and "save" step.

ark_chunk effectively sidestepped into new methods from Kirill's suggestions (#8). Suggestions welcome.

Tests parallelizes with all available CPU cores

When running revdep checks on future, I noticed that the unit tests of arkdb end up spawning N parallel processes, where N = all CPU cores on the machine. For example, here is what I observe when I run on a host with 48 cores;

Here's what I managed to grab using pstree:

 |-sh /wynton/home/cbi/shared/software/CBI/_rocky8/R-4.3.2-gcc10/lib64/R/bin/Rcmd check arkdb_0.0.16.tar.gz --no-manual -o /wynton/home/cbi/hb/repositories/future/revdep/checks/arkdb/old
 |   `-R --no-restore --no-echo --args nextArgarkdb_0.0.16.tar.gznextArg--no-manualnextArg-onextArg/wynton/home/cbi/hb/repositories/future/revdep/checks/arkdb/old
 |      |-sh -c LANGUAGE=en _R_CHECK_INTERNALS2_=1 '/wynton/home/cbi/shared/software/CBI/_rocky8/R-4.3.2-gcc10/lib64/R/bin/R' --vanilla --no-echo < '/scratch/hb/RtmpNYi9Br/file2a35234b2e24d0'
 |      |   `-R --vanilla --no-echo
 |      |      `-sh /wynton/home/cbi/shared/software/CBI/_rocky8/R-4.3.2-gcc10/lib64/R/bin/Rcmd BATCH --vanilla testthat.R testthat.Rout
 |      |          `-R -f testthat.R --restore --save --no-readline --vanilla
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             |-{R}
 |      |             `-{R}
 |      `-{R}

You probably know my rant by now, but this behavior is problematic when running on compute systems shared with others. I'm using more CPU resources than I'm allowed to/requested. I am also risking overusing the CPUs, e.g. running revdepcheck, I am actually ending up running two such instances at the same time.

Do you know where this N = all CPU cores parallelization comes from? I know that you depend on future.apply, but the Futureverse should respect the R CMD check --as-cran limitation, i.e. availableCores() returns at most 2 (two).

installation error

Failed to install the library. What's wrong? Tell me please.

`> devtools::install_github("cboettig/arkdb")
Downloading GitHub repo cboettig/arkdb@HEAD
Error: Failed to install 'arkdb' from GitHub:
Multiple results for CXX11FLAGS found, something is wrong.FALSE

traceback()
10: stop(remote_install_error(remotes[[i]], e))
9: value[3L]
8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
7: tryCatchList(expr, classes, parentenv, handlers)
6: tryCatch(res[[i]] <- install_remote(remotes[[i]], ...), error = function(e) {
stop(remote_install_error(remotes[[i]], e))
})
5: install_remotes(remotes, auth_token = auth_token, host = host,
dependencies = dependencies, upgrade = upgrade, force = force,
quiet = quiet, build = build, build_opts = build_opts, build_manual = build_manual,
build_vignettes = build_vignettes, repos = repos, type = type,
...)
4: force(code)
3: withr::with_path(rtools_path(), code)
2: pkgbuild::with_build_tools({
ellipsis::check_dots_used(action = getOption("devtools.ellipsis_action",
rlang::warn))
{
remotes <- lapply(repo, github_remote, ref = ref, subdir = subdir,
auth_token = auth_token, host = host)
install_remotes(remotes, auth_token = auth_token, host = host,
dependencies = dependencies, upgrade = upgrade, force = force,
quiet = quiet, build = build, build_opts = build_opts,
build_manual = build_manual, build_vignettes = build_vignettes,
repos = repos, type = type, ...)
}
}, required = FALSE)
1: devtools::install_github("cboettig/arkdb")

sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251 LC_MONETARY=Russian_Russia.1251
[4] LC_NUMERIC=C LC_TIME=Russian_Russia.1251

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] rstudioapi_0.13 magrittr_2.0.3 usethis_2.1.5 devtools_2.4.3 pkgload_1.2.4
[6] R6_2.5.1 rlang_1.0.2 fastmap_1.1.0 tools_4.1.3 pkgbuild_1.3.1
[11] sessioninfo_1.2.2 cli_3.2.0 withr_2.5.0 ellipsis_0.3.2 remotes_2.4.2
[16] rprojroot_2.0.3 lifecycle_1.0.1 crayon_1.5.1 brio_1.1.3 processx_3.5.3
[21] purrr_0.3.4 callr_3.7.0 fs_1.5.2 ps_1.6.0 curl_4.3.2
[26] testthat_3.1.3 memoise_2.0.1 glue_1.6.2 cachem_1.0.6 compiler_4.1.3
[31] desc_1.4.1 prettyunits_1.1.1`

support parallelization?

ark and unark are built around a hardwired lapply command. Not sure how well parallel reading from database connections works in all cases, but seems like it would be natural to parallelize this iteration over all tables...

Non-compressed IO

I can think of a couple of situations where I want to dump an database into raw text files for various sorts of inspection. Is it possible to add a compress="none" option for both arking and unarking?

Support more additional formats (i.e. fst, feather, etc?)

arkdb currently hardwires write_tsv/read_tsv for I/O. This should be pluggable.

done, now supports readr and base utils table I/O

Would be nice to take advantage of things like faster speed of fst::read_fst /fst::write_fst, but looks like fst does not support the ability to append, making it impossible to stream in chunks.

Default arguements for streamable_parquet

The condition for overwrite must be T/F for streamable_parquet. However, the default is "ask". The condition for unlink()-ing needs to be updated to accomodate "ask".
The default compression is bzip2, however, this is largely meaningless for a parquet file, depending on how the user has built the arrow package, compression is likely already built in. Perhaps "none" would be more appropriate for parquet, or we should warn users.

Injecting filters

Some databases may be so large that it is not reasonable to attempt to pull the entire table as this could bring down a production system or be much more data than the local system has disk space for when written out to even the most compressed of formats, or be superfluos. It would be useful to be able to inject a WHERE clause into the SQL used.

This would likely need to be passed down as an argument from ark. Here's a simple example of an implementation using a keep_open:

ark(..., filter_statement = 'WHERE status = "Active"') 
...
# https://github.com/ropensci/arkdb/blob/master/R/ark.R#L164
res <- DBI::dbSendQuery(db_con, paste("SELECT * FROM", tablename, filter_statement))

arkdb requires R (>= 3.5.0)

arkdb (Version 0.0.2) cannot be installed from source in R-3.4.4 under Linux, it requires the utils::askYesNo function which was added to R-3.5.0.

version$version.string
#> [1] "R version 3.4.4 (2018-03-15)"
install.packages("arkdb")
#> Installing package into ‘/home/R/site-library_R-3.4’
#> (as ‘lib’ is unspecified)
#> trying URL 'http://erratic/cran/src/contrib/arkdb_0.0.2.tar.gz'
#> Content type 'application/x-gzip' length 27979 bytes (27 KB)
#> ==================================================
#> downloaded 27 KB
#>
#> * installing *source* package ‘arkdb’ ...
#> ** package ‘arkdb’ successfully unpacked and MD5 sums checked
#> ** R
#> ** inst
#> ** byte-compile and prepare package for lazy loading
#> Error : object ‘askYesNo’ is not exported by 'namespace:utils'
#> ERROR: lazy loading failed for package ‘arkdb’
#> * removing ‘/home/R/site-library_R-3.4/arkdb’

Timeline for CRAN release

Dear Karl,

A much required package for me. Thanks! Do you have plans for approximate timeline for the first CRAN release?

Handling column types robustly

It would be good to be able to robustly save and restore column types in a predictable fashion. Data with large numbers of NAs tends to have issues with errors detecting column types correctly, and certain types, such as dates, are not retained.

Would it make sense to create an option to create and/or use a file like schema.json to set data types?

One enabling change for this is to allow arguments to be passed to the native bulk importers (which generally use read.csv and colClasses to set up column types to create tables before importing).

cc @emmamendelsohn

Set readr method's default column type to `character`?

readr uses logical as the default column type when the first n=1000 lines are all NA. (The reason for this is to play nicely with operations like bind_rows(), since you can upcast a logical NA to a character NA but not vice versa, see tidyverse/readr#839. (base read.csv also defaults NAs to logical).

In general I still feel this is the wrong default, as it allows R to make a guess in data parsing that is somewhat destructive: sparsely filled text columns become coerced to all NA because it has text cannot be coerced into a logical. It would probably be safer to default this to character and allow the text and not destroy data.

It seems like it would have been preferable to change the casting behavior used in dplyr to allow it to cast/bind the all-NA character columns?

cc @noamross

Handling for postgres schemas?

Would it be possible to support Postgres databases that have tables in schemas? I'm running into problems I think because table names have to be prefixed by the schema name in the sql.

# try archiving just tables in specified schema
odm2_tables <- dbGetQuery(db, 
  "SELECT * FROM information_schema.tables 
  WHERE table_schema = 'odm2'")

ark(db, dir, lines = 50000, tables = odm2_tables$table_name)

Error:

Exporting actionannotations in 50000 line chunks:
Error in postgresqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  relation "actionannotations" does not exist
LINE 1: SELECT * FROM actionannotations LIMIT 0
                      ^
)

Different number of rows between the original table and the new imported one.

Hi folks, first let me say that the functionality that arkdb provides is great and very much needed.

My issue is that I'm exporting a table via ark() but then when I create a new one using unark(), it has different number of rows.

This is my code (sorry that I could not share my data, it is restricted):

  arkdb::ark(
    db_con = db_con, 
    dir = '...', 
    tables = "original_table",
    filenames = NULL,
    overwrite = TRUE,
    lines=5e+5L
  )  
  files <- fs::dir_ls('...', glob="*.tsv.bz2") 

  arkdb::unark(
    files = files, 
    db_con = db_con, 
    tablenames = "new_imported_table",
    overwrite = TRUE,
    lines=5e+5L
  )

original_table has 798594 rows while new_imported_table has 797904 (missing a few).

Please, what are the possible reasons for that?

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

other attached packages:
[1] arkdb_0.0.15 DBI_1.1.2

ropensci / arkdb Goto Github PK

arkdb's Issues

Major comments

Minor comments:

Super minor comments

Recommend Projects

Recommend Topics

Recommend Org