ropensci / arkdb Goto Github PK
View Code? Open in Web Editor NEWArchive and unarchive databases as flat text files
Home Page: https://docs.ropensci.org/arkdb
License: Other
Archive and unarchive databases as flat text files
Home Page: https://docs.ropensci.org/arkdb
License: Other
https://stackoverflow.com/a/39680636
Right now, we use a really basic OFFSET and LIMIT query. I have run into a few situations where the results come out differently using PostgreSQL.
Need to add ORDER BY X
where X is a column in the table, to be certain that we are getting consistent results otherwise the window can change.
@cboettig I'm thinking we add this as a required argument in ark
. What do you think?
Strange failing cases on certain OSs:
Linux systems with compression (particularly with gz, also with bz2) fail on monetdb ark
call when using MonetDB. (Currently test suite on monetDB uses uncompressed files to avoid this linux-specific failure, passes on Mac & Windows)
appveyor has a strange failure only when running tests in sequence. Clearing the local sql database and files still leaves some shadow, and so later tests append tables to existing db instead of starting fresh. This results in too many rows in output data, causing test to fail. (currently in test suite but skipped on Appveyor, problem does not appear on Linux or Mac).
Wrapping examples the way @coatless suggests here might make the documentation more readable to new users. Otherwise users might erroneously think they need to use a tempdir
as well.
#' \dontshow{.old_wd <- setwd(tempdir())}
#' # Your code here
#' \dontshow{setwd(.old_wd)}
There is often a need to prepare, reorder, sort, or recode raw data into a useable format. For example, databases often store categorical information in a binary (0/1) or integer format (1,2,3,4,5). Researchers or others that are using these data often expect to receive them with encoding built-in. Adding a callback to the ark
function could allow for this type of "analytical preparation" to be completed in a just-in-time basis. Simplifying some typical database <> in-memory analytical work flows. Combined with the new filter_statement this could be used for setting up common data pipelines where a window of data are required, perhaps summarized, modeled, or otherwise aggregated.
In addition to delim
, and quote
, readers should allow the user to specify na
(typically, ""
, "NA"
, or both).
I was writing a bulk import method for Postgres when I realized I needed this.
Interestingly, several SQL CSV import variants allow the user to specify an escape
character, as well. Base and readr readers just turn \
on and off.
Ideally, arkdb should detect when the input file is compatible with a known bulk importer method and then use that to import rather than importing in chunks.
For instance, MonetDB and duckdb have quite reasonable bulk importers. Others can at least do csv. In most cases this should be much faster than chunking, though tends to be less flexible and will need a mechanism to override
Hi folks,
Please, consider this MWE:
# db_con_local: connection to my local MySQL database
# db_con_remote: connection to my remote AWS RDS MySQL database
table_name <- "table_original"
df <- nycflights13::flights
DBI::dbWriteTable(db_con_local, table_name, df, overwrite=TRUE)
arkdb::ark(
db_con = db_con_local,
dir = "data",
tables = table_name,
filenames = NULL,
overwrite = TRUE,
lines=5e+5L
)
files <- glue::glue("./{table_name}.tsv.bz2")
arkdb::unark(
files = files,
db_con = db_con_remote,
tablenames = "table_imported",
overwrite = TRUE,
lines=5e+5L
)
Although the .tsv.bz2
file in this case has just 5MB, the unark()
command above takes several hours and still does not finish the task, until I'm forced to stop R.
I believe this issue is related to r-dbi/RMariaDB#265 .
For some reason, loading data in a remote MySQL database is way faster using the SQL command LOAD DATA LOCAL INFILE
.
This alternative rotine for example only took a few seconds:
# db_con_remote: connection to my remote AWS RDS MySQL database
table_name <- "test"
table_file_path <- glue::glue("data/{table_name}.csv") |> fs::path_abs()
df <- nycflights13::flights
data.table::fwrite(df, table_file_path)
DBI::dbWriteTable(db_con_remote, table_name, df[1,], overwrite=TRUE)
DBI::dbExecute(db_con_remote, glue::glue("TRUNCATE {table_name}"))
# Then run this query in MySQL Workbench (which took only 9 seconds):
TRUNCATE <<table_name>>;
LOAD DATA LOCAL INFILE '<<table_file_path>>'
INTO TABLE <<table_name>>
CHARACTER SET utf8mb4
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Is there a solution for that?
Thank you
Hi folks, first let me say that the functionality that arkdb
provides is great and very much needed.
My issue is the error below which I get after unark()
:
[-] chunk 2 ...Done! (in 2.191768 mins)
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'name' in selecting a method for function 'dbExistsTable': subscript out of bounds
Here is the code (sorry that my data is restricted, hence I could not share):
arkdb::ark(
db_con = db_con,
dir = '...',
tables = "original_table",
filenames = NULL,
overwrite = TRUE,
lines=5e+5L
)
files <- fs::dir_ls('...', glob="*.tsv.bz2")
arkdb::unark(
files = files,
db_con = db_con,
tablenames = "new_imported_table",
overwrite = TRUE,
lines=5e+5L
)
original_table
has 798594 rows while new_imported_table
has 797904 (missing a few).
Please, any idea why this is happening?
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
other attached packages:
[1] arkdb_0.0.15 DBI_1.1.2
Currently, unark()
takes the table name as the basename of the .csv
file. This name may not be SQL-compliant, leading to quoting, which is a bit of a mess.
unark()
should strip out non-compliant characters by default.unark()
should also be more flexible, allowing the user to specify the corresponding table names manually, rather than enforcing they correspond with the incoming csv names.Should cover:
It may be desireable to use duckdb to read or serialize alternative file types like parquet or CSV. The current bulk loader assumes that instead of files, a traditional database will be used. However, in practice duckdb is often used to create an SQL layer over a "folder of files" much like AWS Athena, but locally.
One way to implement this is to allow tablenames to include the reader function. duckdb SQL for reading a c/tsv is as follows:
SELECT * FROM read_csv_auto('file/path.csv');
arkdb can generate this SQL easily if read_csv_auto("file/path.csv")
is passed to the tablename
arguement. However, the naming of output files is designed to match the input table name.
Potential solutions:
as an option (or perhaps the default behavior)?
res <- dbSendQuery(con, "SELECT * FROM ...")
while (TRUE) {
data <- dbFetch(res, n = 50000)
if (nrow(data) == 0) break
readr::write_tsv(data, ...)
}
dbClearResult(res)
Many DBI backends will not reserve client-side memory for the full result set after dbSendQuery()
. Also, the result won't be mixed up when other clients alter the table during export.
Done in #10
fixed
The progress and general chattiness needs to be tuneable (e.g., with a verbose and a progress argument. Ideally both would be tuneable as progress bars are not very log-friendly but messages are (you can see this play out a bit with the vignette build - the progress output ends up on the parent terminal for me)
The log output as it is not incredibly informative after it has been run.
This has been reformatted as Rich recommended
Yeah, I think function overhead means that the fewer chunks you need, the faster it should be. I chose this default as a number that would keep things to a memory footprint that would work on most laptops today for most tables (obviously wide enough table can break this heuristic) but large enough to be acceptably fast, e.g. on the nyflights data. Really no great answer here and the user should expect to tune this parameter if performance is an issue.
Brilliant. Fixed this, but then Kirill had an even better solution, so I've switched to that. The window-sql
method is still there, but perhaps should just be deprecated now.
inherits
, not is
.(oh right, yay, this drops importFrom methods too)
could, but is no longer the best method available, might just deprecate
tested now
tested now
most are tested now
still have to digest this...
ark_chunk effectively sidestepped into new methods from Kirill's suggestions (#8). Suggestions welcome.
When running revdep checks on future, I noticed that the unit tests of arkdb end up spawning N parallel processes, where N = all CPU cores on the machine. For example, here is what I observe when I run on a host with 48 cores;
Here's what I managed to grab using pstree
:
|-sh /wynton/home/cbi/shared/software/CBI/_rocky8/R-4.3.2-gcc10/lib64/R/bin/Rcmd check arkdb_0.0.16.tar.gz --no-manual -o /wynton/home/cbi/hb/repositories/future/revdep/checks/arkdb/old
| `-R --no-restore --no-echo --args nextArgarkdb_0.0.16.tar.gznextArg--no-manualnextArg-onextArg/wynton/home/cbi/hb/repositories/future/revdep/checks/arkdb/old
| |-sh -c LANGUAGE=en _R_CHECK_INTERNALS2_=1 '/wynton/home/cbi/shared/software/CBI/_rocky8/R-4.3.2-gcc10/lib64/R/bin/R' --vanilla --no-echo < '/scratch/hb/RtmpNYi9Br/file2a35234b2e24d0'
| | `-R --vanilla --no-echo
| | `-sh /wynton/home/cbi/shared/software/CBI/_rocky8/R-4.3.2-gcc10/lib64/R/bin/Rcmd BATCH --vanilla testthat.R testthat.Rout
| | `-R -f testthat.R --restore --save --no-readline --vanilla
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | |-{R}
| | `-{R}
| `-{R}
You probably know my rant by now, but this behavior is problematic when running on compute systems shared with others. I'm using more CPU resources than I'm allowed to/requested. I am also risking overusing the CPUs, e.g. running revdepcheck, I am actually ending up running two such instances at the same time.
Do you know where this N = all CPU cores
parallelization comes from? I know that you depend on future.apply, but the Futureverse should respect the R CMD check --as-cran
limitation, i.e. availableCores()
returns at most 2 (two).
Failed to install the library. What's wrong? Tell me please.
`> devtools::install_github("cboettig/arkdb")
Downloading GitHub repo cboettig/arkdb@HEAD
Error: Failed to install 'arkdb' from GitHub:
Multiple results for CXX11FLAGS found, something is wrong.FALSE
traceback()
10: stop(remote_install_error(remotes[[i]], e))
9: value[3L]
8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
7: tryCatchList(expr, classes, parentenv, handlers)
6: tryCatch(res[[i]] <- install_remote(remotes[[i]], ...), error = function(e) {
stop(remote_install_error(remotes[[i]], e))
})
5: install_remotes(remotes, auth_token = auth_token, host = host,
dependencies = dependencies, upgrade = upgrade, force = force,
quiet = quiet, build = build, build_opts = build_opts, build_manual = build_manual,
build_vignettes = build_vignettes, repos = repos, type = type,
...)
4: force(code)
3: withr::with_path(rtools_path(), code)
2: pkgbuild::with_build_tools({
ellipsis::check_dots_used(action = getOption("devtools.ellipsis_action",
rlang::warn))
{
remotes <- lapply(repo, github_remote, ref = ref, subdir = subdir,
auth_token = auth_token, host = host)
install_remotes(remotes, auth_token = auth_token, host = host,
dependencies = dependencies, upgrade = upgrade, force = force,
quiet = quiet, build = build, build_opts = build_opts,
build_manual = build_manual, build_vignettes = build_vignettes,
repos = repos, type = type, ...)
}
}, required = FALSE)
1: devtools::install_github("cboettig/arkdb")
sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251 LC_MONETARY=Russian_Russia.1251
[4] LC_NUMERIC=C LC_TIME=Russian_Russia.1251
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] rstudioapi_0.13 magrittr_2.0.3 usethis_2.1.5 devtools_2.4.3 pkgload_1.2.4
[6] R6_2.5.1 rlang_1.0.2 fastmap_1.1.0 tools_4.1.3 pkgbuild_1.3.1
[11] sessioninfo_1.2.2 cli_3.2.0 withr_2.5.0 ellipsis_0.3.2 remotes_2.4.2
[16] rprojroot_2.0.3 lifecycle_1.0.1 crayon_1.5.1 brio_1.1.3 processx_3.5.3
[21] purrr_0.3.4 callr_3.7.0 fs_1.5.2 ps_1.6.0 curl_4.3.2
[26] testthat_3.1.3 memoise_2.0.1 glue_1.6.2 cachem_1.0.6 compiler_4.1.3
[31] desc_1.4.1 prettyunits_1.1.1`
ark
and unark
are built around a hardwired lapply
command. Not sure how well parallel reading from database connections works in all cases, but seems like it would be natural to parallelize this iteration over all tables...
I can think of a couple of situations where I want to dump an database into raw text files for various sorts of inspection. Is it possible to add a compress="none"
option for both arking and unarking?
write_tsv
/read_tsv
for I/O. This should be pluggable.done, now supports readr and base utils table I/O
Would be nice to take advantage of things like faster speed of fst::read_fst
/fst::write_fst
, but looks like fst
does not support the ability to append, making it impossible to stream in chunks.
Some databases may be so large that it is not reasonable to attempt to pull the entire table as this could bring down a production system or be much more data than the local system has disk space for when written out to even the most compressed of formats, or be superfluos. It would be useful to be able to inject a WHERE
clause into the SQL used.
This would likely need to be passed down as an argument from ark
. Here's a simple example of an implementation using a keep_open
:
ark(..., filter_statement = 'WHERE status = "Active"')
...
# https://github.com/ropensci/arkdb/blob/master/R/ark.R#L164
res <- DBI::dbSendQuery(db_con, paste("SELECT * FROM", tablename, filter_statement))
arkdb (Version 0.0.2) cannot be installed from source in R-3.4.4 under Linux, it requires the utils::askYesNo function which was added to R-3.5.0.
version$version.string
#> [1] "R version 3.4.4 (2018-03-15)"
install.packages("arkdb")
#> Installing package into ‘/home/R/site-library_R-3.4’
#> (as ‘lib’ is unspecified)
#> trying URL 'http://erratic/cran/src/contrib/arkdb_0.0.2.tar.gz'
#> Content type 'application/x-gzip' length 27979 bytes (27 KB)
#> ==================================================
#> downloaded 27 KB
#>
#> * installing *source* package ‘arkdb’ ...
#> ** package ‘arkdb’ successfully unpacked and MD5 sums checked
#> ** R
#> ** inst
#> ** byte-compile and prepare package for lazy loading
#> Error : object ‘askYesNo’ is not exported by 'namespace:utils'
#> ERROR: lazy loading failed for package ‘arkdb’
#> * removing ‘/home/R/site-library_R-3.4/arkdb’
Dear Karl,
A much required package for me. Thanks! Do you have plans for approximate timeline for the first CRAN release?
It would be good to be able to robustly save and restore column types in a predictable fashion. Data with large numbers of NAs tends to have issues with errors detecting column types correctly, and certain types, such as dates, are not retained.
Would it make sense to create an option to create and/or use a file like schema.json
to set data types?
One enabling change for this is to allow arguments to be passed to the native bulk importers (which generally use read.csv
and colClasses
to set up column types to create tables before importing).
readr
uses logical
as the default column type when the first n=1000
lines are all NA
. (The reason for this is to play nicely with operations like bind_rows()
, since you can upcast a logical NA
to a character NA
but not vice versa, see tidyverse/readr#839. (base read.csv
also defaults NAs to logical).
In general I still feel this is the wrong default, as it allows R to make a guess in data parsing that is somewhat destructive: sparsely filled text columns become coerced to all NA because it has text cannot be coerced into a logical. It would probably be safer to default this to character and allow the text and not destroy data.
It seems like it would have been preferable to change the casting behavior used in dplyr
to allow it to cast/bind the all-NA character columns?
cc @noamross
Would it be possible to support Postgres databases that have tables in schemas? I'm running into problems I think because table names have to be prefixed by the schema name in the sql.
# try archiving just tables in specified schema
odm2_tables <- dbGetQuery(db,
"SELECT * FROM information_schema.tables
WHERE table_schema = 'odm2'")
ark(db, dir, lines = 50000, tables = odm2_tables$table_name)
Error:
Exporting actionannotations in 50000 line chunks:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: relation "actionannotations" does not exist
LINE 1: SELECT * FROM actionannotations LIMIT 0
^
)
Hi folks, first let me say that the functionality that arkdb
provides is great and very much needed.
My issue is that I'm exporting a table via ark()
but then when I create a new one using unark()
, it has different number of rows.
This is my code (sorry that I could not share my data, it is restricted):
arkdb::ark(
db_con = db_con,
dir = '...',
tables = "original_table",
filenames = NULL,
overwrite = TRUE,
lines=5e+5L
)
files <- fs::dir_ls('...', glob="*.tsv.bz2")
arkdb::unark(
files = files,
db_con = db_con,
tablenames = "new_imported_table",
overwrite = TRUE,
lines=5e+5L
)
original_table
has 798594 rows while new_imported_table
has 797904 (missing a few).
Please, what are the possible reasons for that?
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
other attached packages:
[1] arkdb_0.0.15 DBI_1.1.2
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.