iqss / dataverse-client-r Goto Github PK
View Code? Open in Web Editor NEWR Client for Dataverse Repositories
Home Page: https://iqss.github.io/dataverse-client-r
R Client for Dataverse Repositories
Home Page: https://iqss.github.io/dataverse-client-r
I sat down to try the Dataverse API and dataverse
today and ran into HTTP 400 (Bad Request) codes when calling create_dataverse()
. As often happens I figured out the solution while writing up the issue.
Without the dataverse
argument, the expected behavior from create_dataverse()
is "a top-level Dataverse is created." I probably don't have the permissions for that on dataverse.harvard.edu
, but this isn't relevant for demonstration:
library("dataverse")
> dv <- create_dataverse()
Error in create_dataverse() : Bad Request (HTTP 400).
> traceback()
3: stop(http_condition(x, "error", task = task, call = call))
2: httr::stop_for_status(r) at create_dataverse.R#27
1: create_dataverse()
Code 400. Stepping through with debug()
the request was
debug at /home/.../dataverse-client-r/R/create_dataverse.R#26: r <- httr::POST(u, httr::add_headers(`X-Dataverse-key` = key), ...)
Browse[2]> r$request
<request>
POST https://dataverse.harvard.edu/api/dataverses
Output: write_memory
Options:
* useragent: libcurl/7.55.1 r-curl/3.1 httr/1.3.1
* post: TRUE
* postfieldsize: 0
Headers:
* Accept: application/json, text/xml, application/xml, */*
* Content-Type:
* X-Dataverse-key: ####-####
Note the empty body and content-type. For completeness, I do have the admin
role for the medsl
dataverse at harvard.edu
, and get the same result with
> r = create_dataverse("medsl")
Error in create_dataverse("medsl") : Bad Request (HTTP 400).
From the Dataverse API docs I see this behavior is actually expected. I'm sending an empty body, and minimally the API requires fields name
, alias
, and dataverseContacts
.
I can confirm this is the issue by sending a request with httr
, using the example content from the docs as body content:
api_url = "https://dataverse.harvard.edu/api/dataverses/medsl"
meta = jsonlite::read_json("http://guides.dataverse.org/en/latest/_downloads/dataverse-complete.json")
r <- httr::POST(api_url, httr::add_headers("X-Dataverse-key" = key), body =
meta, encode = "json")
r$status_code
[1] 201
That's a success and the new dataverse appears in the GUI unpublished. This result can be replicated with create_dataverse()
using its dots argument, which is passed to httr::POST()
.
# same body; nb encode='json' is required
r = create_dataverse("medsl", body = meta, encode = "json")
> str(r)
chr "{\"status\":\"OK\",\"data\":{\"id\":3131902,\"alias\":\"science\",\"name\":\"Scientific Research\",\"affiliatio"| __truncated__
And that's also successful!
If create_dataverse()
will always require body content, it might be worthwhile to move body
into its signature as a new named argument and handle the encoding. Alternatively, the minimal metadata fields (name
, alias
, dataverseContacts
) could appear in the signature, since passing a named list is a little clunky. It looks something like this (.Names
elements added by dput
):
structure(list(name = "Scientific Research", alias = "science", dataverseContacts = list(structure(list(contactEmail = "[email protected]"), .Names = "contactEmail"), structure(list(contactEmail = "[email protected]"), .Names = "contactEmail")), affiliation = "Scientific Research University", description = "We do all the science.", dataverseType = "LABORATORY"), .Names = c("name", "alias", "dataverseContacts", "affiliation", "description", "dataverseType"))
Thoughts? If you're open to an update I can submit a PR.
Over at IQSS/dataverse#5725 we're talking about a new continous integration service for the Dataverse community hosted by UNC (thanks!) at https://jenkins.dataverse.org
We should add a job for this client library, dataverse-client-r.
This travis build matrix needs to be updated to comply with newer standards.
I don't know if it's related to last night's problem. The PR tests passed, but a few minutes later the master build failed.
sudo
has no effect anymore.)https://travis-ci.org/github/IQSS/dataverse-client-r/jobs/664731693/config
https://travis-ci.org/github/IQSS/dataverse-client-r/jobs/664731693
Over at IQSS/IQSSdevtools#1 @christophergandrud just asked if it's possible to create a new version of an existing dataset using dataverse-client-r. It's definitely possible from the Dataverse API side, but I wasn't able to quickly figure out if it's supported by this client or not.
In the current man pages and vignette, the usage of dta files suggest foreign::read.dta
. I would propose switching to haven::read_dta
or at least seeing if all the tests would go through with haven. haven is a tidyverse-based package that has surpassed foreign in recent years (see below). More importantly, haven can read all Stata dataset versions, whereas foreign is stuck in v12 (Stata is currently at v16).
library(ggplot2)
library(dplyr)
library(dlstats)
dl_stats <- cran_stats(c("haven", "foreign"))
dl_stats %>%
as_tibble() %>%
group_by(package) %>%
slice(-n()) %>%
rename(Package = package) %>%
ggplot(aes(end, downloads, group = Package, color = Package)) +
geom_line(aes(linetype = Package)) + geom_point() +
labs(y = "CRAN downloads",
x = "")
Created on 2019-12-09 by the reprex package (v0.3.0)
@leeper do you remember a feature we had in DVN 3 where you could upload a zip file and the folder structure would be recorded in the database and then people could come along later and download the files as a zip and on the fly DVN would create a zip file with the same folder structure as when the files were uploaded?
Recalculate the UNF value of a dataset version, if it’s missing, by supplying the dataset version database id:
POST http://$SERVER/api/admin/datasets/integrity/{datasetVersionId}/fixmissingunf
Switch to "released" for the main tests, since this is a more important scenario to users than "draft" scenarios.
Please specify whether your issue is about:
The "description" for files is repeated, resulting in a duplicate data.frame column name which causes all sorts of issues. Not sure if this is a problem with the API or the R-package, but figured I'd start here. CC @pdurbin
## load package
library("dataverse")
## code goes here
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
colnames(obrien_files)
[1] "description" "label" "restricted"
[4] "version" "datasetVersionId" "categories"
[7] "id" "persistentId" "pidURL"
[10] "filename" "contentType" "filesize"
[13] "description" "storageIdentifier" "rootDataFileId"
[16] "md5" "checksum" "creationDate"
[19] "originalFileFormat" "originalFormatLabel" "originalFileSize"
[22] "UNF" "tabularTags"
## session info for your system
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.8.0.1 dataverse_0.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 rstudioapi_0.10 xml2_1.2.0 magrittr_1.5
[5] tidyselect_0.2.5 R6_2.4.0 rlang_0.3.4 httr_1.4.1
[9] tools_3.4.3 pkgbuild_1.0.2 cli_1.1.0 withr_2.1.2
[13] remotes_2.1.0 assertthat_0.2.1 rprojroot_1.3-2 tibble_2.1.1
[17] crayon_1.3.4 processx_3.3.0 purrr_0.3.2 callr_3.1.1
[21] ps_1.3.0 curl_3.3 glue_1.3.1 pillar_1.4.2
[25] compiler_3.4.3 backports_1.1.4 prettyunits_1.0.2 jsonlite_1.6
[29] pkgconfig_2.0.2
Please specify whether your issue is about:
Put your code here:
## load package
library("dataverse")
## code goes here
datasets <- dataverse_search("*", fq = "dateSort:[2018-01-01T00:00:00Z+TO+2019-01-01T00:00:00Z]", type="dataset", key = "", server = "dataverse.harvard.edu"))
## session info for your system
sessionInfo()
I'm proposing a fix for this in #36
Please specify whether your issue is about:
I think this is just a question, but might also be enhancement/bug report.: The dataverse API allows downloading multipel files as .zip. This is particularly relevant now as it preserves the folder structure where available.
There is code in the get_file()
function that accesses this functionality, but I don't actually think it's ever possible to get there: I find no way of specifying multiple fileids
So first question:
get_file()
?file
parameter?It looks like there is a new component returned from the dataverse_contents()
function. I'll add it to the tests so they pass again
Error: actual[[1]] not equal to `expected`.
Names: 2 string mismatches
Length mismatch: comparison on first 8 components
Component 7: 1 string mismatch
Component 8: 1 string mismatch
expected <- structure(
list(
id = 396356L,
identifier = "FK2/FAN622",
persistentUrl = "https://doi.org/10.70122/FK2/FAN622",
protocol = "doi",
authority = "10.70122",
publisher = "Demo Dataverse",
storageIdentifier = "file://10.70122/FK2/FAN622",
type = "dataset"
),
class = "dataverse_dataset"
)
> actual[[1]]
Dataset (396356): https://doi.org/10.70122/FK2/FAN622
Publisher: Demo Dataverse
publicationDate: 2020-04-22
> expected
Dataset (396356): https://doi.org/10.70122/FK2/FAN622
Publisher: Demo Dataverse
edit: now I'm pretty sure this is related to "releasing"i it (ref #61)
I'm getting a "Not Found (HTTP 404)" error when I try to pull metadata for a doi that was itself pulled from Dataverse using dataverse_search
. Here's the sequence that leads to the error:
search <- dataverse_search("ICEWS")
md <- dataverse_metadata(search$url[4])
Is this an issue with the R package or Dataverse itself?
There needs to be a different approach to initiating the test suite. Right now [tests that should fail... still pass. It's because testthat::test_check()
currently won't run if the API key isn't found as an environmental variable.
I'm open to ideas as always. Currently I'm thinking:
Test only against demo.dataverse.org. (A few weeks ago @pdurbin advocated this in a phone call for several reasons, including that Dataverse's retrieval stats won't be misleading --because one article gets hundreds of hits a month just from automated tests.)
create a (demo) Dataverse account dedicated to testing. At this point, I don't think it needs to be kept secret. There's not really a need to keep it secret. It could even be set in tests/testthat.R
.
@pdurbin, will you please check my claim --especially from a security standpoint?
If the above is safe, the api key might be kept in a yaml file in the inst/
directory.
If the API key to the demo server needs to be protected,
@skasberger, @rliebz, @tainguyenbui, and any others, I'd appreciate any advice from your experience with pyDataverse, dataverse-client-python, and dataverse-client-javascript. I'm not experienced with your languages, but looks like pyDataverse doesn't pass an API key, while client-python posts their API key to the demo server.
(This is different from #4 & #29, which involve the battery of tests/comparisons. Not the management of API keys or how testthat is initiated.)
It will be useful to have the SWORD v2 client as a separate package. This is in the works, but I want to make a formal note of it here.
a suggested code or documentation change, improvement to the code, or feature request:
User story: As a user or curator, I have a list of files on dataverse and want to add tags and descriptions from a spreadsheet to them.
While it may be(?) possible to update file metadata as part of update_dataset
, that's quite messy (and I'm not 100% sure it's even possible -- couldn't get it to work). The native API offers a nice set of functions for this, that I think we should implement using the above-noted functions:
http://guides.dataverse.org/en/latest/api/native-api.html#updating-file-metadata
Thoughts?
Reported via email:
Sys.setenv("DATAVERSE_SERVER"= "dataverse.harvard.edu")
dataset_metadata('DOI for the dataset', version = ":draft", block = "citation",
key = Sys.getenv("My API Token"),
server = Sys.getenv("DATAVERSE_SERVER"))
# client error: (404) Not Found
@leeper, thanks for developing such a stable and well-designed package. Digging into it during the past week, I'm even more appreciative of its design and consistency. And you left a good roadmap of what the package needs next (eg, #4 & #16), which has helped the transition It's been fun learning about it and gaining some new tools in my toolbox along the way.
(To those on the repo's watchlist, I promise I won't bend the purpose of GitHub issues too frequently. But I think @leeper deserves the additional recognition.)
ref #21
Please specify whether your issue is about:
Hi There,
I was trying out some of the functions for the dataverse package, and came across it creating a 500 error every time I tried to execute the function. Unfortunately, I can't put my server and API key, but I can show what I did attempt:
## load package
library("dataverse")
## code
meta2<-list(title="test",author="Li,Thomas",datasetContact="Fish, Fishy",dsDescription="FISH",subject="Quantitative Sciences",depositor="Fish, Fishy",dateOfDeposit="Fish Time",datasetContactEmail="[email protected]")
create_dataset("MLPOCs2018",body=meta2)
When running the debugger, I noticed the 500 error occurs when the POST() function is used from the 'httr' package (which makes sense), so I am trying to see the cause of this (be it permission or something else).
Just in case, the sessionInfo() yields the following:
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods base
other attached packages:
[1] zip_2.0.2 flowCore_1.50.0 magick_2.0 ggplot2_3.2.0 usethis_1.5.0 devtools_2.0.2
[7] magrittr_1.5 data.table_1.12.2 dplyr_0.8.1 dataverse_0.2.0
loaded via a namespace (and not attached):
[1] tidyselect_0.2.5 remotes_2.1.0 purrr_0.3.2 lattice_0.20-38 pcaPP_1.9-73
[6] colorspace_1.4-1 testthat_2.1.1 stats4_3.6.0 yaml_2.2.0 rlang_0.4.0
[11] pkgbuild_1.0.3 pillar_1.4.1 glue_1.3.1 withr_2.1.2 BiocGenerics_0.30.0
[16] sessioninfo_1.1.1 matrixStats_0.54.0 robustbase_0.93-5 munsell_0.5.0 gtable_0.3.0
[21] mvtnorm_1.0-11 memoise_1.1.0 Biobase_2.44.0 callr_3.2.0 ps_1.3.0
[26] curl_3.3 parallel_3.6.0 DEoptimR_1.0-8 Rcpp_1.0.1 corpcor_1.6.9
[31] backports_1.1.4 scales_1.0.0 desc_1.2.0 pkgload_1.0.2 jsonlite_1.6
[36] graph_1.62.0 fs_1.3.1 digest_0.6.19 processx_3.3.1 grid_3.6.0
[41] rprojroot_1.3-2 cli_1.1.0 lazyeval_0.2.2 tibble_2.1.3 cluster_2.1.0
[46] crayon_1.3.4 rrcov_1.4-7 pkgconfig_2.0.2 MASS_7.3-51.4 xml2_1.2.0
[51] prettyunits_1.0.2 assertthat_0.2.1 httr_1.4.0 rstudioapi_0.10 R6_2.4.0
[56] compiler_3.6.0
If needed, I can also provide the log from the server that shows the "API internal error"
Thanks!
This is the result of recent changes in GitHub and Homebrew.
$ brew install curl
Error:
homebrew-core is a shallow clone.
homebrew-cask is a shallow clone.
To `brew update`, first run:
git -C /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core fetch --unshallow
git -C /usr/local/Homebrew/Library/Taps/homebrew/homebrew-cask fetch --unshallow
This restriction has been made on GitHub's request because updating shallow
clones is an extremely expensive operation due to the tree layout and traffic of
Homebrew/homebrew-core and Homebrew/homebrew-cask. We don't do this for you
automatically to avoid repeatedly performing an expensive unshallow operation in
CI systems (which should instead be fixed to not use shallow clones). Sorry for
the inconvenience!
Warning: You are using macOS 10.13.
We (and Apple) do not provide support for this old version.
You will encounter build failures with some formulae.
Please create pull requests instead of asking for help on Homebrew's GitHub,
Twitter or any other official channels. You are responsible for resolving
any issues you experience while you are running this
old version.
Error: curl: no bottle available!
You can try to install from source with e.g.
brew install --build-from-source curl
Please note building from source is unsupported. You will encounter build
failures with some formulae. If you experience any issues please create pull
requests instead of asking for help on Homebrew's GitHub, Twitter or any other
official channels.
The command "brew install curl" failed and exited with 1 during .
Some discussions
As I mentioned at https://groups.google.com/d/msg/dataverse-community/8WTs3wYF6dc/AzPOxzRKFwAJ there are two new APIs having to do with adding and replacing files that we expect to ship with Dataverse 4.6.1. They address this issue: IQSS/dataverse#1612
Once the code and docs are final, I'll leave a comment on this issue but if anyone wants to kick the tires on the new APIs now, please follow the link to the Google Group above! 😄
tldr: Help pages and tests should specify a server
, either via Sys.setenv("DATAVERSE_SERVER" = "havard.dataverse.edu")
OR specifying server = "harvard.dataverse.edu"
each time.
Background: The server
argument used in most of the functions has the default Sys.getenv("DATAVERSE_SERVER")
, which means that the initial user would get a "server not specified" error when trying out the help page examples.
So shouldn't all minimal working examples specify a server
argument?
Also this default is inconsistent with the documentation, which says (emphasis mine)
A character string specifying a Dataverse server. There are multiple Dataverse installations, but the defaults is to use the Harvard Dataverse. This can be modified atomically or globally using
Sys.setenv("DATAVERSE_SERVER" = "dataverse.example.com")
.
I'm attempting to use this client to find the doi of all .R files in a dataverse server, and I've run into a couple of interesting behaviors.
dataverse_search
prints a message to the console "10 of 3842 results retrieved". However, calling dataverse_search
with a "start" parameter that would seem to be the last page of the results (start = ceiling(3842/10) = 385) still yields 10 results, and pages beyond that number continue to yield results. Therefore, how would I determine the appropriate number of pages to get data for?dataverse_search
does not return a doi field. The dataframe returned from dataverse_search
has the following columns:## [1] "name" "type" "url"
## [4] "file_id" "description" "published_at"
## [7] "file_type" "file_content_type" "size_in_bytes"
## [10] "md5" "checksum" "dataset_citation"
## [13] "unf"
stringr
, but having a dedicated doi field would be wonderful.Add an example (based upon IQSS/dataverse#2739) showing how to archive a git repo using the SWORD API.
You guys might start regretting inviting me to be a maintainer. I'm having trouble reproducing the vignettes, even easy parts like retrieving plain-text R & CSVs.
remotes::install_github("iqss/dataverse-client-r")
#> Skipping install of 'dataverse' from a github remote, the SHA1 (bac89f46) has not changed since last install.
#> Use `force = TRUE` to force installation
library("dataverse")
Sys.setenv("DATAVERSE_KEY" = "examplekey12345")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("doi:10.7910/DVN/ARKOTI")
# Dataset (75170):
# Version: 1.0, RELEASED
# Release Date: 2015-07-07T02:57:02Z
# License: CC0
# 22 Files:
# label version id contentType
# 1 alpl2013.tab 2 2692294 text/tab-separated-values
# 2 BPchap7.tab 2 2692295 text/tab-separated-values
# 3 chapter01.R 2 2692202 text/plain; charset=US-ASCII
# ...
# 16 drugCoverage.csv 1 2692233 text/plain; charset=US-ASCII
# ...
# Retrieve files by ID
object.size(dataverse::get_file(2692294)) # tab works
#> 211040 bytes
object.size(dataverse::get_file(2692295)) # tab works
#> 61336 bytes
object.size(dataverse::get_file(2692210)) # R fails
#> Error in dataverse::get_file(2692210): Not Found (HTTP 404).
object.size(dataverse::get_file(2692233)) # csv fails
#> Error in dataverse::get_file(2692233): Not Found (HTTP 404).
# Retrieve files by name & doi
object.size(get_file("alpl2013.tab" , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 211040 bytes
object.size(get_file("BPchap7.tab" , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 61336 bytes
object.size(get_file("chapter01.R" , "doi:10.7910/DVN/ARKOTI")) # R fails
#> Error in get_file("chapter01.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).
object.size(get_file("drugCoverage.csv" , "doi:10.7910/DVN/ARKOTI")) # csv fails
#> Error in get_file("drugCoverage.csv", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).
# Taken straight from https://cran.r-project.org/web/packages/dataverse/vignettes/C-retrieval.html
code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI")
#> Error in get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).
Created on 2019-12-06 by the reprex package (v0.3.0)
Using debug(dataverse::get_file)
, the error-throwing line is in get_file()
:
r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query, ...)
To make things a tad more direct, I called dataverse::get_file(2692233)
. The two relevant parameters to httr::GET()
are
Browse[2]> query
$format
[1] "original"
Browse[2]> u
[1] "https://dataverse.harvard.edu/api/access/datafile/2692233"
The r
value returned is
Response [https://dataverse.harvard.edu/api/access/datafile/2692233?format=original]
Date: 2019-12-07 05:13
Status: 404
Content-Type: application/json
Size: 201 B
That u
value is fine when pasted into Chrome. I saw several Dataverse discussions about a trailing /
. When I added that, the response appears good.
Browse[2]> u2 <- paste0("https://dataverse.harvard.edu/api/access/datafile/2692233", "/")
Browse[2]> httr::GET(u2, httr::add_headers(`X-Dataverse-key` = key), ... )
Response [https://dvn-cloud.s3.amazonaws.com/10.7910/DVN/ARKOTI/14e66408488-c678717f7c4d?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27drugCoverage.csv&response-content-type=text%2Fplain%3B%20charset%3DUS-ASCII&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191207T051632Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20191207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=c1b13a7d3ea2a53c1c1e70c18a762ae0e4ae14eb41fae7d79c71fce26a9b354f]
Date: 2019-12-07 05:16
Status: 200
Content-Type: text/plain; charset=US-ASCII
Size: 4.06 kB
I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?
Why are csv & R files affected, but not tab files? As I step through a tab file (e.g., dataverse::get_file(2692294)
), it appears the exact same lines are executed. And that u
value doesn't have a trailing slash (https://dataverse.harvard.edu/api/access/datafile/2692294
). I see two differences: (a) the content type and (b) this one doesn't go through AWS/S3.
Response [https://dataverse.harvard.edu/api/access/datafile/2692294?format=original]
Date: 2019-12-07 05:35
Status: 200
Content-Type: application/x-stata; name="alpl2013.dta"
Size: 211 kB
<BINARY BODY>
This is probably related to @EdJeeOnGitHub's recent issue #31. Notice he mentions problems with certain file formats.
Is this related at all to IQSS/dataverse#3130, IQSS/dataverse#2559, or IQSS/dataverse#4196?
You can see that my knowledge with the web side of this is limited; I don't understand them that well.
Session info ---------------------------------------------------------------------------
setting value
version R version 3.6.1 Patched (2019-08-12 r76979)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz America/Chicago
date 2019-12-06
Packages -------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
callr 3.3.2 2019-09-22 [1] CRAN (R 3.6.1)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
clipr 0.7.0 2019-07-23 [1] CRAN (R 3.6.1)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
curl 4.3 2019-12-02 [1] CRAN (R 3.6.1)
dataverse * 0.2.1 2019-12-07 [1] Github (bac89f4)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1)
digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1)
ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0)
knitr 1.26 2019-11-12 [1] CRAN (R 3.6.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.0)
pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0)
reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.1)
rmarkdown 1.18 2019-11-27 [1] CRAN (R 3.6.1)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.1)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
whisker 0.4 2019-08-28 [1] CRAN (R 3.6.1)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
xfun 0.11 2019-11-12 [1] CRAN (R 3.6.1)
xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.1)
Hi! At IQSS/dataverse#2700 (comment) @monogan reported some trouble getting the dataverse package installed on Windows and I was thinking that the readme of this repo would explain the best way to get support but it's non-obvious to me. Can this be clarified in the readme? Thanks!
p.s. If @monogan could get some assistance that would be great as well!
The release of R 4.0 doesn't default to factors for string variables. This won't change much with the clients, but the tests break with the different data type
> str(expected)
List of 4
$ title : chr "dataverse-client-r"
$ generator : list()
..- attr(*, "uri")= chr "http://www.swordapp.org/"
..- attr(*, "version")= chr "2.0"
$ dataverseHasBeenReleased: chr "false"
$ datasets :'data.frame': 1 obs. of 2 variables:
..$ NA: Factor w/ 1 level "Bulls Roster 1996-1997": 1
..$ NA: Factor w/ 1 level "https://demo.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.70122/FK2/FAN622": 1
- attr(*, "class")= chr "dataverse_dataset_list"
> str(actual)
List of 4
$ title : chr "dataverse-client-r"
$ generator : list()
..- attr(*, "uri")= chr "http://www.swordapp.org/"
..- attr(*, "version")= chr "2.0"
$ dataverseHasBeenReleased: chr "true"
$ datasets :'data.frame': 1 obs. of 2 variables:
..$ title: chr "Bulls Roster 1996-1997"
..$ id : chr "https://demo.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.70122/FK2/FAN622"
- attr(*, "class")= chr "dataverse_dataset_list"
Sys.setenv("DATAVERSE_SERVER" = "demo.dataverse.org")
Sys.setenv("DATAVERSE_KEY" = "c7208dd2-6ec5-469a-bec5-f57e164888d4")
dv <- get_dataverse("dataverse-client-r")
list_datasets(dv)
The error is
Error in if (x$dataverseHasBeenReleased[[1]] == "true") "Yes" else "No" :
argument is of length zero
The nested structure of the objected has changed, so this function is failing:
Lines 3 to 8 in 697180e
The feed
level needs to be added. So x$feed$dataverseHasBeenReleased[[1]]
becomes x$dataverseHasBeenReleased[[1]]
? But there's something with dispatching isn't going as I expect. The full structure of x
from list_datasets()
isn't being recognized by `print.dataverse_dataset_list().
Edit: I think print()
is fine. It's a problem with the tail end of list_datasets()
. These objects are nested in x$feed
, not just x
.
Lines 82 to 85 in 697180e
@adam3smith, @kuriwaki, @pdurbin, and anyone else,
Should get_file()
be refactored into multiple child functions? It seems like we're asking it to do a lot of things, including
data.frame
or tibble
.I like all these capabilities, and want to run discuss organizational ideas with people so the package structure is (a) easy for us to develop, test, & maintain, and (b) useful and natural to users to learn and incorporate.
One possible approach:
A foundational functional retrieves the file(s) by ID; it is the workhorse that actually retrieves the file. A second function accepts the file name (not ID); it essentially wraps the first function after calling get_fileid()
. Both of these functions deal with a single file at a time.
Another pair of functions deal with multiple files (one by name, one by id). But these return lists, not a single object. They're essentially lapplys/loops around their respective siblings described above.
To avoid breaking the package interface, maybe the existing get_file()
keeps its same interface (that ambiguously accepts with file names or id and returns either single files or a list of files), but we soft-deprecate it and encourage new code to use these more explicit functions? The guts of the function is moved out into the four new functions
Maybe the function names are
get_file()
with an unchanged interfaceget_file_by_id()
(the workhorse)get_file_by_name()
get_files_by_id()
get_files_by_name()
get_tibble_by_id()
get_tibble_by_name()
get_zip_by_id()
get_zip_by_name()
get_file_by_doi()
(see @adam3smith 's commentI'm pretty sure it would be easier to write tests that isolate problems. The documentation becomes more verbose, but probably more straight-forward.
You guys have more experience with Dataverse than I do, and better sense of the use cases. Would this reorganization help users? If not, maybe we still split it into multiple functions, but just keep the visibility of functions 2-5 private.
Maybe I'm making this unnecessarily tedious, but I'm thinking that these download functions are the most called by R users, and they're certainly the ones that are called by new users. So if they leave a bad impression, the package is less likely to be used.
@adam3smith & @kuriwaki, I overlooked this your recent PRs. You should be recognized for your contributions.
(@EdJeeOnGitHub & @billy34, I haven't forgotten about your PRs.)
Will you please create separate PRs and
I'm using R Packages for the definition/distinction between the "aut" and "ctb".
Hi, in the installation instruction it reads one should use 'ghit'. I'm not sure what it is, but couldn't find it through google, nor does the provided code work on my Mac. The usual code does work:
library(devtools) #install first if not yet installed
install_github("iqss/dataverse-client-r")
library("dataverse")
@pdurbin and I discussed possible ways to reduce the effort each of the Dataverse client developer to create and maintain tests. It might be nice if
@skasberger, @rliebz, @tainguyenbui, and any others, please tell me if this isn't worth it, or there's a better approach, etc.
(This is different from #4 & #29, which involve the battery of tests/comparisons. #40 deals with the deployment of API keys used in testing.)
There is no DOI for the dataset that I want to grab
http://hdl.handle.net/10622/DN9QDM
Is it possible to get it somehow? If not, is it possible to add the feature of specifying the dataset by the hdl code?
As I've expressed with a few tweaks:
I'd (like to) modify some of the package's structure to more closely resemble the approach described in Hadley's R Packages book (and the tidyverse style guide). The first reason is because I think it's a good approach and has worked well for me in the past. A second reason is that this approach is documented better and more coherently than any other approach --therefore it will be easier for someone else to modify or become maintainer in the future.
But this won't be a major overhaul. Most of the important elements of the existing package follow this book's approach.
The changes come in two sets: the little guys that come before adding a test suite (issues #4 & #29) and those that come after the test suite (which includes some of the recently accumulating issues like #33, #35, & #37).
Rproj
update Roxygen
updates to .Rbuildignore and .gitignore files.
move daily development off the master and to the dev branch
update person entries in DESCRIPTION file
semantic versioning for dev versions (e.g., x.x.x.9001)
remove date from DESCRIPTION file
(re)connect codecov which currently reads 0%. (@pdurbin, as an admin of this GitHub org, you'll probably see my request for additional permissions granted to Codecov. I requested it only for this single public repo.)
I'm flexible if anyone has other ideas.
Thank you for this package. Do you plan any additional release in CRAN soon to publish fixes and enhancements since latest release (2017)? Thanks in advance
@adam3smith wrote in #49 (comment)
A different approach would be to make it easier to construct native API queries from scratch using the client so that we don't have to code everything but it's more readily accessible. I don't know how hard that would be, but when I wrote API functionality (for downloading .zip files) locally it required quite a bit of code duplication from dataverse library.
We encountered something similar in REDCapR. It's a similar package as this one --in the sense that it's basically a glove around curl calls to a server's api. Then it adds to convenience functions and validation to make R development easier because it returns R's native data.frame
objects.
When we saw all the duplication in REDCapR, we refactored that core functionality into kernel_api()
. That central location makes it easier to improve things consistently like character encoding and error handling. It returns raw result to the calling function, which decides whether to save it as a local file, or return a data.frame.
In dataverse::get_file()
, I see the replication @adam3smith's talking about, with three instances similar to
u <- paste0(api_url(server), "access/datafile/", file, "/metadata/", format)
r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), ...)
httr::stop_for_status(r)
This could become its own function like
kernel <- function(command, server, key, ...) {
u <- paste0(api_url(server), command)
r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), ...)
httr::stop_for_status(r)
r
}
The code inside dataverse::get_file_metadata()
would look like
command <- paste0("access/datafile/", file, "/metadata/", format)
kernel(command, server, key)
The code inside the current dataverse::get_file()
would have three different calls:
command <- paste0("access/datafiles/", file)
kernel(command, server, key)
...
command <- paste0("access/datafile/bundle/", file)
kernel(command, server, key)
...
command <- paste0("access/datafile/", file)
kernel(command, server, key)
...
@adam3smith, @kuriwaki, or anyone else, please share your impressions if you'd like. It looks like @skasberger and the pyDataverse do something very similar with their get_request()
There's a subtle & trivial difference in the tests for get_dataverse()
, which may be related to new versions of tibble or R 4.0.
> expect_equal(actual, expected)
Error: `actual` not equal to `expected`.
Component “datasets”: Attributes: < Names: 1 string mismatch >
Component “datasets”: Attributes: < Length mismatch: comparison on first 2 components >
Component “datasets”: Attributes: < Component 2: Modes: character, numeric >
Component “datasets”: Attributes: < Component 2: target is character, current is numeric >
> actual
Dataverse name: Demo Dataverse
Released? Yes
data frame with 0 columns and 0 rows
> expected
Dataverse name: Demo Dataverse
Released? Yes
data frame with 0 columns and 0 rows
> str(expected)
List of 4
$ title : chr "Demo Dataverse"
$ generator : list()
..- attr(*, "uri")= chr "http://www.swordapp.org/"
..- attr(*, "version")= chr "2.0"
$ dataverseHasBeenReleased: chr "true"
$ datasets :'data.frame': 0 obs. of 0 variables
- attr(*, "class")= chr "dataverse_dataset_list"
> str(actual)
List of 4
$ title : chr "Demo Dataverse"
$ generator : list()
..- attr(*, "uri")= chr "http://www.swordapp.org/"
..- attr(*, "version")= chr "2.0"
$ dataverseHasBeenReleased: chr "true"
$ datasets :'data.frame': 0 obs. of 0 variables
- attr(*, "class")= chr "dataverse_dataset_list"
> attributes(actual$datasets)
$names # This is new
character(0)
$class
[1] "data.frame"
$row.names
integer(0)
> attributes(expected$datasets)
$class
[1] "data.frame"
$row.names
integer(0)
Somehow the testing dataverse contents were deleted.
I'll
I'm opening this issue because I just opened and equivalent issue for the Python client: IQSS/dataverse-client-python#29
What I'm really after is a way to address this issue that @monogan opened at IQSS/dataverse#2700 in which he's trying to figure out what to write in a book about R with regard to how to download files.
Ideally (in my mind), the R package for Dataverse would provide some insulation between the readers of his book and the Dataverse APIs. The book would say, "Install the dataverse package from CRAN and download the file by..." That way, even if the APIs change a bit, future readers of his book will download the latest version of the R package from CRAN and it will still "just work".
I think that the only things users should need to download a file is the DOI of the dataset and the filename. The dataverse package can do the rest. :) It would be way cleaner than my hack: IQSS/dataverse@812424a .
I often find get_file()
errors since the API can't return a certain file format. Since the dataverse API returns informative messages when a request fails it makes sense to pass this along in R instead of just the default from httr::stop_for_status()
.
For instance:
library(dataverse)
library(tidyverse)
IPA_dataset <- get_dataset("doi:10.7910/DVN/JGLOZF")
# doi: https://doi.org/10.7910/DVN/JGLOZF
community_files <- IPA_dataset$files$id %>%
map(function(x){
print(x)
return(get_file(x))
})
after the PR will return:
Not Found (HTTP 404). Failed to '/api/v1/access/datafile/2972336' datafile access error: requested optional service (image scaling, format conversion, etc.) is not supported on this datafile.
instead of:
Error in get_file(x) : Not Found (HTTP 404).
@skasberger kindly offered assistance to the package and its tests (@skasberger, I have it on record so you can't back out).
From my initial look, it appears that I could port/translate a lot of the tests and flat-out steal a lot of the json that's fed to the tests.
@skasberger, are there pyDataverse tests that compare data returned from Dataverse against an expected set of values?
If not, maybe that set of expected values is something that the R & Python libraries could build together. The R data.frame/list could be serialized to json, and compared against a json file; but that's probably too reliant on the R & Python json libraries from producing the exact same plain-text file.
Alternatively, the json expected files (that are common between Python & R) could be read into the languages' dict/list objects (e.g., by Python's json and R's jsonlite) and compared there. I'm guessing this approach is less brittle and more reusable across languages?
To @skasberger & anyone else, I'm not very familiar with Dataverse or Python package development, and I'm open to any ideas how to do things differently or leverage each package's efforts more efficiently.
When trying to get versions from my dataset using dataset_versions
function I get an errror
Error: $ operator is invalid for atomic vectors
Looking at the code we can notice that there is a missing json decoding of the response
I'll file a PR to correct this
[EDIT by @kuriwaki -- add reprex]
library(dataverse)
# this works
get_dataset(dataset = "doi:10.70122/FK2/PPIAXE", server = "demo.dataverse.org")
#> Dataset (182162):
#> Version: 1.1, RELEASED
#> Release Date: 2020-12-30T00:00:24Z
#> License: CC0
#> 22 Files:
#> label version id contentType
#> 1 nlsw88_rds-export.rds 1 1734016 application/octet-stream
#> 2 nlsw88.tab 3 1734017 text/tab-separated-values
# but not this
dataset_versions(dataset = "doi:10.70122/FK2/PPIAXE", server = "demo.dataverse.org")
#> Error: $ operator is invalid for atomic vectors
Created on 2021-01-22 by the reprex package (v0.3.0)
This package is not being actively maintained. If you're interested in contributing or taking over, please express your interest here.
What the issue is about:
Issue: I think most users who want to get data from the R dataverse package want to start working with the data in their R environment right away. However, get_file
only returns raw binary output which is not usable on its own.
Proposal: The help page shows how to write the class raw
object into a temp file and read it back in. The proposed feature is to add an optional argument in get_file
or make a function that does this write-in / read-in-again process automatically. Users will enter a function that will be used to read in the tempfile. An example function that does this is below.
How does this sound?
# hide my key
library(dataverse)
# function ----
# @param file to be passed on to get_file
# @param dataset to be passed on to get_file
# @param read_function If supplied a function object, this will write the
# raw file to a tempfile and read it back in with the supplied function. This
# is useful when you want to start working with the data right away in the R
# environment
get_file_addon <- function(file,
dataset = NULL,
read_function = NULL,
...) {
raw_file <- get_file(file, dataset)
# default of get_file
if (is.null(read_function))
return(raw_file)
# save to temp and then read it in with supplied function
if (!is.null(read_function)) {
tmp <- tempfile(file, fileext = stringr::str_extract(file, "\\.[A-z]+$"))
writeBin(raw_file, tmp)
return(do.call(read_function, list(tmp)))
}
}
# read in two non-tab ingested files ----
cces_dta <- get_file_addon(file = "cumulative_2006_2018.dta",
dataset = "10.7910/DVN/II2DB6",
read_function = haven::read_dta)
cces_rds <- get_file_addon(file = "cumulative_2006_2018.Rds",
dataset = "10.7910/DVN/II2DB6",
read_function = readr::read_rds)
class(cces_dta)
#> [1] "tbl_df" "tbl" "data.frame"
class(cces_rds)
#> [1] "tbl_df" "tbl" "data.frame"
dim(cces_dta)
#> [1] 452755 73
dim(cces_rds)
#> [1] 452755 73
Created on 2019-12-16 by the reprex package (v0.3.0)
It looks like Travis has not passed in two years (ie, Oct 2017 was the last full success).
Things pass on my local machine (which is Ubuntu, like Travis), so I think the problem likely lies with the travis config, rather than the R machinery. I'll try to
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.