Giter Site home page Giter Site logo

dataverse-client-r's People

Contributors

adam3smith avatar danny-dk avatar edjeeongithub avatar jankanis avatar jbgruber avatar kuriwaki avatar leeper avatar pdurbin avatar sindribaldur avatar wibeasley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataverse-client-r's Issues

Creating a dataverse fails with HTTP 400: Bad Request

I sat down to try the Dataverse API and dataverse today and ran into HTTP 400 (Bad Request) codes when calling create_dataverse(). As often happens I figured out the solution while writing up the issue.

problem

Without the dataverse argument, the expected behavior from create_dataverse() is "a top-level Dataverse is created." I probably don't have the permissions for that on dataverse.harvard.edu, but this isn't relevant for demonstration:

library("dataverse")
> dv <- create_dataverse()
Error in create_dataverse() : Bad Request (HTTP 400).
> traceback()
3: stop(http_condition(x, "error", task = task, call = call))
2: httr::stop_for_status(r) at create_dataverse.R#27
1: create_dataverse()

Code 400. Stepping through with debug() the request was

debug at /home/.../dataverse-client-r/R/create_dataverse.R#26: r <- httr::POST(u, httr::add_headers(`X-Dataverse-key` = key), ...)
Browse[2]> r$request
<request>
POST https://dataverse.harvard.edu/api/dataverses
Output: write_memory
Options:
* useragent: libcurl/7.55.1 r-curl/3.1 httr/1.3.1
* post: TRUE
* postfieldsize: 0
Headers:
* Accept: application/json, text/xml, application/xml, */*
* Content-Type:
* X-Dataverse-key: ####-####

Note the empty body and content-type. For completeness, I do have the admin role for the medsl dataverse at harvard.edu, and get the same result with

> r = create_dataverse("medsl")
Error in create_dataverse("medsl") : Bad Request (HTTP 400).

solution

From the Dataverse API docs I see this behavior is actually expected. I'm sending an empty body, and minimally the API requires fields name, alias, and dataverseContacts.

I can confirm this is the issue by sending a request with httr, using the example content from the docs as body content:

api_url = "https://dataverse.harvard.edu/api/dataverses/medsl"
meta = jsonlite::read_json("http://guides.dataverse.org/en/latest/_downloads/dataverse-complete.json")

r <- httr::POST(api_url, httr::add_headers("X-Dataverse-key" = key), body =
  meta, encode = "json")
r$status_code
[1] 201

That's a success and the new dataverse appears in the GUI unpublished. This result can be replicated with create_dataverse() using its dots argument, which is passed to httr::POST().

# same body; nb encode='json' is required
r = create_dataverse("medsl", body = meta, encode = "json")  
> str(r)
 chr "{\"status\":\"OK\",\"data\":{\"id\":3131902,\"alias\":\"science\",\"name\":\"Scientific Research\",\"affiliatio"| __truncated__

And that's also successful!

If create_dataverse() will always require body content, it might be worthwhile to move body into its signature as a new named argument and handle the encoding. Alternatively, the minimal metadata fields (name, alias, dataverseContacts) could appear in the signature, since passing a named list is a little clunky. It looks something like this (.Names elements added by dput):

structure(list(name = "Scientific Research", alias = "science", dataverseContacts = list(structure(list(contactEmail = "[email protected]"), .Names = "contactEmail"), structure(list(contactEmail = "[email protected]"), .Names = "contactEmail")), affiliation = "Scientific Research University", description = "We do all the science.", dataverseType = "LABORATORY"), .Names = c("name", "alias", "dataverseContacts", "affiliation", "description", "dataverseType"))

Thoughts? If you're open to an update I can submit a PR.

update Travis config

This travis build matrix needs to be updated to comply with newer standards.

I don't know if it's related to last night's problem. The PR tests passed, but a few minutes later the master build failed.

  • root: deprecated key sudo (The key sudo has no effect anymore.)
  • root: missing dist, using the default xenial
  • root: missing os, using the default linux
  • root: key matrix is an alias for jobs, using jobs

image

https://travis-ci.org/github/IQSS/dataverse-client-r/jobs/664731693/config

  • Also, the osx image should be updated again. (This came up in November.)

image

https://travis-ci.org/github/IQSS/dataverse-client-r/jobs/664731693

Expand documentation

  • Vignette demonstrating data access and download
  • Package documentation
    • Native API
    • Search API. This is quite complicated and possibly installation-dependent (see IQSS/dataverse#2558). This should be accommodated, for the time being, on the documentation level only. It can be expanded later.
    • Data access
    • Data deposit/SWORD API
  • README examples

Reading Stata suggestion - change foreign to haven?

In the current man pages and vignette, the usage of dta files suggest foreign::read.dta. I would propose switching to haven::read_dta or at least seeing if all the tests would go through with haven. haven is a tidyverse-based package that has surpassed foreign in recent years (see below). More importantly, haven can read all Stata dataset versions, whereas foreign is stuck in v12 (Stata is currently at v16).

library(ggplot2)
library(dplyr)
library(dlstats)


dl_stats <- cran_stats(c("haven", "foreign"))

dl_stats %>% 
  as_tibble() %>% 
  group_by(package) %>% 
  slice(-n()) %>% 
  rename(Package = package) %>% 
  ggplot(aes(end, downloads, group = Package, color = Package)) +
  geom_line(aes(linetype = Package)) + geom_point() +
  labs(y = "CRAN downloads",
       x = "")

Created on 2019-12-09 by the reprex package (v0.3.0)

Upload and download a zip file, maintaining folder structure

@leeper do you remember a feature we had in DVN 3 where you could upload a zip file and the folder structure would be recorded in the database and then people could come along later and download the files as a zip and on the fly DVN would create a zip file with the same folder structure as when the files were uploaded?

Add function to recalculate UNF

Recalculate the UNF value of a dataset version, if it’s missing, by supplying the dataset version database id:

POST http://$SERVER/api/admin/datasets/integrity/{datasetVersionId}/fixmissingunf

Duplicate file description

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

The "description" for files is repeated, resulting in a duplicate data.frame column name which causes all sorts of issues. Not sure if this is a problem with the API or the R-package, but figured I'd start here. CC @pdurbin

## load package
library("dataverse")


## code goes here
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

obrien_files <- get_dataset("doi:10.7910/DVN/WOT075")[['files']]
colnames(obrien_files)

 [1] "description"         "label"               "restricted"         
 [4] "version"             "datasetVersionId"    "categories"         
 [7] "id"                  "persistentId"        "pidURL"             
[10] "filename"            "contentType"         "filesize"           
[13] "description"         "storageIdentifier"   "rootDataFileId"     
[16] "md5"                 "checksum"            "creationDate"       
[19] "originalFileFormat"  "originalFormatLabel" "originalFileSize"   
[22] "UNF"                 "tabularTags"

## session info for your system
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.0.1   dataverse_0.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1        rstudioapi_0.10   xml2_1.2.0        magrittr_1.5     
 [5] tidyselect_0.2.5  R6_2.4.0          rlang_0.3.4       httr_1.4.1       
 [9] tools_3.4.3       pkgbuild_1.0.2    cli_1.1.0         withr_2.1.2      
[13] remotes_2.1.0     assertthat_0.2.1  rprojroot_1.3-2   tibble_2.1.1     
[17] crayon_1.3.4      processx_3.3.0    purrr_0.3.2       callr_3.1.1      
[21] ps_1.3.0          curl_3.3          glue_1.3.1        pillar_1.4.2     
[25] compiler_3.4.3    backports_1.1.4   prettyunits_1.0.2 jsonlite_1.6     
[29] pkgconfig_2.0.2  

fq for dataverse_search is broken

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

Put your code here:

## load package
library("dataverse")

## code goes here
datasets <- dataverse_search("*", fq = "dateSort:[2018-01-01T00:00:00Z+TO+2019-01-01T00:00:00Z]", type="dataset", key = "",  server = "dataverse.harvard.edu"))


## session info for your system
sessionInfo()

I'm proposing a fix for this in #36

Downloading multiple files

Please specify whether your issue is about:

  • a question about package functionality

I think this is just a question, but might also be enhancement/bug report.: The dataverse API allows downloading multipel files as .zip. This is particularly relevant now as it preserves the folder structure where available.
There is code in the get_file() function that accesses this functionality, but I don't actually think it's ever possible to get there: I find no way of specifying multiple fileids

So first question:

  1. Am I right about this? Or could someone give me syntax to do this in get_file()?
  2. If I'm right that this isn't possible, what would be a good way to do this? Allow a vector of ids as input for the file parameter?

new `publicationDate` element

It looks like there is a new component returned from the dataverse_contents() function. I'll add it to the tests so they pass again

Error: actual[[1]] not equal to `expected`.
Names: 2 string mismatches
Length mismatch: comparison on first 8 components
Component 7: 1 string mismatch
Component 8: 1 string mismatch
expected  <- structure(
  list(
    id                = 396356L,
    identifier        = "FK2/FAN622",
    persistentUrl     = "https://doi.org/10.70122/FK2/FAN622",
    protocol          = "doi",
    authority         = "10.70122",
    publisher         = "Demo Dataverse",
    storageIdentifier = "file://10.70122/FK2/FAN622",
    type              = "dataset"
  ),
  class = "dataverse_dataset"
)

> actual[[1]]
Dataset (396356): https://doi.org/10.70122/FK2/FAN622
Publisher: Demo Dataverse
publicationDate: 2020-04-22

> expected
Dataset (396356): https://doi.org/10.70122/FK2/FAN622
Publisher: Demo Dataverse

edit: now I'm pretty sure this is related to "releasing"i it (ref #61)

HTTP 404 error for doi pulled from Dataverse

I'm getting a "Not Found (HTTP 404)" error when I try to pull metadata for a doi that was itself pulled from Dataverse using dataverse_search. Here's the sequence that leads to the error:

search <- dataverse_search("ICEWS")
md <- dataverse_metadata(search$url[4])

Is this an issue with the R package or Dataverse itself?

Testing against a live Dataverse

There needs to be a different approach to initiating the test suite. Right now [tests that should fail... still pass. It's because testthat::test_check() currently won't run if the API key isn't found as an environmental variable.

I'm open to ideas as always. Currently I'm thinking:

  1. Test only against demo.dataverse.org. (A few weeks ago @pdurbin advocated this in a phone call for several reasons, including that Dataverse's retrieval stats won't be misleading --because one article gets hundreds of hits a month just from automated tests.)

  2. create a (demo) Dataverse account dedicated to testing. At this point, I don't think it needs to be kept secret. There's not really a need to keep it secret. It could even be set in tests/testthat.R.

    @pdurbin, will you please check my claim --especially from a security standpoint?

  3. If the above is safe, the api key might be kept in a yaml file in the inst/ directory.

  4. If the API key to the demo server needs to be protected,

    1. we could save it as Travis environmental variables (ref 1 & ref 2)

    2. it would prevent other people from testing the packages on their own machine, so we'll get fewer quality contributions from others.


@skasberger, @rliebz, @tainguyenbui, and any others, I'd appreciate any advice from your experience with pyDataverse, dataverse-client-python, and dataverse-client-javascript. I'm not experienced with your languages, but looks like pyDataverse doesn't pass an API key, while client-python posts their API key to the demo server.


(This is different from #4 & #29, which involve the battery of tests/comparisons. Not the management of API keys or how testthat is initiated.)

Generalize out the SWORD client

It will be useful to have the SWORD v2 client as a separate package. This is in the works, but I want to make a formal note of it here.

Bulk edit file tags and description via update_file_metadata()

a suggested code or documentation change, improvement to the code, or feature request:

User story: As a user or curator, I have a list of files on dataverse and want to add tags and descriptions from a spreadsheet to them.

While it may be(?) possible to update file metadata as part of update_dataset, that's quite messy (and I'm not 100% sure it's even possible -- couldn't get it to work). The native API offers a nice set of functions for this, that I think we should implement using the above-noted functions:

http://guides.dataverse.org/en/latest/api/native-api.html#updating-file-metadata

Thoughts?

Reported 404 Error

Reported via email:

Sys.setenv("DATAVERSE_SERVER"= "dataverse.harvard.edu")
dataset_metadata('DOI for the dataset', version = ":draft", block = "citation",
key = Sys.getenv("My API Token"),
server = Sys.getenv("DATAVERSE_SERVER"))
# client error: (404) Not Found

Thank you, Thomas Leeper

@leeper, thanks for developing such a stable and well-designed package. Digging into it during the past week, I'm even more appreciative of its design and consistency. And you left a good roadmap of what the package needs next (eg, #4 & #16), which has helped the transition It's been fun learning about it and gaining some new tools in my toolbox along the way.

(To those on the repo's watchlist, I promise I won't bend the purpose of GitHub issues too frequently. But I think @leeper deserves the additional recognition.)

ref #21

Creating dataset using create_dataset() yields 500 error

Please specify whether your issue is about:

Hi There,

I was trying out some of the functions for the dataverse package, and came across it creating a 500 error every time I tried to execute the function. Unfortunately, I can't put my server and API key, but I can show what I did attempt:

## load package
library("dataverse")

## code
meta2<-list(title="test",author="Li,Thomas",datasetContact="Fish, Fishy",dsDescription="FISH",subject="Quantitative Sciences",depositor="Fish, Fishy",dateOfDeposit="Fish Time",datasetContactEmail="[email protected]")
create_dataset("MLPOCs2018",body=meta2) 

When running the debugger, I noticed the 500 error occurs when the POST() function is used from the 'httr' package (which makes sense), so I am trying to see the cause of this (be it permission or something else).

Just in case, the sessionInfo() yields the following:


R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] zip_2.0.2         flowCore_1.50.0   magick_2.0        ggplot2_3.2.0     usethis_1.5.0     devtools_2.0.2   
 [7] magrittr_1.5      data.table_1.12.2 dplyr_0.8.1       dataverse_0.2.0  

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5    remotes_2.1.0       purrr_0.3.2         lattice_0.20-38     pcaPP_1.9-73       
 [6] colorspace_1.4-1    testthat_2.1.1      stats4_3.6.0        yaml_2.2.0          rlang_0.4.0        
[11] pkgbuild_1.0.3      pillar_1.4.1        glue_1.3.1          withr_2.1.2         BiocGenerics_0.30.0
[16] sessioninfo_1.1.1   matrixStats_0.54.0  robustbase_0.93-5   munsell_0.5.0       gtable_0.3.0       
[21] mvtnorm_1.0-11      memoise_1.1.0       Biobase_2.44.0      callr_3.2.0         ps_1.3.0           
[26] curl_3.3            parallel_3.6.0      DEoptimR_1.0-8      Rcpp_1.0.1          corpcor_1.6.9      
[31] backports_1.1.4     scales_1.0.0        desc_1.2.0          pkgload_1.0.2       jsonlite_1.6       
[36] graph_1.62.0        fs_1.3.1            digest_0.6.19       processx_3.3.1      grid_3.6.0         
[41] rprojroot_1.3-2     cli_1.1.0           lazyeval_0.2.2      tibble_2.1.3        cluster_2.1.0      
[46] crayon_1.3.4        rrcov_1.4-7         pkgconfig_2.0.2     MASS_7.3-51.4       xml2_1.2.0         
[51] prettyunits_1.0.2   assertthat_0.2.1    httr_1.4.0          rstudioapi_0.10     R6_2.4.0           
[56] compiler_3.6.0  

If needed, I can also provide the log from the server that shows the "API internal error"

Thanks!

fix homebrew on Travis CI

This is the result of recent changes in GitHub and Homebrew.

$ brew install curl
Error: 
  homebrew-core is a shallow clone.
  homebrew-cask is a shallow clone.
To `brew update`, first run:
  git -C /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core fetch --unshallow
  git -C /usr/local/Homebrew/Library/Taps/homebrew/homebrew-cask fetch --unshallow
This restriction has been made on GitHub's request because updating shallow
clones is an extremely expensive operation due to the tree layout and traffic of
Homebrew/homebrew-core and Homebrew/homebrew-cask. We don't do this for you
automatically to avoid repeatedly performing an expensive unshallow operation in
CI systems (which should instead be fixed to not use shallow clones). Sorry for
the inconvenience!
Warning: You are using macOS 10.13.
We (and Apple) do not provide support for this old version.
You will encounter build failures with some formulae.
Please create pull requests instead of asking for help on Homebrew's GitHub,
Twitter or any other official channels. You are responsible for resolving
any issues you experience while you are running this
old version.
Error: curl: no bottle available!
You can try to install from source with e.g.
  brew install --build-from-source curl
Please note building from source is unsupported. You will encounter build
failures with some formulae. If you experience any issues please create pull
requests instead of asking for help on Homebrew's GitHub, Twitter or any other
official channels.
The command "brew install curl" failed and exited with 1 during .

Some discussions

Appropriate server defaults

tldr: Help pages and tests should specify a server, either via Sys.setenv("DATAVERSE_SERVER" = "havard.dataverse.edu") OR specifying server = "harvard.dataverse.edu" each time.

Background: The server argument used in most of the functions has the default Sys.getenv("DATAVERSE_SERVER"), which means that the initial user would get a "server not specified" error when trying out the help page examples.

So shouldn't all minimal working examples specify a server argument?

Also this default is inconsistent with the documentation, which says (emphasis mine)

A character string specifying a Dataverse server. There are multiple Dataverse installations, but the defaults is to use the Harvard Dataverse. This can be modified atomically or globally using Sys.setenv("DATAVERSE_SERVER" = "dataverse.example.com").

doi, "start" parameter using dataverse_search

I'm attempting to use this client to find the doi of all .R files in a dataverse server, and I've run into a couple of interesting behaviors.

  1. Calling dataverse_search prints a message to the console "10 of 3842 results retrieved". However, calling dataverse_search with a "start" parameter that would seem to be the last page of the results (start = ceiling(3842/10) = 385) still yields 10 results, and pages beyond that number continue to yield results. Therefore, how would I determine the appropriate number of pages to get data for?
  2. dataverse_search does not return a doi field. The dataframe returned from dataverse_search has the following columns:
    ## [1] "name" "type" "url"
    ## [4] "file_id" "description" "published_at"
    ## [7] "file_type" "file_content_type" "size_in_bytes"
    ## [10] "md5" "checksum" "dataset_citation"
    ## [13] "unf"
    none of which are doi. I was able to get around this by parsing the "dataset_citation" column using stringr, but having a dedicated doi field would be wonderful.

404 errors in vignette - get_file()

You guys might start regretting inviting me to be a maintainer. I'm having trouble reproducing the vignettes, even easy parts like retrieving plain-text R & CSVs.

Part 1: out of the box

remotes::install_github("iqss/dataverse-client-r")
#> Skipping install of 'dataverse' from a github remote, the SHA1 (bac89f46) has not changed since last install.
#>   Use `force = TRUE` to force installation
library("dataverse")
Sys.setenv("DATAVERSE_KEY" = "examplekey12345")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("doi:10.7910/DVN/ARKOTI")
# Dataset (75170): 
# Version: 1.0, RELEASED
# Release Date: 2015-07-07T02:57:02Z
# License: CC0
# 22 Files:
# label version      id                  contentType
# 1                  alpl2013.tab       2 2692294    text/tab-separated-values
# 2                   BPchap7.tab       2 2692295    text/tab-separated-values
# 3                   chapter01.R       2 2692202 text/plain; charset=US-ASCII
# ...
# 16             drugCoverage.csv       1 2692233 text/plain; charset=US-ASCII
# ...

# Retrieve files by ID
object.size(dataverse::get_file(2692294)) # tab works
#> 211040 bytes
object.size(dataverse::get_file(2692295)) # tab works
#> 61336 bytes
object.size(dataverse::get_file(2692210)) # R fails
#> Error in dataverse::get_file(2692210): Not Found (HTTP 404).
object.size(dataverse::get_file(2692233)) # csv fails
#> Error in dataverse::get_file(2692233): Not Found (HTTP 404).

# Retrieve files by name & doi
object.size(get_file("alpl2013.tab"     , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 211040 bytes
object.size(get_file("BPchap7.tab"      , "doi:10.7910/DVN/ARKOTI")) # tab works
#> 61336 bytes
object.size(get_file("chapter01.R"      , "doi:10.7910/DVN/ARKOTI")) # R fails
#> Error in get_file("chapter01.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).
object.size(get_file("drugCoverage.csv" , "doi:10.7910/DVN/ARKOTI")) # csv fails
#> Error in get_file("drugCoverage.csv", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).


# Taken straight from https://cran.r-project.org/web/packages/dataverse/vignettes/C-retrieval.html
code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI")
#> Error in get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI"): Not Found (HTTP 404).

Created on 2019-12-06 by the reprex package (v0.3.0)

Part 2: digging.

Using debug(dataverse::get_file), the error-throwing line is in get_file():

r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query, ...)

To make things a tad more direct, I called dataverse::get_file(2692233). The two relevant parameters to httr::GET() are

Browse[2]> query
$format
[1] "original"

Browse[2]> u
[1] "https://dataverse.harvard.edu/api/access/datafile/2692233"

The r value returned is

Response [https://dataverse.harvard.edu/api/access/datafile/2692233?format=original]
  Date: 2019-12-07 05:13
  Status: 404
  Content-Type: application/json
  Size: 201 B

That u value is fine when pasted into Chrome. I saw several Dataverse discussions about a trailing /. When I added that, the response appears good.

Browse[2]> u2 <- paste0("https://dataverse.harvard.edu/api/access/datafile/2692233", "/")
Browse[2]> httr::GET(u2, httr::add_headers(`X-Dataverse-key` = key), ... )
Response [https://dvn-cloud.s3.amazonaws.com/10.7910/DVN/ARKOTI/14e66408488-c678717f7c4d?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27drugCoverage.csv&response-content-type=text%2Fplain%3B%20charset%3DUS-ASCII&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191207T051632Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20191207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=c1b13a7d3ea2a53c1c1e70c18a762ae0e4ae14eb41fae7d79c71fce26a9b354f]
  Date: 2019-12-07 05:16
  Status: 200
  Content-Type: text/plain; charset=US-ASCII
  Size: 4.06 kB

Part 3: Questions

  1. I assume this is error is fairly new. Some change with Dataverse? If not, maybe it's related to change with curl that was released 4 days ago?

  2. Why are csv & R files affected, but not tab files? As I step through a tab file (e.g., dataverse::get_file(2692294)), it appears the exact same lines are executed. And that u value doesn't have a trailing slash (https://dataverse.harvard.edu/api/access/datafile/2692294). I see two differences: (a) the content type and (b) this one doesn't go through AWS/S3.

    Response [https://dataverse.harvard.edu/api/access/datafile/2692294?format=original]
    Date: 2019-12-07 05:35
    Status: 200
    Content-Type: application/x-stata; name="alpl2013.dta"
    Size: 211 kB
    <BINARY BODY>

    This is probably related to @EdJeeOnGitHub's recent issue #31. Notice he mentions problems with certain file formats.

  3. Is this related at all to IQSS/dataverse#3130, IQSS/dataverse#2559, or IQSS/dataverse#4196?
    You can see that my knowledge with the web side of this is limited; I don't understand them that well.

devtools::session_info()
  • Session info ---------------------------------------------------------------------------
    setting value
    version R version 3.6.1 Patched (2019-08-12 r76979)
    os Windows 10 x64
    system x86_64, mingw32
    ui RStudio
    language (EN)
    collate English_United States.1252
    ctype English_United States.1252
    tz America/Chicago
    date 2019-12-06

  • Packages -------------------------------------------------------------------------------
    package * version date lib source
    assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
    backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
    callr 3.3.2 2019-09-22 [1] CRAN (R 3.6.1)
    cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
    clipr 0.7.0 2019-07-23 [1] CRAN (R 3.6.1)
    crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
    curl 4.3 2019-12-02 [1] CRAN (R 3.6.1)
    dataverse * 0.2.1 2019-12-07 [1] Github (bac89f4)
    desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
    devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1)
    digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1)
    ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
    evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
    fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
    glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
    htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
    httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1)
    jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0)
    knitr 1.26 2019-11-12 [1] CRAN (R 3.6.1)
    magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
    memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
    packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.0)
    pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1)
    pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
    prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
    processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1)
    ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
    R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
    Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1)
    remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0)
    reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.0)
    rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.1)
    rmarkdown 1.18 2019-11-27 [1] CRAN (R 3.6.1)
    rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
    rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0)
    sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
    testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.1)
    usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
    whisker 0.4 2019-08-28 [1] CRAN (R 3.6.1)
    withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
    xfun 0.11 2019-11-12 [1] CRAN (R 3.6.1)
    xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.1)

screenshot of postman postman

data.frames don't default to factors for strings

The release of R 4.0 doesn't default to factors for string variables. This won't change much with the clients, but the tests break with the different data type

> str(expected)
List of 4
 $ title                   : chr "dataverse-client-r"
 $ generator               : list()
  ..- attr(*, "uri")= chr "http://www.swordapp.org/"
  ..- attr(*, "version")= chr "2.0"
 $ dataverseHasBeenReleased: chr "false"
 $ datasets                :'data.frame':	1 obs. of  2 variables:
  ..$ NA: Factor w/ 1 level "Bulls Roster 1996-1997": 1
  ..$ NA: Factor w/ 1 level "https://demo.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.70122/FK2/FAN622": 1
 - attr(*, "class")= chr "dataverse_dataset_list"
> str(actual)
List of 4
 $ title                   : chr "dataverse-client-r"
 $ generator               : list()
  ..- attr(*, "uri")= chr "http://www.swordapp.org/"
  ..- attr(*, "version")= chr "2.0"
 $ dataverseHasBeenReleased: chr "true"
 $ datasets                :'data.frame':	1 obs. of  2 variables:
  ..$ title: chr "Bulls Roster 1996-1997"
  ..$ id   : chr "https://demo.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.70122/FK2/FAN622"
 - attr(*, "class")= chr "dataverse_dataset_list"

`list_datasets()` & printing is failing

Sys.setenv("DATAVERSE_SERVER" = "demo.dataverse.org")
Sys.setenv("DATAVERSE_KEY"    = "c7208dd2-6ec5-469a-bec5-f57e164888d4")
dv <- get_dataverse("dataverse-client-r")
list_datasets(dv)

The error is

Error in if (x$dataverseHasBeenReleased[[1]] == "true") "Yes" else "No" : 
  argument is of length zero 

The nested structure of the objected has changed, so this function is failing:

print.dataverse_dataset_list <- function(x, ...) {
cat("Dataverse name: ", x$title[[1]], "\n", sep = "")
cat("Released? ", if (x$dataverseHasBeenReleased[[1]] == "true") "Yes" else "No", "\n", sep = "")
print(x$datasets)
invisible(x)
}

The feed level needs to be added. So x$feed$dataverseHasBeenReleased[[1]] becomes x$dataverseHasBeenReleased[[1]]? But there's something with dispatching isn't going as I expect. The full structure of x from list_datasets() isn't being recognized by `print.dataverse_dataset_list().


Edit: I think print() is fine. It's a problem with the tail end of list_datasets(). These objects are nested in x$feed, not just x.

x <- xml2::as_list(xml2::read_xml(r$content))
out <- list(title = x[["title"]][[1L]],
generator = x[["generator"]],
dataverseHasBeenReleased = x[["dataverseHasBeenReleased"]][[1L]])

Refine the organization of `get_file()`?

@adam3smith, @kuriwaki, @pdurbin, and anyone else,

Should get_file() be refactored into multiple child functions? It seems like we're asking it to do a lot of things, including

  • retrieve by file id or by file name
  • retrieve single file or multiple files
  • in #35, @kuriwaki has the cool suggestion of returning a specific file format (like Stata)
  • in #35, I hijacked @kuriwaki's thread to suggest also returning R's native data.frame or tibble.
  • in PR #47, @adam3smith has a cool solution of downloading multiple files as a single zip (also discussed in #46)

I like all these capabilities, and want to run discuss organizational ideas with people so the package structure is (a) easy for us to develop, test, & maintain, and (b) useful and natural to users to learn and incorporate.

One possible approach:
A foundational functional retrieves the file(s) by ID; it is the workhorse that actually retrieves the file. A second function accepts the file name (not ID); it essentially wraps the first function after calling get_fileid(). Both of these functions deal with a single file at a time.

Another pair of functions deal with multiple files (one by name, one by id). But these return lists, not a single object. They're essentially lapplys/loops around their respective siblings described above.

To avoid breaking the package interface, maybe the existing get_file() keeps its same interface (that ambiguously accepts with file names or id and returns either single files or a list of files), but we soft-deprecate it and encourage new code to use these more explicit functions? The guts of the function is moved out into the four new functions

Maybe the function names are

  1. the existing get_file() with an unchanged interface
  2. get_file_by_id() (the workhorse)
  3. get_file_by_name()
  4. get_files_by_id()
  5. get_files_by_name() (see @adam3smith's comment below)
  6. get_tibble_by_id()
  7. get_tibble_by_name()
  8. get_zip_by_id()
  9. get_zip_by_name() (see @adam3smith's comment below)
  10. get_file_by_doi() (see @adam3smith 's comment

I'm pretty sure it would be easier to write tests that isolate problems. The documentation becomes more verbose, but probably more straight-forward.

You guys have more experience with Dataverse than I do, and better sense of the use cases. Would this reorganization help users? If not, maybe we still split it into multiple functions, but just keep the visibility of functions 2-5 private.

Maybe I'm making this unnecessarily tedious, but I'm thinking that these download functions are the most called by R users, and they're certainly the ones that are called by new users. So if they leave a bad impression, the package is less likely to be used.

Add to author list

@adam3smith & @kuriwaki, I overlooked this your recent PRs. You should be recognized for your contributions.

(@EdJeeOnGitHub & @billy34, I haven't forgotten about your PRs.)

Will you please create separate PRs and

  1. add yourselves to the DESCRIPTION file. Add your ORCID id if you have one.
  2. update NEWS.md
  • @adam3smith, please create a separate PR (crediting yourself as an 'aut')
  • @kuriwaki, please create a separate PR (crediting yourself as an 'aut')

I'm using R Packages for the definition/distinction between the "aut" and "ctb".

issue with install instruction

Hi, in the installation instruction it reads one should use 'ghit'. I'm not sure what it is, but couldn't find it through google, nor does the provided code work on my Mac. The usual code does work:

library(devtools) #install first if not yet installed
install_github("iqss/dataverse-client-r")
library("dataverse")

Portability of test suite to clients in other languages

@pdurbin and I discussed possible ways to reduce the effort each of the Dataverse client developer to create and maintain tests. It might be nice if

  1. there was a common bank of (sub)dataverses and files that covered a nice spread of scenarios and asserted a client library (i.e., the two Pythons, one Java, one JavaScript, and R) downloaded/uploaded/searched/processed/whatever correctly. For a download test, the test suite confirms that the client returns a file that matches a specific pre-existing file. For a metadata test, the test suite confirms that the client returns a dataset that matches the pre-existing ~csv.
  2. a manifest file enumerates these files, and certain expected characteristics (e.g., md5, approx file size). Currently, I think a csv adequately meets this un-hierarchical need, where each row represents a file that will be tested.
  3. a client's test suite doesn't code specifically for each file. It probably just loops over the manifest file. To add a new condition, only the manifest file and file bank is modified.
  4. the manifest file and the expected files are eventually stored somewhere centrally, that's easily accessible by the client developers. When someone hits a weird case (e.g., the pyDataverse developer finds a problem when processing a csv with a "txt' extension), they'll add that case to the test banks.

@skasberger, @rliebz, @tainguyenbui, and any others, please tell me if this isn't worth it, or there's a better approach, etc.


(This is different from #4 & #29, which involve the battery of tests/comparisons. #40 deals with the deployment of API keys used in testing.)

package scaffolding

As I've expressed with a few tweaks:

I'd (like to) modify some of the package's structure to more closely resemble the approach described in Hadley's R Packages book (and the tidyverse style guide). The first reason is because I think it's a good approach and has worked well for me in the past. A second reason is that this approach is documented better and more coherently than any other approach --therefore it will be easier for someone else to modify or become maintainer in the future.

But this won't be a major overhaul. Most of the important elements of the existing package follow this book's approach.

The changes come in two sets: the little guys that come before adding a test suite (issues #4 & #29) and those that come after the test suite (which includes some of the recently accumulating issues like #33, #35, & #37).

Before test suite

  • Rproj

  • update Roxygen

  • updates to .Rbuildignore and .gitignore files.

  • move daily development off the master and to the dev branch

  • update person entries in DESCRIPTION file

  • semantic versioning for dev versions (e.g., x.x.x.9001)

  • remove date from DESCRIPTION file

  • (re)connect codecov which currently reads 0%. (@pdurbin, as an admin of this GitHub org, you'll probably see my request for additional permissions granted to Codecov. I requested it only for this single public repo.)

    image (click the image for the live staus)

After test suite and functional changes

I'm flexible if anyone has other ideas.

CRAN release

Thank you for this package. Do you plan any additional release in CRAN soon to publish fixes and enhancements since latest release (2017)? Thanks in advance

Refactor to an internal "kernel" function?

@adam3smith wrote in #49 (comment)

A different approach would be to make it easier to construct native API queries from scratch using the client so that we don't have to code everything but it's more readily accessible. I don't know how hard that would be, but when I wrote API functionality (for downloading .zip files) locally it required quite a bit of code duplication from dataverse library.

We encountered something similar in REDCapR. It's a similar package as this one --in the sense that it's basically a glove around curl calls to a server's api. Then it adds to convenience functions and validation to make R development easier because it returns R's native data.frame objects.

When we saw all the duplication in REDCapR, we refactored that core functionality into kernel_api(). That central location makes it easier to improve things consistently like character encoding and error handling. It returns raw result to the calling function, which decides whether to save it as a local file, or return a data.frame.

In dataverse::get_file(), I see the replication @adam3smith's talking about, with three instances similar to

u <- paste0(api_url(server), "access/datafile/", file, "/metadata/", format)
r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), ...)
httr::stop_for_status(r)

This could become its own function like

kernel <- function(command, server, key, ...) {
  u <- paste0(api_url(server), command)
  r <- httr::GET(u, httr::add_headers("X-Dataverse-key" = key), ...)
  httr::stop_for_status(r)
  r
}

The code inside dataverse::get_file_metadata() would look like

command <- paste0("access/datafile/", file, "/metadata/", format)
kernel(command, server, key)

The code inside the current dataverse::get_file() would have three different calls:

command <- paste0("access/datafiles/", file)
kernel(command, server, key)
...
command <- paste0("access/datafile/bundle/", file)
kernel(command, server, key)
...
command <- paste0("access/datafile/", file)
kernel(command, server, key)
...

@adam3smith, @kuriwaki, or anyone else, please share your impressions if you'd like. It looks like @skasberger and the pyDataverse do something very similar with their get_request()

new data.frame attribute for `get_dataverse()`.

There's a subtle & trivial difference in the tests for get_dataverse(), which may be related to new versions of tibble or R 4.0.

>   expect_equal(actual, expected)
Error: `actual` not equal to `expected`.
Componentdatasets: Attributes: < Names: 1 string mismatch >
Componentdatasets: Attributes: < Length mismatch: comparison on first 2 components >
Componentdatasets: Attributes: < Component 2: Modes: character, numeric >
Componentdatasets: Attributes: < Component 2: target is character, current is numeric >
> actual
Dataverse name: Demo Dataverse
Released?       Yes
data frame with 0 columns and 0 rows
> expected
Dataverse name: Demo Dataverse
Released?       Yes
data frame with 0 columns and 0 rows
> str(expected)
List of 4
 $ title                   : chr "Demo Dataverse"
 $ generator               : list()
  ..- attr(*, "uri")= chr "http://www.swordapp.org/"
  ..- attr(*, "version")= chr "2.0"
 $ dataverseHasBeenReleased: chr "true"
 $ datasets                :'data.frame':	0 obs. of  0 variables
 - attr(*, "class")= chr "dataverse_dataset_list"
> str(actual)
List of 4
 $ title                   : chr "Demo Dataverse"
 $ generator               : list()
  ..- attr(*, "uri")= chr "http://www.swordapp.org/"
  ..- attr(*, "version")= chr "2.0"
 $ dataverseHasBeenReleased: chr "true"
 $ datasets                :'data.frame':	0 obs. of  0 variables
 - attr(*, "class")= chr "dataverse_dataset_list"
> attributes(actual$datasets)
$names                                      # This is new
character(0)

$class
[1] "data.frame"

$row.names
integer(0)

> attributes(expected$datasets)
$class
[1] "data.frame"

$row.names
integer(0)

Add integration test to download a file by filename

I'm opening this issue because I just opened and equivalent issue for the Python client: IQSS/dataverse-client-python#29

What I'm really after is a way to address this issue that @monogan opened at IQSS/dataverse#2700 in which he's trying to figure out what to write in a book about R with regard to how to download files.

Ideally (in my mind), the R package for Dataverse would provide some insulation between the readers of his book and the Dataverse APIs. The book would say, "Install the dataverse package from CRAN and download the file by..." That way, even if the APIs change a bit, future readers of his book will download the latest version of the R package from CRAN and it will still "just work".

I think that the only things users should need to download a file is the DOI of the dataset and the filename. The dataverse package can do the rest. :) It would be way cleaner than my hack: IQSS/dataverse@812424a .

More verbose httr error message for get_file()

I often find get_file() errors since the API can't return a certain file format. Since the dataverse API returns informative messages when a request fails it makes sense to pass this along in R instead of just the default from httr::stop_for_status().

For instance:

library(dataverse)
library(tidyverse)

IPA_dataset <- get_dataset("doi:10.7910/DVN/JGLOZF")
# doi:  https://doi.org/10.7910/DVN/JGLOZF
community_files <- IPA_dataset$files$id %>% 
  map(function(x){
    print(x)
    return(get_file(x))
  })

after the PR will return:

Not Found (HTTP 404). Failed to '/api/v1/access/datafile/2972336' datafile access error: requested optional service (image scaling, format conversion, etc.) is not supported on this datafile.

instead of:

Error in get_file(x) : Not Found (HTTP 404).

Steal tests from pyDataverse

@skasberger kindly offered assistance to the package and its tests (@skasberger, I have it on record so you can't back out).

Tests of uploading to Dataverse

From my initial look, it appears that I could port/translate a lot of the tests and flat-out steal a lot of the json that's fed to the tests.

Tests of downloading from Dataverse

@skasberger, are there pyDataverse tests that compare data returned from Dataverse against an expected set of values?

If not, maybe that set of expected values is something that the R & Python libraries could build together. The R data.frame/list could be serialized to json, and compared against a json file; but that's probably too reliant on the R & Python json libraries from producing the exact same plain-text file.

Alternatively, the json expected files (that are common between Python & R) could be read into the languages' dict/list objects (e.g., by Python's json and R's jsonlite) and compared there. I'm guessing this approach is less brittle and more reusable across languages?


To @skasberger & anyone else, I'm not very familiar with Dataverse or Python package development, and I'm open to any ideas how to do things differently or leverage each package's efforts more efficiently.

(builds on #4 & possibly #22)

Error returned by dataset_versions

When trying to get versions from my dataset using dataset_versions function I get an errror

Error: $ operator is invalid for atomic vectors

Looking at the code we can notice that there is a missing json decoding of the response

I'll file a PR to correct this

[EDIT by @kuriwaki -- add reprex]

library(dataverse)

# this works
get_dataset(dataset = "doi:10.70122/FK2/PPIAXE", server = "demo.dataverse.org")
#> Dataset (182162): 
#> Version: 1.1, RELEASED
#> Release Date: 2020-12-30T00:00:24Z
#> License: CC0
#> 22 Files:
#>                   label version      id               contentType
#> 1 nlsw88_rds-export.rds       1 1734016  application/octet-stream
#> 2            nlsw88.tab       3 1734017 text/tab-separated-values

# but not this
dataset_versions(dataset = "doi:10.70122/FK2/PPIAXE", server = "demo.dataverse.org")
#> Error: $ operator is invalid for atomic vectors

Created on 2021-01-22 by the reprex package (v0.3.0)

Seeking new maintainer!

This package is not being actively maintained. If you're interested in contributing or taking over, please express your interest here.

get_file directly into environment with user-specified file format

What the issue is about:

  • a suggested code or documentation change, improvement to the code, or feature request

Issue: I think most users who want to get data from the R dataverse package want to start working with the data in their R environment right away. However, get_file only returns raw binary output which is not usable on its own.

Proposal: The help page shows how to write the class raw object into a temp file and read it back in. The proposed feature is to add an optional argument in get_file or make a function that does this write-in / read-in-again process automatically. Users will enter a function that will be used to read in the tempfile. An example function that does this is below.

How does this sound?

# hide my key

library(dataverse)

# function ----

# @param file to be passed on to get_file
# @param dataset to be passed on to get_file
# @param read_function If supplied a function object, this will write the 
#   raw file to a tempfile and read it back in with the supplied function. This
#   is useful when you want to start working with the data right away in the R
#   environment
get_file_addon <- function(file,
                            dataset = NULL,
                            read_function = NULL,
                            ...) {
  
  raw_file <- get_file(file, dataset)
  
  # default of get_file
  if (is.null(read_function))
    return(raw_file)
  
  # save to temp and then read it in with supplied function
  if (!is.null(read_function)) {
    tmp <- tempfile(file, fileext = stringr::str_extract(file, "\\.[A-z]+$"))
    writeBin(raw_file, tmp)
    return(do.call(read_function, list(tmp)))
  }
}

# read in two non-tab ingested files ----
cces_dta <- get_file_addon(file = "cumulative_2006_2018.dta", 
                           dataset = "10.7910/DVN/II2DB6",
                           read_function = haven::read_dta)
cces_rds <- get_file_addon(file = "cumulative_2006_2018.Rds", 
                           dataset = "10.7910/DVN/II2DB6",
                           read_function = readr::read_rds)
class(cces_dta)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(cces_rds)
#> [1] "tbl_df"     "tbl"        "data.frame"
dim(cces_dta)
#> [1] 452755     73
dim(cces_rds)
#> [1] 452755     73

Created on 2019-12-16 by the reprex package (v0.3.0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.