ropensci-archive / finch Goto Github PK

View Code? Open in Web Editor NEW

17.0 9.0 4.0 367 KB

:warning: ARCHIVED :warning: Read Darwin Core Archive files

License: Other

R 97.15% Makefile 2.85%

rstats darwin-core darwincore biodiversity gbif r r-package

finch's Introduction

This package has been archived. The former README is now in README-not.<

finch's People

Contributors

Stargazers

Watchers

Forkers

gustavobio steveviss logistx-security

finch's Issues

Reading files from the web

Hi,

Links to DWC-A files shared with the integrated publishing toolkit usually look like this:

http://ipt.jbrj.gov.br/jbrj/archive.do?r=redlist_2013_taxons&v=3.12

However, dwca_read() uses the provided url to get the name of the file where the data will be stored:

basename("http://ipt.jbrj.gov.br/jbrj/archive.do?r=redlist_2013_taxons&v=3.12")

basename("http://ipt.jbrj.gov.br/jbrj/archive.do?r=redlist_2013_taxons&v=3.12")
[1] "archive.do?r=redlist_2013_taxons&v=3.12"

This doesn't work well as R can't find the path to the zip file later on:

Error in unzip(writepath, exdir = dirpath) :
cannot open file '/Users/gustavo/archive.do?r=lista_especies_flora_brasil&v=393.71/resourcerelationship.txt': Not a directory

I guess this could be solved by changing the basename to something else when the url isn't a direct link to a zip file.

Thanks!

Test on larger files

related to #9 - need to figure out best option that will work in as many cases as possible

What to do about big datasets

users machines will vary in RAM, so should leave it up to the users - but helping prevent session from crashing would be good.

Maybe warn with prompt if dataset is over certain size? Maybe too much

Error parsing .xml DwC file with finch::simple_read

Hi, I'm having trouble parsing a DarwinCore file, eml.xml, using the finch package with the code below:
file <- simple_read("eml.xml")

This brings up the error message "Error: no parser for eml". The file eml.xml is in the current file path.

Thanks a lot for any suggestions as to what's going wrong.

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] finch_0.1.0       traits_0.2.0      data.table_1.10.4 vegan_2.4-3      
 [5] lattice_0.20-35   permute_0.9-4     reshape2_1.4.2    rgeos_0.3-23     
 [9] plyr_1.8.4        bindrcpp_0.2      purrr_0.2.2.2     robis_0.1.8      
[13] wellknown_0.1.0   stringr_1.2.0     stringi_1.1.5     dplyr_0.7.2      
[17] jsonlite_1.5      httr_1.2.1       

loaded via a namespace (and not attached):
 [1] taxize_0.8.9     htmltools_0.3.6  yaml_2.1.14      mgcv_1.8-17     
 [5] base64enc_0.1-3  rlang_0.1.1      glue_1.1.1       sp_1.2-5        
 [9] uuid_0.1-2       foreach_1.4.3    bindr_0.1        rvest_0.3.2     
[13] htmlwidgets_0.9  codetools_0.2-15 evaluate_0.10.1  knitr_1.16      
[17] httpuv_1.3.5     crosstalk_1.0.0  curl_2.7         parallel_3.4.1  
[21] Rcpp_0.12.12     xtable_1.8-2     readr_1.1.1      backports_1.1.0 
[25] leaflet_1.1.0    mime_0.5         hms_0.3          digest_0.6.12   
[29] shiny_1.0.3      grid_3.4.1       rprojroot_1.2    tools_3.4.1     
[33] magrittr_1.5     tibble_1.3.3     EML_1.0.3        crul_0.3.8      
[37] bold_0.5.0       cluster_2.0.6    ape_4.1          pkgconfig_2.0.1 
[41] MASS_7.3-47      Matrix_1.2-10    xml2_1.1.1       iterators_1.0.8 
[45] reshape_0.8.7    assertthat_0.2.0 rmarkdown_1.6    R6_2.2.2        
[49] nlme_3.1-131     compiler_3.4.1  
--

Error importing DwC file from URL

> devtools::session_info() Session info ------------------------------------------------------------------------------------------------------------  setting  value                         version  R version 3.4.1 (2017-06-30)  system   x86_64, mingw32               ui       RStudio (1.0.143)             language (EN)                          collate  English_United States.1252    tz       Europe/Paris                  date     2017-07-18                    Packages ----------------------------------------------------------------------------------------------------------------  package    * version date       source          base       * 3.4.1   2017-06-30 local           compiler     3.4.1   2017-06-30 local           data.table   1.10.4  2017-02-01 CRAN (R 3.4.0)  datasets   * 3.4.1   2017-06-30 local           devtools   * 1.13.2  2017-06-02 CRAN (R 3.4.1)  digest       0.6.12  2017-01-27 CRAN (R 3.4.1)  EML          1.0.3   2017-05-01 CRAN (R 3.4.1)  finch      * 0.1.0   2016-12-23 CRAN (R 3.4.1)  graphics   * 3.4.1   2017-06-30 local           grDevices  * 3.4.1   2017-06-30 local           memoise      1.1.0   2017-04-21 CRAN (R 3.4.1)  methods    * 3.4.1   2017-06-30 local           plyr         1.8.4   2016-06-08 CRAN (R 3.4.1)  rappdirs     0.3.1   2016-03-28 CRAN (R 3.4.1)  Rcpp         0.12.12 2017-07-15 CRAN (R 3.4.1)  stats      * 3.4.1   2017-06-30 local           tools        3.4.1   2017-06-30 local           utils      * 3.4.1   2017-06-30 local           uuid         0.1-2   2015-07-28 CRAN (R 3.4.0)  withr        1.0.2   2016-06-20 CRAN (R 3.4.1)  xml2         1.1.1   2017-01-24 CRAN (R 3.4.1)
--

I get the following error:

file <- "http://ipt.vliz.be/eurobis/archive.do?r=nbn_ga000467&v=1.1"
out <- dwca_read(file, read = TRUE)
File in cache
Error in if (grepl("<|>", x)) { : argument is of length zero

but the following works fine:

file <- "dwca-nbn_ga000467-v1.1.zip"
out <- dwca_read(file, read = TRUE)

EML changes

e.g., https://travis-ci.org/ropensci/finch#L1526 Error : object ‘eml_read’ is not exported by 'namespace:EML'

check fxn exported by EML, make fixes

imports fixes

remove plyr
remove rappdirs
importFrom hoardr
importFrom EML
importFrom digest

Parsing occurrence text files in DwC archive

After a year working with GBIF data in R and getting always problems importing correctly occurrence text files in R, I ended up writing a gist where I collected most of the col types I got problems with: type_GBIF_occurrence_fields.R. I discussed with colleagues about the utility of putting it in our project package. But, as suggested here trias-project/trias#25 (comment) why not pitch the authors of finch about? 👍
The typical issue while opening such files is that some DwC fields (columns) are NAs for thousands of rows before getting a real value. This creates parsing failures as R assigned type logical to these fields (columns). My first solution was to increase the value of guess_max parameter but for big files is this unfeasible, plus this is just a work-around.

add code of conduct

Any possibility of more detailed output from dwca_validate()?

I am looking into validating Darwin Core "taxon" class data in R, and want to make sure I don't reinvent the wheel. I see {finch} has a validator function, but it apparently just passes the zip off to https://tools.gbif.org/dwca-validator/ then returns only some very basic statistics (number of records) and a URL to the gbif dwca validator results, where all the juicy info is.

It would be great if the validation results were actually output as a list or (even better) dataframe.

Do the developers of finch have any plans for this sort of functionality or know of any other existing packages that do such a thing? Thanks!

use markdown docs

dwca_read() returns blank occurrence.txt data.frame despite this file not being blank

I setup a GBIF query for all records in the genus "Quercus". This file is fairly large and that may be part of the problem. I ran the function dwca_read on a locally downloaded version of the file that I have linked to below. Also I did verify that the examples included in the help file of dwca_read worked, and with those datasets I was able to load the occurrence.txt data.frame

library(finch)

file <- "http://api.gbif.org/v1/occurrence/download/request/0020631-151016162008034.zip"
out <- dwca_read(file, read = TRUE)
#Read 0.0% of 1027888 rowss
out$data$occurrence.txt
#data frame with 0 columns and 0 rows

issue with installing the finch

I tried to install the package bt I ran into some difficulties, this is the error I get, I think it might have something to do with my R version?

install.packages("finch")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/hanie/OneDrive/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/finch_0.4.0.zip'
Content type 'application/zip' length 302409 bytes (295 KB)
downloaded 295 KB

Also tried

install_github("ropensci/finch")
Error in install_github("ropensci/finch") :
could not find function "install_github"
Can you help please?

Add a vignette

This package needs more documentation! Help out the community by contributing a vignette. If you don't know what a vignette is, check out http://r-pkgs.had.co.nz/vignettes.html for an introduction.

If you aren't sure how to contribute on github checkout https://github.com/ropensci/finch/blob/master/.github/CONTRIBUTING.md

Keep in mind our code of conduct https://github.com/ropensci/finch/blob/master/CONDUCT.md

Create darwin core archives

Try new EML version

Validate dca files

... not actually being passed on to fread in dwca_read

Hi!

... arguments are not being passed on to fread in dwca_read. I've fixed this in my local repo, but it's probably easier for you to do it yourself here. Accents in some files aren't displayed correctly unless I pass encoding = UTF-8 in dwca_read.

Thanks!

from CRAN: don't write to user library

These all show check problems on the Debian check systems caused by
attempts to write to the user library to which all packages get
installed before checking (and which now is remounted read-only for
checking).

Having package code which is run as part of the checks and attempts to
write to the user library violates the CRAN Policy's

Packages should not write in the user’s home filespace (including
clipboards), nor anywhere else on the file system apart from the R
session’s temporary directory (or during installation in the location
pointed to by TMPDIR: and such usage should be cleaned up).

Hence, please update your package(s) as quickly as possible to no longer
(attempt to) write to the user library (including, of course, the
location where the package itself was installed to).

Some links to look at, etc.

http://iphylo.blogspot.com/2013/11/gbif-and-github-fixing-broken-darwin.html?m=1
http://www.canadensys.net/publication/darwin-core
http://rs.tdwg.org/dwc/terms/simple/index.htm#simpledwcastext

Write path

Hi again,

I noticed that when using finch to download a bunch of DWC-A files my Sys.getenv("HOME") would get cluttered with them. It seems to me that the best places to write those files to would be either tempdir() or the working path if persistency is desired.

Thanks,

Gustavo.