Giter Site home page Giter Site logo

limaraf / plantr Goto Github PK

View Code? Open in Web Editor NEW
17.0 3.0 4.0 377.99 MB

An R Package for Managing Species Records from Biological Collections

License: GNU General Public License v3.0

R 73.02% HTML 26.98%
r data-mining herbarium biological-data data-downloader data-cleaning gbif biodiversity r-package

plantr's Introduction

plantR

An R Package for Managing Species Records from Biological Collections

Description

The package plantR provides tools for downloading, processing, cleaning, validating, summarizing and exporting records of plant species occurrences from biological collections. Please read the Introduction and Tutorial of the package for more details.

Installation

The package can be installed in R from github with:

library("remotes")
install_github("LimaRAF/plantR")
library("plantR")

If you run into errors while installing the package, please check the detailed package introduction for alternatives.

Bug report and suggestions

The plantR project is hosted on GitHub. Please report any bugs and suggestions of improvements for the package here.

The package gazetteer and the list of taxonomists are constantly being improved. If you want to contribute with regional gazetteers or with missing names of taxonomists, please e-mail [email protected].

Authors and contributors

Renato A. F. de Lima, Sara R. Mortara, Andrea Sánchez-Tapia, Hans ter Steege & Marinez F. de Siqueira

Citation

Lima, R.A.F., Sánchez-Tapia, A., Mortara, S.R., ter Steege, H., Siqueira, M.F. (2021). plantR: An R package and workflow for managing species records from biological collections. Methods in Ecology and Evolution. https://doi.org/10.1111/2041-210X.13779

Funding

The development of this package was supported by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 795114, by the Coordination for the Improvement of Higher Education Personnel (CAPES, process 88887.145924/2017-00), and by the ‘Instituto Nacional da Mata Atlântica’ (INMA).

Acknowledgements

We thank Sidnei Souza from CRIA/speciesLink for his help with the web API. We also thank the CNCFlora and the TreeCo database for providing many of the localities used to construct the package gazetteer. We thank the Harvard University Herbarium, Brazilian Herbaria Network and the American Society of Plant Taxonomists, who were main sources to compile the current list of taxonomists. We also thank Vinícius C. Souza (ESALQ/USP), who helped to validate and improve the list of plant taxonomists used in the package, and André L. de Gasper and Leila Meyer, for their valuable suggestions on how to make this package more useful and flexible for collection managers and taxonomists.

plantr's People

Contributors

andreasancheztapia avatar limaraf avatar saramortara avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

plantr's Issues

error in validateCoord

I get this error when running the command below from the package vignette:
occs <- validateCoord(occs)
Error in $<-.data.frame(*tmp*, "tmp.order", value = 1:0) :
replacement has 2 rows, data has 0

replaceNames é ambíguo com relação a caracteres non-ascii

replaceNames.csv está assumindo non-ascii no início, para strings como guyane francaise, perou, franca.
depois ele passa a ter non-ascii nos estados do brasil,
Neste momento a linha de código que tiraria está comentada, na hora de criar os dicionários que está deixando passar grafias com non-ascii.
Precisamos ou tirar aqui ou completar a tabela com grafias com non-ascii.

#Renato ö: o codigo abaixo estava dando problemas e tenho quase certeza que preciamos

Small issues with `formatDwc()`

@saramortara @AndreaSanchezTapia

Oi Sara,

Estou finalizando formatOcc() e buscando o ponto onde vc encontrou o erro da vinheta. Notei algumas coisas da mudança de fixFields() para formatDwc():

  • Muitos nomes das colunas que formatDwc() retorna começam com "extensions.http...rs.gbif.org.terms.1.0.Multimedia.http...purl.org.dc.terms."
  • Há muito mais colunas agora (no exemplo da vinheta, são 516). Pensando no manejo de uma data frame tão largo, talvez não valhe a pena pelo menos excluir as colunas não essenciais ou opcionais que só tem NAs?
  • A nova versão de formatOcc() está rodando e como ela não usa mais tdwgNames, não consegui mais reproduzir o erro que vc enviou por email. Pf, faça o pull e veja se vc ainda encontra o mesmo erro.

Overall check and optional improvements

(To be discussed)

data

  • the data sources need to be very well described - where do the gazetteer, the specialist list, etc. come from.
    • RAFL: will do.
  • checking the gazetteer for Colombia it's clear for me we will need fuzzy greps for this part. the official names of the municipalities do not match what goes in the labels. this is not the problem.
    • RAFL: I avoided to use fuzzy greping (for localities or taxonomists) to speed up the process and to avoid any spurous greps. In my opinion, it is safer to continue to complete the gazetter with those exceptions instead of using 'agrep'.
  • download functions should have an optional save parameter - even if the default is TRUE
    • RAFL: agreed. it also automatically creates a folder in the local directory, which I think should be optional. But let's see what @saramortara has to say here. AST: I added the optional save yesterday in rocc, I'll bring it here.

editing

  • fixFields() removes optional fields but this maybe should be optional too, even if the default is TRUE - what if the user wants to keep the optional fields for any reason? The warnings could be saved as metadata

  • colNumber() expectations and functioning are not always clear. In the documentation, a "standardized notation" is mentioned, which is this standard notation?

    • RAFL: will do. the stard notation is the argument 'noNumb', which the defaul is "s.n.",
  • fixName() checks for ampersand and "e". ¿Why doesn't it check for "and", "et", "und", "y"?

    • RAFL: makes sense, will do. Any ideas on how to implement the 'et' for not being consufed with a 'et alli'?
  • in TWDG and maybe other functions that modify names, an initial tolower() may be useful to put everything in lower case and then only at the end capitalize - otherwise you have to list everything (Van, van, Der, der) - I tried to make it work but capName() does not capitalize initials correctly so I didn't apply any change

    • RAFL: will see what I can do.
  • the separator "," in TDWGNames() should not need to have a space. ";" should do the trick. the space should be put internally - I solved this the wrong way, with a paste0.

  • capName() could be replaced by stringr::str_to_title() It works identically but has the same problem as capName(), it does not capitalize initials correctly. btw initials are always without space, is this standard mandatory? initials with space work fine in both functions.

    • RAFL: not sure if I understood because the function is not meant to be applied to names and initials, only to names. The initial without spaces is from the old TDWG standard (e.g. 'Lima, R.A.F'). But if there is a function in stingr that does exactly the same, I agree that we should not re-invent the wheel.
    • RAFL [2nd edit]: I double checked the function, made some improvements (e.g. vectorizing) and it is now much faster than before. stringr::str_to_title() is still faster but I decided to keep capName() to avoid new depencies. Marked as solved for now.
  • Why is formatName() executed before TWDGName if it requires TWDG format? Also, it works with first names, not only initials.

    • RAFL: dont remember anymore: Will check
  • in general the ideal standardized outputs should be described because DarwinCore documentation is a nightmare (ironic)

    • RAFL: could you please elaborate a bit more? AST: It's not urgent, I was thinking about having a reference document with the expected outputs. I was thinking about col numbers, since I understood (after writing this) that names will be Lastname, X.X.
  • getAdmin() has this error message: "input object needs to have a column loc.correct with the locality strings" this should be explained in the documentation. The fact that the name format must be in this "paraguay_paraguari" way needs to be explicit too.

    • RAFL: ok, will do. AST: I think I checked it because I did it.
  • regarding documentation in general not every function needs to be exported, or maybe similar functions can be documented in the same article, so that the person can see their similarities and differences. I am thinking this with all the string formatting inside formatOcc()

    • RAFL: think this may be nice for the wrappers and I agree that some functions need not to be exported (e.g. capName). But personally I prefer when each article describe only one function. I have seen many packages doing articles for multiple functions and I personally, I often get lost on which argument and examples correspond to the function I am interested in.

geographic editing and validation

  • prepLoc(): "de la", "del", "du" are missing (lines 37-39)

    • RAFL: yes this need to be done, but first we need to check if the is no problems of creating for two different sites the same string. But yes, I think this shold be done. An if it is done here, we need to adpat the maps and gazetteers.
  • prepCoord(): possible problems in decimal separator should not happen at this moment but we need an example. I didn't check the math but maybe using package measurements could be a good one

    • RAFL: I took a quick look in function measurements::conv_unit and it looks like it does the trick. But I do not understand why you think problems in decimal separator should not happen at this moment. Remenber that the user my import his own database, so it may have not passed GBIF-like validation steps of the coordinates.
    • RAFL [2nd edit]: I double checked the function, made some improvements and it now has an example. Everything seems to work. I decided not to use measurements::conv_unit to avoid new depencies. MAked as solved for now.
  • the data maps projection and datum is WGS84 but sf does not understand it as 4326. not a huge problem, i think.

    • RAFL: dont know what happened. will check. Update: checked!
  • the decimalDegree transformation will force us to assume that everything is in the same datum (for validation) - for now there will be no projection

    • RAFL: if I remebered it well only problematic coordinates are transformed. Since this should not be the rule, I think the chance of having lots of projection problems is small. And I dont see how we could solve it since projections are rarely provided. Update: marking as solved for now.

current fieldNames.csv outdated

Fields gbif and splink do not match with current output from their sources. Given that we are only dealing with gbif and speciesLink data, I suggest we keep only those data sources. We should have two other columns: dwc with the current DarwinCore standard and a logical column required to specify if the field is required in the data cleaning procedure. Suggestion: file formarField.csv (and we should keep the script we used to generate it either here or in a different repo) containing the columns: gbif, splink, dwc, and required.

The Plant List website is no longer functional

Need to remove the TPL database and add a diffent global backbone to check species synonyms

Error in TPLck(sp = d, infra = infra, corr = corr, diffchar = diffchar, :
Cannot read TPL website.

diminuindo as dependências do plantR

@saramortara @AndreaSanchezTapia

Tive relatos de dificuldades para instalar o plantR no windows. Primeiro foi por causa de um erro com a instalação do 'gtable', depois 'munsell', etc... No fim, a pessoa não conseguiu instalar o plantR. Independente da versão dela do R ou da versão dos pacotes que ela tem instalado e pensando no perfil de usuários que queremos atingir, isso é um problema que muitos poderão ter.

Independente da solução, a origem do problema é, ao meu ver, que temos muitas dependências e dependências recursivas. Além desses problemas de instalação, "dependencies are invitations for other people to break your package" (algumas leitures legais no assunto aqui e aqui).

Ao meu ver a solução é eliminar progressivamente as dependências. Tentei ir fazendo isso, criando funções acessórias e internas alternativas (e.g. rmLatin) ou cópias locais das funções que estão em outros pacotes, que tem um monte de funções mas nós mesmo estmos usando só uma (vários exemplos em acessory_geo.R).

Fiz um código para avaliar as dependências (diretas e recursivas) dos pacotes que temos hoje (19/3/21) no DESCRIPTION. Minhas sugestões para eliminar dependências (pacotes que dependem de muitos outros pacotes ou que usamos pouco) são (em ordem decrescente de prioridade):

Imports:

  • spatialrisk, função haversine in checkCoord() => difícil pois roda em C++, mas tentar criar cópia local em acessory_geo.R
  • tidyr, função separate in checkCoord() => fazer usando R base
  • knitr, função kable() in getCode(), prepFamily(), summaryData(), summaryFlags(), validateTax()
  • countrycode, função countrycode() => criar dicionário de paises e fazer substituicoes localmente
  • flora, funções fixCase(), remove.authors(), trim() => fazer usando o R base
    Comment: only fixCase() and trim() were replaced by other functions
  • flora e Taxonstand, função get.taxa() e TPLck() => fazer name search (exact and fuzzy) localmente a partir do DwC-A da FBO e do Wordl Flora Online?
  • rgbif, função occ_search() => fazer localmente usando jsonlite? rgbif the muiiiita dependência recursiva...

country names in objects and formats should follow the standard nomenclature in other packages

The user doesn't have a way to know we force things like timor leste or south korea. We should be the ones following current conventions.
This breaks shares_borders using world or spData, either way this should be controlled. Ex. "cote d'ivoire" is returned as "cote ivoire". The apostrophe is ascii so I don't see the necessity to remove it and on the contrary, it breaks things.

Actually we should return standard country names in our objects, that can be recognized by other packages. Either country_codes standards or spData standards. communication will be way easier.

NA_character_

This is a question I don't want to forget: I see that NAs are being replaced everywhere with "NA_character_" but I haven't reached the point where I understand why.

a lógica de missName está invertida

A lógica de missName está invertida.

missName <- function(x,                     # o string de entrada
                     type = NULL,           # se é para ser "collector" ou "identificator" (deveria ser "determiner")
                     noName = "Anonymous")  # o string de saída, predeterminado

tem muitas opções para type ("collector","coletor","colector","identificator","identificador","determinador")
mas não detecta esses strings,
Por exemplo:
missName("s/col", type = "collector") devolve "Anonymous" (até aí tudo bem)
missName("s/col", type = "colector") devolve "Anonymous"

mãs:

missName("s/colector", type = "collector") devolve "s/colector"
missName("s/colector", type = "colector") devolve "s/colector"

então não vale a pena ter várias opções de type (isso pode ser apenas collector ou determiner, para o usuário do pacote) o que importa é que o string de entrada possa ter essas variações.

Falta ver como isso se encaixa dentro de formatOcc()

Error de Instalação

Olá, estou com um problema na leitura do plantR.
install_github("LimaRAF/plantR")
library("plantR")

library("plantR")
Error in library("plantR") : there is no package called ‘plantR’

Error in formatTax()

Opa,

Segue abaixo o erro que tou recebendo no formatTax()

occs <- readData(file = "0428744-210914110416597.zip",
                 path <- "https://api.gbif.org/v1/occurrence/download/request/",
                 output = 'occurrence') [usando apenas o output = occurrence por conta do que comentei na outra thread]

#DwC file: format field names for following formatatting
occs <- formatDwc(gbif_data = occs)
#Format collection codes, people names, collector number, and dates
occs <- formatOcc(occs)
#Standardize locality info (country, city names)
occs <- formatLoc(occs)
#Geographical coordinates: decimal degrees formatting and retrieves missing coordinates
#from a gazetteer
occs <- formatCoord(occs)
#Format species and family names
occs <- formatTax(occs)

Retorna o seguinte (abaixo diz que os nomes das famílias foram substituídos mas imagino que o erro impeça, pois depois chequei e não foram):

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
The following family names were automatically replaced:

|Genus          |Old fam.         |New fam.         |
|:--------------|:----------------|:----------------|
|Cordia         |Cordiaceae       |Boraginaceae     |
|Cunninghamia   |Cupressaceae     |Rubiaceae        |
|Dombeya        |Malvaceae        |Araucariaceae    |
|Ehretia        |Ehretiaceae      |Boraginaceae     |
|Euploca        |Heliotropiaceae  |Boraginaceae     |
|Heisteria      |Erythropalaceae  |Olacaceae        |
|Heliotropium   |Heliotropiaceae  |Boraginaceae     |
|Hydrocotyle    |Apiaceae         |Araliaceae       |
|Lonicera       |Caprifoliaceae   |Rubiaceae        |
|Matourea       |Plantaginaceae   |Scrophulariaceae |
|Myriopus       |Heliotropiaceae  |Boraginaceae     |
|Nephrolepis    |Nephrolepidaceae |Lomariopsidaceae |
|Piriqueta      |Turneraceae      |Turneraceae      |
|Prosopanche    |Hydnoraceae      |Hydnoraceae      |
|Quiina         |Quiinaceae       |Quiinaceae       |
|Sambucus       |Adoxaceae        |Adoxaceae        |
|Tetrastylidium |Strombosiaceae   |Olacaceae        |
|Tournefortia   |Heliotropiaceae  |Boraginaceae     |
|Turnera        |Turneraceae      |Turneraceae      |
|Varronia       |Cordiaceae       |Boraginaceae     |
|Viburnum       |Adoxaceae        |Adoxaceae        |
|Viviania       |Vivianiaceae     |Vivianiaceae     |
|Ximenia        |Ximeniaceae      |Olacaceae        |

Error in `[.data.table`(families.data, is.na(name.correct), tmp.fam, FALSE) : 
  The items in the 'by' or 'keyby' list are length(s) (1). Each must be length 10; the same length as there are rows in x (after subsetting if i is provided).
In addition: Warning messages:
1: In gsub(paste0(" ", rank, " "), paste0(" ", rank), x_new, perl = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used
2: In gsub(paste0(" ", rank, " "), paste0(" ", rank), x_new, perl = TRUE) :
  argument 'replacement' has length > 1 and only the first element will be used

Tou no {plantR} 0.1.5 e no R 4.2.1.

Abs!

readData(): grepl() error?

Tudo certo, @LimaRAF? Criarei tópicos separados para alguns problemas que estou tendo.

occs <- readData(file = "0428744-210914110416597.zip", path <- "https://api.gbif.org/v1/occurrence/download/request/")

Retorna o seguinte erro:

Error in if (grepl("verbatim.txt", all.files)) { : 
  the condition has length > 1
In addition: Warning message:
In data.table::fread(occ.path, na.strings = na.strings, quote = quote,  :
  Found and resolved improper quoting out-of-sample. First healed line 2040: <<1305097720																										38360565-6485-48ac-8a91-10033ff89543										CC_BY_4_0			2016-03-10T11:23:54Z		Centro Internacional de Agricultura Tropical - CIAT																		CEN				OCCURRENCE				38360565-6485-48ac-8a91-10033ff89543	19495924	40723	Srgio Duarte Prat Kricun									Native				PRESENT								"Arbol de 4m de altura, flores blanquecinas. Caa - vera"". Co Gilberti N 251.&nf;Doao Herbrio Yerba Mate y TE - EEA Cerro Azul. Cerro Azul, Misiones, Argentina.""". EMBRAPA Recursos Ge>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

Aqui o link desse meu dataset
Your download is available at the following address:
https://api.gbif.org/v1/occurrence/download/request/0428744-210914110416597.zip

@@@@@@@@@@
@@@@@@@@@@

Porém o mesmo também ocorre nos dados de exemplo da função ?readData
occs <- readData(file = "0227351-200613084148143.zip", path <- "https://api.gbif.org/v1/occurrence/download/request/")

Error in if (grepl("verbatim.txt", all.files)) { : 
  the condition has length > 1

Em ambos os casos, se eu colocar output = 'occurrence' nos argumentos o erro não acontece.

PS: algo menor que não vale a pena abrir um issue: o gbif aparentemente não conta mais com uma coluna entitulada ''country'' no arquivo Dwc. Se eu der occs <- formatDwc(gbif_data = occs) nos seus dados de exemplo ele retorna

Warning message:
Important columns were not found in the gbif_data: 
 country 

Valeu!

(Tenho um outro problema rolando com o formatTax() mas precisarei refazer o erro pois limpei o console...)

formatDwc()

The formatDwc() function either for data downloaded directly with the PlantR script or for data downloaded directly from Gbif and Species_link encountered the following error

occs_splink <- rspeciesLink(filename = "Lophostigma_teste_splink.txt", save = TRUE, basisOfRecord = 'PreservedSpecimen', species = "Lophostigma")
occs_gbif <- rgbif2(filename = "Lophostigma_teste_gbif.txt", species = "Lophostigma Radlk.", n.records = 110000, force = TRUE, save = TRUE)
occs <- formatDwc(splink_data = occs_splink, gbif_data = occs_gbif, drop = TRUE)
Error in [.data.frame(splink_data, , c("yearIdentified", "monthIdentified", :
undefined columns selected
In addition: Warning message:
some columns in splink_data do not follow the speciesLink pattern

Error using function 'rgbif2()'

@saramortara @AndreaSanchezTapia

Ao usar rgbif2() com "Euterpe edulis", tudo ok. Ao usar rgbif2() com "Trema micrantha" (ou "Casearia sylvestris" ou ambas), encontrei o seguinte erro:

rgbif2(species = "Trema micrantha", save = FALSE)
Making request to GBIF...
 Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 0, 1 

Alguma idéia do que pode estar acontecendo?

Get species info and taxonomic validation using the Brazilian Flora 2020

@saramortara and @AndreaSanchezTapia,

I have been working on something to get the full description of species (including the reference where the species was published) for final reporting of the species included in my analyses. Since flora does not do it, I had to do it in Tropicos... But Tropicos does not suggest the valid names in case of synonyms, so you need first to get the valid names using flora::get.tax and then use taxize::get_tpsid to get species info.

However, he DwC files from flora has the info on species references (extension data table 'Reference'). And then I realized that thre is also a table TypesAndSpecimen!!! This info is gold and it would be great to validate the taxonomy of the occurrences and could be added to plantR::validateTax...

I was thus thinking of implementing it here within plantR, but I wanted to talk with you first. I know that you have talked with Gustavo to do upgrades on flora for other reasons. Do you think it is better to propose him a collaboration directly in flora codes (this would require using tables currently not loaded from flora) or to include it all in plantR and then use ??

checkBorders

  • no need at all to return share_border
  • idk why we have long merge messages (maybe it's Vectorize() or something in sharesBorders()

validateCoord()

Hi!
I installed the latest version of plantR today and now the validateCoord() function is no longer working for my dataset.

Code:
occs.all.6 <- read.csv("output.occs.all.6.csv", encoding = "native.enc")
occs.all.7 <- validateCoord(occs.all.6, output = "new.col")

Error in s2_geography_from_wkb(x, oriented = oriented, check = check) :
Evaluation error: Found 1 feature with invalid spherical geometry.
[194] Loop 2 edge 8 crosses loop 3 edge 0.

The dataset is available at:
https://github.com/leilameyer08/plantR/blob/main/output.occs.all.6.csv

Thank you!

vectorize `formatName()`

formatName()is taking a vector but the very first logical test is not vectorized -and thus will just obey the first element of the vector.

if (!grepl("[a-z;A-Z], ", x)) {
    nome <- x
  }

will return

Warning message:
In if (!grepl("[a-z;A-Z], ", x)) { :
  the condition has length > 1 and only the first element will be used

Validação geográfica

formatOcc()

Olá pessoal,

Obrigada por resolverem o problema da função validateCoord().

Instalei novamente o plantR, mas agora apareceu um novo problema na função formatOcc() que não está formatando o nome dos coletores e identificadores corretamente. Surgem nomes completamente estranhos nas colunas de recordedBy.new e identifiedBy.new

Testei com o exemplo do pacote com Euterpe edulis e está funcionando normalmente. Apenas com meus dados que aparece esse problema.

Código:
if(!require("plantR")) remotes::install_github("LimaRAF/plantR")

data <- 'https://raw.github.com/leilameyer08/plantR/master/occs.exemplo.csv'
data <- read.csv(data, row.names = 1, encoding = "UTF-8")

occs.all.2 <- formatOcc(data)

occs.all.2$recordedBy[1] ##"M. A. Costa, J. Ribeiro, P. A. Assunçao & E. C. Pereira"
occs.all.2$recordedBy.new[1] ## "Ackermann"

occs.all.2$identifiedBy[1] ##"Acevedo-Rodríguez, P., (BOT), Smithsonian Institution - National Museum of Natural History (UNITED STATES)"
occs.all.2$identifiedBy.new[1] ##"Sandwith, N.Y."

Obrigada!

dealing with non-ASCII characters

When running devtools::check() we get this warning

Found the following files with non-ASCII characters:
colNumber.R
fixName.R
getYear.R
Portable packages must use only ASCII characters in their R code,
except perhaps in comments.
Use \uxxxx escapes for other characters.

We need to check all these functions to not use non ASCII and maybe use textclean package

error rgbif2

I have an error running the example from the tutorial:


library("plantR")
spp <- c("Casearia sylvestris",
         "Euterpe edulis",
         "Trema micrantha")

occs_gbif <- rgbif2(species =  spp,
                    basisOfRecord = "PRESERVED_SPECIMEN",
                    remove_na = FALSE, limit = 500000)

This gives me an error

Making request to GBIF...
Making request to GBIF...
Making request to GBIF...
Error in names(gbif_data) <- species : 
  attribut 'names' [3] must be of same length than vector [2]

I run under:
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

#boaspráticas: commit messages need to be way more informative

Tentando ver o que foi modificado no workflow de validação geográfica, mensagens como "fixing minor bugs" mandam a mensagem de que tem bugs mas não dizem quais bugs eram esses ou aonde.
Se não forem bugs (o workflow estava funcionando) mas adições devido a causas específicas e mudanças no jeito de escrever código, essas mudanças deveriam ser compreensíveis para qualquer pessoa que veja uma modificação pontual num arquivo.

Validação de duplicatas

Codes from the package module on search and merge of duplicated specimens

  • function prepDup()
  • function getDup()
  • function mergeDup()
  • function rmDup()
  • Check documentation and examples of the functions

current warnings WIP

as of 2021-02-24:
formatDwc

Error: Can't combine `gbif$dateIdentified` <datetime<UTC>> and `speciesLink$dateIdentified` <character>.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In formatDwc(splink_data = occs_splink, gbif_data = occs_gbif) :
  some columns in gbif_data do not follow the gbif pattern
Called from: signal_abort(cnd)

formatOcc:

`Warning message:
In last.name[gen.suf] <- gsub("\\s[A-Zà-ýÀ-Ý]\\.", "", name[gen.suf],  :
  number of items to replace is not a multiple of replacement length`

validateCoords [isto já foi]

Error in checkBorders(x1, geo.check = "geo.check", country.shape = country.shape,  : 
  unused arguments (geo.check = "geo.check", country.shape = country.shape, country.gazetteer = country.gazetteer, output = output)

validateTax bad format names

Identifier Records
(mo, R.L. 148
Dias, M.C. 141
(mo, R.L.L. 140
(aau, S.L. 112

validateDup

Warning message:
To merge geographic info, the input data must contain the following column(s): geo.check. Skipping 'geo' from info to merge. 

Make fixSpecies() handle typical NCBI organism names

Names in NCBI/GenBank commonly have a particularly notation that fixSpecies() currently does not handle well. Improvements are needed. Examples are:

  • "Syzygium sp. DIR045"
  • "Malaxis sp. TBG132709"
  • "Senecio lautus x Senecio repangae subsp. pokohinuensis"
  • "Arabidopsis thaliana x Arabidopsis arenosa"
  • "Juglans microcarpa x Juglans regia"
  • "Acacia aff. deltoidea x Acacia aff. stigmatophylla R5710"
  • "Metrosideros excelsa x Metrosideros kermadecensis"
  • "Metrosideros excelsa x Metrosideros robusta"
  • "Tetrastigma cf. magnum PBP-2017"
  • "Phyllanthus aff. sootepensis RWB059"
  • "Phyllanthus cf. gomphocarpus Yu 250"
  • "Phyllanthus cf. boehmii Friis 13159"
  • "Acacia aff. shirleyi Carroll s.n."
  • "Acacia aff. cowleana Carroll 18"
  • "Acacia aff. gracillima Barett 2931"
  • "Phyllanthus sp. RWB-2020a"
  • "Asarum sp. DT2604"
  • "Phyllanthus aff. vakinankaratrae Ravelonarivo 4264"
  • "Tragia sp. Philippines"
  • "Tragia sp. Merida"

E também:

  • "Syzygium spp."
  • "Paralychnophora harleyi (H.Rob.) D.J.N.Hind"
  • "Anemopaegma arvense (Vell.) Stellfeld ex de Souza"
  • "Maytenus boaria Molina"
  • "Sloanea fasciculata D. Sampaio & V.C. Souza"
  • "Conchocarpus albiflorus (Bruniera & Groppo) Bruniera & Groppo"
  • "Grabowskia boerhaaviifolia (L. f.) Schltdl."

fixNames() function

returning " J\xFAnior" makes it parse as "J<fa>nior"

All uniscape codes begin with \u in this case "\u00fa"

But most important: we should not be returning non-ascii characters, if anything it should be nice and plain "Junior". Update: I'm lying, it could be Júnior. The only thing is that Júnior is only for pt-br names, not universal.

rgbif2() for family

Testendo o tutorial para família funcionou, mas não para Sapindaceae. Acho que acaba sendo um problema porque tem mais de 2,5 milhões de registros no GBIF para Sapindaceae. No exemplo tem filtro para país. Rodamos das duas formas abaixo. Para coletar as informações do specieslink, funcinou. Para o GBIF, não:

familia <- "Sapindaceae"
occs_splink <- rspeciesLink(family = familia) #It's working for specieslink
occs_gbif <- rgbif2(species = familia,
n.records = 2600000) #It's not working for gbif. Maybe the number of records?
Error in names(gbif_data) <- species :
'names' attribute [1] must be the same length as the vector [0]

occs_gbif <- rgbif2(species = familia,
country = "BR",
n.records = 450000) #It's not working, even using the same filter used to do the tutorial
Error in names(gbif_data) <- species :
'names' attribute [1] must be the same length as the vector [0]

Errors in validateCoord() & formatTax()

Oi pessoal,

Estou rodando as funções de padronização e validação do plantR com os dados de ocorrência das espécies da tribo Paullinieae e apareceram erros nas funções formatTax() e validateCoord().
Deixei os script com os erros abaixo. Os script está baixando todos os registros de ocorrência para os seis gêneros da tribo. Isso leva um tempinho (são pouco mais de 200 mil registros), mas assim é mais garantido que os erros vão aparecer.

A função formatTax() é a menos problemática porque já temos uma lista com os nomes das espécies padronizados e não precisamos mais rodar a função. Meu objetivo é apenas reportar o erro da função com esse conjunto de dados.
O erro que aparece é : Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 182344, 188400

Já na função validateCoord() o erro varia de acordo com difenrentes testes (vejam no final do script, por favor).

Muito obrigada mais uma vez!

### Clean the environment ####
rm(list=ls())



### Installation ####
if(!require("remotes")) install.packages("remotes")
if(!require("plantR")) remotes::install_github("LimaRAF/plantR")
if(!require("BIEN")) install.packages("BIEN")
if(!require("rgbif")) install.packages("rgbif")
if(!require("stringr")) install.packages("stringr")




### Download occurrence data ####


#### Download occurrence data from BIEN ####
occ.BIEN <- BIEN_occurrence_genus(genus = c("Cardiospermum",
                                            "Lophostigma",
                                            "Paullinia", 
                                            "Serjania",
                                            "Thinouia", 
                                            "Urvillea"),
                                  cultivated = F,
                                  only.new.world = F,
                                  all.taxonomy = F,
                                  native.status = F,
                                  natives.only = T,
                                  observation.type = T,
                                  political.boundaries = T,
                                  collection.info = T) 

dim(occ.BIEN) # 33415   24


## Standardization of character encoding
for (i in 1:ncol(occ.BIEN)){
  if(is.character(occ.BIEN[,i])){
    Encoding(occ.BIEN[,i]) <- "UTF-8"
  }
}



### Download occurrence data from speciesLink ####
occ.splink <- rspeciesLink(basisOfRecord = "PreservedSpecimen",
                           family = "Sapindaceae",
                           species = c("Cardiospermum",
                                       "Lophostigma",
                                       "Paullinia", 
                                       "Serjania",
                                       "Thinouia", 
                                       "Urvillea"),
                           Scope = "plants",
                           Synonyms = "species2000",
                           MaxRecords = 300000)

dim(occ.splink) # 62291     49


## Standardization of character encoding
c.right <- c("À", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", 
             "Ì", "Î", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø", 
             "Ù", "Ú", "Û", "Ü", "Þ", "ß", "á", "â", "ã", "ä", "å", 
             "æ", "ç", "è", "é", "ê", "ë", "ì", "î", "ï", "ð", "ñ", "ò", 
             "ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ", "í")

c.wrong <- c("À", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", 
             "Ê", "Ë", "Ì", "Î", "Ñ", "Ò", "Ó", "Ô", 
             "Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ü", "Þ", "ß", 
             "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", 
             "ë", "ì", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", 
             "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ", "Ã.")


for (i in 1:ncol(occ.splink)){
  if(is.character(occ.splink[,i])){
    Encoding(occ.splink[,i]) <- "UTF-8"
    
    for(j in 1:length(c.right)){
      occ.splink[,i] <- str_replace_all(occ.splink[,i], 
                                        pattern = c.wrong[j], 
                                        replacement = c.right[j])
    }
  }
}



### Download occurrence data from GBIF ####
occ.gbif <- rgbif2(dir = "data/plantR",
                   filename = "output.gbif",
                   species = c("Thinouia Triana & Planch.", 
                               "Lophostigma Radlk.", 
                               "Cardiospermum L.", 
                               "Paullinia L.",
                               "Serjania Mill.", 
                               "Urvillea Kunth"),
                   n.records = 300000,
                   force = T,
                   basisOfRecord = "PRESERVED_SPECIMEN")

dim(occ.gbif) # 134568    206


## Standardization of character encoding
for (i in 1:ncol(occ.gbif)){
  if(is.character(occ.gbif[,i])){
    Encoding(occ.gbif[,i]) <- "UTF-8"
  }
}





### Combine different databases ####


### Formatting BIEN database before running the formatDwc() fuction ####

## Separate "date_collected" into year, month and day
occ.BIEN[,25] <- sapply(strsplit(as.character(occ.BIEN$date_collected), "-"), function(x) (x[1]))
occ.BIEN[,26] <- sapply(strsplit(as.character(occ.BIEN$date_collected), "-"), function(x) (x[2]))
occ.BIEN[,27] <- sapply(strsplit(as.character(occ.BIEN$date_collected), "-"), function(x) (x[3]))
colnames(occ.BIEN)[25:27] <- c("year", "month", "day")



## Prepare other required columns
occ.BIEN[,28] <- occ.BIEN$county
occ.BIEN[,29:30] <- NA
occ.BIEN[,31] <- "Sapindaceae"
colnames(occ.BIEN)[28:31] <- c("municipality", "typeStatus", "scientificNameAuthorship", "family")


## Standardize column names
colnames(occ.BIEN)[colnames(occ.BIEN) == "collection_code"] <- "collectionCode"
colnames(occ.BIEN)[colnames(occ.BIEN) == "catalog_number"] <- "catalogNumber"
colnames(occ.BIEN)[colnames(occ.BIEN) == "record_number"] <- "recordNumber"
colnames(occ.BIEN)[colnames(occ.BIEN) == "recorded_by"] <- "recordedBy"
colnames(occ.BIEN)[colnames(occ.BIEN) == "state_province"] <- "stateProvince"
colnames(occ.BIEN)[colnames(occ.BIEN) == "latitude"] <- "decimalLatitude"
colnames(occ.BIEN)[colnames(occ.BIEN) == "longitude"] <- "decimalLongitude"
colnames(occ.BIEN)[colnames(occ.BIEN) == "identified_by"] <- "identifiedBy"
colnames(occ.BIEN)[colnames(occ.BIEN) == "date_identified"] <- "dateIdentified"
colnames(occ.BIEN)[colnames(occ.BIEN) == "scrubbed_species_binomial"] <- "scientificName"
colnames(occ.BIEN)[colnames(occ.BIEN) == "custodial_institution_codes"] <- "institutionCode"
colnames(occ.BIEN)[colnames(occ.BIEN) == "X.U.FEFF.scrubbed_genus"] <- "scrubbed_genus"

occ.BIEN$dateIdentified <- as.character(occ.BIEN$dateIdentified)



### Combine database using formatDwc() function ####
occs.all <- formatDwc(user_data = occ.BIEN, 
                      splink_data = occ.splink,
                      gbif_data = occ.gbif, 
                      drop = T, bind_data = T)

dim(occs.all) # 230274     47  records




### Data editing ####

#### Collection codes, people names, collector number and dates ####


## Formatting strings before running formatOcc() fuction to avoid this error:
# Error in gsub(x, "", y, fixed = TRUE) : zero-length pattern
occs.all$recordNumber[which(occs.all$recordNumber == "938[=Diary No. 707]")] <- NA
occs.all$verbatimEventDate[which(occs.all$verbatimEventDate == "Sept. 4-'77")] <- NA
occs.all$verbatimEventDate[which(occs.all$verbatimEventDate == "label says \"1841/XIV\"")] <- NA
occs.all$recordedBy[which(occs.all$recordedBy == "M. Nadruz; ,J.F. Baumgratz, M. Bovini, D.S.P. Silva")] <- "M. Nadruz, J.F. Baumgratz, M. Bovini, D.S.P. Silva"



## Replacing "," by ";" to separete names of collectors and identifiers
## Caso 1: "M. A. Costa, J. Ribeiro, E. C. Pereira"

## recordedBy
pos.c2semic <- which(sapply(str_locate_all(pattern = "\\.", occs.all$recordedBy), function(x) (x[1])) == 2 &
                       !is.na(sapply(str_locate_all(pattern = "\\,", occs.all$recordedBy), function(x) (x[1]))))

occs.all$recordedBy[pos.c2semic] <- str_replace_all(occs.all$recordedBy[pos.c2semic], "\\,", "\\;")


## identifiedBy
pos.c2semic.I <- which(sapply(str_locate_all(pattern = "\\.", occs.all$identifiedBy), function(x) (x[1])) == 2 &
                         !is.na(sapply(str_locate_all(pattern = "\\,", occs.all$identifiedBy), function(x) (x[1]))))

occs.all$identifiedBy[pos.c2semic.I] <- str_replace_all(occs.all$identifiedBy[pos.c2semic.I], "\\,", "\\;")



### Replacing "|" by " | " 
## Caso 2: "M. A. Costa|J. Ribeiro|E. C. Pereira"
occs.all$recordedBy <- str_replace_all(occs.all$recordedBy, "\\|", " | ")
occs.all$identifiedBy <- str_replace_all(occs.all$identifiedBy, "\\|", " | ")



occs.all.2 <- formatOcc(occs.all)



#### Locality information ####
occs.all.3 <- formatLoc(occs.all.2)


#### Geographical coordinates ####
occs.all.4 <- formatCoord(occs.all.3)


#### Species and family names ####
occs.all.5 <- formatTax(occs.all.4, db = "tpl") 
## Error in data.frame(..., check.names = FALSE) : 
# arguments imply differing number of rows: 182344, 188400



#### Locality information
occs.all.6 <- validateLoc(occs.all.4)



#### Geographical coordinates
## Test 1
occs.all.7 <- validateCoord(occs.all.6, output = "new.col")
## Error in s2_geography_from_wkb(x, oriented = oriented, check = check) : 
# Evaluation error: Found 1 feature with invalid spherical geometry.
# [194] Loop 2 edge 8 crosses loop 3 edge 0.



## Test 2
sf::sf_use_s2(F)
occs.all.7 <- validateCoord(occs.all.6, output = "new.col")
## Error in `$<-.data.frame`(`*tmp*`, "geo.check", value = c("ok_county",  : 
# replacement has 182344 rows, data has 182346
# In addition: Warning messages:
#   1: In geo.check[is.na(geo.check)] <- `*vtmp*` :
#   number of items to replace is not a multiple of replacement length
# 2: In geo.check[is.na(geo.check)] <- tmp1[is.na(geo.check)] :
#   number of items to replace is not a multiple of replacement length




## Test 3
sf::sf_use_s2(F)
pos <- which(occs.all.6$recordedBy.new == "Busey, P." &
               occs.all.6$recordNumber.new == "422") # Parece que esses registros estao causando o erro: "Error in `$<-.data.frame`(`*tmp*`, "geo.check", value = c("ok_county",..."
occs.all.6.a <- occs.all.6[-pos,]

occs.all.7 <- validateCoord(occs.all.6.a, output = "new.col", tax.name = "scientificName")
## Error in robustbase::covMcd(df1[use_these, c("lon2", "lat2")], alpha = 1/2,  : 
# n == p+1  is too small sample size for MCD


Problem while handling first family name with preposition

Check function code related to the problem below:

plantR::prepTDWG("Maria Souza da Silva", get.initials = FALSE, get.prep = TRUE)
[1] "Silva, Maria Souza da" # ok

plantR::prepTDWG("Maria Souza da Silva", get.initials = FALSE, get.prep = TRUE, format = "init_last")
[1] "Maria Souza da Silva" # ok

plantR::prepTDWG("Maria da Silva Souza", get.initials = FALSE, get.prep = TRUE)
[1] "Souza, Maria Silva" # preposition removed! 

plantR::prepTDWG("Maria da Silva Souza", get.initials = FALSE, get.prep = TRUE, format = "init_last")
[1] "Maria Silva Souza" # preposition removed!

formatDwc() — "can't combine"

Opa,
Geralmente envio e-mails direto ao Renato, mas talvez por aqui seja mais adequado. (escreverei em pt-br; caso prefiram em inglês, por motivos de acessibilidade, posso reescrever depois — e também para as próximas vezes)

Estou tendo um problema no formatDwc() quando utilizo dados advindos da função rspeciesLink() e readData() (em um .zip que baixei direto do gbif).

Segue o código

#GBIF points for SC (Tracheophyta; GBIF)
plants_gbif <- readData(file = 'gbif_plants.zip', path = 'C:/Users/Master/OneDrive - FURB/Mestrado')
#Selecting only the first list
plants_gbif <- plants_gbif$occurrence 

#INCT points for all SC plants
plants_inct <- rspeciesLink(Scope = "plants",
                            basisOfRecord = "PreservedSpecimen",
                            Synonyms = "flora2020",
                            stateProvince = "Santa Catarina")

plants_inct2 <- plants_inct

#Preparing input data in the correct format
occs <- formatDwc(gbif_data = plants_gbif,
                  splink_data = plants_inct2,
                  drop = TRUE)

Isso me retorna o seguinte erro:
Erro: Can't combine gbif$eventDate <datetime> and speciesLink$eventDate .

Abaixo o rlang::last_trace():

> rlang::last_trace()
<error/vctrs_error_incompatible_type>
Can't combine `gbif$eventDate` <datetime<UTC>> and `speciesLink$eventDate` <character>.
Backtrace:
    x
 1. \-plantR::formatDwc(...)
 2.   \-dplyr::bind_rows(res_list, .id = "data_source")
 3.     \-vctrs::vec_rbind(!!!dots, .names_to = .id)
 4.       \-(function () ...
 5.         \-vctrs::vec_default_ptype2(...)
 6.           \-vctrs::stop_incompatible_type(...)
 7.             \-vctrs:::stop_incompatible(...)
 8.               \-vctrs:::stop_vctrs(...)

Se eu não utilizar o readData(), e sim o rgbif2() direto, o erro não rola para um exemplo menor (não cheguei a testar com todos os meus dados porque demoraria um pouco para baixar todas as espécies no momento — mas posso botar para rodar aqui caso necessário. No entanto, como os dados do GBIF vieram direto do .zip pelo readData(), teoricamente, são iguais aos obtidos pela função rgbif2, certo? Uma outra pequena diferença entre os dados do GBIF e do SpeciesLink é que no GBIF peguei apenas plantas vasculares, enquanto no SpeciesLink qualquer tipo de planta — não sei se essa informação será útil, mas vai que serve de algo.

Quanto às colunas obtidas em cada database, de acordo com as funções que usei acima:

Colunas dos dados do GBIF obtidos pela função readData()

[1] "gbifID"                              "abstract"                            "accessRights"                       
  [4] "accrualMethod"                       "accrualPeriodicity"                  "accrualPolicy"                      
  [7] "alternative"                         "audience"                            "available"                          
 [10] "bibliographicCitation"               "conformsTo"                          "contributor"                        
 [13] "coverage"                            "created"                             "creator"                            
 [16] "date"                                "dateAccepted"                        "dateCopyrighted"                    
 [19] "dateSubmitted"                       "description"                         "educationLevel"                     
 [22] "extent"                              "format"                              "hasFormat"                          
 [25] "hasPart"                             "hasVersion"                          "identifier"                         
 [28] "instructionalMethod"                 "isFormatOf"                          "isPartOf"                           
 [31] "isReferencedBy"                      "isReplacedBy"                        "isRequiredBy"                       
 [34] "isVersionOf"                         "issued"                              "language"                           
 [37] "license"                             "mediator"                            "medium"                             
 [40] "modified"                            "provenance"                          "publisher"                          
 [43] "references"                          "relation"                            "replaces"                           
 [46] "requires"                            "rights"                              "rightsHolder"                       
 [49] "source"                              "spatial"                             "subject"                            
 [52] "tableOfContents"                     "temporal"                            "title"                              
 [55] "type"                                "valid"                               "institutionID"                      
 [58] "collectionID"                        "datasetID"                           "institutionCode"                    
 [61] "collectionCode"                      "datasetName"                         "ownerInstitutionCode"               
 [64] "basisOfRecord"                       "informationWithheld"                 "dataGeneralizations"                
 [67] "dynamicProperties"                   "occurrenceID"                        "catalogNumber"                      
 [70] "recordNumber"                        "recordedBy"                          "individualCount"                    
 [73] "organismQuantity"                    "organismQuantityType"                "sex"                                
 [76] "lifeStage"                           "reproductiveCondition"               "behavior"                           
 [79] "establishmentMeans"                  "occurrenceStatus"                    "preparations"                       
 [82] "disposition"                         "associatedReferences"                "associatedSequences"                
 [85] "associatedTaxa"                      "otherCatalogNumbers"                 "occurrenceRemarks"                  
 [88] "organismID"                          "organismName"                        "organismScope"                      
 [91] "associatedOccurrences"               "associatedOrganisms"                 "previousIdentifications"            
 [94] "organismRemarks"                     "materialSampleID"                    "eventID"                            
 [97] "parentEventID"                       "fieldNumber"                         "eventDate"                          
[100] "eventTime"                           "startDayOfYear"                      "endDayOfYear"                       
[103] "year"                                "month"                               "day"                                
[106] "verbatimEventDate"                   "habitat"                             "samplingProtocol"                   
[109] "samplingEffort"                      "sampleSizeValue"                     "sampleSizeUnit"                     
[112] "fieldNotes"                          "eventRemarks"                        "locationID"                         
[115] "higherGeographyID"                   "higherGeography"                     "continent"                          
[118] "waterBody"                           "islandGroup"                         "island"                             
[121] "countryCode"                         "stateProvince"                       "county"                             
[124] "municipality"                        "locality"                            "verbatimLocality"                   
[127] "verbatimElevation"                   "verbatimDepth"                       "minimumDistanceAboveSurfaceInMeters"
[130] "maximumDistanceAboveSurfaceInMeters" "locationAccordingTo"                 "locationRemarks"                    
[133] "decimalLatitude"                     "decimalLongitude"                    "coordinateUncertaintyInMeters"      
[136] "coordinatePrecision"                 "pointRadiusSpatialFit"               "verbatimCoordinateSystem"           
[139] "verbatimSRS"                         "footprintWKT"                        "footprintSRS"                       
[142] "footprintSpatialFit"                 "georeferencedBy"                     "georeferencedDate"                  
[145] "georeferenceProtocol"                "georeferenceSources"                 "georeferenceVerificationStatus"     
[148] "georeferenceRemarks"                 "geologicalContextID"                 "earliestEonOrLowestEonothem"        
[151] "latestEonOrHighestEonothem"          "earliestEraOrLowestErathem"          "latestEraOrHighestErathem"          
[154] "earliestPeriodOrLowestSystem"        "latestPeriodOrHighestSystem"         "earliestEpochOrLowestSeries"        
[157] "latestEpochOrHighestSeries"          "earliestAgeOrLowestStage"            "latestAgeOrHighestStage"            
[160] "lowestBiostratigraphicZone"          "highestBiostratigraphicZone"         "lithostratigraphicTerms"            
[163] "group"                               "formation"                           "member"                             
[166] "bed"                                 "identificationID"                    "identificationQualifier"            
[169] "typeStatus"                          "identifiedBy"                        "dateIdentified"                     
[172] "identificationReferences"            "identificationVerificationStatus"    "identificationRemarks"              
[175] "taxonID"                             "scientificNameID"                    "acceptedNameUsageID"                
[178] "parentNameUsageID"                   "originalNameUsageID"                 "nameAccordingToID"                  
[181] "namePublishedInID"                   "taxonConceptID"                      "scientificName"                     
[184] "acceptedNameUsage"                   "parentNameUsage"                     "originalNameUsage"                  
[187] "nameAccordingTo"                     "namePublishedIn"                     "namePublishedInYear"                
[190] "higherClassification"                "kingdom"                             "phylum"                             
[193] "class"                               "order"                               "family"                             
[196] "genus"                               "subgenus"                            "specificEpithet"                    
[199] "infraspecificEpithet"                "taxonRank"                           "verbatimTaxonRank"                  
[202] "vernacularName"                      "nomenclaturalCode"                   "taxonomicStatus"                    
[205] "nomenclaturalStatus"                 "taxonRemarks"                        "datasetKey"                         
[208] "publishingCountry"                   "lastInterpreted"                     "elevation"                          
[211] "elevationAccuracy"                   "depth"                               "depthAccuracy"                      
[214] "distanceAboveSurface"                "distanceAboveSurfaceAccuracy"        "issue"                              
[217] "mediaType"                           "hasCoordinate"                       "hasGeospatialIssues"                
[220] "taxonKey"                            "acceptedTaxonKey"                    "kingdomKey"                         
[223] "phylumKey"                           "classKey"                            "orderKey"                           
[226] "familyKey"                           "genusKey"                            "subgenusKey"                        
[229] "speciesKey"                          "species"                             "genericName"                        
[232] "acceptedScientificName"              "verbatimScientificName"              "typifiedName"                       
[235] "protocol"                            "lastParsed"                          "lastCrawled"                        
[238] "repatriated"                         "relativeOrganismQuantity"            "recordedByID"                       
[241] "identifiedByID"                      "level0Gid"                           "level0Name"                         
[244] "level1Gid"                           "level1Name"                          "level2Gid"                          
[247] "level2Name"                          "level3Gid"                           "level3Name"                         
[250] "iucnRedListCategory"                 "associatedMedia"                     "country"                            
[253] "minimumElevationInMeters"            "maximumElevationInMeters"            "minimumDepthInMeters"               
[256] "maximumDepthInMeters"                "geodeticDatum"                       "verbatimCoordinates"                
[259] "verbatimLatitude"                    "verbatimLongitude"                   "scientificNameAuthorship"

Colunas dos dados do INCT obtidos pela função rspeciesLink()

[1] "record_id"                "modified"                 "institutionCode"          "collectionCode"          
 [5] "catalogNumber"            "basisOfRecord"            "kingdom"                  "family"                  
 [9] "genus"                    "specificEpithet"          "scientificName"           "scientificNameAuthorship"
[13] "identifiedBy"             "recordedBy"               "year"                     "month"                   
[17] "day"                      "country"                  "stateProvince"            "county"                  
[21] "locality"                 "decimalLongitude"         "decimalLatitude"          "verbatimLongitude"       
[25] "verbatimLatitude"         "minimumElevationInMeters" "occurrenceRemarks"        "barcode"                 
[29] "imagecode"                "recordNumber"             "maximumElevationInMeters" "infraspecificEpithet"    
[33] "typeStatus"               "coordinatePrecision"      "geoFlag"                  "phylum"                  
[37] "order"                    "yearIdentified"           "monthIdentified"          "individualCount"         
[41] "class"                    "dayIdentified"            "continentOcean"           "preparationType"         
[45] "previousCatalogNumber"    "relatedCatalogItem"       "fieldNumber"              "minimumDepthInMeters"    
[49] "maximumDepthInMeters"     "sex"

Agradeço desde já.

limpeza taxonômica

@saramortara @LimaRAF

  • 4.0 First editing (flag indets, class taxon rank, remove authors)
  • 4.1 Remove cf. e aff. (and create a column for the name modificators?)
  • 4.2 Option to consider or not names at infraspecific level at specific level (e.g. remove ‘var.’, ‘subsp.’, ‘forma’, etc)
  • 4.3 Standardize the nomenclature (typos, synonyms and ortographical variants) -> Data editing or Data Validation??
  • 4.4 Edit family names to APG (plantR function formatFamily: step currently done using ‘flora’ and taxonStand packages)
  • 4.5 Include new functions in the workflow of the package tutorial

empty sections in .Rd

devtools::check() returns:

prepare_Rd: fixField.Rd:21-23: Dropping empty section \details
prepare_Rd: fixField.Rd:24-27: Dropping empty section \examples
prepare_Rd: formatCoord.Rd:12-14: Dropping empty section \description
checkRd: (5) formatCoord.Rd:0-15: Must have a \description
prepare_Rd: formatLoc.Rd:12-14: Dropping empty section \description
checkRd: (5) formatLoc.Rd:0-15: Must have a \description
prepare_Rd: missName.Rd:35-37: Dropping empty section \references
prepare_Rd: validateCoord.Rd:12-14: Dropping empty section \description
checkRd: (5) validateCoord.Rd:0-15: Must have a \description
prepare_Rd: validateTax.Rd:12-14: Dropping empty section \description
checkRd: (5) validateTax.Rd:0-15: Must have a \description

checkCoord default keep.cols

If checkBorders() will stop when geo.check, country.shape, country.gazetteer are not present, checkCoord() must return these columns by default.

Error in validateCoord()

Oi Renato,

Estou tendo um problema recorrente, possivelmente ligado ao issue #79.

Ao tentar rodar a função validateCoord(), obtenho o erro:

occ <- validateCoord(df, output = "new.col")

Error in `$<-.data.frame`(`*tmp*`, "geo.check", value = c("ok_state",  : 
  replacement has 2 rows, data has 3

As vezes o erro dá com ok_state, em outros casos com ok_county. Imagino que seja por algum problema de projeção, onde 2 locais são atribuídos para um único ponto. Tentei adicionar o comando sf::sf_use_s2(F) antes de validateCoord, mas não resolveu.

Estou usando a versão 0.1.5 do plantR e a versão 1.0-8 do sf.

Estou te mandando as linhas de uma das planilhas que está dando esse erro:
ErrorValidateCoord.csv

Issue while preparing names with two family names

Trace back in which function the problem below occurrs:

plantR::prepName("AFA Lira")
"Lira, A.F.A." # ok

plantR::prepName("AF Araujo Lira")
"Lira, A.A." # not ok

plantR::prepName("Ferreira dos Santos, André")
"Ferreira Dos, A." # not ok

FormatDwc_CoordinatePrecision!?

The Cardiospermum (Sapindaceae) had another erros in the formatDwc function: including and not including the option "drop = TRUE", that descart columns that are not congruent beteween the databases:

occs_splink <- rspeciesLink(filename = "Cardiospermum_teste_splink.txt", save = TRUE, basisOfRecord = 'PreservedSpecimen', species = "Cardiospermum")
occs_gbif <- rgbif2(filename = "Cardiospermum_teste_gbif.txt", species = "Cardiospermum", n.records = 110000, force = TRUE, save = TRUE)
occs <- formatDwc(splink_data = occs_splink, gbif_data = occs_gbif, drop = TRUE)
occs <- formatDwc(splink_data = occs_splink, gbif_data = occs_gbif)
Error: Can't combine gbif$coordinatePrecision and speciesLink$coordinatePrecision .
Run rlang::last_error() to see where the error occurred.
In addition: Warning messages:
1: some columns in splink_data do not follow the speciesLink pattern
2: some columns in gbif_data does not follow the gbif pattern!

After, run this code to see more details and it appeared:

rlang::last_error()
<error/vctrs_error_incompatible_type>
Can't combine gbif$coordinatePrecision and speciesLink$coordinatePrecision .
Backtrace:

  1. plantR::formatDwc(splink_data = occs_splink, gbif_data = occs_gbif)
  2. dplyr::bind_rows(res_list, .id = "data_source")
  3. vctrs::vec_rbind(!!!dots, .names_to = .id)
  4. vctrs::vec_default_ptype2(...)
  5. vctrs::stop_incompatible_type(...)
  6. vctrs:::stop_incompatible(...)
  7. vctrs:::stop_vctrs(...)

Summaries and checklist

Codes from the package module on data and validation summaries and species checklist generation

  • function summData()
  • function summFlags()
  • function checklist()
  • function exportData()
  • Check documentation and examples of the functions

rspecieslink does not returns field typeStatus

@saramortara

Not sure why the function rspecieslink is not returnig the filed typeStatus, which contain the info on type specimens and is key for taxonomical validation of the specimens. I know speciesLink provided this info but it is not in the data object of the function. I tried to set the argument Typus = TRUE but then the function reported an error ("Output is empty. Check your request.").

Any ideas on how to fix it?

Abs!

Error in tax.check

Oi @LimaRAF e pessoal!

Estou montando um banco de dados de Melastomataceae para o Quadrilátero Ferrífero de Minas Gerais a partir dos registros de ocorrência do GBIF, speciesLink e BIEN. Estou usando o plantR para fazer a limpeza dos registros e o pacote tem funcionando super bem! Mas acabei encontrando dois probleminhas:
1°) Na checagem da qualidade das identificações alguns registros sem identificador (estão como s.n.) estão sendo classificados com alta qualidade ("high"). Acho que esses registros deveriam estar na categoria de "unknown";
2°) A função ValidateDup() está retornando o mesmo erro reportado pelo WevertonBio:
Error in x1$col.year[ids] <- as.character(sapply(strsplit(x1$col.year[ids], : NAs are not allowed in subscripted assignments

Esse é código que estou usando:

### Installation ####
# install.packages("remotes")
#require("remotes")
#install_github("LimaRAF/plantR")
require("plantR")
#devtools::install_github("bmaitner/RBIEN")
require("BIEN")


### Download occurrence data ###
#### Download occurrence data from BIEN ####
occ.BIEN <- BIEN_occurrence_state(country = "Brazil",
                                  state = "Minas Gerais",
                                  cultivated = F,
                                  new.world = T,
                                  all.taxonomy = T,
                                  native.status = F,
                                  natives.only = T,
                                  observation.type = T,
                                  political.boundaries = T,
                                  collection.info = T)

occ.BIEN <- occ.BIEN[occ.BIEN$scrubbed_family == "Melastomataceae", ]
dim(occ.BIEN)

### Download occurrence data from speciesLink ###
occ.splink <- rspeciesLink(basisOfRecord = "PreservedSpecimen",
                           family = "Melastomataceae",
                           country = "Brazil",
                           stateProvince = "Minas Gerais",
                           Scope = "plants",
                           Synonyms = "flora2020",
                           MaxRecords = 100000)
dim(occ.splink)

### Download occurrence data from GBIF ###
occ.gbif <- rgbif2(dir = "data/raw",
                   filename = "output.gbif",
                   species = "Melastomataceae",
                   country = "BR",
                   stateProvince = "Minas Gerais",
                   n.records = 100000,
                   force = T,
                   basisOfRecord = "PRESERVED_SPECIMEN")
dim(occ.gbif)


### Combine database using formatDwc() function ###
occs.all <- formatDwc(splink_data = occ.splink,
                      gbif_data = occ.gbif,
                      bien_data = occ.BIEN,
                      fix.encoding = c("splink_data", "gbif_data", "bien_data"),
                      drop = T, bind_data = T)
dim(occs.all)


### Data editing ###

#### Collection codes, people names, collector number and dates ####
occs.all.2 <- formatOcc(occs.all)
dim(occs.all.2)


#### Locality information ####
occs.all.3 <- formatLoc(occs.all.2)
dim(occs.all.3)


#### Geographical coordinates ####
occs.all.4 <- formatCoord(occs.all.3)
dim(occs.all.4)


#### Species and family names ####
occs.all.5 <- formatTax(occs.all.4, db = "bfo")
dim(occs.all.5)



### Data validation ####

#### Locality information
occs.all.6 <- validateLoc(occs.all.5)
dim(occs.all.6)


#### Geographical coordinates
occs.all.7 <- validateCoord(occs.all.6, output = "new.col")
dim(occs.all.7)


#### Species taxonomy and identification
occs.all.8 <- validateTax(occs.all.7,top.det	= 200,
                          generalist = T)

high_s.n. <- occs.all.8[which(occs.all.8$tax.check == "high" & 
                                occs.all.8$identifiedBy.new == "s.n."),]
dim(high_s.n.) # 8535 registros classificados com alta qualidade de identificação, mas estão sem identificador.



#### Duplicate specimens
occs.all.9 <- validateDup(occs.all.8, merge = T, prop = 0.01,
                          ignore.miss	= T, remove = T)

# Error in x1$col.year[ids] <- as.character(sapply(strsplit(x1$col.year[ids],  : 
# NAs are not allowed in subscripted assignments

Muito obrigada mais uma vez!

fixSpecies error

without gbif data:

Error in check$species_status[id_authors & !is.na(check$species_status)] <- paste(check$species_status[id_authors & :
NAs não são permitidos em atribuições por subscritos
Além disso: Warning messages:
1: In [<-.factor(*tmp*, prob.ids, value = c("Blechnum polypodioides f. maius", :
invalid factor level, NA generated
2: In [<-.factor(*tmp*, !prob.ids, value = c("Telmatoblechnum serrulatum", :
invalid factor level, NA generated

Error in x1$col.year[ids] - validateDup

Hi Renato,

I am getting the following error when I try to use the validateDup function:

Error in x1$col.year[ids] <- as.character(sapply(strsplit(x1$col.year[ids], :
NAs não são permitidos em atribuições por subscritos

I found the piece of dataframe where the error occurs (here: SpeciesOccurrences.csv), but I can not understand why the error occurs.

Could you please help me to solve this error?

Check gazetteer and map

  • Check ADM levels per country (number, names) table
  • Check map creation for missing lines and correct order
  • Check if the problem is the projection - no more
  • Check for missing polygons in GADM and flag manual changes in gazetteer
  • Check for mistakes along the way
  • Guyane instead of french guiana in country.new?
  • The localities in french guiana are weird. Some strings "guyane française" at the end instead of counties (it's not removing that ascii)
  • found guyana [guyana] in country.new

From previous issues:

  • in getAdmin() you say Argentinean subdivisions are Province instead of State (OK) and Department instead of Municipality, but Department exists between the level of Province and Municipality (municipality exists) we should check this (for all countries aliás).
  • I'm getting a lot of county-level checks errors and thought that it was between countries, but even for brazil we are having these problems. I will split this problem to another issue.
    • RAFL: class 'check_local.2municip' should be high because all non-NAs locality fields are taken as having an original resolution at locality level, and the majority of the localities are not in the gazetteer. 'check_municip2state' or 'check_municip2country' are more problematic. It can be typos or problems in the strucutre or complitude of the gazetteer. I would create a new issue only about this as well.

Erro checkCoord

Estou tentando utilizar esta função seguindo exatamente o que está no tutorial e estou recebendo um erro.
Não tive nenhum erro nas etapas anteriores, só neste momento, com esta função.
Instalei há menos de um dia o pacote, então creio que está atualizado.

occs <- checkCoord(occs,

  •                keep.cols = c("geo.check", "NAME_0", "country.gazet"))
    

Error in s2_geography_from_wkb(x, oriented = oriented, check = check) :
Evaluation error: Found 1 feature with invalid spherical geometry.
[194] Loop 2 edge 8 crosses loop 3 edge 0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.