Estou rodando as funções de padronização e validação do plantR com os dados de ocorrência das espécies da tribo Paullinieae e apareceram erros nas funções formatTax() e validateCoord().
Deixei os script com os erros abaixo. Os script está baixando todos os registros de ocorrência para os seis gêneros da tribo. Isso leva um tempinho (são pouco mais de 200 mil registros), mas assim é mais garantido que os erros vão aparecer.
A função formatTax() é a menos problemática porque já temos uma lista com os nomes das espécies padronizados e não precisamos mais rodar a função. Meu objetivo é apenas reportar o erro da função com esse conjunto de dados.
O erro que aparece é : Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 182344, 188400
Já na função validateCoord() o erro varia de acordo com difenrentes testes (vejam no final do script, por favor).
### Clean the environment ####
rm(list=ls())
### Installation ####
if(!require("remotes")) install.packages("remotes")
if(!require("plantR")) remotes::install_github("LimaRAF/plantR")
if(!require("BIEN")) install.packages("BIEN")
if(!require("rgbif")) install.packages("rgbif")
if(!require("stringr")) install.packages("stringr")
### Download occurrence data ####
#### Download occurrence data from BIEN ####
occ.BIEN <- BIEN_occurrence_genus(genus = c("Cardiospermum",
"Lophostigma",
"Paullinia",
"Serjania",
"Thinouia",
"Urvillea"),
cultivated = F,
only.new.world = F,
all.taxonomy = F,
native.status = F,
natives.only = T,
observation.type = T,
political.boundaries = T,
collection.info = T)
dim(occ.BIEN) # 33415 24
## Standardization of character encoding
for (i in 1:ncol(occ.BIEN)){
if(is.character(occ.BIEN[,i])){
Encoding(occ.BIEN[,i]) <- "UTF-8"
}
}
### Download occurrence data from speciesLink ####
occ.splink <- rspeciesLink(basisOfRecord = "PreservedSpecimen",
family = "Sapindaceae",
species = c("Cardiospermum",
"Lophostigma",
"Paullinia",
"Serjania",
"Thinouia",
"Urvillea"),
Scope = "plants",
Synonyms = "species2000",
MaxRecords = 300000)
dim(occ.splink) # 62291 49
## Standardization of character encoding
c.right <- c("À", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë",
"Ì", "Î", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø",
"Ù", "Ú", "Û", "Ü", "Þ", "ß", "á", "â", "ã", "ä", "å",
"æ", "ç", "è", "é", "ê", "ë", "ì", "î", "ï", "ð", "ñ", "ò",
"ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ", "í")
c.wrong <- c("À", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É",
"Ê", "Ë", "Ì", "Î", "Ñ", "Ò", "Ó", "Ô",
"Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ü", "Þ", "ß",
"á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê",
"ë", "ì", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ",
"ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ", "Ã.")
for (i in 1:ncol(occ.splink)){
if(is.character(occ.splink[,i])){
Encoding(occ.splink[,i]) <- "UTF-8"
for(j in 1:length(c.right)){
occ.splink[,i] <- str_replace_all(occ.splink[,i],
pattern = c.wrong[j],
replacement = c.right[j])
}
}
}
### Download occurrence data from GBIF ####
occ.gbif <- rgbif2(dir = "data/plantR",
filename = "output.gbif",
species = c("Thinouia Triana & Planch.",
"Lophostigma Radlk.",
"Cardiospermum L.",
"Paullinia L.",
"Serjania Mill.",
"Urvillea Kunth"),
n.records = 300000,
force = T,
basisOfRecord = "PRESERVED_SPECIMEN")
dim(occ.gbif) # 134568 206
## Standardization of character encoding
for (i in 1:ncol(occ.gbif)){
if(is.character(occ.gbif[,i])){
Encoding(occ.gbif[,i]) <- "UTF-8"
}
}
### Combine different databases ####
### Formatting BIEN database before running the formatDwc() fuction ####
## Separate "date_collected" into year, month and day
occ.BIEN[,25] <- sapply(strsplit(as.character(occ.BIEN$date_collected), "-"), function(x) (x[1]))
occ.BIEN[,26] <- sapply(strsplit(as.character(occ.BIEN$date_collected), "-"), function(x) (x[2]))
occ.BIEN[,27] <- sapply(strsplit(as.character(occ.BIEN$date_collected), "-"), function(x) (x[3]))
colnames(occ.BIEN)[25:27] <- c("year", "month", "day")
## Prepare other required columns
occ.BIEN[,28] <- occ.BIEN$county
occ.BIEN[,29:30] <- NA
occ.BIEN[,31] <- "Sapindaceae"
colnames(occ.BIEN)[28:31] <- c("municipality", "typeStatus", "scientificNameAuthorship", "family")
## Standardize column names
colnames(occ.BIEN)[colnames(occ.BIEN) == "collection_code"] <- "collectionCode"
colnames(occ.BIEN)[colnames(occ.BIEN) == "catalog_number"] <- "catalogNumber"
colnames(occ.BIEN)[colnames(occ.BIEN) == "record_number"] <- "recordNumber"
colnames(occ.BIEN)[colnames(occ.BIEN) == "recorded_by"] <- "recordedBy"
colnames(occ.BIEN)[colnames(occ.BIEN) == "state_province"] <- "stateProvince"
colnames(occ.BIEN)[colnames(occ.BIEN) == "latitude"] <- "decimalLatitude"
colnames(occ.BIEN)[colnames(occ.BIEN) == "longitude"] <- "decimalLongitude"
colnames(occ.BIEN)[colnames(occ.BIEN) == "identified_by"] <- "identifiedBy"
colnames(occ.BIEN)[colnames(occ.BIEN) == "date_identified"] <- "dateIdentified"
colnames(occ.BIEN)[colnames(occ.BIEN) == "scrubbed_species_binomial"] <- "scientificName"
colnames(occ.BIEN)[colnames(occ.BIEN) == "custodial_institution_codes"] <- "institutionCode"
colnames(occ.BIEN)[colnames(occ.BIEN) == "X.U.FEFF.scrubbed_genus"] <- "scrubbed_genus"
occ.BIEN$dateIdentified <- as.character(occ.BIEN$dateIdentified)
### Combine database using formatDwc() function ####
occs.all <- formatDwc(user_data = occ.BIEN,
splink_data = occ.splink,
gbif_data = occ.gbif,
drop = T, bind_data = T)
dim(occs.all) # 230274 47 records
### Data editing ####
#### Collection codes, people names, collector number and dates ####
## Formatting strings before running formatOcc() fuction to avoid this error:
# Error in gsub(x, "", y, fixed = TRUE) : zero-length pattern
occs.all$recordNumber[which(occs.all$recordNumber == "938[=Diary No. 707]")] <- NA
occs.all$verbatimEventDate[which(occs.all$verbatimEventDate == "Sept. 4-'77")] <- NA
occs.all$verbatimEventDate[which(occs.all$verbatimEventDate == "label says \"1841/XIV\"")] <- NA
occs.all$recordedBy[which(occs.all$recordedBy == "M. Nadruz; ,J.F. Baumgratz, M. Bovini, D.S.P. Silva")] <- "M. Nadruz, J.F. Baumgratz, M. Bovini, D.S.P. Silva"
## Replacing "," by ";" to separete names of collectors and identifiers
## Caso 1: "M. A. Costa, J. Ribeiro, E. C. Pereira"
## recordedBy
pos.c2semic <- which(sapply(str_locate_all(pattern = "\\.", occs.all$recordedBy), function(x) (x[1])) == 2 &
!is.na(sapply(str_locate_all(pattern = "\\,", occs.all$recordedBy), function(x) (x[1]))))
occs.all$recordedBy[pos.c2semic] <- str_replace_all(occs.all$recordedBy[pos.c2semic], "\\,", "\\;")
## identifiedBy
pos.c2semic.I <- which(sapply(str_locate_all(pattern = "\\.", occs.all$identifiedBy), function(x) (x[1])) == 2 &
!is.na(sapply(str_locate_all(pattern = "\\,", occs.all$identifiedBy), function(x) (x[1]))))
occs.all$identifiedBy[pos.c2semic.I] <- str_replace_all(occs.all$identifiedBy[pos.c2semic.I], "\\,", "\\;")
### Replacing "|" by " | "
## Caso 2: "M. A. Costa|J. Ribeiro|E. C. Pereira"
occs.all$recordedBy <- str_replace_all(occs.all$recordedBy, "\\|", " | ")
occs.all$identifiedBy <- str_replace_all(occs.all$identifiedBy, "\\|", " | ")
occs.all.2 <- formatOcc(occs.all)
#### Locality information ####
occs.all.3 <- formatLoc(occs.all.2)
#### Geographical coordinates ####
occs.all.4 <- formatCoord(occs.all.3)
#### Species and family names ####
occs.all.5 <- formatTax(occs.all.4, db = "tpl")
## Error in data.frame(..., check.names = FALSE) :
# arguments imply differing number of rows: 182344, 188400
#### Locality information
occs.all.6 <- validateLoc(occs.all.4)
#### Geographical coordinates
## Test 1
occs.all.7 <- validateCoord(occs.all.6, output = "new.col")
## Error in s2_geography_from_wkb(x, oriented = oriented, check = check) :
# Evaluation error: Found 1 feature with invalid spherical geometry.
# [194] Loop 2 edge 8 crosses loop 3 edge 0.
## Test 2
sf::sf_use_s2(F)
occs.all.7 <- validateCoord(occs.all.6, output = "new.col")
## Error in `$<-.data.frame`(`*tmp*`, "geo.check", value = c("ok_county", :
# replacement has 182344 rows, data has 182346
# In addition: Warning messages:
# 1: In geo.check[is.na(geo.check)] <- `*vtmp*` :
# number of items to replace is not a multiple of replacement length
# 2: In geo.check[is.na(geo.check)] <- tmp1[is.na(geo.check)] :
# number of items to replace is not a multiple of replacement length
## Test 3
sf::sf_use_s2(F)
pos <- which(occs.all.6$recordedBy.new == "Busey, P." &
occs.all.6$recordNumber.new == "422") # Parece que esses registros estao causando o erro: "Error in `$<-.data.frame`(`*tmp*`, "geo.check", value = c("ok_county",..."
occs.all.6.a <- occs.all.6[-pos,]
occs.all.7 <- validateCoord(occs.all.6.a, output = "new.col", tax.name = "scientificName")
## Error in robustbase::covMcd(df1[use_these, c("lon2", "lat2")], alpha = 1/2, :
# n == p+1 is too small sample size for MCD