This package has been archived. The former README is now in README-not.
ropensci-archive / scrubr Goto Github PK
View Code? Open in Web Editor NEW:warning: ARCHIVED :warning: Clean species occurrence records
License: Other
:warning: ARCHIVED :warning: Clean species occurrence records
License: Other
This package has been archived. The former README is now in README-not.
The very fact that this exists is (1) awesome and (2) going in a grant application I'm working on.
๐
Ideas:
via #30
I've run into a strange bug where, when specifying tolerance for dedup(), the number of rows returned is greater than the number of rows in the original dataset:
dim(iris) # 150 rows
dim(iris %>% dedup()) # 149 rows
dim(iris %>% dedup(tolerance = 0)) # 11067 rows
dim(iris %>% dedup(tolerance = 0.2)) #9156 rows
dim(iris %>% dedup(tolerance = 0.4)) # 4627 rows
dim(iris %>% dedup(tolerance = 0.6)) # 2640 rows
dim(iris %>% dedup(tolerance = 0.8)) # 431 rows
dim(iris %>% dedup(tolerance = 1)) # 150 rows
These additional rows are exact duplicates & can be removed with distinct(), but it seems to be unintended behavior.
use cases are outlined in readme at top https://github.com/sckott/cleanoccs#cleanoccs
There's got to be other ones:
Would be cleaner, jus save as attributes and re-assign, perhaps
found via http://www.gbif.org/newsroom/uses/2016-gueta-et-al
Some cleaning tasks done (Table 1)
repair these points
remove these points
from https://doi.org/10.1093/jxb/erw451
Occurrence data for each taxon were downloaded from the Global Biodiversity Information Facility (GBIF, http://www.gbif.org) using the RGBIF package in R (Chamberlain et al., 2016; data accessed 1 and 2 July 2016). Occurrence data for the Zambezian C3โC4 within Alloteropsis semialata were taken from Lundgren et al. (2015, 2016). All occurrence data were cleaned by removing any anomalous lati- tude or longitude points, points falling outside of a landmass, and any points close to GBIF headquarters in Copenhagen, Denmark, which may result from erroneous geolocation. To avoid repeated occurrences, latitude and longitude decimal degree values were rounded to two decimal places, and any duplicates at this resolution were removed. These lters are commonly applied to data extracted from GBIF (Zanne et al., 2014).
any ideas?
Please correct before 2020-04-09 to safely retain your package on CRAN.
This apparently needs proj7 to reproduce.
failing on two platforms
https://www.r-project.org/nosvn/R.check/r-patched-linux-x86_64/scrubr-00check.html
https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-gcc/scrubr-00check.html
continuing on from #18 - see comments
I.e. wrapping something like sf::st_within
(or whatever the wkt tool is if that's better) in order to allow to filter observations within a given polygon.
One might need to do that when getting data from APIs that only accept a bounding box or circle as filter.
Filtering out Herbaria/Museum locations :
I see it's on your roadmap/todo but it's not implemented yet, right?
This is a highly visible, important problem for me. I know it's perhaps not easy to solve(?) but just to say (and I'm sure you know) this is a major problem.
e.g. from an analysis of 4,828,341 GBIF records from 89,180 plant species one can see a clearly visible peak in the 'median-latitude-per-species' histogram at around -33.875. Why? Turns out there are 19,773 records across 2,757 species that all have a latitude of exactly "-33.875" and 19,706 (>99.6%) of these records come from the PRECIS database provided by SANBI.
TL;DR records with a latitude of "-38.875" from the PRECIS database should be viewed with extreme cynicism. I am working on finding/identifying more such institution/database cases, happy to supply more data on (plant-related) cases I find if that'd be helpful. I'm starting from scratch / zero-prior knowledge here though. There might already be a good list of such known cases?
Perhaps the maps
package world.cities
dataset
head(world.cities)
#> name country.etc pop lat long capital
#> 1 'Abasan al-Jadidah Palestine 5629 31.31 34.34 0
#> 2 'Abasan al-Kabirah Palestine 18999 31.32 34.35 0
#> 3 'Abdul Hakim Pakistan 47788 30.55 72.11 0
#> 4 'Abdullah-as-Salam Kuwait 21817 29.36 47.98 0
#> 5 'Abud Palestine 2456 32.03 35.07 0
#> 6 'Abwein Palestine 3434 32.03 35.20 0
NROW(world.cities)
#> [1] 43645
Has lots of cities, could use that
idea is to clean coordinates that are imprecise
e.g., -117, 35
and -117.000, 35.000
are not reliable, so remove/flag those
we'd have to make some rules: e.g;
It looks like the tests in this package are sensitive to the order of the attributes()
in a tibble object, which has changed in tibble 3.1.4. Can you please confirm?
testthat::expect_equal()
and testthat::expect_identical()
handle this case automatically, but must be applied on the entire object.
testthat::expect_equal(
attributes(structure(list(), a = 1, b = 2)),
attributes(structure(list(), b = 2, a = 1))
)
#> Error: attributes(structure(list(), a = 1, b = 2)) not equal to attributes(structure(list(), b = 2, a = 1)).
#> Names: 2 string mismatches
#> Component 1: Mean relative difference: 1
#> Component 2: Mean relative difference: 0.5
testthat::expect_equal(
structure(list(), a = 1, b = 2),
structure(list(), b = 2, a = 1)
)
Created on 2021-08-15 by the reprex package (v2.0.1)
I have a first draft function for this https://github.com/sckott/cleanoccs/blob/master/R/dedup.R
I use a particular method, concatenating each row (all columns) into a single string. Then compare all rows against each other, all pairwise combinations, then allow user to set a tolerance, e.g., 0.9, where all strings that have a score of 0.9 or greater, will be dropped.
I haven't thought about this deeply. What are better ways to do this? i imagine we could allow multiple options as well.
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Suggested-packages
perhaps use or add more data sets that ship with the pkg for the else
case when rgbif not installed
Seems like this pkg could get quickly very large if it tries to consider every use case.
Any tasks that are not straightforward filtering of data - those that remove bad data, impossible data, missing data, etc.
Filtering based on any columns in the data - this is not really cleaning per se, and can easily be done via data.table
, dplyr
, etc.
do these make sense?
res <- sample_data_4
dframe(res) %>% coord_within(country = "Egypt")
returns no results
Quite slow right now - profiling with waxwing data - more soon
Hello!
Iโm not able to filter more than one ecoregion in the eco_region function.
I tried to use:
tmp <- eco_region (dframe(OcurrencesOK3), dataset = "meow", region = c('ECO_CODE_X:73','ECO_CODE_X:74','ECO_CODE_X:75'))
But it only filters the first ecoregion, how do I get it to filter all the selected ones?
Thank you!
From #2
In fact, it probably makes sense as the default to keep one of a group of duplicates, and drop the others.
new param, maybe change the drop
param to instead of a logical, be a character/numeric vector choosing whether to keep one of each group of dups, drop all, drop none, drop N (i.e., number of things to drop)
weird bug, doesn't happend when package loaded, only when running check
> attr(df_inc, "coord_incomplete")
<clean dataset>
Size: 194 X 6
Error in rep(".", i) : invalid 'times' argument
Calls: <Anonymous> ... print.clean_df -> trunc_mat_ -> vapply -> FUN -> paste
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.