ropensci-archive / scrubr Goto Github PK

View Code? Open in Web Editor NEW

34.0 12.0 10.0 1.17 MB

:warning: ARCHIVED :warning: Clean species occurrence records

License: Other

R 73.84% Python 24.92% Makefile 1.24%

biodiversity data-cleaning gbif rstats r spocc r-package

scrubr's Introduction

This package has been archived. The former README is now in README-not.

scrubr's People

Contributors

Stargazers

Watchers

Forkers

gshotwell dmirandae jatinrajani ashwinagrawal16 vamsikrishna97 vandit15 anantgowadiya nsm120 yangxhcaf clarimora

scrubr's Issues

Just a heads up

The very fact that this exists is (1) awesome and (2) going in a grant application I'm working on.

👍

taxonomy based cleaning

Ideas:

Dups - find and optionally drop
Records:
- w/o names
- w/ only a genus name
- w/o a genus, epithet or genus+epithet (i.e., sometimes a Family e.g., is given in the taxon name field mixed in with other ranks)
Check each name is valid against user specified source - could be very slow though

Set row.names to NULL before returning data

release new version - check problems

https://cloud.r-project.org/web/checks/check_results_scrubr.html

Literature - to look for ideas, etc.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267104/ "A semi-automated workflow for biodiversity data retrieval, cleaning, and quality control"
http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12209/abstract
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4452258/ "SDMdata: A Web-Based Software Tool for Collecting Species Occurrence Records"
http://data.nbn.org.uk/recordcleaner/documentation/NBNRecordCleanerUserguide.pdf may be something of interest there

improve eco_region() fxn: more data sources, etc.

via #30

When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates

I've run into a strange bug where, when specifying tolerance for dedup(), the number of rows returned is greater than the number of rows in the original dataset:

dim(iris) # 150 rows
dim(iris %>% dedup()) # 149 rows
dim(iris %>% dedup(tolerance = 0)) # 11067 rows
dim(iris %>% dedup(tolerance = 0.2)) #9156 rows
dim(iris %>% dedup(tolerance = 0.4)) # 4627 rows
dim(iris %>% dedup(tolerance = 0.6)) # 2640 rows
dim(iris %>% dedup(tolerance = 0.8)) # 431 rows
dim(iris %>% dedup(tolerance = 1)) # 150 rows

These additional rows are exact duplicates & can be removed with distinct(), but it seems to be unintended behavior.

Collector data source: harvard botanist index

http://kiki.huh.harvard.edu/databases/botanist_index.html

More use cases?

use cases are outlined in readme at top https://github.com/sckott/cleanoccs#cleanoccs

There's got to be other ones:

round to user-defined number of decimal places of coordinates (from #5 )
taxonomy based cleaning use cases:
- collapse duplicates based on lowest level name (e.g., species)
- collapse duplicates based on higher level taxonomic name (e..g, collapse all species within a family)
- drop duplicates based on name
- drop any records without names
date based cleaning
- drop records without dates
- drop records without dates that can not be coerced to a real date
- standardize dates
- create date from other fields
...

rename clean_df, misleading name

Reassign user lat/long column names on exit

Would be cleaner, jus save as attributes and re-assign, perhaps

use case: data cleaning paper using GBIF data

found via http://www.gbif.org/newsroom/uses/2016-gueta-et-al

https://www.researchgate.net/profile/Yohay_Carmel/publication/303833067_Quantifying_the_value_of_user-level_data_cleaning_for_big_data_A_case_study_using_mammal_distribution_models/links/57bf37ea08aeb95224d1039d.pdf

Some cleaning tasks done (Table 1)

repair these points

Wrong coordinate systems - i assume means UTM vs. dec deg
coordinates as string - don't know what this means
Switched Longitude & Latitude
Numerical sign confusion

remove these points

Records with missing or non-Australian coordinate -
Coordinates exactly in center of Australia -
Longitude & Latitude precision less than 3 digits -
Records collected before the year 1990 -
Records with unknown year -
Species name was not recognized by ALA -
Insufficient taxon rank identification -
Domesticated species
Extinct species

use case: real world eg of data cleaning with gbif data

from https://doi.org/10.1093/jxb/erw451

Occurrence data for each taxon were downloaded from the Global Biodiversity Information Facility (GBIF, http://www.gbif.org) using the RGBIF package in R (Chamberlain et al., 2016; data accessed 1 and 2 July 2016). Occurrence data for the Zambezian C3–C4 within Alloteropsis semialata were taken from Lundgren et al. (2015, 2016). All occurrence data were cleaned by removing any anomalous lati- tude or longitude points, points falling outside of a landmass, and any points close to GBIF headquarters in Copenhagen, Denmark, which may result from erroneous geolocation. To avoid repeated occurrences, latitude and longitude decimal degree values were rounded to two decimal places, and any duplicates at this resolution were removed. These lters are commonly applied to data extracted from GBIF (Zanne et al., 2014).

Fix failing cran checks

Please correct before 2020-04-09 to safely retain your package on CRAN.
This apparently needs proj7 to reproduce.

failing on two platforms

https://www.r-project.org/nosvn/R.check/r-patched-linux-x86_64/scrubr-00check.html
https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-gcc/scrubr-00check.html

Add date column guesser to clean_df

Extend coord_within for use with any polygon?

I.e. wrapping something like sf::st_within (or whatever the wkt tool is if that's better) in order to allow to filter observations within a given polygon.

One might need to do that when getting data from APIs that only accept a bounding box or circle as filter.

Filtering out Herbaria/Museum locations

Filtering out Herbaria/Museum locations :
I see it's on your roadmap/todo but it's not implemented yet, right?

This is a highly visible, important problem for me. I know it's perhaps not easy to solve(?) but just to say (and I'm sure you know) this is a major problem.

e.g. from an analysis of 4,828,341 GBIF records from 89,180 plant species one can see a clearly visible peak in the 'median-latitude-per-species' histogram at around -33.875. Why? Turns out there are 19,773 records across 2,757 species that all have a latitude of exactly "-33.875" and 19,706 (>99.6%) of these records come from the PRECIS database provided by SANBI.

TL;DR records with a latitude of "-38.875" from the PRECIS database should be viewed with extreme cynicism. I am working on finding/identifying more such institution/database cases, happy to supply more data on (plant-related) cases I find if that'd be helpful. I'm starting from scratch / zero-prior knowledge here though. There might already be a good list of such known cases?

Removing data based on political centroids

Perhaps the maps package world.cities dataset

head(world.cities)
#>                 name country.etc   pop   lat  long capital
#> 1 'Abasan al-Jadidah   Palestine  5629 31.31 34.34       0
#> 2 'Abasan al-Kabirah   Palestine 18999 31.32 34.35       0
#> 3       'Abdul Hakim    Pakistan 47788 30.55 72.11       0
#> 4 'Abdullah-as-Salam      Kuwait 21817 29.36 47.98       0
#> 5              'Abud   Palestine  2456 32.03 35.07       0
#> 6            'Abwein   Palestine  3434 32.03 35.20       0

NROW(world.cities)
#> [1] 43645

Has lots of cities, could use that

imprecise coords

idea is to clean coordinates that are imprecise

e.g., -117, 35 and -117.000, 35.000 are not reliable, so remove/flag those

we'd have to make some rules: e.g;

Test failure with the development version of tibble

It looks like the tests in this package are sensitive to the order of the attributes() in a tibble object, which has changed in tibble 3.1.4. Can you please confirm?

testthat::expect_equal() and testthat::expect_identical() handle this case automatically, but must be applied on the entire object.

testthat::expect_equal(
  attributes(structure(list(), a = 1, b = 2)),
  attributes(structure(list(), b = 2, a = 1))
)
#> Error: attributes(structure(list(), a = 1, b = 2)) not equal to attributes(structure(list(), b = 2, a = 1)).
#> Names: 2 string mismatches
#> Component 1: Mean relative difference: 1
#> Component 2: Mean relative difference: 0.5
testthat::expect_equal(
  structure(list(), a = 1, b = 2),
  structure(list(), b = 2, a = 1)
)

^{Created on 2021-08-15 by the reprex package (v2.0.1)}

Deduplication

I have a first draft function for this https://github.com/sckott/cleanoccs/blob/master/R/dedup.R

I use a particular method, concatenating each row (all columns) into a single string. Then compare all rows against each other, all pairwise combinations, then allow user to set a tolerance, e.g., 0.9, where all strings that have a score of 0.9 or greater, will be dropped.

I haven't thought about this deeply. What are better ways to do this? i imagine we could allow multiple options as well.

fix use of rgbif in examples

https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Suggested-packages

perhaps use or add more data sets that ship with the pkg for the else case when rgbif not installed

Adopt fixnames?

Cf ropensci/spocc#196 (comment)

In vs. out of scope

Seems like this pkg could get quickly very large if it tries to consider every use case.

What's in scope?

Any tasks that are not straightforward filtering of data - those that remove bad data, impossible data, missing data, etc.

Out of scope?

Filtering based on any columns in the data - this is not really cleaning per se, and can easily be done via data.table, dplyr, etc.

do these make sense?

coord_within seems to not be working

res <- sample_data_4
dframe(res) %>% coord_within(country = "Egypt")

returns no results

Speed up dedup()

Quite slow right now - profiling with waxwing data - more soon

How to filter more than one ecoregion in the eco_region function

Hello!
I’m not able to filter more than one ecoregion in the eco_region function.
I tried to use:
tmp <- eco_region (dframe(OcurrencesOK3), dataset = "meow", region = c('ECO_CODE_X:73','ECO_CODE_X:74','ECO_CODE_X:75'))

But it only filters the first ecoregion, how do I get it to filter all the selected ones?

Thank you!

dedup() should optionally keep one of a group of dups

From #2

In fact, it probably makes sense as the default to keep one of a group of duplicates, and drop the others.

new param, maybe change the drop param to instead of a logical, be a character/numeric vector choosing whether to keep one of each group of dups, drop all, drop none, drop N (i.e., number of things to drop)

Print bug

weird bug, doesn't happend when package loaded, only when running check

> attr(df_inc, "coord_incomplete")
<clean dataset>
Size: 194 X 6

Error in rep(".", i) : invalid 'times' argument
Calls: <Anonymous> ... print.clean_df -> trunc_mat_ -> vapply -> FUN -> paste