Giter Site home page Giter Site logo

ropensci-archive / scrubr Goto Github PK

View Code? Open in Web Editor NEW
34.0 12.0 10.0 1.17 MB

:warning: ARCHIVED :warning: Clean species occurrence records

License: Other

R 73.84% Python 24.92% Makefile 1.24%
biodiversity data-cleaning gbif rstats r spocc r-package

scrubr's Introduction

Project Status: Abandoned

This package has been archived. The former README is now in README-not.

scrubr's People

Contributors

maelle avatar sckott avatar vamsikrishna97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrubr's Issues

Just a heads up

The very fact that this exists is (1) awesome and (2) going in a grant application I'm working on.

๐Ÿ‘

taxonomy based cleaning

Ideas:

  • Dups - find and optionally drop
  • Records:
    • w/o names
    • w/ only a genus name
    • w/o a genus, epithet or genus+epithet (i.e., sometimes a Family e.g., is given in the taxon name field mixed in with other ranks)
  • Check each name is valid against user specified source - could be very slow though

When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates

I've run into a strange bug where, when specifying tolerance for dedup(), the number of rows returned is greater than the number of rows in the original dataset:

dim(iris) # 150 rows
dim(iris %>% dedup()) # 149 rows
dim(iris %>% dedup(tolerance = 0)) # 11067 rows
dim(iris %>% dedup(tolerance = 0.2)) #9156 rows
dim(iris %>% dedup(tolerance = 0.4)) # 4627 rows
dim(iris %>% dedup(tolerance = 0.6)) # 2640 rows
dim(iris %>% dedup(tolerance = 0.8)) # 431 rows
dim(iris %>% dedup(tolerance = 1)) # 150 rows

These additional rows are exact duplicates & can be removed with distinct(), but it seems to be unintended behavior.

More use cases?

use cases are outlined in readme at top https://github.com/sckott/cleanoccs#cleanoccs

There's got to be other ones:

  • round to user-defined number of decimal places of coordinates (from #5 )
  • taxonomy based cleaning use cases:
    • collapse duplicates based on lowest level name (e.g., species)
    • collapse duplicates based on higher level taxonomic name (e..g, collapse all species within a family)
    • drop duplicates based on name
    • drop any records without names
  • date based cleaning
    • drop records without dates
    • drop records without dates that can not be coerced to a real date
    • standardize dates
    • create date from other fields
  • ...

use case: data cleaning paper using GBIF data

found via http://www.gbif.org/newsroom/uses/2016-gueta-et-al

https://www.researchgate.net/profile/Yohay_Carmel/publication/303833067_Quantifying_the_value_of_user-level_data_cleaning_for_big_data_A_case_study_using_mammal_distribution_models/links/57bf37ea08aeb95224d1039d.pdf

Some cleaning tasks done (Table 1)

repair these points

  • Wrong coordinate systems - i assume means UTM vs. dec deg
  • coordinates as string - don't know what this means
  • Switched Longitude & Latitude
  • Numerical sign confusion

remove these points

  • Records with missing or non-Australian coordinate -
  • Coordinates exactly in center of Australia -
  • Longitude & Latitude precision less than 3 digits -
  • Records collected before the year 1990 -
  • Records with unknown year -
  • Species name was not recognized by ALA -
  • Insufficient taxon rank identification -
  • Domesticated species
  • Extinct species

use case: real world eg of data cleaning with gbif data

from https://doi.org/10.1093/jxb/erw451

Occurrence data for each taxon were downloaded from the Global Biodiversity Information Facility (GBIF, http://www.gbif.org) using the RGBIF package in R (Chamberlain et al., 2016; data accessed 1 and 2 July 2016). Occurrence data for the Zambezian C3โ€“C4 within Alloteropsis semialata were taken from Lundgren et al. (2015, 2016). All occurrence data were cleaned by removing any anomalous lati- tude or longitude points, points falling outside of a landmass, and any points close to GBIF headquarters in Copenhagen, Denmark, which may result from erroneous geolocation. To avoid repeated occurrences, latitude and longitude decimal degree values were rounded to two decimal places, and any duplicates at this resolution were removed. These lters are commonly applied to data extracted from GBIF (Zanne et al., 2014).

Extend coord_within for use with any polygon?

I.e. wrapping something like sf::st_within (or whatever the wkt tool is if that's better) in order to allow to filter observations within a given polygon.

One might need to do that when getting data from APIs that only accept a bounding box or circle as filter.

Filtering out Herbaria/Museum locations

Filtering out Herbaria/Museum locations :
I see it's on your roadmap/todo but it's not implemented yet, right?

This is a highly visible, important problem for me. I know it's perhaps not easy to solve(?) but just to say (and I'm sure you know) this is a major problem.

e.g. from an analysis of 4,828,341 GBIF records from 89,180 plant species one can see a clearly visible peak in the 'median-latitude-per-species' histogram at around -33.875. Why? Turns out there are 19,773 records across 2,757 species that all have a latitude of exactly "-33.875" and 19,706 (>99.6%) of these records come from the PRECIS database provided by SANBI.

TL;DR records with a latitude of "-38.875" from the PRECIS database should be viewed with extreme cynicism. I am working on finding/identifying more such institution/database cases, happy to supply more data on (plant-related) cases I find if that'd be helpful. I'm starting from scratch / zero-prior knowledge here though. There might already be a good list of such known cases?

rplot

Removing data based on political centroids

Perhaps the maps package world.cities dataset

head(world.cities)
#>                 name country.etc   pop   lat  long capital
#> 1 'Abasan al-Jadidah   Palestine  5629 31.31 34.34       0
#> 2 'Abasan al-Kabirah   Palestine 18999 31.32 34.35       0
#> 3       'Abdul Hakim    Pakistan 47788 30.55 72.11       0
#> 4 'Abdullah-as-Salam      Kuwait 21817 29.36 47.98       0
#> 5              'Abud   Palestine  2456 32.03 35.07       0
#> 6            'Abwein   Palestine  3434 32.03 35.20       0

NROW(world.cities)
#> [1] 43645

Has lots of cities, could use that

imprecise coords

idea is to clean coordinates that are imprecise

e.g., -117, 35 and -117.000, 35.000 are not reliable, so remove/flag those

we'd have to make some rules: e.g;

Test failure with the development version of tibble

It looks like the tests in this package are sensitive to the order of the attributes() in a tibble object, which has changed in tibble 3.1.4. Can you please confirm?

testthat::expect_equal() and testthat::expect_identical() handle this case automatically, but must be applied on the entire object.

testthat::expect_equal(
  attributes(structure(list(), a = 1, b = 2)),
  attributes(structure(list(), b = 2, a = 1))
)
#> Error: attributes(structure(list(), a = 1, b = 2)) not equal to attributes(structure(list(), b = 2, a = 1)).
#> Names: 2 string mismatches
#> Component 1: Mean relative difference: 1
#> Component 2: Mean relative difference: 0.5
testthat::expect_equal(
  structure(list(), a = 1, b = 2),
  structure(list(), b = 2, a = 1)
)

Created on 2021-08-15 by the reprex package (v2.0.1)

Deduplication

I have a first draft function for this https://github.com/sckott/cleanoccs/blob/master/R/dedup.R

I use a particular method, concatenating each row (all columns) into a single string. Then compare all rows against each other, all pairwise combinations, then allow user to set a tolerance, e.g., 0.9, where all strings that have a score of 0.9 or greater, will be dropped.

I haven't thought about this deeply. What are better ways to do this? i imagine we could allow multiple options as well.

In vs. out of scope

Seems like this pkg could get quickly very large if it tries to consider every use case.

What's in scope?

Any tasks that are not straightforward filtering of data - those that remove bad data, impossible data, missing data, etc.

Out of scope?

Filtering based on any columns in the data - this is not really cleaning per se, and can easily be done via data.table, dplyr, etc.

do these make sense?

Speed up dedup()

Quite slow right now - profiling with waxwing data - more soon

How to filter more than one ecoregion in the eco_region function

Hello!
Iโ€™m not able to filter more than one ecoregion in the eco_region function.
I tried to use:
tmp <- eco_region (dframe(OcurrencesOK3), dataset = "meow", region = c('ECO_CODE_X:73','ECO_CODE_X:74','ECO_CODE_X:75'))

But it only filters the first ecoregion, how do I get it to filter all the selected ones?

Thank you!

dedup() should optionally keep one of a group of dups

From #2

In fact, it probably makes sense as the default to keep one of a group of duplicates, and drop the others.

new param, maybe change the drop param to instead of a logical, be a character/numeric vector choosing whether to keep one of each group of dups, drop all, drop none, drop N (i.e., number of things to drop)

Print bug

weird bug, doesn't happend when package loaded, only when running check

> attr(df_inc, "coord_incomplete")
<clean dataset>
Size: 194 X 6

Error in rep(".", i) : invalid 'times' argument
Calls: <Anonymous> ... print.clean_df -> trunc_mat_ -> vapply -> FUN -> paste

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.