Giter Site home page Giter Site logo

whip's Introduction

whip

Whip is a human and machine-readable syntax to express specifications for data. It can be used as a whip to test how well data meets certain specifications, be it a feather 😅 or a chain whip 😱.

Example:

my_date_field:
  dateformat: ['%Y-%m-%d', '%Y-%m', '%Y'] # Needs to be ISO8601 format, but don't allow ranges
  mindate: 1830-01-01                     # No dates before 1830
  empty: True                             # Empty values are allowed

Documentation

Implementations

You can test whip specifications with pywhip.

Contributors

License

MIT License

whip's People

Contributors

peterdesmet avatar stijnvanhoey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

whip's Issues

whip vs DQ IG Tests and assertions

Lee Belbin send an email on March 3 to the TDWG Biodiversity Data Quality IG group regarding the work of WG2: Data Quality Tests and Assertions:

A select group of TDWGians (highlighted on the Members worksheet on the link below) have invested time to produce what I am calling a core suite of standard tests and associated assertions that can be applied to occurrence records. These tests are to help identify potential occurrence record issues.

Why are we doing this? Largely to try to better align Data Publishers/Data Aggregators/Biodiversity Research Infrastructures/Data Custodians, and hopefully anyone who generates occurrence data. Users would appreciate consistency. A practical example: Merging records from say GBIF and the ALA etc would be greatly facilitated if they both applied the same set of tests.

As a start and to keep it simple, these tests are based on one or more Darwin Core terms. We realize that tests could be applied to all Darwin Core terms, but we wanted a core set that would cover the significant terms that could be implemented relatively easily by all.

The spreadsheet can be found at https://tinyurl.com/h49zwof. Note that this spreadsheet contains a series of worksheets. Please start by reviewing the Principles as they provide a context for what has been learnt during the process.

I'm not familiar with this output, but @cgendreau @tucotuco you're both highlighted as contributing members for this. Would you care to explain the scope of these tests/assertions vs whip? How are the approaches different and what is the chance we're duplicating efforts?

redundant evaluation test `equals`

As the test equals can be expressed by a combination of the terms min and max, this evaluation test is rather redundant.

e.g.

equals: 10

is equal to this test description:

min: 10
max: 10

This is similar for the length test (combination of minlength and maxlength.

It could be considered useful to keep this as a separate test out of convenience for the user. Still, when equals is considered, one could argue to include equalsdate and equalslength as well.

Therefore, I would leave equals out as a separate test option.

When does a value need to be quoted?

We should probably provide a recommendation when values should be quoted:

license:
  allowed: http://creativecommons.org/publicdomain/zero/1.0/ # is this valid or does it need quotes?
informationWithheld:
  allowed: see metadata # is this valid or does this need quotes?

Look into swagger

Suggestion by @dshorthouse I want to note here so we don't forget:

It also brings to mind the OpenAPI (aka Swagger) specification that, though is designed for RESTful API documentation, can also be expressed as YAML, also with rules for
ranges & arrays. See http://swagger.io/specification/

Feature request: Checks for multi-column data consistency

As suggested here https://twitter.com/LifeWatchINBO/status/1042363580107182080

You might have two columns "Country" and "Country Code" (not very well normalised, I know!) - and you might want to check that only one country code is present for each unique country? Or perhaps the number of unique Country+Country-code combinations should be the same as the number of unique countries and also the number of unique country-codes.

My original Twitter example was like this: "Site Name", "Latitude", "Longitude"
You might want to validate that for each site, only one lat/long combination exists.

We have other examples (we use ISA-Tab format a lot) where all rows containing a specific value in column C (for example) must have exactly the same values in all rightward columns (D,E,F, etc). It's the same concept as the lat/long example though.

Great project, following with interest!

Negative values not covered with current syntax on `numberformat`

Currently, we have no information or guidelines about negative values for the numberformat. Options are:

  • users need to be explicit about negative values: mixed negative and non-negative values in a single columns is probably
  • we just look at the absolute value of the incoming number (ignoring - sign)

Priority of the `empty` test validation

The empty test is prior to all other tests, as it makes little sense to test for any other test when the field is just an empty string. Hence, this test aborts the other tests on a value when an empty string is encountered. It will simplify the implementation, as the other tests do not have to take into account the possibility of getting an empty string as input value. it will also diminish processing time (all empty values are faster evaluated with just a single test).

However, this has a major drawback as well:

When a conditional test is added (if) that includes the specification to decide when empty strings are allowed or not, this model runs into trouble. It requires to add a general empty test as well (empty is - sometimes - possible) and the priority of the empty test will stop the other tests, i.e. the if test is never started.

Therefore, we could decide about having the empty-test not as first priority test. This should be balanced against:

  • all implemented tests will have to properly handle empty strings as input
  • redundant tests will be run against these empty strings

Lookup values

Similar to Issue #9 in nature, suppose you wanted to provide a vocabulary resolution through lookup feature? For example, for every three letter country code, lookup and use the 2-letter equivalent.

Usage of the word 'allowed'

If we look at the current specification types, only allowed is an adjective:

allowed
minlength
maxlength
stringformat
regex
min
max
numberformat
mindate
maxdate
dateformat

That been said, I don't have any super suggestion other than something like simply value or allowedvalue (which is not pretty as one word).

Please treat that as a question/idea/suggestion, I'm not a native English speaker.

Would a "required" spec be useful?

This would be a data set level specification. A list of fields that must be found in the input.

Related, what would the implementation expectation be for a specification that can not be validated because the field is not in the input?

Questions about numberformat

  1. Does the format need to be a number, or is this also valid:
length:
  numberformat: .2   # test for "a.ab"
  1. Does the format need to be expressed in quotes?
length:
  numberformat: '.2'  # are the quotes necessary?
  1. Can the numberformat handle comma decimal points?
length:
  numberformat: ',2'  # test for "1,23"
  1. Can the numberformat handle more formats?
length:
  numberformat: '1,3.2'  # test for "1,000.00"

Handling integers inside schema definition

Consider the following example:

coordinateUncertaintyInMeters:
  empty: true
  if:
    - verbatimCoordinateSystem:
        allowed: UTM 1km
      numberformat: x
      allowed: 707
    - verbatimCoordinateSystem:
        allowed: UTM 5km
      numberformat: x
      allowed: 3536

Where the allowed values are somteimes integers (parsing this with default yaml-libraries). The question is how to handle this:

  • Always expect (when loading the yaml-schema) that allowed need to be strings and pre-format it as such after loading
  • Expect from the user to be explicit about the string, by requiring quotes: allowed: '3536'
  • Provide a 'SchemaError' (or 'SpecificationError'), stating that allowed need to be string/list as for integers, the combination min/max is provided. However what with allowed: ['4', '27']?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.