gss-cogs / csvcubed Goto Github PK

A CLI to build linked data cubes.

Home Page: https://gss-cogs.github.io/csvcubed-docs/external/

License: Apache License 2.0

Python 88.03% Gherkin 10.54% Dockerfile 0.15% Shell 0.18% PowerShell 0.24% Groovy 0.16% Makefile 0.05% Batchfile 0.06% CSS 0.03% HTML 0.56%

csvw linked-data rdf skos dcat qb csv cubes

csvcubed's Introduction

csvcubed

csvcubed project provides a command line tool which make it straightforward to turn a CSV into 5-star linked data (CSV-W)

By publishing 5-star linked data and leveraging open standards for data, we believe that we can help ensure that statistical data is discoverable, comparable and analysable by automated tools. We hope that this standards-based approach will unlock network effects which accelerate data analysis by making it easier to collate, compare and contrast data from different sources.

All our work depends on open standards; however it isn't just for open data. Share your data with the world or keep it private, the choice is yours.

Getting started immediately

Get going with csvcubed immediately by installing csvcubed using pip.

pip install csvcubed

From there you'll have access to the csvcubed command line tool which features sub commands build and inspect to create CSV-Ws from CSV and inspect CSV-Ws.

Become well acquainted to csvcubed with our quick start, which includes written instructions as well as transcribed videos.

User Documentation

csvcubed has extensive user documentation which tracks the release of csvcubed while it is in its beta phase. Our documentation can always be improved, so treat bad docs as a bug report.

Related Packages

Name	Description
csvcubed	The key library helping to transform tidy-data into qb-flavoured CSV-W cubes.
csvcubed-models	Models and RDF serialisation functionality required by the csvcubed and csvcubed-pmd packages.
csvcubed-pmd	Transforms a CSV-qb into RDF which is compatible with the Publish My Data platform.
csvcubed-devtools	Shared test functionality & dev dependencies which are commonly required.

Developer Documentation

More detailed developer documentation for this project can be found here.

How to report bugs

We welcome and appreciate bug reports. As we are trying to make this tool useful for all levels of experience, any level of bug or improvement helps others. To contribute to making csvcubed better, check out our bug reporting instructions.

csvcubed's People

Contributors

Stargazers

Watchers

Forkers

martincostello muazzamchaud

csvcubed's Issues

Prototype

Enforce validation of model input datatypes

This task requires that suitable validation of the abstract model's attributes is performed when the validate method is called on a Column/QbDataStructureDefinition model.

I found a useful looking tool called pydantic which allows you to validate the inputs to classes to ensure that they're consistent with the static type attributes defined.

Ideally, we'd be able to let the user do what they like to the model, and then only enforce this type validation when the validate method is called, but I'm not entirely sure pydantic supports this workflow - it seems to throw exceptions whenever the model's constructor is called. Have a deeper look at the pydantic library and see if we can use it in the way described.

If pydantic isn't suitable, we'll have to find some other tool or simply write the validation ourselves.

Mapping: sub property default range/codelist

Dimensions declared as rdfs:subPropertyOf have the implicit rdfs:range a subset (or equal) of their parent property.

When declaring the sub property in the DSD, the range and qb:codeList of the parent property can be used as a default, unless these are declared in the mapping.

Would want to agree an approach to setting the rdfs:range property.

Presets for common global dimensions

periods, geographies (could pass "year" or "financial_year" as an argument and get all the detailed filled in for us)

Allow user to specify date-time format in info.json 2.0.
Do we want to do something to help with geographies here?

Can we just have templates for columns that we can extend/override? (let's say you store a JSON file somewhere on the web with these templates)

Should third parties be able to define their own column templates (at some appropriate publically accessble URI)?

Ability to generate CMD output

Continue support for Mike's existing work on generating both PMD & CMD outputs from the same code.

Only some of the transformations we have need to go to CMD. We can use the tool internally without CMD-output options for a while yet.

Rmarkdown style guidance with code extracts walking me through the process

Guess column types from dataframe

Ability to guess whether a column is measure/dimension/attribute and present the user with the tool's guess at runtime, similar to how R's readr::read_csv or other CSV parsers work.

We can accept poor "guesses" at first - e.g., if a column is entirely numeric, assume it is a measure.

Report back to the user what the programme has guessed, like read_csv:

> readr::read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv")

#> ── Column specification ────────────────────────────────────────────
#> cols(
#>   John = col_character(),
#>   Doe = col_character(),
#>   `120 jefferson st.` = col_character(),
#>   Riverside = col_character(),
#>   NJ = col_character(),
#>   `08075` = col_character()
#> )

Minimum Viable Product

Human readable CSV output

This is the opposite of #568. The CSV-Ws we currently output are focused on being machine readable using notation (i.e. codes) instead of the more human-friendly labels.

This means that the code like E14000987 is serialised in the data set csv instead of the label Tatton. Both represent the same parliamentary consistency. In csvcubed 0.1.* only the notation is output for dimensions, measures, units, and non-literal attributes

There are two approaches to address human-readable CSV-Ws:

Provide an option to serialise a CSV-W where the label is in the output of the main data set csv, instead of its notation. e.g.

Constituency	Value	Measure	Unit
Tatton	1	Count of MPs	Number
South Holland and The Deepings	1	Count of MPs	Number

Provide an option to serialise a CSV-W where the label and notation are output as a pair in the main data set csv. e.g.

Constituancy Notation	Constituancy	Value	Measure	Unit
E14000987	Tatton	1	Count of MPs	Number
E14000939	South Holland and The Deepings	1	Count of MPs	Number

There are many implcations to both approaches mostly how to change how we approach the DSD and attach concept schemes to the main data set.

Front end/GUI service/wizardy thing for basic users

This epic contains all of the issues which need to be addressed by the features squad to help to build this new tool.

Generate a local codelist given a column of a dataframe

R interface to utility

R wrapper around python API (or, tool works in R, regardless of whether a wrapper or a port)

So users can make use of the tooling without having to learn python/switch environments

Recognise that it makes sense for us to do development in Python first and to make more progress in the short term with Python, but need to be conscious of R users.

Design starting point? Reticulate

Basic DCAT metadata inside the CSV-W Metadata

The DCAT metadata will hold the publisher/title/etc. of the dataset but won't directly link into the PMD catalogue and won't use any PMD-specific metadata.

info.json v2.0 - User can define metadata via file

We should support defining URIs using CURIES (i.e. with prefixes) i.e. qudt:sterling versus http://qudt.org/units/sterling - this would make it much easier for JSON schema suggestions to be readable.

Map scrapers to new common model

We could output a nice scraper.json file containing the metadata from this process -> feed into CLI which takes {info.json, scraper.json (optional) and the tidy-csv.csv} and outputs the CSV-W.

We probably want this functionality to output a scraper.json file which is fed into the CLI developed in Issue #108.

We should map the scraper to the CatalogMetadata model (or similar from csvwlib). This could then be directly passed to the Cube in a python script, or serialised to a scraper.json for use with the CLI in #108.

You should start making changes to the code in gss-utils.

Install the csv-qb package as a dependency in gss-utils
Alter the (gss-utils) scraper class with a new method which maps itself into a CatalogMetadata (in csvqb) model. ++ do some behave tests in gss-utils here.
Provide easy functionality to read/write the CatalogMetadata from/to JSON.
Augment the CLI from #108 so that it takes a metadata-overrides argument which specifies the location of the CatalogMetadata JSON file. It should load this data in and take it in preference to any configuration from the info.json. ++ alter the behave tests in the csvqb lib.

Ability to produce generic CSV-W

Would like to see csvwhelper form part of the functionality which writes CSVW from our tooling.

Instructions/specification of the shape I need to get my statistical data into

Need some discussion of the simple single-measure dataset approach (using virtual column measures + units)
Need discussion of more complex multi-measure data approach with one measure row.
- Provide some example code of using pandas melt to convert a pivoted measure dataset into the long-form that we need.
~~Talk about the case where you have multiple measures but the same unit (don't need a units column)~~

Make sure to link to the glossary for definitions of commonly used terms.

For the end of March MVP we’ll be pushing users towards initially shaping their data using the measure column & unit column approach to simplify the onboarding process (so they don’t have to learn the distinction between single-measure and multi-measure datasets too early). So whilst we currently support shapes A and B, we will only be actively promoting the use of shape B; shape A will be a specialised shape for use where people complain about the verbosity of shape B with single-measure datasets.

Ultimately, once we support shape C (which is effectively an extension of shape A) we will be able to shift to promoting that as the first shape of data users venture towards.

Package csvlint

"csvlint forms part of a pipeline and plays a role as part of a pipeline to check whether a CSV file which is produced is correct according to the specification.

Some debate whether this is a priority or not."

Additional qb/skos validation.

One-click deploy

One-click-deploy, just get the metadata into such a state (making assumptions along the way) that RDF can be generated and uploaded into PMD

Minimal config from user is to specify whether a column is measure/dimension/attribute.

Add CSV-W compliant arbitrary JSON-LD to CSV-W output

Advanced users may want to include arbitrary RDF which hasn't been baked into the data model - each time this happens we shouldn't require a change request.

Deploy public-facing packages to pypi

We can't make users pip install -e git+https://...

todo: Need to decide what our organisation's name on pypi should be - and which email address should we sign up under?

@canwaf knows all about how to name mailboxes, so he seems like the most suitable person to push this issue forwards. @JasonHowell may be required to support this too.

Investigate unit test frameworks

Think about what the csvwlib project needs to be able to get from a unit testing framework.
Investigate the popular testing unit testing frameworks in python.
Come up with a presentation/document listing the benefits and drawbacks of each tool.
Make a recommendation as to which tool we should adopt in csvwlib.

Note since this issue hasn't been looked into, @robons has started using unittest in the existing projects.

Express metadata declaratively (`info.json` compatible)

Able to express metadata declaritively (through json, yaml)

This will be backwards compatible with the existing info.json syntax, however you'll be able to add attributes, measures and units locally to the dataset.

Multi-chunk/slice outputs

Investigate use of reticulate to call python code

Ensure that we can move typical pandas dataframe between the two tools.
Find out whether we can use docstrings in python code to drive help text in R.
Will we have any problems deploying the package to CRAN? Is there an easy way to specify that the package has a dependency on a particular version of Python? Will the user have to install Python manually themselves or is the package manager capable of doing that for us?
Does reticulate cause any significant reduction in performance? Are there any limits where passing large dataframes causes memory/performance problems?

Scope out support pivoted multi-measure datasets

Presumably we'll start by supporting the measure-dimension approach to multi-measure datasets. It would be good if we could support users with measures/values split across different columns.

Not a strict "minimal" requirement whilst it's possible for users to transform their data into the long/thin measure-column approach.

We need to decide whether we're supporting either:

Users being able to input pivoted datasets which are then transformed into a CSV-W using the measure-dimension/unit column approach,
Or whether we'll support users inputting pivoted datasets which can also be output in a pivoted form (supporting this will be more useful when we get on to making more human-readable CSV-Ws)
- The pivoted-stuff.zip archive contains a small bit of previous work undertaken on this front.

@canwaf spoke with an academic who wanted to build CSV-Ws containing sensor data - holding data in the measure/unit column approach is very wasteful of resources (huge file sizes) so they want to be able to output CSV-Ws in a pivoted format. In this case we'd need to be able to accept pivoted data as an input too.

We would need the inspect command to work in this pivoted format too.

This issue is to investigate and make some reccomendations as to how we should approach this substantial problem.

Steer users to a decent list of units (qudt)

Given we're going with a declarative (maybe JSON) style syntax we need to provide some documentation linking users to the qudt definitions and providing them with easy access to a few key ones. We don't need to create any python objects to represent them at this point in time.

Include common ones like:

Standardised URIs

Having a template for each uri coined. @rossbowen to work with @ajtucker for an opinionated document for this.

Local
Family
Global

And all of the QB resources (components, properties, dsd, etc.)

Select then extend a global codelist (to make a local superset)

Need to agree with Swirrl how mixed codelists are implemented.

Related to #35 since it will give us a way of downloading the code-lists easily (without time-outs).

TODO: Need to think about how we define the mixing in the info.json v2 syntax (or do we do something outside of that?)

Output CSV-W QB from common model

New Jenkins Pipeline

We need a new pipeline targetted at the specific format of metadata that our new tooling will generate. We need to add in the PMD-specific metadata that won't be in the CSV-W anymore

We probably need two pipelines, one which builds the CSV-W as per our existing pipelines. We should also have another one which should accept the already generated CSV-W and a graph_base_uri which can just do the validation/transformation to RDF/upload to PMD.

Give warnings/errors when using methods which will produce invalid CSVW output

"Examples may be, if a user supplies a dataframe with columns which begin with a ""_"", that an error/warning would be raised as that will result in invalid CSVW. The principle would be that we would be trying to avoid invalidating the CSVW spec as we go, and then would also explicitly check for validity when

Other examples which don't exactly break the CSVW spec but would be of help - if an info.json contains references to columns which don't appear in the supplied dataframe, then a warning would inform the user.

If the info.json does not contain references to columns in the dataframe, an output explaining what assumptions about it would be useful.

I have changed this to a MVP deliverable. I would like the MVP design to incorporate the idea that we must validate the user's input and ensure a valid CSVW output.

Rob: Okay, then we'll need to discuss scope. I don't think it's possible or reasonable for us to be validating everything that the user could possibly be doing wrong for the MVP. We need a prioritised list of what we want users to be warned against doing."

Cube model (Abstract + qb)

Validate CSV-W metadata against QB/SKOS/DCAT without converting to RDF

Validate the QB/SKOS/DCAT metadata from the CSV-W without converting to RDF

Should be fine to write this tool in python since the number of triples is pretty small here.

Action: determine which of the constraints can be validated outside of SPARQL, decide which of these is in scope.
Some kind of user journey?
Could we run csvlint at the end? Is that too similar to what we have now (i.e., can be prone to error?)

Ability to define non-local (family/global) components

Ability to define measures/dimensions/attributes which are non-local to a dataset (e.g. family-level dimensions)

Publishers can define consistent/harmonised measures for use across the departments or x-gov

"As a data manager (in a department, not necessarily an ONS DM), I need a way to define measures/dimensions/attributes which I intend to use across multiple publications.

This is important to achieve linkage. The tooling will have the ability for publishers to adopt common URIs, but not have the ability to coin common URIs initially. This may be initially out of scope. This is a set of tools for data-manager types who would assist publishers within their departments. It is necessary, but currently out of scope."

We will also need the ability to publish code-lists which are independent of datasets.

This ticket is just to come up with a proposed schema syntax which you will present to the wider group (including DMs/DEs) to gain feedback on where were should go with the task next.

Consider:

Measures
Units
Attributes
Attribute values
Dimensions
Code Lists (Concept Schemes)

Make sure that the syntax is as close to identical to the qube-config.json column definition syntax as possible to lower confusion & mistakes.

Define measures/dimensions/attributes/units which are local to a dataset

We can only currently define dimensions which are local to datasets

"Users cannot be expected to know lots of different measures/dimensions etc. exist. We want them to be able to say what they do know, and when they don't know something then a sensible default is provided. We can direct them to the ""correct"" or ""better"" thing later.

URIs may be coined relative to a base which does not have to be http://statistics.gov.uk or http://gss-data.org.uk"

Should be invariant of target platform (i.e., we cannot assume users will always publish on our platform or on PMD)

Avoid adding unknown metadata

The web of data is built on the principle that we "don't know what we don't know". In practice, this means that we should avoid filling out data structures with temporary, null, placement or default values when we don't have the data. It also means that applications need to be able to gracefully handle not having all the data.

We need to apply the fix for when distributions don't have a date GSS-Cogs/gss-utils#51 and consider whether the default value for the issued date of a distribution should inherit from the dataset issued data.

Once that's done, we should remove the placement default values, https://github.com/GSS-Cogs/gss-utils/blob/5c1ba1d243b79e625361c6d70dc5c3fd197f8f2a/gssutils/scrapers/govscot.py#L30.

We also need to go through pipelines that use scraper.distributions[0] to use scraper.distribution(latest=True) instead.

Local tool to preview metadata & data

Once users have generated their CSV-W, how do they verify that they've added all of the correct metadata and haven't missed something? Is there some kind of UI or report we can create to allow them browse what they've created without having to read a metadata JSON file?

csvcubed inspect data-qube.csv-metadata.json

Read the CSV-W using RDFLib and print out to console information such as

Oh btw this file is a data qube!
Metadata (including title, description, etc)
DSD (including column name, type (subtype i.e. literal attribute)
Codelists
Head/tail of data in tabular format
Value_counts on units+measures (i.e. similar to pandas value_counts))

csvcubed inspect my-favourite-code-list.csv-metadata.json

Oh btw this is a code-list
You have X concepts
You have a hierarchy depth of 1 (or 20)
Here's the head/tail 10

Note:

RDFlib can only access the metadata files as JSON-LD but cannot access the underlying observaitons, codelists values
Pandas (or similar) will have to be used to access the contents of the various csv files created

User can set custom URIs

Ability for the user to provide their own URIs (overwriting defaults should they so wish).

This is focused on purely existing dimensions, opposed to establishing new namespaces.

Load model from CSV-W

Currently already loading from SKOS Codelists (Codelist reader class; however this doesn't address the qube's csvw itself.

Extract scrapers from gss-utils to new library

We want to think about how we break apart gss-utils at some point, but it probably isn't part of the csvwlib project.

Create core classes/code directly from W3C vocabs

The classes and attributes are declared in RDF vocabularies, e.g. CSV-W tabular data model namespace doc. We can create source code from these vocabs.

Confirm CSV-W output would convert to valid RDF

Users should be able to confirm whether the resulting RDF would be valid according to the SPARQL tests

Try to elminate SPARQL failures when loading in dataframes/specifying metadata without having to run SPARQL queries (e.g., throw an error if you try to specify a dimension which has blanks)

We'll need one feature to test single measure datasets and another one to test multi-measure datasets.

Python interface to utility (user-facing API)

Search for codelists

Search for codes and try to reconcile as a way of discovering codelists (see Ordnance Survey for inspo)

See https://github.com/etalab/csv-detective

Catalog of catalogs

Related to Swirrl.

Validate CSV-W observations against QB/SKOS/DCAT without converting to RDF

Validate the observations according to the QB/SKOS/DCAT specs

Will be run over all of the data in the CSV file so is python performant enough to run these checks? Should we have another tool in Scala?

Implementing the DCAT tests should be easy - we can simply lift the SPARQL queries we already have and run them in a python script against the JSON-LD CSV-W metadata document.

The SKOS & QB constraints might be a bit more difficult and will involve looking at the values defined inside the CSV itself (as well as looking at the metadata). So some of the tests can probably be brought over as SPARQL tests, but others will need to be re-written in python to go look at the CSV values.

Enforce Existing Code-list Foreign Key Constraints

Warn users if they try to assign a codelist but their dataset contains codes not in that codelist. Might not work for external codelists (or more work needed to get that to work), but if the codelist is held locally might be viable.

Local codelists to form part of MVP.

Could be via RDF representation of codelist, or from the codelist as a CSV.

Requires making request to PMD to check contents of code-list match what's in the local dataset (or alternatively specifying the local location for some ttl describing the existing code-list to allow the checks to be made).

We probably want to do this both in the python app - to provide some more helpful messages to the user, as well as adding the Foreign Key constraints into the CSV-W and letting csv-lint validate that they exist.

Contact details

DCAT 1 & 2 suggest that contact details should use VCARD, https://www.w3.org/TR/2014/REC-vocab-dcat-20140116/#Property:dataset_contactPoint and https://www.w3.org/TR/vocab-dcat-2/#Property:resource_contact_point

We're currently using mailto: URIs as both the dcat:contactPoint and the pmd:contactEmail, but should aim for https://www.w3.org/TR/vcard-rdf/#Examples

Need to figure out the support for this in PMD /cc @BillSwirrl .

Decide on Graph URI + Base URI conventions

Decide on conventional graph URI (where no override has been specified in the Jenkins Job)
Decide on the conventional base URI for documents (where no override has been specified in the Jenkins Job)

We need some government-owned domain that should provide the domain for both of these URIs instead of using gss-data.org.uk - we need something that's going to be long-lasting and not reliant on external contractors.

These URIs can redirect to a contractor/some other domain, but they need to be permanent identifiers which will always resolve.

Warning: If no decision is made by when we need this information, we'll start using PURL URLs under an organisation called idpd.