gss-cogs / csvcubed Goto Github PK
View Code? Open in Web Editor NEWA CLI to build linked data cubes.
Home Page: https://gss-cogs.github.io/csvcubed-docs/external/
License: Apache License 2.0
A CLI to build linked data cubes.
Home Page: https://gss-cogs.github.io/csvcubed-docs/external/
License: Apache License 2.0
Related to Swirrl.
We need some government-owned domain that should provide the domain for both of these URIs instead of using gss-data.org.uk - we need something that's going to be long-lasting and not reliant on external contractors.
These URIs can redirect to a contractor/some other domain, but they need to be permanent identifiers which will always resolve.
Warning: If no decision is made by when we need this information, we'll start using PURL URLs under an organisation called idpd
.
"csvlint forms part of a pipeline and plays a role as part of a pipeline to check whether a CSV file which is produced is correct according to the specification.
Some debate whether this is a priority or not."
Additional qb/skos validation.
This task requires that suitable validation of the abstract model's attributes is performed when the validate
method is called on a Column/QbDataStructureDefinition model.
I found a useful looking tool called pydantic which allows you to validate the inputs to classes to ensure that they're consistent with the static type attributes defined.
Ideally, we'd be able to let the user do what they like to the model, and then only enforce this type validation when the validate
method is called, but I'm not entirely sure pydantic supports this workflow - it seems to throw exceptions whenever the model's constructor is called. Have a deeper look at the pydantic library and see if we can use it in the way described.
If pydantic isn't suitable, we'll have to find some other tool or simply write the validation ourselves.
One-click-deploy, just get the metadata into such a state (making assumptions along the way) that RDF can be generated and uploaded into PMD
Minimal config from user is to specify whether a column is measure/dimension/attribute.
The classes and attributes are declared in RDF vocabularies, e.g. CSV-W tabular data model namespace doc. We can create source code from these vocabs.
We can't make users pip install -e git+https://...
todo: Need to decide what our organisation's name on pypi should be - and which email address should we sign up under?
@canwaf knows all about how to name mailboxes, so he seems like the most suitable person to push this issue forwards. @JasonHowell may be required to support this too.
Warn users if they try to assign a codelist but their dataset contains codes not in that codelist. Might not work for external codelists (or more work needed to get that to work), but if the codelist is held locally might be viable.
Local codelists to form part of MVP.
Could be via RDF representation of codelist, or from the codelist as a CSV.
Requires making request to PMD to check contents of code-list match what's in the local dataset (or alternatively specifying the local location for some ttl describing the existing code-list to allow the checks to be made).
We probably want to do this both in the python app - to provide some more helpful messages to the user, as well as adding the Foreign Key constraints into the CSV-W and letting csv-lint validate that they exist.
periods, geographies (could pass "year" or "financial_year" as an argument and get all the detailed filled in for us)
Can we just have templates for columns that we can extend/override? (let's say you store a JSON file somewhere on the web with these templates)
Should third parties be able to define their own column templates (at some appropriate publically accessble URI)?
Ability for the user to provide their own URIs (overwriting defaults should they so wish).
This is focused on purely existing dimensions, opposed to establishing new namespaces.
Dimensions declared as rdfs:subPropertyOf
have the implicit rdfs:range
a subset (or equal) of their parent property.
When declaring the sub property in the DSD, the range and qb:codeList
of the parent property can be used as a default, unless these are declared in the mapping.
Would want to agree an approach to setting the rdfs:range
property.
Currently already loading from SKOS Codelists (Codelist reader class; however this doesn't address the qube's csvw itself.
Validate the QB/SKOS/DCAT metadata from the CSV-W without converting to RDF
Should be fine to write this tool in python since the number of triples is pretty small here.
Search for codes and try to reconcile as a way of discovering codelists (see Ordnance Survey for inspo)
DCAT 1 & 2 suggest that contact details should use VCARD, https://www.w3.org/TR/2014/REC-vocab-dcat-20140116/#Property:dataset_contactPoint and https://www.w3.org/TR/vocab-dcat-2/#Property:resource_contact_point
We're currently using mailto: URIs as both the dcat:contactPoint
and the pmd:contactEmail
, but should aim for https://www.w3.org/TR/vcard-rdf/#Examples
Need to figure out the support for this in PMD /cc @BillSwirrl .
Would like to see csvwhelper form part of the functionality which writes CSVW from our tooling.
This is the opposite of #568. The CSV-Ws we currently output are focused on being machine readable using notation (i.e. codes) instead of the more human-friendly labels.
This means that the code like E14000987
is serialised in the data set csv instead of the label Tatton
. Both represent the same parliamentary consistency. In csvcubed 0.1.*
only the notation is output for dimensions, measures, units, and non-literal attributes
There are two approaches to address human-readable CSV-Ws:
Constituency | Value | Measure | Unit |
---|---|---|---|
Tatton | 1 | Count of MPs | Number |
South Holland and The Deepings | 1 | Count of MPs | Number |
Constituancy Notation | Constituancy | Value | Measure | Unit |
---|---|---|---|---|
E14000987 | Tatton | 1 | Count of MPs | Number |
E14000939 | South Holland and The Deepings | 1 | Count of MPs | Number |
There are many implcations to both approaches mostly how to change how we approach the DSD and attach concept schemes to the main data set.
The web of data is built on the principle that we "don't know what we don't know". In practice, this means that we should avoid filling out data structures with temporary, null, placement or default values when we don't have the data. It also means that applications need to be able to gracefully handle not having all the data.
We need to apply the fix for when distributions don't have a date GSS-Cogs/gss-utils#51 and consider whether the default value for the issued date of a distribution should inherit from the dataset issued data.
Once that's done, we should remove the placement default values, https://github.com/GSS-Cogs/gss-utils/blob/5c1ba1d243b79e625361c6d70dc5c3fd197f8f2a/gssutils/scrapers/govscot.py#L30.
We also need to go through pipelines that use scraper.distributions[0]
to use scraper.distribution(latest=True)
instead.
R wrapper around python API (or, tool works in R, regardless of whether a wrapper or a port)
So users can make use of the tooling without having to learn python/switch environments
Recognise that it makes sense for us to do development in Python first and to make more progress in the short term with Python, but need to be conscious of R users.
Design starting point? Reticulate
Presumably we'll start by supporting the measure-dimension approach to multi-measure datasets. It would be good if we could support users with measures/values split across different columns.
Not a strict "minimal" requirement whilst it's possible for users to transform their data into the long/thin measure-column approach.
We need to decide whether we're supporting either:
@canwaf spoke with an academic who wanted to build CSV-Ws containing sensor data - holding data in the measure/unit column approach is very wasteful of resources (huge file sizes) so they want to be able to output CSV-Ws in a pivoted format. In this case we'd need to be able to accept pivoted data as an input too.
We would need the inspect command to work in this pivoted format too.
This issue is to investigate and make some reccomendations as to how we should approach this substantial problem.
Able to express metadata declaritively (through json, yaml)
This will be backwards compatible with the existing info.json syntax, however you'll be able to add attributes, measures and units locally to the dataset.
Ability to define measures/dimensions/attributes which are non-local to a dataset (e.g. family-level dimensions)
Publishers can define consistent/harmonised measures for use across the departments or x-gov
"As a data manager (in a department, not necessarily an ONS DM), I need a way to define measures/dimensions/attributes which I intend to use across multiple publications.
This is important to achieve linkage. The tooling will have the ability for publishers to adopt common URIs, but not have the ability to coin common URIs initially. This may be initially out of scope. This is a set of tools for data-manager types who would assist publishers within their departments. It is necessary, but currently out of scope."
We will also need the ability to publish code-lists which are independent of datasets.
This ticket is just to come up with a proposed schema syntax which you will present to the wider group (including DMs/DEs) to gain feedback on where were should go with the task next.
Consider:
Make sure that the syntax is as close to identical to the qube-config.json column definition syntax as possible to lower confusion & mistakes.
qudt:sterling
versus http://qudt.org/units/sterling
- this would make it much easier for JSON schema suggestions to be readable.We need a new pipeline targetted at the specific format of metadata that our new tooling will generate. We need to add in the PMD-specific metadata that won't be in the CSV-W anymore
We probably need two pipelines, one which builds the CSV-W as per our existing pipelines. We should also have another one which should accept the already generated CSV-W and a graph_base_uri which can just do the validation/transformation to RDF/upload to PMD.
Note since this issue hasn't been looked into, @robons has started using unittest in the existing projects.
Need to agree with Swirrl how mixed codelists are implemented.
Related to #35 since it will give us a way of downloading the code-lists easily (without time-outs).
TODO: Need to think about how we define the mixing in the info.json v2 syntax (or do we do something outside of that?)
Users should be able to confirm whether the resulting RDF would be valid according to the SPARQL tests
Try to elminate SPARQL failures when loading in dataframes/specifying metadata without having to run SPARQL queries (e.g., throw an error if you try to specify a dimension which has blanks)
We'll need one feature to test single measure datasets and another one to test multi-measure datasets.
We want to think about how we break apart gss-utils at some point, but it probably isn't part of the csvwlib project.
We could output a nice scraper.json file containing the metadata from this process -> feed into CLI which takes {info.json, scraper.json (optional) and the tidy-csv.csv} and outputs the CSV-W.
We probably want this functionality to output a scraper.json
file which is fed into the CLI developed in Issue #108.
We should map the scraper to the CatalogMetadata
model (or similar from csvwlib). This could then be directly passed to the Cube in a python script, or serialised to a scraper.json
for use with the CLI in #108.
You should start making changes to the code in gss-utils.
CatalogMetadata
(in csvqb) model. ++ do some behave tests in gss-utils here.CatalogMetadata
from/to JSON.metadata-overrides
argument which specifies the location of the CatalogMetadata
JSON file. It should load this data in and take it in preference to any configuration from the info.json
. ++ alter the behave tests in the csvqb lib.Ability to guess whether a column is measure/dimension/attribute and present the user with the tool's guess at runtime, similar to how R's readr::read_csv
or other CSV parsers work.
We can accept poor "guesses" at first - e.g., if a column is entirely numeric, assume it is a measure.
Report back to the user what the programme has guessed, like read_csv
:
> readr::read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv")
#> ── Column specification ────────────────────────────────────────────
#> cols(
#> John = col_character(),
#> Doe = col_character(),
#> `120 jefferson st.` = col_character(),
#> Riverside = col_character(),
#> NJ = col_character(),
#> `08075` = col_character()
#> )
Given we're going with a declarative (maybe JSON) style syntax we need to provide some documentation linking users to the qudt definitions and providing them with easy access to a few key ones. We don't need to create any python objects to represent them at this point in time.
Include common ones like:
Currencies
Weights
Pounds
kg
tons
tonnes
Volumes
Lengths
Run a SPARQL query against PMD to see out most commonly used units, and if there are any obvious units missing off the above list, add them to it.
This should contain a "how to" on just plunking the externally defined unit into your column definition.
Having a template for each uri coined. @rossbowen to work with @ajtucker for an opinionated document for this.
And all of the QB resources (components, properties, dsd, etc.)
This epic contains all of the issues which need to be addressed by the features squad to help to build this new tool.
Once users have generated their CSV-W, how do they verify that they've added all of the correct metadata and haven't missed something? Is there some kind of UI or report we can create to allow them browse what they've created without having to read a metadata JSON file?
csvcubed inspect data-qube.csv-metadata.json
Read the CSV-W using RDFLib and print out to console information such as
csvcubed inspect my-favourite-code-list.csv-metadata.json
Note:
"Examples may be, if a user supplies a dataframe with columns which begin with a ""_"", that an error/warning would be raised as that will result in invalid CSVW. The principle would be that we would be trying to avoid invalidating the CSVW spec as we go, and then would also explicitly check for validity when
Other examples which don't exactly break the CSVW spec but would be of help - if an info.json contains references to columns which don't appear in the supplied dataframe, then a warning would inform the user.
If the info.json does not contain references to columns in the dataframe, an output explaining what assumptions about it would be useful.
I have changed this to a MVP deliverable. I would like the MVP design to incorporate the idea that we must validate the user's input and ensure a valid CSVW output.
Rob: Okay, then we'll need to discuss scope. I don't think it's possible or reasonable for us to be validating everything that the user could possibly be doing wrong for the MVP. We need a prioritised list of what we want users to be warned against doing."
We can only currently define dimensions which are local to datasets
"Users cannot be expected to know lots of different measures/dimensions etc. exist. We want them to be able to say what they do know, and when they don't know something then a sensible default is provided. We can direct them to the ""correct"" or ""better"" thing later.
URIs may be coined relative to a base which does not have to be http://statistics.gov.uk or http://gss-data.org.uk"
Should be invariant of target platform (i.e., we cannot assume users will always publish on our platform or on PMD)
Make sure to link to the glossary for definitions of commonly used terms.
For the end of March MVP we’ll be pushing users towards initially shaping their data using the measure column & unit column approach to simplify the onboarding process (so they don’t have to learn the distinction between single-measure and multi-measure datasets too early). So whilst we currently support shapes A and B, we will only be actively promoting the use of shape B; shape A will be a specialised shape for use where people complain about the verbosity of shape B with single-measure datasets.
Ultimately, once we support shape C (which is effectively an extension of shape A) we will be able to shift to promoting that as the first shape of data users venture towards.
Advanced users may want to include arbitrary RDF which hasn't been baked into the data model - each time this happens we shouldn't require a change request.
The DCAT metadata will hold the publisher/title/etc. of the dataset but won't directly link into the PMD catalogue and won't use any PMD-specific metadata.
Will be run over all of the data in the CSV file so is python performant enough to run these checks? Should we have another tool in Scala?
Implementing the DCAT tests should be easy - we can simply lift the SPARQL queries we already have and run them in a python script against the JSON-LD CSV-W metadata document.
The SKOS & QB constraints might be a bit more difficult and will involve looking at the values defined inside the CSV itself (as well as looking at the metadata). So some of the tests can probably be brought over as SPARQL tests, but others will need to be re-written in python to go look at the CSV values.
Continue support for Mike's existing work on generating both PMD & CMD outputs from the same code.
Only some of the transformations we have need to go to CMD. We can use the tool internally without CMD-output options for a while yet.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.