Presumably we'll start by supporting the measure-dimension approach to multi-measure d

See <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id

Scope out support pivoted multi-measure datasets about csvcubed HOT 4 CLOSED

gss-cogs commented on June 3, 2024

Scope out support pivoted multi-measure datasets

from csvcubed.

Comments (4)

robons commented on June 3, 2024 1

The Plan

After considering the work necessary to support outputting pivoted CSV-Ws from csvcubed, it is apparent that there is some work necessary to support outputting pivoted data sets, however this work is self-contained within the csvcubed project and should have no significant impact on any dependencies (so long as we continue to create RDF compliant with the RDF Data Cube vocabulary). It's also likely that there isn't too much more work to output the pivoted data sets than there would be in pivoting the inputs and outputting in the canonical csvcubed format. Given the benefits we get from supporting pivoted outputs (performance + storage + truly normalised data), we may as well do that work now instead of accepting pivoted inputs now and then support pivoted outputs later.

We will need to ensure that users can link attribute columns to the observed value columns that they are paired to. Note that this means users should also be able to create a units column linked to an obs val column which allows users to vary the unit across the same measure.

Configuring a multi-measure pivoted dataset via convention will not initially be supported, but I have some ideas about how we could address that in the future.

Big Ticket Items

Non-breaking releasable changes

These are changes we can make incrementally which shouldn't alter the behaviour of the application (and so can be released before all features are complete):

Update all QB models representing Attributes and Units; they should have an optional property added on to them which identifies the CSV Column of the observations column with which the attribute/unit is associated.
Update the QbWriter class so that it can support outputting data in both the existing canonical shape, as well as the new pivoted shape.
- This will require us adding a significant number of virtual columns to express new triples, as well as altering the aboutUrl we use.
New validation errors
- We need a validation error to show up where the user has specified multiple obs val columns and a measure dimension column.
- We need a validation error to show up where the user has specified an obs val column but hasn't defined the measure or unit (either against the obs val or as a linked units column).
- We need a validation error to show up where the user has defined an attribute or units column, but hasn't linked it to a particular obs val column where there are multiple obs val columns. If there is only one obs val column, it's clear which one it's associated with.
- We need a validation error where the user has linked a unit or attribute column to a CSV column which we don't see as an obs val column.
Update csvcubed inspect command. N.B. we need to support both the old style and new style csvcubed outputs at the same time.
- Changes will be focused on the csvdataset.transform_dataset_to_canonical_shape function and associated SPARQL queries which pull out the (currently) single unit and measure used in the pivoted data sets.

Testing some of these changes (with pytest + behave) may require temporarily disabling some of the standard validation functionality, but make sure the validation remains there once you merge your changes into main!

Breaking changes

These tasks should be completed just before getting ready to release. Once started, a release cannot occur until all of them are completed:

Alter the QB validation functionality to ensure that it respects that users can define multiple observation value columns now, so long as they each have a mesasure (which is the same for all values in the column) and unit (which can either be the same for all values in the column, or can be allowed to vary) defined.
- We will likely need to remove an existing Validation Error and deprecate the existing documentation page (don't delete it since users of older versions of csvcubed may be redirected there still).
Update the qube-config's JSON Schema definition so that a user can specify which observation values column their attribute or units column is in associated with (this can be a minor version change).
- At the same time it would make sense to alter the qube-config loader so that it loads these new properties in and maps them onto the QB models.
Updating documentation so that it's clear how users can configure a multi-measure pivoted data set (with examples).

Later changes to support conventional definition of pivoted data sets

These tasks can be completed after the initial core release of the pivoted data functionality. They're not core but provide an continuation of the conventional configuration approach.

Allow users to specify an observation values column in the following two syntaxes which can be identified using a regular expression. If a CSV column title matches one of the regexes, then it's an obs val column with measure and unit as defined in the patterns:
- Measure / Unit e.g. Income / £ GBP Million
- Measure (Unit) e.g. Income (£ GBP Million)
An attribute column can be associated with an observation status column with the following pattern:
- Attribute Name [Measure], e.g. Observation Status [Income] would represent an observation status column associated with the observations column which uses the Income meaasure.

We don't need to worry about

Updating any SPARQL queries.
- The existing ones should work perfectly well so long as we're outputting RDF compliant with the RDF Data Cube vocabulary. We shouldn't need to add any skipped tests in either as they are not directly related to the proposed approach to creating pivoted CSV-Ws.
Altering the pmdutils CLI or Jenkins pipeline.
- We're not touching any RDF which is restructured in any part of the CSV-W upload to PMD process.

from csvcubed.

robons commented on June 3, 2024

I'm going to close this issue as done, but we still need to spend some time creating the associated tasks in refinment.

from csvcubed.

robons commented on June 3, 2024

An example of a pivoted multi-measure dataset qube-config JSON follows:

{
    "$schema": "https://purl.org/csv-cubed/qube-config/v1.0",
    "columns":{
        "Period": {
            "from_template": "year"
        },
        "Geography": {
            "label": "ONS LSOA"
        },
        "Average Income per Household": {
            "type": "observations",
            "measure": {
                "label": "Average Income Per household"
            },
            "unit": {
                "label": "GBP"
            }
        },
        "Average floor area of house": {
            "type": "observations",
            "measure": {
                "label": "Average area of house floor"
            },
            "unit": {
                "label": "Square metres"
            }
        }
    }
}

Note that the user can define basic pivoted cubes using the older minor versions of the qube-config syntax.

from csvcubed.

robons commented on June 3, 2024

See #585 for tasks to action.

from csvcubed.

Scope out support pivoted multi-measure datasets about csvcubed HOT 4 CLOSED

Comments (4)

The Plan

Big Ticket Items

Non-breaking releasable changes

Breaking changes

Later changes to support conventional definition of pivoted data sets

We don't need to worry about

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent