Giter Site home page Giter Site logo

Comments (4)

robons avatar robons commented on June 3, 2024 1

The Plan

After considering the work necessary to support outputting pivoted CSV-Ws from csvcubed, it is apparent that there is some work necessary to support outputting pivoted data sets, however this work is self-contained within the csvcubed project and should have no significant impact on any dependencies (so long as we continue to create RDF compliant with the RDF Data Cube vocabulary). It's also likely that there isn't too much more work to output the pivoted data sets than there would be in pivoting the inputs and outputting in the canonical csvcubed format. Given the benefits we get from supporting pivoted outputs (performance + storage + truly normalised data), we may as well do that work now instead of accepting pivoted inputs now and then support pivoted outputs later.

We will need to ensure that users can link attribute columns to the observed value columns that they are paired to. Note that this means users should also be able to create a units column linked to an obs val column which allows users to vary the unit across the same measure.

Configuring a multi-measure pivoted dataset via convention will not initially be supported, but I have some ideas about how we could address that in the future.

Big Ticket Items

Non-breaking releasable changes

These are changes we can make incrementally which shouldn't alter the behaviour of the application (and so can be released before all features are complete):

  • Update all QB models representing Attributes and Units; they should have an optional property added on to them which identifies the CSV Column of the observations column with which the attribute/unit is associated.
  • Update the QbWriter class so that it can support outputting data in both the existing canonical shape, as well as the new pivoted shape.
    • This will require us adding a significant number of virtual columns to express new triples, as well as altering the aboutUrl we use.
  • New validation errors
    • We need a validation error to show up where the user has specified multiple obs val columns and a measure dimension column.
    • We need a validation error to show up where the user has specified an obs val column but hasn't defined the measure or unit (either against the obs val or as a linked units column).
    • We need a validation error to show up where the user has defined an attribute or units column, but hasn't linked it to a particular obs val column where there are multiple obs val columns. If there is only one obs val column, it's clear which one it's associated with.
    • We need a validation error where the user has linked a unit or attribute column to a CSV column which we don't see as an obs val column.
  • Update csvcubed inspect command. N.B. we need to support both the old style and new style csvcubed outputs at the same time.
    • Changes will be focused on the csvdataset.transform_dataset_to_canonical_shape function and associated SPARQL queries which pull out the (currently) single unit and measure used in the pivoted data sets.

Testing some of these changes (with pytest + behave) may require temporarily disabling some of the standard validation functionality, but make sure the validation remains there once you merge your changes into main!

Breaking changes

These tasks should be completed just before getting ready to release. Once started, a release cannot occur until all of them are completed:

  • Alter the QB validation functionality to ensure that it respects that users can define multiple observation value columns now, so long as they each have a mesasure (which is the same for all values in the column) and unit (which can either be the same for all values in the column, or can be allowed to vary) defined.
    • We will likely need to remove an existing Validation Error and deprecate the existing documentation page (don't delete it since users of older versions of csvcubed may be redirected there still).
  • Update the qube-config's JSON Schema definition so that a user can specify which observation values column their attribute or units column is in associated with (this can be a minor version change).
    • At the same time it would make sense to alter the qube-config loader so that it loads these new properties in and maps them onto the QB models.
  • Updating documentation so that it's clear how users can configure a multi-measure pivoted data set (with examples).

Later changes to support conventional definition of pivoted data sets

These tasks can be completed after the initial core release of the pivoted data functionality. They're not core but provide an continuation of the conventional configuration approach.

  • Allow users to specify an observation values column in the following two syntaxes which can be identified using a regular expression. If a CSV column title matches one of the regexes, then it's an obs val column with measure and unit as defined in the patterns:
    • Measure / Unit e.g. Income / £ GBP Million
    • Measure (Unit) e.g. Income (£ GBP Million)
  • An attribute column can be associated with an observation status column with the following pattern:
    • Attribute Name [Measure], e.g. Observation Status [Income] would represent an observation status column associated with the observations column which uses the Income meaasure.

We don't need to worry about

  • Updating any SPARQL queries.
    • The existing ones should work perfectly well so long as we're outputting RDF compliant with the RDF Data Cube vocabulary. We shouldn't need to add any skipped tests in either as they are not directly related to the proposed approach to creating pivoted CSV-Ws.
  • Altering the pmdutils CLI or Jenkins pipeline.
    • We're not touching any RDF which is restructured in any part of the CSV-W upload to PMD process.

from csvcubed.

robons avatar robons commented on June 3, 2024

I'm going to close this issue as done, but we still need to spend some time creating the associated tasks in refinment.

from csvcubed.

robons avatar robons commented on June 3, 2024

An example of a pivoted multi-measure dataset qube-config JSON follows:

{
    "$schema": "https://purl.org/csv-cubed/qube-config/v1.0",
    "columns":{
        "Period": {
            "from_template": "year"
        },
        "Geography": {
            "label": "ONS LSOA"
        },
        "Average Income per Household": {
            "type": "observations",
            "measure": {
                "label": "Average Income Per household"
            },
            "unit": {
                "label": "GBP"
            }
        },
        "Average floor area of house": {
            "type": "observations",
            "measure": {
                "label": "Average area of house floor"
            },
            "unit": {
                "label": "Square metres"
            }
        }
    }
}

Note that the user can define basic pivoted cubes using the older minor versions of the qube-config syntax.

from csvcubed.

robons avatar robons commented on June 3, 2024

See #585 for tasks to action.

from csvcubed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.