Comments (4)
The Plan
After considering the work necessary to support outputting pivoted CSV-Ws from csvcubed, it is apparent that there is some work necessary to support outputting pivoted data sets, however this work is self-contained within the csvcubed project and should have no significant impact on any dependencies (so long as we continue to create RDF compliant with the RDF Data Cube vocabulary). It's also likely that there isn't too much more work to output the pivoted data sets than there would be in pivoting the inputs and outputting in the canonical csvcubed format. Given the benefits we get from supporting pivoted outputs (performance + storage + truly normalised data), we may as well do that work now instead of accepting pivoted inputs now and then support pivoted outputs later.
We will need to ensure that users can link attribute columns to the observed value columns that they are paired to. Note that this means users should also be able to create a units column linked to an obs val column which allows users to vary the unit across the same measure.
Configuring a multi-measure pivoted dataset via convention will not initially be supported, but I have some ideas about how we could address that in the future.
Big Ticket Items
Non-breaking releasable changes
These are changes we can make incrementally which shouldn't alter the behaviour of the application (and so can be released before all features are complete):
- Update all QB models representing Attributes and Units; they should have an optional property added on to them which identifies the CSV Column of the observations column with which the attribute/unit is associated.
- Update the
QbWriter
class so that it can support outputting data in both the existing canonical shape, as well as the new pivoted shape.- This will require us adding a significant number of virtual columns to express new triples, as well as altering the aboutUrl we use.
- New validation errors
- We need a validation error to show up where the user has specified multiple obs val columns and a measure dimension column.
- We need a validation error to show up where the user has specified an obs val column but hasn't defined the measure or unit (either against the obs val or as a linked units column).
- We need a validation error to show up where the user has defined an attribute or units column, but hasn't linked it to a particular obs val column where there are multiple obs val columns. If there is only one obs val column, it's clear which one it's associated with.
- We need a validation error where the user has linked a unit or attribute column to a CSV column which we don't see as an obs val column.
- Update csvcubed inspect command. N.B. we need to support both the old style and new style csvcubed outputs at the same time.
- Changes will be focused on the
csvdataset.transform_dataset_to_canonical_shape
function and associated SPARQL queries which pull out the (currently) single unit and measure used in the pivoted data sets.
- Changes will be focused on the
Testing some of these changes (with pytest + behave) may require temporarily disabling some of the standard validation functionality, but make sure the validation remains there once you merge your changes into main!
Breaking changes
These tasks should be completed just before getting ready to release. Once started, a release cannot occur until all of them are completed:
- Alter the QB validation functionality to ensure that it respects that users can define multiple observation value columns now, so long as they each have a mesasure (which is the same for all values in the column) and unit (which can either be the same for all values in the column, or can be allowed to vary) defined.
- We will likely need to remove an existing Validation Error and deprecate the existing documentation page (don't delete it since users of older versions of csvcubed may be redirected there still).
- Update the qube-config's JSON Schema definition so that a user can specify which observation values column their attribute or units column is in associated with (this can be a minor version change).
- At the same time it would make sense to alter the qube-config loader so that it loads these new properties in and maps them onto the QB models.
- Updating documentation so that it's clear how users can configure a multi-measure pivoted data set (with examples).
Later changes to support conventional definition of pivoted data sets
These tasks can be completed after the initial core release of the pivoted data functionality. They're not core but provide an continuation of the conventional configuration approach.
- Allow users to specify an observation values column in the following two syntaxes which can be identified using a regular expression. If a CSV column title matches one of the regexes, then it's an obs val column with measure and unit as defined in the patterns:
Measure / Unit
e.g.Income / £ GBP Million
Measure (Unit)
e.g.Income (£ GBP Million)
- An attribute column can be associated with an observation status column with the following pattern:
Attribute Name [Measure]
, e.g.Observation Status [Income]
would represent an observation status column associated with the observations column which uses theIncome
meaasure.
We don't need to worry about
- Updating any SPARQL queries.
- The existing ones should work perfectly well so long as we're outputting RDF compliant with the RDF Data Cube vocabulary. We shouldn't need to add any skipped tests in either as they are not directly related to the proposed approach to creating pivoted CSV-Ws.
- Altering the
pmdutils
CLI or Jenkins pipeline.- We're not touching any RDF which is restructured in any part of the CSV-W upload to PMD process.
from csvcubed.
I'm going to close this issue as done, but we still need to spend some time creating the associated tasks in refinment.
from csvcubed.
An example of a pivoted multi-measure dataset qube-config JSON follows:
{
"$schema": "https://purl.org/csv-cubed/qube-config/v1.0",
"columns":{
"Period": {
"from_template": "year"
},
"Geography": {
"label": "ONS LSOA"
},
"Average Income per Household": {
"type": "observations",
"measure": {
"label": "Average Income Per household"
},
"unit": {
"label": "GBP"
}
},
"Average floor area of house": {
"type": "observations",
"measure": {
"label": "Average area of house floor"
},
"unit": {
"label": "Square metres"
}
}
}
}
Note that the user can define basic pivoted cubes using the older minor versions of the qube-config syntax.
from csvcubed.
See #585 for tasks to action.
from csvcubed.
Related Issues (20)
- Spatial/Temporal Bounds/Resolution HOT 1
- Implement inspectors_cache into remaining unit tests HOT 1
- [BUG] int64 vs Int64 for data_type of observation columns
- Extend Missing Values Test Cases
- [BUG] Themes duplicated 1 time for every keyword when using inspect command HOT 7
- Remove unnecessary tests from test_sparqlquerymanager.py
- Markdown inspect output
- [BUG]Investigate why the schema validation errors are not handled properly HOT 2
- Write initial pandas DataFrame Function
- Create API functions useful for dereferencing URIs to Labels
- Update Pandas DataFrame Function for Dereferencing
- Recognise code-list build in csv-to-csvw github action
- Create tool to retrospectively add structure to a Code List HOT 2
- Restrict configuration of cell_uri_template for dimensions HOT 2
- There are multiple units components in the DSD of pivoted multi-measure cubes
- Floating version of pandas HOT 1
- BUG - Attribute Literal `Label` has no default & not marked as required
- BUG - Misleading error message when no attribute values present HOT 2
- Inspect API Dev Docs
- Urgent - Support matching code list values on notation as well as the label.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from csvcubed.