Light

jgcri / gcamdata Goto Github PK

View Code? Open in Web Editor NEW

41.0 13.0 26.0 886.75 MB

The GCAM data system

Home Page: https://jgcri.github.io/gcamdata/

License: Other

R 99.79% Python 0.21%

gcam data

gcamdata's Introduction

gcamdata

The increasing data requirements of complex models demand robust, reproducible, and transparent systems to track and prepare models’ inputs. The gcamdata R package processes raw inputs to produce the hundreds of XML files needed by the GCAM integrated human-earth systems model which is available on its Github repository.

The package is documented in the online manual

gcamdata's People

Contributors

Stargazers

Watchers

gcamdata's Issues

Redo old-vs-new outputs testing code

It's going to need to look at disk-saved outputs (don't want to re-run driver each time we test).

In addition, it should transform some outputs to match old form (no Xyears, wide-vs-long forms).

Build
Run driver(write_intermediate_outputs=TRUE)
Test
Test code: if no intermediate_outputs directory, skip

Update GCAM wiki directly from data system

Suggestion from Leon: it would be great if data system could update wiki documentation automatically–e.g., a table with energy system default assumptions, etc.

This seems straightforward to do.

Add a check that data not generated by more than one chunk

Write a sample trace()

trace(object_name, data_system_output, previous_traces = NULL) {

# Find what produces object_name
# Print info
# Recurse (checking against previous_traces)
}

Pre-create ALL the chunks?

One way to smooth things might be to pre-create all the chunks. They could all declare their dependencies; read their data; and create fake outputs (flagged). That would give Leon et al. a list of finished and unfinished chunks, and make for a killer graphic at GCAM meeting.

Need some mutable globals

Right? For example...

So we can easily check whether a requested data set is from_file without re-calling chunk_names().

Could provide convenience function same_precursors_as(obj)

Logfile(s)

A single log file? Log files for each chunk?

Is socioeconomics_inputs necessary?

Current this chunk reads USDA_GDP_MER.csv which is only used by one chunk, so there's no need for it to be here. Look at other socioeconomics chunks--if there's no need for a dedicated input chunk (i.e. no files used by more than one chunk) then get of it.

`gcam-usa_LA100.Socioeconomics_makedata` and FUTURE_YEARS

gcam-usa_LA100.Socioeconomics_makedata is not robust to anything except a 5-year change in FUTURE_YEARS, because the future population data used there are 2010, 2015, etc. So a 1-year shift in FUTURE_YEARS means it doesn't intersect with the data years, and in turn approx_fun fails (see code below, currently lines 118-127 in module-gcam-usa.R.

Basically, I think we'd like to interpolate the future population data to a 1-year timestep.

  # Future population by scenario. Right now just one scenario.
  PRIMA_pop %>%
    # reshape
    gather(year, population, -state) %>%
    mutate(year = as.numeric(year),
           population = as.numeric(population)) %>%
    # interpolate any missing data from end of history into future
    filter(year %in% c(max(HISTORICAL_YEARS), FUTURE_YEARS)) %>%
    group_by(state) %>%
    mutate(population = approx_fun(year, population)) %>%

Maddison_population.csv empty column

The socioeconomics/Maddison_population.csv file has an empty column, way out between 2008 and 2030. This causes readr::read_csv to throw a warning. Can we remove this column? Because it has no column name, I can't force a skip.

Driver needs to check for duplicated chunk names

Multiple datasets into same XML

Pralit notes that there are cases in the d.s. where multiple datasets flow into a single XML, and the ordering is important. We'll need to look at this.

Some GCAM inputs have bad line endings.

For example: A12.U_curves. You can't see it in the github interface, but this file has line endings (like you would find in Mac OS 9 and previous). It also doesn't have a final newline. Where possible, when we encounter these we should convert to unix line endings () and a final newline.

chunk_generator shouldn't add _DISABLED to function documentation header

Driver tests

Need more and better tests for the driver!

Catches duplicate outputs
Catches unmarked file inputs
Catches lying chunks (outputs don't match what was declared)
Catches being stuck

For all these will need to mock some package functions (find_chunks, etc).

Tests and chunk permissions

Ideally chunks are under highly restrictive restrictions: for example, they can't change the units of their data (?). This would be tested by the automatic testing code. Chunks can request tests be suspended, but doing so flags the issue for code reviewers.

I.e., we want as many restrictions on chunks as we can get. This needs to be thought out and developed more.

Which branch are we basing from?

Integration or master?

Shift graph_chunks() to ggraph

http://www.data-imaginist.com/2017/Announcing-ggraph/

How to handle CSV reads

Should chunks 'know' about CSV reads? Or should they just request (see #5 ) data, and if it's not found, get_data looks for a file with that name?

Reading input data at the beginning versus on-the-fly

We set up the driver to pre-read all input data that it can't find (i.e. that a chunk has declared it will provide). This does allow us to quickly find missing data.

But...this means that the chunks, which 'know' their data and could provide hints to the reader function, can't do so. In particular, socioeconomics/Maddison_population.csv has a missing column that we'd like not to fix in-file, but rather handle transparently.

Probably need to change this.

add_title should record name of caller, too

DEMO_chunk_LA112.U.R

Working

'Legacy_name' attribute

Attach legacy (current d.s.) name to every data product; then can have more sensible names.

Convert tests to mockr package

How to handle un-done chunks?

They don't exist until created. Easy. Except then we can't run dependency analyses on un-done work, which would be really critical to know as we go.
Create them ALL at the beginning, but mark them as "disabled". Disabled chunks don't get run. When someone works on a chunk, they change its status. This could be done via a driver.STATUS return message, or just append "_DISABLED" to the end of chunk names.

check_chunk_outputs - add spaces in precursor error message

More resolved tracking of data provenance

Good suggestion from Steve: have chunks label each of their outputs with the names of all the chunk inputs that contributed to that output. This is easy and will give us much more resolution in terms of printing data provenance: instead of "X was produced by chunk Y" the system will be able to say "X was produced by Y using A, B, and C" [and then recurse to whoever made A-C]. (edited). E.g.

x %>%
  add_provenance(input1, input2) %>%  # adds caller name and names of input1 and input2 to record
  ...

Drivers checks

Driver should

Declaring outputs: verify no overlaps in promised data
Only pass required (declared) data to chunks
Verify chunks return exactly what was promised

Check RTools availability for 3.3.1 - why are people getting warning messages?

Update chunk generator with add_units, etc.

Add tests for add_title, add_units, add_precursors

Also want to have checks that chunks have provided required documentation on their objects.
ALSO want to check that their precursors are in fact one of their inputs!
Put this logic in return_data or the driver?

  # Check that the chunk has provided required data for all objects
  for(at in c(ATTR_TITLE, ATTR_UNITS, ATTR_COMMENTS, ATTR_PRECURSORS)) {
    for(obj in names(dots)) {
       if(is.null(attr(dots[[obj]], at))) {
         warning("No '", at, "' attached to ", obj)
       }
     }
   }

Guidance/standard for how to deal with merges and matches

If no match/rows get dropped, error? OK?

Do we want to provide a left_join_no_drops safe function?

Documenting input data

There are a number of ways to document input datasets not generated by the data system itself, and how we do it will sometimes depend on the data.

Separate file - e.g. xxx.csv accompanied by xxx-metadata.txt
Header in data file
In-code (if specialized for reading in one particular dataset)

Units!!!

The benefits of forcing units specification are MANY. Think about this.

Internally one could have e.g.

x %>%
  add_units("variable_name", "units") %>%
...

Then later chunks have to assert they know units, Hector-style:

x <- get_data(all_data, "x", units = UNITS_NAME)

Ability to 'shim' into driver

Users will want to test out new things. Per @pralitp suggestion - we could give them the ability to modify data frames as the driver runs. For example, they can say "whenever X gets created, call this user-supplied function to modify it before anything else happens".

The more general point is we want to make it easy, or at least provide robust tools, to work with the data system.

DEMO_chunk_LA112.U.R

I want you to do this ....

Write documentation on creating a chunk

Wiki pages on:

Overview of process: check issues, create an issue, do translation, open PR referencing issue
Overview of system and how it works
Testing: what chunks are tested on and how to do it
Creating globals and input files
Looking for efficiencies and finding mistakes (tricky)
Documentation: functions, code, data
Data reformatting and flags

Also:

Improve sample-chunk.R and run by group
Presentation for next GCAM meeting

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.