Giter Site home page Giter Site logo

gcamdata's Introduction

codecov R-CMD coverage-test

DOI

gcamdata

The increasing data requirements of complex models demand robust, reproducible, and transparent systems to track and prepare models’ inputs. The gcamdata R package processes raw inputs to produce the hundreds of XML files needed by the GCAM integrated human-earth systems model which is available on its Github repository.

The package is documented in the online manual

Copyright 2019 Battelle Memorial Institute; see the LICENSE file.

gcamdata's People

Contributors

aarony2j avatar abigailsnyder avatar bpbond avatar clynchy avatar cwroney avatar d3y419 avatar enlochner avatar fengly20 avatar gokuliyer avatar jhoring avatar jonathanhuster avatar kanishkan91 avatar kdorheim avatar kvcalvin avatar marideeweber avatar mbins avatar nealtg avatar orourkepr avatar ouyang363 avatar pkyle avatar pralitp avatar realxinzhao avatar rplzzz avatar russellhz avatar siddarthd96 avatar skim301 avatar ssmithclimate avatar swaldhoff avatar swd-turner avatar zarrarkhan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gcamdata's Issues

Redo old-vs-new outputs testing code

It's going to need to look at disk-saved outputs (don't want to re-run driver each time we test).

In addition, it should transform some outputs to match old form (no Xyears, wide-vs-long forms).

  1. Build
  2. Run driver(write_intermediate_outputs=TRUE)
  3. Test
  4. Test code: if no intermediate_outputs directory, skip

Update GCAM wiki directly from data system

Suggestion from Leon: it would be great if data system could update wiki documentation automatically–e.g., a table with energy system default assumptions, etc.

This seems straightforward to do.

Write a sample trace()

trace(object_name, data_system_output, previous_traces = NULL) {

# Find what produces object_name
# Print info
# Recurse (checking against previous_traces)
}

Pre-create ALL the chunks?

One way to smooth things might be to pre-create all the chunks. They could all declare their dependencies; read their data; and create fake outputs (flagged). That would give Leon et al. a list of finished and unfinished chunks, and make for a killer graphic at GCAM meeting.

Need some mutable globals

Right? For example...

So we can easily check whether a requested data set is from_file without re-calling chunk_names().

Logfile(s)

A single log file? Log files for each chunk?

Is socioeconomics_inputs necessary?

Current this chunk reads USDA_GDP_MER.csv which is only used by one chunk, so there's no need for it to be here. Look at other socioeconomics chunks--if there's no need for a dedicated input chunk (i.e. no files used by more than one chunk) then get of it.

`gcam-usa_LA100.Socioeconomics_makedata` and FUTURE_YEARS

gcam-usa_LA100.Socioeconomics_makedata is not robust to anything except a 5-year change in FUTURE_YEARS, because the future population data used there are 2010, 2015, etc. So a 1-year shift in FUTURE_YEARS means it doesn't intersect with the data years, and in turn approx_fun fails (see code below, currently lines 118-127 in module-gcam-usa.R.

Basically, I think we'd like to interpolate the future population data to a 1-year timestep.

  # Future population by scenario. Right now just one scenario.
  PRIMA_pop %>%
    # reshape
    gather(year, population, -state) %>%
    mutate(year = as.numeric(year),
           population = as.numeric(population)) %>%
    # interpolate any missing data from end of history into future
    filter(year %in% c(max(HISTORICAL_YEARS), FUTURE_YEARS)) %>%
    group_by(state) %>%
    mutate(population = approx_fun(year, population)) %>%

Maddison_population.csv empty column

The socioeconomics/Maddison_population.csv file has an empty column, way out between 2008 and 2030. This causes readr::read_csv to throw a warning. Can we remove this column? Because it has no column name, I can't force a skip.

Multiple datasets into same XML

Pralit notes that there are cases in the d.s. where multiple datasets flow into a single XML, and the ordering is important. We'll need to look at this.

Some GCAM inputs have bad line endings.

For example: A12.U_curves. You can't see it in the github interface, but this file has line endings (like you would find in Mac OS 9 and previous). It also doesn't have a final newline. Where possible, when we encounter these we should convert to unix line endings () and a final newline.

Driver tests

Need more and better tests for the driver!

  • Catches duplicate outputs
  • Catches unmarked file inputs
  • Catches lying chunks (outputs don't match what was declared)
  • Catches being stuck

For all these will need to mock some package functions (find_chunks, etc).

Tests and chunk permissions

Ideally chunks are under highly restrictive restrictions: for example, they can't change the units of their data (?). This would be tested by the automatic testing code. Chunks can request tests be suspended, but doing so flags the issue for code reviewers.

I.e., we want as many restrictions on chunks as we can get. This needs to be thought out and developed more.

How to handle CSV reads

Should chunks 'know' about CSV reads? Or should they just request (see #5 ) data, and if it's not found, get_data looks for a file with that name?

Reading input data at the beginning versus on-the-fly

We set up the driver to pre-read all input data that it can't find (i.e. that a chunk has declared it will provide). This does allow us to quickly find missing data.

But...this means that the chunks, which 'know' their data and could provide hints to the reader function, can't do so. In particular, socioeconomics/Maddison_population.csv has a missing column that we'd like not to fix in-file, but rather handle transparently.

Probably need to change this.

'Legacy_name' attribute

Attach legacy (current d.s.) name to every data product; then can have more sensible names.

How to handle un-done chunks?

  1. They don't exist until created. Easy. Except then we can't run dependency analyses on un-done work, which would be really critical to know as we go.
  2. Create them ALL at the beginning, but mark them as "disabled". Disabled chunks don't get run. When someone works on a chunk, they change its status. This could be done via a driver.STATUS return message, or just append "_DISABLED" to the end of chunk names.

More resolved tracking of data provenance

Good suggestion from Steve: have chunks label each of their outputs with the names of all the chunk inputs that contributed to that output. This is easy and will give us much more resolution in terms of printing data provenance: instead of "X was produced by chunk Y" the system will be able to say "X was produced by Y using A, B, and C" [and then recurse to whoever made A-C]. (edited). E.g.

x %>%
  add_provenance(input1, input2) %>%  # adds caller name and names of input1 and input2 to record
  ...

Drivers checks

Driver should

  • Declaring outputs: verify no overlaps in promised data
  • Only pass required (declared) data to chunks
  • Verify chunks return exactly what was promised

Add tests for add_title, add_units, add_precursors

Also want to have checks that chunks have provided required documentation on their objects.
ALSO want to check that their precursors are in fact one of their inputs!
Put this logic in return_data or the driver?

  # Check that the chunk has provided required data for all objects
  for(at in c(ATTR_TITLE, ATTR_UNITS, ATTR_COMMENTS, ATTR_PRECURSORS)) {
    for(obj in names(dots)) {
       if(is.null(attr(dots[[obj]], at))) {
         warning("No '", at, "' attached to ", obj)
       }
     }
   }

Documenting input data

There are a number of ways to document input datasets not generated by the data system itself, and how we do it will sometimes depend on the data.

  • Separate file - e.g. xxx.csv accompanied by xxx-metadata.txt
  • Header in data file
  • In-code (if specialized for reading in one particular dataset)

Units!!!

The benefits of forcing units specification are MANY. Think about this.

Internally one could have e.g.

x %>%
  add_units("variable_name", "units") %>%
...

Then later chunks have to assert they know units, Hector-style:

x <- get_data(all_data, "x", units = UNITS_NAME)

Ability to 'shim' into driver

Users will want to test out new things. Per @pralitp suggestion - we could give them the ability to modify data frames as the driver runs. For example, they can say "whenever X gets created, call this user-supplied function to modify it before anything else happens".

The more general point is we want to make it easy, or at least provide robust tools, to work with the data system.

Write documentation on creating a chunk

Wiki pages on:

  • Overview of process: check issues, create an issue, do translation, open PR referencing issue
  • Overview of system and how it works
  • Testing: what chunks are tested on and how to do it
  • Creating globals and input files
  • Looking for efficiencies and finding mistakes (tricky)
  • Documentation: functions, code, data
  • Data reformatting and flags

Also:

  • Improve sample-chunk.R and run by group
  • Presentation for next GCAM meeting

Test code and years

Test code should test chunks for different HISTORICAL_YEARS and stuff like that. Ensure nothing breaks.

Abstract away chunk get/return data

I.e., chunks should not access all_data directly, but rather via `get_data(all_data, "what_i_want").

Similarly, they return just return_data(x, y, z).

Create progress() function

...that makes dependency graph; gets finished chunks; appends to pre-existing data; graphs date versus chunks colored by module; saves to README graphics folder.

Separate flags from comments

Right now data flags, like LONG_NO_X_FORM, are just dropped in as comments. We should cleanly separate this--i.e. flags and comments use different structures.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.