open-fred / cli Goto Github PK

1.0 4.0 2.0 151 KB

The open_FRED command line interface

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

cli's Introduction

Overview

docs
tests
package

The open_FRED command line interface and other tools for working with databases containing open_FRED data.

Installation

It's probably best to install the package into a virtual environment. Once you've created a virtualenv, activate it and install the package into it via:

pip install open_FRED-cli

You can also install the in-development version with:

pip install https://github.com/open-fred/cli/archive/master.zip

Once you've done either, make yourself familiar with the program by running:

open_FRED --help

Documentation

https://cli.readthedocs.io/

Development

To run the all tests run:

tox

Note, to combine the coverage data from all the tox environments run:

Windows	set PYTEST_ADDOPTS=--cov-append tox
Other	PYTEST_ADDOPTS=--cov-append tox

cli's People

Contributors

Stargazers

Watchers

Forkers

sabinehaas bashinikm

cli's Issues

Don't drop schema if no `-d`/`--drop` specified

Due to way click.Choice defaults work, the schema is currently dropped even when no -d/--drop option is specified. This should be changed. Even if that means that these options no longer have defaults.

Rewrite nested for loops to list comprehensions

The huge for loop processing a NetCDF file is incredibly clunky/ugly and should be rewritten to use list comprehensions/generators. Not only does this make the code more readable, but it could also improve performance a little bit.

Add an option to move NetCDF files after successful import

It would be good to have the option to move NetCDF files out of the imported directory hierarchy after a successful import. This would enable a very course variant of incremental imports, as one could just stop an ongoing import and restart it with the same command line, since only files not yet imported would remain at the import location.

Allow limiting how much data is imported

Since imports take a long time, it would be beneficial to be able to import only a subset of the data stored in a NetCDF file.

Put default/special values into dimension tables

There should be special values signifying a location that matches everywhere or a timestamp that matches every time. This is especially important for timestamps, since a solution for this is necessary to fix #5.
Whether these should be special NULL rows or, e.g. a timestamp that goes from -∞ to +∞ still has to be decided.

Display progress correctly in the presence of masked values

Some values are masked representing missing data. These values are not imported at all into the database but when calculating how many values to import they are still taken into account, as this calculation doesn't actually loop over all values but just multiplies dimension lengths. This results in an import that finishes before the progress bar reaches 100%.
As it might take too long to loop over all values just to check whether they are masked or not, this might actually not get fixed at all. I'm still on the fence over this. I also kinda like being able to see (roughly) how many values where masked.

Handle datasets/variables which are constant wrt. time

Currently NetCDF Datasets which are constant with respect to time, i.e. they don't have a time dimension specified, are skipped. Of course these should also be handled correctly. As dimension positions are now figured out dynamically, it actually shouldn't be too hard to handle this case by either not giving these values a time stamp, or giving them on that ranges from minus infinity to plus infinity.

What's even more important is the fact, that the way these variables are detected is really hackish and doesn't actually check any dimensions, but checks for the number variables occurring before spatial variables. This is a really brittle criterion which is very specific to our data and might break at any time.

Add timezone information to timestamps

The timestamp values stored in the database currently don't have timezone information stored with them. This should be changed so that the timestamp fields are created with timezone information in mind, when setting up the database. In order for this to work, I have to figure out whether the time information contained in the NetCDF files contains explicit timezone information or whether an implicit timezone is assumed. This timezone information can then be stored along the with timestamp values when importing the data.

Delete requirement of oemof package

The open_FRED cli requires oemof, see setup.py, but as far as I can see it's not imported in any module.

Did I miss anything? Would you be fine to delete this requirement from open_FRED cli?

The reason why I came across this is that in pvcompare we require feedinlib (which requires open_FRED cli) and the new oemof packages (solph, network, ..). This leads to a dependency issue oemof 0.3.2 requires Pyomo XY ... which is confusing as oemof 0.3.2 is not needed in any of the packages.

Improve code deciding which NetCDF variables to import into the `variable` table

See #5 for the whole picture. Currently the code deciding which variables to import into the database is really brittle. What should actually be done is taking the difference between Dataset.variables.keys() and Dataset.dimensions.keys() while additionally skipping special variables which are not dimensions like e.g. 'lat', 'lon', 'rotated_pole' and 'time_bnds'.

NetCDF import is way too slow

Currently it takes ages to import even a single NetCDF file. In order to fix this, I have to rewrite parts of the importer to use bulk inserts instead of going through the ORM for everything.

Variable time_bnds not found in netCDF - File

Trying to upload a data file, I got the following error importing oF_00625_MERRA2.WSS_160M.2002_02.DEplus.nc.

This seems to apply to other files in the dataset as well. Do you know how to fix this?

Enable dropping schema/tables when setting up the database

Currently setting up the database doesn't work if the tables already to be created already exist in the database and have a structure that conflicts with what's going to be created. Also, sometimes one wants to recreate the database making sure there is no old data lingering around so that it's clear that all data is adhering to the newly created structure. For this reason, it would be convenient to be able to automatically drop the schema or the tables which will get created in it before creating the new schema or the tables in it. At the very least it would be a convenience feature so that one doesn't have to manually issue a drop command in a different application.

Enable passing database configuration parameters as command line options

Currently the script figures out which database to use, how to connect to it and which schema to use via the configuration it finds in oemof.db's configuration file under the (hardcoded) [openFRED] section.
It should be possible, to pass the section to use, as well as particular entries on the command line.

Handle categorical data correctly

Some variables, e.g. SOILTYP are categorical, i.e. their values are actually string labels. These values are still stored as numbers in the NetCDF files, but they have additional attributes:

flag_values which is an array of the integers used as variable values and
flag_meanings which is a single string containing the string labels delimited with spaces.

To get a dictionary mapping flag_values to flag_meanings one can simply use

dict(zip(variable.flag_values, variable.flag_meanings.split(' ')))

Storing these values correctly in the database is a bit harder though, since currently only numeric values are supported. I guess I have to resort to joined table inheritance and a small class hierarchy for values to solve this.

Fix `xarray` dependency version specification

It says >= 10.6 but it should be >= 0.10.6.