pascallesage / presamples Goto Github PK

Package to write, load, manage and verify numerical arrays, called presamples.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

presamples's Introduction

presamples

Package to write, load, manage and verify numerical arrays, called presamples.

Initially written for scenario analysis and for the reuse of sampled data in Monte Carlo Simulations in the Brightway LCA framework.

However, the presamples package is software-generic and built on the datapackage standard by Open Knowledge Foundation.

Presamples are useful anytime values for named parameters or matrix elements need to be saved and reused.

Installation: Via pip or conda (conda install --channel pascallesage presamples).

Documentation: We are in the process of writing better documentation.

To read our documentation "under construction", go to our readthedocs page.
If you can't find what you are looking for, you can also try the Jupyter Notebook here

presamples's People

Contributors

Stargazers

Watchers

Forkers

cmutel oases-project mfastudillo susbiores-ubc tngtudor aarek-eng a-pau brightway-lca aleksandra-kim m-rossi

presamples's Issues

Losses override production amounts

Because presamples replaces values in the technosphere matrix, losses accounted for as technosphere exchanges (where input == output) that are in a presample package replace amounts that should actually be production minus loss when building the technosphere matrix.

Can't think of a solution for the loader, and hence extra care should be taken when generating these presamples (i.e. presamples should for samples of production - samples of losses).

For now, a warning should be issued when matrix_data is supplied to create_presamples_package with an element where input==output.

Make sure everything including tests work if bw2data is not installed

`MatrixPresamples` should return samples in order if a "magic" seed is passed

Seed could even be the string "ordered" instead of an integer.

Presamples are randomly selected, which is not useful for use cases where the order of parameters is important (e.g. some sensitivity analysis, time series perhaps).

Presample generator function: Normal Monte Carlo samples for an entire database

errors in quickstart

I'm following the quickstart code of the docs, and copy pasted everything so I'm pretty sure I made no mistakes.

This section gives errors with:
ag_loader.parameters['land' [ha]']

TypeError: list indices must be integers or slices, not str

ag_loader.parameters.consolidated_array

AttributeError: 'list' object has no attribute 'consolidated_array'

ag_loader.parameters.names

AttributeError: 'list' object has no attribute 'names'

I haven't tried any further because of the errors.
Please let me know how to resolve this

Generic presample types

Currently we allow a few types of presamples, hardcoded in the source [example].

However, this won't work for extensions like regionalized LCA that define their own matrices. We should investigate a way to specify the matrix name in the datapackage and make the code more generic.

Create exchanges with bounded random parameters and fixed sum to be used in Monte Carlo

Hi Pascal,
I am looking for a way to create exchanges with bounded random parameters and fixed sum to be used in Montecarlo, see this stackoverflow post for details.
Chris told me you are working on something similar here. Is something to deal with my issue already implemented or there is anything similar I could start from?
If not yet there and you find it useful I would be happy to contribute in developing it.

Presample generator function for case: select one input at a time

Should also have a real example

Improve documentation

Review, edit, and make sure the following sections are present:

Introduction and use cases
How to generate presamples
Data format for different kinds of presamples
How to specify presamples for a new matrix (dtype, row_formatter, metadata)
How to use presamples

Add description

Parameter group names must be valid python variable names

Can't have spaces or other nonsense

parameters mapping faulty when more than one presamples package with parameters passed

See here

The indices start over from 0 for every name_list:

arr1 = np.random.random(size=(2, 10))
arr2 = np.random.random(size=(2, 10))
names1 = ['a', 'b']
names2 = ['c', 'd']
label1 = 'one'
label2 = 'two'

pp_id, pp_fp = ps.create_presamples_package(parameter_data=[(arr1, names1, label1), (arr2, names2, label2)])
loader =ps.PackagesDataLoader([pp_fp])
params = loader.parameters[0]
params.mapping

{'a': 0, 'b': 1, 'c': 0, 'd': 1}

params.values()

array([ 0.91739674, 0.89426511, 0.68989611, 0.44477301])

Presample generator function for case: time series data

Should also have a real example

Missing args to `save_presamples` in ` ParameterizedBrightwayModel`

Specifically, no possibility to pass seed or overwrite .

Asymmetry in matrix_data and parameter_metadata in PackagesDataLoader not necessay

When loading data from presample package paths, data is collected on both matrix resources and parameter resources.
However, when appending extracted information to matrix_data, all data is passed, including parameter-metadata, which is superfluous.
On the other hand, when appending parameter-data to self.parameter_metadata, only parameter resource data is passed along with some other package level data (path, name (renamed to package_name), and indexer (renamed to sample_indexer)), but not all (id, seed, ncols).

There is no need for such asymmetry.

I recommend (untested for now):

Removing parameter_metadata from matrix_data, by adding self.matrix_data[-1].pop('parameter_metadata', None) right after appending the loaded section to matrix_data
Treating parameter data extraction like matrix_data extraction in load_data, and similarly deleting matrix_data from the returned data before appending to parameter_metadata.
Rename parameter_metadata to parameter_data

Steps 2 and 3 will have effects on other parts of the package

hash should be for data, not data location

The hash included in the presamples package is for the location of the data on disk. We eventually want to have data stored off-disk (see https://github.com/PascalLesage/brightway2-presamples/issues/7).
Since what we want to ensure is that the actual data passed is not corrupt, it would be preferable to hash the data (samples, parameter names, etc).

Make models more flexible

able to include all inventory types
able to have both parameters and matrix values

Command sequence for parameterized model is suboptimal

Need exactly this:

model = ParameterizedBrightwayModel("teddy_bears")
model.load_parameter_data()
model.calculate_stochastic(iterations=1000, update_amounts=True)
model.calculate_matrix_presamples()
_, filepath = model.save_presample("fidget")

There are some non-default options and other weirdness here. Need to document and make more sensible.

`update_matrices` requires LCA reference each time

Should be able to pass as optional argument on Loader creation

`seed` should be passed during presamples creation.

Right now, seeds are passed when matrix_presamples are loaded. They should be passed when presamples are created. This is the correct time to distinguish between randomly sampled presamples and sequentially sampled presamples. It also improves reproducibility.

Presample generator function: Import presamples from external model outputs

Would be great to have an actual external model, but could fake it for now.

`create_presamples_package` and `append_presamples_package` need to check parameter name uniqueness

campaigns.db prevents brightway project directory from being deleted

projects.delete_project(delete_dir=True)

projects.delete_project('my_project', delete_dir=True)

and

projects.purge_deleted_directories()

fail with error:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\pjjoyce\\AppData\\Local\\pylca\\Brightway3\\my_project.8eb9cc4510ff31ac37ca85f9bbee44c1\\campaigns.db'

I think campaigns are a presamples thing right?

Shutting down all running instances of brightway and restarting a fresh brightway instance allows you to delete/purge project directories. But deleting only works if its the first command you use in a new brightway session

Steps to reproduce:

from brightway2 import *
projects.set_current('my_project') # in the create new project sense of set_current()
projects.delete_project(delete_dir=True)

No good way to get PresamplePackages in correct order from Campaign

Multiple presample packages can be passed to e.g. MonteCarloLCA
If an element (a parameter, a matrix index) is part of more than one package, the value in the package that is later in the list is the one that is used.
Knowing this, there should be a way to generate a list of presample resources from campaign that is ordered thus:

Ordered resources from oldest parent
Ordered resources from second oldest parent
...
Ordered resources from direct parent
Ordered resources from self.

Currently, Campaign.ancestors will list ancestors in the reverse of the desired order (from direct parent to furthest ancestor). It also does not list presample resources, just campaigns.

The solution should be a class method that looks something like this:

def get_all_resources_in_proper_order(self):    
    ancestors = list(self.ancestors)
    ancestors.reverse()
    resources = [p.path for ancestor in ancestors for p in ancestor.packages]
    try:
        resources.append(*[p.path for p in self.packages])
    except:
        pass
    return resources

@cmutel I tag you in case this is already implemented somewhere and I just can't find it.

Generic presamples should accept integers as well as keys

Poor error when passing faulty indices in matrix data

Passing indices with faulty keys results in a ValueError: setting an array element with a sequence..

_, _ = presamples.create_presamples_package(matrix_data=[(samples, bad_indices, 'technosphere')])

Yields:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-82-5166baacb765> in <module>()
      1 _, _ = presamples.create_presamples_package( presamples.create_presamples_package(matrix_data=[(samples, bad_indices, 'technosphere')])

C:\mypy\anaconda\envs\bw\lib\site-packages\presamples\packaging.py in create_presamples_package(matrix_data, parameter_data, name, id_, overwrite, dirpath, seed)
    295                 "{} and {}".format(samples.shape[1], num_iterations))
    296 
--> 297         indices, metadata = format_matrix_data(indices, kind, *other)
    298 
    299         if samples.shape[0] != indices.shape[0]:

C:\mypy\anaconda\envs\bw\lib\site-packages\presamples\packaging.py in format_matrix_data(indices, kind, dtype, row_formatter, metadata)
    203     if dtype is None and row_formatter is None and metadata is None:
    204         try:
--> 205             return FORMATTERS[kind](indices)
    206         except KeyError:
    207             raise KeyError("Can't find formatter for {}".format(kind))

C:\mypy\anaconda\envs\bw\lib\site-packages\presamples\packaging.py in format_technosphere_presamples(indices)
    101             TYPE_DICTIONARY.get(row[2], row[2])
    102         )
--> 103     return format_matrix_data(indices, 'technosphere', dtype, func, metadata)
    104 
    105 

C:\mypy\anaconda\envs\bw\lib\site-packages\presamples\packaging.py in format_matrix_data(indices, kind, dtype, row_formatter, metadata)
    213         array = np.zeros(len(indices), dtype=dtype)
    214         for index, row in enumerate(indices):
--> 215             array[index] = row_formatter(row)
    216 
    217         return array, metadata

ValueError: setting an array element with a sequence.

Include an exception in the validator with a better error message.

matrix_label should not be passed to KroneckerDelta function (e.g.)

The exchanges are just inventory anyways, split at end of run.

Presample generator function for case: Correlated CFs

Should also have a real example

separate core presamples from use with brightway2

It may be daunting for someone not using brightway2 to have to install brightway2 modules, to have the default Campaign database in a Brightway directory, etc.
It may make sense to separate core presamples functions from uses with brightway2 as two packages.

CFs passed to presamples are not method specific

The current identifiers for CFs is a list of flow identifiers, e.g. ``[('biosphere3', 'f9055607-c571-4903-85a8-7c20e3790c43')].
This is incomplete: the identifiers should also refer to the method.

Poor error when passing wrong group name to `load_parameter_data`

Error is this (not helpful):

IndexError Traceback (most recent call last)
~/miniconda3/envs/presamples/lib/python3.6/site-packages/peewee.py in get(self)
5558 try:
-> 5559 return clone.execute()[0]
5560 except IndexError:

~/miniconda3/envs/presamples/lib/python3.6/site-packages/peewee.py in getitem(self, item)
3405 self.fill_cache(item if item > 0 else 0)
-> 3406 return self.row_cache[item]
3407 else:

IndexError: list index out of range

Centralize aggregating functions

*One *of the uses presamples is to use aggregate data to simplify models.
The burger paper was an example of how to aggregate data at the LCI level (aggregated LCI datasets, i.e. BA^-1s)
Data can be aggregated on many other levels:

at the indicator score level (lightest possible weight of data in an LCA, least resolution)
at the level of parameters that are themselves calculated from external models and that are then used as input parameters to the LCA model. Here, we are aggregating the parameter values and model structure of everything before the parameter value(s)
as supply vectors

The proposal is to extract what is common to all these cases and then adapt (subclass) for different aggregation levels.

Potential common methods:

aggregating: calculating results once
External transformation functions during aggregation (specific example: balancing of land use of water flows)
saving result arrays
determine history/pedigree of aggregated dataset (store ancestry somehow - see Bonsai's work? blockchain?)
supplanting of model "branches" by aggregate "leaves" and (if possible) vice-versa

Link to be made to temporalis and acyclic trees

brightway2 silently not supporting presamples if presamples module imported first

Importing presamples before brightway2 results in a brightway2 import that does not recognize presamples.


In [1]: import presamples

In [2]: from brightway2 import *

In [3]: lca = LCA({Database('example').random():1}, presamples = ['some_path'])
C:\mypy\anaconda\envs\bw\lib\site-packages\bw2calc\lca.py:90: UserWarning: Skipping presamples; `presamples` not installed
  warnings.warn("Skipping presamples; `presamples` not installed")

Importing in the opposite order does not result in this problem.

I think I traced the sequence of events leading to this problem to this, though I don't understand it:

When importing presamples, the PackagesDataLoader method is imported.
This in turn imports utils.validate_presamples_dirpath
The .utils in turn from bw2calc.utils import md5 , hence loading bw2calc
bw2calc tries to import the PackagesDataLoader. The import is in a try/except, and since presamples is not yet imported, it excepts. bw2calc is loaded without the PackagesDataLoader.

I'm unsure why the imported bw2calc is modified (and includes the PackagesDataLoader) when presamples is imported second.

A quick fix would be to replace the md5 function with something else (or simply copying it within presamples. It does not seem special enough to cause this headache.

Presample for case: Correlated aggregated datasets

Maybe also include a correlated aggregated dataset generator

ParametersMapping fails if presamples package contains matrix data

This fails:
name_lists = [ json.load(open(package.path / obj['names']['filepath'])) for obj in package.resources ]
With:

`KeyError Traceback (most recent call last)
in ()
1 name_lists = [
----> 2 json.load(open(package.path / obj['names']['filepath'])) for obj in package.resources
3 ]

in (.0)
1 name_lists = [
----> 2 json.load(open(package.path / obj['names']['filepath'])) for obj in package.resources
3 ]

KeyError: 'names'`

Because only samples resources have a key names.

Continuous integration and code coverage

This isn't hard, just need to copy the code from e.g. wurst and activate the CI services.

Need:

Windows tests (appveyor)
Linux tests (travis)
OS X tests (travis)
test coverage (coveralls)
documentation builder (read the docs)

Missing consolidated parameter samples

The PackagesDataLoader will load all data from an iterable of presample package dirpaths.

When multiple presample package dirpaths contain data for the same matrix elements, only the last data will be used. This allows a baseline model to be updated with new values by passing the baseline presample package first and the new values presamples package after in the list of dirpaths. This is expected behaviour and one of the strengths of presamples.

HOWEVER, for parameter data, there is no "consolidation" of parameter data. Parameter data from different packages are all accessible through their own IndexedParametersMapping object. Parameters in different IndexedParametersMapping objects with the same names coexist and do not influence each other.

What is missing is:

a PackagesDataLoader method to return all parameter values from all IndexedParametersMapping objects, for a given index value.
this method must use the latest value for parameters found in multiple packages.

To illustrate simply, let's say we wanted to return a dict with {param_name: sample}. This simple dict comprehension would do the trick (self refers to a PackagesDataLoader object:

{
    name:sample 
        for i in range(len(self.parameters)) # ensures order
        for name, sample in self.parameters[i].items()
}

Perhaps the best place for this would be in the parameters property (in which case what is presently stored in the parameters property would need to be renamed to e.g. index_parameter_mappings).

However, we should think about whether we want to make this richer:

Create a new Class for these consolidated parameters
Keep track of some metadata, for example which package a specific sample was taken from (we don't do this for matrix data...)

Listing ancestors in campaign without ancestors raises StopIteration

Given:

c1 = Campaign.create(name='a')

which is a campaign with no parent, calling

c1.ancestors()

returns an iterator of ancestors (there are none). Iterating raises a StopIteration error, as specified here.

This means that , for example, both [_ for _ in c1.ancestors], and list(c1.ancestors) return a StopIteration error.

While this is fine, it does not appear to be the expected behaviour based on this test.

Need to agree on expected bahaviour and fix either code or test.

Presamples CF package fails if biosphere flows not used in database

It is difficult to impossible to tell ahead of time which flows a given database will use, so instead of raising a mysterious error, we need to just ignore these flows. The error I get now is:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-87-59955be80c45> in <module>()
      1 mc_with = bw.MonteCarloLCA(fu, lcia_method, presamples=[path])
----> 2 mc_with_results = np.array([next(mc_with) for _ in range(250)])

...

~/miniconda3/envs/regional/lib/python3.6/site-packages/scipy/sparse/compressed.py in check_bounds(indices, bound)
    700             if idx >= bound:
    701                 raise IndexError('index (%d) out of range (>= %d)' %
--> 702                                  (idx, bound))
    703             idx = indices.min()
    704             if idx < -bound:

IndexError: index (4294967295) out of range (>= 2077)

Deal with presamples data with repeated indices/parameter names

When various presamples packages are used in sequence, the last presamples package should determine the value that is used. This is implemented, and so all is good.
When creating the presamples package from a single data resource, however, this makes less sense. Other solutions include:
- Summing values with identical indices/parameter names
- Throwing an error
- ??

presamples should work with LCA, not only MonteCarlo

presamples may be a way to use parameters in LCA-->as a mechanism to override A and B matrix elements.

IrregularPresamplesArray for parameter_presamples

IrregularPresamplesArray should also apply to parameter_presamples

allow presamples package creation with external reference to array

Current implementation requires that the numpy array is actually passed as an argument to create_presamples_package. This therefore does not allow the use of an external address in the field datapackage['resources']['path'].

passing 'seed=sequential' in LCA object not working

The following did not yield expected results:

excs = [*act.technosphere()]
sample_array = np.eye(len(excs))
indices = [(exc['input'], exc['output'], 'technosphere') for exc in excs]
_, fp = create_presample_matrix(sample_array, indices, 'technosphere')
lca = LCA({act:1}, presamples=[fp], seed='sequential')
lca.lci() #expect first exc to be==1, others 0: not the case
lca.lci() #expect second exc to be==1, others 0: not the case
...
Samples are in fact returned in random order.

Inconsistent arguments for RPA instantiation

Places that assume list of (fp, shape):
https://github.com/PascalLesage/brightway2-presamples/blob/seed/presamples/array.py#L20
https://github.com/PascalLesage/brightway2-presamples/blob/seed/presamples/array.py#L11

Places that assume list of fps only:
https://github.com/PascalLesage/brightway2-presamples/blob/seed/presamples/loader.py#L176
https://github.com/PascalLesage/brightway2-presamples/blob/seed/presamples/package_interface.py#L113

TODO: confirm that list of (fp, shape) is what is required, and revert RPA instantiations to pass (fp, shape).

Presample generator function: Normal MC sampling of CFs

Not anything special, just saved and reproducible.

MatrixPresample with technosphere and biosphere not working

The following MatrixPresample was created for a coal power plant with a technosphere exchange combustion_coal_input and an emission combustion_CO2_emissions :

combustion_coal_sample = combustion_coal_input.random_sample(1000).reshape(1, -1)
combustion_CO2_sample = combustion_coal_sample * CO2_per_kg_coal

combustion_corr_matrix_presample_A = (
    combustion_coal_sample,
    [(combustion_coal_input['input'], combustion_coal_input['output'], 'technosphere')],
    'technosphere'
    )

combustion_corr_matrix_presample_B = (
    combustion_CO2_sample,
    [(combustion_CO2_emissions['input'], combustion_CO2_emissions['output'], 'biosphere')],
    'biosphere'
    )

combustion_corr_id, combustion_corr_fp = create_presamples_package(
    matrix_presamples=[
        combustion_corr_matrix_presample_A,
        combustion_corr_matrix_presample_B
    ]
)

combustion_MC_correlated = MonteCarloLCA(
    {combustion:1},
    method=('IPCC 2013', 'climate change', 'GWP 100a'), 
    presamples=[combustion_corr_fp]
    )

The matrix elements for these two exchanges, returned by next(combustion_MC_correlated were not correlated as expected.

Presample generator function for case: stacking presamples/inheritance

As a means to account for design iterations with underspecified models

One-off issue in MonteCarloLCA with presamples

Suppose I have a simple system with the technosphere matrix defined as:

I create a presample package like this:

>>> arr = np.array([10, 20, 30, 40, 50]).reshape(1, -1)
>>> _, fp = create_presamples_package(
...    [
...        (arr, 
...         [(('db', 'B'), ('db', 'A'), 'technosphere')], 
...         'technosphere')
...    ], seed='sequential'
... )

I use seed='sequential' because I then want to calculate the correlation between this specific technosphere exchange and results.
Because of the way MonteCarloLCA objects are usually manipulated, I end up including the second sample (index=1) first:

>>> mc = MonteCarloLCA({a:1}, presamples=[fp])
>>> for i in range(5):
...        next(mc)
...        print(mc.technosphere_matrix[mc.product_dict[('db', 'B')], mc.activity_dict[('db', 'A')]])
-20.0
-30.0
-40.0
-50.0
-10.0

To get what I really need, I have to do this:

>>> mc = MonteCarloLCA({a:1}, presamples=[fp])
>>> mc.lci()
>>> for i in range(5):
...    if i == 0:
...        pass
...    else:
...        next(mc)
...    print(mc.technosphere_matrix[mc.product_dict[('db', 'B')], mc.activity_dict[('db', 'A')]])
-10.0
-20.0
-30.0
-40.0
-50.0

The example on the GSA example notebook shows how to build up the arrays during Monte Carlo, but if the seed is sequential, this shouldn't be necessary: one simply needs to go in an read the presample array.

One simple solution would be to change the way the Indexer is instantiated, from:

self.seed_value, self.count, self.index = seed, 0, None

to:

self.seed_value, self.count, self.index = seed, -1, None

This generates the expected results:

```python
>>> mc = MonteCarloLCA({a:1}, presamples=[fp])
>>> for i in range(5):
...        next(mc)
...        print(mc.technosphere_matrix[mc.product_dict[('db', 'B')], mc.activity_dict[('db', 'A')]])
-10.0
-20.0
-30.0
-40.0
-50.0

example jupyter notebook and examples pages give 404 error

The example jupyter notebook and examples pages both give a 404 error.
Could this be fixed? I would like to use Presamples and would be really happy with some good example workflows to get me up and running!

pascallesage / presamples Goto Github PK

presamples's Introduction

presamples

presamples's People

Contributors

Stargazers

Watchers

Forkers

presamples's Issues

Steps to reproduce:

Recommend Projects

Recommend Topics

Recommend Org