Giter Site home page Giter Site logo

presamples's Introduction

presamples

Package to write, load, manage and verify numerical arrays, called presamples.

Initially written for scenario analysis and for the reuse of sampled data in Monte Carlo Simulations in the Brightway LCA framework.

However, the presamples package is software-generic and built on the datapackage standard by Open Knowledge Foundation.

Presamples are useful anytime values for named parameters or matrix elements need to be saved and reused.

Installation: Via pip or conda (conda install --channel pascallesage presamples).

Documentation: We are in the process of writing better documentation.

  • To read our documentation "under construction", go to our readthedocs page.
  • If you can't find what you are looking for, you can also try the Jupyter Notebook here

Linux/OS X build status Windows build status Coverage Status Documentation Status

presamples's People

Contributors

cmutel avatar pascallesage avatar tngtudor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

presamples's Issues

Losses override production amounts

Because presamples replaces values in the technosphere matrix, losses accounted for as technosphere exchanges (where input == output) that are in a presample package replace amounts that should actually be production minus loss when building the technosphere matrix.

Can't think of a solution for the loader, and hence extra care should be taken when generating these presamples (i.e. presamples should for samples of production - samples of losses).

For now, a warning should be issued when matrix_data is supplied to create_presamples_package with an element where input==output.

errors in quickstart

I'm following the quickstart code of the docs, and copy pasted everything so I'm pretty sure I made no mistakes.

This section gives errors with:
ag_loader.parameters['land' [ha]']

TypeError: list indices must be integers or slices, not str

ag_loader.parameters.consolidated_array

AttributeError: 'list' object has no attribute 'consolidated_array'

ag_loader.parameters.names

AttributeError: 'list' object has no attribute 'names'

I haven't tried any further because of the errors.
Please let me know how to resolve this

Generic presample types

Currently we allow a few types of presamples, hardcoded in the source [example].

However, this won't work for extensions like regionalized LCA that define their own matrices. We should investigate a way to specify the matrix name in the datapackage and make the code more generic.

Create exchanges with bounded random parameters and fixed sum to be used in Monte Carlo

Hi Pascal,
I am looking for a way to create exchanges with bounded random parameters and fixed sum to be used in Montecarlo, see this stackoverflow post for details.
Chris told me you are working on something similar here. Is something to deal with my issue already implemented or there is anything similar I could start from?
If not yet there and you find it useful I would be happy to contribute in developing it.

Improve documentation

Review, edit, and make sure the following sections are present:

  • Introduction and use cases
  • How to generate presamples
  • Data format for different kinds of presamples
  • How to specify presamples for a new matrix (dtype, row_formatter, metadata)
  • How to use presamples

parameters mapping faulty when more than one presamples package with parameters passed

See here

The indices start over from 0 for every name_list:

arr1 = np.random.random(size=(2, 10))
arr2 = np.random.random(size=(2, 10))
names1 = ['a', 'b']
names2 = ['c', 'd']
label1 = 'one'
label2 = 'two'

pp_id, pp_fp = ps.create_presamples_package(parameter_data=[(arr1, names1, label1), (arr2, names2, label2)])
loader =ps.PackagesDataLoader([pp_fp])
params = loader.parameters[0]
params.mapping

{'a': 0, 'b': 1, 'c': 0, 'd': 1}

params.values()

array([ 0.91739674, 0.89426511, 0.68989611, 0.44477301])

Asymmetry in matrix_data and parameter_metadata in PackagesDataLoader not necessay

When loading data from presample package paths, data is collected on both matrix resources and parameter resources.
However, when appending extracted information to matrix_data, all data is passed, including parameter-metadata, which is superfluous.
On the other hand, when appending parameter-data to self.parameter_metadata, only parameter resource data is passed along with some other package level data (path, name (renamed to package_name), and indexer (renamed to sample_indexer)), but not all (id, seed, ncols).

There is no need for such asymmetry.

I recommend (untested for now):

  1. Removing parameter_metadata from matrix_data, by adding self.matrix_data[-1].pop('parameter_metadata', None) right after appending the loaded section to matrix_data
  2. Treating parameter data extraction like matrix_data extraction in load_data, and similarly deleting matrix_data from the returned data before appending to parameter_metadata.
  3. Rename parameter_metadata to parameter_data

Steps 2 and 3 will have effects on other parts of the package

Command sequence for parameterized model is suboptimal

Need exactly this:

model = ParameterizedBrightwayModel("teddy_bears")
model.load_parameter_data()
model.calculate_stochastic(iterations=1000, update_amounts=True)
model.calculate_matrix_presamples()
_, filepath = model.save_presample("fidget")

There are some non-default options and other weirdness here. Need to document and make more sensible.

`seed` should be passed during presamples creation.

Right now, seeds are passed when matrix_presamples are loaded. They should be passed when presamples are created. This is the correct time to distinguish between randomly sampled presamples and sequentially sampled presamples. It also improves reproducibility.

campaigns.db prevents brightway project directory from being deleted

projects.delete_project(delete_dir=True)
projects.delete_project('my_project', delete_dir=True)

and

projects.purge_deleted_directories()

fail with error:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\pjjoyce\\AppData\\Local\\pylca\\Brightway3\\my_project.8eb9cc4510ff31ac37ca85f9bbee44c1\\campaigns.db'

I think campaigns are a presamples thing right?

Shutting down all running instances of brightway and restarting a fresh brightway instance allows you to delete/purge project directories. But deleting only works if its the first command you use in a new brightway session

Steps to reproduce:

from brightway2 import *
projects.set_current('my_project') # in the create new project sense of set_current()
projects.delete_project(delete_dir=True)

No good way to get PresamplePackages in correct order from Campaign

Multiple presample packages can be passed to e.g. MonteCarloLCA
If an element (a parameter, a matrix index) is part of more than one package, the value in the package that is later in the list is the one that is used.
Knowing this, there should be a way to generate a list of presample resources from campaign that is ordered thus:

  • Ordered resources from oldest parent
  • Ordered resources from second oldest parent
    ...
  • Ordered resources from direct parent
  • Ordered resources from self.

Currently, Campaign.ancestors will list ancestors in the reverse of the desired order (from direct parent to furthest ancestor). It also does not list presample resources, just campaigns.

The solution should be a class method that looks something like this:

def get_all_resources_in_proper_order(self):    
    ancestors = list(self.ancestors)
    ancestors.reverse()
    resources = [p.path for ancestor in ancestors for p in ancestor.packages]
    try:
        resources.append(*[p.path for p in self.packages])
    except:
        pass
    return resources

@cmutel I tag you in case this is already implemented somewhere and I just can't find it.

Poor error when passing faulty indices in matrix data

Passing indices with faulty keys results in a ValueError: setting an array element with a sequence..

_, _ = presamples.create_presamples_package(matrix_data=[(samples, bad_indices, 'technosphere')])

Yields:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-82-5166baacb765> in <module>()
      1 _, _ = presamples.create_presamples_package( presamples.create_presamples_package(matrix_data=[(samples, bad_indices, 'technosphere')])

C:\mypy\anaconda\envs\bw\lib\site-packages\presamples\packaging.py in create_presamples_package(matrix_data, parameter_data, name, id_, overwrite, dirpath, seed)
    295                 "{} and {}".format(samples.shape[1], num_iterations))
    296 
--> 297         indices, metadata = format_matrix_data(indices, kind, *other)
    298 
    299         if samples.shape[0] != indices.shape[0]:

C:\mypy\anaconda\envs\bw\lib\site-packages\presamples\packaging.py in format_matrix_data(indices, kind, dtype, row_formatter, metadata)
    203     if dtype is None and row_formatter is None and metadata is None:
    204         try:
--> 205             return FORMATTERS[kind](indices)
    206         except KeyError:
    207             raise KeyError("Can't find formatter for {}".format(kind))

C:\mypy\anaconda\envs\bw\lib\site-packages\presamples\packaging.py in format_technosphere_presamples(indices)
    101             TYPE_DICTIONARY.get(row[2], row[2])
    102         )
--> 103     return format_matrix_data(indices, 'technosphere', dtype, func, metadata)
    104 
    105 

C:\mypy\anaconda\envs\bw\lib\site-packages\presamples\packaging.py in format_matrix_data(indices, kind, dtype, row_formatter, metadata)
    213         array = np.zeros(len(indices), dtype=dtype)
    214         for index, row in enumerate(indices):
--> 215             array[index] = row_formatter(row)
    216 
    217         return array, metadata

ValueError: setting an array element with a sequence.

Include an exception in the validator with a better error message.

separate core presamples from use with brightway2

It may be daunting for someone not using brightway2 to have to install brightway2 modules, to have the default Campaign database in a Brightway directory, etc.
It may make sense to separate core presamples functions from uses with brightway2 as two packages.

CFs passed to presamples are not method specific

The current identifiers for CFs is a list of flow identifiers, e.g. ``[('biosphere3', 'f9055607-c571-4903-85a8-7c20e3790c43')].
This is incomplete: the identifiers should also refer to the method.

Poor error when passing wrong group name to `load_parameter_data`

Error is this (not helpful):


IndexError Traceback (most recent call last)
~/miniconda3/envs/presamples/lib/python3.6/site-packages/peewee.py in get(self)
5558 try:
-> 5559 return clone.execute()[0]
5560 except IndexError:

~/miniconda3/envs/presamples/lib/python3.6/site-packages/peewee.py in getitem(self, item)
3405 self.fill_cache(item if item > 0 else 0)
-> 3406 return self.row_cache[item]
3407 else:

IndexError: list index out of range

Centralize aggregating functions

*One *of the uses presamples is to use aggregate data to simplify models.
The burger paper was an example of how to aggregate data at the LCI level (aggregated LCI datasets, i.e. BA-1s)
Data can be aggregated on many other levels:

  • at the indicator score level (lightest possible weight of data in an LCA, least resolution)
  • at the level of parameters that are themselves calculated from external models and that are then used as input parameters to the LCA model. Here, we are aggregating the parameter values and model structure of everything before the parameter value(s)
  • as supply vectors

The proposal is to extract what is common to all these cases and then adapt (subclass) for different aggregation levels.

Potential common methods:

  1. aggregating: calculating results once
  2. External transformation functions during aggregation (specific example: balancing of land use of water flows)
  3. saving result arrays
  4. determine history/pedigree of aggregated dataset (store ancestry somehow - see Bonsai's work? blockchain?)
  5. supplanting of model "branches" by aggregate "leaves" and (if possible) vice-versa

Link to be made to temporalis and acyclic trees

brightway2 silently not supporting presamples if presamples module imported first

Importing presamples before brightway2 results in a brightway2 import that does not recognize presamples.


In [1]: import presamples

In [2]: from brightway2 import *

In [3]: lca = LCA({Database('example').random():1}, presamples = ['some_path'])
C:\mypy\anaconda\envs\bw\lib\site-packages\bw2calc\lca.py:90: UserWarning: Skipping presamples; `presamples` not installed
  warnings.warn("Skipping presamples; `presamples` not installed")

Importing in the opposite order does not result in this problem.

I think I traced the sequence of events leading to this problem to this, though I don't understand it:

  1. When importing presamples, the PackagesDataLoader method is imported.
  2. This in turn imports utils.validate_presamples_dirpath
  3. The .utils in turn from bw2calc.utils import md5 , hence loading bw2calc
  4. bw2calc tries to import the PackagesDataLoader. The import is in a try/except, and since presamples is not yet imported, it excepts. bw2calc is loaded without the PackagesDataLoader.

I'm unsure why the imported bw2calc is modified (and includes the PackagesDataLoader) when presamples is imported second.

A quick fix would be to replace the md5 function with something else (or simply copying it within presamples. It does not seem special enough to cause this headache.

ParametersMapping fails if presamples package contains matrix data

This fails:
name_lists = [ json.load(open(package.path / obj['names']['filepath'])) for obj in package.resources ]
With:

`KeyError Traceback (most recent call last)
in ()
1 name_lists = [
----> 2 json.load(open(package.path / obj['names']['filepath'])) for obj in package.resources
3 ]

in (.0)
1 name_lists = [
----> 2 json.load(open(package.path / obj['names']['filepath'])) for obj in package.resources
3 ]

KeyError: 'names'`

Because only samples resources have a key names.

Continuous integration and code coverage

This isn't hard, just need to copy the code from e.g. wurst and activate the CI services.

Need:

  • Windows tests (appveyor)
  • Linux tests (travis)
  • OS X tests (travis)
  • test coverage (coveralls)
  • documentation builder (read the docs)

Missing consolidated parameter samples

The PackagesDataLoader will load all data from an iterable of presample package dirpaths.

When multiple presample package dirpaths contain data for the same matrix elements, only the last data will be used. This allows a baseline model to be updated with new values by passing the baseline presample package first and the new values presamples package after in the list of dirpaths. This is expected behaviour and one of the strengths of presamples.

HOWEVER, for parameter data, there is no "consolidation" of parameter data. Parameter data from different packages are all accessible through their own IndexedParametersMapping object. Parameters in different IndexedParametersMapping objects with the same names coexist and do not influence each other.

What is missing is:

  • a PackagesDataLoader method to return all parameter values from all IndexedParametersMapping objects, for a given index value.
  • this method must use the latest value for parameters found in multiple packages.

To illustrate simply, let's say we wanted to return a dict with {param_name: sample}. This simple dict comprehension would do the trick (self refers to a  PackagesDataLoader object:

{
    name:sample 
        for i in range(len(self.parameters)) # ensures order
        for name, sample in self.parameters[i].items()
} 

Perhaps the best place for this would be in the parameters property (in which case what is presently stored in the parameters property would need to be renamed to e.g. index_parameter_mappings).

However, we should think about whether we want to make this richer:

  • Create a new Class for these consolidated parameters
  • Keep track of some metadata, for example which package a specific sample was taken from (we don't do this for matrix data...)

Listing ancestors in campaign without ancestors raises StopIteration

Given:

c1 = Campaign.create(name='a')

which is a campaign with no parent, calling

c1.ancestors()

returns an iterator of ancestors (there are none). Iterating raises a  StopIteration error, as specified here.

This means that , for example, both [_ for _ in c1.ancestors], and list(c1.ancestors) return a StopIteration error.

While this is fine, it does not appear to be the expected behaviour based on this test.

Need to agree on expected bahaviour and fix either code or test.

Presamples CF package fails if biosphere flows not used in database

It is difficult to impossible to tell ahead of time which flows a given database will use, so instead of raising a mysterious error, we need to just ignore these flows. The error I get now is:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-87-59955be80c45> in <module>()
      1 mc_with = bw.MonteCarloLCA(fu, lcia_method, presamples=[path])
----> 2 mc_with_results = np.array([next(mc_with) for _ in range(250)])

...

~/miniconda3/envs/regional/lib/python3.6/site-packages/scipy/sparse/compressed.py in check_bounds(indices, bound)
    700             if idx >= bound:
    701                 raise IndexError('index (%d) out of range (>= %d)' %
--> 702                                  (idx, bound))
    703             idx = indices.min()
    704             if idx < -bound:

IndexError: index (4294967295) out of range (>= 2077)

Deal with presamples data with repeated indices/parameter names

  • When various presamples packages are used in sequence, the last presamples package should determine the value that is used. This is implemented, and so all is good.
  • When creating the presamples package from a single data resource, however, this makes less sense. Other solutions include:
    • Summing values with identical indices/parameter names
    • Throwing an error
    • ??

passing 'seed=sequential' in LCA object not working

The following did not yield expected results:

excs = [*act.technosphere()]
sample_array = np.eye(len(excs))
indices = [(exc['input'], exc['output'], 'technosphere') for exc in excs]
_, fp = create_presample_matrix(sample_array, indices, 'technosphere')
lca = LCA({act:1}, presamples=[fp], seed='sequential')
lca.lci() #expect first exc to be==1, others 0: not the case
lca.lci() #expect second exc to be==1, others 0: not the case
...
Samples are in fact returned in random order.

Inconsistent arguments for RPA instantiation

MatrixPresample with technosphere and biosphere not working

The following MatrixPresample was created for a coal power plant with a technosphere exchange combustion_coal_input and an emission combustion_CO2_emissions :

combustion_coal_sample = combustion_coal_input.random_sample(1000).reshape(1, -1)
combustion_CO2_sample = combustion_coal_sample * CO2_per_kg_coal

combustion_corr_matrix_presample_A = (
    combustion_coal_sample,
    [(combustion_coal_input['input'], combustion_coal_input['output'], 'technosphere')],
    'technosphere'
    )

combustion_corr_matrix_presample_B = (
    combustion_CO2_sample,
    [(combustion_CO2_emissions['input'], combustion_CO2_emissions['output'], 'biosphere')],
    'biosphere'
    )

combustion_corr_id, combustion_corr_fp = create_presamples_package(
    matrix_presamples=[
        combustion_corr_matrix_presample_A,
        combustion_corr_matrix_presample_B
    ]
)

combustion_MC_correlated = MonteCarloLCA(
    {combustion:1},
    method=('IPCC 2013', 'climate change', 'GWP 100a'), 
    presamples=[combustion_corr_fp]
    )

The matrix elements for these two exchanges, returned by next(combustion_MC_correlated were not correlated as expected.

One-off issue in MonteCarloLCA with presamples

Suppose I have a simple system with the technosphere matrix defined as:
image

I create a presample package like this:

>>> arr = np.array([10, 20, 30, 40, 50]).reshape(1, -1)
>>> _, fp = create_presamples_package(
...    [
...        (arr, 
...         [(('db', 'B'), ('db', 'A'), 'technosphere')], 
...         'technosphere')
...    ], seed='sequential'
... )

I use seed='sequential' because I then want to calculate the correlation between this specific technosphere exchange and results.
Because of the way MonteCarloLCA  objects are usually manipulated, I end up including the second sample (index=1) first:

>>> mc = MonteCarloLCA({a:1}, presamples=[fp])
>>> for i in range(5):
...        next(mc)
...        print(mc.technosphere_matrix[mc.product_dict[('db', 'B')], mc.activity_dict[('db', 'A')]])
-20.0
-30.0
-40.0
-50.0
-10.0

To get what I really need, I have to do this:

>>> mc = MonteCarloLCA({a:1}, presamples=[fp])
>>> mc.lci()
>>> for i in range(5):
...    if i == 0:
...        pass
...    else:
...        next(mc)
...    print(mc.technosphere_matrix[mc.product_dict[('db', 'B')], mc.activity_dict[('db', 'A')]])
-10.0
-20.0
-30.0
-40.0
-50.0

The example on the GSA example notebook shows how to build up the arrays during Monte Carlo, but if the seed is sequential, this shouldn't be necessary: one simply needs to go in an read the presample array.

One simple solution would be to change the way the Indexer is instantiated, from:

self.seed_value, self.count, self.index = seed, 0, None

to:

self.seed_value, self.count, self.index = seed, -1, None

This generates the expected results:

```python
>>> mc = MonteCarloLCA({a:1}, presamples=[fp])
>>> for i in range(5):
...        next(mc)
...        print(mc.technosphere_matrix[mc.product_dict[('db', 'B')], mc.activity_dict[('db', 'A')]])
-10.0
-20.0
-30.0
-40.0
-50.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.