ceholden / yatsm Goto Github PK

View Code? Open in Web Editor NEW

63.0 19.0 29.0 20.08 MB

Yet Another Time Series Model

Home Page: https://yatsm.readthedocs.org/en/latest/

License: MIT License

Python 80.61% Jupyter Notebook 18.77% Shell 0.62%

python landsat remotesensing timeseries

yatsm's People

Contributors

Stargazers

Watchers

yatsm's Issues

Results and cache data format

Try pytables.

Might be faster, smaller, and make updating results in place easier?

YATSM line append bug

Can't append if yatsm.record is None:

  File "/usr3/graduate/valpasq/Documents/yatsm/yatsm/cli/line.py", line 188, in line
    output.extend(yatsm.record)
TypeError: 'NoneType' object is not iterable

Missed in test datasets because they contain no 50%+ NODATA timeseries

Make the LongTermMeanPhenology class more like sklearn

Separate hyperparameters from the data. Basically, move the model in the __init__ to the fit method and move all from the fit method to the __init__. Do any preprocessing in the fit method.

Mask values

Right now the mask values are hard coded to Fmask mask values (0, 1 pass, else masked). We should be able to specify individual values that are masked:

add in new parameter to config file that specifies mask values as list
parse these in 0.1.1 config reader
line_yatsm.py should mask based upon these values
Include valid range of data mask information (e.g., 0 - 10,000)

Config file algorithm specification

Make way for more timeseries algorithms within module by changing configuration file to be able to point to many different algorithms:

Add new submodule, algorithms
Rename yatsm to ccdc and place into algorithms submodule. YATSM class to CCDCesque
Change configuration file by adding algorithm key under YATSM section. The algorithm specified by algorithm key will be searched for as the section title from which to extract algorithm parameterization information.
Add new YATSM section for options generic to all timeseries algorithms, like reverse or robust.
Remove robust results and omission and commission tests from current YATSM (future, CCDCesque) and place into yatsm.algorithms.yatsm. These will be parameterized in YATSM metadata section.
Add section for regression/predictive model method configuration (see #26)

Propose change example:

[metadata]
version = 0.5

[YATSM]
algorithm = CCDCesque
regression = LassoCV
design_matrix = 1 + x + harm(x, 1)
reverse = False
robust = False
commission_alpha = 
...

[CCDCesque]
consecutive = 5

[LassoCV]
pickle = somefile.pkl
...

It is very difficult to imagine specifying all arguments to a sklearn classifier or regression estimator via a config file. Things like n_alpha could play well, but how would we specify alphas = np.logspace(0.001, 30, 50)? This proposed format sidesteps these concerns by requiring that regression options provide a pickled file from sklearn.external.joblib that already contains the parameterization desired. If the pickle item is not provided, but the section is labeled, default to a pickle of an existing regression object packaged with yatsm.

Target v0.5.0 as milestone to coincide with another major rehaul (#28).

`line_yatsm.py` filenames

Add in the "0"s required to make all rows contain 4 (or 5?) numbers.

Note: will need to update TSTools driver to reflect this

Missing "=" in documentatiopn

In the documentation of Batch Interface, where you give the example of running on Sun Grid Engine, to specify the ethernet speed of the node the option should be -l eth_speed=10. The '=' sign is missing.

Command line interface

Move from many command line programs that use docopt for CLI parsing to one centralized yatsm CLI using click.

Motivation:
There are a lot of CLI scripts already that process data, make maps, or perform housekeeping. The map scripts are prefixed with the name "yatsm_" but none of the others are. If we had one centralized yatsm command then it would be easier to navigate the various scripts.

Example:

$ yatsm --help
YATSM algorithm text

Options:
--verbose
--help

Commands:
lines         runs lines
cache         housekeeping for cache data
changemap     makes changemaps
map           makes maps

Classification features

Give user more control in terms of what attributes are used in classification:

Enable subset of coefficients, RMSE, etc
Add in support for phenology information
Add support for ancillary data (DEM, change map, etc)
- I think we can do this as a list of raster files (provided they have same extent) with bands referenced QGIS raster calculator style (e.g., "/path/to/raster@1")

Docs: exploration.rst

Documentation for model parameter exploration guide page docs/guide/exploration.rst

https://jacobian.org/writing/great-documentation/

Model specification

Right now we have very limited control over model specification -- one can use "freq" to assign the seasonal harmonics, but that's it. For specific cases with just optical data it might be advantageous to have more control over the model specification -- e.g., don't use a slope term. This becomes more obvious when we consider other timeseries of remote sensing datasets.

I propose we eliminate "freq" in favor of a Patsy (see docs) style description of our models to formulate the design matrix.

See this example from Statsmodels for dummy variable encoding and this guide from Patsy for the Treatment class.

YATSM commission test references an old variable name

We don't use i_x anymore -- fix this in postprocess.commission_test

Issue in no data zone in line.py line 196

https://github.com/ceholden/yatsm/blob/master/yatsm/cli/line.py#L196

File "/usr3/graduate/valpasq/Documents/yatsm/yatsm/cli/line.py", line 196, in line
output.extend(yatsm.record)
TypeError: 'NoneType' object is not iterable

Fix (tested):
Add if statement to check if record exists:

if yatsm.record is not None:
output.extend(yatsm.record)

`yatsm_map.py` --before and --after

Make the --before and --after options work on prediction and coefficient images as well as for maps. Two ways:

Brute force just find the indices for each and write to the raster. Do the --before and --after before the intersecting model so that any results found in these more preferable approaches overwrites the prvious answers
Figure out a more elegant way of finding the indices that doesn't require as much computation.
- Find the indices for intersecting, --before, and --after. Only add indices from --after and --before if they contribute new pixels to the index list from the intersecting query

Version 0.1 release/DOI

Now that classification is almost stabilized, firm on 0.1 release

Add in more timeseries tests

Make "Yet another..." even more obvious by adding in logic from other timeseries monitoring approaches:

Omission test
Comission test
Handling of omission and comission tests in runner

No one set of parameters will be ideal for all cases, and online monitoring approaches can either omit or commit changes that would be obviously wrong from other approaches. Omission tests might be useful when looking for more obvious changes to make sure a high threshold hasn't ignored some obvious changes (e.g., gradual disturbances). Comission tests could nullify spurious or ephemeral changes identified using low thresholds.

Docs: dataset.rst

Write documentation for dataset preparation guide docs/guide/dataset.rst

https://jacobian.org/writing/great-documentation/

Classification

Code flexible interface for running some classifier on the output results. Two parts:

Training
- allow for crossvalidation
- output model diagnostics, plots, etc.
Classification on entire dataset

Interface:

Training and classification routines should expect some kind of API from algorithm "helper classes" that basically just wrap scikit-learn
Helper classes have some INI config file that parameterizes them
Training run CLI will take a config file as input

yatsm_map QA/QC bands

Add option to output QA/QC band for timeseries selected by yatsm_map.py.

0 - fill value
1 - before, allowing for changes
2 - after
3 - matching

yatsm_map.py -- add phenology map masking

Add options to mask pixels in output phenology maps based on number of observations in phenology fit (--pheno_obs) or minimum spline correlation (--pheno_cor).

Usability: pre-package a set of regression method pickles

Include a set of already-pickled regression methods for users to pick from:

OLS
Lasso with alpha/lambda=20
Lasso with alpha/lambda by CV
RLM

LTM phenology tests

The long term mean phenology transition date calculation has been validated when coding the implementation, so why isn't it included as a test?

Including the LTM phenology metric calculation as a test is probably a good 1st step to accomplish before adding the year-to-year variations from the mean.

Allow refit methods to come from pickles distributed with YATSM

Currently only the main prediction method can be read in from the pickles included in YATSM. The config parsing does not try to load pickles for the refit function.

Documentation

Sections

Algorithm

Document parameters for yatsm.py

Dataset preparation

Stacked images for now... and probably for good.

Exploration

run_yatsm.py
TSTools interface

Batch process

How to setup the input data for line_yatsm.py
How to create config .ini file for line_yatsm.py
How to run line_yatsm.py in parallel for SGE cluster environment

Output visualization

How to create changemaps and such from scripts/yatsm_changemap.py
- Wrote documentation with basic text explanation of yatsm_changemap.py
- More TODO:
  - Example images
  - Explanation of --root, --result, and --image argument flags
Creation of prediction maps, coefficient images, land cover maps
- Wrote documentation with basic text explanation of yatsm_map.py
- More TODO:
  - Example maps
  - Images helping explain --after and --before
  - More information on CLI flags / switches
  - Explanation of --root, --result, and --image argument flags

Classification

Training
Batch prediction

Phenology

Link to Eli's paper
Description of algorithm outputs
Description of departures from his algorithm
Description of algorithm parameters
Future work -- year to year departures from long term mean phenology

Hosting

Currently hosted on readthedocs

Cache file updater

Need a way of updating the cache data files for additional new images. Roadmap:

Create cache_yatsm.py to just read in and cache data
- cache_yatsm.py should include something like --update option
- By default, --update will add in images to the end of the timeseries (e.g., sort by date all images found and add all_images[previous_n:]
Change cache file format to also include array of image filenames / basenames / dates
- Should allow for validation of timeseries (e.g., if you removed an image and added a new one, the # would stay the same so we need to check against dates or filenames)
- Also provides better way of finding what images have been added

Docs: classification.rst

Documentation for model parameter exploration guide page docs/guide/classification.rst

https://jacobian.org/writing/great-documentation/

yatsm_changemap

Better interface idea

Treat --first and implicit --last change options as required but mutually exclusive options. This will allow us to add in the option for mapping the number of changes between dates.

Basically:

Usage:
    yatsm_changemap.py (first | last | num) <start_date> <end_date> <output>

Model prediction / regression selection

Allow users to specify the regression or prediction model used within the online monitoring approach. We're using LASSO with a fixed lambda because of legacy CCDC reasons, but there's no reason we couldn't move toward more algorithms.

A short list of interesting ones:

Embed Eli Melaas' phenology code into YATSM

Should be simple enough to embed his code as additional run option. Record the following:

"Deciduousness" correlation coefficient
"Inflection point" for SOS and EOS
"Bounding Box" for spring / fall

Biggest problem will be fitting model to segments with low numbers of observations (see Fig 6 and 7 in Melaas et al 2013). Another large problem will be with multiple-cropping

Config file subsections -- init and fit

To make our configuration files better pair with the scikit-learn estimator API, each timeseries model must now be configured using an init and fit subsection.

Example from some made up estimator:

MadeUpModel:
    init:
        threshold: 3
        likelihood: binomial
        sided: two-sided
    fit:
        sample_weights: '1 / vza'

This greatly improves the flexibility of the configuration files because it explicitly declared what configuration settings are for what methods. This can be extended for other methods (e.g., transform) for scikit-learn-like APIs.

Training period slope test

Implement slope test when testing for good training period as a toggle.

Implications:

If we want to skip this "disturbed" time period, we might get lower model RMSE post-disturbance
We could end up creating maps of disturbed time periods by excluding models from these periods of rapid change

YATSM >100% CPU utilization with conda

Looks like the NumPy included in conda can do multithreaded computation for linear algebra routines by linking to OpenBLAS. You can disable this by setting OPENBLAS_NUM_THREADS=1 as an environment variable, but having all my jobs killed by our cluster's process reaper gave me quite the fright before I found the solution to turning it off!

If we linked against the Intel MKL, we'd also have MKL_NUM_THREADS to set.

Proposed solutions:

Document it and hope people find it
By default, have yatsm set the environment variables for multiprocessing to just use one thread. Use some --num_threads optional argument in the YATSM cli to turn multiprocessing on.

I'm actually in favor of option 2, but it seems pretty oppressive. Any thoughts @valpasq or @bullocke ?

Config file usability improvement -- environment variables

Allow environment variables to be used in configuration files. Should greatly help usability in situations where you want to run the same configuration on different but identically formatted datasets. Previously this was accomplished by sed'ing many file paths in the config files.

Example usage, where $ROOTDIR is used in the configuration files to point to the dataset location:

export ROOTDIR=$HOME/Documents/landsat_stack/p013r030/images 
yatsm -v pixel $CONFIG/envvar.yaml 25 25

YATSM schema

Write document defining schema or spec for timeseries result storage used in YATSM. Preliminarily:

YATSM

Version: 1.0

1. Purpose

This specification describes the vocabulary and schema for describing timeseries within Yet Another Timeseries Model (YATSM).

2. Definitions

Term	Definition
`segment`	A period of time without disturbances. A segment represents a period of stable conditions, including stable land cover, such as permanent developed or forested cover, and stable land cover dynamics, such as a prolonged period of regrowing forest or the gradual succession of vegetation species.
`break`	An abrupt change in some characteristic of a `segment`, including changes in the magnitude, timing, or variability of observed data. `breaks` interrupt a `segment` and necessitate the estimation of another `segment`.
`ephemeral`	A `break` in a segment that does not persist. `segments` separated by `ephemeral` `breaks` are often functionally identical and may be joined together if `ephemeral` change processes are not of interest. Examples of `ephemeral` changes include precipitation driven early green-ups of vegetation in arid environments, non-fatal insect or weather driven defoliation events, or flooding that does not permanently change or alter the land cover or land cover characteristics.
`ensemble`	A history of `segments` and `breaks` for a given unit of area.
`event`	A change in land surface condition that does not constitute an abrupt change or segment break. Examples include a grassland fire, flood event, or hail storm over crops that does not change land cover and is not large enough to become an `ephemeral` change, but is something that would be worthwhile recording. Events may be classified based on the departures from the expected signal.
...	fill in more here

3. Schema

Timeseries models store their ensemble results as a collection of segments. Each segment has the following properties documented here in JSON format but stored within YATSM as numpy structured arrays:

dtype=[
    ('start', 'i4'),  # ordinal date of first observation in segment
    ('end', 'i4'),  # ordinal date of last observation in segment
    ('break', 'i4'),  # ordinal date of break that ends segment
    ('coef', 'float32', (n_features, n_models)),  # coefficients for all features in X design matrix for all fitted Y data
    ('px', 'u2'),  # column index coordinate of segment
    ('py', 'u2')  # row index coordinate of segment
]

Additional attributes, including the magnitude of a break or phenological attributes, may be stored as attributes of each segment.

reports

Pweave can combine Markdown/LaTeX/etc. and Python code to generate reports with matplotlib figures and more.

Write code that has template for report. Use CLI to fill in the details after verifying input. Report includes things like:

Time statistics
- Earliest and latest image
- Earliest and latest models
Histogram of number of changes per pixel w/ mean/median/etc
Histogram of dates of all changes
- Outlier detection based on % of total
- If outlier, display downsampled stretched composite image with locations
Change size distribution by connecting adjoining pixels within some time limits
KML of all changes above some size threshold (see ndimage.label for connected objects)

After template is filled in with details, exec it and output to specified PDF file.

`yatsm_changemap.py` option -- `--magnitude`

--magnitude flag for creating maps showing the magnitude of disturbance in each of the test_indices is not yet implemented:

yatsm/scripts/yatsm_changemap.py

Line 257 in 9f5cc27

raise NotImplementedError("Haven't written magnitude parser yet")

Add it in --magnitude parsing as follows:

Determine number of test_indices by first checking for attribute in saved record; if this fails then determine number of nonzero magnitude attributes in the record
Create array with extra 3rd dimension to store these bands
Populate array with dates and magnitude values

Magnitude is a float, but the date is expressed as an integer. Upcast integer to float seems more straightforward than adding precision to magnitude by scaling by a factor of 10, plus we're less likely to overflow.

Robust / regular results ultimatum

Why make the user choose to save one or the other? Should be save both if the user wants robust results and ask when making maps to output the robust prediction / coefficient map, or not.

Changes:

line_yatsm.py
- If robust, save both records: 'records' and 'robust_records'
yatsm_map.py
- Add CLI switch --robust
- CLI switch will determine what attribute is used within the .npz saved models

NumPy 1.9 DeprecationWarning

From logs:

/usr3/graduate/ceholden/code/yatsm/yatsm/yatsm.py:48: DeprecationWarning: assignment will raise an error in the future, most likely because your index result shape does not match the value array shape. You can use `arr.flat[index] = values` to keep the old behaviour.

Code was inserting coef_, array of length 8 (for example), into a subset of self.coef_ (e.g., into 3 array indices which were non-zero).

Could not find results for...

When making maps / changemaps, the warning logging notice about not finding timeseries model results for certain saved files is stupid:

If the file was saved, then the model at least ran for that line
Warning messages are polluting the screen, even without debugging turned on
A real warning would be that the saved file .npz is corrupted, not that the 'record' .npy subfile contains no model results

Turn this off!

Docs: model_specification.rst

Documentation for model parameter specification guide page docs/guide/model_specification.rst

https://jacobian.org/writing/great-documentation/

Config file format

Switch config file format to YAML. Primary reason for change is that YAML can automatically parse the data types in the config files. This saves me a lot of coding, is more flexible since I don't have to update a parser, and is probably gives the user more flexibility in the long term.

Target release v0.5.0 since that one doesn't work with existing config files anyway. Why update your config files for v0.5.0 only to also port to YAML in the future?

Example:

metadata:
    version: "0.5.0"

dataset:
    # Text file containing dates and images
    input_file: "/home/ceholden/Documents/yatsm/examples/p022r049_input.csv"
    # Input date format
    date_format: "%Y%j"
    # Output location
    output: "/home/ceholden/Documents/landsat_stack/p022r049/images/YATSM"
    # Output file prefix (e.g., [prefix]_[line].npz)
    output_prefix: "yatsm_r"
    # Total number of bands
    n_bands: 8
    # Mask band (e.g., Fmask)
    mask_band: 8
    # List of integer values to mask within the mask band
    mask_values: [2, 3, 4, 255]
    # Valid range of non-mask band data
    # specify 1 range for all bands, or specify ranges for each band
    valid_range: [0, 10000]
    # Indices for multi-temporal cloud masking (indexed on 1)
    green_band: 2
    swir1_band: 5
    # Use BIP image reader? If not, use GDAL to read in
    use_bip_reader: true
    # Directory location for caching dataset lines
    cache_line_dir: "/home/ceholden/Documents/landsat_stack/p022r049/images/.yatsm_cache"

produces:

In [8]: cfg = yaml.safe_load(open('p022r049_example.yaml'))

In [9]: cfg
Out[9]: 
{'dataset': {'cache_line_dir': '/home/ceholden/Documents/landsat_stack/p022r049/images/.yatsm_cache',
  'date_format': '%Y%j',
  'green_band': 2,
  'input_file': '/home/ceholden/Documents/yatsm/examples/p022r049_input.csv',
  'mask_band': 8,
  'mask_values': [2, 3, 4, 255],
  'n_bands': 8,
  'output': '/home/ceholden/Documents/landsat_stack/p022r049/images/YATSM',
  'output_prefix': 'yatsm_r',
  'swir1_band': 5,
  'use_bip_reader': True,
  'valid_range': [0, 10000]},
 'metadata': {'version': '0.5.0'}

Pretty great!

Docs: configuration.rst

Documentation for model parameter exploration guide page docs/guide/configuration.rst

https://jacobian.org/writing/great-documentation/

Example outputs

Need example output for test validation and user examples:

Map-making utilities:

robust_record refactor

Refactor yatsm.algorithms.postprocess.robust_record to be more general. This function can be simplified to be refit(yatsm, prefix, predictor) and this would allow other regression methods (e.g., a cross-validated Lasso) to be ran within the same context of redefining the coef and rmse within each segment of each ensemble model.

Shortcut robust_record as refit(yatsm, prefix='robust', predictor=rlm.RLM).

Also refactor yatsm map --robust option to --result_prefix so we can output coef/prediction maps from robust_coef, lasso_coef, theilsen_coef, etc...

Config file specification:

YATSM:
    ...
    refit_prefix: robust
    refit_prediction: RLM
    ...

where RLM is a pickled RLM prediction object.

Merit: This should allow change to be fit using one method of prediction while also summarizing timeseries using other prediction methods.

CCDCesque: Model optimization - `fit indices` fitting

While monitoring, we only need to update the tested models, not all of them. This is currently done for simplicity, but it would scale awfully if there are many more fitted indices than tested indices.

Proposed change:

fit_models should maybe use test_indices as default if bands=None
update_model should also take in bands to fit (pass np array)
- while simply monitoring we can just update the tested ones
- self.models[bands] = self.fit_models(X, Y, bands)
- when change is detected, or we run out of data, we can update all of the fit_indices
- probably don't need to update self.record[self.n_record]['start'] in each loop (minor change)
monitor updates all models after change
- need to make sure tested indices keep the same results when developing

Line runner exists after TSLengthException

yatsm line does not check for TSLengthException (e.g., where there is NoData on image edges) and as such will stop when encountering a pixel that does not have enough data.

This is a regression from v0.4.0 and prior releases.

Better requirements specification

Some of the changes associated with #29 and #30 mean that glmnet-python isn't a required package anymore and the broadening of scope for YATSM means that there will eventually be quite a lot of optional dependencies. This project probably needed better documentation for how to install it anyway, but this was the main impetus for the following change:

Make requirements/ directory for storing itemized requirements files
Put all 100% necessary packages in common.txt
Create pointer file requirements.txt that points to requirements/common.txt
Add dev.txt for development dependencies (documentation and tests mostly)
Group into categories and itemize optional dependencies. Right now this means:
- accel.txt for making things run faster (Fortran wrapper for enets/lasso, Numba for all purpose fastness)
- pheno.txt for Eli Melaas' LTM phenology

Finally, add a conda environment file that users can point to for automating the installation. We can probably also use this for travis.ci, which is nice. See this repo, conda-env, for details about the environment.yaml file.

Error when phenology fitting is enabled

Error occurs when attempting to run phenology fitting:

  File "/usr3/graduate/valpasq/Documents/yatsm/yatsm/cli/line.py", line 190, in line
    ltm = pheno.LongTermMeanPhenology(yatsm, **cfg['phenology'])
TypeError: __init__() got an unexpected keyword argument 'year_interval'

Config location: /projectnb/landsat/projects/Massachusetts/p012r031/p012r031_config.yaml

Real time monitoring

Add capability to load past results, find new images, and make sense of whether there are new changes (or observations indicating a likelihood of a change).

CCDCesque: masking improvements

Make the multitemporal masking more configurable by doing the following:

Parameterize the span of multitemporal cloud filtering when using smooth_mask / lowess filtering.
Allow the critical values to be specific to the green & swir bands
- User can pass a float or a tuple; float transformed to tuple of (float, float)
Allow the critical values to be data-driven
- Example, 2 * np.std(y) where y is the green or swir1 bands
- Blocker: I have no idea how to be clever about passing and evaluating a str containing a mathematical formula (Patsy won't cover all of the maths we might want to do)

ceholden / yatsm Goto Github PK

yatsm's People

Contributors

Stargazers

Watchers

Forkers

yatsm's Issues

Sections

Algorithm

Dataset preparation

Exploration

Batch process

Output visualization

Classification

Phenology

Hosting

Better interface idea

YATSM

1. Purpose

2. Definitions

3. Schema

Recommend Projects

Recommend Topics

Recommend Org