ceholden / yatsm Goto Github PK
View Code? Open in Web Editor NEWYet Another Time Series Model
Home Page: https://yatsm.readthedocs.org/en/latest/
License: MIT License
Yet Another Time Series Model
Home Page: https://yatsm.readthedocs.org/en/latest/
License: MIT License
Try pytables
.
Might be faster, smaller, and make updating results in place easier?
Can't append if yatsm.record
is None
:
File "/usr3/graduate/valpasq/Documents/yatsm/yatsm/cli/line.py", line 188, in line
output.extend(yatsm.record)
TypeError: 'NoneType' object is not iterable
Missed in test datasets because they contain no 50%+ NODATA timeseries
Separate hyperparameters from the data. Basically, move the model
in the __init__
to the fit
method and move all from the fit
method to the __init__
. Do any preprocessing in the fit
method.
Right now the mask values are hard coded to Fmask mask values (0, 1 pass, else masked). We should be able to specify individual values that are masked:
line_yatsm.py
should mask based upon these valuesMake way for more timeseries algorithms within module by changing configuration file to be able to point to many different algorithms:
algorithms
yatsm
to ccdc
and place into algorithms
submodule. YATSM
class to CCDCesque
algorithm
key under YATSM
section. The algorithm specified by algorithm
key will be searched for as the section title from which to extract algorithm parameterization information.YATSM
section for options generic to all timeseries algorithms, like reverse
or robust
.YATSM
(future, CCDCesque
) and place into yatsm.algorithms.yatsm
. These will be parameterized in YATSM
metadata section.Propose change example:
[metadata]
version = 0.5
[YATSM]
algorithm = CCDCesque
regression = LassoCV
design_matrix = 1 + x + harm(x, 1)
reverse = False
robust = False
commission_alpha =
...
[CCDCesque]
consecutive = 5
[LassoCV]
pickle = somefile.pkl
...
It is very difficult to imagine specifying all arguments to a sklearn
classifier or regression estimator via a config file. Things like n_alpha
could play well, but how would we specify alphas = np.logspace(0.001, 30, 50)
? This proposed format sidesteps these concerns by requiring that regression options provide a pickled file from sklearn.external.joblib
that already contains the parameterization desired. If the pickle
item is not provided, but the section is labeled, default to a pickle of an existing regression object packaged with yatsm
.
Target v0.5.0 as milestone to coincide with another major rehaul (#28).
Add in the "0"s required to make all rows contain 4 (or 5?) numbers.
Note: will need to update TSTools driver to reflect this
In the documentation of Batch Interface, where you give the example of running on Sun Grid Engine, to specify the ethernet speed of the node the option should be -l eth_speed=10. The '=' sign is missing.
Move from many command line programs that use docopt
for CLI parsing to one centralized yatsm
CLI using click
.
Motivation:
There are a lot of CLI scripts already that process data, make maps, or perform housekeeping. The map scripts are prefixed with the name "yatsm_" but none of the others are. If we had one centralized yatsm
command then it would be easier to navigate the various scripts.
Example:
$ yatsm --help
YATSM algorithm text
Options:
--verbose
--help
Commands:
lines runs lines
cache housekeeping for cache data
changemap makes changemaps
map makes maps
Give user more control in terms of what attributes are used in classification:
Documentation for model parameter exploration guide page docs/guide/exploration.rst
Right now we have very limited control over model specification -- one can use "freq" to assign the seasonal harmonics, but that's it. For specific cases with just optical data it might be advantageous to have more control over the model specification -- e.g., don't use a slope term. This becomes more obvious when we consider other timeseries of remote sensing datasets.
I propose we eliminate "freq" in favor of a Patsy (see docs) style description of our models to formulate the design matrix.
See this example from Statsmodels for dummy variable encoding and this guide from Patsy for the Treatment class.
We don't use i_x
anymore -- fix this in postprocess.commission_test
https://github.com/ceholden/yatsm/blob/master/yatsm/cli/line.py#L196
File "/usr3/graduate/valpasq/Documents/yatsm/yatsm/cli/line.py", line 196, in line
output.extend(yatsm.record)
TypeError: 'NoneType' object is not iterable
Fix (tested):
Add if statement to check if record exists:
if yatsm.record is not None:
output.extend(yatsm.record)
Make the --before and --after options work on prediction and coefficient images as well as for maps. Two ways:
--before
and --after
before the intersecting model so that any results found in these more preferable approaches overwrites the prvious answers--before
, and --after
. Only add indices from --after
and --before
if they contribute new pixels to the index list from the intersecting queryNow that classification is almost stabilized, firm on 0.1 release
Make "Yet another..." even more obvious by adding in logic from other timeseries monitoring approaches:
No one set of parameters will be ideal for all cases, and online monitoring approaches can either omit or commit changes that would be obviously wrong from other approaches. Omission tests might be useful when looking for more obvious changes to make sure a high threshold hasn't ignored some obvious changes (e.g., gradual disturbances). Comission tests could nullify spurious or ephemeral changes identified using low thresholds.
Write documentation for dataset preparation guide docs/guide/dataset.rst
Code flexible interface for running some classifier on the output results. Two parts:
Interface:
scikit-learn
Add option to output QA/QC band for timeseries selected by yatsm_map.py
.
Add options to mask pixels in output phenology maps based on number of observations in phenology fit (--pheno_obs) or minimum spline correlation (--pheno_cor).
Include a set of already-pickled regression methods for users to pick from:
The long term mean phenology transition date calculation has been validated when coding the implementation, so why isn't it included as a test?
Including the LTM phenology metric calculation as a test is probably a good 1st step to accomplish before adding the year-to-year variations from the mean.
Currently only the main prediction method can be read in from the pickles included in YATSM. The config parsing does not try to load pickles for the refit function.
yatsm.py
run_yatsm.py
line_yatsm.py
line_yatsm.py
line_yatsm.py
in parallel for SGE cluster environmentscripts/yatsm_changemap.py
yatsm_changemap.py
--image
argument flagsyatsm_map.py
--after
and --before
--root
, --result
, and --image
argument flagsCurrently hosted on readthedocs
Need a way of updating the cache data files for additional new images. Roadmap:
cache_yatsm.py
to just read in and cache data
cache_yatsm.py
should include something like --update
option--update
will add in images to the end of the timeseries (e.g., sort by date all images found and add all_images[previous_n:]
Documentation for model parameter exploration guide page docs/guide/classification.rst
Treat --first
and implicit --last
change options as required but mutually exclusive options. This will allow us to add in the option for mapping the number of changes between dates.
Basically:
Usage:
yatsm_changemap.py (first | last | num) <start_date> <end_date> <output>
Allow users to specify the regression or prediction model used within the online monitoring approach. We're using LASSO with a fixed lambda because of legacy CCDC reasons, but there's no reason we couldn't move toward more algorithms.
A short list of interesting ones:
Should be simple enough to embed his code as additional run option. Record the following:
Biggest problem will be fitting model to segments with low numbers of observations (see Fig 6 and 7 in Melaas et al 2013). Another large problem will be with multiple-cropping
To make our configuration files better pair with the scikit-learn estimator API, each timeseries model must now be configured using an init
and fit
subsection.
Example from some made up estimator:
MadeUpModel:
init:
threshold: 3
likelihood: binomial
sided: two-sided
fit:
sample_weights: '1 / vza'
This greatly improves the flexibility of the configuration files because it explicitly declared what configuration settings are for what methods. This can be extended for other methods (e.g., transform
) for scikit-learn-like APIs.
Implement slope test when testing for good training period as a toggle.
Implications:
Looks like the NumPy included in conda
can do multithreaded computation for linear algebra routines by linking to OpenBLAS. You can disable this by setting OPENBLAS_NUM_THREADS=1
as an environment variable, but having all my jobs killed by our cluster's process reaper gave me quite the fright before I found the solution to turning it off!
If we linked against the Intel MKL, we'd also have MKL_NUM_THREADS
to set.
Proposed solutions:
yatsm
set the environment variables for multiprocessing to just use one thread. Use some --num_threads
optional argument in the YATSM cli to turn multiprocessing on.I'm actually in favor of option 2, but it seems pretty oppressive. Any thoughts @valpasq or @bullocke ?
Allow environment variables to be used in configuration files. Should greatly help usability in situations where you want to run the same configuration on different but identically formatted datasets. Previously this was accomplished by sed'ing many file paths in the config files.
Example usage, where $ROOTDIR
is used in the configuration files to point to the dataset location:
export ROOTDIR=$HOME/Documents/landsat_stack/p013r030/images
yatsm -v pixel $CONFIG/envvar.yaml 25 25
Write document defining schema or spec for timeseries result storage used in YATSM. Preliminarily:
Version: 1.0
This specification describes the vocabulary and schema for describing timeseries within Yet Another Timeseries Model (YATSM).
Term | Definition |
---|---|
segment |
A period of time without disturbances. A segment represents a period of stable conditions, including stable land cover, such as permanent developed or forested cover, and stable land cover dynamics, such as a prolonged period of regrowing forest or the gradual succession of vegetation species. |
break |
An abrupt change in some characteristic of a segment , including changes in the magnitude, timing, or variability of observed data. breaks interrupt a segment and necessitate the estimation of another segment . |
ephemeral |
A break in a segment that does not persist. segments separated by ephemeral breaks are often functionally identical and may be joined together if ephemeral change processes are not of interest. Examples of ephemeral changes include precipitation driven early green-ups of vegetation in arid environments, non-fatal insect or weather driven defoliation events, or flooding that does not permanently change or alter the land cover or land cover characteristics. |
ensemble |
A history of segments and breaks for a given unit of area. |
event |
A change in land surface condition that does not constitute an abrupt change or segment break. Examples include a grassland fire, flood event, or hail storm over crops that does not change land cover and is not large enough to become an ephemeral change, but is something that would be worthwhile recording. Events may be classified based on the departures from the expected signal. |
... | fill in more here |
Timeseries models store their ensemble
results as a collection of segments
. Each segment
has the following properties documented here in JSON
format but stored within YATSM as numpy
structured arrays:
dtype=[
('start', 'i4'), # ordinal date of first observation in segment
('end', 'i4'), # ordinal date of last observation in segment
('break', 'i4'), # ordinal date of break that ends segment
('coef', 'float32', (n_features, n_models)), # coefficients for all features in X design matrix for all fitted Y data
('px', 'u2'), # column index coordinate of segment
('py', 'u2') # row index coordinate of segment
]
Additional attributes, including the magnitude
of a break
or phenological attributes, may be stored as attributes of each segment
.
Pweave can combine Markdown/LaTeX/etc. and Python code to generate reports with matplotlib figures and more.
Write code that has template for report. Use CLI to fill in the details after verifying input. Report includes things like:
ndimage.label
for connected objects)After template is filled in with details, exec it and output to specified PDF file.
--magnitude
flag for creating maps showing the magnitude of disturbance in each of the test_indices
is not yet implemented:
yatsm/scripts/yatsm_changemap.py
Line 257 in 9f5cc27
Add it in --magnitude
parsing as follows:
test_indices
by first checking for attribute in saved record; if this fails then determine number of nonzero magnitude
attributes in the recordMagnitude is a float, but the date is expressed as an integer. Upcast integer to float seems more straightforward than adding precision to magnitude by scaling by a factor of 10, plus we're less likely to overflow.
Why make the user choose to save one or the other? Should be save both if the user wants robust results and ask when making maps to output the robust prediction / coefficient map, or not.
Changes:
--robust
.npz
saved modelsFrom logs:
/usr3/graduate/ceholden/code/yatsm/yatsm/yatsm.py:48: DeprecationWarning: assignment will raise an error in the future, most likely because your index result shape does not match the value array shape. You can use `arr.flat[index] = values` to keep the old behaviour.
Code was inserting coef_
, array of length 8 (for example), into a subset of self.coef_
(e.g., into 3 array indices which were non-zero).
When making maps / changemaps, the warning logging notice about not finding timeseries model results for certain saved files is stupid:
.npz
is corrupted, not that the 'record' .npy
subfile contains no model resultsTurn this off!
Documentation for model parameter specification guide page docs/guide/model_specification.rst
Switch config file format to YAML. Primary reason for change is that YAML can automatically parse the data types in the config files. This saves me a lot of coding, is more flexible since I don't have to update a parser, and is probably gives the user more flexibility in the long term.
Target release v0.5.0 since that one doesn't work with existing config files anyway. Why update your config files for v0.5.0 only to also port to YAML in the future?
Example:
metadata:
version: "0.5.0"
dataset:
# Text file containing dates and images
input_file: "/home/ceholden/Documents/yatsm/examples/p022r049_input.csv"
# Input date format
date_format: "%Y%j"
# Output location
output: "/home/ceholden/Documents/landsat_stack/p022r049/images/YATSM"
# Output file prefix (e.g., [prefix]_[line].npz)
output_prefix: "yatsm_r"
# Total number of bands
n_bands: 8
# Mask band (e.g., Fmask)
mask_band: 8
# List of integer values to mask within the mask band
mask_values: [2, 3, 4, 255]
# Valid range of non-mask band data
# specify 1 range for all bands, or specify ranges for each band
valid_range: [0, 10000]
# Indices for multi-temporal cloud masking (indexed on 1)
green_band: 2
swir1_band: 5
# Use BIP image reader? If not, use GDAL to read in
use_bip_reader: true
# Directory location for caching dataset lines
cache_line_dir: "/home/ceholden/Documents/landsat_stack/p022r049/images/.yatsm_cache"
produces:
In [8]: cfg = yaml.safe_load(open('p022r049_example.yaml'))
In [9]: cfg
Out[9]:
{'dataset': {'cache_line_dir': '/home/ceholden/Documents/landsat_stack/p022r049/images/.yatsm_cache',
'date_format': '%Y%j',
'green_band': 2,
'input_file': '/home/ceholden/Documents/yatsm/examples/p022r049_input.csv',
'mask_band': 8,
'mask_values': [2, 3, 4, 255],
'n_bands': 8,
'output': '/home/ceholden/Documents/landsat_stack/p022r049/images/YATSM',
'output_prefix': 'yatsm_r',
'swir1_band': 5,
'use_bip_reader': True,
'valid_range': [0, 10000]},
'metadata': {'version': '0.5.0'}
Pretty great!
Documentation for model parameter exploration guide page docs/guide/configuration.rst
Need example output for test validation and user examples:
Map-making utilities:
Refactor yatsm.algorithms.postprocess.robust_record
to be more general. This function can be simplified to be refit(yatsm, prefix, predictor)
and this would allow other regression methods (e.g., a cross-validated Lasso) to be ran within the same context of redefining the coef
and rmse
within each segment of each ensemble model.
Shortcut robust_record
as refit(yatsm, prefix='robust', predictor=rlm.RLM)
.
Also refactor yatsm map
--robust
option to --result_prefix
so we can output coef/prediction maps from robust_coef
, lasso_coef
, theilsen_coef
, etc...
Config file specification:
YATSM:
...
refit_prefix: robust
refit_prediction: RLM
...
where RLM
is a pickled RLM prediction object.
Merit: This should allow change to be fit using one method of prediction while also summarizing timeseries using other prediction methods.
While monitoring, we only need to update the tested models, not all of them. This is currently done for simplicity, but it would scale awfully if there are many more fitted indices than tested indices.
Proposed change:
fit_models
should maybe use test_indices
as default if bands=Noneupdate_model
should also take in bands to fit (pass np array)
fit_indices
self.record[self.n_record]['start']
in each loop (minor change)monitor
updates all models after change
yatsm line
does not check for TSLengthException
(e.g., where there is NoData on image edges) and as such will stop when encountering a pixel that does not have enough data.
This is a regression from v0.4.0
and prior releases.
Some of the changes associated with #29 and #30 mean that glmnet-python
isn't a required package anymore and the broadening of scope for YATSM
means that there will eventually be quite a lot of optional dependencies. This project probably needed better documentation for how to install it anyway, but this was the main impetus for the following change:
requirements/
directory for storing itemized requirements filescommon.txt
requirements.txt
that points to requirements/common.txt
dev.txt
for development dependencies (documentation and tests mostly)accel.txt
for making things run faster (Fortran wrapper for enets/lasso, Numba for all purpose fastness)pheno.txt
for Eli Melaas' LTM phenologyFinally, add a conda
environment file that users can point to for automating the installation. We can probably also use this for travis.ci, which is nice. See this repo, conda-env
, for details about the environment.yaml
file.
Error occurs when attempting to run phenology fitting:
File "/usr3/graduate/valpasq/Documents/yatsm/yatsm/cli/line.py", line 190, in line
ltm = pheno.LongTermMeanPhenology(yatsm, **cfg['phenology'])
TypeError: __init__() got an unexpected keyword argument 'year_interval'
Config location: /projectnb/landsat/projects/Massachusetts/p012r031/p012r031_config.yaml
Add capability to load past results, find new images, and make sense of whether there are new changes (or observations indicating a likelihood of a change).
Make the multitemporal masking more configurable by doing the following:
span
of multitemporal cloud filtering when using smooth_mask
/ lowess
filtering.2 * np.std(y)
where y
is the green or swir1 bandsstr
containing a mathematical formula (Patsy won't cover all of the maths we might want to do)A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.