A previous version with only Gaussian cost function is available here.

gfpop Vignette

Vincent Runge

LaMME, Evry University

August 22, 2019 (version 2)

The Graph-constrained Change-point problem

Quick Start

Some examples

Graph construction

Supplementary R functions

More on the gfpop function and its C++ structure

The Graph-constrained Change-point problem

General description

gfpop is an R package developed to complete parametric change-point detection in univariate time series constrained to a graph structure. Constraints are imposed to the sequence of inferred parameters (most of the time a mean parameter) of consecutive segments and related to the direction and/or the magnitude of the mean changes.

Change-point detection is performed using the functional pruning optimal partitioning method (fpop) based on an exact dynamic programming algorithm. Cf paper. On optimal multiple changepoint algorithms for large data. This algorithmic strategy can be seen as an extented Viterbi algorithm for a Hidden Markov Model with continuous parameter states. A presentation of the generalized fpop (gfpop) algorithm is available in the paper A log-linear time algorithm for constrained changepoint detection.

In the main gfpop function the user chooses a global parametric model for the change-point problem to solve (change in "mean", "variance", "exp", "poisson" or "negbin" distribution in type variable) and a graph structure. To run the function the user also gives some data to segment potentially associated with a weights vector of same length.

gfpop <- function(data, mygraph, type = "mean", weights = NULL)

The algorithm of our package is designed to consider a large variety of constraints. These constraints are modelled by a graph. At each time a number of states is accessible, these states are the nodes of the graph. Possible transitions between states at time and are represented by the edges of the graph. Each edge is associated to main elements: a constraint (for example ), some parameters associated to the constraint, a penalty (possibly null) and robust parameters to use Huber or biweight losses only on the considered edge.

In addition to the edge definition, the nodes can be constrained to a range of values (means) in the inference. We also can specify starting and ending nodes. In our implementation the transitions are not time-dependent except for starting and ending steps.

The edges of the graph can be of type "null", "std", "up", "down" or "abs" with the following meaning:

"null" edge : there is no change-point introduced. We stay on the same segment. "null" corresponds to the constraint . The value does not change (or we have an exponential decay if );
"std" edge : no contraint, the next segment can have any mean;
"up" edge : the next segment has a greater mean (we can also force the size of the gap to be greater than a minimal value). "up" corresponds to the constraint ;
"down" edge : the next segment has a lower mean (wan also can force the size of the gap to be greater that a minimal value). "down" corresponds to the constraint ;
"abs" edge : the absolute value of the difference of two consecutive means is greater than a given parameter.

A nonnegative internal parameter can thus be associated to an edge (in "up", "down" and "abs""). The "null" edge refers to an absence of change-point. This edge can be used between different states to constraint segment lengths. Thus our package includes the segment neighborhood algorithm for which the number of change-points is fixed. More details on graph construction are given in the section Graph construction.

The non-constrained changepoint detection problem

The change-point vector defines the segments , with fixed bounds and . We use the set to define the non-constrained minimal global cost given by

where is a penalty parameter and is the minimal cost over the segment . The penalty is understood as an additional cost when introducing a new segment. The argminimum of this quantity gives a vector containing the last position of each segment (if we do not consider ). The quantity is the solution of the non-constrained optimal partitioning method.

In our setting, the cost is the result of the minimization of a cost function with additive property:

with the argminimum defining the infered mean of the i+1-th segment with . Additivity of the cost (the decomposition) is guaranteed as we will use costs deriving from a likelihood. has in our package possible basis decompositions:

We can associate a current mean to each data point and we write a cost on a segment as a result of a constrained minimization:

so that we get another description of the objective function :

From this expression, it will be possible to impose some constraints on the vector of means .

Graph-constrained problem

is the indicator function. This previous approach can be generalized to complex constraints on consecutive means using a graph structure. We define the transition graph as a directed acyclic graph with the following properties:

Nodes are indexed by time and state. for node with time and state . The states are elements of the set ;
All the nodes have time in except for the unique starting node and the unique ending node , where denotes an undefinite state;
Edges are transitions between "time-consecutive" vertices of type and which gives us the edge for ;
Each edge is associated to a function constraining consecutive means and by the indicator function (for example ) and a penalty .

The next table summarizes all the possible constraints encoded into the gfpop package.

constraints	,
null
std
up
down
abs

We define a path of the graph as a collection of vertices with and and for and . Morever, the path is made of edges denoted with . A vector verifies the path if for all , we have (valid constraint). We write to say that the vector verifies the path . The formulation of our graph-constrained problem is then the following:

In the gfpop package, the edges for are not time-dependent. In this case, we redefine a graph by the label of its vertices and its set of oriented edges with defining an edge going from node to node with indicator function and penalty .

In most applications, the set of edges always contains edges of type for all with indicator function as it corresponds to an absence of constraint on segment lengths.

The main algorithm of the package, the gfpop function, returns the change-point vector defined as , with the argminimum of in and . we also give the possibility to restrict the set of valid paths by imposing a starting and/or an ending ,node and contraint the range of infered means, replacing by in the definition of .

Quick Start

we present a basic use of the main functions of the gfpop package. More details about optional arguments are given in later sections.

We install the package from Github:

#devtools::install_github("vrunge/gfpop")
library(gfpop)

We simulate some univariate gaussian data (n = 1000 points) with relative change-point positions 0.1, 0.3, 0.5, 0.8, 1 and means 1, 2, 1, 3, 1 with a variance equal to 1.

n <- 1000
myData <- dataGenerator(n, c(0.1,0.3,0.5,0.8,1), c(1,2,1,3,1), sigma = 1)

We define the graph of constraints to use for the dynamic programming algorithm. A simple case is the up-down constraint with a penalty here equal to a classic 2 log(n).

myGraph <- graph(penalty = 2*log(n), type = "updown")

The gfpop function gives the result of the segmentation using myData and myGraph as parameters. We choose a gaussian cost.

gfpop(data = myData, mygraph = myGraph, type = "mean")

## changepoints
## [1]  100  300  500  799 1000
## 
## states
## [1] "Dw" "Up" "Dw" "Up" "Dw"
## 
## forced
## [1] 0 0 0 0
## 
## parameters
## [1] 1.031384 2.130883 1.047181 2.961262 1.005459
## 
## globalCost
## [1] 984.3485
## 
## attr(,"class")
## [1] "gfpop"

The vector changepoints gives the last index of each segment. It always ends with the length of the vector vectData.

The vector states contains the states in which lies each mean. The length of this vector is the same as the length of changepoint.

The vector forced is a boolean vector. A forced element means that two consecutive means have been forced to satisfy the constraint. For example, the "up" edge with parameter is forced if .

The vector parameters contains the inferred means/parameters of the successive segments.

The number globalCost is equal to , the overall cost of the segmented data.

Some examples

Isotonic regression

The isotonic regression infers a sequence of nondecreasing means.

n <- 1000
mydata <- dataGenerator(n, c(0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1), c(0, 0.5, 1, 1.5, 2, 2.5, 3), sigma = 1)
myGraphIso <- graph(penalty = 2*log(n), type = "isotonic")
gfpop(data =  mydata, mygraph = myGraphIso, type = "mean")

## changepoints
## [1]  211  383  713 1000
## 
## states
## [1] "Iso" "Iso" "Iso" "Iso"
## 
## forced
## [1] 0 0 0
## 
## parameters
## [1] 0.2898816 1.2191021 2.1403436 2.8936415
## 
## globalCost
## [1] 1025.616
## 
## attr(,"class")
## [1] "gfpop"

In this example, we use in gfpop function a robust biweight gaussian cost with K = 1 and the min parameter in order to infer means greater than 0.5.

Fixed number of change-points

This algorithm is called segment neighborhood in the change-point litterature. In this example, we fixed the number of segments at with an isotonic constraint. The graph contains two "up" edges with no cycling.

n <- 1000
mydata <- dataGenerator(n, c(0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1), c(0, 0.5, 1, 1.5, 2, 2.5, 3), sigma = 1)
beta <- 0
myGraph <- graph(
  Edge(0, 1,"up", beta),
  Edge(1, 2, "up", beta),
  Edge(0, 0, "null"),
  Edge(1, 1, "null"),
  Edge(2, 2, "null"),
  StartEnd(start = 0, end = 2))

gfpop(data =  mydata, mygraph = myGraph, type = "mean")

## changepoints
## [1]  298  600 1000
## 
## states
## [1] "0" "1" "2"
## 
## forced
## [1] 0 0
## 
## parameters
## [1] 0.5195712 1.7562499 2.6745127
## 
## globalCost
## [1] 1061.791
## 
## attr(,"class")
## [1] "gfpop"

Robust up-down with constrained starting and ending states

In presence of outliers we need a robust loss (biweight). We can also force the starting and ending state and a minimal gap between the means (here equal to 1)

n <- 1000
mydata <- dataGenerator(n, c(0.1,0.3,0.5,0.8,1), c(0,1,0,1,0), sigma = 1) + 5*(rbinom(n, 1, 0.05)) - 5*(rbinom(n, 1, 0.05))
beta <- 2*log(n)
myGraph <- graph(
  Edge(0, 1, "up", penalty = beta, gap = 1, K = 3),
  Edge(1, 0, "down", penalty = beta, gap = 1, K = 3),
  Edge(0, 0, "null"),
  Edge(1, 1, "null"),
  StartEnd(start = 0, end = 0))
gfpop(data =  mydata, mygraph = myGraph, type = "mean")

## changepoints
## [1]  138  305  493  818 1000
## 
## states
## [1] "0" "1" "0" "1" "0"
## 
## forced
## [1] 1 1 0 1
## 
## parameters
## [1] 0.01850678 1.01850678 0.01850678 1.02272647 0.02272647
## 
## globalCost
## [1] 1079.697
## 
## attr(,"class")
## [1] "gfpop"

If we skip all these constraints and use a standard fpop algorithm, the result is the following

myGraphStd <- graph(penalty = 2*log(n), type = "std")
gfpop(data =  mydata, mygraph = myGraphStd, type = "mean")

## changepoints
##  [1]   17   18   31   32   33   48   51   52   65   66   88   89  111  112  129  130  149  150  174  176  194  195  205  206  220  221  227  229
## [29]  284  285  340  341  358  359  407  408  424  425  428  442  449  450  473  474  479  480  486  487  510  512  549  550  551  565  566  570
## [57]  571  574  590  591  595  596  613  614  627  629  639  640  701  702  778  779  796  797  818  847  848  873  875  878  882  928  929  933
## [85]  934  953  955  957  958  962  963 1000

## states
##  [1] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## [24] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## [47] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## [70] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## 
## forced
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 0 0
## [72] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 
## parameters
##  [1]  0.37883666 -5.60737627  0.14265729 -4.92600986  5.76837573  0.46081049 -2.07025038  6.85184085 -0.08186373  6.07962475  0.16536163
## [12] -6.72115878  0.11779333 -6.03851292  0.38897505  6.36318506  1.22328631 -4.84795400  1.37996121  4.93788071  0.28409636  6.30089144
## [23]  0.79739606 -5.77855389  1.43330020  6.66018030  0.79843808 -4.43936467  1.17189869  7.11750913  0.30445508  5.63498352  0.07145084
## [34] -6.13223532 -0.10524605  6.53335020  0.29761434  4.42059093 -3.52301587 -0.41869828  1.60922633 -6.59086009  0.07051877  4.87531990
## [45] -2.14802188  7.64909748 -0.03687840 -7.05437136  0.62557676  5.83147112  1.26743138  6.64006018 -4.30260774  0.94530626 -4.62237889
## [56]  1.24394216  8.62245556 -1.51272772  1.14038368  6.62490433  0.97797273  6.91505783  0.45741709 -5.00510212  0.94368809 -4.22010595
## [67]  0.82243663  6.88636862  1.06272055 -4.38864051  0.97078589 -4.92447920  1.46891367 -3.97447863  1.27138347 -0.25178268  6.50770084
## [78]  0.66843446  4.86486110 -0.36798139 -3.52991185  0.28266676 -5.37296951  1.26764318  6.32132912 -0.29253887  5.10483206 -0.24247171
## [89]  7.11097890 -0.62541613 -5.62987116  0.27378843
## 
## globalCost
## [1] 1747.779
## 
## attr(,"class")
## [1] "gfpop"

abs edge

With a unique "abs" edge, we impose a difference between the means of size at least 1.

n <- 10000
mydata <- dataGenerator(n, c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1), c(0, 1, 0, 2, 1, 2, 0, 1, 0, 1), sigma = 0.5)
beta <- 2*log(n)
myGraph <- graph(
  Edge(0, 0,"abs", penalty = beta, gap = 1),
  Edge(0, 0,"null"))
gfpop(data =  mydata, mygraph = myGraph, type = "mean")

## changepoints
##  [1]  1000  1999  3000  4000  5000  6000  7000  8002  9000 10000
## 
## states
## [1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 
## forced
## [1] 0 1 0 0 1 0 1 1 0
## 
## parameters
## [1] 0.008077166 1.023073852 0.023073852 1.994958839 0.990916305 1.990916305 0.001495140 1.001495140 0.001495140 1.024786144
## 
## globalCost
## [1] 2494.115
## 
## attr(,"class")
## [1] "gfpop"

Notice that some of the edges are forced, the vector forced contains non-zero values.

Exponential decay

The null edge corresponds to an exponential decay state if its parameter is not equal to 1.

n <- 1000
mydata <- dataGenerator(n, c(0.2, 0.5, 0.8, 1), c(5, 10, 15, 20), sigma = 1, gamma = 0.966)
beta <- 2*log(n)
myGraphDecay <- graph(
  Edge(0, 0, "up", penalty = beta),
  Edge(0, 0, "null", 0, decay = 0.966)
  )
g <- gfpop(data =  mydata, mygraph = myGraphDecay, type = "mean")
g

## changepoints
## [1]  200  500  800 1000
## 
## states
## [1] "0" "0" "0" "0"
## 
## forced
## [1] 0 0 0
## 
## parameters
## [1] 0.0052340381 0.0003157659 0.0004575503 0.0198921241
## 
## globalCost
## [1] 935.4619
## 
## attr(,"class")
## [1] "gfpop"

and we plot the result

gamma <- 0.966
len <- diff(c(0, g<img src="/tex/46024bfe5198e5e1ff626011648c9edf.svg?invert_in_darkmode&sanitize=true" align=middle width=558.8873268pt height=24.65753399999998pt/>parameters[i]*c(1, cumprod(rep(1/gamma,len[i]-1))))}
signal <- rev(signal)

ylimits <- c(min(mydata), max(mydata))
plot(mydata, type ='p', pch ='+', ylim = ylimits)
par(new = TRUE)
plot(signal, type ='l', col = 4, ylim = ylimits, lwd = 3)

Graph construction

In the gfpop package, graphs are represented by a dataframe with 9 features and build with the R functions Edge, Node, StartEnd and graph.

emptyGraph <- graph()
emptyGraph

## [1] state1    state2    type      penalty   parameter K         a         min       max
## <0 rows> (or 0-length row.names)

state1 is the starting node of an edge, state2 its ending node. type is one of the available edge type ("null", "std", "up", "down", "abs"). penalty is a nonnegative parameter: the additional cost to consider when we move within the graph using a edge (or stay on the same node). parameter is annother nonnegative parameter, a characteristics of the edge, depending of its type (it is a decay if type is "null" and a gap otherwise). K and a are robust parameters. min and max are used to constrain the rang of value for the node parameter.

We add edges into a graph as follows

myGraph <- graph(
  Edge("E1", "E1", "null"),
  Edge("E1", "E2", "down", 3.1415, gap = 1.5)
)
myGraph

##  state1 state2 type parameter penalty   K   a min max
## 1     E1     E1 null       1.0       0 Inf Inf  NA  NA
## 2     E1     E2 down       1.5       0 Inf Inf  NA  NA

we can only add edges to this dataframe using the object Edge.

The graph can contain information on the starting and/or ending edge to use with the StartEnd function.

beta <- 2 * log(1000)
myGraph <- graph(
  Edge("Dw", "Dw", "null"),
  Edge("Up", "Up", "null"),
  Edge("Dw", "Up", "up", penalty = beta, gap = 1),
  Edge("Dw", "Dw", "down", penalty = beta),
  Edge("Up", "Dw", "down", penalty = beta),
  StartEnd(start = "Dw", end = "Dw"))
myGraph

## state1 state2  type parameter  penalty   K   a min max
## 1     Dw     Dw  null         1  0.00000 Inf Inf  NA  NA
## 2     Up     Up  null         1  0.00000 Inf Inf  NA  NA
## 3     Dw     Up    up         1 13.81551 Inf Inf  NA  NA
## 4     Dw     Dw  down         0 13.81551 Inf Inf  NA  NA
## 5     Up     Dw  down         0 13.81551 Inf Inf  NA  NA
## 6     Dw   <NA> start        NA       NA  NA  NA  NA  NA
## 7     Dw   <NA>   end        NA       NA  NA  NA  NA  NA

Some graphs are often used: they are defined by default in the graph function. To use these graphs, we specify a string type equal to "std", "isotonic", "updown" or "relevant". For example,

myGraphIso <- graph(penalty = 12, type = "isotonic")
myGraphIso

##   state1 state2 type parameter penalty   K   a min max
## 1    Iso    Iso null         1       0 Inf Inf  NA  NA
## 2    Iso    Iso   up         0      12 Inf Inf  NA  NA

The function Node can be used to restrict the range of value for parameter associated to a node (called also a vertex). For example the following graph is an isotonic graph with inferred parameters between 0 et 1 only.

myGraph <- graph(
  Edge("Up", "Up", "up", penalty = 3.1415),
  Edge("Up", "Up"),
  Node("Up", min = 0, max = 1)
  )
myGraph

##   state1 state2 type parameter penalty   K   a min max
## 1     Up     Up   up         0  3.1415 Inf Inf  NA  NA
## 2     Up     Up null         1  0.0000 Inf Inf  NA  NA
## 3     Up     Up node        NA      NA  NA  NA   0   1

Supplementary R functions

Data generator function

the dataGenerator function is used to simulate data-points from a distribution of type equal to “mean”, “poisson”, “exp”, “variance” or “negbin”. Standard deviation parameter sigma and decay gamma are specific to the Gaussian mean model. size is linked to the R “rnbinom” function from R stats package.

Standard deviation estimation

We often need to estimate the standard deviation from the observed data to normalize the data or choose the edge penalties. The sdDiff returns such an estimation with the default HALL method [Hall et al., 1990] well suited for time series with change-points.

More on the gfpop function and its C++ structure

To be done.

muskanmahajan37 / gfpop Goto Github PK

gfpop's Introduction

gfpop Vignette

Vincent Runge

LaMME, Evry University

August 22, 2019 (version 2)

The Graph-constrained Change-point problem

General description

The non-constrained changepoint detection problem

Graph-constrained problem

Quick Start

Some examples

Isotonic regression

Fixed number of change-points

Robust up-down with constrained starting and ending states

abs edge

Exponential decay

Graph construction

Supplementary R functions

Data generator function

Standard deviation estimation

More on the gfpop function and its C++ structure

gfpop's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org