morrowcj / remoteparts Goto Github PK

remotePARTS is a set of tools for running Partitioned spatio-temporal auto regression analyses on remotely-sensed data sets.

License: GNU General Public License v3.0

R 83.80% C++ 15.64% C 0.57%

remote-sensing-in-r statistical-analysis big-data autocorrelation

remoteparts's Introduction

remotePARTS

remotePARTS is an R package that contains tools for analyzing spatiotemporal data, typically obtained via remote sensing.

Description

These tools were created to test map-scale hypotheses about trends in large remotely sensed data sets but any data with spatial and temporal variation can be analyzed. Tests are conducted using the PARTS method for analyzing spatially autocorrelated time series (Ives et al., 2021). The method’s unique approach can handle extremely large data sets that other spatiotemporal models cannot, while still appropriately accounting for spatial and temporal autocorrelation. This is done by partitioning the data into smaller chunks, analyzing chunks separately and then combining the separate analyses into a single, correlated test of the map-scale hypotheses.

Instalation

To install the package and it’s dependencies, use the following R code:

install.packages("remotePARTS")

To install the latest development version of this package from github, use

install.packages("devtools") # ensure you have the latest devtools
devtools::install_github("morrowcj/remotePARTS")

Then, upon successful installation, load the package with library(remotePARTS).

The latest version of Rtools is required for Windows and C++11 is required for other systems.

Example usage

For examples on how to use remotePARTS, see the Alaska vignette:

vignette("Alaska")

Note that the vignette needs to be built when installing with and may require the build_vignettes = TRUE argument when installing with install_github().

If you’re having trouble installing or building the package, you may need to double check that the R build tools are properly installed on your machine: official Rstudio development prerequisites](https://support.posit.co/hc/en-us/articles/200486498-Package-Development-Prerequisites) To do this, use pkgbuild::has_build_tools(debug = TRUE) and pkgbuild::check_build_tools(debug = TRUE) to unsure that your build tools are up to date.

The vignette is also available online: https://morrowcj.github.io/remotePARTS/Alaska.html.

Bugs and feature requests

If you find any bugs, have a feature or improvement to suggest, or any other feedback about the remotePARTS package, please submit a GitHub Issue here. We really appreciate any and all feedback.

References

Ives, Anthony R., et al. “Statistical inference for trends in spatiotemporal data.” Remote Sensing of Environment 266 (2021): 112678. https://doi.org/10.1016/j.rse.2021.112678

remoteparts's People

Contributors

Stargazers

Watchers

Forkers

jihconghan surajitdb schatterje22 c11 hyinhe nicolasyounes

remoteparts's Issues

Default output of fitCor is massive

UPDATE

See comments below

Problem

Because I elected to save the fitted nls object (as $mod) in the output of fitCor, the results can be quite large. I was unaware of how big nls objects are.

Solution

The solution is to add a save_mod argument to fitCor(). Setting it to FALSE will save tons of memeory for large datasets.

fitGLS_partition with ncores>1

This doesn't work for models of the form Y ~ 1.

High core utilization during CRAN submission.

After my second attempt to submit to CRAN, we were flagged for the following note:

* checking examples ... [49s/6s] NOTE
Examples with CPU (user + system) or elapsed time > 5s
            user system elapsed
partGLS 43.138  1.395   2.815
Examples with CPU time > 2.5 times elapsed time
            user system elapsed ratio
partGLS 43.138  1.395   2.815 15.82

The offending function must be fitGLS_partition(), but I have been unable to replicate CRAN's results on any of my test machines. If others could run the following code with the newest version of the package, it might provide some insight.

library(remotePARTS)
data(ndvi_AK10000)
df = ndvi_AK10000[seq_len(1000), ]
pm = sample_partitions(nrow(df), npart = 3)

proc_ratio <- function(expr){
  tab = system.time(expr)
  ratio = (tab["user.self"] + tab["sys.self"]) / tab["elapsed"]
  return(ratio)
}

test_1 <- proc_ratio(fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm, data = df, 
                                      nugget = 0))
test_2 <- proc_ratio(fitGLS_partition(formula = CLS_coef ~ lat, partmat = pm, data = df, 
                                      nugget = 0))
test_3 <- proc_ratio(fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm, data = df, 
                                      nugget = NA))
test_4 <- proc_ratio(multicore_fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm, 
                                                data = df, nugget = 0, ncores = 2))
test_5 <- proc_ratio(fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm, data = df, 
                                      nugget = 0, ncores = 2, parallel = TRUE))
test_6 <- proc_ratio(fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm, data = df, 
                                      nugget = 0, ncores = 2, parallel = FALSE))

all_tests <- c(test_1, test_2, test_3, test_4, test_5, test_6)

all_tests
any(all_tests >= 2.5) # did any tests use more than 2 cores?

fitAR_map needs parallel version

Problem

With large datasets, fitAR_map() (and perhaps fitCLS_map()) is exceptionally slow - it can take longer than the partitioned GLS, since it only runs in serial (1 thread). This is clearly an oversight.

Solution

This functionality lends itself to parallelization well. AR models are fit to each pixel independently. So, it makes sense to implement a parallel version of fitAR_map() via an argument ncores.

Early on in the development of remotePARTS, I gave examples about how to apply fitAR() to a full map with the raster package. But, since this package doesn't officially support rasters, an implementation that can handle matrices (and dataframes) is important.

rho in fitAR_map()

Clay,

It might be nice to have fitAR_map() include rho in the output. I can do this if you want.

Cheers, Tony

distm_FUN for projections like UTM.

My rasters are in UTM projection and the distance is Euclidean. I used rater::pointDistance to calculate it.
I tried to use pointDistance for fitcor() but failed.
Any suggestions on distance functions that will work for UTM?

Thank you.

fitGLS_* max out CPU use on Linux

When running fitGLS_opt on Linux CPU use is maxed out (on an application server with 64 cores) or more than 80 CPUs are in use (on an application server with 112 cores). The issue occurs already when computing an example from the vignette (3000 data points), with the computation being successful only on the 112 core server, where it takes almost an hour. For comparison, the very same Vignette is computed on a Win machine (64 cores, 1023GB RAM) in less than 5 minutes taking <3% CPU.

Furthermore, when the ncores argument is passed to the fitGLS_partition it is ignored on Linux, and all available cores are used (not even all-1). For bigger datasets (~240,000 pixels after taking every 3rd px and excluding NAs) it leads to the following error:

tryChatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11509 OUT=/dev/null SETUPTIMEOUT=120 TIMEOUT=259200 XDR=TRUE SETUPSTRATEGY=parallel

the following arguments are passed to thefitGLS_partition function (though the same error is observed with some twigs to the passed arguments):

fitGLS_partition(formula = Est ~ 1
                             # frmula0 = Est ~ 1,
                             data = gvDM, 
                             partmat = GV.pm, 
                             covar_FUN = "covar_exp",
                             covar.pars = list(range = 4.805467),
                             distm_FUN = "distm_km",
                             ncross = 5,
                             ncores = 20,
                             debug = TRUE, #FALSE,
                             progressbar = TRUE  )

remotePARTS is run in a docker container created from this image: https://github.com/erfea/R.Docker4remotePARTS. Exact specification of the system architecture and setup are available here:(https://github.com/morrowcj/remotePARTS/files/11434853/serverSetup_remotePARTS.txt)

Limiting number of cores available for a container leads to the use of all assigned cores and the same error on each node.

At the same time in Windows environment, where limiting CPU use works well the said bigger dataset generates the following error Error in unserialize(socklist[[n]]) : error reading from connection (with the same setup of the fitGLS_partition function as above).

I am wondering whether anyone else came across similar behavioue.

I am happy to share the data and run some more test if needed.
Thank you and cheers!
Kasia

speeding up `distm_km`

Dear Clay and Tony,

I wanted to speed up the calculation of maximum distance for fitCor.

Instead of running distm_km on the complete lat/lon matrix one could take only the coordinates of the points with xmin, xmax, ymin, and ymax combining them into a [4,2] matrix.
This can be easily achieved by selecting these points from a matrix or a sp object.

assuming GV.df3 is a data frame with all NA excluded, with lat & lon information in lat and lng columns we can extract the needed columns (and make reprojection from LAEA to WGS84, an additional step specific to my workflow)

# coordinates 
coords <- data.frame(GV.df3[,c('lng', 'lat')])
# recalculate LAEA(3035) to WGS84(4326)
coordinates(coords) <- c('lng', 'lat')
proj4string(coords) <- CRS("+init=epsg:3035")
coordsLL <- spTransform(coords, CRS("+init=epsg:4326"))

and then calculate dist_km on the complete dataset

start_time <- Sys.time()
max.dist <- max(distm_km(coordsLL@coords ))
end_time <- Sys.time()
end_time - start_time
> Time difference of 34.43078 min
> max.dist = 39.05035

or use an alternative solution:

start_time <- Sys.time()
xmin <- coordsLL@coords[coordsLL@coords[,'lng']== min(coordsLL@coords[,'lng'])]
xmax <- coordsLL@coords[coordsLL@coords[,'lng']== max(coordsLL@coords[,'lng'])]
ymin <- coordsLL@coords[coordsLL@coords[,'lat']== min(coordsLL@coords[,'lat'])]
ymax <- coordsLL@coords[coordsLL@coords[,'lat']== max(coordsLL@coords[,'lat'])]

d.mat <- rbind(c(xmin),c(xmax),c(ymin),c(ymax))

max.dist <- max(distm_km(d.mat))
end_time <- Sys.time()
end_time - start_time 
> Time difference of 3.392486 secs
> max.dist = 38.28768

However, what troubles me is that the estimations are tad different: 39km vs. 38.3km. It is not a massive difference, yet both methods should, in principle, give exactly the same value, as the mare difference in the alternative approach is exclusion of 'redundant'. Since I am using 30-m dataset, pixel origin is also not really at play here.

I am wondering whether you deem the alternative solution valid, and do you have, by chance, any idea why the difference in maximum distance occurs?

Thank you and cheers,
Kasia

Variance instead of SE

In fitAR, the returned values of SE are, in fact, variances.

The code should be something like SE = diag(s2beta)^0.5
check if other versions of fitX have this problem.

fitAR() results & estimates for the same value in the time series

Describe the bug
Hello!

I have run the fitAR() model over a raster stack and have now started exploring the results.
In that raster stack, there is a bunch of pixels that have the same value across all the layers. For those pixels, I get a tiny value of coefficient and low p-value (see example below). I know that the model "estimates parameters for the regression model with AR(1) random error terms", but I find this result confusing and false positive output. The estimate is rather on the random error and that is an artificial result.
Of course, I can round up the coefficient estimate and then get 0, but the p-value still exists. But I do not see a point in generating that result at all.
For the same raster stack, and those stable pixels, while applying the Mann-Kendal test, I get 0 for the coefficient and p-value 1.

Maybe in case a function fitAR() is applied on the vector consisting of the same values, the fitting should be skipped, and the output just gives the coeff = 0, and p-value = 1? Unless that would further affect the spatiotemporal part of the modeling?
Maybe adding a condition like skip.same = TRUE/FALSE would either yield for TRUE: coeff = 0, p-value =1; and for FALSE: regular model fit.

Thanks!
To Reproduce

x <- c(rep(46,20))
t <- 1:20
AR.time <- fitAR(x ~ t)
Warning messages:
1: In optimize(function(par) fn(par, ...)/con$fnscale, lower = lower,  :
  NA/Inf replaced by maximum positive value
2: In optimize(function(par) fn(par, ...)/con$fnscale, lower = lower,  :
  NA/Inf replaced by maximum positive value
summary(AR.time)

Call:
fitAR(formula = x ~ t)
Residuals:
       Min         1Q     Median         3Q        Max 
-3.553e-14 -2.132e-14 -7.105e-15  1.776e-15  1.421e-14 
Coefficients:
              Estimate Std. Error   t value Pr(>|t|)    
(Intercept)  4.600e+01  7.335e-29 5.371e+15  < 2e-16 ***
t           -2.490e-15  5.157e-31 3.467e+00  0.00275 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Mean squared error: 0
Log-likelihood: 577.7417

Desktop (please complete the following information):

OS: Platform: x86_64-w64-mingw32/x64 (64-bit); Running under: Windows 10 x64 (build 19045)
R: R version 4.2.2 (2022-10-31 ucrt)
Package version: 1.0.0

Add GLS recommendation to GLS

Tony identified numerous examples where fitCor() does not work but fitGLS_opt() does. He recommends that we state a preference for the latter in our documentation.

Package needs unit tests

Currently, the package has build tests that ensure that the code can run - but no formal tests that the functions behave properly are implemented yet. These are needed to ensure that any updates to the package do not break the functionality.

I wrote this package before I understood the importance of unit tests. I now know how necessary they are and they should be implemented ASAP.

improve memory of parallel fitGLS_partition

Problem

The parallel partitioned GLS is driven by the function MC_GLSpart(). This function utilizes foreach(i = 1:npart, ...) %dopar% {...} syntax. This formulation has the entire dataset imported on each instance (thread). That leads to memory usage snowballing quite quickly (ncores $\times$ the size of the data object).

Solution

foreach() accepts an iterator that allows data to be constructed on the fly. In short, this could allow only the data from the partition of interest to be imported for a given instance. The upshot is that the total memory usage shouldn't be much greater than the total size of the original object. So, we should swap i = 1:npart with an iterator to provide partitions.

Need geographic convex hull

Problem

Currently max_dist(coords) [branch "maxdist_fitcor"] fits a convex hull by assuming that the coords are in euclidean space. This has major implications for some global scenarios. For example this method would determine that the longitude-latitude pairs [-150, 0] and [150, 0] are more distant than [-90, 0] and [90, 0] even though the opposite is true.

Solution

I don't know of a good solution to this of efficiently finding the "hull" among many points on the face of a sphere.

Fatal error that aborts R

Occasionally, using some functions in this package can cause a fatal errorthat aborts R.

As far as I've been able to tell, these errors occur when input to the functions are improper (e.g., incorrect dimensions) or when there is insufficient memory.

They appear to only occur with functions that use C++ (via Rcpp).

Default values of fitGLS_opt() allow for negative nugget.

Problem

By default, no transformations or limits are placed on the parameters for fitGLS_opt(). This allows the nugget to be any value that returns a valid likelihood - including negative values. Negative values of the nugget are nonsensical, though.

Solution

The most straight forward solution would be to limit the nugget to be between 0 and 1 by default, since this makes the most sense from the semivariogram perspective. This is easily achieved through the trans and backtrans arguments and the logit function.

fitGLS_opt(formula, data, coords,
    trans = list(nugget = function(p){log(p/(1-p))}),
    backtrans = list(nugget = function(l){1/(1+exp(-l))}),
    ...
)

Note that this solution is already presented in the examples for the function, but should be implemented by default to prevent nonsensical results.

morrowcj / remoteparts Goto Github PK

remoteparts's Introduction

remotePARTS

Description

Instalation

Example usage

Bugs and feature requests

References

remoteparts's People

Contributors

Stargazers

Watchers

Forkers

remoteparts's Issues

UPDATE

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

Recommend Projects

Recommend Topics

Recommend Org