iqss / amelia Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 17.0 116.7 MB

Amelia: A Package for Missing Data

Home Page: http://gking.harvard.edu/amelia

R 93.37% C++ 4.75% C 0.48% TeX 1.40%

amelia's People

Contributors

Stargazers

Watchers

Forkers

brentonk micodes2 jrnold huiyingchua thomasbrawner johnsonhsieh apeirohedra mbsabath waiwai5988 lx0413 gechry1 cheerup731 jonlachmann olivroy samirazahmat-kesh

amelia's Issues

Error in as.POSIXct.numeric(value) : 'origin' must be supplied

Y2015W1_m1_NA <- Y2015W1_m1 %>% 
  dplyr::select(index, BidOpen, BidHigh, BidLow, BidClose, AskOpen, AskHigh, AskLow,  AskClose) %>% 
  prodNA(noNA = 0.01)
> Y2015W1_m1_NA
# A tibble: 7,200 x 9
   index               BidOpen BidHigh BidLow BidClose AskOpen AskHigh AskLow AskClose
   <dttm>                <dbl>   <dbl>  <dbl>    <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 2015-01-04 22:00:00    120.    121.   120.     121.    121.    121.   121.     121.
 2 2015-01-04 22:01:00    121.    121.   120.     121.    121.    121.   121.     121.
 3 2015-01-04 22:02:00    121.    121.   121.     121.    121.    121.   121.     121.
 4 2015-01-04 22:03:00    121.    121.   121.     121.    121.    121.   121.     121.
 5 2015-01-04 22:04:00    121.    121.   120.     121.    121.    121.   121.     121.
 6 2015-01-04 22:05:00    121.    121.   120.     120.    121.    121.   121.     121.
 7 2015-01-04 22:06:00    120.    121.   120.     121.    121.    121.   120.     121.
 8 2015-01-04 22:07:00    121.    121.   121.     121.    121.    121.   121.     121.
 9 2015-01-04 22:08:00    121.    121.   121.     121.    121.    121.   121.     121.
10 2015-01-04 22:09:00    121.    121.   121.     121.    121.    121.   121.     121.
# ... with 7,190 more rows
> Y2015W1_m1_NA %>% str
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	7200 obs. of  9 variables:
 $ index   : POSIXct, format: "2015-01-04 22:00:00" "2015-01-04 22:01:00" "2015-01-04 22:02:00" ...
 $ BidOpen : num  120 121 121 121 121 ...
 $ BidHigh : num  121 121 121 121 121 ...
 $ BidLow  : num  120 120 121 121 120 ...
 $ BidClose: num  121 121 121 121 121 ...
 $ AskOpen : num  121 121 121 121 121 ...
 $ AskHigh : num  121 121 121 121 121 ...
 $ AskLow  : num  121 121 121 121 121 ...
 $ AskClose: num  121 121 121 121 121 ...
> Y2015W1_m1_NA %>% amelia(ts = 'index')
-- Imputation 1 --

  1  2
Error in as.POSIXct.numeric(value) : 'origin' must be supplied
> Y2015W1_m1_NA %>% amelia(idvars = 'index')
-- Imputation 1 --

  1  2
Error in as.POSIXct.numeric(value) : 'origin' must be supplied

By refer to https://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf, I tried to impute the missing value for tibble format data frame but the system prompt me the date origin error.

Data not found when using moList and parallel

From a user email:

I am currently trying to use the overimputation feature of "Amelia" in conjunction with parallelized computation in the "snow" package. Unfortunately, my computations fail with error code 3 and message "The setting for the data argument doesn't exist." On the CRAN, you're listed as the maintainer, so I was hoping that you might be able to help me sort out my problem.

Some additional details: Following the examples in the documentation, I generate a "molist" with moPrep(), which I then pass to amelia(). When run without parallelization, this works perfectly fine, and amelia() returns the imputed data as expected. However, in parallel, amelia() appears unable to find the data set that is pointed to in the "molist" generated by moPrep(). There appears to be no difference between the "molist" I get when running with/without parallelization (i.e., both have a symbolic reference to the data in the $data slot). However, in parallel, amelia() can't find it, whereas it can when run without parallelization.

This is likely due to scoping issues with "eval" and parallel. We could try to specify the frame in the amelia.molist function.

chol() throws std::runtime_error on singular matrices causing R to terminate

When testing a piece of software I am working on that uses Amelia I came across an error where R suddenly terminates. After some investigation it turns out that chol() throws a std::runtime_error when the matrix it tries to decompose is singular.

I have attached a file which causes the problem when running Amelia with the command

test <- amelia(temp_data, m = 5, ts = "dates", p2s = 0, idvars = NULL, cs = NULL, parallel="no", lags = colnames(temp_data)[-1])

I have also forked the repository to https://github.com/jonlachmann/Amelia where a fix is applied. Please let me know if you want me to create a pull request or if you find a better solution. My solution is quite simple, I have added try/catch blocks around the chol() calls, and on failure I let C++ return R_Nilvalue. This is then handled in R the same way as was already present when Amelia even before calling the C++ function was able to determine that the matrix was singular.

ameliacrash.RData.zip

overimpute() not using bounds parameter of amelia() output structure

Bug, problem or wanted behavior?

"overimpute()
This function temporarily treats each observed value in var as missing and imputes that value based on the imputation model of output. "
^^ Quote from R-Help.

The bounds parameter is used in amelia() for all NA values of a specific dataset-column.
But if the bounds parameter is set, it's not used in the function overimpute().

The problem in such a situation is, that it is possible to force amelia() in a specific way (for example to use only values between 40 and 50).
But if you use overimpute() nobody notice these boundarys.

Anyway: this behavior is missing in the documentation.

Demosource:

set.seed(1234)
x.out_overimpute_bug<-amelia(africa,cs=2,ts=1 ,bounds=rbind(c(5,40,50)) ,lags="infl" )
test<-overimpute(x.out_overimpute_bug,var=c(5))
test$lower.overimputed[c(100:115)]
test$mean.overimputed[c(100:115)]
test$upper.overimputed[c(100:115)]

^^ values are not between 40 and 50 ..

tscsPlot() - Missing warning about initialization of random number generator and behavior or is it a bug.

Bug or missing warning:

"tscsPlot()
Plots a time series for a given variable in a given cross-section and provides confidence intervals for the imputed values."
^^Quote from R help

My first thought was: great "tscsPlot" plots my imputed values and there is no need for an individual ggplot().

Just because the function get's as an input: the output of the imputation process based on amelia().
Normaly i expect in such a situation, that multiple calling of the same function (tscsPlot) generates equal output.
That's not the case. The output is not only based on the amelia() output.
Internal functions of amelia() and the random numbers are involved too.
The question ist, what ist the information gain (if the values always change) or is there a bug?

Actually the same result is only possible, if the random number generator is set every time calling.

Missing:

First, a warning in the documentation about the behavior (random numbers).
Second, a warning in the documentation, that the (mean) output is not equal the imputed values of amelia().

Example Source:

set.seed(1234)
tcc<-amelia(africa,cs="country",ts="year")

set.seed(1234)
tscsPlot(output=tcc,cs="Cameroon",var="trade")
set.seed(4711)
tscsPlot(output=tcc,cs="Cameroon",var="trade")

(mean) imputed values are shifting

tscsPlot(output=tcc,cs="Cameroon",var="trade",ylim=c(40,60))
tscsPlot(output=tcc,cs="Cameroon",var="trade",ylim=c(40,60))
tscsPlot(output=tcc,cs="Cameroon",var="trade",ylim=c(40,60))
tscsPlot(output=tcc,cs="Cameroon",var="trade",ylim=c(40,60))
tscsPlot(output=tcc,cs="Cameroon",var="trade",ylim=c(40,60))
tscsPlot(output=tcc,cs="Cameroon",var="trade",ylim=c(40,60))

tscsPlot() can't handle tibble objects

Per an email on the amelia list from Jonathan Zadra, the following code doesn't work:

library(Amelia)
library(tibble)
data(africa)
africa <- as_tibble(africa)
a.out <- amelia(africa, ts = "year", cs = "country")
tscsPlot(a.out, cs = "Burundi", var = "trade")

We get the following error:

Error: Unsupported use of matrix or array for column indexing

deleted

Imputation errors when character variables used in noms

MWE:

library(Amelia)
data(freetrade)

freetrade$signed <- ifelse(freetrade$signed, "yes", "no")
out <- amelia(freetrade, ts = "year", cs = "country", noms = "signed")

with error:

-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14
Error in yy %*% unique(na.omit(x.orig[, i])) : non-conformable arguments

Imputed values depend on platform and R version

The imputed values are different on OS X with R 3.5.0 than they are on other platforms or other R versions. To demonstrate:

On linux with R 3.5.0

library(Amelia)
data(africa)
set.seed(1)
a.out <- amelia(africa[ , c("infl", "trade")],
                     m = 1,
                     boot.type = "none")
saveRDS(a.out, "linux350.rds")

On Mac with R 3.4.4

library(Amelia)
data(africa)
set.seed(1)
a.out <- amelia(africa[ , c("infl", "trade")],
                     m = 1,
                     boot.type = "none")
saveRDS(a.out, "mac344.rds")

On Mac with R 3.5.0

library(Amelia)
data(africa)
set.seed(1)
a.out <- amelia(africa[ , c("infl", "trade")],
                     m = 1,
                     boot.type = "none")
saveRDS(a.out, "mac350.rds")

On any system

linux350 <- readRDS("linux350.rds")
mac344 <- readRDS("mac344.rds")
mac350 <- readRDS("mac350.rds")

all.equal(linux350, mac344)
## TRUE
all.equal(mac344, mac350)
## [1] "Component “imputations”: Component “imp1”: Component “trade”: Mean relative difference: 0.2454121"

This is a bit troubling since in some cases we would like to provide code that is exactly reproducible.

Label spacing in missmap

Add more options for spacing in the missmap() functions. Ideas include adding spacing or allowing rotation. Especially needed for x-axis.

Cannot install Amelia due to compilation error

Hello,
I cannot install Amelia in my R environment due to a compilation error:

/usr/bin/ld: cannot find -lgfortran

g++ -std=gnu++11 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o Amelia.so em.o init.o -llapack -lblas -lgfortran -lm -lquadmath -L/usr/lib/R/lib -lR
/usr/bin/ld: cannot find -lgfortran
collect2: error: ld returned 1 exit status
/usr/share/R/share/make/shlib.mk:10: recipe for target 'Amelia.so' failed
make: *** [Amelia.so] Error 1
ERROR: compilation failed for package ‘Amelia’
* removing ‘/home/user/R/x86_64-pc-linux-gnu-library/4.0/Amelia’

The downloaded source packages are in
	‘/tmp/RtmpUGAush/downloaded_packages’
Warning message:
In install.packages("Amelia") :
  installation of package ‘Amelia’ had non-zero exit status

Altough I do have gcc, g++, and gfortran (all version 7.5.0) installed in my computer, the compilation process still fails.

Impute held out data

Amelia should be able to use its output to impute a new dataset. Ideal for held out data.

Amelia - Output-Parameter missing "names" in dataset-structure on m>1

After using Amelia the output structure includes a lot of informations.

The covMatrices and mu output-matrix include the corresponding fieldnames, only if the m parameter is set to one.
If you want to use these informations in the case of multiple imputations (m>1), it's necessary to call the function two times
(m=1 and m>1).

That's irritating.

Example:

names(amelia(x=africa,m=1,cs="country")$covMatrices[1,,])

names(amelia(x=africa,m=2,cs="country")$covMatrices[1,,])

Amelia reruns EM algorithm even when bootstrap is off

When boot.type = "none", amelia() will run the identical EM algorithm m times. Instead, we should probably run EM once and then call amelia.impute() multiple times.

xlim/ylim in disperse()

The xlim and ylim values for the disperse function are currently hardcoded. Users should be able to set them.

Installation problems

Sorry for the basic/stupid question but, during the installation, Amelia cannot find the R directory, and does not accept the right one (just reinstalled). I tried this on different PCs, and with Win10 and Win11 (see screenshot).

Surprisingly, I haven't found similar questions on the web or on Github, and thus suppose that I am the problem...

Any suggestion? Thanks a lot

AmeliaView() - Logfile includes not working call of amelia

AmeliaView includes a logfile, which is very helpful if you don't know the syntax of statements.
In the case of using the bounds parameter, the generated "call of amelia" is not working.

Set a few bounds and call amelia to get a logfile.

Logfile output:

amelia(x = getAmelia("amelia.data"), m = 5, idvars = "country",
ts = "year", cs = NULL, priors = NULL, lags = "infl",
empri = 0, intercs = FALSE, leads = "population", splinetime = NULL,
logs = NULL, sqrts = NULL, lgstc = NULL, ords = NULL, noms = NULL,
bounds = c(3, 5, 5, 10, 10, 20), max.resample = 1000, tolerance = 1e-04)

Working in R it generates the following error:

Error in bounds[, 1] : incorrect number of dimensions

I think, the correct output is:

amelia(x = africa, m = 5, idvars = "country",
ts = NULL, cs = NULL, priors = NULL, lags = NULL, empri = 0,
intercs = FALSE, leads = NULL, splinetime = NULL, logs = NULL,
sqrts = NULL, lgstc = NULL, ords = "year", noms = NULL,
bounds = rbind(c(3,5,10),c(5,10,20))
, max.resample = 1000, tolerance = 1e-04)

Tests re-implemented in testthat fail during R CMD check

I'm working on implementing a testthat testing regime for Amelia, and I'm running into an issue with the moPrep test. The test passes when I run it in the standard R console, but fails when I run the R CMD check. Is there anything unique in the mo methods that could potentially cause this behavior?

Bug when `NA`s are present in date/time variables supplied to `idvars`

amcheck() checks to see if there are any NA values in any POSIXt (date/time) variables in the dataset and throws an error if so, but it runs the check for all variables in the dataset, not just those not named in idvars. It should ignore variables named in idvars, as these will neither be imputed nor used in the imputation but may exist in the dataset as metadata. The simple fix is to exclude the idvars variables in the check for POSIXt variables.

Amelia/R/amcheck.r

Lines 941 to 947 in 4de6306

    
           if (is.data.frame(x)) { 
        
             is.posix <- function(x) inherits(x, c("POSIXt", "POSIXct", "POSIXlt")) 
        
             posix.check <- sapply(x, is.posix) 
        
             if (any(is.na(x[, posix.check]))) { 
        
               stop("NA in POSIXt variable: remove or convert to numeric") 
        
             } 
        
           }

Plot

I'm using Amelia 1.7.1, R 3.0.1

by trying to plot the result of an amelia run i get the following error

Error in compare.density(output = x, var = which.vars[i], legend = FALSE, :
The 'var' option points to a non-existant column.

my test data is a 500 by 4 matrix with time in the first column and doubles in 2,3 and 4 columns (2,3,4 contain missing values)

-------------------------------------------------------------------------------------------------------

require(Amelia)
require(R.matlab)

with_missings <- readMat("Artdata_wm_xt2_ut2.mat")
time_series = as.matrix(with_missings$simp)
time_series[is.nan(time_series)] <- NA

colnames(time_series) <- c("time","S2", "u_sin","u_lin")

a.out <- amelia(x = time_series, m=1, ts=1, lgstc=c("S2"), lags=c("S2","u_sin"), leads=c("S2","u_sin"), polytime = 3)
summary(a.out)
plot(a.out)

---------------------------------------------------------------------------------------------------------

if i change the last line to
plot(a.out, which.vars <- c(2,3,4))
its working fine.

----------------------------------------------------------------------------------------------------------

Calling plot.amelia() without specification of vars:
It seems that the problem is caused by the following line

numericVars <- sapply(x$imputations[[1]], "is.numeric")

changing it to

numericVars <- sapply(x$imputations$imp1[1,], "is.numeric")
or
numericVars <- sapply(x$imputations[[1]][1,], "is.numeric")
solved the problem.

Dispersion statistics

Where you have all aggregate level data, eg data for a set of 100 studies, and partly missing summary statistics (such as a mean and standard decision for age), is there any specific advice for imputing the standard deviations?
A log transformation makes sense to me since the SD must be positive, but is there anything else we should think about?

Plot only overimpute (without compare)

I'm trying to plot overimpute in an own plot:
plot(amelia_result, var_names, overimpute=TRUE, compare=FALSE)

However your function set.mfrow is forcing mfrow to c(2,1) when overimpute=TRUE
I would propose to extend the if statement to check if both compare and overimpute is TRUE, before forcing mfrow to c(2,1)

set.mfrow <- function(nvars = 1, overimpute = FALSE) {

  if (compare && overimpute) {
    ## If we are overimputing as well, we need
    ## two plots per variable
    mfrow <- switch(min(nvars, 13),
                    c(2,1), ## 2  plot : 1x2
                    c(2,2), ## 4  plots: 2x2
                    c(3,2), ## 6  plots: 3x2
                    c(4,2), ## 8  plots: 4x2
                    c(3,2), ## 10 plots: 3x2
                    c(3,2), ## 12 plots: 3x2
                    c(4,2), ## 14 plots: 4x2
                    c(4,2), ## 16 plots: 4x2
                    c(4,2), ## 18 plots: 4x2
                    c(3,2), ## 20 plots: 3x2
                    c(3,2), ## 22 plots: 3x2
                    c(3,2), ## 24 plots: 3x2
                    c(4,2)) ## 26 plots: 4x2
  } else {
    mfrow <- switch(min(nvars, 13),
                    c(1,1), ## 1  plot : 1x1
                    c(2,1), ## 2  plots: 2x1
                    c(2,2), ## 3  plots: 2x2
                    c(2,2), ## 4  plots: 2x2
                    c(3,2), ## 5  plots: 3x2
                    c(3,2), ## 6  plots: 3x2
                    c(3,3), ## 7  plots: 3x3
                    c(3,3), ## 8  plots: 3x3
                    c(3,3), ## 9  plots: 3x3
                    c(3,2), ## 10 plots: 3x2
                    c(3,2), ## 11 plots: 3x2
                    c(3,2), ## 12 plots: 3x2
                    c(3,3)) ## 13 plots: 3x3
  }

  return(mfrow)
}

The current result (small in height);

any way to use lists with ameliabind()?

As a workaround for #21, I used purrr:map() to run amelia() with m=1. This returns a list. I was hoping that I could use ameliabind() to combine the list, but it seems it wants me to type out the names of the individual objects. dplyr::bind_rows() is an example of a function that can combine datasets and takes either object names as separate arguments or a list. This might be a nice feature for ameliabind()

AmeliaView - Typo to link of help file

Bug in version 1.8

Using the documentation menu the used link is not showing the documentation (only a empty site).

Sourcecode:
label = "Documentation", command = function() browseURL("http://gking.harvard.edu/amelia/docs/"),

Fix:

For example the link could be corrected to: http://r.iq.harvard.edu/docs/amelia/amelia.pdf
Btw. the document is about version 1.6.2.

Amelia - data integrity violation using priors parameter in cs column

Bug:

ONLY !!! If using priors - parameter AND using a exchanged cs and ts dataset-position,
amelia generates in the following example "NAs" in cs-entries which are imputed.

Based on the documentation a fixed position of ts and cs in a dataset is not necessary.
The parameter cs= and ts= exists to make everything dynamical.

Using Version 1.8 a warning is thrown, but it's only a warning. There should be an error ;-)

Source:

africa_switch<-data.frame( country= africa$country,year= africa$year, gdp_pc= africa$gdp_pc ,
infl= africa$infl, trade= africa$trade, civlib= africa$civlib, population=africa$population)

imp_amelia<-amelia(africa_switch,ts=c("year"),cs=c("country"),m=2,logs=c("trade","population"),
lags=c("trade","population"),leads=c("trade","population"),
idvars=c("infl"),
polytime=2, boot.type="ordinary",
splinetime=1,
bound = rbind(c(5, 0, Inf)),
priors=matrix(c(35,5,100,95),nrow=1,ncol=4 ), p2s=1)

unique(imp_amelia$imputations$imp1$country)
[1] Burkina Faso NA Burundi Cameroon Congo Senegal Zambia
Levels: Burkina Faso Burundi Cameroon Congo Senegal Zambia

unique(africa_switch$country)
[1] Burkina Faso Burundi Cameroon Congo Senegal Zambia
Levels: Burkina Faso Burundi Cameroon Congo Senegal Zambia

(Greater and smaller symbols surrounding NA removed)

Problem:

In other situations (more cases and priors) the cs value is overwritten with a unique value (1,2,3,4...) which could be identical with a real cs value.
It's not clear (for me) if there is an impact on the EM process and other parts in amelia() imputation.
The generated output is - in such a case - not very useful.

Workaround:

Change column position in datasets.

Crash when the number of missingness patterns == rows of the data

If each row of the data has its own pattern of missingness, there is an out of bounds error in the ameliaImpute() function in the C++ code. Here is the error:

error: Mat::operator(): index out of bounds libc++abi.dylib: terminating with uncaught exception of type std::logic_error: Mat::operator(): index out of bounds
Probably has to do with the looping over patterns and the last iteration of that loop.

Amelia Runs Indefinitely on Smallish Dataset

Hi,

I have a dataset of around 2000 rows by 50ish columns that belong to around 170 cross-sectional units for a population. My call looks like this:

df_test <-
  df[Cross.Sectional.ID %in% df[, pmax(response, na.rm = TRUE) > 100, by =
                                   Cross.Sectional.ID][V1 == FALSE, Cross.Sectional.ID],]
start_time <- Sys.time()
df_amelia <- amelia(setDF(df_test[, c(-1,-2)])
  , m = 1
  , p2s = 2
  , cs = "Cross.Sectional.ID"
  , ts = "Time_Unit"
  , ords = c("Ordinal.Variable.1", "Ordinal.Variable.2", "Ordinal.Variable.3")
)
end_time <- Sys.time()

Running this on my business laptop has been going for multiple days without completion. Oddly, R doesn't seem to be soaking up much of my processor or ram - processor usage seems to be absorbing only 30 percent of capacity, even when nothing else is running. Are there any common mistakes on a dataset this size that might cause Amelia to run interminably or break silently? How could I adjust my settings to speed things up?

R fatal error RAM limit

RStudio crashes running Amelia on a 324000 rows x 17 cols dataframe at about 11GiB: "R session aborted. R encountered a fatal error. The session was terminated".

Would be nice with suggestions for (a) how to deal with memory issues or (b) features actually dealing with it. E.g. running on a distributed system or on a local database rather than in memory.

MI <- amelia(x = data,
    m = 1, 
    p2s = 1,
    idvars = c("index","var2"),
    noms = c("var3","var4"),
    ords = "var5",
    ts = "dt",
    cs = "var1",
    empri = 0.05 * nrow(data),
    polytime = 2,
    intercs = TRUE,
    bounds = matrix(c(11,0,1000, 12,0,1000, 13,0,1000), nrow = 3, ncol = 3, byrow = TRUE),
    parallel = 'multicore',
    ncpus = 8,
    collect = TRUE)

Amelia has no problem running on subsets of the data i.e. one of 1-10 datasets indicated by var2.

Debian Bullseye

Missmap plot (axis, label and order)

Hi, first of all I would like to thank you for a great package!

As you see in the picture below the missmap plot looks a bit strange:

Y-axis is covered
The plot ledgend "missing" and "observed" is covered

Secondly, my missing data in this example is located in observation x8: 2,6 and x13: 2,3,7 while looking at the plot it seems that its located in the end of the variable (around observation 120).

Further, it would be great to also present the percentage missing in the ledgend, such as missing 5%

To reproduce the plot, see data and code-snippet below.

R 3.3.2
Amelia 1.7.4

---------------Script below---------------------
dates <- c("2004-01-01","2004-02-01","2004-03-01","2004-04-01","2004-05-01","2004-06-01","2004-07-01","2004-08-01","2004-09-01","2004-10-01","2004-11-01","2004-12-01","2005-01-01","2005-02-01","2005-03-01","2005-04-01","2005-05-01","2005-06-01","2005-07-01","2005-08-01","2005-09-01","2005-10-01","2005-11-01","2005-12-01","2006-01-01","2006-02-01","2006-03-01","2006-04-01","2006-05-01","2006-06-01","2006-07-01","2006-08-01","2006-09-01","2006-10-01","2006-11-01","2006-12-01","2007-01-01","2007-02-01","2007-03-01","2007-04-01","2007-05-01","2007-06-01","2007-07-01","2007-08-01","2007-09-01","2007-10-01","2007-11-01","2007-12-01","2008-01-01","2008-02-01","2008-03-01","2008-04-01","2008-05-01","2008-06-01","2008-07-01","2008-08-01","2008-09-01","2008-10-01","2008-11-01","2008-12-01","2009-01-01","2009-02-01","2009-03-01","2009-04-01","2009-05-01","2009-06-01","2009-07-01","2009-08-01","2009-09-01","2009-10-01","2009-11-01","2009-12-01","2010-01-01","2010-02-01","2010-03-01","2010-04-01","2010-05-01","2010-06-01","2010-07-01","2010-08-01","2010-09-01","2010-10-01","2010-11-01","2010-12-01","2011-01-01","2011-02-01","2011-03-01","2011-04-01","2011-05-01","2011-06-01","2011-07-01","2011-08-01","2011-09-01","2011-10-01","2011-11-01","2011-12-01","2012-01-01","2012-02-01","2012-03-01","2012-04-01","2012-05-01","2012-06-01","2012-07-01","2012-08-01","2012-09-01","2012-10-01","2012-11-01","2012-12-01","2013-01-01","2013-02-01","2013-03-01","2013-04-01","2013-05-01","2013-06-01","2013-07-01","2013-08-01","2013-09-01","2013-10-01","2013-11-01","2013-12-01")
c0 <- c(33736.25,NA,35005.65,35640.35,36275.05,NA,37604.00,35536.00,37919.25,38211.00,39905.75,38832.75,36678.75,37647.75,41619.50,39772.00,34867.25,38081.75,37346.50,41084.00,40469.00,40494.25,45103.50,44942.25,49926.50,49098.25,55861.75,49798.50,60079.50,54494.25,52755.50,54108.50,51919.50,58384.00,59443.75,53449.75,61783.50,56632.25,60741.25,53469.25,58679.25,56215.50,60113.75,55327.25,47813.50,56163.75,55138.25,42860.50,53791.75,58305.75,57092.25,65094.00,58048.50,62106.75,70625.75,58003.75,57788.25,48779.00,37041.50,31290.50,29668.50,26596.25,29381.00,28410.00,27741.25,34613.25,38353.25,38667.75,40339.25,41320.00,40927.75,45773.50,44696.75,40971.50,50719.75,46328.00,38762.75,42482.50,43731.25,44469.75,47563.50,49267.50,51317.50,49352.00,48782.50,50155.00,58700.25,47921.25,51833.50,56210.75,52743.50,52627.50,50518.75,45608.75,45609.25,40429.50,45020.25,46274.50,48017.00,38877.50,44003.75,35805.50,41224.25,40429.75,41069.75,45422.00,42740.25,39639.75,44829.50,41062.75,38255.50,38981.00,38435.50,36320.00,40649.25,38103.50,36962.00,41676.00,36727.25,12062.25)
c1 <- c(50885.5,NA,NA,58949.00,51924.25,59090.50,NA,59753.00,63674.50,63230.50,68688.25,66035.00,63383.50,65055.75,70958.75,71269.00,64961.75,77509.00,75885.00,83534.50,84858.00,85241.75,93909.75,91522.00,99407.00,99633.25,117389.50,114951.75,168926.25,158309.50,161919.50,169263.50,159617.50,164985.50,154616.75,126790.50,124711.25,113504.50,141918.00,147539.75,161306.75,156962.00,175396.50,165245.25,152951.00,184169.25,153251.25,118557.25,155322.25,165626.25,160326.00,191048.25,167630.75,173460.25,193503.50,152676.25,159510.75,113272.75,74325.25,64498.75,67625.75,66280.75,82473.00,88117.25,86794.75,110290.00,119936.50,123294.50,136306.75,138322.25,140173.25,146597.25,147712.25,136953.75,171634.50,154887.75,129906.75,142970.50,148161.75,152943.75,169596.50,174128.75,186321.00,192080.00,191098.25,197343.50,219192.50,170692.25,178529.75,198992.50,201994.75,198898.00,182915.25,154289.25,166130.00,151343.25,168903.00,176862.50,186044.00,156918.75,174224.25,140976.00,166951.50,164822.00,161360.50,185588.75,169266.25,151279.75,177073.00,161400.25,153244.75,151262.25,151801.00,140074.25,158527.75,150819.50,150383.25,165332.75,148387.25,49426.25)
data <- data.frame(c0,c1)
colnames(data) <- c("X8","X13")
ind_category <- NULL
ind_group <- NULL
target <- 0
amelia_results <-amelia(data,m=5,ts='dates',p2s=0,idvars=ind_category,cs=ind_group,lags=colnames(data))
missingMapAmelia <- missmap(amelia_results, legend = TRUE)
missingMapData <- missmap(data[,-1])

reproducibility with parallel processing

set.seed() does not seem to work when using parallel = "multicore". I assume that's because there's no way to pass the seed onto the parallel jobs. I'm not sure if this is a bug or simply a limitation of using parallel processing with Amelia.

library(Amelia)
#> Warning: package 'Amelia' was built under R version 4.0.2
#> Loading required package: Rcpp
#> ## 
#> ## Amelia II: Multiple Imputation
#> ## (Version 1.7.6, built: 2019-11-24)
#> ## Copyright (C) 2005-2020 James Honaker, Gary King and Matthew Blackwell
#> ## Refer to http://gking.harvard.edu/amelia/ for more information
#> ##
library(parallel)
data(africa)

# Reproducible:
set.seed(123)
a.out1 <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc", p2s = 0)

set.seed(123)
a.out2 <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc", p2s = 0)

## original
africa[38:42, ]
#>    year  country gdp_pc  infl trade    civlib population
#> 38 1989  Burundi    532 11.66    NA 0.1666667    5330730
#> 39 1990  Burundi    550  7.00    NA 0.1666667    5487000
#> 40 1991  Burundi    560  9.00 38.42 0.1666667    5643320
#> 41 1972 Cameroon    815  8.09 46.48 0.5000000    6835870
#> 42 1973 Cameroon     NA 10.38    NA 0.5000000    7021850

## run 1
a.out1$imputations[[1]][38:42, ]
#>    year  country   gdp_pc  infl    trade    civlib population
#> 38 1989  Burundi  532.000 11.66 34.01444 0.1666667    5330730
#> 39 1990  Burundi  550.000  7.00 28.77401 0.1666667    5487000
#> 40 1991  Burundi  560.000  9.00 38.42000 0.1666667    5643320
#> 41 1972 Cameroon  815.000  8.09 46.48000 0.5000000    6835870
#> 42 1973 Cameroon 1534.801 10.38 85.77617 0.5000000    7021850

## run 2
a.out2$imputations[[1]][38:42, ]
#>    year  country   gdp_pc  infl    trade    civlib population
#> 38 1989  Burundi  532.000 11.66 34.01444 0.1666667    5330730
#> 39 1990  Burundi  550.000  7.00 28.77401 0.1666667    5487000
#> 40 1991  Burundi  560.000  9.00 38.42000 0.1666667    5643320
#> 41 1972 Cameroon  815.000  8.09 46.48000 0.5000000    6835870
#> 42 1973 Cameroon 1534.801 10.38 85.77617 0.5000000    7021850

# Not Reproducible:
set.seed(123)
a.out1 <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc", p2s = 0, parallel = "multicore", ncpus = detectCores() - 1)

set.seed(123)
a.out2 <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc", p2s = 0, parallel = "multicore", ncpus = detectCores() - 1)

## run 1
a.out1$imputations[[1]][38:42, ]
#>    year  country  gdp_pc  infl    trade    civlib population
#> 38 1989  Burundi 532.000 11.66 41.76351 0.1666667    5330730
#> 39 1990  Burundi 550.000  7.00 64.16109 0.1666667    5487000
#> 40 1991  Burundi 560.000  9.00 38.42000 0.1666667    5643320
#> 41 1972 Cameroon 815.000  8.09 46.48000 0.5000000    6835870
#> 42 1973 Cameroon 871.101 10.38 64.33208 0.5000000    7021850

## run 2
a.out2$imputations[[1]][38:42, ]
#>    year  country   gdp_pc  infl    trade    civlib population
#> 38 1989  Burundi 532.0000 11.66 40.37939 0.1666667    5330730
#> 39 1990  Burundi 550.0000  7.00 25.26368 0.1666667    5487000
#> 40 1991  Burundi 560.0000  9.00 38.42000 0.1666667    5643320
#> 41 1972 Cameroon 815.0000  8.09 46.48000 0.5000000    6835870
#> 42 1973 Cameroon 570.4258 10.38 48.06267 0.5000000    7021850

^{Created on 2020-08-17 by the reprex package (v0.3.0)}

Error: Subscript `AMr1.orig` is a matrix, the data `x.imp[, -possibleFactors][AMr1.orig]` must have size 1

In case you had not seen this StackOverflow error report from 2020:

https://stackoverflow.com/questions/64056125/how-to-fix-erreur-subscript-amr1-orig-is-a-matrix-the-data-x-imp-possib

I just ran into the same problem. The solution is simple. Just add this to convert x to data frame if it is a tibble:

if (inherits(x, "tbl_df")) {
  x <- as.data.frame(x)
}

mi.combine() reversed confidence interval bounds

When function mi.combine() outputs confidence intervals, lower and upper bounds are reversed, i.e., lower bound (conf.low) is higher than upper bound (conf.high).

Setting lower.tail=TRUE when calculating critical value should fix this issue, as currently is negative value.

Question regarding imputed datasets.

From a.out$imputations, the last imputed dataset is the final result or do I need to average across all datasets?

R encountered fatal error - unsure cause

Hello, I'm trying to figure out why I am suddenly receiving a fatal error while running amelia. I'm fairly new to this package, but when I used it originally (2 weeks ago), it was running. Now, I can't run a MI without R crashing. Any suggestions? I've already deleted and re-installed all of R on my laptop.

Here is the code that I am running:

library(lavaan)
library(readxl)
library(haven)
library(semTools)
set.seed(5)

library(naniar)
library(finalfit)
library(Amelia)

#load in new data set, adolescents only

trauma<- read_excel("C:/Users/PayneWinston/OneDrive - Newport Academy/Research Papers & Projects/Trauma Paper/adol_short.rev.xlsx")
View(trauma)

#multiple imputation

mi<-amelia(trauma, m=5, idvars = c("age","bothpar", "coerc","lsu",
"ECR_01M", "ECR_02M","ECR_03M","ECR_04M", "ECR_01F", "ECR_02F","ECR_03F","ECR_04F"),
ords =c("ECR_01Mr", "ECR_02Mr","ECR_03Mr","ECR_04Mr","ECR_05M","ECR_06M","ECR_07M","ECR_08M","ECR_09M",
"ECR_01Fr", "ECR_02Fr","ECR_03Fr","ECR_04Fr","ECR_05F","ECR_06F","ECR_07F","ECR_08F","ECR_09F"))

AmeliaView - Wrong implementation of splinetime parameter

Bug based on AmeliaView 1.8:

In AmeliaView it is possible to choose the splinetime= Parameter as "Splines" with knots from zero to ten.

Amelia is called with the parameter values from 0 to 10. But these parameters are used in a different definition (as seen below).
Short: The number of knots is limited to three and has to be translated into the values 4,5 and 6.
On the other hand, if you want a polynominal of time the values 1,2 and 3 are possible values.

Source: help function to amelia()

"splinetime:
interger value of 0 or greater to control cubic smoothing splines of time. Values between 0 and 3 create a simple polynomial of time (identical to the polytime argument). Values k greater than 3 create a spline with an additional k-3 knotpoints."

How to test/check:

Just use the entry 10 knots, which leads to an error.
Read the debug-log-file for amelia() call.

Impact/Conclusion:

Normaly a simple typo is no problem, but in this case it's different.
Using AmeliaView() with ts without knowing this bug, leads into wrong documentation of the used functions and modell.
Conclusions from such work should be checked and reviewed.

Typo and missing parts in documentation

Help file in R-"Amelia":

"Note that the theta, mu and covMatrcies[!] objects refers to the data as seen by the EM algorithm and is thusly centered, scaled, stacked, tranformed and rearranged. See the manual for details and how to access this information."

There is no link to the detail informations (centering, scaling, stacking...)

	if (is.data.frame(x)) {
	is.posix <- function(x) inherits(x, c("POSIXt", "POSIXct", "POSIXlt"))
	posix.check <- sapply(x, is.posix)
	if (any(is.na(x[, posix.check]))) {
	stop("NA in POSIXt variable: remove or convert to numeric")
	}
	}