melff / memisc Goto Github PK

View Code? Open in Web Editor NEW

43.0 43.0 8.0 8.99 MB

Tools for Managing Survey Data, Creating Tables of Estimates and Data Summaries

Home Page: https://melff.github.io/memisc

R 83.46% C 16.54%

cran r r-package rstats survey-data

memisc's People

Stargazers

Watchers

Forkers

tcarnus anhqle xkdog schoonees kjhealy miranda-chen jeffreyhanson kentaro-kamada

memisc's Issues

Use of "anyNA" function breaks compatibility with R < 3.1

Hi,

I have been having troubles installing memisc from sources on a system where I am stuck with R version 3.0.2 (2013-09-25). It seems that the usage of anyNA, introduced with commit [https://github.com/melff/memisc/commit/dd3da28045220e5b3726cfc0f1b56cdaeda87b0c] in function [.mtable breaks compatibility with R versions < 3.1.

I therefore suggest that the R dependency in DESCRIPTION is bumped up to 3.1.

Palmar

SPSS variable names trimmed by memisc/spss.system.file?

Hi Martin:

In my experience so far, It seems as though variable names exceeding 8 characters are trimmed to 8 characters when importing from SPSS .sav files to R. Any subsequent variable names sharing the same initial 8 characters are subsequently coerced into unique, but generic, strings (e.g., longername1 -> longerna, longername2 -> V2_a, longername3 -> V3_a).

I'm wondering if perhaps there is something I'm overlooking or doing incorrectly, as I've seen no mention of variable name truncation in the documentation or any of the discussions I've read. I did go as far as to read the pspp-system-for-R.c code where I saw the following call in line 412: trim(curr_var.name,8). I don't know if this bit of code is relevant to my query, but it did make me wonder if the truncation of variable names was by design and static (i.e., there is no option to turn it off).

If it is not an intentional feature, would you have any ideas about why my variable names are being trimmed? If this, however, is a feature, would altering it be a possibility in the future?

Thanks!

-Mike

summary.stats.default is unused

Extending mtable for a new model class "cls" currently requires specifying a corresponding "summary.stats.cls" option in addition to a getSummary.cls method, otherwise no summary statistics are displayed with the default summary.stats = TRUE argument of mtable(). From what help("mtable") says about summary.stats, I would have expected that all summary statistics from getSummary.cls() are reported when no such option has been defined.

A minimal reproducible example follows:

lm0 <- lm(sr ~ pop15 + pop75, data = LifeCycleSavings)
class(lm0) <- "cls"
getSummary.cls <- function (obj, ...) {
    class(obj) <- "lm"
    getSummary.lm(obj, ...)
}
mtable(lm0, summary.stats = TRUE)  # summary statistics are not shown
mtable(lm0, summary.stats = "N")   # works
oopt <- options("summary.stats.cls" = getOption("summary.stats.default"))
mtable(lm0, summary.stats = TRUE)  # now the defaults (logLik and N) are shown
options(oopt)

The reason is that "summary.stats.default" will not be selected in selectSummaryStats:

memisc/pkg/R/mtable.R

Lines 85 to 93 in 977c022

    
           cls <- class(x) 
        
           sumstats.name <- paste0("summary.stats.",cls) 
        
           sumstats <- lapply(sumstats.name,getOption) 
        
           if(length(sumstats)){ 
        
               sumstats <- unlist(sumstats) 
        
               sumstats[1] 
        
           } 
        
           else 
        
               sumstats <- getOption("summary.stats.default")

The condition length(sumstats) equals the length of the class vector class(x), which will contain at least one element for a new model class, so the option "summary.stats.default" won't come into play and "summary.stats.cls" needs to be specified to avoid a NULL result.

Should the condition be replaced by something like any(!vapply(sumstats, is.null, TRUE))?

memisc with data.table

Since I am often working with surveys the memisc library looks very promising. I always work with data.table and there seems to be an issue however.

When I generate a codebook with memisc on a data.table, there is the following issue: when I type the name of the object (say, mtcars) an error is thrown, see code below.

    library(memisc)
    library(data.table)

    data(mtcars)
    setDT(mtcars, keep.rownames=  T)
    mtcars = within(mtcars, {

      description(vs) = "whatever"
      description(am) = "something unclear"
      description(carb) = "something different"
      wording(vs) = "this is going to be a long comment"

      labels(vs) = c(
        "many" = 1,
        "not so many" = 0
      )

      labels(carb) = c(
        "one" = 1,
        "two" = 2,
        "three" = 3
      )

      missing.values(carb) = c(4,6, 8)

    })

    codebook(mtcars) %>% show_html

    # annotation(mtcars) = "my long story"

    # if data.frame = error and annotation(mtcars) = "....text...." then message "Error in if (nzchar(nm.i)) { : argument is of length zero"
    # annotation(mtcars)

    # if data.table: throws error "Error in format.item(char.trunc(col), justify = justify, ...) : 
    # unused argument (justify = justify)
    mtcars


    > sessionInfo()
    R version 3.2.3 (2015-12-10)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 15.04

    locale:
      [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=nl_NL.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
    [10] LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
      [1] grid      stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
      [1] Hmisc_3.17-0      ggplot2_2.0.0     Formula_1.2-1     survival_2.38-3   magrittr_1.5      DescTools_0.99.15 manipulate_1.0.1  memisc_0.99.3     MASS_7.3-44       lattice_0.20-33   data.table_1.9.7 

    loaded via a namespace (and not attached):
      [1] Rcpp_0.12.3         nloptr_1.0.4        RColorBrewer_1.1-2  plyr_1.8.3          tools_3.2.3         rpart_4.1-10        boot_1.3-17         lme4_1.1-9          nlme_3.1-122        gtable_0.1.2        mgcv_1.8-7          Matrix_1.2-2       
    [13] parallel_3.2.3      mvtnorm_1.0-3       SparseM_1.7         proto_0.3-10        gridExtra_2.0.0     cluster_2.0.3       MatrixModels_0.4-1  nnet_7.3-11         foreign_0.8-66      latticeExtra_0.6-26 minqa_1.2.4         car_2.0-25         
    [25] scales_0.3.0        splines_3.2.3       rsconnect_0.4.1.11  pbkrtest_0.4-2      colorspace_1.2-6    quantreg_5.19       acepack_1.3-3.3     munsell_0.4.2       chron_2.3-47

write_html error

I've noticed a failure when trying to export a codebook to HTML. It turns out this can be reproduced simply by having a character variable in the data frame:

> write_html(codebook(data.frame(x="a")), file="codebook.html")
Error in is.finite(x) : default method not implemented for type 'list'

This happens on git master too.

format_html(codebook()) for factor variables

stops with an error, tested with 34e66cd. What am I doing wrong?

library(memisc)
#> Loading required package: lattice
#> Loading required package: MASS
#> 
#> Attaching package: 'memisc'
#> The following objects are masked from 'package:stats':
#> 
#>     contr.sum, contr.treatment, contrasts
#> The following object is masked from 'package:base':
#> 
#>     as.array
format_html(codebook(iris[5]))
#> Error in tab[, 3]: subscript out of bounds

Created on 2018-01-12 by the reprex package (v0.1.1.9000).

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_US                       
#>  collate  en_US.UTF-8                 
#>  tz       Europe/Busingen             
#>  date     2018-01-12
#> Packages -----------------------------------------------------------------
#>  package      * version    date       source                          
#>  backports      1.1.2      2017-12-13 cran (@1.1.2)                   
#>  base         * 3.4.3      2017-12-01 local                           
#>  car            2.1-5      2017-07-04 CRAN (R 3.4.1)                  
#>  compiler       3.4.3      2017-12-01 local                           
#>  datasets     * 3.4.3      2017-12-01 local                           
#>  devtools       1.13.4     2017-11-09 CRAN (R 3.4.2)                  
#>  digest         0.6.13     2017-12-14 CRAN (R 3.4.3)                  
#>  evaluate       0.10.1     2017-06-24 CRAN (R 3.4.1)                  
#>  graphics     * 3.4.3      2017-12-01 local                           
#>  grDevices    * 3.4.3      2017-12-01 local                           
#>  grid           3.4.3      2017-12-01 local                           
#>  htmltools      0.3.6      2017-04-28 CRAN (R 3.4.1)                  
#>  knitr          1.18       2017-12-27 CRAN (R 3.4.3)                  
#>  lattice      * 0.20-35    2017-03-25 CRAN (R 3.4.1)                  
#>  lme4           1.1-13     2017-04-19 CRAN (R 3.4.0)                  
#>  magrittr       1.5        2014-11-22 CRAN (R 3.4.3)                  
#>  MASS         * 7.3-47     2017-04-21 CRAN (R 3.4.1)                  
#>  Matrix         1.2-10     2017-04-28 CRAN (R 3.4.1)                  
#>  MatrixModels   0.4-1      2015-08-22 CRAN (R 3.4.0)                  
#>  memisc       * 0.99.15    2018-01-12 Github (melff/memisc@34e66cd)   
#>  memoise        1.1.0      2017-08-07 Github (hadley/memoise@d63ae9c) 
#>  methods      * 3.4.3      2017-12-01 local                           
#>  mgcv           1.8-17     2017-02-08 CRAN (R 3.4.0)                  
#>  minqa          1.2.4      2014-10-09 CRAN (R 3.4.0)                  
#>  nlme           3.1-131    2017-02-06 CRAN (R 3.4.0)                  
#>  nloptr         1.0.4      2014-08-04 CRAN (R 3.4.0)                  
#>  nnet           7.3-12     2016-02-02 CRAN (R 3.4.0)                  
#>  parallel       3.4.3      2017-12-01 local                           
#>  pbkrtest       0.4-7      2017-03-15 CRAN (R 3.4.0)                  
#>  quantreg       5.33       2017-04-18 CRAN (R 3.4.0)                  
#>  Rcpp           0.12.14.5  2018-01-11 local                           
#>  repr           0.12.0     2017-04-07 CRAN (R 3.4.0)                  
#>  rmarkdown      1.8        2017-11-17 CRAN (R 3.4.3)                  
#>  rprojroot      1.3-2      2018-01-03 local (krlmlr/rprojroot@851d293)
#>  SparseM        1.77       2017-04-23 CRAN (R 3.4.0)                  
#>  splines        3.4.3      2017-12-01 local                           
#>  stats        * 3.4.3      2017-12-01 local                           
#>  stringi        1.1.6      2017-11-17 CRAN (R 3.4.3)                  
#>  stringr        1.2.0      2017-02-18 CRAN (R 3.4.1)                  
#>  tools          3.4.3      2017-12-01 local                           
#>  utils        * 3.4.3      2017-12-01 local                           
#>  withr          2.1.1.9000 2017-12-30 Github (r-lib/withr@df18523)    
#>  yaml           2.1.16     2017-12-12 CRAN (R 3.4.3)

toLatex.ftable(extrarowsep = ...) produces wrong results

library(magrittr)
library(memisc)
array(as.character(1:12), dim = c(2,2,3), dimnames = list(X=1:2, Y=letters[1:2], Z=LETTERS[1:3])) %>%
  ftable %>% toLatex(extrarowsep="1ex")

Output:

\begin{tabular}{lllD{.}{.}{0}D{.}{.}{0}D{.}{.}{0}}
\toprule
& && \multicolumn{3}{c}{Z}\\
\cmidrule{4-4}\cmidrule{5-5}\cmidrule{6-6}
X&Y && \multicolumn{1}{c}{A}&\multicolumn{1}{c}{B}&\multicolumn{1}{c}{C}\\
\midrule
1&a && 1  & 5  &  9\\
 &b && 3  & 7  & 11\\
2&a && 2  & 6  & 10\\[1ex]
 &b && 4  & 8  & 12\\
1&a && 1  & 5  &  9\\[1ex]
\bottomrule
\end{tabular}

The first row is repeated for some reason, the extrarowsep should appear one line above its actual appearance, and omitted for the last row.

omit specific covariates in mtable [question]

is there a simple way to omit specific covariates (ie. controls variables) from the table?

parseHeaderPorStream(ptr) : unknown tag "T" found

Is there anyway to work past such an error?

m2003 <- spss.portable.file("../rawData/census/f466/f466ind.por")
[1] "f basic metalS/21/Manufacture of metal products (excl. machinery and equipment)T"
[2] "/18/Manufacture of machinery and equipment10/20/Manufacture of office and accoun"
Hide Traceback
Error in parseHeaderPorStream(ptr) : unknown tag "T" found in line 65 offset 54
stop("unknown tag ", dQuote(tag.code), " found in line ", currline, " offset ", offset)
 parseHeaderPorStream(ptr)

Saving files to XLSX

I want to use openxlsx::write.,xlsx(my_data_set,file="mydata,xlsx")
It works fine but I get the labels. The missing codes are replaced with missing values. I would actually like to get the numeric codes or labels for the missing values. Is that feasible? I think I remember doing it once but that may be my imagination. Suggestions?

"NewSysFile" not resolved from current namespace (memisc)

[memisc 0.99.14.12]
[R version 3.3.1]
[openSuSE leap]

Loading an SPSS .sav file with

library("memisc")
data <- as.data.set(spss.system.file("foo.sav"))

results in

"NewSysFile" not resolved from current namespace (memisc)

Note that memisc has been installed into a subdirectory of ~/R/x86_64-suse-linux-gnu-library/3.3, which .libPath() properly shows.

Is this a mistake or problem on my side?

Issue with new version of spss.system.file

I have been using memisc for a while to open some SPSS files.

I have just updated the package to the current version (0.99.25.5), and the function spss.system.file is not able anymore to open those files, and returns an error.

The error states that the variable "encoding" is not defined, when running the lines of code message(sprintf("File character set is '%S'.", encoding)).
The result is that the function spss.system.file fails to execute and it is not possible to load the file.

Please note that I went back using the old version of memisc (0.99.22) and with this package version I am able to correctly open the SPSS files with spss.system.file. This suggests that the problem is not with the file, but there might be a bug in the new version of spss.system.file.

P.s.
Your package has been extremely helpful over the years, thank you for your work.

Reduce output content of codebook()?

Hi, great package, love your work.

I'm wondering if there is any straight-forward way of manipulating content from codebook(). It's producing more output than I want in my documentation.

How do you suggest getting codebook() output without information on the following objects:
-storage mode & measurement
-description
-N and percent

Thanks <3

library(memisc) causes functions to print NAMESPACE cache information

Hi,

It would appear that attaching both the memisc and the tibble R packages causes certain functions to print text about caches and NAMESPACES. For example, I've included a reproducible version below (tested with memisc CRAN version 0.99.25.6 and GitHub version 0.99.26.3):

# load packages
library(tibble)
library(memisc)

# create data
d <- tibble(value = seq(0, 100))

# subset data
subset_d <- subset(d, d$value < 50)
# Found more than one class "tbl_df" in cache; using the first, from namespace 'tibble'
# Also defined by ‘memisc’
# Found more than one class "tbl_df" in cache; using the first, from namespace 'tibble'
# Also defined by ‘memisc’

Although this text isn't a warning -- and nor is it an error message -- it could potentially cause issues from some users? One of the packages I contribute to had a similar issue a while ago, and @davidcanarte reported that they are currently experiencing this issue with the memisc R package and was looking for a fix. I believe this issue is due to defining tbl_df as a S4 class in the memisc R package (i.e. setOldClass("tbl_df")) when the tibble R package defines tbl_df as an S4 class? Therefore, a potential fix could involve (1) removing the setOldClass("tbl_df") code and (2) updating the NAMESPACE file to import the tbl_df S4 class from the tibble R package. This would also require (3) listing the tibble R package under Imports and not Enhances in the DESCRIPTION file. What do you think?

I've verified that this approach fixes the issue locally (i.e. by running the example code above on an updated fork of the memisc GitHub repository). In case this is helpful, I've submitted a PR with the proposed fix. I've bumped the version to 0.99.26.4, but please let me know if you need any additional updates to merge the PR?

Escaping character "_" when using mtable_format_latex

When one of the models that are being printed with mtable_format_latex() has a LaTeX special character, LaTeX won't compile the table. Of course, using those special characters in model names can easily be avoided by the user - but could mtable_format_latex() throw a warning when it detects such issues?

Forum / Mailinglist

There should be a forum or a mailinglist related to memisc.

Currently I see no good way to ask memisc-related questions. GitHub-issuses are for development bug-reports etc and not for support questions.
StackOverflow doesn't over a memisc tag currently.

This is the currently opened question http://stackoverflow.com/questions/41208734/how-to-drop-labels-from-a-memiscdata-set-in-r.

Not for R 3.3.1

I tried to install.packages("memsic") but got the message Paket ‘memsic’ ist nicht verfügbar (for R version 3.3.1) (means not available for R 3.3.1).

I am using Siduction (Debian GNU/Linux unstable) with R version 3.3.1 (2016-06-21).

memisc::cases(): check.xor="ignore" not working for non-exhaustive

The help text for memisc::cases() states for the parameter check.xor: "checks, whether the case conditions are mutually exclusive and exhaustive".

In case the conditions are not exhaustive check.xor="ignore" does not work. It does issue a warning:

> x <- c(1,2)
> memisc::cases(
+   "1"=x==1,
+   "2"=x==2,
+   "3"=x==3,
+   check.xor="ignore"
+ )
[1] 1 2
Levels: 1 2 3
Warning message:
In memisc::cases(`1` = x == 1, `2` = x == 2, `3` = x == 3, check.xor = "stop") :
  condition x == 3 is never satisfied

in cases.R one should probably change (similarly to the check for done):

if(any(never) && check.xor!="ignore"){
  msg <- switch(check.xor,warn=warning,stop=stop)
  neverlab <- deflabels[never]
  if(length(neverlab)==1)
    msg("condition ",neverlab," is never satisfied")
  else
    msg("conditions ",paste(neverlab,collapse=", ")," are never satisfied")
}

Using string variable with variable names to subset .sav data

The function memisc::subset() does not allow us to use string variable as parameter for select. Currently, you can't do:

sav <- memisc::spss.system.file(filename) 
vars = c("var1", "var2")
data =  memisc::subset(sav, select=vars)

That simple limitation is a little counterproductive and makes the package less portable. I am proposing a modification in the file importer-methods.R below, in the function setMethod("subset","importer",.... The modification preserves the current way the subset works, but now also accepts a string vector in the select parameter of the function subset in the package. I posted it here instead or using a pull request because this little piece of code is sufficient to do the job.

CURRENT CODE:

setMethod("subset","importer", 
    function (x, subset, select, drop = FALSE,
              ...)
....
        nl <- as.list(1:nvars)
        names(nl) <- names
        cols <- logical(nvars)
        cols[eval(substitute(select), nl, parent.frame())] <- TRUE
        select.vars <- sapply(substitute(select)[-1],as.character)
....
{

PROPOSED MODIFICATION:

setMethod("subset","importer", 
    function (x, subset, select, drop = FALSE,
              ...)
....
        nl <- as.list(1:nvars)
        names(nl) <- names
        cols <- logical(nvars)
        if (class(substitute(select)) == 'call') {
            cols[eval(substitute(select), nl, parent.frame())] <- TRUE
            select.vars <- sapply(substitute(select)[-1],as.character)
        }else{
            select.vars = select
            cols[which(names(nl) %in% select.vars )] = TRUE
        }
....
{

How do I dynamically use subset selections

My use case is that I'm using the UK Labour Force Survey, and I want to pool a number of datasets.

I only want to load a small number of variables for analysis, but the variable names change between datasets depending on the naming convention. The weight variable changes from time to time (pwt16, pwt14, pwt11 etc).

I'd like to load a list of datasets, do a set of transformations to common form, and then pool the datasets for analysis using the survey or svryr packages.

When I try to use the subset specification as a pre-prepared character vector, I get
Error in max(sapply(args, length)) : invalid 'type' (list) of argument.

Problem importing string variables from SPSS (.sav)

Hi,

I'm only experiencing this issue on my Linux PC, the same script runs without trouble on Windows.

I'm importing a survey dataset, there are a couple of string variables in the dataset. The string variables mostly contain numbers, eg "1", "2", "-66", but some of them contain text-based answers as well, eg. "Please don't send me more surveys".

I've now noticed that only the string variables with absolutely no text seem to be working properly in memisc on Linux. is.character(ds$var) returns TRUE, and it can be coerced into numeric without errors. The variables with values containing text on the other hand will give errors:

>ds$problemvar
Item 'blablabla variable label blablabla' (measurement: nominal, type: character, length = 1729)

Error in if (any(xw > width)) { : missing value where TRUE/FALSE needed

> str(ds$problemvar)
 Nmnl. item  chr [1:1729] "-66

                                                  "|
 __truncated__ ...

It appears that some form of truncation is happening. Here is what it looks like when indexing the column:

> ds[1]

Data set with 1729 observations and 1 variables

   ...
 1 ...
 2 ...
 3 ...
 4 ...
 5 ...
 6 ...
 7 ...
 8 ...
 9 ...
10 ...

While another variable, based on a near identical survey question, works fine:

> str(ds$noproblemvar)
 Nmnl. item  chr [1:1729] "-66" "-66" "-66" "-66" ...

I have been comparing the above variables every which way, both in SPSS and R; the only discernible difference is that one of them, while being exported from the survey software as a string variable because text input was allowed, only contains numbers.

I'm importing the data from .sav files like so:

in_file = suppressWarnings(
	spss.system.file(
		file.path(use_dir, sav_file, fsep= .Platform$file.sep)))

ds = as.data.set(in_file)

Anyways, thanks for making memisc. My script works on windows so I can still make use of it, but it would be nice to figure out a workaround so I can handle these datasets in Linux as well. I can send you a .sav dataset to troubleshoot with if that helps.

library(memisc) cause errors in TukeyHSD() ?

Hello. This is something I ran into while working with SPSS files.

The issue can be reproduced as follows:

example(TukeyHSD) # works fine

library(memisc)
example(TukeyHSD)

Error in FUN(X[[i]], ...) : subscript out of bounds

probles in platform i386-w64-mingw32/i386 (32-bit)

While testing memisc I got a strange problem. It works well in platform x86_64-w64-mingw32/x64 (64-bit) but not in platform i386-w64-mingw32/i386 (32-bit).

It's a bug or I missed something?

Thanks,

Manel Salamero
[email protected]

PS: I attached a file with the instructions and results.
memisc_test.txt

example dataset

Does the memisc package contain example datasets like the R-inbuild-datasets (e.g. mtcars, iris)?

If not, it would be nice to have them. Of course they should use/demonstrate the memisc-specific data-types/classes.

Processing SPSS control files

Having double quotes embedded in a quote string does not work. In other words, this string
"Has used drugs - lifetime (incl marijuana ""just once"")"
2
does not work. I had to replace the double quotes with the ` character to read the file.
SPSS.zip are the examples.
I can provide a link to the data file if you wish to test it but it is 18mb zipped

Exhibit odds ration in the regression estimates table [question]

Hello,

A question related to the memisc package. Is it possible to present the exponentiated coefficients of a GLM model? I used the mtable() function to exhibt the results from a model from the mclogit() and lme4() package, and tried to see if there was an option to report the estimates as odds ratio.

Thanks you very much.

toLatex.ftable() raises error if axes are not named

library(magrittr)
library(memisc)
array(as.character(1:12), dim = c(2,2,3), dimnames = list(1:2, letters[1:2], LETTERS[1:3])) %>%
  ftable %>% toLatex
## Error in hleaders[n.col.vars, 1:n.row.vars] <- names(row.vars) : 
##   number of items to replace is not a multiple of replacement length
array(as.character(1:12), dim = c(2,2,3), dimnames = list(X=1:2, Y=letters[1:2], Z=LETTERS[1:3])) %>%
  ftable %>% toLatex
## ...works

Error in row(mrang_val[, 1:2])

When trying to import the attached dataset with spss.system.file() , this error occures:

Error in row(mrang_val[, 1:2]) :
ein matrixähnliches Objekt ist als Argument für 'row' nötig (GERMAN)
should be something like
a matrix-like object is required as an argument for 'row' (english)
AtestSet.sav.zip

This happens only if there is set a discrete missing value AND a missing value range at the same time.

Measure info and Custom Attributes

I want to import a SPSS file (*.sav) to R and I guess that memisc is the best available package for that. However, I'm facing two issues - probably due to my low R skills - and any help would be greatly appreciated.

Firstly, measure information "Ordinal", defined in the SPSS file, is skipped when imported in R and ordinal variables are declared as "Nominal". In addition the SPSS file has two extra Attributes (two custom columns added in Variable View) that are not imported when memisc importer is used.

I wonder if any of the above issues is treated successfully with memisc.

Thanks in advanced.

importing large files

Hi, I get an out of memory error

> dataset2003<- as.data.set(spss.portable.file("../rawData/census/f463/f463ind.por"))
Error in readStringPorStream(pstream) : 
 cannot allocate memory block of size 16777216 Tb

Is there any way around this? I want to import 15 files of this size and can't even do 1.

toLatex.escape.tex option as argument

Thanks for the nice package. I just had an issue with a new improvement that you listed in the NEWS file for 0.99 under Improvements, namely:

toLatex() methods optionally escape dollar, subscript and superscript symbols.

This is great for some cases I am sure, but for my case it meant that my paper written a while ago (which has math symbols in table headers) could not be knitted without error anymore. It also seems that this is set via a global argument

toLatex.escape.tex

but the toLatex() method (in this case for an ftable) does not expose this as an argument. I was able to fix the issue with

options(toLatex.escape.tex = FALSE)

after debugging a few times. However, it would be much nicer if this argument was directly accessible in the function toLatex() itself, and documented there.

So the proposal is to have an argument in the toLatex() methods for changing this behavior, together with documentation.

For backwards compatibility, would it be an option to have toLatex.escape.tex = FALSE by default?

Sorry for not submitting a patch, but I need to work on the referenced paper now....

Thanks!

Pieter

Multiple quotes and escape issues

The variable descriptions in an SPSS dataset contain imbedded quotes
Example
abc 'The rain doesn''t fall on Sunday '
the result is a missing vbl error message.
I could not get your software to process it unless I edited the duplicated single quotes to a pattern such as *.
I tried using ' instead of * but that did not work.
That is not a particular problem. I can edit the descriptions but I can't replace the original descriptions with the edited ones. I am forgetting how to do bulk edits of
your description. Do I have to use some kind of "for" loop?
I tried to do an edit with
descriptions(my.,ds)<-edited_descriptions_as_character_array by this does not work.
string_replacement_problem.pdf

Fails to import variable with duplicate labels

Is it too much to ask that memisc would be able to cope with duplicated labels?

myurl <- "http://hansekbrand.se/temp/BUG.SAV"
z <- tempfile()
download.file(myurl,z,mode="wb")
my.meta.data <- spss.system.file(z)
Warning message:
1 variables have duplicated labels:
  HML23 
> my.df <- as.data.frame(my.meta.data)
Error in as.factor(x) : Duplicate labels
file.remove(z)

I work a lot with data from the Demographic and Health Surveys (DHS), and some of those files are so big that importing them with read.spss() requires amounts of RAM not found in most computers. Many or even most of the hundreds of files from DHS have duplicated labels in them. To get memisc working with such files would really help my work, currently I have a computer with 68 GB RAM so I manage, but I want others to be able to use my code.

Kind regards,

Hans Ekbrand, university of Gothenburg, Sweden.

mysteric behaviour with factors and labels

Please see this basi-R example with a data.frame.

d <- data.frame(a = sample(1:100))
d$a_strat <- cut(d$a, breaks=seq(1,100, by=10)) # stratify by 10
e <- d[,c('a_strat')]

> str(d$a_strat)
 Factor w/ 9 levels "(1,11]","(11,21]",..: 2 6 1 8 6 9 5 3 NA 9 ...
> str(e)
 Factor w/ 9 levels "(1,11]","(11,21]",..: 2 6 1 8 6 9 5 3 NA 9 ...

You see the labels for levels ar not lost. But when I do the same with a memisc:data.set they are lost.

d <- data.set(a = sample(1:100))
d$a_strat <- cut(d$a, breaks=seq(1,100, by=10))
e <- d[,c('a_strat')]

> str(d$a_strat)
 Factor w/ 9 levels "(1,11]","(11,21]",..: 4 9 3 1 NA 9 5 4 9 9 ...
> str(e)
Data set with 100 obs. of 1 variable:
 $ a_strat: Nmnl. item w/ 9 labels for 1,2,3,...  int  4 9 3 1 NA 9 5 4 9 9 ...

What is behind that behaviour?

as.character(as.item(..., missing.values = 0, ...)) returns NA instead of a string

Since an update in memisc (somewhere between 0.99.22 and 0.99.28), the following code returns an error:

foo <- as.item(c(0,1,1,-1), missing.values= 0, labels = structure(c(-1,0,1), names=c('Yes', 'PNR', 'No')))
bar <- as.factor(as.character(foo))
bar <- relevel(bar, 'PNR')

Before the update (I had 0.99.22), we had: foo = c('PNR', 'Yes', 'Yes', 'No') while now we have foo = c(NA, 'Yes', 'Yes', 'No').
This is a weird behavior because the point of memisc is to distinguish missing values from NA.
Can you revert to the previous behavior?

Variables with duplicate labels cause an infinite loop in xtabs

The Statcan SPSS file that I am processing (community health survey) has two variables with duplicate labels. If these variables are in the dataset that is used a source for xtabs, even if the variables are not in the xtab, there is an error message about factor problems and R hangs in a loop).

to.data.frame not working

I am trying to convert a data set to data.frame using to.data.frame, after loading it using memisc::spss.system.file(filename) and subset(..). I get the following error:

Error in as.factor(x) : Duplicate labels

Problems with slicing: "read_sysfile_slice" not resolved from current namespace

Hi,

I have been processing SPSS .sav files with great success with an older version of memisc and R on Windows.

I have recently upgraded to the 3.4 series of R and memisc version 0.99.14.2.

In code where I use slicing e.g.:
data_sp[,2]
or
data_sp[,names(data_sp)=="grp"]
I now get this error:
Error in .Call("read_sysfile_slice", x@ptr, what = x, j = cols, i = rows, : "read_sysfile_slice" not resolved from current namespace (memisc

I find the function "read_sysfile_slice" in pspp-system-for-R.c.

I believe that "read_sysfile_slice" should be exported in memisc.h

Best regards,

Jon Wickmann

relabel with gsub

I'm having issues with relabeling items using argument gsub = TRUE. The regular expression produces incorrect matches. For example trying to relabel c to foo by regular expression also relabels a, which should not be matched by c at all:

f <- as.factor(rep(letters[1:4],5))
f <- as.item(f)
f2 <- relabel(f, c = 'foo', gsub = TRUE)
labels(f)
Values and labels:

1 'a'
2 'b'
3 'c'
4 'd'

labels(f2)

Values and labels:

1 'foo'
2 'b'
3 'foo'
4 'd'

I expected following result:

Values and labels:

1 'a'
2 'b'
3 'foo'
4 'd'

Control encoding when parsing SPSS file using spss.system.file()

Dear Martin,

I have been given a SPSS system file that I would like to analyse using R. I am using the following magic for parsing the file into R.

library(memisc)
foo <- spss.system.file("foobar.sav")
bar <- subset(foo, select=c(var1,var2,var3))

When having a look at the parsed data, you get the following:

> bar
Data set with 379 observations and 3 variables

var1       var2        var3
1      gut    weiblich      Herbst
2      gut mnlich      Sommer
3      gut mnlich      Sommer
4      gut mnlich      Winter
5      gut mnlich Fr�hling
6      gut mnlich Fr�hling
7      gut    weiblich Fr�hling
.
.
.
25      gut    weiblich Fr�hling
.. ........ ........... ...........
(27 of 379 observations shown)

I guess you get the idea. The collaborator has saved the sav-file in utf-8 by adding a line SET UNICODE = ON. to his/her syntax-file. My locals are set to utf-8, too.

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.04

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] graphics  grDevices datasets  utils     stats     methods   base     

other attached packages:
[1] foreign_0.8-63  memisc_0.97     MASS_7.3-40     lattice_0.20-29
[5] ggplot2_1.0.1   reshape2_1.4.1  plyr_1.8.2

I am using the uxterm terminal-emulator for running R. Thus, everything is in utf-8. I have the strong suspicion that memisc is using a latin1 encoding when parsing the SPSS sav-file by default. Is this correct? Is it possible to change this encoding when parsing?

Thanks you very much!

PS. Why does it say 27 of 379 observations shown, when in fact only 25 of them are shown?

Using `data.set` with Date class?

Hi,

Very helpful package! I am just running into one issue where there doesn't seem to be any support for the Date class, which our group finds pretty important for survey data. Are there any plans to add support in the near future? I installed the latest version of memisc from GitHub, and code that shows the problem is below.

Thanks!
~Andrew

> rccsData<-data.set(rccsData)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘as.item’ for signature ‘"Date"’
> colnames(rccsData)[3]<-"sample.date"
> rccsData<-within(rccsData, {
+   description(sample.date)<-"Date of interview and sample collection"
+ })
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] memisc_0.99.4.1 MASS_7.3-40     lattice_0.20-31

loaded via a namespace (and not attached):
[1] tools_3.2.0 grid_3.2.0

Error when generating codebook

When importing with the following syntax, description works but i get an error generating the codebook.
ZA <- spss.system.file(Alt_in)
description(ZA)
codebook(ZA)

The error generated says:
"Error in if (ncol(descr) > 1) { : argument is of length zero"

If i ask for a codebook on a specific variable, it works.
I'm sure it's something simple I'm missing?

Indicate charset in format_html() output

format_html currently does not output any information regarding character encoding used in the output. In that case, ISO-8859-1 is assumed. But R strings are typically in UTF-8. This means any non-ASCII characters are not interpreted correctly by web browsers.

This can easily be fixed by adding the following output to the HTML:

 <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Cf. http://www.w3schools.com/html/html_charset.asp

toLatex.data.frame() cannot deal with matrices inside of data frames

Probable issue with record length

I am trying to process cycle30 of the General Social survey of statistics canada.
The file has a very large record layout. To see if it was feasible, I split of just the record layout
I got the following error message. I have included the columns file and the log from the run. Is there anything that can be done or some I find someone who can load it for me in spss.
Error in rofseek(fptr, pos = 0) : not an rofile
gss30_processl_log.pdf
gss30_columns.txt

Removing digits when using `toLatex` directly with `ftable`

For some very simple tables (with only one variable, counting the number of appearances of that variable), I use xtabs to generate the table and then export it with ftable. However, doing so adds no less than seven decimal digits.
What could be the reason for this? How could the decimal digits be made to disappear in the finale, LaTeX-ready table (without manually removing them in the .tex-file)?

Process stops prematurely - spss.fixed - data files appear fine

Statistics Canada's summer release of the Survey of Financial Security seems fine. There are two datasets. A large special weight dataset which processes fine and an economic family dataset which has issues. The spss.fixed process seems to stop midway through the first record. I tried it with only the record layout as well as with the full set of files. I have attached the efam set in the zip file. Suggestions as to how to move forward would be appreciated.
sfs2016_for_elff.zip

Suggestion for new method codebookEntry

memisc appears to be lacking a dedicated codebookEntry for the "dateitem.item " class, instead defaulting to "atomic" which uses the R's "summary" method. Because this function does not provide information on missing values, codebookEntry appears to be missing information on missings in date-Variables as well.

Here is some quick code, i used to patch this after the package is loaded. It's not perfect but it seems to provide the needed information.

# set the method after loading the package
setGeneric("codebookEntry", getGeneric("codebookEntry", package="memisc"))

setMethod("codebookEntry","datetime.item",function(x){
  spec <- c(
    "Storage mode:"=storage.mode(x)
  )
  isna <- is.na(x)
  stats <- summary(x)
 stats <- list(descr=cbind(names(stats),paste(stats), NAs=sum(isna)))
  new("codebookEntry",
      spec = spec,
      stats = stats,
      annotation = annotation(x)
  )
})

What do you think?

deduplicate_labels() is painfully slow on large datasets

Thanks for closing the issue with duplicate labels. It works, but unfortunately it is very slow on large data sets. The time spent on importing data is about 100 times longer with deduplicate_labels() than without. My guess is that the implementation could be improved.

Beware that the test file is big: 1.7 GB, and that it will take almost 4 hours to run deduplicate_labels() on it.

myurl <- "http://hansekbrand.se/temp/test_deduplicate.sav"
z <- tempfile()
download.file(myurl,z,mode="wb")
my.meta.data <- spss.system.file(z)
## File character set is 'UTF-8'.
## Converting character set to the local 'utf-8'.
## Warning message:
## 1 variables have duplicated labels:
##   SHDISTRI 

####  The next step takes almost 4 hours on my machine
fixed.meta.data <- deduplicate_labels(my.meta.data)

Importing a subset of the file without running deduplicate_labels() takes only a few minutes.

my.subset <- c("HHID", "HVIDX", "HV000", "HV001", "HV002", "HV005", "HV006", 
"HV007", "HV009", "HV013", "HV014", "HV016", "HV024", "HV025", 
"HV028", "HV201", "HV204", "HV205", "HV207", "HV208", "HV209", 
"HV210", "HV211", "HV212", "HV213", "HV214", "HV215", "HV216", 
"HV221", "HV225", "HV226", "HV227", "HV228", "HV230A", "HV236", 
"HV237", "HV239", "HV241", "HV242", "HV243B", "HV243C", "HV243D", 
"HV244", "HV245", "HV246", "HV247", "HV271", "SH36", "HV101", 
"HV104", "HV105", "HV106", "HV108", "HV111", "HV112", "HV113", 
"HV114", "HV140", "HC60")
names(my.subset) <- my.subset
my.ds <- subset(my.meta.data, select = my.subset)
my.df <- as.data.frame(within(my.ds, {
    missing.values(HV112) <- c("Mother not in household")
    }))

Is there a way to speed up deduplicate_labels()?

package ‘memisc’ is not available (for R version 3.3.1)

I am not able to install package. Error message appearing.

	cls <- class(x)
	sumstats.name <- paste0("summary.stats.",cls)
	sumstats <- lapply(sumstats.name,getOption)
	if(length(sumstats)){
	sumstats <- unlist(sumstats)
	sumstats[1]
	}
	else
	sumstats <- getOption("summary.stats.default")

melff / memisc Goto Github PK

memisc's People

Stargazers

Watchers

Forkers

memisc's Issues

Recommend Projects

Recommend Topics

Recommend Org