jknowles / datasynthr Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 4.0 100 KB

Functions to procedurally generate synthetic data in R for testing and collaboration.

R 100.00%

datasynthr's People

Contributors

Stargazers

Watchers

Forkers

trinker black-milk nfultz

datasynthr's Issues

Generate ordinal factors

Need to generate ordinal factors like Likert scales. Currently can only generate unordered factors and can only test for associations between unordered factors.

Generate simulated data from known data

Allow users to generate simulated data from a known dataset. Solve for:

Simple numeric data
Nested numeric data
Simple categorical data
Nested categorical data
Simple numeric and categorical data
Nested numeric and categorical data

The current test suite was built before the release of testthat to CRAN. This means the package uses some weird declarations instead of using the equal or identical constructions available in testthat. These should be updated.

Multivariate correlation structures

Currently the correlation structure of generated data comes from enforcing bivariate correlations that chain off one another sequentially or all come from a single starting vector.

This needs to be fixed to allow data to be generated using a user-specified covariance matrix.

Missing data

Synthetic data needs to have an option to include missing data.

Missing Completely At Random
Missing at Random
Missing Not at Random

These options need to be user controllable to specifically test different types of missingness.

Additionally need functions in place to detect and measure missing data in a dataset in order to ensure function is working properly.

Generate continuous dependent variables

Allow the user to pass dataframe elements and generate a continuous dependent variable with a known relationship to predictors.

Error checking within functions

Make sure each function has error checking in place to report to users when an error has been made and to provide helpful context.

Documentation

Need to document each function with roxygen

Make a vignette that reproduces internal R datasets synthetically (pick a couple)

Generators are sensitive to scale of input variable

The following fails:

a <- rnorm(5000)
a <- a * 10
RHO1 <- 0.7
RHO2 <- -0.7
RHO3 <- 0.01

vecB <- rpoiscor(a, RHO1)
vecC <- rpoiscor(a, RHO2)
vecD <- rpoiscor(a, RHO3)
tol <- 0.05

In this case vec B , C, and D do not result in very nice variables. They have infinite standard deviations or missing values.

Generators not optimized

Need to vectorize the generator functions. They are taking much too long to execute. Generating a numeric correlated matrix that is 88000 x 80 takes close to one minute in some cases. This is definitely due to the loop through the columns inside this function and needs to be fixed!

Make sure it can pass CRAN checks

Must get into a buildable state, then features will be added through branches until passing CRAN can be verified for them as well.

Allow parameters to be single or list-wise for generator functions

Allow the user to pass a list of correlation structures instead of only a single correlation for an entire dataset.

Allow the user to pass a list of names for the resulting dataset.

Allow the user to pass desired scales for the variables so they can be rescaled.

Package not available for R version 3.2.1

Are you considering updating this package for use with newer versions of R?

Updated missingness handling

Create a method for testing MCAR data that is more efficient than a dump of Gamma statistics
Allow user to specify functional form / model for MAR data

Make sure correlation samplers pass internal tests

Right now depending on which distribution the correlation sampler is trying to generate a correlated distribution from, the correlations are way off -- especially when switching between a uniform and a normal distribution.

This needs to be investigated further and these errors need to be reduced. These functions should pass more of the tests in inst/test-correlatedsamplers than they do.

Chain together generators for entire datasets

Make it so users can chain the generator functions in order to create datasets with multiple data types and interesting correlation structures.

Setting the seed

Results need to be reproducible.

Allow users to set the seed at various points in the options to functions. Make sure that this is reproducible and write tests to check this.

Tests are crude

Need to vectorize tests so they run across distribution combinations and across levels of rho that are sensible. Right now arbitrarily using 0.7, .05, and -0.7 is OK, but insufficient for ensuring the functions are robust.

Allow the user to specify the index for correlations

Currently a randomly generated first column is used by genFactor and genNumeric to generate subsequent correlations. It would be nice if the user can chain together correlation structures by specifying the column used -- allowing multiple dataframes to be combined with a single "seed" column for the correlation structure.

Alternative RNGs

Hook in with alternative RNG backends to allow users to use RNGs that are not built into R but come through add-on packages.