Giter Site home page Giter Site logo

datasynthr's People

Contributors

aaronrudkin avatar jknowles avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

datasynthr's Issues

Generate ordinal factors

Need to generate ordinal factors like Likert scales. Currently can only generate unordered factors and can only test for associations between unordered factors.

Generate simulated data from known data

Allow users to generate simulated data from a known dataset. Solve for:

  • Simple numeric data
  • Nested numeric data
  • Simple categorical data
  • Nested categorical data
  • Simple numeric and categorical data
  • Nested numeric and categorical data

Update test infrastructure

The current test suite was built before the release of testthat to CRAN. This means the package uses some weird declarations instead of using the equal or identical constructions available in testthat. These should be updated.

Multivariate correlation structures

Currently the correlation structure of generated data comes from enforcing bivariate correlations that chain off one another sequentially or all come from a single starting vector.

This needs to be fixed to allow data to be generated using a user-specified covariance matrix.

Missing data

Synthetic data needs to have an option to include missing data.

  • Missing Completely At Random
  • Missing at Random
  • Missing Not at Random

These options need to be user controllable to specifically test different types of missingness.

Additionally need functions in place to detect and measure missing data in a dataset in order to ensure function is working properly.

Error checking within functions

Make sure each function has error checking in place to report to users when an error has been made and to provide helpful context.

Documentation

Need to document each function with roxygen

Make a vignette that reproduces internal R datasets synthetically (pick a couple)

Generators are sensitive to scale of input variable

The following fails:

a <- rnorm(5000)
a <- a * 10
RHO1 <- 0.7
RHO2 <- -0.7
RHO3 <- 0.01

vecB <- rpoiscor(a, RHO1)
vecC <- rpoiscor(a, RHO2)
vecD <- rpoiscor(a, RHO3)
tol <- 0.05

In this case vec B , C, and D do not result in very nice variables. They have infinite standard deviations or missing values.

Generators not optimized

Need to vectorize the generator functions. They are taking much too long to execute. Generating a numeric correlated matrix that is 88000 x 80 takes close to one minute in some cases. This is definitely due to the loop through the columns inside this function and needs to be fixed!

Allow parameters to be single or list-wise for generator functions

Allow the user to pass a list of correlation structures instead of only a single correlation for an entire dataset.

Allow the user to pass a list of names for the resulting dataset.

Allow the user to pass desired scales for the variables so they can be rescaled.

Updated missingness handling

  • Create a method for testing MCAR data that is more efficient than a dump of Gamma statistics
  • Allow user to specify functional form / model for MAR data

Make sure correlation samplers pass internal tests

Right now depending on which distribution the correlation sampler is trying to generate a correlated distribution from, the correlations are way off -- especially when switching between a uniform and a normal distribution.

This needs to be investigated further and these errors need to be reduced. These functions should pass more of the tests in inst/test-correlatedsamplers than they do.

Setting the seed

Results need to be reproducible.

Allow users to set the seed at various points in the options to functions. Make sure that this is reproducible and write tests to check this.

Tests are crude

Need to vectorize tests so they run across distribution combinations and across levels of rho that are sensible. Right now arbitrarily using 0.7, .05, and -0.7 is OK, but insufficient for ensuring the functions are robust.

Allow the user to specify the index for correlations

Currently a randomly generated first column is used by genFactor and genNumeric to generate subsequent correlations. It would be nice if the user can chain together correlation structures by specifying the column used -- allowing multiple dataframes to be combined with a single "seed" column for the correlation structure.

Alternative RNGs

Hook in with alternative RNG backends to allow users to use RNGs that are not built into R but come through add-on packages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.