jknowles / datasynthr Goto Github PK
View Code? Open in Web Editor NEWFunctions to procedurally generate synthetic data in R for testing and collaboration.
Functions to procedurally generate synthetic data in R for testing and collaboration.
Need to generate ordinal factors like Likert scales. Currently can only generate unordered factors and can only test for associations between unordered factors.
Allow users to generate simulated data from a known dataset. Solve for:
The current test suite was built before the release of testthat to CRAN. This means the package uses some weird declarations instead of using the equal
or identical
constructions available in testthat
. These should be updated.
Currently the correlation structure of generated data comes from enforcing bivariate correlations that chain off one another sequentially or all come from a single starting vector.
This needs to be fixed to allow data to be generated using a user-specified covariance matrix.
Synthetic data needs to have an option to include missing data.
These options need to be user controllable to specifically test different types of missingness.
Additionally need functions in place to detect and measure missing data in a dataset in order to ensure function is working properly.
Allow the user to pass dataframe elements and generate a continuous dependent variable with a known relationship to predictors.
Make sure each function has error checking in place to report to users when an error has been made and to provide helpful context.
Need to document each function with roxygen
Make a vignette that reproduces internal R datasets synthetically (pick a couple)
The following fails:
a <- rnorm(5000)
a <- a * 10
RHO1 <- 0.7
RHO2 <- -0.7
RHO3 <- 0.01
vecB <- rpoiscor(a, RHO1)
vecC <- rpoiscor(a, RHO2)
vecD <- rpoiscor(a, RHO3)
tol <- 0.05
In this case vec B , C, and D do not result in very nice variables. They have infinite standard deviations or missing values.
Need to vectorize the generator functions. They are taking much too long to execute. Generating a numeric correlated matrix that is 88000 x 80 takes close to one minute in some cases. This is definitely due to the loop through the columns inside this function and needs to be fixed!
Must get into a buildable state, then features will be added through branches until passing CRAN can be verified for them as well.
Allow the user to pass a list of correlation structures instead of only a single correlation for an entire dataset.
Allow the user to pass a list of names for the resulting dataset.
Allow the user to pass desired scales for the variables so they can be rescaled.
Are you considering updating this package for use with newer versions of R?
Right now depending on which distribution the correlation sampler is trying to generate a correlated distribution from, the correlations are way off -- especially when switching between a uniform and a normal distribution.
This needs to be investigated further and these errors need to be reduced. These functions should pass more of the tests in inst/test-correlatedsamplers
than they do.
Make it so users can chain the generator functions in order to create datasets with multiple data types and interesting correlation structures.
Results need to be reproducible.
Allow users to set the seed at various points in the options to functions. Make sure that this is reproducible and write tests to check this.
Need to vectorize tests so they run across distribution combinations and across levels of rho that are sensible. Right now arbitrarily using 0.7, .05, and -0.7 is OK, but insufficient for ensuring the functions are robust.
Currently a randomly generated first column is used by genFactor
and genNumeric
to generate subsequent correlations. It would be nice if the user can chain together correlation structures by specifying the column used -- allowing multiple dataframes to be combined with a single "seed" column for the correlation structure.
Hook in with alternative RNG backends to allow users to use RNGs that are not built into R but come through add-on packages.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.