Giter Site home page Giter Site logo

ctgan's Introduction


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status PyPI Shield Unit Tests Downloads Coverage Status


Overview

CTGAN is a collection of Deep Learning based Synthetic Data Generators for single table data, which are able to learn from real data and generate synthetic clones with high fidelity.

Important Links
💻 Website Check out the SDV Website for more information about the project.
📙 SDV Blog Regular publshing of useful content about Synthetic Data Generation.
📖 Documentation Quickstarts, User and Development Guides, and API Reference.
:octocat: Repository The link to the Github Repository of this library.
📜 License The entire ecosystem is published under the MIT License.
⌨️ Development Status This software is in its Pre-Alpha stage.
Community Join our Slack Workspace for announcements and discussions.
Tutorials Run the SDV Tutorials in a Binder environment.

Implemented Models

Currently, this library implements the CTGAN and TVAE models proposed in the Modeling Tabular data using Conditional GAN paper. For more information about these models, please check out the respective user guides:

Install

CTGAN is part of the SDV project and is automatically installed alongside it. For details about this process please visit the SDV Installation Guide

Optionally, CTGAN can also be installed as a standalone library using the following commands:

Using pip:

pip install ctgan

Using conda:

conda install -c pytorch -c conda-forge ctgan

For more installation options please visit the CTGAN installation Guide

Usage Example

⚠️ WARNING: If you're just getting started with synthetic data, we recommend using the SDV library which provides user-friendly APIs for interacting with CTGAN. To learn more about using CTGAN through SDV, check out the user guide here.

To get started with CTGAN, you should prepare your data as either a numpy.ndarray or a pandas.DataFrame object with two types of columns:

  • Continuous Columns: can contain any numerical value.
  • Discrete Columns: contain a finite number values, whether these are string values or not.

In this example we load the Adult Census Dataset which is a built-in demo dataset. We then model it using the CTGANSynthesizer and generate a synthetic copy of it.

from ctgan import CTGANSynthesizer
from ctgan import load_demo

data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGANSynthesizer(epochs=10)
ctgan.fit(data, discrete_columns)

# Synthetic copy
samples = ctgan.sample(1000)

Join our community

  1. Please have a look at the Contributing Guide to see how you can contribute to the project.
  2. If you have any doubts, feature requests or detect an error, please open an issue on github or join our Slack Workspace.
  3. Also, do not forget to check the project documentation site!

Citing TGAN

If you use CTGAN, please cite the following work:

  • Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.
@inproceedings{xu2019modeling,
  title={Modeling Tabular data using Conditional GAN},
  author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2019}
}

Related Projects

Please note that these libraries are external contributions and are not maintained nor supervised by the MIT DAI-Lab team.

R interface for CTGAN

A wrapper around CTGAN has been implemented by Kevin Kuo @kevinykuo, bringing the functionalities of CTGAN to R users.

More details can be found in the corresponding repository: https://github.com/kasaai/ctgan

CTGAN Server CLI

A package to easily deploy CTGAN onto a remote server. This package is developed by Timothy Pillow @oregonpillow.

More details can be found in the corresponding repository: https://github.com/oregonpillow/ctgan-server-cli




The DataCebo team is the proud developer of The Synthetic Data Vault Project, the largest open source ecosystem for synthetic data generation & evaluation. The ecosystem is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data.Or, use the standalone libraries for specific needs.

ctgan's People

Contributors

amontanez24 avatar baukebrenninkmeijer avatar csala avatar fealho avatar jdtheripperpc avatar katxiao avatar kevinykuo avatar leix28 avatar lurosenb avatar matheusccouto avatar npatki avatar oregonpillow avatar pvk-developer avatar timvink avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.