Giter Site home page Giter Site logo

dieterich-lab / asyh Goto Github PK

View Code? Open in Web Editor NEW
3.0 4.0 0.0 332 KB

The Anonymous Synthesizer for Health Data

License: MIT License

Python 98.84% Jinja 1.16%
anonymization synthetic-data ctgan gaussian-mixture-models ml-pipeline synthetic-dataset-generation variational-autoencoder synthetic-data-vault health-data

asyh's Introduction

ASyH - Anonymous Synthesizer for Health Data (Release 1).

Overview

The ASyH is a software helping Clinics as holders of large quantities of highly restricted personal health data to provide the Medical Data Community with realistic datasets without the breach of privacy. It does this by synthesizing data with Machine Learning techniques which preserve data distribution and correlation while adding as much variation to the synthetic data, in order for it to have no resemblance to any of the original patient data entries.

For synthesis, metrics and quality assurance we will mainly use the Synthetic Data Vault (github).

Installation and Upgrading

Using pip, the easiest way to install/upgrade ASyH is

pip install --upgrade https://github.com/dieterich-lab/ASyH/tarball/v1.0.2

Usage

The most basic use case for ASyH is to create an ASyH Application object and call synthesize() to get a synthetic dataset from the best-performing SDV model/synthesizer (one of CopulaGAN, CTGAN, GaussianCopula, or TVAE [cf. the SDV documentation]). The input original dataset should be provided as a pandas DataFrame, the synthesized dataset is output as pandas DataFrame as well. For identification of numerical and categorical variables, a metadata file in JSON format needs to be provided (see below).

import ASyH

asyh = ASyH.Application()
synthetic_data = asyh.synthesize('original_data.csv', metadata_file='metadata.json')

# write the synthetic dataset to CSV file:
synthetic_data.to_csv(output_file, index=False)

Alternatively, you can specify an Excel file as first argument to asyh.synthesize(.,.)

Additionally, a report of the output data quality (in terms of similarity to the original data) can be generated with (appended to the above code, in the same script file)

import ASyH
import pandas
import json

# We will need the original dataset as pandas DataFrame
original_data = pandas.read_csv('input_data.csv')

# We also need the metadata as a dict:
with open('metadata.json', 'r', encoding='utf-8') as md_file:
    metadata = json.load(md_file)

asyh = ASyH.Application()
synthetic_data = asyh.synthesize(input_data.csv', metadata_file='metadata.json')

# the following will create the md file
#   report.md
# and, if an installation of TeXLive and pandoc is available
#   report.pdf
report = ASyH.Report(original_data, synthetic_data, metadata)
report.generate('report', asyh.model.model_type)

you will find a zip archive with all images, the markdown file (if generated the PDF as well), and the synthetic data in a CSV file. Mind that the above code assumes that the metadata specifies the table name as 'data'.

Metadata format

ASyH uses SDV's metadata format (cf. 'Metadata' in the SDV documentation).

The skeleton of the JSON file should look like the following

{"columns":
    { ...column specifications...
    },
 "primary_key":...
}

Specifying a primary_key is optional.

The column specifications are of the form

"COLUMN_NAME": {"sdtype": "COLUMN_TYPE"}

or

"COLUMN_NAME": {"sdtype": "COLUMN_TYPE", "SPECIFIER": SPECIFIER_VALUE}

where COLUMN_NAME is a column variable's name and COLUMN_TYPE is on of (numerical, datetime, categorical, boolean, id). The SPECIFIER/SPECIFIER_VALUE pair to use depends on the sdtype of the variable, it does not apply to boolean and categorical variables, otherwise, they are:

  • computer_representation for numerical variables.
    Allowed values are "Float", "Int8", "Int16", "Int32", "Int64", "UInt8", "UInt16", "UInt32", "UInt64"

  • regex_format for id variables.
    The regex string should use Perl-style regular expression syntax (cf. also the Python documentation).

  • datetime_format is required for datetime type variables.
    The SPECIFIER_VALUE for this specifier is a string in strftime format.

Development

To do development on this software do this:

  • Check out the repository

  • Create a Python venv for the project

  • Activate the venv

  • Install the package editable (-e) with the test dependencies:

      pip install -e '.[tests]'
    

To run the tests set the PYTHONPATH and execute pytest on the 'tests' folder:

    export PYTHONPATH=$(pwd)
    pytest tests

Release History

Release Date
1.0.0 25/05/2023

asyh's People

Contributors

garrgravarr avatar timjohann avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

asyh's Issues

Delay SDV model creation.

The actual instantiation of the SDV model should be done just before we want to train with data, since we want to make sure we can adapt the argument list for the model constructor according to the input data and metadata.

The _train() method is defined in the generic Model class. The specific constructor should therefore be called with adapted arguments within the generic interface.

Thus, we should not provide a ready-made SDV model object to Model.init(), but instead provide a method calling the actual constructor with the specific arguments (as a dictionary, compare Issue #6).

Construct argument list from input data layout.

Adaptation of SDV model (number of hidden layers for encoding and decoding, etc.) is done at the SDV model's construction, via arguments.
Models have different keyword arguments to specify their internal layout, therefore, every specific ASyH model needs to construct an argument list from the input data layout.
For keyword arguments, a dictionary should be used.

The dictionary can then be used as argument as in
sdv_model_constructor(**argument_dict).

ASyH 1.0: report.j2 not found

Generating a report with ASyH 1.0.0 will fail with errors like:

Traceback (most recent call last):
  File "/beegfs/scratch/ASyH/ARX-ASyH-Comparison/ASyH-scripts/2-train+sample+report.py", line 35, in 
    report.generate(dataset, asyh_application.model.model_type)
  File "/beegfs/homes/hwilhelmi/.venvs/ASyH/lib/python3.9/site-packages/ASyH/report.py", line 49, in generate
    markdown = self.get_mark_down_report(dataset_name, sd_model_name, images)
  File "/beegfs/homes/hwilhelmi/.venvs/ASyH/lib/python3.9/site-packages/ASyH/report.py", line 107, in get_mark_down_report
    jinja_template = self._get_report_template()
  File "/beegfs/homes/hwilhelmi/.venvs/ASyH/lib/python3.9/site-packages/ASyH/report.py", line 124, in _get_report_template
    return env.get_template('report.j2')
  File "/beegfs/homes/hwilhelmi/.venvs/ASyH/lib/python3.9/site-packages/jinja2/environment.py", line 1010, in get_template
    return self._load_template(name, globals)
  File "/beegfs/homes/hwilhelmi/.venvs/ASyH/lib/python3.9/site-packages/jinja2/environment.py", line 969, in _load_template
    template = self.loader.load(self, name, self.make_globals(globals))
  File "/beegfs/homes/hwilhelmi/.venvs/ASyH/lib/python3.9/site-packages/jinja2/loaders.py", line 126, in load
    source, filename, uptodate = self.get_source(environment, name)
  File "/beegfs/homes/hwilhelmi/.venvs/ASyH/lib/python3.9/site-packages/jinja2/loaders.py", line 218, in get_source
    raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: report.j2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.