rheinwerk-verlag / pganonymize Goto Github PK

A commandline tool for anonymizing PostgreSQL databases

Home Page: http://pganonymize.readthedocs.io/

License: Other

Makefile 5.33% Python 94.23% Dockerfile 0.44%

gdpr database anonymization postgresql python python2 python3 command-line-tool dsgvo cli anonymizer developer-tools command-line privacy python27

pganonymize's Introduction

pganonymize

A commandline tool to anonymize PostgreSQL databases for DSGVO/GDPR purposes.

It uses a YAML file to define which tables and fields should be anonymized and provides various methods of anonymization. The tool requires a direct PostgreSQL connection to perform the anonymization.

https://raw.githubusercontent.com/rheinwerk-verlag/pganonymize/main/docs/_static/demo.gif

Contents

Features
Installation
Usage
- Database dump
- Docker

Features

Intentionally compatible with Python 2.7 (for old, productive platforms)
Anonymize PostgreSQL tables on data level entry with various providers (some examples in the table below)
Exclude data for anonymization depending on regular expressions or SQL WHERE clauses
Truncate entire tables for unwanted data

Field	Value	Provider	Output
`first_name`	John	`choice`	(Bob\|Larry\|Lisa)
`title`	Dr.	`clear`
`street`	Irving St	`faker.street_name`	Miller Station
`password`	dsf82hFxcM	`mask`	XXXXXXXXXX
`credit_card`	1234-567-890	`partial_mask`	1??????????0
`email`	[email protected]	`md5`	0cba00ca3da1b283a57287bcceb17e35
`email`	[email protected]	`faker.unique.email`	[email protected]
`phone_num`	65923473	`md5` as_number: True	3948293448
`ip`	157.50.1.20	`set`	127.0.0.1
`uuid_col`	00010203-0405-......	`uuid4`	f7c1bd87-4d....

Note: faker.unique.[provider] only supported on Python 3.6+ (Faker library min. supported python version)
Note: uuid4 - only for (native uuid4) columns

See the documentation for a more detailed description of the provided anonymization methods.

Installation

The default installation method is to use pip:

$ pip install pganonymize

Usage

usage: pganonymize [-h] [-v] [-l] [--schema SCHEMA] [--dbname DBNAME]
               [--user USER] [--password PASSWORD] [--host HOST]
               [--port PORT] [--dry-run] [--dump-file DUMP_FILE]

Anonymize data of a PostgreSQL database

optional arguments:
-h, --help            show this help message and exit
-v, --verbose         Increase verbosity
-l, --list-providers  Show a list of all available providers
--schema SCHEMA       A YAML schema file that contains the anonymization
                        rules
--dbname DBNAME       Name of the database
--user USER           Name of the database user
--password PASSWORD   Password for the database user
--host HOST           Database hostname
--port PORT           Port of the database
--dry-run             Don't commit changes made on the database
--dump-file DUMP_FILE
                      Create a database dump file with the given name
--dump-options DUMP_OPTIONS
                      Options to pass to the pg_dump command
--init-sql INIT_SQL   SQL to run before starting anonymization
--parallel            Data anonymization is done in parallel

Despite the database connection values, you will have to define a YAML schema file, that includes all anonymization rules for that database. Take a look at the schema documentation or the YAML sample schema.

Example calls:

$ pganonymize --schema=myschema.yml \
    --dbname=test_database \
    --user=username \
    --password=mysecret \
    --host=db.host.example.com \
    -v

$ pganonymize --schema=myschema.yml \
    --dbname=test_database \
    --user=username \
    --password=mysecret \
    --host=db.host.example.com \
    --init-sql "set search_path to non_public_search_path; set work_mem to '1GB';" \
    -v

Database dump

With the --dump-file argument it is possible to create a dump file after anonymizing the database. Please note, that the pg_dump command from the postgresql-client-common library is necessary to create the dump file for the database, e.g. under Linux:

$ sudo apt-get install postgresql-client-common

Example call:

$ pganonymize --schema=myschema.yml \
    --dbname=test_database \
    --user=username \
    --password=mysecret \
    --host=db.host.example.com \
    --dump-file=/tmp/dump.gz \
    -v

So that the password for dumping does not have to be entered manually, it can also be entered as an environment var PGPASSWORD:

$ PGPASSWORD=password pganonymize --schema=myschema.yml \
    --dbname=test_database \
    --user=username \
    --password=mysecret \
    --host=db.host.example.com \
    --dump-file=/tmp/dump.gz \
    -v

Warning

Currently only the dump-file operation supports environment variables.

Docker

If you want to run the anonymizer within a Docker container you first have to build the image:

$ docker build -t pganonymize .

After that you can pass a schema file to the container, using Docker volumes, and call the anonymizer:

$ docker run \
    -v <path to your schema>:/schema.yml \
    -it pganonymize \
    /usr/local/bin/pganonymize \
    --schema=/schema.yml \
    --dbname=<database> \
    --user=<user> \
    --password=<password> \
    --host=<host> \
    -v

pganonymize's People

Contributors

Stargazers

Watchers

pganonymize's Issues

Script fails when a table column as upper case characters

Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 39, in anonymize_tables
    import_data(connection, column_dict, table_name, table_columns, primary_key, data)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 140, in import_data
    copy_from(connection, data, 'source', table_columns)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 119, in copy_from
    cursor.copy_from(new_data, table, sep=COPY_DB_DELIMITER, null='\\N', columns=columns)
psycopg2.errors.UndefinedColumn: column "createdat" of relation "source" does not exist

Add Trusted Publisher Management workflow

PyPI has introduced the Trusted Publisher Management in order to replace hardcoded API tokens and passwords. The project's publish workflow should be update to match the configuration from PyPI.

Better schema validation

If a schema has an invalid structure, the resulting errors are a bit confusing. E.g. if the "tables" definition is missing, the resulting error looks like this:

Traceback (most recent call last):
  File "/home/henning/.local/share/virtualenvs/postgresql-anonymizer-AUSqld0C/bin/pganonymize", line 11, in <module>
    load_entry_point('pganonymize', 'console_scripts', 'pganonymize')()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 32, in main
    truncate_tables(connection, schema.get('truncate', []))
AttributeError: 'list' object has no attribute 'get'

It would ne nice to have a validation method that checks for the basic structure of the YAML file:

The tables keyword
fields at the table level
A provider at the field level, with at least a name
...

Instead of writing own methods for the validation it could be possible to use general Python based YAML validation libraries, e.g.:

https://github.com/pyeve/cerberus/
https://github.com/23andMe/Yamale (only Python 3.6+)
https://github.com/keleshev/schema

New release?

Hi, @hkage! Can you please release a new version with latest changes?

Support --format=plain option for pg_dump

I would like to have an option to choose how pg_dump will format the database (i.e. plain vs custom)
I could make a PR to support it, but I need to know
Would you interested to have this kind of option?
What do you think of --dump-format (plain, custom directory, tar) and --dump-compress (0..9) ?
Or do you prefer something more generic to pass any pg_dump options to the command?

Anonymizing error if there is a JSONB column in a table

I have a strange error:

pganonymizer.exceptions.BadDataFormat: invalid input syntax for type json
DETAIL:  Token "'" is invalid.
CONTEXT:  JSON data, line 1: {'...
COPY source, line 29, column ui_settings: "{'firstTime': True}"

YAML file:

tables:
  - accounts:
      fields:
        - name:
            provider:
              name: fake.name
        - email:
            provider:
              name: fake.email
        - phone:
            provider:
              name: fake.phone_number
        - title:
            provider:
              name: choice
              values:
                - "Mr"
                - "Mrs"
                - "Dr"
                - "Prof"
                - "Ms"

truncate:
  - django_session

ui_settings column values:
{"firstTime": true, "licenseBannerHasBeenShown": true}
{"firstTime": true}
{}

What am I doing wrong?

Schema compatibility (dbtoyaml)

Thanks for the great tool!
It would be nice if this is compatible with dbtoyaml: https://pyrseas.readthedocs.io/en/latest/
Currently, to make the yaml compatible, it is required some transformation that could be avoided.

Commandline argument to create a database dump

After anonymizing a database it would be nice to be able to create a PostgreSQL dump file for further usage. The format could be hard coded for the first version (e.g. bzip2 compressed). Example usage:

$ pganonymize --schema=my_schema.yml \
    --host=localhost \
    --user=user \
    --password=password \
    --dbname=database \
    --dump-file=my_anonymized_database.bz2

parmap no longer supports Python 2.7

parmap release 1.5.3 dropped support for older versions of Python.

Relevant changelog entry:

parmap (1.5.3)

Drop support for unsupported python versions

Add support for python 3.10

Use tqdm.auto to have nice progress bars on jupyter notebooks (#26)

Add dummy _number_left for parallel async (#23)

Downgrading my parmap to 1.5.2 does fix the issue on older versions of Python.

Project name conflict

Hi,

Your project's name conflicts with PostgreSQL Anonymizer by @daamien from Dalibo. It seems that your project started 2 years after @daamien's one.

What do you think of this ?

Anyway, this project is very interesting !

Regards,

compatibility with "GENERATED ALWAYS" columns

It would be great to have compatibility with https://www.postgresql.org/docs/current/ddl-generated-columns.html.
At the moment, when I name a generated column in the yml file, I get the following error:

- members:
    primary_key: uuid
    chunk_size: 5000
    fields:
     - name:
        provider:
          name: md5
     - firstname:
        provider:
          name: fake.first_name
     - lastname:
        provider:
          name: fake.last_name

psycopg2.errors.GeneratedAlways: column "name" can only be updated to DEFAULT
DETAIL:  Column "name" is a generated column.

This is of course, because generated columns can only be updated in a certain way.
Maybe a generated provider would be a nice addition?

How to use faker’s localized providers?

Hi,

Is it possible to use faker’s localized providers?

I need to use this one for example: https://faker.readthedocs.io/en/master/locales/fr_FR.html#faker.providers.ssn.fr_FR.Provider

Regards

ValueError when copying a table with id of type uuid4

Hey,

I get this error when trying to anonymize any table which primary key id uses uuid4:

INFO: Found table definition "users"
1216it [00:00, 38715.87it/s]                                                                   
Processing 1 batches for users:   0%|                           | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 210, in f
    return formatter(v)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 109, in uuid_formatter
    return 'i2Q', (16, (guid.int >> 64) & MAX_INT64, guid.int & MAX_INT64)
AttributeError: 'str' object has no attribute 'int'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/__main__.py", line 12, in main
    main(args)
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/cli.py", line 79, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose, dry_run=args.dry_run)
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 43, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 94, in build_and_then_import_data
    import_data(connection, temp_table, [primary_key] + column_names, filter(None, data))
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 173, in import_data
    mgr.copy([[escape_str_replace(val) for col, val in row.items()] for row in data])
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 294, in copy
    self.writestream(data, datastream)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 322, in writestream
    f, d = formatter(val)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 135, in <lambda>
    return lambda v: ('i', (-1,)) if v is None else formatter(v)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 213, in f
    errors.raise_from(ValueError, message, exc)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/errors/py3.py", line 9, in raise_from
    raise exccls(message) from exc
ValueError: error formatting value 16cfc1fb-16fc-4888-b6a7-3638698df7ae for column id
ERROR: 1

The value received by pgcopy is this string 16cfc1fb-16fc-4888-b6a7-3638698df7ae, when it's expecting a instance of class uuid.UUID.

This is happening with this as schema.yml:

tables:
 - users:
    fields:
     - password:
        provider:
          name: mask
          sign: '?'

For context:

$ pip freeze
Faker==9.8.0
parmap==1.5.3
pganonymize==0.6.1
pgcopy==1.5.0
psycopg2==2.9.1
python-dateutil==2.8.2
pytz==2021.3
PyYAML==6.0
six==1.16.0
text-unidecode==1.3
tqdm==4.62.3

$ postgres --version
postgres (PostgreSQL) 11.8

$ psql -d my_db -c "\d users"

                                      Table "public.users"
          Column          |            Type             | Collation | Nullable |         Default         
--------------------------+-----------------------------+-----------+----------+-------------------------
 id                       | uuid                        |           | not null | 
 ...

Please let me know if some more information is needed, or if I missed some info from the documentation 😅

Test completion

Due to the urgency most parts of the project are untestet. Therefore a lot of unittest should be added, mostly for the utils.py package.

Use asynchronous PostgreSQL connection

https://github.com/MagicStack/asyncpg

More an idea than a necessary feature, because asyncpg requires Python 3.5 or later.

Subprocess "run" being used on Python2.7

When attempting to use the create_database_dump utility on a legacy python2.7 system, the subprocess command fails because run doesn't exist in the subprocess module bundled with Python 2.7.

https://docs.python.org/3.5/library/subprocess.html#older-high-level-api

I believe the equivalent is call instead of run.

Relevant traceback:

Traceback (most recent call last):
  File "/Users/brett/.virtualenvs/iris/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/__main__.py", line 12, in main
    main(args)
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/cli.py", line 89, in main
    create_database_dump(args.dump_file, pg_args)
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/utils.py", line 274, in create_database_dump
    subprocess.run(cmd, shell=True)
AttributeError: 'module' object has no attribute 'run'

Script fails to run

Version: pganonymize-0.5.0

INFO: Found table definition "users"
Anonymizing |████████████████████████████████| 5656/5656
Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 43, in anonymize_tables
    import_data(connection, column_dict, table_name, table_columns, primary_key, data)
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 154, in import_data
    copy_from(connection, data, temp_table, table_columns)
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 132, in copy_from
    cursor.copy_from(new_data, table, sep=COPY_DB_DELIMITER, null='\\N', columns=quoted_cols)
psycopg2.errors.UndefinedColumn: column ""id"" of relation "tmp_users" does not exist

@hkage

Commandline argument to list available providers

It would be nice to have a commandline argument that lists all available providers, e.g.:

$ pganonymize --list-providers

choice - Provider that returns a random value from a list of choices.
clear - Provider to set a field value to None.
fake - Provider to generate fake data.
...

Python 2.7 tests are failing

Because of the ending Python 2.7 support, most of the CI images used for testing have dropped the Python 2.7 interpreter (and so setting up Python within the actions, see actions/setup-python#672). This leads to failing tests and breaks the testing chain for other Python versions. As our company still uses Python 2.7 for productive environments this project still needs to support Python 2.7.

During Exclude if the Result is "None" then TypeError is Raised

During the exclude of rows, if a column returns None value then the row[None] raises Type error

https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/5f6d7b3e1a9f4ae22e843eb1c6d57314a1939936/pganonymizer/utils.py#L101

Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/pganonymize", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 38, in anonymize_tables
    data, table_columns = build_data(connection, table_name, columns, excludes, total_count, verbose)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 68, in build_data
    if not row_matches_excludes(row, excludes):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 101, in row_matches_excludes
    if pattern.match(row[column]):
TypeError: expected string or bytes-like object

Ref: myschema.yml

tables:
 - res_partner:
    fields:
     - name:
        provider:
          name: fake.name
     - email:
        provider:
          name: fake.email
    excludes: 
     - email:
        - "info.*@example.com"

Expected Behaviour :

To ignore such records as it does not match the pattern any way

Using 'faker.unique.xx' as provider doesn't ensure uniqueness of the values...

... due to the use of raw parallelization.

To prove it, run the following:

import parmap
from pganonymizer.providers import FakeProvider


provider = FakeProvider(name='fake.unique.user_name')


def gen_values(qty=100):
    base_values = [f'v{n}' for n in range(qty)]
    parallel_values = parmap.map(provider.alter_value, base_values)
    serial_values = parmap.map(provider.alter_value, base_values, pm_parallel=False)
    return parallel_values, serial_values


pvals, svals = gen_values()

# verify uniqueness
print(len(set(pvals)), len(set(svals)))

on my machine, when run it I get the following values:

$ python test.py
16 100

If it worked, it should have been 100 100

rheinwerk-verlag / pganonymize Goto Github PK

pganonymize's Introduction

pganonymize

pganonymize's People

Contributors

Stargazers

Watchers

Forkers

pganonymize's Issues

Recommend Projects

Recommend Topics

Recommend Org