Giter Site home page Giter Site logo

rheinwerk-verlag / pganonymize Goto Github PK

View Code? Open in Web Editor NEW
40.0 10.0 26.0 739 KB

A commandline tool for anonymizing PostgreSQL databases

Home Page: http://pganonymize.readthedocs.io/

License: Other

Makefile 5.33% Python 94.23% Dockerfile 0.44%
gdpr database anonymization postgresql python python2 python3 command-line-tool dsgvo cli anonymizer developer-tools command-line privacy python27

pganonymize's Introduction

pganonymize

A commandline tool to anonymize PostgreSQL databases for DSGVO/GDPR purposes.

It uses a YAML file to define which tables and fields should be anonymized and provides various methods of anonymization. The tool requires a direct PostgreSQL connection to perform the anonymization.

PyPI - Python Version license pypi Download count build codecov pganonymize

https://raw.githubusercontent.com/rheinwerk-verlag/pganonymize/main/docs/_static/demo.gif
  • Intentionally compatible with Python 2.7 (for old, productive platforms)
  • Anonymize PostgreSQL tables on data level entry with various providers (some examples in the table below)
  • Exclude data for anonymization depending on regular expressions or SQL WHERE clauses
  • Truncate entire tables for unwanted data
Field Value Provider Output
first_name John choice (Bob|Larry|Lisa)
title Dr. clear  
street Irving St faker.street_name Miller Station
password dsf82hFxcM mask XXXXXXXXXX
credit_card 1234-567-890 partial_mask 1??????????0
email [email protected] md5 0cba00ca3da1b283a57287bcceb17e35
email [email protected] faker.unique.email [email protected]
phone_num 65923473 md5 as_number: True 3948293448
ip 157.50.1.20 set 127.0.0.1
uuid_col 00010203-0405-...... uuid4 f7c1bd87-4d....
  • Note: faker.unique.[provider] only supported on Python 3.6+ (Faker library min. supported python version)
  • Note: uuid4 - only for (native uuid4) columns

See the documentation for a more detailed description of the provided anonymization methods.

The default installation method is to use pip:

$ pip install pganonymize
usage: pganonymize [-h] [-v] [-l] [--schema SCHEMA] [--dbname DBNAME]
               [--user USER] [--password PASSWORD] [--host HOST]
               [--port PORT] [--dry-run] [--dump-file DUMP_FILE]

Anonymize data of a PostgreSQL database

optional arguments:
-h, --help            show this help message and exit
-v, --verbose         Increase verbosity
-l, --list-providers  Show a list of all available providers
--schema SCHEMA       A YAML schema file that contains the anonymization
                        rules
--dbname DBNAME       Name of the database
--user USER           Name of the database user
--password PASSWORD   Password for the database user
--host HOST           Database hostname
--port PORT           Port of the database
--dry-run             Don't commit changes made on the database
--dump-file DUMP_FILE
                      Create a database dump file with the given name
--dump-options DUMP_OPTIONS
                      Options to pass to the pg_dump command
--init-sql INIT_SQL   SQL to run before starting anonymization
--parallel            Data anonymization is done in parallel

Despite the database connection values, you will have to define a YAML schema file, that includes all anonymization rules for that database. Take a look at the schema documentation or the YAML sample schema.

Example calls:

$ pganonymize --schema=myschema.yml \
    --dbname=test_database \
    --user=username \
    --password=mysecret \
    --host=db.host.example.com \
    -v

$ pganonymize --schema=myschema.yml \
    --dbname=test_database \
    --user=username \
    --password=mysecret \
    --host=db.host.example.com \
    --init-sql "set search_path to non_public_search_path; set work_mem to '1GB';" \
    -v

With the --dump-file argument it is possible to create a dump file after anonymizing the database. Please note, that the pg_dump command from the postgresql-client-common library is necessary to create the dump file for the database, e.g. under Linux:

$ sudo apt-get install postgresql-client-common

Example call:

$ pganonymize --schema=myschema.yml \
    --dbname=test_database \
    --user=username \
    --password=mysecret \
    --host=db.host.example.com \
    --dump-file=/tmp/dump.gz \
    -v

So that the password for dumping does not have to be entered manually, it can also be entered as an environment var PGPASSWORD:

$ PGPASSWORD=password pganonymize --schema=myschema.yml \
    --dbname=test_database \
    --user=username \
    --password=mysecret \
    --host=db.host.example.com \
    --dump-file=/tmp/dump.gz \
    -v

Warning

Currently only the dump-file operation supports environment variables.

If you want to run the anonymizer within a Docker container you first have to build the image:

$ docker build -t pganonymize .

After that you can pass a schema file to the container, using Docker volumes, and call the anonymizer:

$ docker run \
    -v <path to your schema>:/schema.yml \
    -it pganonymize \
    /usr/local/bin/pganonymize \
    --schema=/schema.yml \
    --dbname=<database> \
    --user=<user> \
    --password=<password> \
    --host=<host> \
    -v

pganonymize's People

Contributors

abhinavvaidya avatar bobslee avatar fblackburn1 avatar hkage avatar kianmeng avatar koptelovav avatar korsar182 avatar nurikk avatar rheinwerk-mp avatar tilley avatar w1ldpo1nter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pganonymize's Issues

Script fails when a table column as upper case characters

Script fails when a table column as upper case characters

Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 39, in anonymize_tables
    import_data(connection, column_dict, table_name, table_columns, primary_key, data)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 140, in import_data
    copy_from(connection, data, 'source', table_columns)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 119, in copy_from
    cursor.copy_from(new_data, table, sep=COPY_DB_DELIMITER, null='\\N', columns=columns)
psycopg2.errors.UndefinedColumn: column "createdat" of relation "source" does not exist

Better schema validation

If a schema has an invalid structure, the resulting errors are a bit confusing. E.g. if the "tables" definition is missing, the resulting error looks like this:

Traceback (most recent call last):
  File "/home/henning/.local/share/virtualenvs/postgresql-anonymizer-AUSqld0C/bin/pganonymize", line 11, in <module>
    load_entry_point('pganonymize', 'console_scripts', 'pganonymize')()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 32, in main
    truncate_tables(connection, schema.get('truncate', []))
AttributeError: 'list' object has no attribute 'get'

It would ne nice to have a validation method that checks for the basic structure of the YAML file:

  • The tables keyword
  • fields at the table level
  • A provider at the field level, with at least a name
  • ...

Instead of writing own methods for the validation it could be possible to use general Python based YAML validation libraries, e.g.:

Support --format=plain option for pg_dump

I would like to have an option to choose how pg_dump will format the database (i.e. plain vs custom)
I could make a PR to support it, but I need to know
Would you interested to have this kind of option?
What do you think of --dump-format (plain, custom directory, tar) and --dump-compress (0..9) ?
Or do you prefer something more generic to pass any pg_dump options to the command?

Anonymizing error if there is a JSONB column in a table

I have a strange error:

pganonymizer.exceptions.BadDataFormat: invalid input syntax for type json
DETAIL:  Token "'" is invalid.
CONTEXT:  JSON data, line 1: {'...
COPY source, line 29, column ui_settings: "{'firstTime': True}"

YAML file:

tables:
  - accounts:
      fields:
        - name:
            provider:
              name: fake.name
        - email:
            provider:
              name: fake.email
        - phone:
            provider:
              name: fake.phone_number
        - title:
            provider:
              name: choice
              values:
                - "Mr"
                - "Mrs"
                - "Dr"
                - "Prof"
                - "Ms"

truncate:
  - django_session

ui_settings column values:
{"firstTime": true, "licenseBannerHasBeenShown": true}
{"firstTime": true}
{}

What am I doing wrong?

Commandline argument to create a database dump

After anonymizing a database it would be nice to be able to create a PostgreSQL dump file for further usage. The format could be hard coded for the first version (e.g. bzip2 compressed). Example usage:

$ pganonymize --schema=my_schema.yml \
    --host=localhost \
    --user=user \
    --password=password \
    --dbname=database \
    --dump-file=my_anonymized_database.bz2

parmap no longer supports Python 2.7

parmap release 1.5.3 dropped support for older versions of Python.

Relevant changelog entry:

parmap (1.5.3)

  • Drop support for unsupported python versions
  • Add support for python 3.10
  • Use tqdm.auto to have nice progress bars on jupyter notebooks (#26)
  • Add dummy _number_left for parallel async (#23)

Downgrading my parmap to 1.5.2 does fix the issue on older versions of Python.

compatibility with "GENERATED ALWAYS" columns

It would be great to have compatibility with https://www.postgresql.org/docs/current/ddl-generated-columns.html.
At the moment, when I name a generated column in the yml file, I get the following error:

- members:
    primary_key: uuid
    chunk_size: 5000
    fields:
     - name:
        provider:
          name: md5
     - firstname:
        provider:
          name: fake.first_name
     - lastname:
        provider:
          name: fake.last_name

psycopg2.errors.GeneratedAlways: column "name" can only be updated to DEFAULT
DETAIL:  Column "name" is a generated column.

This is of course, because generated columns can only be updated in a certain way.
Maybe a generated provider would be a nice addition?

ValueError when copying a table with id of type uuid4

Hey,

I get this error when trying to anonymize any table which primary key id uses uuid4:

INFO: Found table definition "users"
1216it [00:00, 38715.87it/s]                                                                   
Processing 1 batches for users:   0%|                           | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 210, in f
    return formatter(v)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 109, in uuid_formatter
    return 'i2Q', (16, (guid.int >> 64) & MAX_INT64, guid.int & MAX_INT64)
AttributeError: 'str' object has no attribute 'int'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/__main__.py", line 12, in main
    main(args)
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/cli.py", line 79, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose, dry_run=args.dry_run)
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 43, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 94, in build_and_then_import_data
    import_data(connection, temp_table, [primary_key] + column_names, filter(None, data))
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 173, in import_data
    mgr.copy([[escape_str_replace(val) for col, val in row.items()] for row in data])
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 294, in copy
    self.writestream(data, datastream)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 322, in writestream
    f, d = formatter(val)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 135, in <lambda>
    return lambda v: ('i', (-1,)) if v is None else formatter(v)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 213, in f
    errors.raise_from(ValueError, message, exc)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/errors/py3.py", line 9, in raise_from
    raise exccls(message) from exc
ValueError: error formatting value 16cfc1fb-16fc-4888-b6a7-3638698df7ae for column id
ERROR: 1

The value received by pgcopy is this string 16cfc1fb-16fc-4888-b6a7-3638698df7ae, when it's expecting a instance of class uuid.UUID.

This is happening with this as schema.yml:

tables:
 - users:
    fields:
     - password:
        provider:
          name: mask
          sign: '?'

For context:

$ pip freeze
Faker==9.8.0
parmap==1.5.3
pganonymize==0.6.1
pgcopy==1.5.0
psycopg2==2.9.1
python-dateutil==2.8.2
pytz==2021.3
PyYAML==6.0
six==1.16.0
text-unidecode==1.3
tqdm==4.62.3

$ postgres --version
postgres (PostgreSQL) 11.8

$ psql -d my_db -c "\d users"

                                      Table "public.users"
          Column          |            Type             | Collation | Nullable |         Default         
--------------------------+-----------------------------+-----------+----------+-------------------------
 id                       | uuid                        |           | not null | 
 ...

Please let me know if some more information is needed, or if I missed some info from the documentation 😅

Test completion

Due to the urgency most parts of the project are untestet. Therefore a lot of unittest should be added, mostly for the utils.py package.

Subprocess "run" being used on Python2.7

When attempting to use the create_database_dump utility on a legacy python2.7 system, the subprocess command fails because run doesn't exist in the subprocess module bundled with Python 2.7.

https://docs.python.org/3.5/library/subprocess.html#older-high-level-api

I believe the equivalent is call instead of run.

Relevant traceback:

Traceback (most recent call last):
  File "/Users/brett/.virtualenvs/iris/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/__main__.py", line 12, in main
    main(args)
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/cli.py", line 89, in main
    create_database_dump(args.dump_file, pg_args)
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/utils.py", line 274, in create_database_dump
    subprocess.run(cmd, shell=True)
AttributeError: 'module' object has no attribute 'run'

Script fails to run

Version: pganonymize-0.5.0

INFO: Found table definition "users"
Anonymizing |████████████████████████████████| 5656/5656
Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 43, in anonymize_tables
    import_data(connection, column_dict, table_name, table_columns, primary_key, data)
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 154, in import_data
    copy_from(connection, data, temp_table, table_columns)
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 132, in copy_from
    cursor.copy_from(new_data, table, sep=COPY_DB_DELIMITER, null='\\N', columns=quoted_cols)
psycopg2.errors.UndefinedColumn: column ""id"" of relation "tmp_users" does not exist

@hkage

Commandline argument to list available providers

It would be nice to have a commandline argument that lists all available providers, e.g.:

$ pganonymize --list-providers

choice - Provider that returns a random value from a list of choices.
clear - Provider to set a field value to None.
fake - Provider to generate fake data.
...

Python 2.7 tests are failing

Because of the ending Python 2.7 support, most of the CI images used for testing have dropped the Python 2.7 interpreter (and so setting up Python within the actions, see actions/setup-python#672). This leads to failing tests and breaks the testing chain for other Python versions. As our company still uses Python 2.7 for productive environments this project still needs to support Python 2.7.

During Exclude if the Result is "None" then TypeError is Raised

During the exclude of rows, if a column returns None value then the row[None] raises Type error

https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/5f6d7b3e1a9f4ae22e843eb1c6d57314a1939936/pganonymizer/utils.py#L101

Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/pganonymize", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 38, in anonymize_tables
    data, table_columns = build_data(connection, table_name, columns, excludes, total_count, verbose)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 68, in build_data
    if not row_matches_excludes(row, excludes):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 101, in row_matches_excludes
    if pattern.match(row[column]):
TypeError: expected string or bytes-like object

Ref: myschema.yml

tables:
 - res_partner:
    fields:
     - name:
        provider:
          name: fake.name
     - email:
        provider:
          name: fake.email
    excludes: 
     - email:
        - "info.*@example.com"

Expected Behaviour :

  • To ignore such records as it does not match the pattern any way

Using 'faker.unique.xx' as provider doesn't ensure uniqueness of the values...

... due to the use of raw parallelization.

To prove it, run the following:

import parmap
from pganonymizer.providers import FakeProvider


provider = FakeProvider(name='fake.unique.user_name')


def gen_values(qty=100):
    base_values = [f'v{n}' for n in range(qty)]
    parallel_values = parmap.map(provider.alter_value, base_values)
    serial_values = parmap.map(provider.alter_value, base_values, pm_parallel=False)
    return parallel_values, serial_values


pvals, svals = gen_values()

# verify uniqueness
print(len(set(pvals)), len(set(svals)))

on my machine, when run it I get the following values:

$ python test.py
16 100

If it worked, it should have been 100 100

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.